# <font color=Blue>Clustering Model</font>

## Problem Statement:

HELP International is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities. It runs a lot of operational projects from time to time along with advocacy drives to raise awareness as well as for funding purposes. After the recent funding programmes, they have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. The significant issues that come while making this decision are mostly related to choosing the countries that are in the direst need of aid. 

## Approach:
Categorise the countries using some socio-economic and health factors that determine the overall development of the country.Suggest the countries which the CEO needs to focus on the most.

## Model Objective:
- Cluster the countries by the factors mentioned and recommend to the CEO.

## The tasks performed in the model presented below are:
- Necessary data inspection and EDA tasks suitable for this dataset.
- Outlier Analysis
- Tried both K-means and Hierarchical clustering(both single and complete linkage) on this dataset to create the clusters.
- Analysed the clusters and identify the ones which are in dire need of aid by comparing how these three variables - [gdpp, child_mort and income] vary for each cluster of countries to recognise and differentiate the clusters of developed countries from the clusters of under-developed countries.
- Performed visualisations on the clusters that have been formed.  
- Finally ,listed the 5 countries which are in direst need of aid based on the analysis done.


### Steps followed to build this model
1. Importing the Libraries
2. Data Understanding
3. Exploratory Data Analysis
4. Outlier Analysis
5. Feature Scaling
6. Data Modeling 
    - Optimal k value
    - Clustering using k means
    - Cluster Profiling
    - Clustering using Hierarical Method
    - Cluster Profiling
7. Inference and Recomendation
  


In [None]:
import sklearn
print(sklearn.__version__)

In [None]:
## Importing the libraries
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import MinMaxScaler

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.neighbors import NearestNeighbors

from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cut_tree

pd.set_option('display.max_rows', 500)

In [None]:
import scipy
print(scipy.__version__)

## **Data Understanding**

In [None]:
#Read the data and creating the dataframe
country_df=pd.read_csv("../input/country-data/Country-data.csv")

In [None]:
# Check if data is loaded or not
country_df.head()

Data Dictionary
- country :	Name of the country
- child_mort :	Death of children under 5 years of age per 1000 live births
- exports :	Exports of goods and services per capita. Given as %age of the GDP per capita
- health :	Total health spending per capita. Given as %age of GDP per capita
- imports :	Imports of goods and services per capita. Given as %age of the GDP per capita
- Income :	Net income per person
- Inflation :	The measurement of the annual growth rate of the Total GDP
- life_expec :	The average number of years a new born child would live if the current mortality patterns are to remain the same
- total_fer :	The number of children that would be born to each woman if the current age-fertility rates remain the same.
- gdpp :	The GDP per capita. Calculated as the Total GDP divided by the total population.



Here the features exports,health and imports are given as percentage of GPA. Therefore we need to convert it into its true numerical values.

In [None]:
#Converting the exports columns to its true numerical values
country_df['exports']=(country_df['exports']*country_df['gdpp'])/100

In [None]:
#Converting the health columns to its true numerical values
country_df['health']=(country_df['health']*country_df['gdpp'])/100

In [None]:
#Converting the imports columns to its true numerical values
country_df['imports']=(country_df['imports']*country_df['gdpp'])/100

In [None]:
# inspect the first five rows of data
country_df.head()

In [None]:
# check the no of rows and columns
country_df.shape

In [None]:
#info all the entire data along with types
country_df.info()

In [None]:
#Summary of the numerical columns in the dataframe
country_df.describe()

In [None]:
country_df.columns

In [None]:
# Check if any null values are present in the data
country_df.isnull().sum()

There are no null values in the dataset

## **Exploratory Data Analysis**

In [None]:
plt.figure(figsize=(20,20))
feat_list=country_df.columns[1:]
for i in enumerate(feat_list):
    plt.subplot(3,3, i[0]+1)
    sns.distplot(country_df[i[1]])


Seaborn Distplot represents the overall distribution of continuous data variables.
The above graphs shows the features child_mortality , gdpp , income, total_fer ,life_expec shows are widlely distributed.

In [None]:
num_data=country_df[['child_mort', 'exports', 'health', 'imports', 'income',
       'inflation', 'life_expec', 'total_fer', 'gdpp']]
sns.pairplot(num_data)
plt.show()

## **Outlier Analysis**

In [None]:
plt.figure(figsize=(10,20))
feat_list=country_df.columns[1:]
for i in enumerate(feat_list):
    plt.subplot(3,3,i[0]+1) 
    sns.boxplot(country_df[i[1]])

plt.show()

Observation: We find that all the features have outliers. We need to handle them expect the outliers which are in higher range of child mortality and lower range in gdpp and income. As there factors help us in cluster profiling as the countries if these three characterstics may require the funding more. 

As the data which is provided is less we chose not to delete any outliers and instead we use capping method

In [None]:
#Handling the outliers
# calculating in arbitary way. its is usually based on business
# removing (statistical) outliers
# 1st and 95th percentile levels

# outlier treatment for child_mort
Q1 = country_df.child_mort.quantile(0.01)
Q3 = country_df.child_mort.quantile(0.95)
country_df['child_mort'][country_df['child_mort']<=Q1]=Q1
#country_df['child_mort'][country_df['child_mort']>=Q3]=Q3

# outlier treatment for exports
Q1 = country_df.exports.quantile(0.01)
Q3 = country_df.exports.quantile(0.95)
country_df['exports'][country_df['exports']<=Q1]=Q1
country_df['exports'][country_df['exports']>=Q3]=Q3

# outlier treatment for health
Q1 = country_df.health.quantile(0.01)
Q3 = country_df.health.quantile(0.95)
country_df['health'][country_df['health']<=Q1]=Q1
country_df['health'][country_df['health']>=Q3]=Q3

# outlier treatment for imports
Q1 = country_df.imports.quantile(0.01)
Q3 = country_df.imports.quantile(0.95)
country_df['imports'][country_df['imports']<=Q1]=Q1
country_df['imports'][country_df['imports']>=Q3]=Q3

# outlier treatment for income
Q1 = country_df.income.quantile(0.01)
Q3 = country_df.income.quantile(0.95)
#country_df['income'][country_df['income']<=Q1]=Q1
country_df['income'][country_df['income']>=Q3]=Q3

# outlier treatment for inflation
Q1 = country_df.inflation.quantile(0.01)
Q3 = country_df.inflation.quantile(0.95)
country_df['inflation'][country_df['inflation']<=Q1]=Q1
country_df['inflation'][country_df['inflation']>=Q3]=Q3

# outlier treatment for life_expec
Q1 = country_df.life_expec.quantile(0.01)
Q3 = country_df.life_expec.quantile(0.95)
country_df['life_expec'][country_df['life_expec']<=Q1]=Q1
country_df['life_expec'][country_df['life_expec']>=Q3]=Q3

# outlier treatment for total_fer
Q1 = country_df.total_fer.quantile(0.01)
Q3 = country_df.total_fer.quantile(0.95)
country_df['total_fer'][country_df['total_fer']<=Q1]=Q1
country_df['total_fer'][country_df['total_fer']>=Q3]=Q3

# outlier treatment for gdpp
Q1 = country_df.gdpp.quantile(0.01)
Q3 = country_df.gdpp.quantile(0.95)
#country_df['gdpp'][country_df['gdpp']<=Q1]=Q1
country_df['gdpp'][country_df['gdpp']>=Q3]=Q3

In [None]:
country_df.head()

## Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
country_df1 = scaler.fit_transform(country_df.drop('country', axis = 1))
country_df1

In [None]:
country_df1 = pd.DataFrame(country_df1, columns = country_df.columns[1:])
country_df1.head()

## Hopkins Statistics:

Hopkins stats:
The Hopkins statistic (Lawson and Jurs 1990) is used to assess the clustering tendency of a data set by measuring the probability that a given data set is generated by a uniform data distribution. In other words, it tests the spatial randomness of the data. we need to pass a dataframe to the Hopkins statistic function to find if the dataset is suitable for clustering or not. On multiple iterations of Hopkins Statistic, we will be getting multiple values since the algorithm uses some randomisation in the initialisation part of the code. Therefore we need to run it a couple of times before confirming whether the data is suitable for clustering or not. If the value of Hopkins statistic is close to 1, then we can reject the null hypothesis and conclude that the dataset D is significantly a clusterable data

In [None]:
#Calculating the Hopkins statistic
from sklearn.neighbors import NearestNeighbors
from random import sample
from numpy.random import uniform
import numpy as np
from math import isnan
 
def hopkins(X):
    d = X.shape[1]
    #d = len(vars) # columns
    n = len(X) # rows
    m = int(0.1 * n) 
    nbrs = NearestNeighbors(n_neighbors=1).fit(X.values)
 
    rand_X = sample(range(0, n, 1), m)
 
    ujd = []
    wjd = []
    for j in range(0, m):
        u_dist, _ = nbrs.kneighbors(uniform(np.amin(X,axis=0),np.amax(X,axis=0),d).reshape(1, -1), 2, return_distance=True)
        ujd.append(u_dist[0][1])
        w_dist, _ = nbrs.kneighbors(X.iloc[rand_X[j]].values.reshape(1, -1), 2, return_distance=True)
        wjd.append(w_dist[0][1])
 
    H = sum(ujd) / (sum(ujd) + sum(wjd))
    if isnan(H):
        print(ujd, wjd)
        H = 0
 
    return H

In [None]:
hopkins(country_df1)

In [None]:
#running the hopkins statistics for abunt 10 times and finding the avg : As hopkins stats value changes everytime we run the code. So i prefer to take the avg if them and then see the score.
hop_list=[]
for i in range(0,9):
    hop_list.append(hopkins(country_df1))
hop_list

In [None]:
import statistics
statistics.mean(hop_list) 

The hopkins score is about 0.85. IT denotes that the data is good for clustering

## Modeling- Clustering

### Finding the Optimal Number of Clusters

#### SSD/ Elbow curve

In [None]:
# elbow-curve/SSD
ssd = []
range_n_clusters = [2, 3, 4, 5, 6, 7, 8]
for num_clusters in range_n_clusters:
    kmeans = KMeans(n_clusters=num_clusters, max_iter=50)
    kmeans.fit(country_df1)
    
    ssd.append(kmeans.inertia_)

In [None]:
ssd

In [None]:
# plot the SSDs for each n_clusters
plt.xlabel("number of clusters")
plt.ylabel("SSD")
plt.plot(range_n_clusters,ssd)
plt.title("Elbow Curve")
plt.show()

The above graph shows that cluster size 3 is optimal.
ssd is sum of square distances to individual samples to the nearest clusters. as n of clusters increases ssd will decrease. we need to find does it goes down enough to add one cluster from 3 to 4 slope reduces. so the drop is not the significant. So optimal value is 3.

### Silhouette Analysis
silhouette score=(p−q)/max(p,q)
 
p  is the mean distance to the points in the nearest cluster that the data point is not a part of . So this must be high

q  is the mean intra-cluster distance to all the points in its own cluster and this must be low

The value of the silhouette score range lies between -1 to 1.

A score closer to 1 indicates that the data point is very similar to other data points in the cluster, and dissimilar to other points of other clusters. Very good cluster

A score closer to -1 indicates that the data point is not similar to the data points in its cluster. and similar to other points of other clusters. Not that good cluster

In [None]:
range_n_clusters = [2, 3, 4, 5, 6, 7, 8]

for num_clusters in range_n_clusters:
    
    # intialise kmeans
    kmeans = KMeans(n_clusters=num_clusters, max_iter=50)
    kmeans.fit(country_df1)
    
    cluster_labels = kmeans.labels_
    
    # silhouette score
    silhouette_avg = silhouette_score(country_df1, cluster_labels)
    print("For n_clusters={0}, the silhouette score is {1}".format(num_clusters, silhouette_avg))

In [None]:

from sklearn.metrics import silhouette_score
ss = []
for k in range(2, 11):
    kmean = KMeans(n_clusters = k).fit(country_df1)
    ss.append([k, silhouette_score(country_df1, kmean.labels_)])
temp = pd.DataFrame(ss)    
plt.plot(temp[0], temp[1])
plt.title("Silhouette_Score")
plt.xlabel("number of clusters")
plt.ylabel("silhouette_score")

The graph and the silhouette_score shows that k=2 or k=3 are optimal values and can be used as number of clusters.
So based on the elbow curver and sihouette score we come to a conclusion that k=3 is the optimal value of k

## Clustering using K Means with K=3

In [None]:
# Kmean Clustering with k=3

kmeans = KMeans(n_clusters=3,random_state=50,max_iter=50)
kmeans.fit(country_df1)

In [None]:
kmeans.labels_

In [None]:
# assign the lcluster label
country_df['cluster_id'] = kmeans.labels_
country_df.head()

In [None]:
#Count of number of countires under each cluster
country_df.cluster_id.value_counts()

### Cluster Profiling : Getting insights from the cluster

In [None]:
#scatter plot between income and gdpp with respect to Cluster Ids
sns.scatterplot(x='gdpp',y='income',hue='cluster_id',data=country_df,palette='Set1')
plt.title("Scatter plot between income and gdpp with respect to Cluster Ids")
plt.show()

> Observation : Cluster 2 has to be targeted as it is low in income and gdpp.

In [None]:
#scatter plot between child_mort and gdpp with respect to Cluster Ids
sns.scatterplot(x='gdpp',y='child_mort',hue='cluster_id',data=country_df,palette='Set1')
plt.title("Scatter plot between child_mort and gdpp with respect to Cluster Ids")
plt.show()

Observation: The above graph also depicts that Cluster 2 are the target countries to get the AID as it has less gdpp and high child_mortality rate

In [None]:
#scatter plot between income and child_mort with respect to Cluster Ids
sns.scatterplot(x='income',y='child_mort',hue='cluster_id',data=country_df,palette='Set1')
plt.title("Scatter plot between income and child_mort with respect to Cluster Ids")
plt.show()

Observation: The above graph also depicts that Cluster 2 are the target countries to get the AID as it has less income and high child_mortality rate

In [None]:
#Box plot which shows all the three parameters cluster wise
country_df.drop(['country','exports', 'health', 'imports', 'inflation', 'life_expec', 'total_fer'], axis = 1).groupby('cluster_id').mean().plot(kind = 'bar')
plt.title("Box plot which shows all the three parameters cluster wise")
plt.show()

In [None]:
#Box plot which shows all the three paramters cluster wise using log scale to makechild mortality values more evident.
country_df.drop(['country','exports', 'health', 'imports', 'inflation', 'life_expec', 'total_fer'], axis = 1).groupby('cluster_id').mean().plot(kind = 'bar')
plt.title("Box plot which shows all the three parameters cluster wise")
plt.yscale("log")
plt.show()

Observation:

             Cluster 0: Medium child mortality rate, gdpp and income : Developing countries.

             Cluster 1: Less Child mortality rate, high gdpp and income : Developed countries.
             
             Cluster 2: High Child mortality rate, less gdpp and income : Under developed countries. Which we need to target

In [None]:
# plot for child_mort column for different clusters to check its variation
sns.boxplot(x='cluster_id', y='child_mort', data=country_df)
plt.title("Plot for child_mort column for different clusters to check its variation")
plt.show()

Observation: The cluster 1 has countries with less child mortality and cluster 2 has countries with high child mortality.

In [None]:
# plot for income column for different clusters to check its variation
sns.boxplot(x='cluster_id', y='income', data=country_df)
plt.title("Plot for income column for different clusters to check its variation")
plt.show()

Observation: The cluster 1 has countries with high income and cluster 2 has countries with less income.

In [None]:
# plot for gdpp column for different clusters to check its variation
sns.boxplot(x='cluster_id', y='gdpp', data=country_df)
plt.title("Plot for gdpp column for different clusters to check its variation")
plt.show()

Observation: The cluster 1 has countries with high gdpp and cluster 2 has countries with low gdpp.

Observation:

- Cluster 0: Medium child mortality rate, gdpp and income : Developing countries.
- Cluster 1: Less Child mortality rate, high gdpp and income : Developed countries.            
- Cluster 2: High Child mortality rate, less gdpp and income : Under developed countries. Which we need to target



In [None]:
#Countries under Cluster 2 ie. which have less gdpp, less income and high child mortality
country_df[country_df['cluster_id'] == 2]

In [None]:
country_df['country'][country_df['cluster_id'] == 2]

## The above is the list of countries which come under cluster 2. These 48 countries  need AID .

In [None]:
#Countries under Cluster 0 ie. which have medium gdpp, medium income and medium child mortality
country_df[country_df['cluster_id'] == 0]

In [None]:
#Countries under Cluster 1 ie. which have high gdpp, high income and low child mortality
country_df[country_df['cluster_id'] == 1]

In [None]:
#Get the top 5 countires that are in dire need of HELP - When Gdpp, Income and Child mortality is the order of preference.)
country_df[country_df['cluster_id'] == 2].sort_values(by = ['gdpp','income','child_mort'], ascending = [True,True,False]).head(5)

In [None]:
#Get the top 5 countires that are in dire need of HELP - When 'income','gdpp','child_mort is the order of preference)
country_df[country_df['cluster_id'] == 2].sort_values(by = ['income','gdpp','child_mort'], ascending = [True,True,False]).head(5)

In [None]:
#Get the top 5 countires that are in dire need of HELP - When 'child_mort','income','gdpp' is the order of preference)
country_df[country_df['cluster_id'] == 2].sort_values(by = ['child_mort','income','gdpp'], ascending = [False,True,True]).head(5)

****Inference:

Less gdpp,less income and high child mortality parameters are considered. 
- Five countries which are in direst need of aid - Burundi, Liberia ,Congo. Dem. Rep , Niger , Sierra Leone (When Gdpp, Income and Child mortality is the order of preference.)
- Five countries which are in direst need of aid - Congo. Dem. Rep, Liberia ,Burundi, Niger , Central African Republic (When 'income','gdpp','child_mort is the order of preference)
-  Five countries which are in direst need of aid - Haiti, Sierra Leone ,Chad, Central African Republic ,Mali (When 'child_mort','income','gdpp' is the order of preference)

Final Inference which can be given to the business : Top five countries that need AID is Burundi, Liberia ,Congo. Dem. Rep , Niger , Sierra Leone (When Gdpp, Income and Child mortality is the order of preference.

Reason to sort it according to this order is 
1. Gross Domestic Product of a country indicates the total value of production of goods and services of that country  and thus indicates that the above five countries have the least production happening in their country. This in turn would lead to the need of requiring external support from other countries. When the financial help is provided to the countries based on the low GDP indicator, such countries should be able to produce more goods and services, which in turn helps in the countries overall development.
2. Low Income is a sign of the inability to purchase commodities or services which can keep a family comfortable or in dire need of resources in case of very low income families. As such, foreign help would provide the required help to such families at a lower cost.
3. Child mortality rate can be the third indicator as it is not directly linked to the resources available, as it is affected to other reasons like incomplete care of the mother during pregnancy, irresponsibility of the health care workers, etc. Thus, child moratlity rate can be considered as the third factor in this problem, keeping in mind the malnutrition and other financial problems the country is facing.

## Hierarchical Clustering

In [None]:
country_df_hierarchy = country_df.copy()
country_df_hierarchy=country_df_hierarchy.drop(['cluster_id'],axis=1)

In [None]:
country_df_hierarchy.head()

In [None]:
# single linkage - performing on scaled df
plt.figure(figsize=(20,10))
mergings = linkage(country_df1, method="single", metric='euclidean')
dendrogram(mergings)
plt.show()

In [None]:
# complete linkage
plt.figure(figsize=(20,10))
mergings = linkage(country_df1, method="complete", metric='euclidean')
dendrogram(mergings)
plt.show()

The above dendrogram shows that k=3 is optimal.

In [None]:
# complete linkage
plt.figure(figsize=(20,10))
mergings = linkage(country_df1, method="complete", metric='euclidean')
dendrogram(mergings)
plt.show()
# Hierarical Clustering using 3 clusters
cluster_labels = cut_tree(mergings, n_clusters=3).reshape(-1, )
cluster_labels

In [None]:
# assign cluster labels
country_df_hierarchy['cluster_labels'] = cluster_labels

In [None]:
country_df_hierarchy.head()

In [None]:
#Count of number of countires under each cluster
country_df_hierarchy.cluster_labels.value_counts()

### Cluster Profiling : Getting insights from the cluster

In [None]:
#scatter plot between income and gdpp with respect to Cluster Ids
sns.scatterplot(x='gdpp',y='income',hue='cluster_labels',data=country_df_hierarchy,palette='Set1')
plt.title("Scatter plot between income and gdpp with respect to Cluster Ids")
plt.show()

Observation : Cluster 0 has to be targeted as it is low in income and gdpp.

In [None]:
#scatter plot between child_mort and gdpp with respect to Cluster Ids
sns.scatterplot(x='gdpp',y='child_mort',hue='cluster_labels',data=country_df_hierarchy,palette='Set1')
plt.title("Scatter plot between child_mort and gdpp with respect to Cluster Ids")
plt.show()


Observation: The above graph also depicts that Cluster 0 are the target countries to get the AID as it has less gdpp and high child_mortality rate

In [None]:
#scatter plot between income and child_mort with respect to Cluster Ids
sns.scatterplot(x='income',y='child_mort',hue='cluster_labels',data=country_df_hierarchy,palette='Set1')
plt.title("Scatter plot between income and child_mort with respect to Cluster Ids")
plt.show()


Observation: The above graph also depicts that Cluster 0 are the target countries to get the AID as it has less income and high child_mortality rate

In [None]:
#Box plot which shows all the three parameters cluster wise
country_df_hierarchy.drop(['country','exports', 'health', 'imports', 'inflation', 'life_expec', 'total_fer'], axis = 1).groupby('cluster_labels').mean().plot(kind = 'bar')
plt.title("Box plot which shows all the three parameters cluster wise")
plt.show()

In [None]:
#Box plot which shows all the three paramters cluster wise using log scale to makechild mortality values more evident.
country_df_hierarchy.drop(['country','exports', 'health', 'imports', 'inflation', 'life_expec', 'total_fer'], axis = 1).groupby('cluster_labels').mean().plot(kind = 'bar')
plt.title("Box plot which shows all the three parameters cluster wise")
plt.yscale("log")
plt.show()

Observation:
- Cluster 0: High Child mortality rate, less gdpp and income : Under developed countries. Which we need to target
- Cluster 1: Medium child mortality rate, gdpp and income : Developing countries.
- Cluster 2: Less Child mortality rate, high gdpp and income : Developed countries.

In [None]:
# plot for child_mort column for different clusters to check its variation
sns.boxplot(x='cluster_labels', y='child_mort', data=country_df_hierarchy)
plt.title("Plot for child_mort column for different clusters to check its variation")
plt.show()

Observation: The cluster 2 has countries with less child mortality and cluster 0 has countries with high child mortality.

In [None]:
# plot for income column for different clusters to check its variation
sns.boxplot(x='cluster_labels', y='income', data=country_df_hierarchy)
plt.title("Plot for income column for different clusters to check its variation")
plt.show()

Observation: The cluster 0 has countries with less income and cluster 2 has countries with high income.

In [None]:
# plot for gdpp column for different clusters to check its variation
sns.boxplot(x='cluster_labels', y='gdpp', data=country_df_hierarchy)
plt.title("Plot for gdpp column for different clusters to check its variation")
plt.show()

Observation: The cluster 0 has countries with less gdpp and cluster 2 has countries with high gdpp.

### Overall Observations:
- Clusters 0 denotes of under developed countries 
- Clusters 1 denotes of developing countries 
- Clusters 2 denotes of developed countries 


## We need to target countries of Cluster 0 The countries which need AID .

In [None]:
#Countries under Cluster 0 ie. which have less gdpp, less income and high child mortality
country_df_hierarchy[country_df_hierarchy['cluster_labels'] == 0]

In [None]:
country_df_hierarchy.country[country_df_hierarchy['cluster_labels'] == 0]

## The above is the list of countries which come under cluster 0. The countries which need AID .
## There are 67 countries in this category.

In [None]:
#Countries under Cluster 1 ie. which have medium gdpp, medium income and medium child mortality
country_df_hierarchy[country_df_hierarchy['cluster_labels'] == 1]

In [None]:
#Countries under Cluster 2 ie. which have high gdpp, high income and low child mortality
country_df_hierarchy[country_df_hierarchy['cluster_labels'] == 2]

In [None]:
#Get the top 5 countires that are in dire need to HELP - When Gdpp, Income and Child mortality is the order of preference.)
country_df_hierarchy[country_df_hierarchy['cluster_labels'] == 0].sort_values(by = ['gdpp','income','child_mort'], ascending = [True,True,False]).head(5)

In [None]:
#Get the top 5 countires that are in dire need to HELP - When 'income','gdpp','child_mort is the order of preference)
country_df_hierarchy[country_df_hierarchy['cluster_labels'] == 0].sort_values(by = ['income','gdpp','child_mort'], ascending = [True,True,False]).head(5)

In [None]:
#Get the top 5 countires that are in dire need to HELP - When 'child_mort','income','gdpp' is the order of preference)
country_df_hierarchy[country_df_hierarchy['cluster_labels'] == 0].sort_values(by = ['child_mort','income','gdpp'], ascending = [False,True,True]).head(5)

Inference:

Less gdpp,less income and high child mortality parameters are considered. 
- Five countries which are in direst need of aid - Burundi, Liberia ,Congo. Dem. Rep , Niger , Sierra Leone (When Gdpp, Income and Child mortality is the order of preference.)
- Five countries which are in direst need of aid - Congo. Dem. Rep, Liberia ,Burundi, Niger , Central African Republic (When 'income','gdpp','child_mort is the order of preference)
- Five countries which are in direst need of aid - Haiti, Sierra Leone ,Chad, Central African Republic ,Mali (When 'child_mort','income','gdpp' is the order of preference)

## <font color=Green>Final Inference and Recommendations </font>

- The optimal value of k for clustering is 3
- For k means Clustering model the number of countries under each cluster

|Cluster |Count| Description                                                                           |
|--------|-----|---------------------------------------------------------------------------------------|
|0       | 82  | The cluster 0 has countries with medium gdpp, medium income and medium child mortality | 
|2       | 48  | The cluster 2 has countries with less gdpp, less income and high child mortality       | 
|1       | 37  | The cluster 1 has countries with high gdpp, high income and less child mortality       | 


- We need to concentrate on countries belonging to Cluster 2 as this cluster denotes the under-developed countries.- 48 countries

- For Hierarical Cluserting model the number of countries under each cluster

|Cluster |Count| Description                                                                           |
|--------|-----|:-------------------------------------------------------------------------------------:|
|0       | 67  | The cluster 0 has countries with less gdpp, less income and high child mortality       | 
|1       | 60  | The cluster 1 has countries with medium gdpp, medium income and medium child mortality | 
|2       | 40  | The cluster 2 has countries with high gdpp, high income and less child mortality       | 

- We need to concentrate on countries belonging to Cluster 0 as this cluster denotes the under-developed countries - 67 countries
- The top five countires that need AID are (same countires obtained by k means and hierarchical algorithm)
    - Burundi,
    - Liberia
    - Congo. Dem. Rep
    - Niger
    - Sierra Leone 
- I have considered the order less gdpp first followed by less income and high child mortality rate because 

1. Gross Domestic Product of a country indicates the total value of production of goods and services of that country and thus indicates that the above five countries have the least production happening in their country. This in turn would lead to the need of requiring external support from other countries. When the financial help is provided to the countries based on the low GDP indicator, such countries should be able to produce more goods and services, which in turn helps in the countries overall development.
2. Low Income is a sign of the inability to purchase commodities or services which can keep a family comfortable or in dire need of resources in case of very low income families. As such, foreign help would provide the required help to such families at a lower cost.
3. Child mortality rate can be the third indicator as it is not directly linked to the resources available, as it is affected to other reasons like incomplete care of the mother during pregnancy, irresponsibility of the health care workers, etc. Thus, child moratlity rate can be considered as the third factor in this problem, keeping in mind the malnutrition and other financial problems the country is facing.
