# Clustering of countries

## Problem Statement
HELP International is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities. It runs a lot of operational projects from time to time along with advocacy drives to raise awareness as well as for funding purposes.

After the recent funding programmes, they have been able to raise around $ 10 million. 

## Objective
- The CEO of the NGO needs to decide how to use this money strategically and effectively. The significant issues that come while making this decision are mostly related to choosing the countries that are in the direst need of aid. 

- Categorise the countries using some socio-economic and health factors that determine the overall development of the country. Then you need to suggest the countries which the CEO needs to focus on the most.  The datasets containing those socio-economic factors and the corresponding data dictionary are provided below.

## Algorithm for analysis (Clustering)

>1. Data Quality Check

>2. EDA: Univariate and Bivariate

>3. Outlier

>4. Scaling

>5. Hopkin's Test

>6. Finding the best value of k(Number of clusters) using SSD Elbow, Silhoutte Score

>7. Using the final value of k(Number of clusters), performing the k-Means analysis

>8. Visualization the clustering using scatter plot

>9. Performing Cluster profiling: __GDPP, CHILD_MORT and INCOME.__

>10. Hierarchical Clustering: Single linkage, Complete Linkage


### Importing required libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cut_tree

pd.set_option('display.max_rows',500)
pd.set_option('display.max_columns',100)

### Reading the data

In [None]:
df = pd.read_csv('../input/unsupervised-learning-on-country-data/Country-data.csv')
df.head()

### Reading Data dictionary

In [None]:
data_dict = pd.read_csv('../input/unsupervised-learning-on-country-data/data-dictionary.csv')
data_dict

### Check for shape / size parameters

In [None]:
df.shape

### Checking for info - Datatypes

In [None]:
df.info()

> All data types are correct

### __Check for null values__

In [None]:
df.isnull().sum()

> No null values are observed

### Describing the data

In [None]:
df.describe()

### Finding the duplicate entries of coutry names

In [None]:
df['country'].duplicated().sum()

In [None]:
len(list(df['country'].unique()))

> All the country names are unique and no duplicates found.

### Data Quality Check

In [None]:
# Since variables exports, health and imports are in percentage based on gdpp - Converting them to actual values
df['exports'] = df['exports'] * df['gdpp']/100
df['imports'] = df['imports'] * df['gdpp']/100
df['health'] = df['health'] * df['gdpp']/100

In [None]:
df.head()

### Univariate analysis

In [None]:
df.columns # Columns in the dataframe

In [None]:
columns = df.columns[1:] # Exluding variable - country in columns list for data visualization

In [None]:
# Visualization of Data distubution among all numerical variables
for col in columns:
    sns.distplot(df[col])
    plt.show();
    

> - Normal distrubution of data is observed in exports, imports, inflation. There is possibility of internal grouping for clustering. Since, all the values fall under 3σ over the mean.

> - gdpp, total fertility, income, child mortality, health are showing multi modal distrubution. Where we can find the patternd for clustering. 

### Bivariate analysis

In [None]:
sns.pairplot(df)
plt.show()

In [None]:
# Correlation data and visualization using heatmaps
plt.figure(figsize=(12,10))
df_corr = (df.drop('country',axis=1)).corr()
sns.heatmap(df_corr,cmap="YlGnBu",annot=True)

#### From above pair plot and heatmap
> Child mortality rate, life ecpectancy and total fertility are highly correlated.

> __GDPP__ is highly correlated with exports, imports, health and income. If __GDPP__ is good, income will be is good and will have good health infra so, health is not an issue.

## Data Preparation
### Outlier

In [None]:
i=1
plt.subplots(ncols=3,nrows=3, figsize=(15,12))
for col in columns:
    plt.subplot(3,3,i)
    sns.boxplot(df[col])
    i+=1;
plt.show()

> Except in life expectancy all the variables are having outliers above the upper hinge.

> As per our objective we need to find out the countries who are in dire need of health aid. So we sholud keep in mind about high child mortality, low health, low life expectancy regarding health and high inflation, low income, low gdpp are our targets.

> Soft capping is considered for further analysis. Since dataset is small and all the lower socio-economic countries are important.

In [None]:
# Treatment using soft capping
for col in columns:
    percentiles = df[col].quantile([0.01,0.99]).values
    df[col][df[col] <= percentiles[0]] = percentiles[0]
    df[col][df[col] >= percentiles[1]] = percentiles[1]

In [None]:
i=1
plt.subplots(ncols=3,nrows=3, figsize=(15,12))
for col in columns:
    plt.subplot(3,3,i)
    sns.boxplot(df[col])
    i+=1;
plt.show()

> Soft capping will reduce the influence of outliers on biasing and help in analysis.

In [None]:
#Scaling the data
scaler = StandardScaler()
scaled_data = pd.DataFrame(scaler.fit_transform(df.drop('country',axis=1)))
scaled_data.columns = df.drop('country',axis=1).columns
scaled_data.head()

In [None]:
#Calculating the Hopkins statistic
from sklearn.neighbors import NearestNeighbors
from random import sample
from numpy.random import uniform
import numpy as np
from math import isnan
 
def hopkins(X):
    d = X.shape[1]
    #d = len(vars) # columns
    n = len(X) # rows
    m = int(0.1 * n) 
    nbrs = NearestNeighbors(n_neighbors=1).fit(X.values)
 
    rand_X = sample(range(0, n, 1), m)
 
    ujd = []
    wjd = []
    for j in range(0, m):
        u_dist, _ = nbrs.kneighbors(uniform(np.amin(X,axis=0),np.amax(X,axis=0),d).reshape(1, -1), 2, return_distance=True)
        ujd.append(u_dist[0][1])
        w_dist, _ = nbrs.kneighbors(X.iloc[rand_X[j]].values.reshape(1, -1), 2, return_distance=True)
        wjd.append(w_dist[0][1])
 
    H = sum(ujd) / (sum(ujd) + sum(wjd))
    if isnan(H):
        print(ujd, wjd)
        H = 0
 
    return H
hopkins(df.drop('country',axis=1))

> We have got hopkins value of 0.866 which says, the data we have got is ~87% dissimilar and thus data is not uniformly randomly distributed.

### K-means Clustering

In [None]:
ssd = []
range_n_clusters = list(range(2,10))
for num_clusters in range_n_clusters:
    kmeans = KMeans(n_clusters=num_clusters, max_iter=50,random_state=0)
    kmeans.fit(scaled_data)
    
    ssd.append(kmeans.inertia_)
    
# plot the SSDs for each n_clusters
# ssd
plt.plot(ssd)
plt.title('Elbow curve')
plt.grid(True)


> From above plot, it is observed that after k=3 nature if SSD elbow curve i.e., the rate of change of kmeans inertia is insignificant.

In [None]:
# silhouette analysis

range_n_clusters = list(range(2,10))
sil_score = []
for num_clusters in range_n_clusters:
    
    # intialise kmeans
    kmeans = KMeans(n_clusters=num_clusters, max_iter=50)
    kmeans.fit(scaled_data)
    
    cluster_labels = kmeans.labels_
    
    # silhouette score
    silhouette_avg = silhouette_score(scaled_data, cluster_labels)
    sil_score.append([num_clusters,silhouette_score(scaled_data, cluster_labels)])
    print("For n_clusters={0}, the silhouette score is {1}".format(num_clusters, silhouette_avg))

print()
sil_df = pd.DataFrame(sil_score)
sil_df    


In [None]:
plt.plot(sil_df[0],sil_df[1])
plt.title('silhoutte_score')

> From the silhoutte analysis we have got highest sil_score at k=2, since we are not considering in practice so going to next highest sil_score i.e., k=3

In [None]:
# final model with k=3
kmeans = KMeans(n_clusters=3, max_iter=50, random_state=0)
kmeans.fit(scaled_data)

In [None]:
kmeans.labels_

In [None]:
# Assigning the cluster labels into data frame for further data retrival
df['cluster_id_kmeans'] = kmeans.labels_
df.head()

> K-Means clustering is done.

### Hierarchical clustering

In [None]:
# Dendogram using single linkage
plt.figure(figsize=(8,6))
mergings = linkage(scaled_data, method="single", metric='euclidean')
dendrogram(mergings)
plt.title('Single linkage')
plt.show()

> From the above dendogram, we have notices that the data points are soo close and hard to determing the value of k. So, let's go ahead with complete linkage.

In [None]:
# complete linkage
plt.figure(figsize=(8,6))
mergings = linkage(scaled_data, method="complete", metric='euclidean')
dendrogram(mergings)
plt.title('Complete linkage')
plt.show()

> From the above dendogram we can see dissimilarity of clusters are found at distance 10. We will get k=3.

In [None]:
# From the above dendogram considering 3 clusters and labels as follows
cluster_labels = cut_tree(mergings, n_clusters=3).reshape(-1, )
cluster_labels

In [None]:
# Assigning the cluster lables to main dataframe
df['cluster_id_hc'] = cluster_labels
df.head()

# Visualize using scatterplot

In [None]:
# GDPP v/s Income
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
sns.scatterplot(x='gdpp',y='income',data=df,hue='cluster_id_kmeans',palette='Set1')
plt.title('K-means clustering')
plt.subplot(1,2,2)
sns.scatterplot(x='gdpp',y='income',data=df,hue='cluster_id_hc',palette='Set1')
plt.title('Heirarchical clustering')
plt.show()

> Clusters labelled 0 and 1 are found to be overlapped. May be due to not consdering the other variables

In [None]:
# Child mortality rate v/s Income
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
sns.scatterplot(x='child_mort',y='income',data=df,hue='cluster_id_kmeans',palette='Set1')
plt.title('K-means clustering')
plt.subplot(1,2,2)
sns.scatterplot(x='child_mort',y='income',data=df,hue='cluster_id_hc',palette='Set1')
plt.title('Heirarchical clustering')
plt.show()

> Clusters are well distinguished

In [None]:
# GDPP v/s Child mortality
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
sns.scatterplot(x='gdpp',y='child_mort',data=df,hue='cluster_id_kmeans',palette='Set1')
plt.title('K-means clustering')
plt.subplot(1,2,2)
sns.scatterplot(x='gdpp',y='child_mort',data=df,hue='cluster_id_hc',palette='Set1')
plt.title('Hierarchical clustering')
plt.show()

> Clusters are well distinguished

# Cluster profiling:

### K-means clustering parameters analysis

In [None]:
# Grouping the data to find the centroids of the clusters
df[['gdpp','income','child_mort','cluster_id_kmeans']].groupby('cluster_id_kmeans').mean()

In [None]:
df[['gdpp','income','child_mort','cluster_id_kmeans']].groupby('cluster_id_kmeans').mean().plot(kind='bar')

> From the above barplot and data frame we can notice 3 clusters with different characteristics as follows

- Low __GDPP__, low __Income__ and high __Child_mortality__.

- Medium __GDPP__, __income__ and __Child_mortality__.

- High __GDPP__, high __income__ and low __Child_mortality__

In [None]:
# Box plots
plt.subplots(ncols=3,figsize=(15,5))
plt.subplot(1,3,1)
sns.boxplot(x='cluster_id_kmeans', y='gdpp', data=df)
plt.subplot(1,3,2)
sns.boxplot(x='cluster_id_kmeans', y='income', data=df)
plt.subplot(1,3,3)
sns.boxplot(x='cluster_id_kmeans', y='child_mort', data=df)

> It is evident from above plots that clustering done is effective at value of k=3, since all the medians are distant and nature of plot is as per criteria derived from above barplot.

In [None]:
# Grouping the direst need top 5 countries from k-means clustering method
df[df['cluster_id_kmeans']==0].sort_values(['gdpp','income','child_mort'],ascending=[True,True,False]).head()

### Hierarchical clustering - Parameters analysis

In [None]:
# Grouping the data to find the centroids of the clusters
df[['gdpp','income','child_mort','cluster_id_hc']].groupby('cluster_id_hc').mean()

In [None]:
# Plot of centroids using barplot
df[['gdpp','income','child_mort','cluster_id_hc']].groupby('cluster_id_hc').mean().plot(kind='bar')

> From the above barplot and data frame we can notice 3 clusters with different characteristics as follows

- Low __GDPP__, low __Income__ and high __Child_mortality__.

- Medium __GDPP__, __income__ and __Child_mortality__.

- High __GDPP__, high __income__ and low __Child_mortality__

In [None]:
plt.subplots(ncols=3,figsize=(15,5))
plt.subplot(1,3,1)
sns.boxplot(x='cluster_id_hc', y='gdpp', data=df)
plt.subplot(1,3,2)
sns.boxplot(x='cluster_id_hc', y='income', data=df)
plt.subplot(1,3,3)
sns.boxplot(x='cluster_id_hc', y='child_mort', data=df)

> It is evident from mabove plots that clustering done is effective at value of k=3, since all the medians are distant and nature of plot is as per criteria derived from above barplot.

In [None]:
# Grouping the direst need top 5 countries from Hierarchical clustering method
df[df['cluster_id_hc']==0].sort_values(['gdpp','income','child_mort'],ascending=[True,True,False]).head()

In [None]:
(df[df['cluster_id_hc']==0].sort_values(['gdpp','income','child_mort'],ascending=[True,True,False]).head())[['country']]

> Based on the clustering of countries done using K-Means and hierarchical clustering, we have below common observations

- Resulted with optimal value of k = 3 i.e, number of clusters
- All 3 clusters showing the different characteristics required as follows
    
    - Low __GDPP__, low __Income__ and high __Child_mortality__.
    - Medium __GDPP__, __income__ and __Child_mortality__.
    - High __GDPP__, high __income__ and low __Child_mortality__.
    
- Since our objective and interest is to identify the countries with bad socio-economic and high child mortality which is direly need in the help. Below mentioned the country named which is need in help.
    - __Liberia__
    - __Burundi__
    - __Congo, Dem. Rep__
    - __Niger__
    - __Sierra Leone__