# Definition
Unsupervised Learning is a type of Machine Learning learning algorithm used to draw inferences from data without any specified targets. Hence the goal for the algorithm in this case is to find underlying structure/pattern from unlabeled data.  

There are 3 main usages of Unsupervised Learning:-
1. **Clustering** (Forming Groups of Data Points)
2. **Association** (Example:- People who buy Bat also tend to buy ball)
3. **Dimensionality Reduction** (Reducing dimensionality of data by projection)

# About this Notebook
* Here we are going to learn the most widley used aspect of Unsupervised Learning:- "**Clustering**".
* Being **beginner friendly**, this notebook will focus solely on basics, getting to know the data and build a primitive yet effective model.
* We will explore the following clutering algorithms:-
    1. k-Means Clustering
    2. Heirarchical Clustering
    3. Affinity Propagation
    4. Mean Shift
    5. Spectral Clustering
    6. DBSCAN
    7. Gaussian Mixture Model
* We will also learn how to determine the **optimal number of clusters** for some of the algorithms.
* Lastly we will learn how draw some interesting insights using the results of models.

# What is Clustering
Clustering is a task of segregating the whole population into smaller groups in which the members of each group have more similar traits to each other than to members of other groups.  
For example:-  
The complete set of news in a day can be clustered into groups like Political News, Sports News, Entertainment News and Weather report. In such case each member of Sports news (for example Cricket and Formula 1) will have more similarities to one another than with any member of the Weather Report cluster.

# Types of Clustering Algorithms
1. **Centroid Algorithm**:- This is an iterative approach of finding cluster centroid and deciding cluster based on the point's distance to each cluster center. This process is repeated until the centroid movement converges. (Ex:- k-Means)
2. **Density Algorithm**:- Density-Based Clustering refers to unsupervised learning methods that identify distinctive groups/clusters in the data, based on the idea that a cluster in a data space is a contiguous region of high point density, separated from other such clusters by contiguous regions of low point density. This algorithm requires only one-pass over the data space. (Ex:- DBSCAN)
3. **Distribution Algorithm**:- This algorithm is based on the idea that clusters can easily be defined as objects belonging most likely to the same distribution. (Ex:- Gaussian Mixture Models)
4. **Connectivity Algorithm**:- The core idea behind this algorithm is that datapoints closer to one another in N-dimensional space tend to have similar properties than data points much farther away. (Ex:- Heirarchical Clustering)

# Applications of Clustering
Clustering is widely used in many different domains inluding:-  
* Fraud Detection
* Recommender Systems
* News Segmentation
* Spam Detection

and many more...

# Problem Statement

## About the Data:-
The data contains the followig columns:-

| Column Name | Description |
|:------------|:------------|
| country | Name of the Country |
| child_mort | Child Mortality Rate |
| exports | Per capita export of goods and services |
| health | Per capita spending on health |
| imports | Per capita import of goods and services |
| Income | Per capita Income |
| Inflation | Annual growth rate of GDP |
| life_expec | Life Expectancy |
| total_fer | Fertility rate |
| gdpp | Per capita GDP |

## Expected Outcome:-
Based on the above socio-economic factors we need to determine which contries to invest that are in the direst need of aid.  

So now that this is clear, let's get into it starting with some basic imports:-

In [None]:
# For Reproducable results
from numpy.random import seed
seed(1)

# Asthetics
import warnings
import sklearn.exceptions
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings("ignore", category=sklearn.exceptions.UndefinedMetricWarning)

# General
import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np
import os
from tqdm import tqdm

# Visialisation
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(style="whitegrid")
from plotly import graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

# Models
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, AgglomerativeClustering, AffinityPropagation
from sklearn.cluster import MeanShift, estimate_bandwidth, SpectralClustering
from sklearn.cluster import DBSCAN
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
import scipy.cluster.hierarchy as shc

In [None]:
data_path = '../input/unsupervised-learning-on-country-data'

train_file_path = os.path.join(data_path, 'Country-data.csv')

print(f'Training file path: {train_file_path}')

# EDA

## 1. Descriptive Analysis

In [None]:
train_df = pd.read_csv(train_file_path)
train_df.sample(10) # Random 10 rows from the data

In [None]:
# Basic Decriptive Analysis
train_df.describe().T

In [None]:
train_df.isnull().sum()

As we can see there are no null values in any of the columns. So 1 step less for us. Now let's move on to understanding each column...

In [None]:
train_df.dtypes

We can see that all feature columns (except country) are numerical in nature. And the desciption makes it clear that none of them are any categorical numbers as well. So eveything is continuous. Good for us... Moving on to visualizations

In [None]:
features = [
    'child_mort', 'exports', 'health','imports',
    'income', 'inflation', 'life_expec', 'total_fer',
    'gdpp'
]

## 2. Univariate Analysis

In [None]:
# Features of smilar scales grouped together for better visibility
features_1 = [
    'income', 'gdpp'
]

features_2 = [
    'child_mort', 'exports','imports',
    'inflation', 'life_expec'
]

features_3 = [
    'health','total_fer'
]

In [None]:
plt.figure(figsize=(10,8))
sns.boxplot(data=train_df[features_1], orient="h", palette="Set2");

In [None]:
plt.figure(figsize=(10,8))
sns.boxplot(data=train_df[features_2], orient="h", palette="Set2");

In [None]:
plt.figure(figsize=(10,8))
sns.boxplot(data=train_df[features_3], orient="h", palette="Set2");

## 3. Pair-Wise Analysis

In [None]:
g = sns.pairplot(train_df[features])
g.fig.set_size_inches(12,10)

In [None]:
train_df_cor_spear = train_df[features].corr(method='spearman')
plt.figure(figsize=(10,8))
sns.heatmap(train_df_cor_spear, square=True, cmap='coolwarm', annot=True);

In [None]:
train_df_cor_spear = train_df[features].corr(method='spearman')
mask = np.zeros_like(train_df_cor_spear)
mask[np.triu_indices_from(mask)] = True
plt.figure(figsize=(10,8))
sns.heatmap(train_df_cor_spear, mask=mask, square=True, cmap='coolwarm', annot=True);

## EDA Inferences:-
Interesting dataset! Some inferences I can make-out right now are as follows:-  
1. Child Mortality rate decreases with Increase in GDPP. Both of them have a very high correlation, which is expected as developed contries(having higer GDPP) will have better healthcare and hence better chance of survival.
2. Child Mortality rate is directly proportional to total fertility rate. Which is sort of a dependent feature in my opinion. Women are giving birth to more children if the previous ones didn't survive unfortunately.
3. Child Mortality rate is iversely proportional to Life expectance. Which is again a very dependent feature, because if more number children unfortunately die too early it pulls down the overall life expectancy of the country.
4. Inflation is weakly inversely proportional to GDPP which in my opinion might be due to the economic saturation in highly developed nations.
5. Per capita Income is heavily correlated to GDPP because one is roughly a function of another.
6. Import and exports increade with one another which implied the trading power of the contry as a whole grows; i.e, countries who export more are also likely to import something else more.
7. Spendings on health increase with GDPP and Income, which is self-explanatory.
8. Income rises with exports which might be because people generate income by generating goods and services which are later exported.
9. Child Mortality rate is iversely proportional to Healthcare expenditure and Income. Which shows that unfortunate circumstances with low income groups is often responsible for a low life expentancy among children.  

Keeping thse in mind, we see that there is a very high level of correlation between most the features. So let's first drop one of the variable pairs which have very strong correlation.

# Feature Engineering

In [None]:
features_to_drop = ['gdpp', 'child_mort', 'total_fer']
train_df.drop(features_to_drop, axis=1, inplace=True)

In [None]:
train_df['Trade_Deficiency'] = train_df['exports'] - train_df['imports']

In [None]:
features = [feat for feat in train_df.columns if feat not in ['country']]
print(features)

## Scaling Data
Since many of our algorithm sare based on point-to-point distance, it is essential to scale the data as a higher variance in one dimension might lead to worse performance of the model. We are just going to use the StandardScaler algorithm within sklearn. This will essentially make the mean of the data ~ 0 and variance ~ 1.

In [None]:
scaler = StandardScaler()
scaled_data = scaler.fit_transform(train_df[features])
scaled_data = pd.DataFrame(scaled_data, columns=features)

Now let's move on to the modelling part... But before that let's create some functions which will help us through the whole process because we will be repeating similar steps for most of the models. Also this makes the code generic and reusable for anyone of you interested to follow same avenues.

# Utils

In [None]:
def project_to_2d(df, features=features, plot=False, cluster=None):
    pca = PCA(n_components=2)
    projected = pca.fit_transform(df[features])
    if plot:
        if cluster is None:
            sns.scatterplot(
                x=projected[:, 0],
                y=projected[:, 1]
            )
        elif cluster is not None:
            num_clusters = df[cluster].nunique()
            sns.scatterplot(
                x=projected[:, 0],
                y=projected[:, 1],
                hue=df[cluster].values,
                palette=sns.color_palette("husl", num_clusters)
            )
    return projected

In [None]:
def pair_plot_cluster(df, scaled_data, cluster, features=features):
    df[cluster] = scaled_data[cluster]
    num_clusters = df[cluster].nunique()
    g = sns.pairplot(
        df[features + [cluster]],
        hue=cluster,
        palette=sns.color_palette("husl", num_clusters)
    )
    g.fig.set_size_inches(12,10)
    plt.show()

In [None]:
RANDOM_SEED = 42

Before we cluster the data, let's look at how it looks in a 2-D projection.

In [None]:
projected_df = project_to_2d(scaled_data, features, plot=True)

# Model Creation
We will be generating various models and discussed earlier. And we will use the utility functions defined above to judge and derive inferences from various models.

# 1. KMeans  
The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields.

In [None]:
model = KMeans(n_clusters=2, init='k-means++', random_state=RANDOM_SEED)
# 2 is just an arbitrary number, we will find the exact number soon below

model.fit(scaled_data[features])
scaled_data['KMeans'] = model.predict(scaled_data[features])

In [None]:
projected_df = project_to_2d(scaled_data, features, plot=True, cluster='KMeans')

In [None]:
pair_plot_cluster(train_df, scaled_data, 'KMeans')

There are 2 broad methods to find the optimal number of clusters, let's look at them, one by one...
### A. Elbow Method  
In this method we iterate over various number of probable clusters and find the overall inertia of the clusters. The plot of the same forms a hard like structure with shoulders, elbow, forearm, etc. Based on this we **"eyeball"** and try to find the elbow of this plot and take this as our optimal nummber of cluster.

In [None]:
INERTIAS = []
for cluster in range(1,20):
    model = KMeans(n_clusters = cluster, init='k-means++',
                   n_jobs = -1, random_state=RANDOM_SEED)
    model.fit(scaled_data[features])
    INERTIAS.append(model.inertia_)

inert_df = pd.DataFrame({'Num_Clusters':range(1,20), 'Inertia':INERTIAS})
plt.figure(figsize=(12,6))
sns.lineplot(data=inert_df, x="Num_Clusters", y="Inertia", marker='o')
plt.ylim(0, 1200)
plt.xlabel('Number of clusters')
plt.ylabel('Inertia');

Based on this image it looks like our elbow lies somewhere in the 3-7 region. Let's take 3 as our optimal number of cluster.

In [None]:
model = KMeans(n_clusters=3, init='k-means++', random_state=RANDOM_SEED)
model.fit(scaled_data[features])
scaled_data['KMeans'] = model.predict(scaled_data[features])

In [None]:
projected_df = project_to_2d(scaled_data, features, plot=True, cluster='KMeans')

In [None]:
pair_plot_cluster(train_df, scaled_data, 'KMeans')

As you might have guessed, this method is very subjective and involves a lot of eyeballing and assumptions. Which can be quite trivial and frankly we are going to encounter much more complex ploblems than this in real life. So, let's try the second method.
### B. Silhouette Coefficient Method
A higher Silhouette Coefficient score relates to a model with better defined clusters. The Silhouette Coefficient is defined for each sample and is composed of two scores:
* The mean distance between a sample and all other points in the same class.
* The mean distance between a sample and all other points in the next nearest cluster.

Finally, the Total Silhouette Coefficient for a set of samples is given as the mean of the Silhouette Coefficient for each sample.

In [None]:
SILHOUETTES = []
for cluster in range(2,20):
    model = KMeans(
        n_clusters = cluster, init='k-means++',
        n_jobs = -1, random_state=RANDOM_SEED)
    model.fit(scaled_data[features])
    labels = model.labels_
    SILHOUETTES.append(silhouette_score(
        scaled_data[features],
        labels, metric = 'euclidean'
    ))

inert_df = pd.DataFrame({'Num_Clusters':range(2,20), 'Silhoette':SILHOUETTES})
plt.figure(figsize=(12,6))
sns.lineplot(data=inert_df, x="Num_Clusters", y="Silhoette", marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Silhoette Score');

Here we see that we have a clear peak at 2, thus our optimal number of clusters as per kmeans is 2.  
As you can see Silhouette method is much more objective in nature and we do not have to guess the position.

***NOTE:- You might be wondering why we do not set a very high number of clusters. That is because the inertia of the clusters will always keep decreasing as we increase the number of clusters until we have exactly the number of points in the dataset. But this will defy the purpose, we do not want 100% purity, but we want to group toghether SIMILAR data points.***

In [None]:
model = KMeans(n_clusters=2, init='k-means++', random_state=RANDOM_SEED)
model.fit(scaled_data[features])
scaled_data['KMeans'] = model.predict(scaled_data[features])

In [None]:
projected_df = project_to_2d(scaled_data, features, plot=True, cluster='KMeans')

In [None]:
pair_plot_cluster(train_df, scaled_data, 'KMeans')

Here we see the differenciating factors among various clusters are Income, Health and Life expectancy. Which happen to be key indicators of lifestyle.  
So based on these we can classify the nations into 'Healthy Lifestyle' and 'Unhealthy Lifestyle'.
### Predictions

In [None]:
label_dict = {
    0 : 'Unhealthy Lifestyle',
    1 : 'Healthy Lifestyle'
}

train_df['Kmeans_Prediction'] = train_df['KMeans'].map(label_dict)

There can be nations where people earn a lot but do not spend on healthcare or vice-versa. This broad category takes care of that as well. Let's see some example of each class...

In [None]:
print(train_df[train_df['Kmeans_Prediction'] == 'Healthy Lifestyle'].sample(10)['country'].to_list())

In [None]:
print(train_df[train_df['Kmeans_Prediction'] == 'Unhealthy Lifestyle'].sample(10)['country'].to_list())

# 2. Hierarchical Clustering
In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters. This is an unsupervised clustering algorithm which involves creating clusters that have predominant ordering. Strategies for hierarchical clustering generally fall into two types:
1. Agglomerative Hierarchical Clustering
2. Divisive Hierarchical Clustering

And unlike kMeans we have to use **dendogram method** here to determine the number of ideal clusters as per hierarchy. What is dendogram method? Let me explain:-

### Dendogram
We can use a dendrogram to visualize the history of groupings and figure out the optimal number of clusters.
1. Determine the largest vertical distance that doesn’t intersect any of the other clusters.
2. Draw a horizontal line at both extremities.
3. The optimal number of clusters is equal to the number of vertical lines going through the horizontal line.

![](https://miro.medium.com/proxy/1*LBOReupihNEsI6Kot3Q6YQ.png)
Source:- [Medium](https://towardsdatascience.com/machine-learning-algorithms-part-12-hierarchical-agglomerative-clustering-example-in-python-1e18e0075019)  
In this example, the ideal number of cluster will be 4. Now let's draw a similar diagram for our problem...

In [None]:
plt.figure(figsize=(50, 20))
_ = shc.dendrogram(shc.linkage(scaled_data[features], method='ward'))

So in our case the optimal number of clusters will be **2** using the dendogram method.  
### Agglomerative Hierarchical Clustering
Bottom up approach. Start with many small clusters and merge them together to create bigger clusters.  

In [None]:
model = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')
model.fit(scaled_data[features])
scaled_data['Agglomerative_H'] = model.labels_

In [None]:
projected_df = project_to_2d(scaled_data, features, plot=True, cluster='Agglomerative_H')

In [None]:
pair_plot_cluster(train_df, scaled_data, 'Agglomerative_H')

### Predictions

In [None]:
label_dict = {
    0 : 'Healthy Lifestyle',
    1 : 'Unhealthy Lifestyle'
}

train_df['Agglomerative_Prediction'] = train_df['Agglomerative_H'].map(label_dict)

In [None]:
print(train_df[train_df['Agglomerative_Prediction'] == 'Healthy Lifestyle'].sample(10)['country'].to_list())

In [None]:
print(train_df[train_df['Agglomerative_Prediction'] == 'Unhealthy Lifestyle'].sample(10)['country'].to_list())

# 3. Affinity Propagation

AffinityPropagation creates clusters by sending messages between pairs of samples until convergence. A dataset is then described using a small number of exemplars, which are identified as those most representative of other samples. The messages sent between pairs represent the suitability for one sample to be the exemplar of the other, which is updated in response to the values from other pairs. This updating happens iteratively until convergence, at which point the final exemplars are chosen, and hence the final clustering is given.

In [None]:
af = AffinityPropagation(preference=-200)
af.fit(scaled_data[features]);
scaled_data['Affinity'] = af.labels_

In [None]:
projected_df = project_to_2d(scaled_data, features, plot=True, cluster='Affinity')

In [None]:
pair_plot_cluster(train_df, scaled_data, 'Affinity')

### Predictions

In [None]:
label_dict = {
    1 : 'Healthy Lifestyle',
    0 : 'Unhealthy Lifestyle'
}

train_df['Affinity_Prediction'] = train_df['Affinity'].map(label_dict)

In [None]:
print(train_df[train_df['Affinity_Prediction'] == 'Healthy Lifestyle'].sample(10)['country'].to_list())

In [None]:
print(train_df[train_df['Affinity_Prediction'] == 'Unhealthy Lifestyle'].sample(10)['country'].to_list())

As we can see Affinity propagation has made similar splits to earlier.

The best thing about Affinity propagation is that number of clusters is calculated automatically depending on the hyperparameters, so we do not have to guess the number of clusters.

# 4. Mean Shift
MeanShift clustering aims to discover blobs in a smooth density of samples. It is a centroid based algorithm, which works by updating candidates for centroids to be the mean of the points within a given region. These candidates are then filtered in a post-processing stage to eliminate near-duplicates to form the final set of centroids.

In [None]:
bandwidth = estimate_bandwidth(scaled_data[features], quantile=0.2)
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(scaled_data[features])
scaled_data['Mean Shift'] = ms.labels_

In [None]:
projected_df = project_to_2d(scaled_data, features, plot=True, cluster='Mean Shift')

In [None]:
pair_plot_cluster(train_df, scaled_data, 'Mean Shift')

We can see that Mean Shift is not working particularly well on this type of data. But nevertheless it is a very powerful algorithm and should be tried if someone is planning to approach any unsupervised problem.  

# 5. Spectral Clustering

SpectralClustering performs a low-dimension embedding of the affinity matrix between samples, followed by clustering, e.g., by KMeans, of the components of the eigenvectors in the low dimensional space.

In [None]:
sc = SpectralClustering(
    n_clusters=2, assign_labels="kmeans",
    affinity='nearest_neighbors',
    random_state=RANDOM_SEED
)
sc.fit(scaled_data[features])
scaled_data['Spectral'] = sc.labels_

In [None]:
projected_df = project_to_2d(scaled_data, features, plot=True, cluster='Spectral')

In [None]:
pair_plot_cluster(train_df, scaled_data, 'Spectral')

### Predictions

In [None]:
label_dict = {
    1 : 'Healthy Lifestyle',
    0 : 'Unhealthy Lifestyle'
}

train_df['Spectral_Prediction'] = train_df['Spectral'].map(label_dict)

In [None]:
print(train_df[train_df['Spectral_Prediction'] == 'Healthy Lifestyle'].sample(10)['country'].to_list())

In [None]:
print(train_df[train_df['Spectral_Prediction'] == 'Unhealthy Lifestyle'].sample(10)['country'].to_list())

# 6. DBSCAN
The DBSCAN algorithm views clusters as areas of high density separated by areas of low density. Due to this rather generic view, clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex shaped. The central component to the DBSCAN is the concept of core samples, which are samples that are in areas of high density. A cluster is therefore a set of core samples, each close to each other (measured by some distance measure) and a set of non-core samples that are close to a core sample (but are not themselves core samples).

In [None]:
dbs = DBSCAN(eps=1, min_samples=5)
dbs.fit(scaled_data[features])
scaled_data['DBSCAN'] = dbs.labels_

In [None]:
projected_df = project_to_2d(scaled_data, features, plot=True, cluster='DBSCAN')

In [None]:
pair_plot_cluster(train_df, scaled_data, 'DBSCAN')

DBSCAN is not performing particularly well on this dataset, but it has two major hyperparameters that can be tuned to achieve better performance. DBSCAN is a very powerful algorithm and is extensively used in unsupervised problems.  
# 7. Gaussian Mixture Model
The GaussianMixture object implements the expectation-maximization (EM) algorithm for fitting mixture-of-Gaussian models. It can also draw confidence ellipsoids for multivariate models, and compute the Bayesian Information Criterion to assess the number of clusters in the data.

In [None]:
gmm = GaussianMixture(n_components=2, random_state=RANDOM_SEED)
gmm.fit(scaled_data[features])
scaled_data['GMM'] = gmm.predict(scaled_data[features])

In [None]:
projected_df = project_to_2d(scaled_data, features, plot=True, cluster='GMM')

In [None]:
pair_plot_cluster(train_df, scaled_data, 'GMM')

### Predictions

In [None]:
label_dict = {
    1 : 'Healthy Lifestyle',
    0 : 'Unhealthy Lifestyle'
}

train_df['GMM_Prediction'] = train_df['GMM'].map(label_dict)

In [None]:
print(train_df[train_df['GMM_Prediction'] == 'Healthy Lifestyle'].sample(10)['country'].to_list())

In [None]:
print(train_df[train_df['GMM_Prediction'] == 'Unhealthy Lifestyle'].sample(10)['country'].to_list())

# Recommendation:-  
As we have only 2 distinct classes in almost all of the algorithms it is very apparent from the problem statement that we need to recommend the 'Unhealthy Lifestyle' group countries for investment to the firm. As those people are in more need of the money since income and healthcare expensiture were one of the major distinguishing factors of the classification as we can see from the distribution on the pair-plots.

This was a quick overview/implementation example of almost all of the major unsupervised machine learning models for tabular data.  
Hope you learnt something from this notebook.  
I will always keep updating and adding new things to this notebook as and when I come across more algorithms worth sharing. So come back for more if you liked this one...  

**Also if you found this notebook useful and use parts of it in your work, please don't forget to show your appreciation by upvoting this kernel.**