<a href="https://colab.research.google.com/github/DLPY/Unsupervised-Learning-Session-1/blob/main/Hierarchical_Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Open In Colab

# Clustering Customers based on Bank Account Data

Detail on Data: https://www.kaggle.com/shrutimechlearn/churn-modelling

# TODO: Download source data from Github
!wget https://github.com/DLPY/Classification_Session_1/blob/main/Churn_Modelling.csv

# 1. Import necessary packages for performing Hierarchical Clustering

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import scipy.cluster.hierarchy as sch
import scipy.stats as stats
from scipy.cluster.hierarchy import dendrogram, fcluster, leaves_list, linkage
from scipy.spatial import distance
from sklearn import preprocessing
from sklearn.cluster import AgglomerativeClustering


import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
!wget https://raw.githubusercontent.com/DLPY/Classification_Session_1/main/Churn_Modelling.csv

# **2. Read data from csv file into Pandas dataframe**

In [None]:
df = pd.read_csv('Churn_Modelling.csv')

## Data Description
This data set contains details of a bank's customers.

There are various features .

**Row Numbers:** Row Numbers from 1 to 10000.

**CustomerId:** Unique Ids for bank customer identification.

**Surname:** Customer's last name.

**CreditScore:** Credit score of the customer.

**Geography:** The country from which the customer belongs(Germany/France/Spain).

**Gender:** Male or Female.

**Age:** Age of the customer.

**Tenure:** Number of years for which the customer has been with the bank.

**Balance:** Bank balance of the customer.

**NumOfProducts:** Number of bank products the customer is utilising.

**HasCrCard:** Binary Flag for whether the customer holds a credit card with the bank or not(0=No, 1=Yes).

**IsActiveMember:** Binary Flag for whether the customer is an active member with the bank or not(0=No, 1=Yes).

**EstimatedSalary:** Estimated salary of the customer in Euro.

**Exited:** Binary flag 1 if the customer closed account with bank and 0 if the customer is retained(0=No, 1=Yes).

In [None]:
df.head(5)

## Quick review - columns that are useful and general to all customers (and not binary values):
* **Balance**
* **EstimatedSalary**
* **CreditScore**
* **Tenure**
* **NumOfProducts**

In [None]:
features = df[['Balance', 'EstimatedSalary', 'CreditScore', 'Tenure', 'NumOfProducts']]

## Quick review of the data set

In [None]:
features.isnull().sum()

In [None]:
# A visualise the data subset distributions.
plt.figure(1, figsize=(15, 6))
n = 0
col_count = len(features.columns.values)
for x in features.columns.values:
    n += 1
    plt.subplot(1, col_count, n)
    plt.subplots_adjust(hspace=0.5, wspace=0.5)
    sns.distplot(features[x], bins=15)
    plt.title('Distplot of {}'.format(x))
plt.show()

# **3. Could we identify clusters using business rules? How?**

In [None]:
sns.scatterplot(x=features['EstimatedSalary'],
                y=features['Balance'],
                hue=features['NumOfProducts'],
                data=features)

# **4. Plot a dendrogram to identify the count of clusters.**

In [None]:
# This cutoff variable is determined by viewing the dengdrogram and 'eyeballing' the result.
max_d = 0.7

# Absolute value of correlation matrix, then subtract from 1 for disimilarity
DF_dism = 1 - np.abs(features.corr())

# Compute average linkage
A_dist = distance.squareform(DF_dism.to_numpy() )
Z = linkage(A_dist, method="average")

# Dendrogram
# Cutting the dendrogram at max_d
plt.axhline(y=max_d, c='k')
D = dendrogram(Z=Z, labels=DF_dism.index, color_threshold=0.7, leaf_font_size=12, leaf_rotation=45)
plt.show()

# **5. Agglomerative Clustering (bottom up approach)**

In [None]:
aggloclust = AgglomerativeClustering(affinity='euclidean', compute_full_tree='auto',
                        connectivity=None, linkage='ward',
                        memory=None, n_clusters=4).fit(features)
print(aggloclust)

# Get the clustered labels
labels = aggloclust.labels_

In [None]:
features.columns.values

In [None]:
# Make a plot with sub-plots to review the features within the clusters.
# Notice that Estimated Salary/balance has especially clear distinction.
plt.figure(1, figsize=(15, 6))
n = 0
col_count = len(features.columns.values)
cols = list(features.columns.values)
cols.remove('NumOfProducts')
cols.remove('Tenure')

for x in cols:
    print(x)
    sublist = list(cols)
    sublist.remove(x)
    newlist = sublist[n:] + sublist[:n]
    n += 1
    plt.subplot(1, col_count, n)
    plt.subplots_adjust(hspace=5, wspace=0.5)
#     plt.title('Distplot of {} and {}'.format(newlist[0], newlist[1]))
    sns.scatterplot(x=features[newlist[1]],
                    y=features[newlist[0]],
                    hue=labels,
                    data=features)

plt.show()

# **6. Review and compare the clusters**

In [None]:
# Store the results as a copy and add the cluster values to the dataframe.
results = features.copy(deep=True)

In [None]:
results['Cluster'] = labels

In [None]:
results.head(10)

# **7. Summary of Hierarchical Clustering**

In [None]:
features_agg = results.groupby("Cluster")

In [None]:
features_agg_avg = features_agg.mean().reset_index()
features_agg_avg

In [None]:
fig,ax = plt.subplots(figsize=(18,3))

ax2 = ax.twinx() # Create another axes that shares the same x-axis as ax.
ax3 = ax.twinx() # Create another axes that shares the same x-axis as ax.

color=['red','blue','green']
width = 0.2



p1 = features_agg_avg.EstimatedSalary.plot(kind='bar', ax=ax, width=width, position=1, color='red', label = 'Estimated Salary')
p2 = features_agg_avg.Balance.plot(kind='bar', color='blue', ax=ax2, width=width, position=0, label = 'Balance')
p3 = features_agg_avg.NumOfProducts.plot(kind='bar', color='green', ax=ax3, width=width, position=2, label = 'Products')

ax.grid()
ax.set_xlabel('Cluster')
ax.tick_params(axis='x', rotation=0)
ax.set_ylabel('Estimated Salary')
ax2.set_ylabel('Balance')
ax3.set_ylabel('Products')
ax.set_ylim(0,170000)
ax2.set_ylim(0,170000)
ax3.set_ylim(0,4)
plt.title("Cluster Vs ( Avg Balance & Estimated Salary & Products)", weight='bold')

# ask matplotlib for the plotted objects and their labels
lines, labels = ax.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
lines3, labels3 = ax3.get_legend_handles_labels()
ax.legend(loc=0)
ax2.legend(lines + lines2, labels + labels2, loc=0)
ax3.spines['right'].set_position(('outward', 60))
ax3.legend(lines + lines2 +lines3, labels + labels2+labels3, loc=0)
fig.show()

**Conclusion:** 

1. Avereage Balance for each cluster, from highest to lowest: 0, 2, 3, 1

2. Avereage Estimated salary for each cluster, from highest to lowest: 2, 1, 0, 3

3. Avereage Count of products for each cluster, from highest to lowest: 1 & 3, 0 & 2


Balance and estimated salary appear to be the features that have the biggest influence on determining the cluster groups.  

Credit Score and Tenure are equivalent across all groups.  

The average count of products is similar for clusters 0 and 2, 1 and 3, although 1 and 3 have a slightly higher count on average than the other two.