###### Objective

To analyze the All life credit card customer base to understand the glitches in the customer care services offered by them to their customers and help them upgrade their service delivery model.

###### Set the working directory

In [None]:
import os

In [None]:
os.getcwd()

###### Importing the required libraries

In [None]:
import pandas as pd
import numpy as np

## Visualization libraries
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns

# For missing values
import missingno as msno

# Model libraries
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans

# Ignore warnings 
import warnings
pd.options.display.max_columns = None
pd.options.display.max_rows = None
warnings.filterwarnings("ignore")

###### Read the dataset

In [None]:
data=pd.read_excel('Credit Card Customer Data.xlsx')

In [None]:
data1=pd.read_excel('Credit Card Customer Data.xlsx')

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data.describe().transpose()

In [None]:
data.nunique()

In [None]:
data.columns

##### Basic checks on the data before getting it ready for analysis 

In [None]:
def basic_checks(df):
    
    print('='*50)
    print('Shape of the dataframe is: \n',df.shape)
    print('='*50)
    print('Basic stats for the data: \n',df.describe())
    print('='*50)
    print('Data type and info :')
    print(df.info())
    print('='*50)
    print('Missing value information : \n',df.isnull().any())
    print('='*50)
    print('Sum of missing values if any : \n',df.isnull().sum())

In [None]:
basic_checks(data)

###### Missing values matrix

In [None]:
msno.matrix(data)

No missing values in the dataset,in any of the variables

###### Dropping the unnecessary columns

Drop the serial number from the analysis

In [None]:
data.drop(['Sl_No','Customer Key'],axis=1,inplace=True)

In [None]:
data.head()

In [None]:
data1.drop(['Sl_No'],axis=1,inplace=True)

In [None]:
data1.head()

###### Plotting the correlations 

In [None]:
data.corr()

In [None]:
plt.figure(figsize=(10,8))

sns.heatmap(data.corr(),
            annot=True,
            linewidths=.5,
            center=0,
            cbar=False,
            cmap="YlGnBu")

plt.show()

In [None]:

sns.pairplot(data,diag_kind='kde');

In [None]:
############################### EDA PART 1 ##################################


#### EDA - Part 1 

**Basic checks**

1.There are 660 unique records in the dataset ,with 655 unique customer key records .Implying there’s duplication of customer key records .About 5 customer key records have been repeated .10 of the total records could indicate a data entry issue .
2.All the variables in the data are in int data type 

**Basic stats**

1.There are 660 records and 7 variables in the dataset
2.The average credit limit for the dataset is 34574.2 and the median is at 18000,clearly indicating the data is skewed.
3.The maximum credit limit is at 200000
4.The clients in the dataset hold at least one credit card and a maximum of 10
5.The median number of credit cards is at 5 ,the credit card data distribution isn’t skewed.
6.Maximum number of bank visits is at 5 and minimum at 0 ,indicating there is a customer base that never visits the bank
7.The bank visit data is not skewed.
8.Maximum online visits are 15 and mean number of online visits are 2.5,there is also a customer base which never makes online visits.
9.Maximum calls made by the customer is 10 and minimum is 0 


**Missing values**

There are no missing values in the dataset 

**Correlation matrix**

1.There seems to be a high correlation between average credit limit and total credit cards
2.There is also some correlation between total online visits and average credit limit 
3.Total calls made and average credit limit are negatively correlated ,implying higher the credit limit lesser the number of calls
4.there is a negative correlation between number of bank visits and the number of online visits , I.e the data with higher number of online visits has lower number of bank visits 
5.Total credit cards and calls made are negatively correlated indicating clients with higher number of credit cards have lower number of calls to the bank 
6.Calls made and average credit limit are negatively correlated 

Correlation doesn’t imply causation ,so let us dig into the details of how each variables may be related and if we could have a derived metric from the variables.

**Possible clusters in the data**

1.We see that clearly a few of the variables show a tendency to group together ,if we look at the KDE plots 
2.Clearly we see 3 or more peaks for this data ,but let’s look at it in detail further.



###### Univariate and Bivariate analysis 

In [None]:
def plots(variable):
    fig=plt.figure(figsize=(15,5))
    plt.subplot(131)
    sns.distplot(data[variable])
    plt.subplot(132)
    sns.boxplot(x=data[variable])
    
   

In [None]:
plots('Avg_Credit_Limit')

A lot of outliers here as we can see and the data is right skewed 

In [None]:
def CL_cat(x):
    if(x>0)&(x<=50000):
        return 0
    else:
        if(x>50000)&(x<=100000):
            return 1
        else:
            if(x>100000):
                return 2
           

In [None]:
data['CL_cat']=data['Avg_Credit_Limit'].apply(CL_cat)

In [None]:
sns.countplot('CL_cat',data=data)
#plt.xticks(rotation=90)

In [None]:
a=data[data['Avg_Credit_Limit']>=100000].count()/data.shape[0]

print(a*100)

b=data[data['Avg_Credit_Limit']<100000].count()/data.shape[0]

print(b*100)


In [None]:
plots('Total_Credit_Cards')

In [None]:
a=data[data['Total_Credit_Cards']>=7].count()/data.shape[0]
print(a*100)

In [None]:
sns.countplot(data['Total_Credit_Cards'])

In [None]:
plots('Total_visits_bank')

In [None]:
sns.countplot(data['Total_visits_bank'])

In [None]:
a=data[data['Total_visits_bank']==0].count()/data.shape[0]
print(a*100)

b=data[data['Total_visits_bank']<=2].count()/data.shape[0]
print(b*100)

c=data[data['Total_visits_bank']>3].count()/data.shape[0]
print(c*100)

In [None]:
plots('Total_visits_online')

In [None]:
a=data[data['Total_visits_online']==0]['Total_visits_online'].count()/data.shape[0]
print('Total online visits = 0 :',a*100)

b=data[data['Total_visits_online']<=2]['Total_visits_online'].count()/data.shape[0]
print('Total online visits <=2 :',b*100)

c=data[data['Total_visits_online']>6]['Total_visits_online'].count()/data.shape[0]
print('Total online visits >6:',c*100)

In [None]:
def OV_cat(x):
    if (x==0):
        return 0
    else:
            if (x>0)&(x<=6):
                return 1
            else:
                if(x>6):
                    return 2

In [None]:
data['OV_cat']=data['Total_visits_online'].apply(OV_cat)

In [None]:
sns.countplot(data['OV_cat'])

In [None]:
plots('Total_calls_made')

In [None]:
sns.countplot(data['Total_calls_made'])

In [None]:
a=data[data['Total_calls_made']==0]['Total_calls_made'].count()/data.shape[0]
print('Total_calls_made = 0 :',a*100)

b=data[data['Total_calls_made']<=2]['Total_calls_made'].count()/data.shape[0]
print('Total_calls_made <=2 :',b*100)

d=data[data['Total_calls_made']>6]['Total_calls_made'].count()/data.shape[0]
print('Total_calls_made >6:',d*100)


In [None]:
data.columns

In [None]:
data.drop(['CL_cat','OV_cat'],axis=1,inplace=True)

###### Multivariate plots 

In [None]:
############### Average CreditLimit vs (Bank Visits,TotalCredit Cards) ######################

In [None]:
fig, ax = plt.subplots(figsize=(10,7))
sns.scatterplot(data=data,x='Avg_Credit_Limit',y='Total_visits_bank',hue='Total_Credit_Cards')
plt.xticks(rotation=90);
ax.set_title("Average CreditLimit vs (Bank Visits,TotalCredit Cards)")
ax.legend(loc="upper left", bbox_to_anchor=(1,1))
plt.show()

In [None]:
############### Average CreditLimit vs (Online Visits,TotalCalls Made) ######################

In [None]:
fig, ax = plt.subplots(figsize=(10,7))
sns.scatterplot(data=data,x='Avg_Credit_Limit',y='Total_visits_online',hue='Total_calls_made')
plt.xticks(rotation=90);
ax.set_title("Average CreditLimit vs (Online Visits,TotalCalls Made)")
ax.legend(loc="upper left", bbox_to_anchor=(1,1))
plt.show()

In [None]:
sns.scatterplot(data=data,x='Avg_Credit_Limit',y='Total_calls_made')
plt.xticks(rotation=90);

In [None]:
sns.scatterplot(data=data,x='Total_Credit_Cards',y='Total_visits_bank')
plt.xticks(rotation=90);

People with credit cards between 4&7 seem to visit the bank more

In [None]:
sns.scatterplot(data=data,x='Total_Credit_Cards',y='Total_visits_online')
plt.xticks(rotation=90);

In [None]:
sns.scatterplot(data=data,x='Total_Credit_Cards',y='Total_calls_made')
plt.xticks(rotation=90);

In [None]:
sns.scatterplot(data=data,x='Total_visits_bank',y='Total_visits_online')
plt.xticks(rotation=90);

In [None]:
sns.scatterplot(data=data,x='Total_visits_online',y='Total_calls_made')
plt.xticks(rotation=90); 

In [None]:
########################## EDA - PART 2 ######################

###### EDA - Part 2 


**Univariate Analysis** 

Average Credit Limit 

1.Most of the data lies between 0-100000 average credit limit 
2.The distribution is right skewed and there are outliers in the data where credit limit is greater than 100000.
3.6% of the data is above 100000 credit limit and the remainder 93% is below 100000 credit limit 

Total Credit Cards 

1.The distribution is somewhat even and shows no outliers
2.The range lies between 1-10 credit cards 
3.21% of the users have more than 7 credit cards 

Total Bank visits

1.The distribution is somewhat even and shows no outliers
2.The range lies between 0-5 
3.15% of the card holders never visited the bank and 56% of the card holders visited the bank <=2 times  , 29% of the customers visited the bank fore than 3 times 

Total Online visits 

1.The distribution is right skewed and has a few outliers 
2.22% of the customers have never visited online platforms ,67% of the customers have visited online less than 2 times and 8% customers visited the online platform more than 6 times

Total calls made

1.The distribution is some what even with no outliers 
214.6% of the clients made no calls at all to the bank and 42% of the customers made less than 2 calls 
3.19% of the customers had called more than 6 times

**Multivariate analysis** 

Average CreditLimit vs (Bank Visits,TotalCredit Cards)

1.Those that have an average credit limit>100000 and more than 6 credit cards, don’t seem to be visiting the bank more than once
2.The highest number of bank visits are by persons who have 0-75000 credit limit

Average CreditLimit vs (Online Visits,TotalCalls Made)

1.Those with higher credit limit have a higher online visit frequency than those with lower credit limit.

2.Also,those with higher credit seem to be making lesser number of calls(0-4) compared to those with lower credit limit .

3.Clearly there's 2 distinct groups here where total online visits <6 are those with credit limit <75000 Those with credit limit >100000 have more online visits recorded( 8 or above)

A few points more from the Scatterplots 

1.More numbers bank visits are from clients who have 4-7 cards 
2.The more the number of credit cards the more the number of online visits for banking ,lesser cards show little no activity online 
3.Those that visit the bank less ,have online visits more
4.More calls were made by customers who held 1-4 credit cards
5.Total calls made by the customer who visited the bank online  more than 7 times were much lesser than those who had lesser online visits.


In [None]:
################################# DATA PREPROCESSING ##################################

###### Data Preprocessing 

**Dropping the unnecessary variables** 

Since we had introduced a few categorical variables during EDA ,we will drop those unnecessary variables before proceeding with scaling & modelling the data

**Outlier Treatment** 

A couple of variable like Average credit limit and Total Online visits have outliers which would be treated as the Clustering alogorithms are very sensitive to outliers.The remainder of the variables dont need to be treated for outliers.

**Scaling the data** 
1. Import the necessary variables for scaling 
2. Apply scaling to the dataframe to bring all variables of teh data to the same scale.

**Hopkins Statistic**

We apply the hopkins statistic to see the score for tendency of clustering.A value for H higher than 0.75 indicates a clustering tendency at the 90% confidence level


###### Treat Outliers

In [None]:
def outlier(x):
    sorted(x)
    Q1,Q3=np.percentile(x,[25,75])
    IQR=Q3-Q1
    lower_range=Q1-(1.5*IQR)
    upper_range=Q3+(1.5*IQR)
    return lower_range,upper_range

In [None]:
outlier(data1['Avg_Credit_Limit'])

In [None]:
lowerbound,upperbound=outlier(data['Avg_Credit_Limit'])

In [None]:
data[(data['Avg_Credit_Limit']<lowerbound)|(data['Avg_Credit_Limit']>upperbound)].count()

In [None]:
data['Avg_Credit_Limit']=np.where(data['Avg_Credit_Limit']>upperbound,105000,data['Avg_Credit_Limit'])


In [None]:
data[(data['Total_visits_online']<lowerbound)|(data['Total_visits_online']>upperbound)].count()

In [None]:
data[(data['Total_calls_made']<lowerbound)|(data['Total_calls_made']>upperbound)].count()

In [None]:
outlier(data1['Total_visits_online'])

In [None]:
lowerbound,upperbound=outlier(data['Total_visits_online'])

In [None]:
data[(data['Total_visits_online']<lowerbound)|(data['Total_visits_online']>upperbound)].count()

In [None]:
data['Total_visits_online']=np.where(data['Total_visits_online']>upperbound,8.5,data['Total_visits_online'])

###### Scaling the data 

In [None]:
from scipy.stats import zscore 

data_scaled=data.apply(zscore)

In [None]:
data_scaled.head()

###### Hopkins statistic - The ability for data to cluster

In [None]:
# Calculate Hopkins Statistic

# To perform KMeans clustering 
from sklearn.cluster import KMeans
from sklearn.neighbors import NearestNeighbors
from random import sample
from numpy.random import uniform
from math import isnan

def hopkins(X):
    d = X.shape[1]
    #d = len(vars) # columns
    n = len(X) # rows
    m = int(0.1 * n) 
    nbrs = NearestNeighbors(n_neighbors=1).fit(X.values)
 
    rand_X = sample(range(0, n, 1), m)
 
    ujd = []
    wjd = []
    for j in range(0, m):
        u_dist, _ = nbrs.kneighbors(uniform(np.amin(X,axis=0),np.amax(X,axis=0),d).reshape(1, -1), 2, return_distance=True)
        ujd.append(u_dist[0][1])
        w_dist, _ = nbrs.kneighbors(X.iloc[rand_X[j]].values.reshape(1, -1), 2, return_distance=True)
        wjd.append(w_dist[0][1])
 
    H = sum(ujd) / (sum(ujd) + sum(wjd))
    if isnan(H):
        print(ujd, wjd)
        H = 0
 
    return H

In [None]:
hopkins(data_scaled)

The data has a tendency to cluster and the score is counted as good as it is >0.75

##### Elbow method - Finding the optimal number of clusters 

In [None]:
#Finding optimal no. of clusters
from scipy.spatial.distance import cdist
clusters=range(1,10)
meanDistortions=[]

for k in clusters:
    model=KMeans(n_clusters=k)
    model.fit(data_scaled)
    prediction=model.predict(data_scaled)
    meanDistortions.append(sum(np.min(cdist(data_scaled, model.cluster_centers_, 'euclidean'), axis=1)) / data_scaled
                           .shape[0])


plt.plot(clusters, meanDistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method')

The elbow bend looks like is at 3 .

###### KMeans Algorithm

let's check for K=3 & K=4 using the Kmeans algorithm since the elbow bend is at 3.

**Approach**

1.Elbow method indicates 3 clusters .We will try different algorithms to understand the performance for 3 & 4 clusters 

2.We calculate the means using the Kmeans data to see the  variable distribution among the different clusters 

3.We then visually assess the distribution by doing a box plot 

4.Join the labeled K=3 ,KMeans data to the main dataframe which we are set up for analysis.

5.We print the value count of these datapoints by cluster to see if this distribution is intuitive and meaningful.

6.At each stage we calculate the Silhouette scores to be sure that the clustering is good enough. We will pick the best model based on our intuition looking at the data distribution and also looking at the Silhouette Scores.

7.We repeat the steps above for K=4 

8.We compare the Silhouette scores and also the box plots of the clusters to see which option is the best choice 

###### KMeans , K=3

**Silhouette Score**

In [None]:
# For K = 3 
final_model3=KMeans(3)
final_model3.fit(data_scaled)
prediction=final_model3.predict(data_scaled)

## Print Silhouette score 
from sklearn.metrics import silhouette_score
silhouette_score(data_scaled,final_model3.labels_)

In [None]:
#Append the prediction 
data["GROUP_3"] = prediction
data_scaled["GROUP_3"] = prediction
print("Groups Assigned : \n")
data.head()

In [None]:
datagrouped_K3=data.groupby(['GROUP_3'])
datagrouped_K3.mean()

In [None]:
data_scaled.boxplot(by='GROUP_3',layout=(2,4),figsize=(15,10));

In [None]:
centroids=final_model3.cluster_centers_

In [None]:
centroids

In [None]:
## creating a new dataframe only for labels and converting it into categorical variable
df_labels = pd.DataFrame(final_model3.labels_ , columns = list(['K=3']))

df_labels['K=3'] = df_labels['K=3'].astype('category')

In [None]:
df_labeled=data1.join(df_labels)

In [None]:
df_labeled['K=3'].value_counts()

In [None]:
data_scaled.head()

In [None]:
from mpl_toolkits.mplot3d import Axes3D

In [None]:
data_scaled.drop(['GROUP_3'],axis=1,inplace=True)

In [None]:
fig = plt.figure(figsize=(8, 6))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=20, azim=60)
k3_model=KMeans(3)
k3_model.fit(data_scaled)
labels = k3_model.labels_
ax.scatter(data_scaled.iloc[:, 0], data_scaled.iloc[:, 1], data_scaled.iloc[:, 2],c=labels.astype(np.float), edgecolor='k')
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Avg_Credit_Limit')
ax.set_ylabel('Total_Credit_Cards')
ax.set_zlabel('Total_visits_bank')
ax.set_title('3D plot of KMeans Clustering')

###### KMeans , K=4 

**Silhouette Score**

In [None]:
# For K = 4 
final_model4=KMeans(4)
final_model4.fit(data_scaled)
prediction=final_model4.predict(data_scaled)

## Print the Silhouette score 

silhouette_score(data_scaled,final_model4.labels_)


In [None]:
data_scaled.head()

In [None]:
#Append the prediction 
data["GROUP_4"] = prediction
data_scaled["GROUP_4"] = prediction
print("Groups Assigned : \n")
data.head()


In [None]:
datagrouped=data.groupby(['GROUP_4'])
datagrouped.mean()

In [None]:
###### Adding the values to the dataframe 

In [None]:
df_labels = pd.DataFrame(final_model4.labels_ , columns = list(['K=4']))

df_labels['K=4'] = df_labels['K=4'].astype('category')

In [None]:
#Joining the newly created column to existing dataframe

df_labeled=df_labeled.join(df_labels)

In [None]:
df_labeled.groupby(df_labeled['K=4']).size()

In [None]:
df_labeled.head()

In [None]:
data_scaled.boxplot(by = 'GROUP_4',  layout=(2,4), figsize=(20, 15))
plt.xticks(rotation=90);

In [None]:
data_scaled.drop(['GROUP_4'],axis=1,inplace=True)

In [None]:
data_scaled.head()

**KMeans results** 

1.Looking at the K3 mean values ,we see they are 3 different clusters with varied average mean values for all variables 
2.The box plot indicates the distribution of the data is different for all 3 clusters formed.
3.Looking at K=4 mean values ,comparing them for all available variables we see ,that the means of a couple of variables are moving closely indicating they could actually form 1 cluster instead of being split into 2 .
4.The box plot is indicative of the above point too 
5.We see the Silhouette score of K=3 is much better than the one for K=4 
6.Considering the above factors we decide that K=3 is the best option for KMeans algorithm 

Let's proceed with Hierarchical clustering now..

###### Hierarchical clustering 

**Approach - Hierarchical Clustering**

1.Import the required libraries to check the cophenetic coefficient for the different linkages in Hierarchical Clustering 

2.Compute the Cophenetic coefficient for all the linkages, pick a handful in descending order

3.Find the linkage matrix 

4.Plot the dendrogram for the consolidated data frame 

5.Truncate the dendrogram setting the pvalue  ,cut the dendrogram at a reasonable max_depth

6.Use the distance matrix measure and FCluster function to cluster the data 

7.Evaluate the clusters by looking at the distribution of the datapoints

8.Print the Silhouette coefficient and assess the data distribution 

In [None]:
from scipy.cluster.hierarchy import cophenet, dendrogram, linkage

In [None]:
from scipy.spatial.distance import pdist  #Pairwise distribution between data points

###### Average linkage

##### Cophenetic coefficients for different linkages

In [None]:
links=['single','complete','average','ward','median','centroid']
for each in links:
    Z = linkage(data_scaled, method=each, metric='euclidean')
    cc,cophn_dist=cophenet(Z,pdist(data_scaled))
    print (each,cc)

###### Linkage Matrix- Average

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage
Z_average = linkage(data_scaled, 'average', metric='euclidean')
Z_average.shape

In [None]:
Z_average[:]

###### Dataframe for consolidated dendrogram

In [None]:
plt.figure(figsize=(25, 10))
dendrogram(Z_average)
plt.show()

###### Truncated Dendrogram

In [None]:
# Hint: Use truncate_mode='lastp' attribute in dendrogram function to arrive at dendrogram
dendrogram(
    Z_average,
    truncate_mode='lastp',  # show only the last p merged clusters
    p=6,  # show only the last p merged clusters
)
plt.show()

In [None]:
max_d=3.5


###### Ditsnace matrix & Fcluster function to cluster the data 

In [None]:
from scipy.cluster.hierarchy import fcluster
clusters_average = fcluster(Z_average, max_d, criterion='distance')
clusters_average
set(clusters_average)

In [None]:
df_labeled['clusters_label_average']=clusters_average

In [None]:
df_labeled.groupby(['clusters_label_average']).size()

###### Calculate the average Silhoutte score - Average linkage

In [None]:
from sklearn.metrics import silhouette_score
silhouette_score(data_scaled,clusters_average)

###### Linkage Matrix- Centroid

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage
Z_centroid = linkage(data_scaled, 'centroid', metric='euclidean')
Z_centroid.shape

In [None]:
Z[:]

###### Dendrogram 

In [None]:
plt.figure(figsize=(25, 10))
dendrogram(Z_centroid)
plt.show()

In [None]:
# Hint: Use truncate_mode='lastp' attribute in dendrogram function to arrive at dendrogram
dendrogram(
    Z_centroid,
    truncate_mode='lastp',  # show only the last p merged clusters
    p=6,  # show only the last p merged clusters
)
plt.show()

In [None]:
max_d=3.1

In [None]:
from scipy.cluster.hierarchy import fcluster
clusters_centroid = fcluster(Z_centroid, max_d, criterion='distance')
clusters_centroid
set(clusters_centroid)

In [None]:
df_labeled['clusters_label_centroid']=clusters_centroid

In [None]:
df_labeled.groupby(['clusters_label_centroid']).size()

In [None]:
#from sklearn.metrics import silhouette_score
silhouette_score(data_scaled,clusters_centroid)

###### Complete linkage

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage
Z_complete = linkage(data_scaled, 'complete', metric='euclidean')
Z_complete.shape

In [None]:
Z_complete[:]

In [None]:
plt.figure(figsize=(25, 10))
dendrogram(Z_complete)
plt.show()

In [None]:
# Hint: Use truncate_mode='lastp' attribute in dendrogram function to arrive at dendrogram
dendrogram(
    Z_complete,
    truncate_mode='lastp',  # show only the last p merged clusters
    p=6,  # show only the last p merged clusters
)
plt.show()

In [None]:
max_d=5.5 #- test 1 
#max_d=4.5 - test 2

In [None]:
clusters_complete = fcluster(Z_complete, max_d, criterion='distance')
clusters_complete
set(clusters_complete)

In [None]:
silhouette_score(data_scaled,clusters_complete)

In [None]:
df_labeled['clusters_label_complete']=clusters_complete

In [None]:
df_labeled.groupby(['clusters_label_complete']).size()

**Hierarchical Clustering - Results** 

1.Looking at the cophenetic coefficient we could see some linkages were best suited to proceed with the clustering 

2.We apply Hierchical clustering using Euclidean distance measure to print the dendrograms for the best cophenetic coefficient values I.e Average,Centroid,Complete linkage .

3.Bythe distribution and also looking at the dendrogram ,although Average and Centroid have high cophenetic values we see that the data distribution and the dendrogram makes perfect sense only in case of ‘Complete’ linkage

4.We notice that the Silhouette coefficient gives a good score for Complete linkage and thus we settle for 3 clusters using Hierarchical clustering - Complete linkage 


Let's look further at the results and compare the results from the clusters 

In [None]:
####### Boxplots - Hierarchical clustering(Complete linkage) &  KMeans @ K=3 ########

In [None]:
df_labeled.head()

In [None]:
plt.figure(figsize=(14,10))
plt.subplot(321)
sns.boxplot(x='K=3',y='Avg_Credit_Limit',data=df_labeled)
plt.subplot(322)
sns.boxplot(x='K=3',y='Total_Credit_Cards',data=df_labeled)
plt.subplot(323)
sns.boxplot(x='K=3',y='Total_visits_bank',data=df_labeled)
plt.subplot(324)
sns.boxplot(x='K=3',y='Total_visits_online',data=df_labeled)
plt.subplot(325)
sns.boxplot(x='K=3',y='Total_calls_made',data=df_labeled)

In [None]:
plt.figure(figsize=(14,10))
plt.subplot(321)
sns.boxplot(x='clusters_label_complete',y='Avg_Credit_Limit',data=df_labeled)
plt.subplot(322)
sns.boxplot(x='clusters_label_complete',y='Total_Credit_Cards',data=df_labeled)
plt.subplot(323)
sns.boxplot(x='clusters_label_complete',y='Total_visits_bank',data=df_labeled)
plt.subplot(324)
sns.boxplot(x='clusters_label_complete',y='Total_visits_online',data=df_labeled)
plt.subplot(325)
sns.boxplot(x='clusters_label_complete',y='Total_calls_made',data=df_labeled)

###### Compare K Means & Hierarchical Clusters using boxplots & Comparing the Cluster means 

In [None]:
plt.figure(figsize=(14,10))
plt.subplot(521)
sns.boxplot(x='K=3',y='Avg_Credit_Limit',data=df_labeled)
plt.subplot(522)
sns.boxplot(x='clusters_label_complete',y='Avg_Credit_Limit',data=df_labeled)

plt.subplot(523)
sns.boxplot(x='K=3',y='Total_Credit_Cards',data=df_labeled)
plt.subplot(524)
sns.boxplot(x='clusters_label_complete',y='Total_Credit_Cards',data=df_labeled)

plt.subplot(525)
sns.boxplot(x='K=3',y='Total_visits_bank',data=df_labeled)
plt.subplot(526)
sns.boxplot(x='clusters_label_complete',y='Total_visits_bank',data=df_labeled)

plt.subplot(527)
sns.boxplot(x='K=3',y='Total_visits_online',data=df_labeled)
plt.subplot(528)
sns.boxplot(x='clusters_label_complete',y='Total_visits_online',data=df_labeled)

plt.subplot(529)
sns.boxplot(x='K=3',y='Total_calls_made',data=df_labeled)
plt.subplot(5,2,10)
sns.boxplot(x='clusters_label_complete',y='Total_calls_made',data=df_labeled)


Let's compare the Cluster means of the 2 best models at hand 

In [None]:
datagrouped_Hclust=df_labeled.groupby(['clusters_label_complete'])
datagrouped_Hclust.mean()

In [None]:
datagrouped_KMeans=df_labeled.groupby(['K=3'])
datagrouped_KMeans.mean()


**Comparing the results of KMeans & Hierarchical Clusters**

1.We get 3 clusters for both Hierarchical and K Means Clusters 

2.Both give us Silhouette scores of 0.516 

3.We compare the box plots of these clusters using the labeled data and we see that except for some slight variation all variables behave in the very similar way in both clusters.

4.Comparing by each variable ,Average Credit Limit shows 3 clusters which could be seen as very high value ,low value and medium value credit spend, the clusters show no significant difference in the values.

5.Total Credit Cards shows clusters with means at 2.4,5.5&8.74 for both Clustering algorithms 

6.Total Bank visits ,the 3 clusters show a mean value of 0.6,0.9,3.4 on both the clusters.

7.Total Calls made show similar values for the clusters,1.08,6.8&2.0 as mean values

8.Total Online visits show cluster mean values at 10.9,3.55,0.98 

9.All the variables show the cluster means that are significantly different from each other in both the algorithm results 

Since both clusters behave similarly to understand the behavior of the customers we group by K3 labels

In [None]:
fig, ax = plt.subplots(figsize=(10,7))
sns.scatterplot(data=df_labeled,x='Avg_Credit_Limit',y='Total_visits_bank',hue='K=3')
plt.xticks(rotation=90);
ax.set_title("Average CreditLimit vs (Bank Visits,TotalCredit Cards)")
ax.legend(loc="upper left", bbox_to_anchor=(1,1))
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(10,7))
sns.scatterplot(data=df_labeled,x='Avg_Credit_Limit',y='Total_visits_online',hue='K=3')
plt.xticks(rotation=90);
ax.set_title("Average CreditLimit vs (Online Visits,TotalCalls Made)")
ax.legend(loc="upper left", bbox_to_anchor=(1,1))
plt.show()

In [None]:
sns.scatterplot(data=df_labeled,x='Avg_Credit_Limit',y='Total_calls_made',hue='K=3')
plt.xticks(rotation=90);

In [None]:
sns.scatterplot(data=df_labeled,x='Total_Credit_Cards',y='Total_visits_online',hue='K=3')
plt.xticks(rotation=90);

In [None]:
sns.scatterplot(data=df_labeled,x='Total_visits_online',y='Total_calls_made',hue='K=3')
plt.xticks(rotation=90); 

**Business Insights** 

1.There are 3 major segments of customers in the Credit card data .The  high credit limit customer ,moderate credit limit customer and low credit limit customer.

2.We see that based on the average credit limit and the total bank visits ,customers  having total bank visits>2 (low-moderate credit customers) and bank visits less than <2 are segmented separately ,High credit limit customers who visited the bank less once  are shown clearly and separately. 

3.The high credit limit customers show high online visits ,those that show low to moderate credit limit show very less number of visits online.

4.High number of calls are made by low credit limit customers ,moderate credit limit shows lesser number of calls compared to low credit limit customer 

5.Higher credit card number shows higher online visits.

**Hence the Key actions should be..**

1.All of these implying that if a customer is high valued ,holding numerous credit cards he would do more online transactions and would reach the customer care less via calls or through a bank visit ,this customer should be sent marketing messages through email/text or any other online channel to upsell products and for customer retention .

2.The moderate value customers are a good combination of client coming to the bank, online visits and making calls to the bank .This type of customer can be retained by using all channels to upset and also retain ,but preferably through mobile channel and upsetting products during their bank visits.

3.Last,and also the major % of the customer base is the low credit limit customers ,these are the customers that make maximum calls to the bank and also pay high visits to the bank, therefore these customer should not be targeted for upsell through online channels ,but we should channel the marketing through phone calls and marketing assistants at the bank.