# AllLife Credit Card Customer Segmentation 

# Objective:
### To identify different segments in the existing customer based on their spending patterns as well as past interaction with the bank

# Key Questions:
### 1. How many different segments of customers are there?
### 2. How are these segments different from each other?
### 3. What are your recommendations to the bank on how to better market to and service these customers?

# Data Description:
### Data is of various customers of a bank with their credit limit, the total number of credit cards the customer has, and different channels through which customer has contacted the bank for any queries, different channels include visiting the bank, online and through a call centre

In [327]:
# Importing the required libraries
import numpy as np   

from sklearn.model_selection import train_test_split

from sklearn.cluster import KMeans

# to handle data in form of rows and columns 
import pandas as pd    

# importing ploting libraries
import matplotlib.pyplot as plt   

# To enable plotting graphs in Jupyter notebook
%matplotlib inline 

#importing seaborn for statistical plots
import seaborn as sns

from sklearn import metrics
from scipy.stats import zscore

from scipy.spatial.distance import cdist
from mpl_toolkits.mplot3d import Axes3D

from scipy.cluster.hierarchy import fcluster
from sklearn.metrics import silhouette_score
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import pdist

from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import cophenet

import warnings
warnings.filterwarnings('ignore')

In [328]:
# Reading the excel file into pandas dataframe
mydata = pd.read_excel("Credit Card Customer Data.xlsx").dropna()

### Univariate analysis, EDA and Visualization (comments and explanations of every step included) are as follows:

In [None]:
# Analysis of the body of distributions / head
mydata.head()

In [329]:
##Remove id since it is redundant
mydata.drop('Sl_No', axis=1, inplace=True)

In [158]:
# Number of rows and columns
mydata.shape

(660, 6)

In [None]:
# Missing values
mydata.isna().sum()

In [None]:
# Taking a value count of the missing values 
mydata.isnull().apply(pd.value_counts)

In [None]:
# Outliers discovery using Interquartile range (IQR) 
Q1 = mydata.quantile(0.25)
Q3 = mydata.quantile(0.75)
IQR = Q3-Q1
print(IQR)

In [72]:
# Check number of zeros in the column
(mydata==0).sum() 

Customer Key             0
Avg_Credit_Limit         0
Total_Credit_Cards       0
Total_visits_bank      100
Total_visits_online    144
Total_calls_made        97
dtype: int64

In [None]:
# Basic Information of the dataset including data types
mydata.info()

In [None]:
# Checking only the data types not the whole information
mydata.dtypes

In [None]:
# Description of the independent attributes (name, range of values observed, mean and median, standard deviation and quartiles)
mydata.describe().transpose()

In [None]:
# Number of unique values in each column
mydata.nunique()

In [None]:
#Check for duplicte rows
mydata.duplicated().sum()

In [330]:
##Remove Customer key because it will add no value to the clustering
mydata.drop('Customer Key', axis=1, inplace=True)

In [74]:
# Remving duplicates
mydata = mydata.drop_duplicates()

In [75]:
# Skewness of the dataset
mydata.skew()

Avg_Credit_Limit       2.186592
Total_Credit_Cards     0.150120
Total_visits_bank      0.149368
Total_visits_online    2.209521
Total_calls_made       0.656954
dtype: float64

#### Avg_Credit_Limit and Total_visits_online are highly skewed so i will treat the data later in this project

In [None]:
# Outliers discovery using Visualization tools (boxplot)
sns.distplot(mydata['Avg_Credit_Limit'], kde = False)
plt.show()

#### There are outliers in the average credit limit above

In [None]:
sns.distplot(mydata['Total_Credit_Cards'])
plt.show()

In [None]:
sns.countplot(mydata['Total_visits_bank'])
plt.show()

In [None]:
sns.countplot(mydata['Total_visits_online'])
plt.show()

#### There are outliers in the total visits online as shown in the above plot

In [None]:
sns.boxplot(mydata['Total_calls_made'])
plt.show()

In [None]:
# Visualization using pairplot
sns.pairplot(mydata, diag_kind='kde') 

In [None]:
corr = mydata.corr()
sns.heatmap(corr, annot = True)

In [176]:
###  Findings in terms of degree of relationship between the above variables:

#### - There is a negative correlation between the Avg_Credit_Limit and Total_visits_bank
#### - Avg_Credit_Limit and Total_calls_made also have a negative correlation
#### - Total_Credit_Cards and Total_calls_made have a negative correlation
#### - Total_visits_bank and Total_calls_made have a negative correlation
#### - Avg_Credit_Limit and Total_Credit_Cards have a positive correlation
#### - Avg_Credit_Limit and Total_visits_online have a positive correlation
#### - Total_Credit_Cards also has a positive correlation with Total_visits_bank
#### - Total_Credit_Cards also has a positive correlation with Total_visits_online
#### - Total_calls_made and Total_visits_online have a positive correlation
#### - Total_visits_bank and Total_visits_online have a negative correlation

In [205]:
### Treat the outliers discovered
# mydata['Avg_Credit_Limit'] = np.log(mydata['Avg_Credit_Limit'])
# mydata['Total_visits_online'] = np.log(mydata['Total_visits_online'])

When itreated the outliers and scaled, my data became too large and prompted errors while plotting the elbow method. So, i will just state the code for the project sake

In [280]:
##Scale the data
from scipy.stats import zscore
mydata_z = mydata.apply(zscore)

In [None]:
mydata_z

### K-means Clustering

In [None]:
#Finding optimal no. of clusters

clusters=range(1,10)
meanDistortions=[]

for k in clusters:
    model=KMeans(n_clusters=k)
    model.fit(mydata)
    prediction=model.predict(mydata)
    meanDistortions.append(sum(np.min(cdist(mydata, model.cluster_centers_, 'euclidean'), axis=1)) / mydata
                           .shape[0])


plt.plot(clusters, meanDistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method')

In [332]:
# Set the value of k=6
kmeans = KMeans(n_clusters=6, n_init = 15, random_state=2345)

In [None]:
kmeans.fit(mydata_z)

In [334]:
centroids = kmeans.cluster_centers_

In [None]:
centroids

In [336]:
#Calculate the centroids for the columns to profile
centroid_df = pd.DataFrame(centroids, columns = list(mydata_z) )

In [None]:
centroid_df.T

In [338]:
# Group 2 has highest values for Avg_Credit_Limit while 5 has lowest
# Group 0 has the lowest Total_Credit_Cards and 2 has the highest values
# Group 3 has the highest number customers who visited the bank while 2 has the lowest
# Group 2 has the highest number of customers who communicated with the bank online while group 4 has the least number
# Group 1 made the highest number of calls to the bank while group 2 made the least calls

In [339]:
## creating a new dataframe only for labels and converting it into categorical variable
df_labels = pd.DataFrame(kmeans.labels_ , columns = list(['labels']))

df_labels['labels'] = df_labels['labels'].astype('category')

In [340]:
# Joining the label dataframe with the data frame.
df_labeled = mydata_z.join(df_labels)

In [None]:
# the groupby creates a grouped dataframe that needs to be converted back to dataframe
df_analysis = (df_labeled.groupby(['labels'] , axis=0)).head(4177)  
df_analysis

In [None]:
df_labeled['labels'].value_counts()  

In [343]:
# Demonstrate the 3d plot using mplot3d

In [None]:

## 3D plots of clusters

fig = plt.figure(figsize=(8, 6))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=20, azim=60)
k3_model=KMeans(3)
k3_model.fit(mydata_z)
labels = k3_model.labels_
ax.scatter(mydata_z.iloc[:, 0], mydata_z.iloc[:, 1], mydata_z.iloc[:, 2],c=labels.astype(np.float), edgecolor='k')
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Length')
ax.set_ylabel('Height')
ax.set_zlabel('Weight')
ax.set_title('3D plot of KMeans Clustering')

In [None]:
# Let K = 3 for the following demonstration
final_model=KMeans(3)
final_model.fit(mydata)
prediction=final_model.predict(mydata)

#Append the prediction 
mydata["GROUP"] = prediction
print("Groups Assigned : \n")
mydata[["Total_calls_made", "GROUP"]]

In [None]:
mydata.boxplot(by = 'GROUP',  layout=(2,4), figsize=(20, 15))

### To determine the quality of customer relationship in the bank, we will check the relationship between the Avg_Credit_Limit of customers and the various channels the customers used in contacting the bank

In [232]:
 mydata['Avg_Credit_Limit'].corr(mydata['Total_visits_bank'])

-0.10031230969326996

In [233]:
 mydata['Avg_Credit_Limit'].corr(mydata['Total_visits_online'])

0.5513845236894896

In [234]:
 mydata['Avg_Credit_Limit'].corr(mydata['Total_calls_made'])

-0.4143518934760464

The correlation coefficient above indicates that average credit limit has a positive correlation with total visits online, but has a negative coeeficient with others. So, i will use that to plot the graph below.

In [None]:
plt.plot(mydata['Avg_Credit_Limit'], mydata['Total_visits_online'],'bo')
z = np.polyfit(mydata['Avg_Credit_Limit'], mydata['Total_visits_online'],1)
p = np.poly1d(z)
plt.plot(mydata['Avg_Credit_Limit'], p(mydata['Total_visits_online']), "r--")

#geom_point()

In [91]:
# From the above graph, the data does not form a perfectly straight line. It is important to note, however, that correlation is merely
#indicative of a relationship between the two.

### Hierarchical clustering (with different linkages) with the help of dendrogram and cophenetic coeff

In [347]:
# Generate the linkage matrix using ward

Z = linkage(mydata, 'ward', metric='euclidean')
Z.shape

(659, 4)

In [348]:
Z[:]

array([[3.10000000e+02, 3.95000000e+02, 0.00000000e+00, 2.00000000e+00],
       [5.60000000e+01, 1.75000000e+02, 0.00000000e+00, 2.00000000e+00],
       [2.90000000e+01, 2.15000000e+02, 0.00000000e+00, 2.00000000e+00],
       ...,
       [1.31300000e+03, 1.31400000e+03, 3.76027041e+05, 5.18000000e+02],
       [1.31200000e+03, 1.31500000e+03, 6.29334215e+05, 1.42000000e+02],
       [1.31600000e+03, 1.31700000e+03, 1.09385169e+06, 6.60000000e+02]])

In [None]:
# Plot the dendrogram for the consolidated dataframe
plt.figure(figsize=(25, 10))
dendrogram(Z)
plt.show()

#### From the truncated dendrogram, find out the optimal distance between clusters which u want to use an input for clustering data

In [None]:
# Hint: Use truncate_mode='lastp' attribute in dendrogram function to arrive at dendrogram
dendrogram(
    Z,
    truncate_mode='lastp',  # show only the last p merged clusters
    p=3,  # show only the last p merged clusters
)
plt.show()

In [351]:
max_d = 52

In [None]:
# Use this distance measure(max_d) and fcluster function to cluster the data into 3 different groups
clusters = fcluster(Z, max_d, criterion='distance')
clusters

In [None]:
### Final dendogram with 'ward linkage'
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import pdist
import matplotlib.pyplot as plt
plt.figure(figsize=(18, 16))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
Z = linkage(mydata_z, 'ward')
dendrogram(Z,leaf_rotation=90.0,p=5,color_threshold=52,leaf_font_size=10,truncate_mode='level')
plt.tight_layout()

#### Use average as linkage metric and distance as Eucledian as ff:

In [305]:
#### generate the linkage matrix
from scipy.cluster.hierarchy import dendrogram, linkage
Z1 = linkage(mydata_z, 'average', metric='euclidean')
Z1.shape

(659, 4)

In [306]:
Z1[:]

array([[ 464.        ,  497.        ,    0.        ,    2.        ],
       [ 250.        ,  361.        ,    0.        ,    2.        ],
       [ 252.        ,  324.        ,    0.        ,    2.        ],
       ...,
       [   0.        , 1309.        ,    3.11135778,  387.        ],
       [1315.        , 1316.        ,    3.25253923,  610.        ],
       [1314.        , 1317.        ,    5.45418035,  660.        ]])

In [None]:
plt.figure(figsize=(25, 10))
dendrogram(Z1)
plt.show()

In [None]:
# Hint: Use truncate_mode='lastp' attribute in dendrogram function to arrive at dendrogram
dendrogram(
    Z1,
    truncate_mode='lastp',  # show only the last p merged clusters
    p=3,  # show only the last p merged clusters
)
plt.show()

In [309]:
max_d = 50

In [None]:
clusters = fcluster(Z1, max_d, criterion='distance')
clusters

In [None]:
plt.figure(figsize=(18, 16))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
Z1 = linkage(mydata_z, 'average')
dendrogram(Z1,leaf_rotation=90.0,p=5,color_threshold=50,leaf_font_size=10,truncate_mode='level')
plt.tight_layout()

### Use complete as linkage metric and distance as Eucledian

In [312]:
#### generate the linkage matrix
from scipy.cluster.hierarchy import dendrogram, linkage
Z2 = linkage(mydata_z, 'complete', metric='euclidean')
Z2.shape

(659, 4)

In [313]:
Z2[:]

array([[ 464.        ,  497.        ,    0.        ,    2.        ],
       [ 250.        ,  361.        ,    0.        ,    2.        ],
       [ 320.        ,  378.        ,    0.        ,    2.        ],
       ...,
       [1309.        , 1314.        ,    4.3347102 ,  397.        ],
       [1313.        , 1316.        ,    5.95846764,  610.        ],
       [1315.        , 1317.        ,    8.44853628,  660.        ]])

In [None]:
plt.figure(figsize=(25, 10))
dendrogram(Z2)
plt.show()

In [None]:
# Hint: Use truncate_mode='lastp' attribute in dendrogram function to arrive at dendrogram
dendrogram(
    Z2,
    truncate_mode='lastp',  # show only the last p merged clusters
    p=3,  # show only the last p merged clusters
)
plt.show()

In [316]:
max_d = 50

In [317]:
clusters = fcluster(Z2, max_d, criterion='distance')
clusters

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [None]:
plt.figure(figsize=(18, 16))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
Z2 = linkage(mydata_z, 'complete')
dendrogram(Z2,leaf_rotation=90.0,p=5,color_threshold=50,leaf_font_size=10,truncate_mode='level')
plt.tight_layout()

### Use sklearn Agglomerative Clustering and see how is it different from scipy.cluster.hierarchy

In [369]:
model = AgglomerativeClustering(n_clusters=3, affinity='euclidean',  linkage='ward')
model.fit(mydata_z)

AgglomerativeClustering(affinity='euclidean', compute_full_tree='auto',
                        connectivity=None, distance_threshold=None,
                        linkage='ward', memory=None, n_clusters=3)

In [370]:
L=model.labels_
L

array([0, 2, 0, 0, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [None]:
sns.boxplot(mydata_z)  # plot points with cluster dependent colors
plt.show()

In [None]:
sns.boxplot(Z)

In [None]:
sns.boxplot(Z1)

In [None]:
sns.boxplot(Z2)

In [None]:
sns.boxplot(Z3)

### Calculate the cophenetic coefficient of the hierarchical clustering

In [272]:
c, coph_dists = cophenet(Z, pdist(mydata_z))
c

0.8977080867389372

### Calculate Avg Silhoutte Score of the K-means Clustering


In [354]:
silhouette_score(mydata_z,clusters)

-0.36807635364871844

In [326]:
### Silhouette Score is better when closer 1 and worse when closer to -1 here. -0.37 is not so great

### Compare K-means clusters with Hierarchical clusters

In [None]:
# K-means cluster uses elbow method while Hierarchical uses agglomerative or divisive method. In this project, i used agglomerative
# K-means clusters can handle bid data while Hierarchical cannot
# In K-means clustering, we can see the shape of our clusters in a 3D sphere (as depicted above) or 2D, but Hierarchical uses Dendogram as shown above
# From the elbow method or 3D plot above you can tell how many K-means cluster are available at a glance, but in hierarchical, you have to count the number of legs after cutting the dendogram at a certain level
# The results in K-means differ when you run the algorithm multiple times with random clusters, but in hierarchical your reslts are reproducible

### Analyse the clusters formed, tell us how is one cluster different from another and answer all the key questions


In [None]:
# Difference between one cluster and another
The cophenetic coefficient score of the hierarchical cluster is better than the average silhouette score of the k-means cluster

In [None]:
# Answers to key questions
1) There are 3 different segments of customers
2) Some customer walk into the bank for transactions and support, another segment uses the online bank facilities while the last segment prefer to call the bank for services and resolution of their problems

In [None]:
# Recommendations to the bank on how to better market to and service these customers

Based on the analysis of this data, it is clear that more customers make calls to the bank call center, followed by those who use the bank's online services.
While planning to upgrade the service delivery model, more emphasis should be on the online platforms and the customers that visit the bankto enable more customers contact the bank through this means. 
To capture more credit card customers, the head of marketing can coach the bank staff in the call center to target the customers who use this means. they should market the credit cards more through this means because the bulk of their customers communicate with them via phone calls.   