# Understanding Mall customers

In this exercise, you work for a consulting firm as data scientist. In this scenario, your client is the owner of a Mall and he wants to understand the customers who can easily buy.  

You dispose of a dataset from the mall dataset with 5 features :
- CustomerID of the customer
- Gender of the customer
- Age of the customer
- Annual Income of the customer in k$
- Spending Score assigned by the mall based on customer behavior and spending nature (1-99)


You have one day to perform this analysis

In [1]:
import pandas as pd
import numpy as np

In [2]:
mall_df = pd.read_csv('../data/mail_customer.txt')
mall_df

FileNotFoundError: [Errno 2] No such file or directory: 'C:/Users/mattb/Simplon/Rendu/Segmentation_client/mail_customer.txt'

# Customer Segmentation using different clustering methods

Try to perform different Clustering methods (e.g. k-means, agglomerative, DBSCAN, Gaussian) to create clusters and understand behaviors

https://machinelearningmastery.com/clustering-algorithms-with-python/

https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68

### 🔎 Exploration du Dataset

In [None]:
# Data shape
mall_df.shape

In [None]:
# Information sur le Dataset
mall_df.info()

In [None]:
# Description des statistiques
mall_df.describe()

In [None]:
# Type
mall_dtype = mall_df.dtypes
mall_dtype.value_counts()

In [None]:
# Missing value
mall_df.isna().sum()

### 🔎 Visualisations

In [None]:
# Outliers
mall_df.plot(kind='box', subplots=True, layout=(2, 2), figsize = (9, 6), color='royalblue')
plt.show()

In [None]:
# Barplots
plt.figure(figsize = (7,5))
sns.countplot(x = "Gender", data = mall_df, palette="Blues")

In [None]:
# Correlation
corr = mall_df.corr()
plt.figure(figsize=(7,7))
sns.heatmap(corr,cbar=True, square=True, annot=True, cmap='Blues')

In [None]:
# Histogramme
plt.figure(figsize=(15,6))
sns.countplot(x = "Age", data = mall_df, palette="Blues")

### 🔎 DBSCAN clustering

In [None]:
# Libraries
from numpy import unique
import seaborn as sns
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from matplotlib import pyplot
import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_score
from sklearn import metrics

In [None]:
# Create copy of initial dataframe
df_dbscan = mall_df.copy()

In [None]:
# Need to change Gender column, because it's a categorical column
df_dbscan['Gender'] = df_dbscan['Gender'].replace({'Female':1, 'Male':0})

# df_dbscan

In [None]:
# Scaling data before clustering
scaler = StandardScaler()
df_dbscan_scale = scaler.fit_transform(df_dbscan[['Age','Annual Income (k$)','Spending Score (1-100)']])
df_dbscan[['Age','Annual Income (k$)','Spending Score (1-100)']] = df_dbscan_scale
df_dbscan

In [None]:
# DBSCAN CLUSTERING : using Spending score and Annual income features
X = df_dbscan.iloc[:, [3, 4]].values
X

# Instance of DBSCAN
dbscan=DBSCAN(eps=0.37,min_samples=4)

# Fitting the model
model=dbscan.fit(X)

# Storing clusters
labels=model.labels_

# Creating new column in our dataframe
df_dbscan['cluster'] = labels

In [None]:
# Plotting our clusters
plt.scatter(data=df_dbscan[df_dbscan['cluster']>=0], x='Spending Score (1-100)', y='Annual Income (k$)', c='cluster', cmap="plasma")

In [None]:
#identifying the points which makes up our core points
sample_cores=np.zeros_like(labels,dtype=bool)

sample_cores[dbscan.core_sample_indices_]=True

#Calculating the number of clusters
n_clusters=len(set(labels))- (1 if -1 in labels else 0)

# Printing silouhette score
print(metrics.silhouette_score(X,labels))

In [None]:
#Plotting distribution of our clusters (Age feature)
sns.boxplot(x='cluster', y='Age', data=df_dbscan)

In [None]:
#Plotting distribution of our clusters (Annual income feature)
sns.boxplot(x='cluster', y='Annual Income (k$)', data=df_dbscan)

In [None]:
#Plotting distribution of our clusters (Spending score feature)
sns.boxplot(x='cluster', y='Spending Score (1-100)', data=df_dbscan)

In [None]:
#Pairplot for distribution visualisation
sns.pairplot(df_dbscan, hue='cluster')

### 🔎 K-mean clustering

In [None]:
df_km = mall_df.copy()

In [None]:
df_km = df_km.drop(['CustomerID','Gender'],axis=1)

In [None]:
n_clusters = [2,3,4,5,6,7,8,9,10]
clusters_inertia = [] # inertia of clusters
s_scores = [] # silhouette scores

for n in n_clusters:
    KM_est = KMeans(n_clusters=n, init='k-means++').fit(df_km)
    clusters_inertia.append(KM_est.inertia_)    # data for the elbow method
    silhouette_avg = silhouette_score(df_km, KM_est.labels_)
    s_scores.append(silhouette_avg)

In [None]:
model = KMeans(random_state=4)
visualizer = KElbowVisualizer(model, k =(2,10))
visualizer.fit(df_km)
visualizer.show()

In [None]:
km = KMeans(n_clusters=5)
km_pred = km.fit_predict(df_km)

In [None]:
df_km['cluster'] = km_pred
df_km['cluster'].value_counts()
df_km

In [None]:
score = silhouette_score(df_km, km.labels_, metric='euclidean')
print('Silhouetter Score: %.3f' % score)

In [None]:
sns.pairplot(df_km, hue = 'cluster')

### 🔎 Agglomerative clustering

In [None]:
# Copie du Dataset
df_agg = mall_df.copy()

In [None]:
from sklearn.cluster import AgglomerativeClustering 

In [None]:
# Agglomerative clustering
mglearn.plots.plot_agglomerative_algorithm()

💡 Tout d'abord, tous les points sont initialisés en tant que clusters individuels. Ensuite, à chaque étape, les deux clusters les plus proches sont fusionnés.

Dans les quatre premières étapes, 4 ensembles de clusters de deux points sont formés. Dans les étapes 5 à 7, trois clusters de 3 points sont formés et finalement dans l'étape 9, 3 clusters principaux sont formés de points différents.

In [None]:
X = df_agg[['Annual Income (k$)','Spending Score (1-100)']].values
Y = df_agg[['Age','Spending Score (1-100)']].values

In [None]:
# Agglomerative clustering avec 5 clusters
agglo = AgglomerativeClustering(n_clusters = 5, affinity = 'euclidean', linkage = 'ward')
y_ch = agglo.fit_predict(X)

In [None]:
df_agg = pd.get_dummies(mall_df,columns=['Gender'], prefix = ['sex'])

In [None]:
# Drop colonne
df = df_agg.drop('CustomerID',axis=1)

In [None]:
agg_pred = agglo.fit_predict(df)

In [None]:
# Nouvelle feature "cluster"
df['cluster'] = agg_pred
df

In [None]:
# Boxplots Annual Income
plt.figure(figsize=(9,5))
sns.boxplot(x='cluster',y='Annual Income (k$)', data=df, palette='Blues')

In [None]:
# Boxplots Annual Spending Score
plt.figure(figsize=(9,5))
sns.boxplot(x='cluster',y='Spending Score (1-100)', data=df, palette='Blues')

In [None]:
# Boxplots Annual Age
plt.figure(figsize=(9,5))
sns.boxplot(x='cluster',y='Age', data=df, palette='Blues')

In [None]:
# Pairplot
sns.pairplot(df,hue='cluster', palette="mako")

In [None]:
# Affichage des 5 clusters avec Age
plt.figure(figsize=(5,6))

plt.scatter(Y[y_hc == 0, 0], Y[y_hc == 0, 1], c = 'paleturquoise', label = 'Cluster 0')
plt.scatter(Y[y_hc == 1, 0], Y[y_hc == 1, 1], c = 'skyblue', label = 'Cluster 1')
plt.scatter(Y[y_hc == 2, 0], Y[y_hc == 2, 1], c = 'royalblue', label = 'Cluster 2')
plt.scatter(Y[y_hc == 3, 0], Y[y_hc == 3, 1], c = 'blue', label = 'Cluster 3')
plt.scatter(Y[y_hc == 4, 0], Y[y_hc == 4, 1], c = 'navy', label = 'Cluster 4')

plt.title('Clusters of customers')
plt.xlabel('Age')
plt.ylabel('Spending Score')
plt.legend()
plt.show()

💡 Il n'y a pas de groupes distincts pour l'"Age" par rapport au "Spending Score".

In [None]:
# Affichage des 5 clusters avec Annual Income
plt.figure(figsize=(5,6))

plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 1], c = 'paleturquoise', label = 'Cluster 0')
plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], c = 'skyblue', label = 'Cluster 1')
plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 1], c = 'royalblue', label = 'Cluster 2')
plt.scatter(X[y_hc == 3, 0], X[y_hc == 3, 1], c = 'blue', label = 'Cluster 3')
plt.scatter(X[y_hc == 4, 0], X[y_hc == 4, 1], c = 'navy', label = 'Cluster 4')

plt.title('Clusters of customers')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.legend()
plt.show()

💡 5 clusters permettent de mieux différencier chacun des sous-groupes. Les cinq groupes sont les suivants :

* Faible revenu avec un score de dépenses élevé (Cluster 0)
* Faible revenu avec un faible score de dépenses (Cluster 1)
* Revenu élevé avec un score de dépenses élevé (Cluster 2)
* Revenu moyen avec un score de dépenses moyen (Cluster 3)
* Revenu élevé avec un score de dépenses faible (Cluster 4)

💡 Nous pouvons voir que les clusters sont identiques à ceux qui utilisent la liaison "complète". Cela pourrait indiquer que les clusters sont bien définis puisque le changement de lien n'affecte pas les clusters. Ce graphique utilise le couplage 'ward' et 5 clusters.

In [None]:
# Import librairies
import plotly.figure_factory as ff
import plotly.graph_objects as go
import plotly.express as px
from scipy.cluster.hierarchy import ward
from scipy.cluster.hierarchy import dendrogram, linkage

In [None]:
# Dendogramme
plt.figure(figsize = (17, 7))
dendo = dendrogram(linkage(X, method = 'ward'))
plt.show()

In [None]:
# Dendrogramme interactif
# Affichage
fig.update_layout(width = 1000, height = 500, yaxis_title = 'Distance entre les clusters', xaxis_title = 'Sample index')
fig.update_xaxes(showticklabels=False)

# Ligne pointillée
fig.add_shape(
        type='line',
        x0=0,
        y0=260,
        x1=1985,
        y1=260,
        line=dict(
            color='Black',
            dash='dash'
        )
)
fig.show()

💡 L'axe vertical (distance entre les clusters) représente la distance euclidienne. En partant du bas, les feuilles fusionnent en branches et cela correspond aux échantillons/clusters qui sont similaires les uns aux autres. La distance verticale représente la similarité des clusters. Par exemple, plus les distances verticales sont grandes avant la fusion, plus les clusters sont dissemblables.

💡 La ligne pointillée montre où nous choisissons de couper le dendrogramme pour obtenir un nombre désiré de clusters. Le nombre de lignes verticales avant de se diviser sous les lignes pointillées nous indique le nombre de clusters que nous aurons si nous coupons à la distance verticale (distance entre les clusters). Dans ce dendrogramme, nous pouvons voir qu'il y a cinq clusters lorsque nous coupons à la distance des clusters = 260, et qu'il y a 5 lignes verticales sous les lignes pointillées avant de se diviser.

💡 Ainsi, à l'aide de ce dendrogramme, nous pouvons examiner chaque clusters individuelle et la façon dont il fusionne pour former un cluster plus grande de bas en haut. Cela nous fournit beaucoup d'informations puisque nous pouvons inspecter chaque échantillon individuel et examiner dans quelle mesure il est similaire (ou non) aux échantillons avec lesquels il fusionne.

### Affinity propagation clustering

In [None]:
df_ap = df.copy()
df_ap = df_ap.drop(['CustomerID','Gender'],axis=1)

In [None]:
no_of_clusters = []
preferences = range(-20000,-5000,100) # arbitraty chosen range
af_sil_score = [] # silouette scores

for p in preferences:
    AF = AffinityPropagation(preference=p, max_iter=200, random_state=None).fit(df_ap)
    no_of_clusters.append((len(np.unique(AF.labels_))))
    af_sil_score.append(silhouette_score(df_ap, AF.labels_))
    
af_results = pd.DataFrame([preferences, no_of_clusters, af_sil_score], index=['preference','clusters', 'sil_score']).T
af_results.sort_values(by='sil_score', ascending=False).head()

In [None]:
fig, ax = plt.subplots(figsize=(12,5))
ax = sns.lineplot(x = n_clusters,y = clusters_inertia, marker='o', ax=ax)
ax.set_title("Elbow method")
ax.set_xlabel("number of clusters")
ax.set_ylabel("clusters inertia")
ax.axvline(5, ls="--", c="red")
ax.axvline(6, ls="--", c="red")
plt.grid()
plt.show()

In [None]:
ap = AffinityPropagation(preference=-14500,random_state=None)
ap_pred = ap.fit_predict(df_ap)

In [None]:
df_ap['cluster'] = ap_pred
df_ap['cluster'].value_counts()

In [None]:
sns.pairplot(df_ap,hue = 'cluster')

### Gaussian clustering

# Conclusions