<a href="https://colab.research.google.com/github/Hamzandj/Open-Week-3.0/blob/main/Clustering_with_K_means.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**OPEN WEEK 3.0 : Clustering in a nutshell**

In this example was made for a short presentation during the Open Week 3.0 event organized by [Open Coding Club](https://www.facebook.com/OpenCodingClub). The dataset and the code used in this project are inspired from [this project](https://www.kaggle.com/roshansharma/mall-customers-clustering-analysis).

Let's import the main dependancies

In [11]:
# for basic mathematics operation 
import numpy as np
import pandas as pd
from pandas import plotting

# for visualizations
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')

# for interactive visualizations
import plotly.offline as py
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
from plotly import tools
init_notebook_mode(connected = True)
import plotly.figure_factory as ff

# for path
import os

#for clustering
from sklearn.cluster import KMeans


As you can see, we are using the most known python library used for machine learning, wich is scikit-learn`sklearn`. This last contains alot of machine learning algorithms, we are using here K-mean.

To test our parogram we need first to load the dataset. We use the `Pandas` Library to load and manipuate the data. 

Let's load the data  and get a closer look at it :

In [None]:

# importing the dataset
data = pd.read_csv('./sample_data/Mall_Customers.csv')

import warnings
warnings.filterwarnings('ignore')

#Plotting the distribution of a given fields
plt.rcParams['figure.figsize'] = (18, 8)

selected_field = 'Annual Income (k$)'
plt.subplot(1, 2, 1)
sns.set(style = 'whitegrid')
sns.distplot(data[selected_field])
plt.title('Distribution of '+selected_field, fontsize = 20)
plt.xlabel('Range of '+selected_field)
plt.ylabel('Count')

selected_field = 'Spending Score (1-100)'
plt.subplot(1, 2, 2)
sns.set(style = 'whitegrid')
sns.distplot(data[selected_field], color = 'red')
plt.title('Distribution of '+selected_field, fontsize = 20)
plt.xlabel('Range of '+selected_field)
plt.ylabel('Count')
plt.show()

Let's procede now to a first combination. It's highly probzble that the spending score can depend on the income of the costummer, we're going to plot those two fields and try to see how this relation is present in the dataset.

In [None]:
#Plotting
data.plot.scatter(x='Annual Income (k$)',y='Spending Score (1-100)')

plt.show()

As you can see, we can distinguish 5 difrent profils of costumers, of course the example we're using is an easy one, so we can procede to an initial clustering with K-means on those two fields to have the clusters depending on the anual income of customers.

To do that we call the `Kmeans` loaded earlier, this last takes many parameters, we're going to talk about some of them :


*   **n_clusters** : number of clusters we want (k)
*   **init** : the algorithm technic used to initialize the centroids
*   **max_iter** : the maximum iterations to do before terminating the algorithm if the found centroids are not moving (stopping condition) 

Once we configure the K-means model, we will execute it on our dataset by calling `kmeans.fit.predict()`


In [None]:
#Initialize the K-means model
kmeans = KMeans(n_clusters = 5, init = 'k-means++', max_iter = 300, n_init = 10,
                random_state = 0)

#X will contain two columns : Annual income and spending score
annual_dataset = data.iloc[:, [2, 3]].values

#Clusters Centroids with Kmeans
labels = kmeans.fit_predict(annual_dataset)

#Plotting the result
plt.rcParams['figure.figsize'] = (10, 10)
plt.title('Cluster of Annual Income', fontsize = 30)
plt.scatter(annual_dataset[:, 0], annual_dataset[:, 1], s = 50, c=labels, cmap = 'viridis')
plt.style.use('fivethirtyeight')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.grid()
plt.show()

After finding the clusters, let's try to analyse and add legends to each clusters so we can understand each costumer profil (cluster).

In [None]:
#Adding labels to clusters
plt.scatter(annual_dataset[labels == 0, 0], annual_dataset[labels == 0, 1], s = 100, c = 'pink', label = 'miser')
plt.scatter(annual_dataset[labels == 1, 0], annual_dataset[labels == 1, 1], s = 100, c = 'yellow', label = 'general')
plt.scatter(annual_dataset[labels == 2, 0], annual_dataset[labels == 2, 1], s = 100, c = 'cyan', label = 'target')
plt.scatter(annual_dataset[labels == 3, 0], annual_dataset[labels == 3, 1], s = 100, c = 'magenta', label = 'spendthrift')
plt.scatter(annual_dataset[labels == 4, 0], annual_dataset[labels == 4, 1], s = 100, c = 'orange', label = 'careful')
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:, 1], 
            s = 150, c = 'black' , label = 'centeroid', marker='x')

plt.style.use('fivethirtyeight')
plt.title('K Means Clustering', fontsize = 20)
plt.xlabel('Age')
plt.ylabel('Spending Score')
plt.legend()
plt.grid()
plt.show()

We can do another try by analysing the age field and see waht can we get, we procede as we did with the annual income field.

In [None]:
kmeans = KMeans(n_clusters = 4, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)

#Dataset containing the age and spending record
age_dataset = data.iloc[:, [1, 3]].values

#Proceding to the clustering
labels = kmeans.fit_predict(age_dataset)

#Plotting result
plt.rcParams['figure.figsize'] = (10, 10)
plt.title('Cluster of Ages', fontsize = 30)
plt.scatter(age_dataset[:, 0], age_dataset[:, 1], s = 50, c=labels, cmap = 'viridis')
plt.style.use('fivethirtyeight')
plt.xlabel('Age')
plt.ylabel('Spending Score (1-100)')
plt.grid()
plt.show()

Let's add adequat legends to each cluster.

In [None]:
plt.rcParams['figure.figsize'] = (10, 10)
plt.title('Clusters of Ages', fontsize = 30)

plt.scatter(age_dataset[labels == 0, 0], age_dataset[labels == 0, 1], s = 100, c = 'pink', label = 'Target Costumer (Young)' )
plt.scatter(age_dataset[labels == 1, 0], age_dataset[labels == 1, 1], s = 100, c = 'orange', label = 'Priority Customers')
plt.scatter(age_dataset[labels == 2, 0], age_dataset[labels == 2, 1], s = 100, c = 'lightgreen', label = 'Usual costumers')
plt.scatter(age_dataset[labels == 3, 0], age_dataset[labels == 3, 1], s = 100, c = 'red', label = 'Target Customers(Old)')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 100, c = 'black', marker='x')

plt.style.use('fivethirtyeight')
plt.xlabel('Age')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.grid()
plt.show()