# Customer Segmentation
The purpose of this project it to apply clusting algorithm to split customers up into different groups. This is a common task in business as it allows for targeted advertising amoung many other things.
I will mainly focus of k-Means.

## Data import and inital exploration
First we need to import the data and do some inital checking.

In [1]:
import pandas as pd

df = pd.read_csv('../input/customer-segmentation-tutorial-in-python/Mall_Customers.csv')
df.head()

In [2]:
df.shape

In [3]:
df.describe()

In [4]:
df.dtypes

In [5]:
df.isnull().sum()

There is nothing unusual about the data that need fixing, although customerID does not seem to be useful.

## Data Visulation 
It is sometimes useful to plot the data to see if there are trends that need to be explained

In [6]:
import seaborn as sns
sns.countplot(y = 'Gender' , data = df)

In [7]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 3, figsize=(10,5),sharey=True)
#plt.ylabel("Count")
sns.histplot(ax=axes[0], data=df, x='Annual Income (k$)' , kde=True)
axes[0].set_title('Annual Income')
sns.histplot(ax=axes[1], data=df, x='Age' , kde=True)
axes[1].set_title('Age')
sns.histplot(ax=axes[2], data=df, x='Spending Score (1-100)' , kde=True)
axes[2].set_title('Spending Score (1-100)')
plt.show()

## Machine learning

### kMeans
This algorithm clusters data together into k groups with equal variance and minimizes intertia. This scales well with large number of samples and is a common used algorithm.             
In basic terms it has 3 steps. The first step chooses the initial centriod location randomally, it then loops between the next 2 steps. The first of these steps is to assign each sample to the nearest centriod, the second step creates a new centriod by taking the mean value of all the samples assiged to each previous centroid. The distance between the old and new centriods are computed and the algorithm repeats until the centroids do not move significantly.            
The algorithm aims to choose centroids that minimise the interia, this is a measure of how coherent the clusters are. However it performs poorly to elongated clusters. Inertia can be used to find the optimal number of clusters as it can be plotted and the point of the elbow is where k is optimal.             
This however is not always the best way, you can also plot the silhouette score which is the mean silhouette coefficient over all the instances. You can plot the silhouette scores and the peak at the k value is more prominent and easier to spot.

In [8]:
from sklearn.cluster import KMeans
import numpy as np
from sklearn.metrics import silhouette_score
import time
start_time = time.time()

train_x = df[['Age' , 'Spending Score (1-100)']]
inertia = []
silhoute_score = []
range_n_clusters = [2, 3, 4, 5, 6,7,8,9]
    
for i in range_n_clusters:
    kmean = KMeans(n_clusters = i).fit(train_x)
    inertia.append(kmean.inertia_)
    silhoute_score.append(silhouette_score(train_x, kmean.fit_predict(train_x)))
print("--- %s seconds ---" % (time.time() - start_time)) 
plt.figure(figsize = (15 ,6))
plt.plot(range_n_clusters , inertia , 'o')
plt.plot(range_n_clusters , inertia , '-' , alpha = 0.5)
plt.xlabel("$k$", fontsize=14) , plt.ylabel('Inertia', fontsize=14)
plt.show()

plt.figure(figsize=(15 ,6))
plt.plot(range_n_clusters, silhoute_score, "bo-")
plt.xlabel("$k$", fontsize=14)
plt.ylabel("Silhouette score", fontsize=14)
plt.show()

You can also plot every silhouette coefficent and sort them by cluster and the value. This is called a solhouette diagram, where each cluster is shaped like a knife. The knifes height indicates the number of inmstances the cluster contains and the width represent the sorted silhoutte coefficents of the cluster.          
The best value for the number of clusters is where each of the clusters silhouette coefficent is abouve the mean value and each have similar values. 

In [9]:
import matplotlib.cm as cm
from sklearn.metrics import silhouette_samples
from matplotlib.ticker import FixedLocator, FixedFormatter
import matplotlib as mpl



for n_clusters in range_n_clusters:
    # Create a subplot with 1 row and 2 columns
    fig1 = plt.figure()
    fig=fig1.add_subplot(111)
    #fig.set_size_inches(18, 7)

    fig.set_xlim([-0.1, 1]) #all silhouette scores will lie between these values
    fig.set_ylim([0, len(train_x) + (n_clusters + 1) * 10]) # This will ensure that there is white space between clusters


    clusterer = KMeans(n_clusters=n_clusters)
    cluster_labels = clusterer.fit_predict(train_x)
    silhouette_avg = silhouette_score(train_x, cluster_labels)
    print("For n_clusters =", n_clusters,"The average silhouette_score is :", silhouette_avg)

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(train_x, cluster_labels)
    y_lower = 10
    for i in range(n_clusters):
        ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]
        ith_cluster_silhouette_values.sort()
        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        fig.fill_betweenx(np.arange(y_lower, y_upper),0, ith_cluster_silhouette_values,facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        fig.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    fig.set_title("The silhouette plot for the various clusters.")
    fig.set_xlabel("The silhouette coefficient values")
    fig.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    fig.axvline(x=silhouette_avg, color="red", linestyle="--")

    fig.set_yticks([])  # Clear the yaxis labels / ticks
    fig.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])


### kMeans++
There is an advancement of this algorithm called kMeans++. This changes the inital assignment step of the clusters to ensure that the centriods are distant from each other.

In [10]:
import time
start_time = time.time()

train_x = df[['Age' , 'Spending Score (1-100)']]
inertia = []
silhoute_score = []
range_n_clusters = [2, 3, 4, 5, 6,7,8,9]
    
for i in range_n_clusters:
    kmean = KMeans(n_clusters = i, init='k-means++').fit(train_x)
    inertia.append(kmean.inertia_)
    silhoute_score.append(silhouette_score(train_x, kmean.fit_predict(train_x)))

print("--- %s seconds ---" % (time.time() - start_time))

plt.figure(figsize = (15 ,6))
plt.plot(range_n_clusters , inertia , 'o')
plt.plot(range_n_clusters , inertia , '-' , alpha = 0.5)
plt.xlabel("$k$", fontsize=14) , plt.ylabel('Inertia', fontsize=14)
plt.show()

plt.figure(figsize=(15 ,6))
plt.plot(range_n_clusters, silhoute_score, "bo-")
plt.xlabel("$k$", fontsize=14)
plt.ylabel("Silhouette score", fontsize=14)
plt.show()

In [11]:
kmeans = KMeans(n_clusters = 4, init='k-means++').fit(train_x)
centroids1 = kmeans.cluster_centers_
labels1 = kmeans.labels_

plt.figure(figsize=(17 ,8))
plt.scatter(data = df, x='Age', y='Spending Score (1-100)',c = labels1)
plt.ylabel("Spending Score (1-100)", fontsize=14)
plt.xlabel("Age", fontsize=14)
plt.scatter(x = centroids1[: , 0] , y =  centroids1[: , 1] , s = 150 , c = 'red' , alpha = 0.5)