## **Table of Contents**

* [**Getting Started**](#getting_started)
* [**Data Preparation**](#preparation)
    * [Data Exploration](#exploration)
    * [Data Cleaning](#cleaning)
* [**Feature Engineering**](#engineering)
    * [Data Visualization & Analysis](#data_visualization)
    * [Feature Scaling](#scaling)
* [**KMeans Clustering**](#kmeans)
    * [**Age-Score Clustering**](#as-cluster)
        * [Elbow Method](#a_s_elbow)
        * [Model Fit](#a_s_fit)
        * [Clusters](#a_s_clusters)
        * [**Target Customers**](#a_s_target)
    * [**Income-Score Clustering**](#is-cluster)
        * [Elbow Method](#i_s_elbow)
        * [Model Fit](#i_s_fit)
        * [Clusters](#i_s_clusters)
        * [**Target Customers**](#i_s_target)  
    * [**Age-Income-Score Clustering**](#ais-cluster)
        * [Elbow Method](#a_i_s_elbow)
        * [Model Fit](#a_i_s_fit)
        * [Clusters](#a_i_s_clusters)
        * [**Target Customers**](#a_i_s_target)

## **Getting Started**
<a id = "getting_started"></a>

**Importing necessary libraries.**

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans
import plotly.graph_objs as go
import plotly as py
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

**Fetching datasets path information (kaggle)**

In [None]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

**Loading dataset into pandas dataframes**


In [None]:
data = pd.read_csv("/kaggle/input/customer-segmentation-tutorial-in-python/Mall_Customers.csv")
data.head()

## **Data Preparation**
<a id = "preparation"></a>

* [**Data Exploration**](#exploration)
* [**Data Cleaning**](#cleaning)

### **Data Exploration**
<a id = "exploration"></a>

In [None]:
data.describe()

In [None]:
data.dtypes

### **Data Cleaning**
<a id = "cleaning"></a>

**Do we have any null values?**

Fortunately, we have no null values into our dataset.

In [None]:
data.isnull().sum()

**Do we have any unnecessary column that's not affecting the output?**

The `CustomerID` column doesn't contribute anything on determining the target customer group.

So, we can drop this column.

In [None]:
data.describe()

In [None]:
data.drop(["CustomerID"], axis=1, inplace=True)
data.head()

**Converting Categorical Values**

Male = 1, Female = 0

In [None]:
le = LabelEncoder()
data['Gender'] = le.fit_transform(data['Gender'])
data.head()

**Renaming Columns**

In [None]:
new_column_names = {
    "Gender": "Gender", 
    "Age": "Age", 
    "Annual Income (k$)": "Income", 
    "Spending Score (1-100)":"Score"
}
data.rename(columns=new_column_names, inplace=True)
data.head()

## **Feature Engineering**
<a id = "engineering"></a>

* [**Data Visualization & Analysis**](#data_visualization)
* [**Feature Engineering**](#scaling)

### **Data Visualization & Analysis**
<a id = "data_visualization"></a>

**KDE Plot** 

Kernel Density Estimate for the quantitative variables

(Similar to histograms. Density instead of frequency.)

In [None]:
quantitative_columns = ['Age' , 'Income' , 'Score']

In [None]:
sns.set(rc={'figure.figsize':(18,9)})
sns.kdeplot(data=data[quantitative_columns])

So, it would be a good idea to normalize these variables. (Inside [Feature Scaling Section](#feature_scaling))

**Count Plot**

For the only qualitative variable, `Gender`

In [None]:
sns.countplot(x='Gender',data=data)

**Scatter Plot: Age vs Score**

`Age` and `Score` have a **weak** relation between them. 

* Customers with lower age values tend to have higher score. Thus, higher buying tendency.

In [None]:
sns.scatterplot(data=data, x="Age", y="Score", hue="Gender")

**Scatter Plot: Income vs Score**

`Income` and `Score` have a **moderate** relation between them. 

* Customers with 40-70 k annual income, tend to have a medium score of 40-60. 

* On the other hand, both the lower and the higher income customers can be divided into two seperate group of lower and higher scores. 

In [None]:
sns.scatterplot(data=data, x="Income", y="Score", hue="Gender")

**Group By Mean (Gender)**

* Mean score of `Female (0)` customers > Mean score of `Male (1)` customers

* But it doesn't indicate any `strong` relation, based upon which, we can cluster them.

* So, we can ignore `Gender` while using clustering.

In [None]:
data.groupby(['Gender']).mean()

### **Feature Scaling**
<a id ="scaling"></a>

**Normalization**

Using `MinMaxScaler` from `sklearn`

In [None]:
min_max_scaler = MinMaxScaler()
data[quantitative_columns] = min_max_scaler.fit_transform(data[quantitative_columns])
data.head()

**Standardization**

Using `StandardScaler` from `sklearn`

(Not Needed I think)

In [None]:
# standard_scaler = StandardScaler()
# data[quantitative_columns] = standard_scaler.fit_transform(data[quantitative_columns])
# data.head()

## **KMeans Clustering**
<a id = "kmeans"></a>

* [**Age-Score Clustering**](#as-cluster)
* [**Income-Score Clustering**](#is-cluster)
* [**Age-Income-Score Clustering**](#ais-cluster)

### **Age-Score Clustering**
<a id="as-cluster"></a>

* [**Elbow Method**](#a_s_elbow)
* [**Model Fit**](#a_s_fit)
* [**Clusters**](#a_s_clusters)
* [**Target Customers**](#a_s_target)

In [None]:
a_s_clustering_data = data.copy()
# [Gender, Age, Income, Score]
a_s_clustering_data = a_s_clustering_data.iloc[:, [False, True, False, True]]
a_s_clustering_data.head()

#### **Elbow Method**
<a id = a_s_elbow></a>

In [None]:
sum_of_squared_error = []
max_k = 10
for k in range(1, max_k):
    model = KMeans(n_clusters=k)
    model.fit(a_s_clustering_data)
    sum_of_squared_error.append(model.inertia_)

In [None]:
plt.title('The Elbow Method (Age-Score)')
plt.xlabel('k')
plt.ylabel('Sum of Squared Error')
plt.xticks(range(1, max_k))
plt.plot(range(1,max_k),sum_of_squared_error)
plt.plot(2, sum_of_squared_error[1],'ro') 
plt.show()

#### **Model Fit**
<a id = a_s_fit></a>

k = 2

In [None]:
k=2
model = (KMeans(n_clusters = k ,init='k-means++', n_init = 10 ,max_iter=300, tol=0.0001,  random_state= 111  , algorithm='elkan') )
model.fit(a_s_clustering_data)

#### **Clusters**
<a id = a_s_clusters></a>

In [None]:
clusters = model.labels_
centroids = model.cluster_centers_
a_s_clustering_data['Clusters'] = clusters

sns.scatterplot(x=a_s_clustering_data['Age'], 
                y=a_s_clustering_data['Score'], 
                hue=a_s_clustering_data['Clusters'], 
                palette=sns.color_palette('husl', k))
plt.title('Age-Score KMeans Clustering (k={})'.format(k))
plt.show()

#### **Target Customers**
<a id = a_s_target></a>

* **RED CLUSTER**
    * `LOW AGE` indicates that the customer will have`HIGH Score`. (Good prospect for new products) 
* **GREEN CLUSTER**
    * `HIGH AGE` and `LOW Score`

So, We should target **LOW AGE** customers.

### **Income-Score Clustering**
<a id="is-cluster"></a>

* [**Elbow Method**](#i_s_elbow)
* [**Model Fit**](#i_s_fit)
* [**Clusters**](#i_s_clusters)
* [**Target Customers**](#i_s_target)

In [None]:
i_s_clustering_data = data.copy()
# [Gender, Age, Income, Score]
i_s_clustering_data = i_s_clustering_data.iloc[:, [False, False, True, True]]
i_s_clustering_data.head()

#### **Elbow Method**
<a id = i_s_elbow></a>

In [None]:
sum_of_squared_error = []
max_k = 20
for k in range(1, max_k):
    model = KMeans(n_clusters=k)
    model.fit(i_s_clustering_data)
    sum_of_squared_error.append(model.inertia_)

In [None]:
plt.title('The Elbow Method (Income-Score)')
plt.xlabel('k')
plt.ylabel('Sum of Squared Error')
plt.xticks(range(1, max_k))
plt.plot(range(1,max_k),sum_of_squared_error)
plt.plot(5, sum_of_squared_error[4],'ro') 
plt.show()

#### **Model Fit**
<a id = i_s_fit></a>
k = 5

In [None]:
k = 5
model = (KMeans(n_clusters = k ,init='k-means++', n_init = 10 ,max_iter=300, tol=0.0001,  random_state= 111  , algorithm='elkan') )
model.fit(i_s_clustering_data)

#### **Clusters**
<a id = i_s_clusters></a>

In [None]:
clusters = model.labels_
centroids = model.cluster_centers_
i_s_clustering_data['Clusters'] = clusters
sns.scatterplot(x=i_s_clustering_data['Income'], 
                y=i_s_clustering_data['Score'], 
                hue=i_s_clustering_data['Clusters'], 
                palette=sns.color_palette('husl', k))
plt.title('Income-Score KMeans Clustering (k={})'.format(k))
plt.show()

#### **Target Customers**
<a id = i_s_target></a>

* **RED CLUSTER**
    * `MEDIUM Income` indicates that the customer will have`MEDIUM Score`. (Good prospect for trusted customers) 
* **GREEN CLUSTER**
    * `LOW Income` and `LOW Score`
* **PURPLE CLUSTER**
    * `LOW Income` but `HIGH Score` (But not consistent)
* **YELLOW CLUSTER**
    * `HIGH Income` and `LOW Score`
* **BLUE CLUSTER**
    * `HIGH Income` and `HIGH Score` (But not consistent)
    
So, We should target **MEDIUM INCOME** customers.

### **Age-Income-Score Clustering**
<a id="ais-cluster"></a>

* [**Elbow Method**](#a_i_s_elbow)
* [**Model Fit**](#a_i_s_fit)
* [**Clusters**](#a_i_s_clusters)
* [**Target Customers**](#a_i_s_target)


In [None]:
a_i_s_clustering_data = data.copy()
# [Gender, Age, Income, Score]
a_i_s_clustering_data = a_i_s_clustering_data.iloc[:, [False, True, True, True]]
a_i_s_clustering_data.head()

#### **Elbow Method**
<a id = a_i_s_elbow></a>

In [None]:
sum_of_squared_error = []
max_k = 20
for k in range(1, max_k):
    model = KMeans(n_clusters=k)
    model.fit(a_i_s_clustering_data)
    sum_of_squared_error.append(model.inertia_)

In [None]:
plt.title('The Elbow Method (Age-Income-Score)')
plt.xlabel('k')
plt.ylabel('Sum of Squared Error')
plt.xticks(range(1, max_k))
plt.plot(range(1,max_k),sum_of_squared_error)
plt.plot(4, sum_of_squared_error[3],'ro') 
plt.show()

#### **Model Fit**
<a id = a_i_s_fit></a>

k = 4

In [None]:
k = 4
model = (KMeans(n_clusters = k ,init='k-means++', n_init = 10 ,max_iter=300, tol=0.0001,  random_state= 111  , algorithm='elkan') )
model.fit(a_i_s_clustering_data)

#### **Clusters**
<a id = a_i_s_clusters></a>

In [None]:
labels = model.labels_
centroids = model.cluster_centers_
a_i_s_clustering_data['Clusters'] = labels
trace1 = go.Scatter3d(
    x= a_i_s_clustering_data['Age'],
    y= a_i_s_clustering_data['Score'],
    z= a_i_s_clustering_data['Income'],
    mode='markers',
     marker=dict(
        colorscale = "sunset",
        color = a_i_s_clustering_data['Clusters'], 
        size= 20,
        line=dict(
            colorscale = "sunset",
            color= a_i_s_clustering_data['Clusters'],
            width= 12
        ),
        opacity=0.6
     )
)
data_trace = [trace1]
layout = go.Layout(
    title= '3D Clusters (Age-Income-Score)',
    scene = dict(
            xaxis = dict(title  = 'Age'),
            yaxis = dict(title  = 'Score'),
            zaxis = dict(title  = 'Income')
        )
)
fig = go.Figure(data=data_trace, layout=layout)
py.offline.iplot(fig)

#### **Target Customers**
<a id = a_i_s_target></a>

* **BLUE CLUSTER**
    * `LOW-MEDIUM Income` and `MEDIUM-HIGH Age` indicates that the customer will have`MEDIUM-LOW Score`
* **YELLOW CLUSTER**
    * `LOW-MEDIUM Income` and `LOW-MEDIUM Age` indicates that the customer will have`MEDIUM-HIGH Score`
* **MAGENTA CLUSTER**
    * `MEDIUM-HIGH Income` and `LOW-HIGH Age` indicates that the customer will have`MEDIUM-HIGH Score`
* **ORANGE CLUSTER**
    * `MEDIUM-HIGH Income` and `LOW-MEDIUM Age` indicates that the customer will have`MEDIUM-HIGH Score`
    
So, We should target **LOW-MEDIUM AGE** and **MEDIUM-HIGH INCOME** customers.