<center><img src="https://github.com/insaid2018/Term-1/blob/master/Images/INSAID_Full%20Logo.png?raw=true" width="240" height="100" /></center>

---
# **Table of Contents**
---

**1.** [**Introduction**](#Section1)<br>
**2.** [**Problem Statement**](#Section2)<br>
**3.** [**Installing & Importing Libraries**](#Section3)<br>
  - **3.1** [**Installing Libraries**](#Section31)
  - **3.2** [**Upgrading Libraries**](#Section32)
  - **3.3** [**Importing Libraries**](#Section33)

**4.** [**Data Acquisition & Description**](#Section4)<br>
  - **4.1** [**Data Description**](#Section41)
  - **4.2** [**Data Information**](#Section42)

**5.** [**Data Pre-processing**](#Section5)<br>
  - **5.1** [**Pre-Profiling Report**](#Section51)<br>

**6.** [**Exploratory Data Analysis**](#Section6)<br>
**7.** [**Model Development & Evaluation**](#Section7)<br>
**8.** [**Conclusion**](#Section8)<br>

---
<a name = Section1></a>
# **1. Introduction**
---

- Kmeans is an **supervised**, **iterative**, **partition-based** clustering algorithm.

- It tries to partition the dataset into **K pre-defined** distinct, **non-overlapping subgroups** (clusters).

- It **assigns** data points to a cluster such that the **sum of the squared distance** is at the **minimum**.

<center><img src="https://iliazaitsev.me/static/images/posts/clustering.gif" width=30%></center>

- Sum of the squared distance is the **distance** between the **data points**and the **cluster’s centroid**.

- The **less variation** we have within clusters, the **more homogeneous** (similar) the data **points are** within the same cluster.

---
<a name = Section2></a>
# **2. Problem Statement**
---

- It is crucial to **understand customer behaviour** and **categorise customers** based on their demography and buying behaviour.

- This is broadly one aspect of **customer segmentation**. 

- Marketers use it to **better tailor** their **marketing efforts** to **various audience subsets**.

- The subsets can be in terms of **promotional**, **marketing** and **product development** strategies.

<center><img src="https://raw.githubusercontent.com/insaid2018/Domain_Case_Studies/master/Retail/customer-segmentation3.jpg"></center>

**<h4>Scenario:</h4>**

- **Adshock** is an established company that **analyze customer data** for other companies and **target advertisements**.

- A local bank wants to **target** their **customers** with new promotional offers offered by the bank.

- They have **consulted Adshock** to help them find **customer groups** for the offers.

- As one of their data scientists, you are given this particular task.

- You are provided with a **dataset** that contains the **income** and **spending** of **anonymous customers**.

- Your task is to **provide a solution** that **segment** their **customers** leading them for marketing in most effective way.

---
<a name = Section3></a>
# **3. Installing & Importing Libraries**
---

<a name = Section31></a>
### **3.1 Installing Libraries**

In [None]:
!pip install -q datascience                                         # Package that is required by pandas profiling
!pip install -q pandas-profiling                                    # Library to generate basic statistics about data
!pip install -q yellowbrick                                         # Toolbox for Measuring Machine Performance

[?25l[K     |████▊                           | 10kB 19.7MB/s eta 0:00:01[K     |█████████▍                      | 20kB 23.2MB/s eta 0:00:01[K     |██████████████                  | 30kB 18.6MB/s eta 0:00:01[K     |██████████████████▊             | 40kB 15.7MB/s eta 0:00:01[K     |███████████████████████▍        | 51kB 8.4MB/s eta 0:00:01[K     |████████████████████████████    | 61kB 9.7MB/s eta 0:00:01[K     |████████████████████████████████| 71kB 4.6MB/s 
[?25h  Building wheel for folium (setup.py) ... [?25l[?25hdone


<a name = Section32></a>
### **3.2 Upgrading Libraries**

- **After upgrading** the libraries, you need to **restart the runtime** to make the libraries in sync. 

- Make sure not to execute the cell above (3.1) and below (3.2) again after restarting the runtime.

In [None]:
!pip install -q --upgrade pandas-profiling
!pip install -q --upgrade yellowbrick

<a name = Section33></a>
### **3.3 Importing Libraries**

In [None]:
#-------------------------------------------------------------------------------------------------------------------------------
import pandas as pd                                                 # Importing for panel data analysis
from pandas_profiling import ProfileReport                          # Import Pandas Profiling (To generate Univariate Analysis) 
pd.set_option('display.max_columns', None)                          # Unfolding hidden features if the cardinality is high      
pd.set_option('display.max_colwidth', None)                         # Unfolding the max feature width for better clearity      
pd.set_option('display.max_rows', None)                             # Unfolding hidden data points if the cardinality is high
pd.set_option('mode.chained_assignment', None)                      # Removing restriction over chained assignments operations
pd.set_option('display.float_format', lambda x: '%.2f' % x)         # To suppress scientific notation over exponential values
#-------------------------------------------------------------------------------------------------------------------------------
import numpy as np                                                  # Importing package numpys (For Numerical Python)
#-------------------------------------------------------------------------------------------------------------------------------
import matplotlib.pyplot as plt                                     # Importing pyplot interface using matplotlib
import seaborn as sns                                               # Importin seaborm library for interactive visualization
%matplotlib inline
#-------------------------------------------------------------------------------------------------------------------------------
from sklearn.cluster import KMeans                                  # To instantiate KMeans model
from sklearn.preprocessing import StandardScaler                    # To import a standard scaler for scaling the features
#-------------------------------------------------------------------------------------------------------------------------------
import warnings                                                     # Importing warning to disable runtime warnings
warnings.filterwarnings("ignore")                                   # Warnings will appear only once

---
<a name = Section4></a>
# **4. Data Acquisition & Description**
---

- The dataset consists of **Annual Income** (in **`$`**) of approximately **300 customers** and their **Annual Spend** (in **`$`**) for a period of one year.

</br>

| Records | Features | Dataset Size |
| :-- | :-- | :-- |
| 303 | 2 | 2.37 KB| 

</br>

| Id | Features | Description |
| :-- | :--| :--| 
|01| **INCOME** | Annual Income (in thousand `$`). |
|02| **SPEND** | Annual Spend (in thousand `$`). |


In [None]:
data = pd.read_csv(filepath_or_buffer='https://raw.githubusercontent.com/insaid2018/Term-3/master/Data/Assignment/CLV.csv')
print('Data Shape:', data.shape)
data.head()

<a name = Section41></a>
### **4.1 Data Description**

- In this section we will get **information about the data** and see some observations.

In [None]:
data.describe()

<a name = Section42></a>
### **4.2 Data Information**

- In this section we will see the **information about the types of features**.

In [None]:
data.info()

<a name = Section5></a>

---
# **5. Data Pre-Processing**
---

<a name = Section51></a>
### **5.1 Pre Profiling Report**

- For **quick analysis** pandas profiling is very handy.

- Generates profile reports from a pandas DataFrame.

- For each column **statistics** are presented in an interactive HTML report.

In [None]:
# profile = ProfileReport(df = data)
# profile.to_file(output_file = 'Pre Profiling Report.html')
# print('Accomplished!')

Summarize dataset:   0%|          | 0/15 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Accomplished!


**Performing Operations**


---
**<h4>Question 1:** Create a function that removes duplicate rows from the dataset.</h4>

---

<details>

**<summary>Hint:</summary>**

- You can use the `.drop_duplicates()` method to remove the duplicates.

</details>


In [None]:
def duplicate_removal(data=None):
  # Write your code here...

In [None]:
duplicate_removal(data=data)

<a name = Section6></a>

---
# **6. Exploratory Data Analysis**
---


---
**<h4>Question 2:** Create a function that checks the distribution of the INCOME feature.</h4>

---

<details>

**<summary>Hint:</summary>**

- Create a 15x7 inches figure.

- Plot a box using the `sns.boxplot` method for the 'INCOME' feature.

- Add additional cosmetics like `grid` and `title`.

- Set `fontsize` for labels as 14, for ticks as 12, and title as 16.

- Use `plt.show()` to properly display the plot.

</details>


In [None]:
def income_plot(data=None, column=None):
  # Write your code here...

In [None]:
income_plot(data=data, column='INCOME')


---
**<h4>Question 3:** Create a function that checks the distribution of the SPEND feature.</h4>

---

<details>

**<summary>Hint:</summary>**

- Create a 15x7 inches figure.

- Plot a box using the `sns.histplot` method for the 'SPEND' feature and keep kde=True.

- Add additional cosmetics like `grid` and `title`.

- Set `fontsize` for labels as 14, for ticks as 12, and title as 16.

- Use `plt.show()` to properly display the plot.

</details>


In [None]:
def spend_plot(data=None, column=None):
  # Write your code here...

In [None]:
spend_plot(data=data, column='SPEND')


---
**<h4>Question 4:** Create a function that compare INCOME and SPEND features.</h4>

---

<details>

**<summary>Hint:</summary>**

- Create a 15x10 inches figure.

- Plot a box using the `sns.scatterplot` method for the 'INCOME' and 'SPEND' features.

- Add additional cosmetics like `grid` and `title`.

- Set `fontsize` for labels as 14, for ticks as 12, and title as 16.

- Use `plt.show()` to properly display the plot.

</details>


In [None]:
def compare_plot(data=None, column1=None, column2=None):
  # Write your code here...

In [None]:
compare_plot(data=data, column1='INCOME', column2='SPEND')

<a name = Section7></a>

---
# **7. Model Development & Evaluation**
---

- In this section we will apply K-Means clustering to our customer data.

- We will find the best number of clusters using the elbow plot analysis.


---
**<h4>Question 5:** Create a function that initializes a KMeans clustering model.</h4>

---

<details>

**<summary>Hint:</summary>**

- Use `n_clusters=k`, `random_state=42`, and `n_jobs=-1` for initialization.

- Set a k parameter in the calling function that assigns the value to `n_clusters`

</details>


In [None]:
def model_initialize(k=None):
  # Write your code here...

In [None]:
k_means = model_initialize(k=2)
k_means


---
**<h4>Question 6:** Create a function that applies KMeans clustering on the given data.</h4>

---

<details>

**<summary>Hint:</summary>**

- 

</details>


In [None]:
def form_clusters(k_model=None):
  # Write your code here...

In [None]:
k_model=form_clusters(k_model=k_means)


---
**<h4>Question 7:** Create a function that extracts the cluster centers and the cluster labels for each observation in the data.</h4>

---

<details>

**<summary>Hint:</summary>**

- Use `.cluster_centers_` to extract the cluster centers.

- Use `k_model.labels_` and merge it with the dataset as a `'Labels'` column.

</details>


In [None]:
def center_n_labels(k_model=None, data=None):
  # Write your code here...

In [None]:
data, centers = center_n_labels(k_model=k_model, data=data)
data.head()


---
**<h4>Question 8:** Create a function that plots the clusters along with the cluster centers.</h4>

---

<details>

**<summary>Hint:</summary>**

- Create a 15x10 inches figure.

- Plot a box using the `sns.scatterplot` method for the 'INCOME' and 'SPEND' features and use 'Labels' as hue.

- Plot the cluster centers using plt.plot() and assign a unique marker to them (say 'P' or '+' or '*').

- Add additional cosmetics like `grid` and `title`.

- Set `fontsize` for labels as 14, for ticks as 12, and title as 16.

- Use `plt.show()` to properly display the plot.

</details>


In [None]:
def cluster_plot():
  # Write your code here...

In [None]:
cluster_plot()


---
**<h4>Question 9:** Create a function that runs multiple iterations of K-Means for different values of K and calculate inertia for each cluster group.</h4>

---

<details>

**<summary>Hint:</summary>**

- 

</details>


In [None]:
def k_means_iterations(data=None):
  # Write your code here...

In [None]:
inertia_vals = k_means_iterations(data=data)


---
**<h4>Question 10:** Create a function that plots an elbow plot to check the optimal value of K with respect to inertia of clusters.</h4>

---

<details>

**<summary>Hint:</summary>**

- Create a 15x7 inches figure.

- Use plt.plot() method to plot values of k and the corresponding inertia values.

- Add additional cosmetics like `grid` and `title`.

- Set `fontsize` for labels as 14, for ticks as 12, and title as 16.

- Use `plt.show()` to properly display the plot.

</details>


In [None]:
def plot_interia(inertia_vals=None):
  # Write your code here...

In [None]:
plot_interia(inertia_vals=inertia_vals)


---
**<h4>Question 11:** Use the previously written functions to create best number of clusters (selected with the elbow plot) on the given data.</h4>

---

<details>

**<summary>Hint:</summary>**

- Use model_initialize(), form_clusters(), center_n_labels(), and cluster_plot() functions with k=best_number_of_clusters

</details>


In [None]:
k_final = model_initialize(k=5)
k_final = form_clusters(k_model=k_final)
data, centers = center_n_labels(k_model=k_final, data=data)
cluster_plot()

<a name = Section8></a>

---
# **8. Conclusion**
---

- We have observed that **spending habits** **don't relate** to a **customer's income**.

- Using the above generated model, we have formed **five different customer segments**.

- It takes into account their **income** and **spending** that are noted by the bank.

- **High income**, **high expense** groups can be **targeted** with some **premium plans**.

- There is a **low probability** that targeting offers to groups with **low income**, **low spending** will be effective.

- **Targeting offers** to groups with **low income**, **high spending** groups could be **arbitrary** and should be **explored** further.