<a href="https://colab.research.google.com/github/Shrutiba/iisc_cds/blob/main/M2_NB_MiniProject_5_Customer_Segmentation_Kmeans.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification Program in Computational Data Science

##  A program by IISc and TalentSprint

### Mini Project Notebook: Customer segmentation using clustering

## Learning Objectives

At the end of the experiment, you will be able to :

* extract summary level insight from a given customer dataset.

* handle the missing data and identify the underlying pattern or structure of the data.

* create an unsupervised model that generates the optimum number of segments for the customer base

* identify customer segments based on the overall buying behaviour


## Dataset

The dataset chosen for this mini project is the Online Retail dataset. It is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.

The dataset contains 541909 records, and each record is made up of 8 fields.

To know more about the dataset : [click here](https://archive.ics.uci.edu/ml/datasets/Online+Retail)

## Information

**Clustering** is the task of grouping together a set of objects so that the objects in the same cluster are more similar to each other than to objects in other clusters. Similarity is a measure that reflects the strength of the relationship between two data objects.

In the clustering calculation, K-Means is a very popular algorithm. In this analysis, this method is used to cluster the similar data items.

In Retail and E-Commerce (B2C), and more broadly in B2B, one of the key elements shaping the business strategy of a firm is understanding of customer behaviour. More specifically, understanding the customers based on different business metrics: how much they spend (revenue), how often they spend (frequency), are they new or existing customers, what are their favorite products, etc... Such understanding in turn helps direct marketing, sales, account management and product teams to support customers on a personalized level and improve the product offering.

Furthermore, segmenting customers into different categories based on similar/cyclical buying pattern over a period of 1 year helps the retail shops manage their inventory better, thereby lowering costs and raising revenues by placing the orders in sync with the buying cycles.

## Problem Statement

Perform customer segmentation for an Online Retail using an Unsupervised Clustering technique

## Grading = 10 Points

### Import Required packages

In [65]:
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import scipy
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
from scipy import stats

## Data Wrangling

In [3]:
#@title Download the data
!wget -qq https://cdn.iisc.talentsprint.com/CDS/MiniProjects/Online_Retail.zip
!unzip -qq Online_Retail.zip

## Load the data

In [33]:
# YOUR CODE HERE
df = pd.read_csv('Online_Retail_Train.csv', encoding = 'unicode_escape')

In [45]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,579427,22636,CHILDS BREAKFAST SET CIRCUS PARADE,2,2011-11-29 13:04:00,8.5,16479.0,United Kingdom
1,554092,21916,SET 12 RETRO WHITE CHALK STICKS,24,2011-05-22 12:41:00,0.42,17176.0,United Kingdom
2,577774,84692,BOX OF 24 COCKTAIL PARASOLS,6,2011-11-21 15:57:00,0.42,16712.0,United Kingdom
3,C571196,23350,ROLL WRAP VINTAGE SPOT,-12,2011-10-14 12:02:00,1.25,,United Kingdom
4,546649,84509a,SET OF 4 ENGLISH ROSE PLACEMATS,1,2011-03-15 14:17:00,7.46,,United Kingdom


In [50]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 514813 entries, 0 to 514812
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    514813 non-null  object 
 1   StockCode    514813 non-null  object 
 2   Description  513428 non-null  object 
 3   Quantity     514813 non-null  int64  
 4   InvoiceDate  514813 non-null  object 
 5   UnitPrice    514813 non-null  float64
 6   CustomerID   514813 non-null  object 
 7   Country      514813 non-null  object 
dtypes: float64(1), int64(1), object(6)
memory usage: 31.4+ MB


In [34]:
# Change the data type of CustomerID to object as it is categorical variable
df['CustomerID'] = df['CustomerID'].astype(str)

## Data Pre-processing (2 points)

Explore the dataset by performing the following operations:

* There is a lot of redundant data. Identify such data and take appropriate action.

  **Hint:** refer to this [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html)

* Most Invoices appear as normal transactions with positive quantity and prices, but there are some prefixed with "C" or "A" which denote different transaction types. Invoice starting with C represents cancelled order and A represents the Adjusted. Identify such data and take appropriate action.

  **Hint:** Check the negative values in Quantity column for all cancelled orders

* Handle the null values by dropping or filling with appropriate mean


* Some of the transactions based on the `StockCode` variable are not actually products, but representing the costs or fees regarding to the post or bank or other tansactions. Find such data and handle it accordingly.

  Hint:
    - The transaction with `'POST' 'PADS' 'M' 'DOT' 'C2' 'BANK CHARGES'` as their `StockCodes` are considered as irrelevant transactions.

* Identify the outliers in the UntiPrice and Quantity and handle them accordingly.

  **Hint:** [link](https://thecleverprogrammer.com/2023/07/26/detect-and-remove-outliers-using-python/)

* Create a DayOfWeek column using `InvoiceDate`, Hint: pd.to_datetime()

**Note:** Perform all the above operations using a function to reuse and apply the same for test data.

In [49]:
df[(df['CustomerID'] == 'nan')]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
3,C571196,23350,ROLL WRAP VINTAGE SPOT,-12,2011-10-14 12:02:00,1.25,,United Kingdom
4,546649,84509a,SET OF 4 ENGLISH ROSE PLACEMATS,1,2011-03-15 14:17:00,7.46,,United Kingdom
6,564758,84951A,SET OF 4 PISTACHIO LOVEBIRD COASTER,1,2011-08-30 10:39:00,2.46,,United Kingdom
15,579777,23222,CHRISTMAS TREE HANGING GOLD,1,2011-11-30 15:13:00,1.63,,United Kingdom
24,578067,23344,JUMBO BAG 50'S CHRISTMAS,2,2011-11-22 15:43:00,4.13,,United Kingdom
...,...,...,...,...,...,...,...,...
514801,561209,21990,MODERN FLORAL STATIONERY SET,1,2011-07-25 16:57:00,2.46,,United Kingdom
514804,543660,22288,HANGING METAL RABBIT DECORATION,1,2011-02-11 10:40:00,2.46,,United Kingdom
514805,580366,23489,VINTAGE BELLS GARLAND,4,2011-12-02 16:38:00,2.88,,United Kingdom
514807,540977,22208,WOOD STAMP SET THANK YOU,2,2011-01-12 15:01:00,1.66,,United Kingdom


In [41]:
def get_duplicate_row_count(data):
    invoice_count = data.groupby(['InvoiceNo','StockCode']).size().reset_index(name='count')
    return invoice_count['count'].value_counts()

In [53]:
def getRowCountWithNoCustomerID(data):
    return (data['CustomerID'] == 'nan').sum()

In [59]:
def getRowCountWithStockCodeS(data):
    return ((data['StockCode'] == 'S') & (data['CustomerID'] != 'nan')).sum()

In [63]:
def getRowCountWithIrrelevantStockCode(data):
    return data['StockCode'].isin(['POST', 'PADS', 'M', 'DOT', 'C2', 'BANK CHARGES']).sum()

In [66]:
def getOutlierCountOfQuantity(data):

    return np.abs(stats.zscore(data['Quantity']))

In [None]:
def data_preprocessing(data):

# Drop duplicates by comparing all the fields
    data.drop_duplicates(inplace=True)
    data.drop((data['CustomerID'] == 'nan').index, inplace=True)
    data.drop(data['StockCode'].isin(['POST', 'PADS', 'M', 'DOT', 'C2', 'BANK CHARGES']).index, inplace=True)



In [67]:
print(f"No. of duplicate rows with count: {get_duplicate_row_count(df)}")
print(f"No. of rows with no customer ID: {getRowCountWithNoCustomerID(df)}")
print(f"No. of rows with Sample StockCode and customer ID is not null: {getRowCountWithStockCodeS(df)}")
print(f"No. of rows with irrelevant StockCodes: {getRowCountWithIrrelevantStockCode(df)}")
print(f"Z Score of Quantity: {calculateZscoreForQuantity(df)}")

No. of duplicate rows with count: count
1     496374
2       8102
3        581
4         76
5         16
6          6
7          4
9          1
19         1
16         1
Name: count, dtype: int64
No. of rows with no customer ID: 128263
No. of rows with Sample StockCode and customer ID is not null: 0
No. of rows with irrelevant StockCodes: 2585
Z Score of Quantity: 0         0.033788
1         0.064634
2         0.015893
3         0.096419
4         0.038261
            ...   
514808    0.029314
514809    0.010949
514810    0.038261
514811    0.038261
514812    0.029314
Name: Quantity, Length: 514813, dtype: float64


In [None]:
def process_data(df):

    print("BEFORE PREPROCESSING")
    print(f"No. of duplicate rows with count: {get_duplicate_row_count(df)}")
    print(f"No. of rows with no customer ID: {getRowCountWithNoCustomerID(df)}")
    print(f"No. of rows with Sample StockCode and customer ID is not null: {getRowCountWithStockCodeS(df)}")
    print(f"No. of rows with irrelevant StockCodes: {getRowCountWithIrrelevantStockCode(df)}")

    data_preprocessing(df)

    print("/n/n AFTER PREPROCESSING")
    print(f"No. of duplicate rows with count: {get_duplicate_row_count(df)}")
    print(f"No. of rows with no customer ID: {getRowCountWithNoCustomerID(df)}")
    print(f"No. of rows with Sample StockCode and customer ID is not null: {getRowCountWithStockCodeS(df)}")
    print(f"No. of rows with irrelevant StockCodes: {getRowCountWithIrrelevantStockCode(df)}")


## Understanding new insights from the data (1 point)

1.  Are there any free items in the data? How many are there?

2.  Find the number of transactions per country and visualize using an appropriate plot

3.  What is the ratio of customers who are repeat purchasers vs single-time purchasers? Visualize using an appropriate plot.

4. Plot heatmap showing unit price per month and day of the week

  **Hint:** Month name as index on Y-axis, Day of the week on X-axis

5. Find the top 10 customers who bought the most no.of items. Also find the top 10 Items bought by most no.of customers.

In [None]:
# YOUR CODE HERE

## Feature Engineering and Transformation (2 points)

### Create new features to uncover better insights and drop the unwanted columns

* Create a new column which represents Total amount spent by each customer

    **Hint:** Quantity * UnitPrice

* Customer IDs are seen to be repeated. Maintain unique customer IDs by grouping and summing up all possible observations per customer.

    **Hint:** [pandas.groupby.agg](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.core.groupby.DataFrameGroupBy.agg.html)

**Note:** Perform the above operations in function, to reuse and apply the same for test data

In [None]:
# YOUR CODE HERE

### Scale the data

Apply `StandardScaler` on the features.

In [None]:
# YOUR CODE HERE for scaling

## Clustering ( 5 points)

### Apply k-means algorithm to identify a specific number of clusters


* Fit the k-means model

* Extract and store the cluster centroids

Below are the parameters for k-means, which are helpful

**n_clusters** is no. of clusters specified

**k-means++** is a random initialization method for centroids to avoid random initialisation trap

**max_iter** is max no of iterations defined when k-means is running

**n_init** is no. of times k-means will run with different initial centroids

[why-is-k-means-slower-than-random-initialization-k-means](https://stats.stackexchange.com/questions/185396/why-is-k-means-slower-than-random-initialization-k-means/185422)

In [None]:
# YOUR CODE HERE to apply KMeans

#### Find the optimal number of clusters (K) by using the [Elbow method](https://pythonprogramminglanguage.com/kmeans-elbow-method/).

Use the optimal no. of clusters and store the cluster centroids

In [None]:
# YOUR CODE HERE

### Apply DBSCAN algorithm for clustering

- Compare the results of clusters from k-means and DBSCAN


In [None]:
# YOUR CODE HERE

### Analyze the clusters


- consider two features and visualize the clusters with different colors using the predicted cluster centers.

  **Hint:** 2D plot

- consider three features and visualize the clusters with different colors using the predicted cluster centers.

  **Hint:** [3D plot](https://matplotlib.org/stable/gallery/mplot3d/scatter3d.html
)

In [None]:
# YOUR CODE HERE

### Train a supervised algorithm on clustered data

This will allow us to predict cluster numbers (label) for each test data instance

* Create labelled data with k-means cluster labels
  
  **Hint**: [`kmeans.labels_`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
)
* Split the data into train and validation sets
* Train a supervised algorithm on the train data
* Find the accuracy of the model using validation data

In [None]:
# YOUR CODE HERE

### Evaluation of Test Data
* Use the model to predict the labels for the Test data below
* Format the test data in the same format as the train data.
* Predict it with trained supervised ML model

In [None]:
# Test set provided as below
test = pd.read_csv("Online_Retail_Test.csv")
test.head(3)

In [None]:
# YOUR CODE HERE

### Report Analysis

- Discuss the pros and cons of removing the missing values vs replacing with the mean values
- Based on the visualization of clusters, comment on the difference in buying patterns of each cluster
- What other methods could be used to determine the optimal no. of clusters?