# Customer segmentation using clustering

## Objectives

* extract summary level insight from a given customer dataset.

* handle the missing data and identify the underlying pattern or structure of the data.

* create an unsupervised model that generates the optimum number of segments for the customer base

* identify customer segments based on the overall buying behaviour


## Dataset

The dataset chosen for this project is the Online Retail dataset. It is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.

The dataset contains 541909 records, and each record is made up of 8 fields.

Download the dataset : [click here](https://archive.ics.uci.edu/ml/datasets/Online+Retail)

## Problem Statement

Perform customer segmentation for an Online Retail using an Unsupervised Clustering technique

### Import Required packages

In [None]:
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import scipy
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
from mpl_toolkits import mplot3d
from yellowbrick.cluster import SilhouetteVisualizer
from sklearn.metrics import silhouette_score
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score
from sklearn import metrics              
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree 

## Data Wrangling

Download the data

## Load the data

In [None]:
df_train = pd.read_csv('Online_Retail_Train.csv')

In [None]:
df_train.shape

## Data Pre-processing

Explore the dataset by performing the following operations:

* There is a lot of redundant data. Identify such data and take appropriate action.

* Most Invoices appear as normal transactions with positive quantity and prices, but there are some prefixed with "C" or "A" which denote different transaction types. Invoice starting with C represents cancelled order and A represents the Adjusted. Identify such data and take appropriate action.

* Handle the null values by dropping or filling with appropriate mean


* Some of the transactions based on the `StockCode` variable are not actually products, but representing the costs or fees regarding to the post or bank or other tansactions. Find such data and handle it accordingly.

* Identify the outliers in the UntiPrice and Quantity and handle them accordingly.

* Create a DayOfWeek column using `InvoiceDate`.

In [None]:
df_train.head()

In [None]:
df_train.info()

In [None]:
duplicate_count = df_train[df_train.duplicated()]
len(duplicate_count)

In [None]:
df_train.drop_duplicates(subset=None, keep='first', inplace=True, ignore_index=False)
df_train.shape

In [None]:
df_train.isnull().sum()

In [None]:
df_train.dropna(inplace=True)

In [None]:
df_train.shape

In [None]:
df_train.isnull().sum()

In [None]:
dfC = df_train[df_train['InvoiceNo'].str.startswith('C')]
dfC

In [None]:
dfA = df_train[df_train['InvoiceNo'].str.startswith('A')]
dfA

In [None]:
df_train.drop(df_train[df_train['InvoiceNo'].str.startswith('C')].index, inplace=True)

In [None]:
df_train[df_train['InvoiceNo'].str.startswith('C')]

In [None]:
df_train[df_train['Quantity']<0]

In [None]:
#The transaction with 'POST' 'PADS' 'M' 'DOT' 'C2' 'BANK CHARGES' 'CRUK' 'AMAZONFEE' 'gift_0001_10'   as their StockCodes are considered as irrelevant transactions.

In [None]:
print(df_train['StockCode'].unique())

In [None]:
df_train[df_train['StockCode'].str[0].str.isalpha()]

In [None]:
df_train.drop(df_train[df_train['StockCode'].str[0].str.isalpha()].index, inplace=True)

In [None]:
df_train[df_train['StockCode'].str[0].str.isalpha()]

In [None]:
plt.figure(figsize=(5,5))
plt.scatter(df_train['UnitPrice'], df_train['Quantity'])

In [None]:
print(df_train.shape)
df_train.drop(df_train[(df_train['Quantity']>3500) | (df_train['UnitPrice']>200)].index, inplace=True)
df_train.shape

In [None]:
plt.figure(figsize=(5,5))
plt.scatter(df_train['UnitPrice'], df_train['Quantity'])

In [None]:
df_train['InvoiceDate'] = pd.to_datetime(df_train['InvoiceDate'])
df_train['InvoiceDate'].dtype

In [None]:
df_train['DayOfWeek'] = df_train['InvoiceDate'].dt.day_name()

In [None]:
df_train.head()

## Understanding new insights from the data

1.  Are there any free items in the data? How many are there?

2.  Find the number of transactions per country and visualize using an appropriate plot

3.  What is the ratio of customers who are repeat purchasers vs single-time purchasers? Visualize using an appropriate plot.

4. Plot heatmap showing unit price per month and day of the week

5. Find the top 10 customers who bought the most no.of items. Also the top 10 Items bought by most no.of customers.

In [None]:
df_train[df_train['UnitPrice']==0].shape

In [None]:
df_train['Country'].value_counts()

In [None]:
plt.figure(figsize=(10,10))
plt.barh(df_train['Country'].unique(),df_train['Country'].value_counts() )
plt.xscale('log')
plt.xlabel('Transactions')
plt.title('Transactions per country')
plt.show()

In [None]:
repeat_cust = 0
cust_id = df_train['CustomerID'].value_counts()
for x in cust_id:
  if x!=1:
    repeat_cust+=1
single_cust = 4356- repeat_cust
repeat_cust, single_cust

In [None]:
from fractions import Fraction
repeat_single_fraction = repeat_cust/single_cust
repeat_single_ratio = Fraction(repeat_single_fraction).limit_denominator()
print(repeat_single_ratio)

In [None]:
x_labels = ['Repeat Purchasers', 'Single-Time Purchasers']
heights = [repeat_cust, single_cust]
plt.figure(figsize=(7,7))
plt.pie(heights, autopct='%1.1f%%', labels=x_labels)
plt.show()

In [None]:
df_train['MonthName'] =  pd.DatetimeIndex(df_train['InvoiceDate']).month

In [None]:
df_train.head()

In [None]:
#heatmap
# Create a pivot table with MonthName as index and DayOfWeek as columns
table = pd.pivot_table(df_train, values='UnitPrice', index='MonthName', columns='DayOfWeek', aggfunc='sum')

# Plot the heatmap using seaborn
plt.figure(figsize = (10,7))
sns.heatmap(table, cmap='cool')#'YlGnBu')
plt.title('Unit price per month and day of the week', fontsize = 15)
plt.show()

In [None]:
#top 10 customers
groupby_cust = df_train.groupby('CustomerID').sum('Quantity')
groupby_cust.sort_values(by='Quantity', ascending=False, inplace=True)
groupby_cust.head(10)

In [None]:
#top 10 items
groupby_item = df_train.groupby('StockCode').sum()
groupby_item.sort_values(by='Quantity', ascending=False, inplace=True)
groupby_item.head(10)

## Feature Engineering and Transformation

### Create new features to uncover better insights and drop the unwanted columns

* Create a new column which represents Total amount spent by each customer

* Customer IDs are seen to be repeated. Maintain unique customer IDs by grouping and summing up all possible observations per customer.

In [None]:
df_train['Total Amount'] = df_train['Quantity']*df_train['UnitPrice']
df_train.head()

In [None]:
import datetime
snapshot_date = datetime.datetime(2011, 12, 10)
df = df_train.groupby(['CustomerID'],as_index=False).agg({
'InvoiceDate': lambda x: (snapshot_date - x.max()).days,
'InvoiceNo': 'count',
'Total Amount': 'sum'}).rename(columns = {'InvoiceDate': 'Recency', 'InvoiceNo': 'Frequency','Total Amount': 'MonetaryValue'})

In [None]:
df.head()

### Scale the data
 
Apply `StandardScaler` on the features.

In [None]:
plt.figure(figsize=(12,10))
plt.subplot(3, 1, 1); sns.histplot(df['Recency'])
plt.subplot(3, 1, 2); sns.histplot(df['Frequency'])
plt.subplot(3, 1, 3); sns.histplot(df['MonetaryValue'])

In [None]:
df.describe()

In [None]:
df.drop(df[df['Recency']==0].index, inplace=True)

In [None]:
df1 = np.log10(df)

In [None]:
plt.figure(figsize=(12,10))
plt.subplot(3, 1, 1); sns.histplot(df1['Recency'])
plt.subplot(3, 1, 2); sns.histplot(df1['Frequency'])
plt.subplot(3, 1, 3); sns.histplot(df1['MonetaryValue'])

In [None]:
df1.describe()

In [None]:
df1.shape

In [None]:
# YOUR CODE HERE for scaling
X = StandardScaler().fit_transform(df1[['Recency', 'Frequency', 'MonetaryValue']])

In [None]:
X

## Clustering

### Apply k-means algorithm to identify a specific number of clusters


* Fit the k-means model

* Extract and store the cluster centroids

In [None]:
kmeans3 = KMeans(n_clusters=3, random_state=1, n_init=10)
y_predict = kmeans3.fit_predict(X)
plt.scatter(X[:,1], X[:,2], c=y_predict, cmap = 'summer')
plt.show()

In [None]:
kmeans3.inertia_

#### Find the optimal number of clusters (K) by using the Elbow method.

Use the optimal no. of clusters and store the cluster centroids

In [None]:
# YOUR CODE HERE
clusters = np.arange(1,10)
inertia = []
for c in clusters:
    kmeans = KMeans(n_clusters = c, random_state=1, n_init=10)
    kmeans.fit_predict(X) 
    inertia.append(kmeans.inertia_)
plt.plot(clusters, inertia, marker= '.')
plt.title('Inertia Plot')
plt.xlabel("$k$")
plt.ylabel('Inertia')
plt.show()

In [None]:
kmeans4 = KMeans(n_clusters=4, random_state=1, n_init=10)
y_predict = kmeans4.fit_predict(X)
plt.scatter(X[:,1], X[:,2], c=y_predict, cmap = 'summer')
plt.show()

In [None]:
kmeans4.inertia_

In [None]:
kmeans5 = KMeans(n_clusters=5, random_state=1, n_init=10)
y_predict = kmeans5.fit_predict(X)
kmeans5.inertia_

In [None]:
clusters = np.arange(2,10)
sil_score = []
for c in clusters:
    kmeans = KMeans(n_clusters = c, random_state=1, n_init=10)
    kmeans.fit(X)
    sil_score.append(silhouette_score(X, kmeans.labels_))
plt.plot(clusters, sil_score, marker= '.')
plt.title('Silhouette score plot')
plt.xlabel("$k$")
plt.ylabel("Silhouette score")
plt.show()

In [None]:
clusters = [2, 3, 4, 5]
for c in clusters:
    plt.figure(figsize=(6, 4))
    kmeans = KMeans(c, random_state=1, n_init=10)
    visualizer = SilhouetteVisualizer(kmeans, colors='yellowbrick')
    visualizer.fit(X)        
    plt.title("k={}".format(c))
    plt.xlabel("Silhoutte score")
    plt.ylabel('Number of Instances')
    plt.show()

### Apply DBSCAN algorithm for clustering

- Compare the results of clusters from k-means and DBSCAN


In [None]:
dbscan = DBSCAN(eps=0.5, min_samples=10)
dbscan.fit(X)
print("Unique clusters in data: ", np.unique(dbscan.labels_))

### Analyze the clusters


- consider two features and visualize the clusters with different colors using the predicted cluster centers.

- consider three features and visualize the clusters with different colors using the predicted cluster centers.

In [None]:
plt.scatter(X[:,1], X[:,2], c=y_predict, cmap = 'summer')
plt.show()

In [None]:
fig = plt.figure(figsize = (10, 10))
ax = plt.axes(projection ="3d")
ax.scatter3D(X[:,0], X[:,1], X[:,2], c=y_predict, cmap = 'Accent')

### Train a supervised algorithm on clustered data

This will allow us to predict cluster numbers (label) for each test data instance

* Create labelled data with k-means cluster labels
* Split the data into train and validation sets
* Train a supervised algorithm on the train data
* Find the accuracy of the model using validation data

In [None]:
from sklearn.metrics import pairwise_distances_argmin_min

In [None]:
'''km = KMeans(n_clusters=4, n_init=10, random_state=1).fit(X)
closest, _ = pairwise_distances_argmin_min(km.cluster_centers_, X)
closest'''

In [None]:
'''print('Closest to cluster 1: ',X[160])
print('Closest to cluster 2: ',X[3252])
print('Closest to cluster 3: ',X[940])
print('Closest to cluster 4: ',X[2894])'''

In [None]:
'''print("labels: ", km.labels_)
mydict = {i: np.where(km.labels_ == i)[0] for i in range(km.n_clusters)}
print('data points in centroid 1:',len(mydict[0]))
print('data points in centroid 2:',len(mydict[1]))
print('data points in centroid 3:',len(mydict[2]))
print('data points in centroid 4:',len(mydict[3]))'''

In [None]:
'''target = np.empty(len(X), dtype=np.int32)
#labels = ['Lost Customers', 'Best Customers', 'At Risk Customers', 'New Customers']
#labels = [0, 1, 2 ,3]
for i in range(4):
  temp = np.where(km.labels_ == i)[0]
  for j in temp:
    target[j] = i

target'''

In [None]:
#np.unique(target)

In [None]:
#y = target

In [None]:
kmeans4 = KMeans(n_clusters = 4, n_init = 'auto', random_state = 1)
kmeans4.fit(X)
y_kmeans4 = kmeans4.predict(X)

In [None]:
df1['Cluster'] = kmeans4.labels_
df1.head()

In [None]:
X = StandardScaler().fit_transform(df1[['Recency', 'Frequency', 'MonetaryValue']])
y = df1['Cluster']

In [None]:
X_train, X_validate, y_train, y_validate = train_test_split(X, y, test_size = 0.25, random_state=123)

In [None]:
ADB = AdaBoostClassifier(DecisionTreeClassifier(max_depth=2),
                         n_estimators=125,
                         learning_rate = 0.8,
                         random_state=42)

ADB.fit(X_train, y_train)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

In [None]:
n_scores = cross_val_score(ADB, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
('Accuracy: %.3f' % (np.mean(n_scores)*100))

In [None]:
labels = ADB.predict(X_validate)
matrix = metrics.confusion_matrix(y_validate, labels)
# creating a heat map to visualize confusion matrix
sns.heatmap(matrix.T, square=True, annot=True, fmt='d', cbar=False)
plt.xlabel('true label')
plt.ylabel('predicted label');

In [None]:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
log_reg.score(X_validate, y_validate)

In [None]:
kmeans5 = KMeans(n_clusters = 5, n_init = 'auto', random_state = 1)
kmeans5.fit(X)
y_kmeans5 = kmeans5.predict(X)

In [None]:
df1['Cluster'] = kmeans5.labels_
df1.head()

In [None]:
X = StandardScaler().fit_transform(df1[['Recency', 'Frequency', 'MonetaryValue']])
y = df1['Cluster']
X_train, X_validate, y_train, y_validate = train_test_split(X, y, test_size = 0.25, random_state=123)

In [None]:
ADB = AdaBoostClassifier(DecisionTreeClassifier(max_depth=2),
                         n_estimators=125,
                         learning_rate = 0.8,
                         random_state=42)

ADB.fit(X_train, y_train)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(ADB, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
('Accuracy: %.3f' % (np.mean(n_scores)*100))

In [None]:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
log_reg.score(X_validate, y_validate)

In [None]:
kmeans3 = KMeans(n_clusters = 3, n_init = 'auto', random_state = 1)
kmeans3.fit(X)
y_kmeans3 = kmeans3.predict(X)

In [None]:
df1['Cluster'] = kmeans3.labels_
df1.head()

In [None]:
X = StandardScaler().fit_transform(df1[['Recency', 'Frequency', 'MonetaryValue']])
y = df1['Cluster']
X_train, X_validate, y_train, y_validate = train_test_split(X, y, test_size = 0.25, random_state=123)

In [None]:
ADB = AdaBoostClassifier(DecisionTreeClassifier(max_depth=2),
                         n_estimators=125,
                         learning_rate = 0.8,
                         random_state=42)

ADB.fit(X_train, y_train)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(ADB, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
('Accuracy: %.3f' % (np.mean(n_scores)*100))

In [None]:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
log_reg.score(X_validate, y_validate)

### Evaluation of Test Data
* Use the model to predict the labels for the Test data below
* Format the test data in the same format as the train data.
* Predict it with trained supervised ML model

In [None]:
# Test set provided as below
test = pd.read_csv("Online_Retail_Test.csv")
test.head(3)

In [None]:
test.info()

In [None]:
test.isnull().sum()

In [None]:
test.dropna(inplace=True)

In [None]:
test.drop_duplicates(subset=None, keep='first', inplace=True, ignore_index=False)
test.shape

In [None]:
test.drop(test[test['InvoiceNo'].str.startswith('C')].index, inplace=True)

In [None]:
test['Total Amount'] = test['Quantity']*test['UnitPrice']
test.head()

In [None]:
test['InvoiceDate'] = pd.to_datetime(test['InvoiceDate'])
test['InvoiceDate'].dtype

In [None]:
snapshot_date = datetime.datetime(2011, 12, 10)
test_df = test.groupby(['CustomerID'],as_index=False).agg({
'InvoiceDate': lambda x: (snapshot_date - x.max()).days,
'InvoiceNo': 'count',
'Total Amount': 'sum'}).rename(columns = {'InvoiceDate': 'Recency', 'InvoiceNo': 'Frequency','Total Amount': 'MonetaryValue'})
test_df.head()

In [None]:
plt.figure(figsize=(12,10))
plt.subplot(3, 1, 1); sns.histplot(test_df['Recency'])
plt.subplot(3, 1, 2); sns.histplot(test_df['Frequency'])
plt.subplot(3, 1, 3); sns.histplot(test_df['MonetaryValue'])

In [None]:
test_df.describe()

In [None]:
test_df.drop(test_df[test_df['Recency']==0].index, inplace=True)
test_df.describe()

In [None]:
test_df1 = np.log10(test_df)

In [None]:
plt.figure(figsize=(12,10))
plt.subplot(3, 1, 1); sns.histplot(test_df1['Recency'])
plt.subplot(3, 1, 2); sns.histplot(test_df1['Frequency'])
plt.subplot(3, 1, 3); sns.histplot(test_df1['MonetaryValue'])

In [None]:
y_test = StandardScaler().fit_transform(test_df1[['Recency', 'Frequency', 'MonetaryValue']])

In [None]:
y_test_predADB = ADB.predict(y_test)

In [None]:
y_test_predADB

In [None]:
np.unique(y_test_predADB)

In [None]:
y_test_predlr = log_reg.predict(y_test)

In [None]:
y_test_predlr

In [None]:
np.unique(y_test_predlr)