Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name below:

In [None]:
NAME = "Quanpu Xiao"
STUDENT_ID = "14368978"

---

## Overview
Welcome to the clustering task. In previous courses, you've learned the application of several clustering methods. Now it's time to put them to use in a real-world task. One of the most classic application is the analysis of business data. By the means of clustering, we can analyze daily transactional data of some business to perform tasks like customer segmentation.

In the following tasks, we will analyze a database provided by a real online retailer to help them to perform a basic customer segmentation.

Online retailing refers to the sale of goods or the provision of services targeted to the needs of individuals or households via the Internet or other electronic channels. The attached document incluedes a database collected by one retailer for all its transactions registered for non-store online retailing in the UK between 12/12/2010 and 12/9/2011. The company primarily sells unique all-occasion gifts.

You will try to use your what you had learned to help them analyze their customers. Many of the tasks you may have come across before, but today is the first time we can combine them together.

## Goal
We aim to separate the customers into several groups so that the company can target its customers efficiently. Ideally 3 or 4 groups.

## The steps are broadly divided into:

1. [Step 1: Reading and Understanding the Data](#1)
2. [Step 2: Data Cleaning](#2)
3. [Step 3: Feature Engineering](#3)
   1. [Step 3.1: Feature Creation](#3)
   2. [Step 3.2: Outlier detection](#3.2)
   3. [Step 3.3: Standarization](#3.3)
4. [Step 4: Model Building](#4)
5. [Step 5: Final Analysis](#5)

**This code is adopted from the code of MANISH KUMAR. The origin is https://www.kaggle.com/datasets/hellbuoy/online-retail-customer-clustering

<a id="1"></a> <br>
## Step 1 : Reading and Understanding Data

Import the packages and set the random seed to unify the result.

In [None]:
# Import required libraries for dataframe and visualization
# If you some packages are installed in your computer, you 
# can install them by using "pip"
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt

# import required libraries for clustering
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score

np.random.seed(42)

As the first step in the analysis, we need to load the data and do some inspections.

You can use this code to load the database.

In [None]:
# Reading the data on which analysis needs to be done
retail = pd.read_csv('OnlineRetail.csv', sep=",", encoding="ISO-8859-1", header=0)

### Task 1

Print the basic information of the data and check whether there are problems inside.

At least you should check them by using the dataframe.info function

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

<a id="2"></a> <br>
## Step 2 : Data Cleaning

### By observing the information of the data, at least three issues are found:
1. Missing values exist in "Description" and "CunstomerID"
2. "CustomerID" should be string type rather than integer
3. “InvoiceData” need to be datetime type for further usage

### Task 2
1. delete all columns that have missing value
2. Convert the "CustomerID" into string type
3. Convert the "InvoiceData" into datetime type with the format '%d-%m-%Y %H:%M'

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert len(retail) == 406829

<a id="3"></a> <br>
## Step 3 : Feature Engineering
## 3.1 : Feature Creation

### Next, we deal with the features. We adopt a simple and classic approach, i.e. we calculate the following 3 factors:
- Recency: Number of days since last purchase
- Frequency: Number of tracsactions
- Monetary: Total amount of transactions (revenue contributed)

You can run the following code to compute the attribute "Recency".

In [None]:
# New Attribute : Recency

# Compute last transaction date to get the recency of customers
max_date = max(retail['InvoiceDate'])
retail['Recency'] = max_date - retail['InvoiceDate']

# Compute recency by grouping the same Customers together
rfm_p = retail.groupby('CustomerID')['Recency'].min()
rfm_p = rfm_p.reset_index()

# Extract number of days only
rfm_p['Recency'] = rfm_p['Recency'].dt.days
rfm_p.head()

### Task 3
Compute the other two attributes: "Monetary" and "Frequency".

Quantity times the UnitPrice is the amount of one order and you can sum up all orders for the same CustomerID to get the Monetary, save this in a `df` called `rfm_m` in column 'Amount'.

You can count how many InvoiceNo for one CustomerID to get the Frequency, save this in a `df` called `rfm_f` in column 'Frequency'. 

In [None]:
rfm_m = ...
# YOUR CODE HERE
raise NotImplementedError()

rfm_f = ...
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Check if columns exist in rfm_m and rfm_f
assert 'CustomerID' in rfm_m.columns, "Expected column name: CustomerID"
assert 'Amount' in rfm_m.columns, "Expected column name: Amount"
assert 'CustomerID' in rfm_f.columns, "Expected column name: CustomerID"
assert 'Frequency' in rfm_f.columns, "Expected column name: Frequency"


### Task 4
Merge all three attributes to form a `dataframe` called "rfm".

In [None]:
rfm = ...
# YOUR CODE HERE
raise NotImplementedError()

<a id="3.2"></a> <br>
## 3.2 : Outlier detection

### Task 5
Plot the distribution of all three attributes in "rfm".

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Observing them after plotting, do you think the distributions are indicating some problems?

# Answer cell
#### The distribution above is abnormal because most of the data lies in a very narrow band. The exact reason can actually be traced to outliers. 
#### In our distribution, the outliers occupy a extremely high value that makes the plot too wide and the data too narrow.

### Task 6
In this circumstance, we need to remove the extreme value before modeling. Please use the function `dataframe.quantile()` to detect the 0.99 percent point and use it as a threshold to delete all points higher than this value for all three attributes.
Save the rest of the data in `rfm`

After that, you can plot the distribution again. This time you should be able to see the shape of the distribution.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert len(rfm) == 4244

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

<a id="3.3"></a> <br>
### Step 3.3 : Standarization

It is extremely important to rescale the variables so that they have a comparable scale.

### Task 7
Please use a `Standardisation Scaling` to transform all three attributes and stored the data in a new `dataframe` called "rfm_df_scaled".
Be sure to run the cell below so your `dataframe` has the correct columns on the correct place.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()


In [None]:
# Run this cell
rfm_df_scaled = pd.DataFrame(rfm_df_scaled)
rfm_df_scaled.columns = ['Amount', 'Frequency', 'Recency']
rfm_df_scaled.head()

<a id="4"></a> <br>
## Step 4 : Building the Model
Now it is time to cluster the customers into groups. We use k-mean and hierachical clustering to achieve this task. 
We will be calculating the silhoute score for various settings and saving this in a dictionary.

### Task 8
Perform kmean clustering on the data with the number of clusters set to be 3.

And compute the Silhouette Scores.

In [None]:
# Initialize an empty dictionary
silhouette_scores = {}

In [None]:
silhouette_scores['Kmean3'] = 0

# YOUR CODE HERE
raise NotImplementedError()


### Task 9
Perform hierachical clustering on the data, the algorithm can be chosen as agglomerative clustering, number of clusters is 3, linkage is "ward".

Compute the Silhouette Scores.

In [None]:
silhouette_scores['Ward3'] = 0

# YOUR CODE HERE
raise NotImplementedError()

### Task 10
Change the number of cluster to be 4 for both kmean and hierachical clustering.

Compute the Silhouette scores for both cases.

After that, the variable "silhouette_scores" should have four values.

In [None]:
silhouette_scores['Kmean4'] = 0

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
silhouette_scores['Ward4'] = 0

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
print(silhouette_scores)

### Additional Task
You can use the scatter plot (Amount vs. Frequency) to show the differences between these two methods.

By observing the plot we know that the kmean shows clearer boundaries. 

According to their mechanism, do you know why?

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

### Task 11
Record the result of kmean with 3 clusters as a new column named "ClusterLabel" in "rfm".

In [None]:
# YOUR CODE HERE
raise NotImplementedError()


In [None]:
assert 'ClusterLabel' in rfm.columns, "Missing column"

<a id="5"></a> <br>
## Step 5 : Result Analysis

### Task 12
Plot scatter plots of both Amount vs. Recency and Amount vs. Frequency. Try to identify differences between clusters?

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Additional Task
You can use the "ClusterLabel" as the x-axis to draw a box plot for each attribute, in this case you can see the differences clearly.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

#### According to the figures above, we can separate all customers into three type of groups:
- Good buyers: They are loyal supporters of the company, either bought many times or bought large amounts.
- Potential buyers: They have bought a certain amount of products and are quite active in recent days. They may become our loyal supporters in the future, even though the amount may not be large.
- Lost Buyers: They haven't buy any products for a long time. In the beginning they tried to buy some small amount and after that they disappeared.

### Final Task:
Determine which type of buyer matches which of the 3 clusters we obtained by eyeballing the plots.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Good_buyer_cluster = ...
Potential_buyer_cluster = ...
Lost_buyer_cluster = ...


#### Congratulations, the analysis is complete. 
#### After we present the results to the company, all departments of the company have developed meaningful strategies based on your analysis. And the analysis help them make dramatic improvement in their future sales.

End