# Clustering

In this assignment, you will implement a K-Means Clustering algorithm from scratch and compare the results to existing sklearn algorithm.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Question 1.1: Write a method that determine Labels from Points and ClusterCentroids, and return a list of a label for each point

In [2]:
def FindLabelOfClosest(Points, ClusterCentroids): # determine Labels from Points and ClusterCentroids
    NumberOfClusters, NumberOfDimensions = ClusterCentroids.shape # dimensions of the initial Centroids
    Distances = np.zeros(NumberOfClusters)
    NumberOfPoints, NumberOfDimensions = Points.shape
    Labels = np.zeros(NumberOfPoints)
    for PointNumber in range(NumberOfPoints): # assign labels to all data points            
        for ClusterNumber in range(NumberOfClusters): # for each cluster
            Distances[ClusterNumber] = (np.linalg.norm(Points.iloc[PointNumber]-ClusterCentroids.iloc[ClusterNumber]))
        Labels[PointNumber] = np.argmin(Distances)
    return Labels # return the a label for each point


Question 1.2: Write a method that determine centroid of Points with the same label

In [3]:
def CalculateClusterCentroid(Points, Labels): # determine centroid of Points with the same label
    ClusterLabels = np.unique(Labels) # names of labels
    NumberOfPoints, NumberOfDimensions = Points.shape
    ClusterCentroids = pd.DataFrame(0, index=np.arange(len(ClusterLabels)), columns=range(NumberOfDimensions))
    for ClusterNumber in ClusterLabels: # for each cluster
        # get mean for each label 
        mean_ = Points[Labels==ClusterNumber].mean(axis=0)
        ClusterCentroids.loc[ClusterNumber, :] = mean_
        print(ClusterCentroids.loc[ClusterNumber, :])
        print(type(mean_))
    return ClusterCentroids # return the a label for each point

In [4]:
# test_labels = np.random.choice(4, 10, replace=True)
# ClusterLabels = np.unique(test_labels)
# range(ClusterLabels.size-1)
# range(3)
# clus_no = 2
# points = pd.DataFrame([[1,1], [2,2],[3,3],[4,4],[5,5],[6,6],[7,7],[8,8],[9,9],[10,10]])
# print(test_labels)
# print(test_labels == clus_no)
# res = points[test_labels == clus_no]
# print(res.shape)
# res.mean(axis=0)
# np.arange(len(ClusterLabels))

Question 1.3: Put it all together as such. K-means algorithm partitions the input data into K clusters by iterating between the following two steps:
- Compute the cluster center by computing the arithmetic mean of all the points belonging to the cluster.
- Assign each point to the closest cluster center.

In [5]:
def KMeans(Points, ClusterCentroidGuesses):
    #TODO
    ClusterCentroids = ClusterCentroidGuesses.copy()
    Labels_Previous = None
    # Get starting set of labels
    Labels = FindLabelOfClosest(Points, ClusterCentroids)
    while not np.array_equal(Labels, Labels_Previous):
        # Re-calculate cluster centers based on new set of labels
        ClusterCentroids = CalculateClusterCentroid(Points.to_numpy(), Labels)
        Labels_Previous = Labels.copy() # Must make a deep copy
        # Determine new labels based on new cluster centers
        Labels = FindLabelOfClosest(Points, ClusterCentroids)
    return Labels, ClusterCentroids

In [6]:
StoreTxn = pd.read_csv("./Superstore Transaction data.csv")
StoreTxn['Order Date'] = pd.to_datetime(StoreTxn['Order Date'] )
StoreTxn.head()

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,1,CA-2016-152156,2016-11-08,11/11/2016,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2,0.0,41.9136
1,2,CA-2016-152156,2016-11-08,11/11/2016,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,3,0.0,219.582
2,3,CA-2016-138688,2016-06-12,6/16/2016,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,...,90036,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62,2,0.0,6.8714
3,4,US-2015-108966,2015-10-11,10/18/2015,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,5,0.45,-383.031
4,5,US-2015-108966,2015-10-11,10/18/2015,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368,2,0.2,2.5164


In [7]:
# StoreTxn.info()

In [8]:
# StoreTxn.describe()

Extract RFM features from the transaction data:
- Recency: when was the last purchase they made
- Frequency: how often do they make a purchase in the last month (or any given window you choose)
- Monetary: how much money did they spend in the last month

Question 2.1:
- Use groupby to summarize the quantity and dollar columns by user_id and date
- Name the aggregated data txn_agg
- Reset the index for txn_agg to the default and user_id and date to dataframe columns
- Confirm changes

In [9]:
txn_agg = StoreTxn.groupby(['Customer ID', 'Order Date']).agg({'Quantity': 'sum', 'Sales': 'sum'}).reset_index()
#TODO #Summarize quantity and dollar by user_id - date.  
txn_agg.head(10)

Unnamed: 0,Customer ID,Order Date,Quantity,Sales
0,AA-10315,2014-03-31,4,726.548
1,AA-10315,2014-09-15,5,29.5
2,AA-10315,2015-10-04,2,26.96
3,AA-10315,2016-03-03,14,4406.072
4,AA-10315,2017-06-29,5,374.48
5,AA-10375,2014-04-21,5,16.52
6,AA-10375,2014-10-24,3,34.272
7,AA-10375,2015-02-03,5,178.37
8,AA-10375,2015-05-08,2,5.248
9,AA-10375,2015-11-13,6,84.96


Question 2.2:Using the aggregated data, obtain recency, frequency and monetary features for both dollar and quantity. Use a 7-day moving window for frequency and monetary. Call your new features last_visit_ndays (recency) quantity_roll_sum_7D (frequency) and dollar_roll_sum_7D (monetary)

In [10]:
txn_agg['Order Date'] = pd.to_datetime(txn_agg['Order Date'] )

In [11]:
last = txn_agg.groupby("Customer ID").apply(lambda d: d.assign(last_visit_ndays=d["Order Date"].diff() ))
#TODO # Group the data by user_id and calculate lag as the differnce between the current and previous date (lag by one period)
# last.rename(columns = {'Order Date' : 'last_visit_ndays'}, inplace = True) # Name the lagged date values last_visit_ndays
print(last.head(10), end='\n\n')

roll = txn_agg.groupby("Customer ID").rolling(window="7D", on="Order Date").agg({"Quantity":"sum","Sales":"sum"})
roll = roll.reset_index()
#TODO # Group the data by user_id.  Apply a 7 day offset to implement a moving 7-day window totaling quantity and dollars sold within each time window. 
roll.rename(columns = {'Quantity' : 'Quantity_roll_sum_7D', 'Sales' : 'Sales_roll_sum_7D'}, inplace = True) # Name the resulting data values quantity_roll_sum_7D and dollar_roll_sum_7D
print(roll.head(10), end='\n\n')


  Customer ID Order Date  Quantity     Sales last_visit_ndays
0    AA-10315 2014-03-31         4   726.548              NaT
1    AA-10315 2014-09-15         5    29.500         168 days
2    AA-10315 2015-10-04         2    26.960         384 days
3    AA-10315 2016-03-03        14  4406.072         151 days
4    AA-10315 2017-06-29         5   374.480         483 days
5    AA-10375 2014-04-21         5    16.520              NaT
6    AA-10375 2014-10-24         3    34.272         186 days
7    AA-10375 2015-02-03         5   178.370         102 days
8    AA-10375 2015-05-08         2     5.248          94 days
9    AA-10375 2015-11-13         6    84.960         189 days

  Customer ID Order Date  Quantity_roll_sum_7D  Sales_roll_sum_7D
0    AA-10315 2014-03-31                   4.0            726.548
1    AA-10315 2014-09-15                   5.0             29.500
2    AA-10315 2015-10-04                   2.0             26.960
3    AA-10315 2016-03-03                  14.0       

Question 2.3: Combine all three features into a single DataFrame and call it txn_roll

In [12]:
# print(last.shape)
# print(roll.shape)
# print(txn_roll.shape)

In [13]:
txn_roll = pd.merge(last, roll, on=["Customer ID", "Order Date"], how='inner')
#TODO # Inner join between roll (frequency and monetary fields) and last (recency fields) to create churn_roll.  Join based on index which works given that both dataframes are sorted by user_id and date.
txn_roll = txn_roll.drop('Quantity', axis=1)
txn_roll = txn_roll.drop('Sales', axis=1)

print(txn_roll.dtypes, end='\n\n')
txn_roll.head(10)


Customer ID                      object
Order Date               datetime64[ns]
last_visit_ndays        timedelta64[ns]
Quantity_roll_sum_7D            float64
Sales_roll_sum_7D               float64
dtype: object



Unnamed: 0,Customer ID,Order Date,last_visit_ndays,Quantity_roll_sum_7D,Sales_roll_sum_7D
0,AA-10315,2014-03-31,NaT,4.0,726.548
1,AA-10315,2014-09-15,168 days,5.0,29.5
2,AA-10315,2015-10-04,384 days,2.0,26.96
3,AA-10315,2016-03-03,151 days,14.0,4406.072
4,AA-10315,2017-06-29,483 days,5.0,374.48
5,AA-10375,2014-04-21,NaT,5.0,16.52
6,AA-10375,2014-10-24,186 days,3.0,34.272
7,AA-10375,2015-02-03,102 days,5.0,178.37
8,AA-10375,2015-05-08,94 days,2.0,5.248
9,AA-10375,2015-11-13,189 days,6.0,84.96


Question 2.4: Use fillna to replace missing values for recency with a large value like 100 days (whatever makes business sense). HINT: You can use pd.Timedelta('100 days') to set the value.

In [14]:
txn_roll['last_visit_ndays'] = txn_roll['last_visit_ndays'].fillna(pd.Timedelta(days=1000))
#TODO # Replace missing recency values with 1000 days
txn_roll.head(10)

Unnamed: 0,Customer ID,Order Date,last_visit_ndays,Quantity_roll_sum_7D,Sales_roll_sum_7D
0,AA-10315,2014-03-31,1000 days,4.0,726.548
1,AA-10315,2014-09-15,168 days,5.0,29.5
2,AA-10315,2015-10-04,384 days,2.0,26.96
3,AA-10315,2016-03-03,151 days,14.0,4406.072
4,AA-10315,2017-06-29,483 days,5.0,374.48
5,AA-10375,2014-04-21,1000 days,5.0,16.52
6,AA-10375,2014-10-24,186 days,3.0,34.272
7,AA-10375,2015-02-03,102 days,5.0,178.37
8,AA-10375,2015-05-08,94 days,2.0,5.248
9,AA-10375,2015-11-13,189 days,6.0,84.96


In [15]:
txn_roll['last_visit_ndays'] = txn_roll['last_visit_ndays'].dt.days

Question 2.5: Merge the aggregated data churn_agg with the RFM features in churn_roll. You can use the merge method to do this with the right keys specified.

In [16]:
txn_rfm = pd.merge(txn_roll, StoreTxn, on=["Customer ID", "Order Date"], how='right')
#TODO #merge on Customer ID and Order Date
print(txn_rfm.shape)
txn_rfm.head(10)

(9994, 24)


Unnamed: 0,Customer ID,Order Date,last_visit_ndays,Quantity_roll_sum_7D,Sales_roll_sum_7D,Row ID,Order ID,Ship Date,Ship Mode,Customer Name,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,CG-12520,2016-11-08,390,5.0,993.9,1,CA-2016-152156,11/11/2016,Second Class,Claire Gute,...,42420,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2,0.0,41.9136
1,CG-12520,2016-11-08,390,5.0,993.9,2,CA-2016-152156,11/11/2016,Second Class,Claire Gute,...,42420,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,3,0.0,219.582
2,DV-13045,2016-06-12,1000,2.0,14.62,3,CA-2016-138688,6/16/2016,Second Class,Darrin Van Huff,...,90036,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62,2,0.0,6.8714
3,SO-20335,2015-10-11,1000,7.0,979.9455,4,US-2015-108966,10/18/2015,Standard Class,Sean O'Donnell,...,33311,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,5,0.45,-383.031
4,SO-20335,2015-10-11,1000,7.0,979.9455,5,US-2015-108966,10/18/2015,Standard Class,Sean O'Donnell,...,33311,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368,2,0.2,2.5164
5,BH-11710,2014-06-09,1000,38.0,3714.304,6,CA-2014-115812,6/14/2014,Standard Class,Brosina Hoffman,...,90032,West,FUR-FU-10001487,Furniture,Furnishings,Eldon Expressions Wood and Plastic Desk Access...,48.86,7,0.0,14.1694
6,BH-11710,2014-06-09,1000,38.0,3714.304,7,CA-2014-115812,6/14/2014,Standard Class,Brosina Hoffman,...,90032,West,OFF-AR-10002833,Office Supplies,Art,Newell 322,7.28,4,0.0,1.9656
7,BH-11710,2014-06-09,1000,38.0,3714.304,8,CA-2014-115812,6/14/2014,Standard Class,Brosina Hoffman,...,90032,West,TEC-PH-10002275,Technology,Phones,Mitel 5320 IP Phone VoIP phone,907.152,6,0.2,90.7152
8,BH-11710,2014-06-09,1000,38.0,3714.304,9,CA-2014-115812,6/14/2014,Standard Class,Brosina Hoffman,...,90032,West,OFF-BI-10003910,Office Supplies,Binders,DXL Angle-View Binders with Locking Rings by S...,18.504,3,0.2,5.7825
9,BH-11710,2014-06-09,1000,38.0,3714.304,10,CA-2014-115812,6/14/2014,Standard Class,Brosina Hoffman,...,90032,West,OFF-AP-10002892,Office Supplies,Appliances,Belkin F5C206VTEL 6 Outlet Surge,114.9,5,0.0,34.47


Question 3.1: Train the k-means algorithm you developed earlier on the RFM features using  𝑘=4 . What are the cluster centroids? The cluster centroids should be reported in the original scale, not the standardized scale.

In [17]:
new_df = txn_roll.drop(['Customer ID', 'Order Date'], axis=1)
idx = np.random.choice(len(txn_roll), 4, replace=False)

ClusterCentroidGuesses = new_df.iloc[idx]

Labels, ClusterCentroids = KMeans(new_df, ClusterCentroidGuesses)

0     293.426846
1      14.304698
2    1987.596275
Name: 0, dtype: float64
<class 'numpy.ndarray'>
0     58.681196
1      7.709838
2    265.122604
Name: 1, dtype: float64
<class 'numpy.ndarray'>
0    926.541802
1      6.896851
2    233.498895
Name: 2, dtype: float64
<class 'numpy.ndarray'>
0    243.914535
1      5.619186
2    132.967120
Name: 3, dtype: float64
<class 'numpy.ndarray'>
0    317.652845
1      7.823718
2    470.813713
Name: 0, dtype: float64
<class 'numpy.ndarray'>


In [22]:
print(ClusterCentroids)
print(Labels)

            0         1           2
0  317.652845  7.823718  470.813713
[0. 0. 0. ... 0. 0. 0.]


Question 3.2: Pick few pairs and plot scatter plots along with cluster centroids.

[Bonus] Question 4: Train k-means model using sklearn library and compare results to the model developed above.

In [20]:
from sklearn.cluster import KMeans

kmeans = KMeans(random_state=0, n_clusters=4).fit(txn_roll.drop(['Customer ID', 'Order Date'], axis=1))
print(kmeans.labels_)
print(kmeans.cluster_centers_)

[2 0 0 ... 0 2 0]
[[3.13355567e+02 6.36233701e+00 1.67715466e+02]
 [3.04790210e+02 1.74405594e+01 3.62480178e+03]
 [3.37426210e+02 1.29185360e+01 1.18283617e+03]
 [4.76857143e+02 1.76428571e+01 1.15174207e+04]]


Comparing the 2 models: My model seems very rudimentary and is not clustering well.

Question 5: Create a new text cell in your Notebook: Complete a 50-100 word summary (or short description of your thinking in applying this week's learning to the solution) of your experience in this assignment. Include: What was your incoming experience with this model, if any? what steps you took, what obstacles you encountered. how you link this exercise to real-world, machine learning problem-solving. (What steps were missing? What else do you need to learn?) This summary allows your instructor to know how you are doing and allot points for your effort in thinking and planning, and making connections to real-world work.

Incoming experience: No incoming experience apart from previous assignments.

Steps taken: This week's lesson was about unsupervised learning. Got a feel for kmeans clustering works under the hood.

Obstacles: Took a while to figure out what each method did, the types of inputs and outputs and some of the errors thrown.

Link to real world: Helped me understand when Scikit learn libraries and how they can be leveraged. Also, with implementation, I learnt how we could tweak the distance method used (for example, can use Manhattan instead of Euclidean distances bases on use case)

Steps missing (with just this week's learning): Scaling, model evaluation and finding the appropriate number of clusters through elbow method.