# Assignment

In the previous assignment, we featurized the `retail-churn.csv` data using RFM. In this assignment, we build on the feature engineering we did in the last assignment and run k-means on the data with RFM features in order to do **customer segmentation**. Since k-means is unsupervised, we will also encounter challenges around interpreting results at the end. 

In [9]:
import pandas as pd
import datetime as dt
import numpy as np
import matplotlib.pyplot as plt
col_names = ['user_id', 'gender', 'address', 'store_id', 'trans_id', 'timestamp', 'item_id', 'quantity', 'dollar']
churn = pd.read_csv("data/retail-churn.csv", sep = ",", skiprows = 1, names = col_names)
churn.dtypess

user_id        int64
gender        object
address       object
store_id       int64
trans_id       int64
timestamp     object
item_id      float64
quantity       int64
dollar         int64
dtype: object

1. Rerun the feature engineering steps on the data to extract RFM features. <span style="color:red" float:right>[2 point]</span>

In [10]:
churn['timestamp']= pd.to_datetime(churn['timestamp'])
churn['date'] = pd.DatetimeIndex(churn['timestamp']).date
churn_agg = churn.groupby(by=['user_id','date'],as_index = False).sum()
churn_agg = churn_agg.reset_index() #resets the index from aggregation
churn_agg['date'] = pd.to_datetime(churn_agg['date']) #reassigns date to datetime

recency = churn_agg.groupby('user_id').diff() #creates recency dataframe
frequency = churn_agg.groupby('user_id').rolling('7D', on ='date').sum() 
monetary = pd.DataFrame(churn_agg.groupby('user_id').rolling('7D', on ='date').sum()) 

churn_roll = pd.concat([frequency['quantity'], monetary['dollar']], axis = 1, keys = ['quantity_roll_sum_7D','dollar_roll_sum_7D' ]) #makes new dataframe
churn_roll = churn_roll.reset_index() #resets index from .rollling
churn_roll['last_visit_ndays'] = recency['date'] # adds recency
churn_roll['last_visit_ndays'] = churn_roll['last_visit_ndays'].fillna(pd.Timedelta('999 days')) # sets NaN to 100 
churn_agg = churn_agg.merge(churn_roll)
churn_agg.head(10)

Unnamed: 0,index,user_id,date,store_id,trans_id,item_id,quantity,dollar,level_1,quantity_roll_sum_7D,dollar_roll_sum_7D,last_visit_ndays
0,0,1113,2000-11-12,236305,1810321,9610000000000.0,5,420,0,5.0,420.0,999 days
1,0,1113,2000-11-12,236305,1810321,9610000000000.0,5,420,1,8.0,978.0,14 days
2,0,1113,2000-11-12,236305,1810321,9610000000000.0,5,420,2,14.0,1602.0,1 days
3,0,1113,2000-11-12,236305,1810321,9610000000000.0,5,420,3,23.0,2230.0,40 days
4,1,1113,2000-11-26,354465,3000946,17230000000000.0,3,558,0,5.0,420.0,999 days
5,1,1113,2000-11-26,354465,3000946,17230000000000.0,3,558,1,8.0,978.0,14 days
6,1,1113,2000-11-26,354465,3000946,17230000000000.0,3,558,2,14.0,1602.0,1 days
7,1,1113,2000-11-26,354465,3000946,17230000000000.0,3,558,3,23.0,2230.0,40 days
8,2,1113,2000-11-27,708957,6115029,28270000000000.0,6,624,0,5.0,420.0,999 days
9,2,1113,2000-11-27,708957,6115029,28270000000000.0,6,624,1,8.0,978.0,14 days


2. Train a k-means algorithm on the RFM features using $k = 10$. What are the cluster centroids? The cluster centroids should be reported in the **original scale**, not the standardized scale. <span style="color:red" float:right>[2 point]</span> 

In [11]:
churn_agg['last_visit_ndays'] = pd.to_numeric(churn_agg['last_visit_ndays'].dt.days, downcast='integer')
churn_agg.head()

Unnamed: 0,index,user_id,date,store_id,trans_id,item_id,quantity,dollar,level_1,quantity_roll_sum_7D,dollar_roll_sum_7D,last_visit_ndays
0,0,1113,2000-11-12,236305,1810321,9610000000000.0,5,420,0,5.0,420.0,999
1,0,1113,2000-11-12,236305,1810321,9610000000000.0,5,420,1,8.0,978.0,14
2,0,1113,2000-11-12,236305,1810321,9610000000000.0,5,420,2,14.0,1602.0,1
3,0,1113,2000-11-12,236305,1810321,9610000000000.0,5,420,3,23.0,2230.0,40
4,1,1113,2000-11-26,354465,3000946,17230000000000.0,3,558,0,5.0,420.0,999


In [12]:
from sklearn.cluster import KMeans
n_clusters = 10# the number of clusters (k)
which_cols = ['quantity_roll_sum_7D', 'dollar_roll_sum_7D','last_visit_ndays']

X = churn_agg[which_cols]
kmeans = KMeans(n_clusters = n_clusters, random_state = 0) # step 1: initialize
kmeans.fit(X) # step 2, learn the clusters
churn_agg['cluster'] = kmeans.predict(X) # step 3, assign a cluster to each row
churn_agg

Unnamed: 0,index,user_id,date,store_id,trans_id,item_id,quantity,dollar,level_1,quantity_roll_sum_7D,dollar_roll_sum_7D,last_visit_ndays,cluster
0,0,1113,2000-11-12,236305,1810321,9.610000e+12,5,420,0,5.0,420.0,999,0
1,0,1113,2000-11-12,236305,1810321,9.610000e+12,5,420,1,8.0,978.0,14,0
2,0,1113,2000-11-12,236305,1810321,9.610000e+12,5,420,2,14.0,1602.0,1,0
3,0,1113,2000-11-12,236305,1810321,9.610000e+12,5,420,3,23.0,2230.0,40,0
4,1,1113,2000-11-26,354465,3000946,1.723000e+13,3,558,0,5.0,420.0,999,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
369751,37053,2179315,2001-02-28,503817,3257058,9.420000e+12,3,377,37053,3.0,377.0,999,0
369752,37054,2179346,2001-02-28,3778890,24447343,6.991000e+13,23,3567,37054,23.0,3567.0,999,7
369753,37055,2179414,2001-02-28,9067842,58670039,1.607080e+14,46,4993,37055,46.0,4993.0,999,7
369754,37056,2179469,2001-02-28,1763405,11404692,3.298000e+13,15,1706,37056,15.0,1706.0,999,0


3. Our earlier choice of $k=5$ was arbitrary. To find a better number of $k$ create a **scree plot**, which plots the number of clusters $k$ on the x-axis and the sum of squared distances from each point to its cluster centroid on the y-axis. We can get the latter by calling the `inertia_` attribute as shown in the lab. Plot the scree plot for $k$ values from 3 to 15. <span style="color:red" float:right>[3 point]</span>

In [14]:
from sklearn import metrics
from scipy.spatial.distance import cdist
distortions = []
inertias = []
mapping1 = {}
mapping2 = {}
K = range(3, 15)
 
for k in K:
    # Building and fitting the model
    kmeanModel = KMeans(n_clusters=k).fit(X)
    kmeanModel.fit(X)
 
    distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_,'euclidean'), axis=1)) / X.shape[0])
    inertias.append(kmeanModel.inertia_)
 
    mapping1[k] = sum(np.min(cdist(X, kmeanModel.cluster_centers_,'euclidean'), axis=1)) / X.shape[0]
    mapping2[k] = kmeanModel.inertia_

KeyboardInterrupt: 

In [None]:
plt.plot(K, distortions, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Distortion')
plt.title('The Elbow Method using Distortion')
plt.show()

4. Based on the scree plot, what is a good value to pick for $k$? Provide a brief justification for your choice. <span style="color:red" float:right>[2 point]</span>

To determine the optimal number of clusters, we have to select the value of k at the “elbow” ie the point after which the distortion/inertia start decreasing in a linear fashion. Thus for the given data, we conclude that the optimal number of clusters for the data is 7

5. Train a k-means algorithm on the RFM features using your new value of $k$. Report the  size, mean and standard deviation for the RFM features for each cluster. <span style="color:red" float:right>[2 point]</span>

In [15]:
n_clusters = 7# the number of clusters (k)
which_cols = ['quantity_roll_sum_7D', 'dollar_roll_sum_7D','last_visit_ndays']

X = churn_agg[which_cols]
kmeans = KMeans(n_clusters = n_clusters, random_state = 0) # step 1: initialize
kmeans.fit(X) # step 2, learn the clusters
churn_agg['cluster'] = kmeans.predict(X) # step 3, assign a cluster to each row

In [19]:
churn_agg.groupby('cluster').std()

Unnamed: 0_level_0,index,user_id,store_id,trans_id,item_id,quantity,dollar,level_1,quantity_roll_sum_7D,dollar_roll_sum_7D,last_visit_ndays
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,10759.987288,641573.984909,742210.978349,6402139.0,22867190000000.0,11.554765,848.986409,10760.23891,20.10971,1285.286371,384.915118
1,11369.369602,812381.895573,609123.341621,14934380.0,52339390000000.0,51.97201,3328.637928,11366.991009,373.481735,6458.902147,23.269032
2,11212.657929,683177.917989,333333.803954,24169770.0,85860170000000.0,25.471659,6550.280844,11207.666186,344.45779,15348.128789,2.954946
3,9993.487623,630515.228581,701405.808263,9090057.0,32828440000000.0,23.613692,1614.275978,9994.26697,91.005914,2596.450486,21.755533
4,10302.872129,640259.360797,814185.155616,7704206.0,27851160000000.0,9.928053,1085.620953,10303.247751,37.943907,1846.26746,76.682594
5,13797.307417,853697.501862,960441.261598,21576830.0,75106990000000.0,89.265662,6827.506098,13799.361152,549.116298,15471.688511,41.54244
6,10138.100203,677940.416707,544722.613449,9538176.0,34307060000000.0,11.607413,2031.821398,10141.045223,81.760951,4032.612174,18.600638


In [20]:
churn_agg.groupby('cluster').mean()

Unnamed: 0_level_0,index,user_id,store_id,trans_id,item_id,quantity,dollar,level_1,quantity_roll_sum_7D,dollar_roll_sum_7D,last_visit_ndays
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,18042.018557,1365585.0,499691.854482,6352326.0,23134890000000.0,7.024838,579.750936,18038.623718,27.794264,1975.721475,195.486775
1,12517.723954,973084.7,274576.336797,11665520.0,41880700000000.0,17.448974,1678.207168,12527.323475,539.002531,45691.441114,3.865041
2,4260.596875,272659.1,317058.919375,21845660.0,80330450000000.0,25.281875,3462.884375,4279.755,1307.624375,143264.409375,2.4
3,16420.65221,1289833.0,354256.4251,8016738.0,29339260000000.0,9.34948,925.631129,16426.035446,161.775814,14974.96983,5.356766
4,17000.638954,1317994.0,451343.69305,7442417.0,27186420000000.0,8.213661,759.607643,17001.948924,87.625704,7239.411647,13.356214
5,18644.700475,1301786.0,478959.555891,14925920.0,54615870000000.0,31.090203,3277.712128,18654.136383,1038.582218,82843.997842,5.497626
6,16552.676364,1283675.0,243228.547755,8330378.0,30021440000000.0,9.508014,1079.581019,16562.040345,242.836426,26248.260237,3.764519


In [21]:
churn_agg.groupby('cluster').size()

cluster
0    195343
1      7506
2      1600
3     43785
4    100987
5      2317
6     18218
dtype: int64

6. Pick 3 clusters at random and describe what makes them different from one another (in terms of their RFM features). <span style="color:red" float:right>[3 point]</span>

cluster 0: Primarily describes the low quantity,low dollar, sparse transactions. The transations that occur rarely and call for a small quantity and price

cluster 2:Is the most opposite from 0 in that they are very close together tranactions of high quantity and high Value

cluster 5: Transactions that are about a week apart (maybe weekly for most?) that are of average quantity and slightly below average cost

# End of assignment