# Assignment (Milestone 2)

We saw how TF-IDF can be used to create features on text data. Let's now look at an example of a special transformation very common in the retail industry: RFM or recency-frequency-monetary transformation. The goal of this assignment is to implement create RFM features for the `retail-churn.csv` data. You will see that having time series data opens us up to many types of features (although how useful they will ultimately be is another question).

Prior to running the following code, make sure you upgrade `pandas` to version `1.0.1` by running `pip install pandas==1.0.1`. You can check the current version of pandas by running the next cell.

In [115]:
import pandas as pd
import datetime as dt
import numpy as np
import matplotlib.pyplot as plt
pd.__version__

'1.1.2'

Let's read the `retail-churn.csv` data we are by now very familiar with.

In [116]:
col_names = ['user_id', 'gender', 'address', 'store_id', 'trans_id', 'timestamp', 'item_id', 'quantity', 'dollar']
churn = pd.read_csv("data/retail-churn.csv", sep = ",", skiprows = 1, names = col_names)
churn.head()

Unnamed: 0,user_id,gender,address,store_id,trans_id,timestamp,item_id,quantity,dollar
0,101981,F,E,2860,818463,11/1/2000 0:00,4710000000000.0,1,37
1,101981,F,E,2861,818464,11/1/2000 0:00,4710000000000.0,1,17
2,101981,F,E,2862,818465,11/1/2000 0:00,4710000000000.0,1,23
3,101981,F,E,2863,818466,11/1/2000 0:00,4710000000000.0,1,41
4,101981,F,E,2864,818467,11/1/2000 0:00,4710000000000.0,8,288


Run the following steps to feature engineer the data.

1. Convert the `timestamp` column to be of type `datetime`. <span style="color:red" float:right>[1 point]</span>

In [117]:
churn['timestamp']= pd.to_datetime(churn['timestamp']) #converts to datetime

2. Extract the date from `datetime` and store it in a new column called `date`. <span style="color:red" float:right>[1 point]</span>

In [118]:
churn['date'] = pd.DatetimeIndex(churn['timestamp']).date # Creates date column
churn.head()

Unnamed: 0,user_id,gender,address,store_id,trans_id,timestamp,item_id,quantity,dollar,date
0,101981,F,E,2860,818463,2000-11-01,4710000000000.0,1,37,2000-11-01
1,101981,F,E,2861,818464,2000-11-01,4710000000000.0,1,17,2000-11-01
2,101981,F,E,2862,818465,2000-11-01,4710000000000.0,1,23,2000-11-01
3,101981,F,E,2863,818466,2000-11-01,4710000000000.0,1,41,2000-11-01
4,101981,F,E,2864,818467,2000-11-01,4710000000000.0,8,288,2000-11-01


Notice that the **granularity** of the data is not daily spend, but rather individual transactions. We can see that because the same user has multiple transactions with the same timestamp. Before we run RFM, we need to **aggregate** the data so we have daily granularity.

3. Aggregate `quantity` and `dollar` to daily data (so that `user_id` and `date` are unique for each row). Call the aggregated data `churn_agg`. You can ignore all the other columns, as they are not needed. <span style="color:red" float:right>[2 point]</span>

In [119]:
churn_agg = churn.groupby(by=['user_id','date'],as_index = False).sum() #aggregates quantity and dollar
churn_agg

Unnamed: 0,user_id,date,store_id,trans_id,item_id,quantity,dollar
0,1113,2000-11-12,236305,1810321,9.610000e+12,5,420
1,1113,2000-11-26,354465,3000946,1.723000e+13,3,558
2,1113,2000-11-27,708957,6115029,2.827000e+13,6,624
3,1113,2001-01-06,827162,8990176,3.333700e+13,9,628
4,1250,2001-02-04,973126,5978686,1.907000e+13,5,734
...,...,...,...,...,...,...,...
37053,2179315,2001-02-28,503817,3257058,9.420000e+12,3,377
37054,2179346,2001-02-28,3778890,24447343,6.991000e+13,23,3567
37055,2179414,2001-02-28,9067842,58670039,1.607080e+14,46,4993
37056,2179469,2001-02-28,1763405,11404692,3.298000e+13,15,1706


4. Using the aggregated data, obtain recency, frequency and monetary features for both `dollar` and `quantity`. Use a 7-day moving window for frequency and monetary. Call your new features `last_visit_ndays` (recency) `quantity_roll_sum_7D` (frequency) and `dollar_roll_sum_7D` (monetary). <span style="color:red" float:right>[4 point]</span>

  HINT: In `pandas` recency is a kind of **difference** feature, because it's based on calculating the difference between the current date and a previous date (called a **lag**). We can use the `diff` method to get recency. Frequency and monetary features are called **rolling** features, because it is a type of cumulative sum but over a moving window. We can use the `rolling` function to get frequency and monetary, where the `window` and `on` arguments need to chosen carefully.

In [120]:
churn_agg = churn_agg.reset_index() #resets the index from aggregation
churn_agg['date'] = pd.to_datetime(churn_agg['date']) #reassigns date to datetime

recency = churn_agg.groupby('user_id').diff() #creates recency dataframe
recency

Unnamed: 0,index,date,store_id,trans_id,item_id,quantity,dollar
0,,NaT,,,,,
1,1.0,14 days,118160.0,1190625.0,7.620000e+12,-2.0,138.0
2,1.0,1 days,354492.0,3114083.0,1.104000e+13,3.0,66.0
3,1.0,40 days,118205.0,2875147.0,5.067000e+12,3.0,4.0
4,,NaT,,,,,
...,...,...,...,...,...,...,...
37053,,NaT,,,,,
37054,,NaT,,,,,
37055,,NaT,,,,,
37056,,NaT,,,,,


In [121]:
frequency = churn_agg.groupby('user_id').rolling('7D', on ='date').sum() #creates frequency dataframe
frequency

Unnamed: 0_level_0,Unnamed: 1_level_0,index,user_id,date,store_id,trans_id,item_id,quantity,dollar
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1113,0,0.0,1113.0,2000-11-12,236305.0,1810321.0,9.610000e+12,5.0,420.0
1113,1,1.0,2226.0,2000-11-26,590770.0,4811267.0,2.684000e+13,8.0,978.0
1113,2,3.0,3339.0,2000-11-27,1299727.0,10926296.0,5.511000e+13,14.0,1602.0
1113,3,6.0,4452.0,2001-01-06,2126889.0,19916472.0,8.844700e+13,23.0,2230.0
1250,4,4.0,1250.0,2001-02-04,973126.0,5978686.0,1.907000e+13,5.0,734.0
...,...,...,...,...,...,...,...,...,...
2179315,37053,37053.0,2179315.0,2001-02-28,503817.0,3257058.0,9.420000e+12,3.0,377.0
2179346,37054,37054.0,2179346.0,2001-02-28,3778890.0,24447343.0,6.991000e+13,23.0,3567.0
2179414,37055,37055.0,2179414.0,2001-02-28,9067842.0,58670039.0,1.607080e+14,46.0,4993.0
2179469,37056,37056.0,2179469.0,2001-02-28,1763405.0,11404692.0,3.298000e+13,15.0,1706.0


In [122]:
monetary = pd.DataFrame(churn_agg.groupby('user_id').rolling('7D', on ='date').sum()) # creates monetary dataframe
monetary

Unnamed: 0_level_0,Unnamed: 1_level_0,index,user_id,date,store_id,trans_id,item_id,quantity,dollar
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1113,0,0.0,1113.0,2000-11-12,236305.0,1810321.0,9.610000e+12,5.0,420.0
1113,1,1.0,2226.0,2000-11-26,590770.0,4811267.0,2.684000e+13,8.0,978.0
1113,2,3.0,3339.0,2000-11-27,1299727.0,10926296.0,5.511000e+13,14.0,1602.0
1113,3,6.0,4452.0,2001-01-06,2126889.0,19916472.0,8.844700e+13,23.0,2230.0
1250,4,4.0,1250.0,2001-02-04,973126.0,5978686.0,1.907000e+13,5.0,734.0
...,...,...,...,...,...,...,...,...,...
2179315,37053,37053.0,2179315.0,2001-02-28,503817.0,3257058.0,9.420000e+12,3.0,377.0
2179346,37054,37054.0,2179346.0,2001-02-28,3778890.0,24447343.0,6.991000e+13,23.0,3567.0
2179414,37055,37055.0,2179414.0,2001-02-28,9067842.0,58670039.0,1.607080e+14,46.0,4993.0
2179469,37056,37056.0,2179469.0,2001-02-28,1763405.0,11404692.0,3.298000e+13,15.0,1706.0


5. Combine all three features into a single `DataFrame` and call it `churn_roll`. <span style="color:red" float:right>[1 point]</span>

In [123]:
churn_roll = pd.concat([frequency['quantity'], monetary['dollar']], axis = 1, keys = ['quantity_roll_sum_7D','dollar_roll_sum_7D' ]) #makes new dataframe
churn_roll = churn_roll.reset_index() #resets index from .rollling
churn_roll['last_visit_ndays'] = recency['date'] # adds recency
churn_roll.head()

Unnamed: 0,user_id,level_1,quantity_roll_sum_7D,dollar_roll_sum_7D,last_visit_ndays
0,1113,0,5.0,420.0,NaT
1,1113,1,8.0,978.0,14 days
2,1113,2,14.0,1602.0,1 days
3,1113,3,23.0,2230.0,40 days
4,1250,4,5.0,734.0,NaT


6. Use `fillna` to replace missing values for recency with a large value like 100 days (whatever makes business sense). HINT: You can use `pd.Timedelta('100 days')` to set the value. <span style="color:red" float:right>[1 point]</span>

In [124]:
churn_roll['last_visit_ndays'] = churn_roll['last_visit_ndays'].fillna(pd.Timedelta('100 days')) # sets NaN to 100 

7. To see if things worked, merge the aggregated data `churn_agg` with the RFM features in `churn_roll`. You can use the `merge` method to do this with the right keys specified. <span style="color:red" float:right>[2 point]</span>

In [125]:
churn_agg = churn_agg.merge(churn_roll)

8. Check the features we created to make sure they appear to show the right calculations. You can do this by just checking the first 10 rows of the data. <span style="color:red" float:right>[1 point]</span>

In [126]:
churn_agg.head(10)

Unnamed: 0,index,user_id,date,store_id,trans_id,item_id,quantity,dollar,level_1,quantity_roll_sum_7D,dollar_roll_sum_7D,last_visit_ndays
0,0,1113,2000-11-12,236305,1810321,9610000000000.0,5,420,0,5.0,420.0,100 days
1,0,1113,2000-11-12,236305,1810321,9610000000000.0,5,420,1,8.0,978.0,14 days
2,0,1113,2000-11-12,236305,1810321,9610000000000.0,5,420,2,14.0,1602.0,1 days
3,0,1113,2000-11-12,236305,1810321,9610000000000.0,5,420,3,23.0,2230.0,40 days
4,1,1113,2000-11-26,354465,3000946,17230000000000.0,3,558,0,5.0,420.0,100 days
5,1,1113,2000-11-26,354465,3000946,17230000000000.0,3,558,1,8.0,978.0,14 days
6,1,1113,2000-11-26,354465,3000946,17230000000000.0,3,558,2,14.0,1602.0,1 days
7,1,1113,2000-11-26,354465,3000946,17230000000000.0,3,558,3,23.0,2230.0,40 days
8,2,1113,2000-11-27,708957,6115029,28270000000000.0,6,624,0,5.0,420.0,100 days
9,2,1113,2000-11-27,708957,6115029,28270000000000.0,6,624,1,8.0,978.0,14 days


One take-away from the above example is that feature engineering can be a complicated topic, and relies to some extent on creativity and domain knowledge, as we saw with time series data and RFM. For this reason, some modern machine learning libraries are working on what is called **automated feature engineering** to see if algorithms can automatically figure out a set of good features to use by the machine learning model.

# End of assignment