## Decision Tree Modeling: Model 2

For my second model, I will build a decision tree. It is a classification model which can be used to make predictions based on a set of rules or conditions. I will build the model based on customer's historical behaviour (past purchases), demographic information and item attributes, such as price and popularity. 

Each node in the tree will represent a decision based on a feature and the branches represent the possible outcomes or decisions that can be made on that feature. It will then be used to recommend items to the customer based on the path through the tree that corresponds to the customer's characteristics and preferences.



In [51]:
# importing necessary Python libraries
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from datetime import datetime
import matplotlib.pyplot as plt

# avoid displaying warnings
import warnings
warnings.filterwarnings("ignore")


from sklearn.tree import DecisionTreeClassifier



from sklearn.model_selection import KFold, cross_val_score, train_test_split, GridSearchCV, cross_validate
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, confusion_matrix
from sklearn.tree import DecisionTreeClassifier 
from sklearn import metrics 
import time 

In [2]:
# importing customer dataset
customers = pd.read_csv('../src/data/cleancustomers.csv', index_col=0)
# importing transactions dataset
transactions = pd.read_csv('../src/data/cleantransactions.csv', index_col=0)
# importing articles dataset
articles = pd.read_csv('../src/data/cleanarticles.csv', index_col=0)

Checking the number of transactions per customer 

In [3]:
# checking number of transactions per customer
customercounts = transactions.customer_id.value_counts()

In [4]:
customercounts

be1981ab818cf4ef6765b2ecaea7a2cbf14ccd6e8a7ee985513d9e8e53c6d91b    1641
b4db5e5259234574edfff958e170fe3a5e13b6f146752ca066abca3c156acc71    1321
a65f77281a528bf5c1e9f270141d601d116e1df33bf9df512f495ee06647a9cc    1304
49beaacac0c7801c2ce2d189efe525fe80b5d37e46ed05b50a4cd88e34d0748f    1233
cd04ec2726dd58a8c753e0d6423e57716fd9ebcf2f14ed6012e7e5bea016b4d6    1217
                                                                    ... 
caab3a054f5e7a752412ee02c2e16d760b6d118c44c35962c0669d7aecb95f9b       1
0f060816d7ead7fe5d842325cb65e0225c11fdf1758a2c9e9d175addbdca826c       1
8c35d73f72679d27fed2ab79e96220b45b5242fd17f5dfc0635483a519654f9d       1
8bd64d3f8cfab43f1dec4884777e481979e6bc36637e07f97383c644e364ea0f       1
efc7e4c42b1d5729e4b777b929618aed5bef8bcb73619af5df4821f467417952       1
Name: customer_id, Length: 1362281, dtype: int64

We will start by looking at those customers who have made over 15 transactions

In [7]:
# selecting those customers with over 15 transactions
customers_15 = customercounts.index[customercounts > 15]

Decision trees can take a very long time to run therefore I will start using a sample of 10,000 rows. 

In [10]:
# filtering the customer table for those customers who have purchased over 15 transactions
# then taking a sample of 10000 from customer table
samplecustomer = customers[customers['customer_id'].isin(customers_15)].sample(n=10000, frac=None, replace=False, weights=None, random_state=42)

In [11]:
samplecustomer

Unnamed: 0,customer_id,Active,club_member_status,fashion_news_frequency,age,postal_code
182343,21f784e38b53662ed5e54657149112fc283b3a695657aa...,1,ACTIVE,Regularly,24,bca6416aca7ba1fca428bcc70ef5d6cbcd32e384737d81...
849682,9e9cdb717a39ee881d16a2c53f0acf8a3d8c9c7401eada...,1,ACTIVE,Regularly,40,03581b3344d730a36d08757ddc704377db2fe5ec0d8aab...
524532,61efebb5f2d630f26006883f08c8c605947d967264f37c...,1,ACTIVE,Regularly,29,57b75b559ce8717b1110f998a2d901f0ecc7219f959b32...
469768,57bfe3d884947368e312f29560d5fa1dca127114437936...,0,PRE-CREATE,NONE,27,e3a61f56a52c9db8bd840cfd8f1b785d643ff4880701ba...
1355910,fcfde18e06accd996c8361748ec651fafd83120e20409f...,0,ACTIVE,NONE,43,c9a99a31a62101ea2af38568b1f637565dfddcfc852221...
...,...,...,...,...,...,...
406719,4c0ccc7603bc22e6252d79c48dead2872ac60dd818150a...,0,PRE-CREATE,UNKNOWN,54,fe1d5f6e72b95978d23e6380f54278118152f7f158ece8...
1243621,e81564f8e4bba08144d904fdf8fe1fe674580bfa5db2b3...,0,ACTIVE,NONE,56,00e55a934a4bef613c584d0e4f347b9648c07b1ba4b8f3...
120425,1672c3d95c031449a17bc739ea8eab454b84fc40f18ff3...,1,ACTIVE,Regularly,39,9074e11bd825ab41eecf4ea1c6ae55a4eeefe11efaaf3c...
203375,25e9c3e9046e082bfd9c4160beecfcf36d40860efa393b...,0,ACTIVE,NONE,30,09d60d19ec9abf64456d170a1684ccebe2ad53615ad0db...


Now I will split the samplecustomer dataset. This is important as we want to train the model on the train set and test the model on the test set. Doing it this way helps us understand how the model will generalize to new unseen data. It also prevents data leakage. When employing the model we need to make sure it is on data that it hasn't seen before otherwise it doesn't give us an accurate representation of how well the model would work on new data (unbiased evaluation).

Without a separate test set, it is possible to overfit the model to the training data, which means that the model may become too complex and fit noise in the data rather than the underlying patterns. This can result in poor performance when the model is applied to new data.

In [12]:
# Splitting sample data into test and train
customer_train, customer_test = train_test_split(samplecustomer, test_size=0.25)

In [13]:
customer_train

Unnamed: 0,customer_id,Active,club_member_status,fashion_news_frequency,age,postal_code
1268456,ecb6e236773246368b383988811f9bca5408482b1fee28...,0,ACTIVE,NONE,54,aa7db475b219fcad52ca6ed2acb00a8d1a456ae9cdad58...
1335731,f93a495bc39f0935706713a7aa89248667080bc30199d1...,1,ACTIVE,Regularly,22,4497446069b83f8956c08ad4ad14c142d7b2644f608f2e...
1265844,ec3c51b0c803af2ecf49a3555e97373b07f9f144c8c366...,0,ACTIVE,NONE,21,210bee0df0809e6e4a7b8f5856c033b4de7f175893df8c...
402593,4b43dd31d96dd26776e72f19054dadb46adc40109f4fa8...,0,ACTIVE,NONE,28,84d908904e777fb73fe397ba28475edc487402d9419639...
355053,4268ce8a25ac0707b617633bec3c5dd0ff4291f84f0c23...,0,ACTIVE,NONE,23,1e76d55a6b7510e37adb208e9191dc3b403b2dcaf2e895...
...,...,...,...,...,...,...
1352677,fc620a365316ccdc9b20ec35ee3d7cb4292e3d999c1287...,1,ACTIVE,Regularly,20,479c52f72347575d310aa45ddb8df2616f69e0ababf8ed...
382357,47818992f78a1e3352c66560ebd11907f69fa95c2073dd...,1,ACTIVE,Regularly,55,5ab32317cfc42eedafd91305ddfd37a6bbf467cdeae50f...
1087544,caf5ce756422dbe2bae05f5421f105c85f1f01f3b61850...,0,ACTIVE,NONE,46,e5f8ab27942c4d428512c8b744e370b89a71729800dfd6...
1048308,c3ab2a3ecf1c4cb18efece84fe777eebdd39935840dc94...,1,ACTIVE,Regularly,24,b04f84cc06b7723b306f7336e11e4f781aee033ce2ef1f...


In [8]:
articles.head(5)

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,108775015,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,9,Black,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
1,108775044,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,10,White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
2,108775051,108775,Strap top (1),253,Vest top,Garment Upper body,1010017,Stripe,11,Off White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
3,110065001,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,9,Black,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."
4,110065002,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,10,White,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."


There are many features in the articles table which represent the same data. I will drop some of these to avoid multicollinearity. It will also speed up the process.

In [14]:
# drop some columns from articles df 
reducedarticles = articles.drop(columns=['prod_name', 'graphical_appearance_name', 'colour_group_name', 'department_name', 'perceived_colour_value_id', 'perceived_colour_master_id', 'index_name', 'index_group_name', 'section_name', 'garment_group_name'])

In [15]:
reducedarticles.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 105126 entries, 0 to 105125
Data columns (total 15 columns):
 #   Column                        Non-Null Count   Dtype 
---  ------                        --------------   ----- 
 0   article_id                    105126 non-null  int64 
 1   product_code                  105126 non-null  int64 
 2   product_type_no               105126 non-null  int64 
 3   product_type_name             105126 non-null  object
 4   product_group_name            105126 non-null  object
 5   graphical_appearance_no       105126 non-null  int64 
 6   colour_group_code             105126 non-null  int64 
 7   perceived_colour_value_name   105126 non-null  object
 8   perceived_colour_master_name  105126 non-null  object
 9   department_no                 105126 non-null  int64 
 10  index_code                    105126 non-null  object
 11  index_group_no                105126 non-null  int64 
 12  section_no                    105126 non-null  int64 
 13 

In [16]:
# merging articles and transaction tables into new df 
transactions_articles = transactions.merge(reducedarticles, on='article_id', how='left')

In [17]:
transactions_articles.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,article_purchase_count,product_code,product_type_no,product_type_name,product_group_name,graphical_appearance_no,colour_group_code,perceived_colour_value_name,perceived_colour_master_name,department_no,index_code,index_group_no,section_no,garment_group_no,detail_desc
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2,1,541518.0,306.0,Bra,Underwear,1010016.0,51.0,Dusty Light,Pink,1334.0,B,1.0,61.0,1017.0,"Lace push-up bras with underwired, moulded, pa..."
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2,1,663713.0,283.0,Underwear body,Underwear,1010016.0,9.0,Dark,Black,1338.0,B,1.0,61.0,1017.0,"Lace push-up body with underwired, moulded, pa..."
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221001,0.020322,2,1,505221.0,252.0,Sweater,Garment Upper body,1010010.0,7.0,Medium Dusty,Unknown,5963.0,D,2.0,58.0,1003.0,Jumper in rib-knit cotton with hard-worn detai...
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2,1,505221.0,252.0,Sweater,Garment Upper body,1010010.0,52.0,Medium Dusty,Pink,5963.0,D,2.0,58.0,1003.0,Jumper in rib-knit cotton with hard-worn detai...
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687001,0.016932,2,1,685687.0,252.0,Sweater,Garment Upper body,1010010.0,8.0,Dark,Grey,3090.0,A,1.0,15.0,1023.0,V-neck knitted jumper with long sleeves and ri...


In [21]:
transactions_articles.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28813419 entries, 0 to 28813418
Data columns (total 20 columns):
 #   Column                        Dtype  
---  ------                        -----  
 0   t_dat                         object 
 1   customer_id                   object 
 2   article_id                    int64  
 3   price                         float64
 4   sales_channel_id              int64  
 5   article_purchase_count        int64  
 6   product_code                  float64
 7   product_type_no               float64
 8   product_type_name             object 
 9   product_group_name            object 
 10  graphical_appearance_no       float64
 11  colour_group_code             float64
 12  perceived_colour_value_name   object 
 13  perceived_colour_master_name  object 
 14  department_no                 float64
 15  index_code                    object 
 16  index_group_no                float64
 17  section_no                    float64
 18  garment_group_no    

In [22]:
# Merging transactions_articles df with customer train 
train = customer_train.merge(transactions_articles, on='customer_id', how='inner')
test = customer_test.merge(transactions_articles, on='customer_id', how='inner')

In [23]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 372549 entries, 0 to 372548
Data columns (total 25 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   customer_id                   372549 non-null  object 
 1   Active                        372549 non-null  int64  
 2   club_member_status            372549 non-null  object 
 3   fashion_news_frequency        372549 non-null  object 
 4   age                           372549 non-null  int64  
 5   postal_code                   372549 non-null  object 
 6   t_dat                         372549 non-null  object 
 7   article_id                    372549 non-null  int64  
 8   price                         372549 non-null  float64
 9   sales_channel_id              372549 non-null  int64  
 10  article_purchase_count        372549 non-null  int64  
 11  product_code                  371215 non-null  float64
 12  product_type_no               371215 non-nul

In [24]:
test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 120767 entries, 0 to 120766
Data columns (total 25 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   customer_id                   120767 non-null  object 
 1   Active                        120767 non-null  int64  
 2   club_member_status            120767 non-null  object 
 3   fashion_news_frequency        120767 non-null  object 
 4   age                           120767 non-null  int64  
 5   postal_code                   120767 non-null  object 
 6   t_dat                         120767 non-null  object 
 7   article_id                    120767 non-null  int64  
 8   price                         120767 non-null  float64
 9   sales_channel_id              120767 non-null  int64  
 10  article_purchase_count        120767 non-null  int64  
 11  product_code                  120330 non-null  float64
 12  product_type_no               120330 non-nul

In [55]:
train = train.sort_values(['customer_id', 't_dat'])
test = test.sort_values(['customer_id', 't_dat'])

In [56]:
train

Unnamed: 0,customer_id,Active,club_member_status,fashion_news_frequency,age,postal_code,t_dat,article_id,price,sales_channel_id,...,graphical_appearance_no,colour_group_code,perceived_colour_value_name,perceived_colour_master_name,department_no,index_code,index_group_no,section_no,garment_group_no,detail_desc
205292,0000f1c71aafe5963c3d195cf273f7bfd50bbf17761c91...,1,ACTIVE,Regularly,30,7c902dca60ee0bd0f9030eefd445d11146e8d24835738a...,2018-09-29,373506019,0.010831,1,...,1010010.0,10.0,Light,White,3610.0,B,1.0,62.0,1021.0,Fine-knit liner socks designed to be hidden in...
205293,0000f1c71aafe5963c3d195cf273f7bfd50bbf17761c91...,1,ACTIVE,Regularly,30,7c902dca60ee0bd0f9030eefd445d11146e8d24835738a...,2018-12-07,564309022,0.013542,2,...,1010016.0,9.0,Dark,Black,7988.0,F,3.0,26.0,1017.0,Trunks in stretch cotton jersey with flatlock ...
205294,0000f1c71aafe5963c3d195cf273f7bfd50bbf17761c91...,1,ACTIVE,Regularly,30,7c902dca60ee0bd0f9030eefd445d11146e8d24835738a...,2018-12-07,564309022,0.013559,2,...,1010016.0,9.0,Dark,Black,7988.0,F,3.0,26.0,1017.0,Trunks in stretch cotton jersey with flatlock ...
205295,0000f1c71aafe5963c3d195cf273f7bfd50bbf17761c91...,1,ACTIVE,Regularly,30,7c902dca60ee0bd0f9030eefd445d11146e8d24835738a...,2018-12-07,661166006,0.040661,2,...,1010016.0,32.0,Medium,Orange,1616.0,A,1.0,11.0,1003.0,"Jumper in a soft, fine-knit wool blend with ri..."
205296,0000f1c71aafe5963c3d195cf273f7bfd50bbf17761c91...,1,ACTIVE,Regularly,30,7c902dca60ee0bd0f9030eefd445d11146e8d24835738a...,2018-12-07,676990001,0.040661,2,...,1010010.0,6.0,Dusty Light,Grey,1610.0,A,1.0,6.0,1003.0,Long jumper in a soft knit in a relaxed fit wi...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
161059,ffed87e65b5196e7ac9d8631e35a3c4243a169fc06a7ea...,0,ACTIVE,NONE,37,493d2362f919f80aac0a69edaae9cae505b85592452545...,2019-06-20,768912001,0.016136,1,...,1010016.0,9.0,Dark,Black,1676.0,A,1.0,16.0,1002.0,Leggings in soft jersey made from an organic c...
161060,ffed87e65b5196e7ac9d8631e35a3c4243a169fc06a7ea...,0,ACTIVE,NONE,37,493d2362f919f80aac0a69edaae9cae505b85592452545...,2019-11-23,784856005,0.025407,2,...,1010016.0,11.0,Dusty Light,White,3611.0,B,1.0,62.0,1021.0,"Slippers in soft, rib-knit chenille. Fold-down..."
161061,ffed87e65b5196e7ac9d8631e35a3c4243a169fc06a7ea...,0,ACTIVE,NONE,37,493d2362f919f80aac0a69edaae9cae505b85592452545...,2019-11-23,792058001,0.042356,2,...,1010026.0,9.0,Dark,Black,3709.0,B,1.0,62.0,1017.0,"Top in supersoft pile with a wrapover, jersey-..."
161062,ffed87e65b5196e7ac9d8631e35a3c4243a169fc06a7ea...,0,ACTIVE,NONE,37,493d2362f919f80aac0a69edaae9cae505b85592452545...,2019-11-23,792058002,0.042356,2,...,1010002.0,10.0,Light,White,3709.0,B,1.0,62.0,1017.0,"Top in supersoft pile with a wrapover, jersey-..."


To make the model as accuarate as possible, I will be using the last 15 items that the customers bought as my historical behaviour feature of the decision tree. This way we can hopefully avoid the customer's fashion taste changing too significantly

In [25]:
# getting the index number of last 15 transactions for each customer 
train15 = train.groupby('customer_id', observed=True).tail(15).index

In [26]:
train15

Int64Index([    90,     91,     92,     93,     94,     95,     96,     97,
                98,     99,
            ...
            372539, 372540, 372541, 372542, 372543, 372544, 372545, 372546,
            372547, 372548],
           dtype='int64', length=112500)

In [29]:
test15 = test.groupby('customer_id', observed=True).tail(15).index

In [60]:
test15

Int64Index([124390, 124391, 124392, 124393, 124394, 124395, 124396, 124397,
            124398, 124399,
            ...
            119071, 119072, 119073, 119074, 119075, 119076, 119077, 119078,
            119079, 119080],
           dtype='int64', length=37500)

In [39]:
# saving previous transactions_articles table to use for other model notebooks
transactions_articles.to_csv('merged_df1.csv', index=False)

In [30]:
# selects last 15 transactions for each customer  using the indexs from train15 and test15
train.loc[train15,:] 
test.loc[test15,:]

Unnamed: 0,customer_id,Active,club_member_status,fashion_news_frequency,age,postal_code,t_dat,article_id,price,sales_channel_id,...,graphical_appearance_no,colour_group_code,perceived_colour_value_name,perceived_colour_master_name,department_no,index_code,index_group_no,section_no,garment_group_no,detail_desc
55,328bf991e38bea51519e362885a545a95dfe8889d95117...,1,ACTIVE,Regularly,50,e77824dde8c7d846d76146147d61c0aff720d293bd697a...,2019-12-22,659832004,0.005237,2,...,1010016.0,10.0,Light,White,1643.0,D,2.0,51.0,1002.0,"Short, off-the-shoulder top in ribbed jersey w..."
56,328bf991e38bea51519e362885a545a95dfe8889d95117...,1,ACTIVE,Regularly,50,e77824dde8c7d846d76146147d61c0aff720d293bd697a...,2019-12-22,738943004,0.019678,2,...,1010010.0,93.0,Dark,Green,1626.0,A,1.0,15.0,1003.0,Boxy polo-neck jumper in a soft knit containin...
57,328bf991e38bea51519e362885a545a95dfe8889d95117...,1,ACTIVE,Regularly,50,e77824dde8c7d846d76146147d61c0aff720d293bd697a...,2019-12-22,744291001,0.010492,2,...,1010016.0,42.0,Medium,Red,1640.0,D,2.0,53.0,1005.0,"Short, off-the-shoulder top in crinkled jersey..."
58,328bf991e38bea51519e362885a545a95dfe8889d95117...,1,ACTIVE,Regularly,50,e77824dde8c7d846d76146147d61c0aff720d293bd697a...,2019-12-29,224606019,0.008458,2,...,1010016.0,9.0,Dark,Black,3509.0,C,1.0,65.0,1019.0,Narrow belt in grained imitation leather with ...
59,328bf991e38bea51519e362885a545a95dfe8889d95117...,1,ACTIVE,Regularly,50,e77824dde8c7d846d76146147d61c0aff720d293bd697a...,2019-12-29,728997001,0.025407,2,...,1010016.0,9.0,Dark,Black,3509.0,C,1.0,65.0,1019.0,Leather belt with a metal buckle. Width 2.8 cm.
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
120762,afe9980e05079d2b078c9418d0554bb55812b4689c9c83...,0,ACTIVE,NONE,18,7954b99f03af96349ad705e250afcbe312022cc3af25f6...,2020-02-12,757903001,0.006763,1,...,1010010.0,7.0,Dusty Light,Grey,3409.0,C,1.0,65.0,1019.0,Scarf in a soft weave with fringes on the shor...
120763,afe9980e05079d2b078c9418d0554bb55812b4689c9c83...,0,ACTIVE,NONE,18,7954b99f03af96349ad705e250afcbe312022cc3af25f6...,2020-02-12,780418002,0.016932,1,...,1010016.0,7.0,Dusty Light,Grey,3945.0,D,2.0,52.0,1019.0,"Scarf in a soft, fine knit with fringes on the..."
120764,afe9980e05079d2b078c9418d0554bb55812b4689c9c83...,0,ACTIVE,NONE,18,7954b99f03af96349ad705e250afcbe312022cc3af25f6...,2020-07-07,610776087,0.008458,1,...,1010017.0,22.0,Medium,Yellow,1676.0,A,1.0,16.0,1002.0,T-shirt in lightweight jersey with a rounded h...
120765,afe9980e05079d2b078c9418d0554bb55812b4689c9c83...,0,ACTIVE,NONE,18,7954b99f03af96349ad705e250afcbe312022cc3af25f6...,2020-07-07,879248002,0.016932,1,...,1010016.0,19.0,Dark,Khaki green,1676.0,A,1.0,16.0,1002.0,Short shorts in lightweight sweatshirt fabric ...


In [49]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 372549 entries, 0 to 372548
Data columns (total 25 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   customer_id                   372549 non-null  object 
 1   Active                        372549 non-null  int64  
 2   club_member_status            372549 non-null  object 
 3   fashion_news_frequency        372549 non-null  object 
 4   age                           372549 non-null  int64  
 5   postal_code                   372549 non-null  object 
 6   t_dat                         372549 non-null  object 
 7   article_id                    372549 non-null  int64  
 8   price                         372549 non-null  float64
 9   sales_channel_id              372549 non-null  int64  
 10  article_purchase_count        372549 non-null  int64  
 11  product_code                  371215 non-null  float64
 12  product_type_no               371215 non-nul

In [31]:
# customers and their last 15 purchases 
y_train = train.loc[train15,:].groupby('customer_id', observed=True)['article_id'].apply(lambda x: x.tolist())
y_test =  test.loc[test15,:].groupby('customer_id', observed=True)['article_id'].apply(lambda x: x.tolist())

In [32]:
y_train

customer_id
0000f1c71aafe5963c3d195cf273f7bfd50bbf17761c9199e53dbb81641becd7    [841383002, 864716001, 889714001, 685814048, 8...
000346516dd355b40badca0c0f5f37a318ddae31f0e0f76a3a0454eb591b6384    [642437010, 685823003, 745232001, 456163030, 6...
000a5f3c8be9167cb0d09dd8a17b6b54998e9e83faaf52072ee9434b85c85dd3    [817401001, 817401003, 832309007, 839959002, 8...
000c886e014a122bd9066501103e3f4a3ec157af27399a5f6fa2dc540c123356    [751556005, 457722013, 490113004, 850617001, 6...
00180a85401e63b5063a3ffbf51da92002944c06c161b16e90ed7f66527a566d    [693246002, 744910001, 751254001, 761579001, 5...
                                                                                          ...                        
ffe42e2da309097faed4b0544399d193a989dfb6563b811da06bd9812026f72e    [823118001, 823118002, 823165001, 853393002, 8...
ffe51c3eb582e649b4f3de8634463cfb2a089740c8d8c5426fafbd6f9977e09a    [827411001, 850244001, 860949001, 880792001, 3...
ffe8cceee64827679ba7535d4ad9427316610466f914

In [36]:
# selecting one article at random from their last 15 purchases to test our model later on 
one_y_train = y_train.apply(lambda x: x[1])
one_y_test = y_test.apply(lambda x: x[1])

In [37]:
one_y_train

customer_id
0000f1c71aafe5963c3d195cf273f7bfd50bbf17761c9199e53dbb81641becd7    864716001
000346516dd355b40badca0c0f5f37a318ddae31f0e0f76a3a0454eb591b6384    685823003
000a5f3c8be9167cb0d09dd8a17b6b54998e9e83faaf52072ee9434b85c85dd3    817401003
000c886e014a122bd9066501103e3f4a3ec157af27399a5f6fa2dc540c123356    457722013
00180a85401e63b5063a3ffbf51da92002944c06c161b16e90ed7f66527a566d    744910001
                                                                      ...    
ffe42e2da309097faed4b0544399d193a989dfb6563b811da06bd9812026f72e    823118002
ffe51c3eb582e649b4f3de8634463cfb2a089740c8d8c5426fafbd6f9977e09a    850244001
ffe8cceee64827679ba7535d4ad9427316610466f91418d3ee4c16a42c4764c0    697119002
ffecb79ac8c7eb58476bf69ebaaa03cc2d1bd16197d0e444256317ea03e1129c    790645005
ffee1b7dd2cdde971d8082beaba080326b21a71358f28e449da9a2c432a0822c    649098003
Name: article_id, Length: 7500, dtype: int64

In [38]:
one_y_test

customer_id
002e0bc901590e07341001b2358bded94031771537c24b6a7805c9b82cc95a1a    452618001
007b0039f5f4af2f5936a728efb425dca66adb7576ba18f3235266342fe26226    757995009
008298e14999e4e19a4d3b667e86b314db208cf94e63e1515bd1287ba71382f6    756189002
008ce803acbbcb2659d8dfe9c073e9f422367f19443a13062621aa22342e578f    727611001
009a729f3c2e43f01ccaa31a2c4be7d6c453dde6a200e4a5953ae519c575ace6    832481001
                                                                      ...    
ffceea30ef718f589d3e4a6987e45f6d634a7f7add65ab40636f723623818ef9    739680002
ffd094f5287e36345b647669162a7668c40d66c269409a7a52fec6b3765271bc    762856007
ffe44295c63a13498687134a9e6ee5c57e08d84bfffa0d810821c1181b8873e1    692709002
ffed87e65b5196e7ac9d8631e35a3c4243a169fc06a7ea0b8e9b80831abaefc0    708478001
fff71dcdcda46de4b2c36f782c9f64376707edbb5df110a3ae3eb5079844aad2    567993017
Name: article_id, Length: 2500, dtype: int64

In [39]:
train

Unnamed: 0,customer_id,Active,club_member_status,fashion_news_frequency,age,postal_code,t_dat,article_id,price,sales_channel_id,...,graphical_appearance_no,colour_group_code,perceived_colour_value_name,perceived_colour_master_name,department_no,index_code,index_group_no,section_no,garment_group_no,detail_desc
0,ecb6e236773246368b383988811f9bca5408482b1fee28...,0,ACTIVE,NONE,54,aa7db475b219fcad52ca6ed2acb00a8d1a456ae9cdad58...,2018-10-19,663236002,0.050831,2,...,1010010.0,22.0,Medium Dusty,Yellow,1626.0,A,1.0,15.0,1003.0,Polo-neck jumper in a soft knit containing som...
1,ecb6e236773246368b383988811f9bca5408482b1fee28...,0,ACTIVE,NONE,54,aa7db475b219fcad52ca6ed2acb00a8d1a456ae9cdad58...,2018-11-02,529953006,0.033881,2,...,1010016.0,42.0,Medium,Red,1747.0,D,2.0,53.0,1009.0,Treggings with a high waist and concealed elas...
2,ecb6e236773246368b383988811f9bca5408482b1fee28...,0,ACTIVE,NONE,54,aa7db475b219fcad52ca6ed2acb00a8d1a456ae9cdad58...,2018-11-02,656719003,0.050831,2,...,1010004.0,83.0,Dark,Green,1722.0,A,1.0,15.0,1009.0,Tailored trousers in a stretch weave with two ...
3,ecb6e236773246368b383988811f9bca5408482b1fee28...,0,ACTIVE,NONE,54,aa7db475b219fcad52ca6ed2acb00a8d1a456ae9cdad58...,2018-12-03,663986003,0.016085,2,...,1010016.0,9.0,Dark,Black,1322.0,A,1.0,15.0,1013.0,"Short, fitted dress in velour with a small sta..."
4,ecb6e236773246368b383988811f9bca5408482b1fee28...,0,ACTIVE,NONE,54,aa7db475b219fcad52ca6ed2acb00a8d1a456ae9cdad58...,2019-05-12,471714002,0.016932,2,...,1010016.0,13.0,Dusty Light,Mole,5690.0,F,3.0,55.0,1025.0,Knee-length shorts in a cotton weave with a bu...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
372544,e204a7a9090584034ec0b658bdce4520f2fba3b610fba3...,1,ACTIVE,Regularly,61,77180d26d7666a016f4538ca3caacaff0e9eef7ce82195...,2020-02-21,751664001,0.025407,1,...,1010016.0,9.0,Dark,Black,1722.0,A,1.0,15.0,1009.0,Treggings in superstretch twill with an elasti...
372545,e204a7a9090584034ec0b658bdce4520f2fba3b610fba3...,1,ACTIVE,Regularly,61,77180d26d7666a016f4538ca3caacaff0e9eef7ce82195...,2020-02-21,796210010,0.025407,1,...,1010001.0,9.0,Dark,Black,1722.0,A,1.0,15.0,1009.0,Trousers in superstretch twill with a high wai...
372546,e204a7a9090584034ec0b658bdce4520f2fba3b610fba3...,1,ACTIVE,Regularly,61,77180d26d7666a016f4538ca3caacaff0e9eef7ce82195...,2020-03-03,598515031,0.016932,1,...,1010006.0,9.0,Dark,Black,1336.0,B,1.0,61.0,1017.0,Hipster briefs in organic cotton jersey with l...
372547,e204a7a9090584034ec0b658bdce4520f2fba3b610fba3...,1,ACTIVE,Regularly,61,77180d26d7666a016f4538ca3caacaff0e9eef7ce82195...,2020-06-19,853881001,0.016932,1,...,1010016.0,9.0,Dark,Black,8310.0,S,26.0,5.0,1005.0,Sports shorts in fast-drying functional fabric...


In [88]:
customers

Unnamed: 0,customer_id,Active,club_member_status,fashion_news_frequency,age,postal_code
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0,ACTIVE,NONE,49,52043ee2162cf5aa7ee79974281641c6f11a68d276429a...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0,ACTIVE,NONE,25,2973abc54daa8a5f8ccfe9362140c63247c5eee03f1d93...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0,ACTIVE,NONE,24,64f17e6a330a85798e4998f62d0930d14db8db1c054af6...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0,ACTIVE,NONE,54,5d36574f52495e81f019b680c843c443bd343d5ca5b1c2...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,1,ACTIVE,Regularly,52,25fa5ddee9aac01b35208d01736e57942317d756b32ddd...
...,...,...,...,...,...,...
1371975,ffffbbf78b6eaac697a8a5dfbfd2bfa8113ee5b403e474...,0,ACTIVE,NONE,24,7aa399f7e669990daba2d92c577b52237380662f36480b...
1371976,ffffcd5046a6143d29a04fb8c424ce494a76e5cdf4fab5...,0,ACTIVE,NONE,21,3f47f1279beb72215f4de557d950e0bfa73789d24acb5e...
1371977,ffffcf35913a0bee60e8741cb2b4e78b8a98ee5ff2e6a1...,1,ACTIVE,Regularly,21,4563fc79215672cd6a863f2b4bf56b8f898f2d96ed590e...
1371978,ffffd7744cebcf3aca44ae7049d2a94b87074c3d4ffe38...,1,ACTIVE,Regularly,18,8892c18e9bc3dca6aa4000cb8094fc4b51ee8db2ed14d7...


Now I will create a function which I can apply to the train and test dfs to remove uneccessary columns. I am going to focus on customer demographic features and how many items they have bought each. I will need to turn these into binary columns in order to fit the Decision Tree, therefore won't be using info about the items as there are far too many products to binarize these columns. I have dropped postcode as the actual postcode details have been hidden, meaning we can't get much information from it.

In [43]:
def useful_features (customers):
        
        features_rows = {
                    'Active' : customers['Active'].iloc[0],
                    'club_member_status' : customers['club_member_status'].iloc[0],
                    'fashion_news_frequency' :customers['fashion_news_frequency'].iloc[0],
                    'age'  : customers['age'].iloc[0]} 
        features_rows['bought_items'] = customers.shape[0] 
        return pd.Series(features_rows) 

In [44]:
# applying useful features function to make new train and test df 
x_train = train.groupby('customer_id', observed=True).apply(useful_features) 

x_test  = test.groupby('customer_id', observed=True).apply(useful_features)

In [45]:
x_train.head()

Unnamed: 0_level_0,Active,club_member_status,fashion_news_frequency,age,bought_items
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0000f1c71aafe5963c3d195cf273f7bfd50bbf17761c9199e53dbb81641becd7,1,ACTIVE,Regularly,30,123
000346516dd355b40badca0c0f5f37a318ddae31f0e0f76a3a0454eb591b6384,0,ACTIVE,NONE,25,20
000a5f3c8be9167cb0d09dd8a17b6b54998e9e83faaf52072ee9434b85c85dd3,0,PRE-CREATE,NONE,36,21
000c886e014a122bd9066501103e3f4a3ec157af27399a5f6fa2dc540c123356,0,ACTIVE,NONE,18,29
00180a85401e63b5063a3ffbf51da92002944c06c161b16e90ed7f66527a566d,0,ACTIVE,NONE,39,25


Here I will format the train df so that I have a df which is customer_id and their corresponding last 15 purchases. I will remove the brackets from the list of strings using lambda and [1:-1]

In [50]:
# applying the one hot encoding  to our X train and  X test dataframe  for 'FN', 'Active', 'club_member_status', 'fashion_news_frequency' columns
x_train_encoded = pd.get_dummies(x_train, columns=['Active', 'club_member_status', 'fashion_news_frequency'])
x_test_encoded = pd.get_dummies(x_test, columns=['Active', 'club_member_status', 'fashion_news_frequency'])

In [64]:
# Create Decision tree classifer object
clf = DecisionTreeClassifier()

# Train Decision Tree Classifer
clf = clf.fit(x_train_encoded, one_y_train)

#Predict the response for test dataset
y_pred = clf.predict(x_train_encoded)

In [65]:
print(y_pred)

[855249006 188183015 631878006 ... 697119002 790645005 649098003]


In [66]:
y_pred.shape

(7500,)

In [68]:
print("Accuracy:",metrics.accuracy_score(one_y_train, y_pred))

Accuracy: 0.5805333333333333


In [70]:
#Predict the response for test dataset
y_pred_test = clf.predict(x_test_encoded)

In [71]:
print("Accuracy y_test:",metrics.accuracy_score(one_y_test, y_pred_test))

Accuracy y_test: 0.0


Accuracy is 0.58 for the train set but zero for the test set, therefore the model doesn't work. I will now go onto using matrix multiplication/collaborative filtering as an alternative way of predicting recommendations