## Baseline Model 1: Recommending most popular product from customer's most bought category

In this notebook I will create a baseline model which finds the most popular product per product type and then recommends that based on the product type that the customer buys the most of. For example if a customer's most popular product type to buy is trousers, I will recommend to them the most popular product within trousers

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from scipy import stats
from scipy.stats import norm

In [2]:
# load data
df = pd.read_csv('../src/data/merged_df1.csv')

In [3]:
df.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,article_purchase_count,product_code,product_type_no,product_type_name,product_group_name,graphical_appearance_no,colour_group_code,perceived_colour_value_name,perceived_colour_master_name,department_no,index_code,index_group_no,section_no,garment_group_no,detail_desc
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2,1,541518.0,306.0,Bra,Underwear,1010016.0,51.0,Dusty Light,Pink,1334.0,B,1.0,61.0,1017.0,"Lace push-up bras with underwired, moulded, pa..."
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2,1,663713.0,283.0,Underwear body,Underwear,1010016.0,9.0,Dark,Black,1338.0,B,1.0,61.0,1017.0,"Lace push-up body with underwired, moulded, pa..."
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221001,0.020322,2,1,505221.0,252.0,Sweater,Garment Upper body,1010010.0,7.0,Medium Dusty,Unknown,5963.0,D,2.0,58.0,1003.0,Jumper in rib-knit cotton with hard-worn detai...
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2,1,505221.0,252.0,Sweater,Garment Upper body,1010010.0,52.0,Medium Dusty,Pink,5963.0,D,2.0,58.0,1003.0,Jumper in rib-knit cotton with hard-worn detai...
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687001,0.016932,2,1,685687.0,252.0,Sweater,Garment Upper body,1010010.0,8.0,Dark,Grey,3090.0,A,1.0,15.0,1023.0,V-neck knitted jumper with long sleeves and ri...


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25454992 entries, 0 to 25454991
Data columns (total 20 columns):
 #   Column                        Dtype  
---  ------                        -----  
 0   t_dat                         object 
 1   customer_id                   object 
 2   article_id                    int64  
 3   price                         float64
 4   sales_channel_id              int64  
 5   article_purchase_count        int64  
 6   product_code                  float64
 7   product_type_no               float64
 8   product_type_name             object 
 9   product_group_name            object 
 10  graphical_appearance_no       float64
 11  colour_group_code             float64
 12  perceived_colour_value_name   object 
 13  perceived_colour_master_name  object 
 14  department_no                 float64
 15  index_code                    object 
 16  index_group_no                float64
 17  section_no                    float64
 18  garment_group_no    

In [7]:
# checking number of transactions per customer
customercounts = df.customer_id.value_counts()

In [8]:
customercounts

be1981ab818cf4ef6765b2ecaea7a2cbf14ccd6e8a7ee985513d9e8e53c6d91b    1641
b4db5e5259234574edfff958e170fe3a5e13b6f146752ca066abca3c156acc71    1321
a65f77281a528bf5c1e9f270141d601d116e1df33bf9df512f495ee06647a9cc    1304
49beaacac0c7801c2ce2d189efe525fe80b5d37e46ed05b50a4cd88e34d0748f    1233
cd04ec2726dd58a8c753e0d6423e57716fd9ebcf2f14ed6012e7e5bea016b4d6    1217
                                                                    ... 
caab3a054f5e7a752412ee02c2e16d760b6d118c44c35962c0669d7aecb95f9b       1
0f060816d7ead7fe5d842325cb65e0225c11fdf1758a2c9e9d175addbdca826c       1
8c35d73f72679d27fed2ab79e96220b45b5242fd17f5dfc0635483a519654f9d       1
8bd64d3f8cfab43f1dec4884777e481979e6bc36637e07f97383c644e364ea0f       1
efc7e4c42b1d5729e4b777b929618aed5bef8bcb73619af5df4821f467417952       1
Name: customer_id, Length: 1362281, dtype: int64

In order for this model to be relatively effective, I will need to select only those customers who have made enough purchases for me to deduce their most popular category. I will therefore select those customers who have made over 50 transactions

In [9]:
# selecting those customers with over 50 transactions
customers_50 = customercounts.index[customercounts > 50]

In [10]:
customers_50

Index(['be1981ab818cf4ef6765b2ecaea7a2cbf14ccd6e8a7ee985513d9e8e53c6d91b',
       'b4db5e5259234574edfff958e170fe3a5e13b6f146752ca066abca3c156acc71',
       'a65f77281a528bf5c1e9f270141d601d116e1df33bf9df512f495ee06647a9cc',
       '49beaacac0c7801c2ce2d189efe525fe80b5d37e46ed05b50a4cd88e34d0748f',
       'cd04ec2726dd58a8c753e0d6423e57716fd9ebcf2f14ed6012e7e5bea016b4d6',
       '55d15396193dfd45836af3a6269a079efea339e875eff42cc0c228b002548a9d',
       '689f4eda82fdf3d9bfe8e524bbd0d931c4d7690f2234d3e48779f924aaf4103d',
       'd80ed4ababfa96812e22b911629e6bcbf5093769051ea447e2b696ac98a3dae9',
       '6cc121e5cc202d2bf344ffe795002bdbf87178054bcda2e57161f0ef810a4b55',
       '8df45859ccd71ef1e48e2ee9d1c65d5728c31c46ae957d659fa4e5c3af6cc076',
       ...
       '35d2387a7ecad82c398d1046f09558225ffa92ac83b97ca135e7ccb02f460272',
       '3eb6eec2db6cca5ad302ad4aaa571273f9d7e757b0899e4a447e64a28108f6d7',
       '59e634a40a349158c2c7c20e7416024306fc5dff130e3c9fcbcfa5dd64595a53',
       '781f20

In [12]:
customers50df = df[df['customer_id'].isin(customers_50)].sample(n=10000, frac=None, replace=False, weights=None, random_state=42)

In [14]:
customers50df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 28590805 to 21262368
Data columns (total 20 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   t_dat                         10000 non-null  object 
 1   customer_id                   10000 non-null  object 
 2   article_id                    10000 non-null  int64  
 3   price                         10000 non-null  float64
 4   sales_channel_id              10000 non-null  int64  
 5   article_purchase_count        10000 non-null  int64  
 6   product_code                  9968 non-null   float64
 7   product_type_no               9968 non-null   float64
 8   product_type_name             9968 non-null   object 
 9   product_group_name            9968 non-null   object 
 10  graphical_appearance_no       9968 non-null   float64
 11  colour_group_code             9968 non-null   float64
 12  perceived_colour_value_name   9968 non-null   obje

In [15]:
# testing for the customer who made the most transactions (top customer)
customer_id = 'be1981ab818cf4ef6765b2ecaea7a2cbf14ccd6e8a7ee985513d9e8e53c6d91b'

In [16]:
filtered_df = df.where(df['customer_id'] == customer_id)

In [17]:
filtered_df.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,article_purchase_count,product_code,product_type_no,product_type_name,product_group_name,graphical_appearance_no,colour_group_code,perceived_colour_value_name,perceived_colour_master_name,department_no,index_code,index_group_no,section_no,garment_group_no,detail_desc
0,,,,,,,,,,,,,,,,,,,,
1,,,,,,,,,,,,,,,,,,,,
2,,,,,,,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,,,,,,,
4,,,,,,,,,,,,,,,,,,,,


In [18]:
filtered_df = filtered_df.dropna()

In [19]:
filtered_df.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,article_purchase_count,product_code,product_type_no,product_type_name,product_group_name,graphical_appearance_no,colour_group_code,perceived_colour_value_name,perceived_colour_master_name,department_no,index_code,index_group_no,section_no,garment_group_no,detail_desc
75642,2018-09-21,be1981ab818cf4ef6765b2ecaea7a2cbf14ccd6e8a7ee9...,658506001.0,0.059305,1.0,1.0,658506.0,275.0,Skirt,Garment Lower body,1010023.0,71.0,Dusty Light,Blue,1773.0,D,2.0,57.0,1016.0,"5-pocket, knee-length skirt in washed stretch ..."
75643,2018-09-21,be1981ab818cf4ef6765b2ecaea7a2cbf14ccd6e8a7ee9...,662980002.0,0.033881,1.0,1.0,662980.0,252.0,Sweater,Garment Upper body,1010010.0,7.0,Dusty Light,Grey,1626.0,A,1.0,15.0,1003.0,Long jumper in a soft knit containing some woo...
98914,2018-09-22,be1981ab818cf4ef6765b2ecaea7a2cbf14ccd6e8a7ee9...,478549001.0,0.050831,2.0,1.0,478549.0,272.0,Trousers,Garment Lower body,1010018.0,9.0,Dark,Black,1649.0,D,2.0,58.0,1009.0,5-pocket trousers in imitation leather with a ...
98915,2018-09-22,be1981ab818cf4ef6765b2ecaea7a2cbf14ccd6e8a7ee9...,637515003.0,0.033881,2.0,1.0,637515.0,252.0,Sweater,Garment Upper body,1010010.0,73.0,Dark,Blue,1626.0,A,1.0,15.0,1003.0,"V-neck jumper in a soft, fine knit with droppe..."
98916,2018-09-22,be1981ab818cf4ef6765b2ecaea7a2cbf14ccd6e8a7ee9...,667709001.0,0.050831,2.0,1.0,667709.0,275.0,Skirt,Garment Lower body,1010023.0,9.0,Dark,Black,1773.0,D,2.0,57.0,1016.0,Pencil skirt in washed stretch denim with a tw...


In [20]:
# ordering df by top customer's most popular product purchase 
filtered_df.groupby('product_type_name')['article_id'].count().sort_values(ascending=False)

product_type_name
Dress                293
Trousers             195
Sweater              164
Underwear bottom     100
Skirt                 90
Bikini top            89
Shorts                70
Top                   65
Bra                   64
Leggings/Tights       58
Swimwear bottom       50
Jacket                48
T-shirt               46
Vest top              41
Blouse                41
Socks                 25
Belt                  24
Blazer                21
Hoodie                19
Shirt                 18
Swimsuit              14
Hat/beanie            11
Pyjama set            10
Bag                    8
Bodysuit               8
Hair clip              8
Cardigan               6
Other accessories      5
Coat                   5
Boots                  5
Unknown                4
Hair/alice band        4
Pyjama bottom          3
Sandals                3
Scarf                  3
Night gown             2
Hair ties              2
Hair string            2
Slippers               2
Sarong 

Since the top customer's most popular purchase was Dresses, I will find out the most popular product within dresses and recommend that

In [21]:
# filtering the original df for transactions that are only dresses, then dropping the non applicable rows 
dress_filtered = df.where(df['product_type_name'] == 'Dress')
dress_filtered = dress_filtered.dropna()

In [22]:
dress_filtered.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,article_purchase_count,product_code,product_type_no,product_type_name,product_group_name,graphical_appearance_no,colour_group_code,perceived_colour_value_name,perceived_colour_master_name,department_no,index_code,index_group_no,section_no,garment_group_no,detail_desc
46,2018-09-20,00401a367c5ac085cb9d4b77c56f3edcabf25153615db9...,567475001.0,0.033881,2.0,1.0,567475.0,265.0,Dress,Garment Full body,1010016.0,9.0,Dark,Black,1919.0,A,1.0,2.0,1005.0,Long-sleeved tunic in jersey crêpe with a V-ne...
47,2018-09-20,00401a367c5ac085cb9d4b77c56f3edcabf25153615db9...,567594001.0,0.011847,2.0,1.0,567594.0,265.0,Dress,Garment Full body,1010016.0,9.0,Dark,Black,1919.0,A,1.0,2.0,1005.0,Tunic in soft viscose jersey with a sheen. Sho...
54,2018-09-20,00401a367c5ac085cb9d4b77c56f3edcabf25153615db9...,648719001.0,0.025407,2.0,1.0,648719.0,265.0,Dress,Garment Full body,1010016.0,9.0,Dark,Black,1919.0,A,1.0,2.0,1005.0,Short-sleeved dress in cotton jersey with a se...
55,2018-09-20,00401a367c5ac085cb9d4b77c56f3edcabf25153615db9...,681358001.0,0.025407,2.0,1.0,681358.0,265.0,Dress,Garment Full body,1010016.0,9.0,Dark,Black,1676.0,A,1.0,16.0,1002.0,"Straight, wide-fitting dress in jersey crêpe w..."
57,2018-09-20,00402f4463c8dc1b3ee54abfdea280e96cd87320449eca...,665481004.0,0.050831,1.0,1.0,665481.0,265.0,Dress,Garment Full body,1010001.0,9.0,Dark,Black,1666.0,A,1.0,11.0,1005.0,Knee-length dress in jersey crêpe with a V-nec...


In [23]:
# grouping and counting by article id and sorting values ascending to find the most popular dress 
dress_filtered.groupby('article_id')['article_id'].count().sort_values(ascending=False)

article_id
716348001.0    4604
612935009.0    4555
401044004.0    4415
721298001.0    4308
714824001.0    4194
               ... 
542833004.0       1
866596003.0       1
543760004.0       1
701085020.0       1
527487004.0       1
Name: article_id, Length: 10222, dtype: int64

Now I will write a function which does the above process but for every customer

In [24]:
def find_most_popular_article(df, customer_id):
    # filter DataFrame to include only transactions by the specified customer
    customer_df = df[df['customer_id'] == customer_id]
    
    # group transactions by article category name and count the number of articles purchased in each category
    category_counts = customer_df.groupby('product_type_name')['article_id'].count()
    
    # find the category with the highest number of articles purchased
    most_popular_category = category_counts.idxmax()
    
    # filter DataFrame to include only transactions for the most popular category
    category_df = customer_df[customer_df['product_type_name'] == most_popular_category]
    
    # group transactions by article ID and count the number of times each article was purchased
    article_counts = category_df.groupby('article_id')['article_id'].count()
    
    # find the article with the highest number of purchases in the most popular category
    most_popular_article = article_counts.idxmax()
    
    # return the ID of the most popular article in the most popular category for the customer
    return most_popular_article

In [25]:
# group transactions by customer ID and count the number of transactions for each customer
customer_counts = df.groupby('customer_id')['article_id'].count()

# filter to include only customers with more than 50 transactions
big_customers = customer_counts[customer_counts > 50]

# loop through the customer IDs in the filtered DataFrame
for customer_id in big_customers.index:
    most_popular_article = find_most_popular_article(df, customer_id)
    print(f"The most popular article for customer {customer_id} is {most_popular_article}")

KeyboardInterrupt: 