# **<a id="Content">HnM RecSys Notebook 9417</a>**

## **<a id="Content">Table of Contents</a>**
* [**<span>1. Imports</span>**](#Imports)  
* [**<span>2. Pre-Processing</span>**](#Pre-Processing)
* [**<span>3. Exploratory Data Analysis</span>**](#Exploratory%20Data%20Analysis)  
    * [**<span>3.1 Articles</span>**](#EDA::Articles)  
    * [**<span>3.2 Customers</span>**](#EDA::Customers)
    * [**<span>3.3 Transactions</span>**](#EDA::Transactions)
* [**<span>4. Models</span>**](#Models) 
    * [**<span>4.1 Popularity</span>**](#Popularity%20Model)   
    * [**<span>4.2 ALS</span>**](#Alternating%20Least%20Squares)  
    * [**<span>4.2 GBDT</span>**](#GBDT)  
    * [**<span>4.3 SGD/similar</span>**](#SGD)  
    * [**<span>4.4 NN</span>**](#NN)  



## Imports

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import os
import re
import warnings


# Importing data
articles = pd.read_csv('/Users/priyashshah/Desktop/COMP9417/ML Group Assignment/articles.csv')
print(articles.head())
print("--")
customers = pd.read_csv('/Users/priyashshah/Desktop/COMP9417/ML Group Assignment/customers.csv')
print(customers.head())
print("--")
transactions = pd.read_csv("/Users/priyashshah/Desktop/COMP9417/ML Group Assignment/transactions_train.csv")
print(transactions.head())
print("--")

   article_id  product_code          prod_name  product_type_no  \
0   108775015        108775          Strap top              253   
1   108775044        108775          Strap top              253   
2   108775051        108775      Strap top (1)              253   
3   110065001        110065  OP T-shirt (Idro)              306   
4   110065002        110065  OP T-shirt (Idro)              306   

  product_type_name  product_group_name  graphical_appearance_no  \
0          Vest top  Garment Upper body                  1010016   
1          Vest top  Garment Upper body                  1010016   
2          Vest top  Garment Upper body                  1010017   
3               Bra           Underwear                  1010016   
4               Bra           Underwear                  1010016   

  graphical_appearance_name  colour_group_code colour_group_name  ...  \
0                     Solid                  9             Black  ...   
1                     Solid               

## Pre-Processing

In [26]:
# ----- empty value stats -------------
print("Missing values: ")
print(customers.isnull().sum())
print("--\n")

print("FN Newsletter vals: ", customers['FN'].unique())
print("Active communication vals: ",customers['Active'].unique())
print("Club member status vals: ", customers['club_member_status'].unique())
print("Fashion News frequency vals: ", customers['fashion_news_frequency'].unique())
print("--\n")

# ---- data cleaning -------------

customers['FN'] = customers['FN'].fillna(0)
customers['Active'] = customers['Active'].fillna(0)

# replace club_member_status missing values with 'LEFT CLUB' --> no members with LEFT CLUB status in data
customers['club_member_status'] = customers['club_member_status'].fillna('LEFT CLUB')
customers['fashion_news_frequency'] = customers['fashion_news_frequency'].fillna('None')
customers['fashion_news_frequency'] = customers['fashion_news_frequency'].replace('NONE', 'None')
customers['age'] = customers['age'].fillna(customers['age'].mean())
customers['age'] = customers['age'].astype(int)
articles['detail_desc'] = articles['detail_desc'].fillna('None')

print("Missing values: ")
print(customers.isnull().sum())
print("--\n")


# ---- memory optimizations -------------

# uses 8 bytes instead of given 64 byte string, reduces mem by 8x, 
# !!!! have to convert back before merging w/ sample_submissions.csv
transactions['customer_id'] = transactions['customer_id'].apply(lambda x: int(x[-16:],16) ).astype('int64')
# uses 4 bytes instead of given 10 byte string, reduces mem by 2.5x
transactions['article_id'] = transactions['article_id'].astype('int32') 
# !!!! ADD LEADING ZERO BACK BEFORE SUBMISSION OF PREDICTIONS TO KAGGLE: 
#transactions['article_id'] = '0' + transactions.article_id.astype('str')

# reduces mem by 3x
transactions['price'] = transactions['price'].astype('float32') 
transactions['sales_channel_id'] = transactions['sales_channel_id'].astype('int8')

Missing values: 
customer_id               0
FN                        0
Active                    0
club_member_status        0
fashion_news_frequency    0
age                       0
postal_code               0
dtype: int64
--

FN Newsletter vals:  [0. 1.]
Active communication vals:  [0. 1.]
Club member status vals:  ['ACTIVE' 'LEFT CLUB' 'PRE-CREATE']
Fashion News frequency vals:  ['None' 'Regularly' 'Monthly']
--

Missing values: 
customer_id               0
FN                        0
Active                    0
club_member_status        0
fashion_news_frequency    0
age                       0
postal_code               0
dtype: int64
--



TypeError: 'int' object is not subscriptable

In [24]:
#!pip install catboost
!pip install ipywidgets
#!jupyter nbextension enable  — py widgetsnbextension

Defaulting to user installation because normal site-packages is not writeable
Collecting catboost
  Downloading catboost-1.1.1-cp37-none-macosx_10_6_universal2.whl (22.0 MB)
[K     |████████████████████████████████| 22.0 MB 225 kB/s eta 0:00:01     |█████████████████▉              | 12.3 MB 243 kB/s eta 0:00:40
Collecting graphviz
  Downloading graphviz-0.20.1-py3-none-any.whl (47 kB)
[K     |████████████████████████████████| 47 kB 8.7 MB/s  eta 0:00:01
[?25hCollecting plotly
  Downloading plotly-5.14.1-py2.py3-none-any.whl (15.3 MB)
[K     |████████████████████████████████| 15.3 MB 12.3 MB/s eta 0:00:01
Collecting tenacity>=6.2.0
  Downloading tenacity-8.2.2-py3-none-any.whl (24 kB)
Installing collected packages: tenacity, plotly, graphviz, catboost
Successfully installed catboost-1.1.1 graphviz-0.20.1 plotly-5.14.1 tenacity-8.2.2
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m


In [25]:
from catboost import CatBoostRegressor
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

IMPLEMENTATION

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import os
import re
import warnings

# ----- empty value stats -------------
print("Missing values: ")
print(customers.isnull().sum())
print("--\n")

print("FN Newsletter vals: ", customers['FN'].unique())
print("Active communication vals: ",customers['Active'].unique())
print("Club member status vals: ", customers['club_member_status'].unique())
print("Fashion News frequency vals: ", customers['fashion_news_frequency'].unique())
print("--\n")

# ---- data cleaning -------------

customers['FN'] = customers['FN'].fillna(0)
customers['Active'] = customers['Active'].fillna(0)

# replace club_member_status missing values with 'LEFT CLUB' --> no members with LEFT CLUB status in data
customers['club_member_status'] = customers['club_member_status'].fillna('LEFT CLUB')
customers['fashion_news_frequency'] = customers['fashion_news_frequency'].fillna('None')
customers['fashion_news_frequency'] = customers['fashion_news_frequency'].replace('NONE', 'None')
customers['age'] = customers['age'].fillna(customers['age'].mean())
customers['age'] = customers['age'].astype(int)
articles['detail_desc'] = articles['detail_desc'].fillna('None')


print("Customers' Missing values: ")
print(customers.isnull().sum())
print("--\n")

Missing values: 
customer_id                    0
FN                        895050
Active                    907576
club_member_status          6062
fashion_news_frequency     16009
age                        15861
postal_code                    0
dtype: int64
--

FN Newsletter vals:  [nan  1.]
Active communication vals:  [nan  1.]
Club member status vals:  ['ACTIVE' nan 'PRE-CREATE' 'LEFT CLUB']
Fashion News frequency vals:  ['NONE' 'Regularly' nan 'Monthly' 'None']
--

Customers' Missing values: 
customer_id               0
FN                        0
Active                    0
club_member_status        0
fashion_news_frequency    0
age                       0
postal_code               0
dtype: int64
--



In [13]:
import numpy as np
import pandas as pd

def reduce_mem_usage(df):
    """Iterate over all the columns of a DataFrame and modify the data type
    to reduce memory usage, handling ordered Categoricals"""
    
    # check the memory usage of the DataFrame
    start_mem = df.memory_usage().sum() / 1024**2
    print("Memory usage of dataframe is {:.2f} MB".format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type == 'category':
            if df[col].cat.ordered:
                # Convert ordered Categorical to an integer
                df[col] = df[col].cat.codes.astype('int16')
            else:
                # Convert unordered Categorical to a string
                df[col] = df[col].astype('str')
        
        elif col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min >= np.iinfo(np.int8).min and c_max <= np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min >= np.iinfo(np.int16).min and c_max <= np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min >= np.iinfo(np.int32).min and c_max <= np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min >= np.iinfo(np.int64).min and c_max <= np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min >= np.finfo(np.float16).min and c_max <= np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min >= np.finfo(np.float32).min and c_max <= np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    
    # check the memory usage after optimization
    end_mem = df.memory_usage().sum() / 1024**2
    print("Memory usage after optimization is: {:.2f} MB".format(end_mem))

    # calculate the percentage of the memory usage reduction
    mem_reduction = 100 * (start_mem - end_mem) / start_mem
    print("Memory usage decreased by {:.1f}%".format(mem_reduction))
    
    return df

In [7]:
customers['club_member_status'].replace({'LEFT CLUB': 0, 'PRE-CREATE': 1, 'ACTIVE': 2}, inplace=True)
customers['club_member_status'] = customers['club_member_status'].astype('int8')
print(customers['club_member_status'].unique())

[2 0 1]


In [16]:
from catboost import CatBoostRegressor
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

articles = articles.drop(columns=['product_code', 'prod_name', 'product_type_name', 'product_group_name', 'graphical_appearance_name', 'department_name', 'index_name', 'index_group_name', 'section_name', 'garment_group_name', 'detail_desc'])
articles = articles.drop(columns=[col for col in articles.columns if 'colour_' in col or 'perceived_' in col])

In [18]:
# Join the transaction dataframe with the customers dataframe
merged = pd.merge(transactions, customers, on='customer_id', how='inner')

# Calculate the mean age for each article
item_mean_age = merged.groupby('article_id')['age'].mean()

# Calculate the difference between every user's age and the mean age of users who have purchased a particular item
merged['age_diff'] = merged['age'] - merged['article_id'].map(item_mean_age)

# Group by article and take the mean of age_diff
article_age_diff = merged.groupby('article_id')['age_diff'].mean()

# Append the age difference feature to the articles dataframe
articles['age_diff'] = articles['article_id'].map(article_age_diff)

articles.head()

Unnamed: 0,article_id,product_type_no,graphical_appearance_no,department_no,index_code,index_group_no,section_no,garment_group_no,age_diff
0,108775015,253,1010016,1676,A,1,16,1002,2.731011e-14
1,108775044,253,1010016,1676,A,1,16,1002,-1.909939e-14
2,108775051,253,1010017,1676,A,1,16,1002,-6.279215e-16
3,110065001,306,1010016,1339,B,1,61,1017,6.812771e-15
4,110065002,306,1010016,1339,B,1,61,1017,-9.465115e-15


In [49]:

transactions_df = transactions

transactions_df = transactions_df.drop_duplicates(subset=['customer_id','article_id'])
transactions_df= transactions_df.reset_index(drop = True)
customer_list = transactions_df.customer_id.value_counts()[:50].index.tolist()
customers_1 = customers.loc[customers['customer_id'].isin(customer_list)]
transactions_train_t = transactions_df.loc[transactions_df['customer_id'].isin(customer_list)]
articles_t = articles.loc[articles['article_id'].isin(list(transactions_train_t['article_id']))]
merge_df = pd.DataFrame()
for customer in customers_1['customer_id']:
    new_df = pd.DataFrame()
    new_df['Buy'] = []
    print(customer)
    new_df = new_df.append([customers_1.loc[customers_1['customer_id']==customer]]*len(articles_t)).reset_index(drop = True)
    new_df['article_id'] = list(articles_t['article_id'])
    tran_temp = transactions_train_t.loc[transactions_train_t['customer_id']==customer]
    for article in list(tran_temp['article_id']):
        new_df.loc[(new_df['customer_id']==customer) & (new_df['article_id']==article),['Buy']] = 1
    merge_df = merge_df.append(new_df).reset_index(drop = True)
merge_df


1135991499650384534
-738104704126742956
8572405339616664040
-3864861336650713034
-7724193906943787184
8775840678685643411
1137836553762819430
6025578233842836240
7952062577868193503
3267368325910590416
-8030009826304683278
-5697361604447432642
742206143892321423
-4557035136527070563
628088449402505029
-8684636470525030339
-1277389631032871377
8170076110479575893
-1897419710364364417
-1719797630016463240
-6737281167552191327
3319755920444815383
2697331706436144899
-1229325526706752666
-5976676472876429720
4181321428134864254
9070597775879917454
-6943172096873938826
-2385229251180014276
8683271572215123548
3407358910964148684
-1219804509442908655
7398229172292340849
-8298629768350782975
5854009424779598107
-4781325834093528838
1362310019195253974
-1021296598342089594
-2110333707337082135
5345619263166536794
-8742378727249606979
-20833052798049461
1575517627660529403
-2021215128221699234
-7961016853555269047
-2954818649882845510
2768447994533782618
6876265344195936652
5876564334685098354


Unnamed: 0,Buy,customer_id,FN,Active,club_member_status,fashion_news_frequency,age,postal_code,article_id
0,,1.135991e+18,1.0,1.0,2.0,Regularly,51.0,8db52856d17c197683efbc9d5ef2dc873aaf7062486b2d...,108775015
1,1.0,1.135991e+18,1.0,1.0,2.0,Regularly,51.0,8db52856d17c197683efbc9d5ef2dc873aaf7062486b2d...,108775044
2,,1.135991e+18,1.0,1.0,2.0,Regularly,51.0,8db52856d17c197683efbc9d5ef2dc873aaf7062486b2d...,110065001
3,,1.135991e+18,1.0,1.0,2.0,Regularly,51.0,8db52856d17c197683efbc9d5ef2dc873aaf7062486b2d...,110065011
4,,1.135991e+18,1.0,1.0,2.0,Regularly,51.0,8db52856d17c197683efbc9d5ef2dc873aaf7062486b2d...,111565001
...,...,...,...,...,...,...,...,...,...
1146195,,2.636587e+18,1.0,1.0,2.0,Regularly,70.0,3455b39b24a47ae0262c91c5728ab9ddcfccc43628291e...,946795001
1146196,,2.636587e+18,1.0,1.0,2.0,Regularly,70.0,3455b39b24a47ae0262c91c5728ab9ddcfccc43628291e...,947509001
1146197,,2.636587e+18,1.0,1.0,2.0,Regularly,70.0,3455b39b24a47ae0262c91c5728ab9ddcfccc43628291e...,947934001
1146198,,2.636587e+18,1.0,1.0,2.0,Regularly,70.0,3455b39b24a47ae0262c91c5728ab9ddcfccc43628291e...,952267001


In [53]:
merge_df['Buy'] = merge_df['Buy'].fillna(0)
merge_df.head()

Unnamed: 0,Buy,customer_id,FN,Active,club_member_status,fashion_news_frequency,age,postal_code,article_id,product_type_no,graphical_appearance_no,department_no,index_code,index_group_no,section_no,garment_group_no,age_diff
0,0.0,1.135991e+18,1.0,1.0,2.0,Regularly,51.0,8db52856d17c197683efbc9d5ef2dc873aaf7062486b2d...,108775015,253,1010016,1676,A,1,16,1002,2.731011e-14
1,1.0,1.135991e+18,1.0,1.0,2.0,Regularly,51.0,8db52856d17c197683efbc9d5ef2dc873aaf7062486b2d...,108775044,253,1010016,1676,A,1,16,1002,-1.909939e-14
2,0.0,1.135991e+18,1.0,1.0,2.0,Regularly,51.0,8db52856d17c197683efbc9d5ef2dc873aaf7062486b2d...,110065001,306,1010016,1339,B,1,61,1017,6.812771e-15
3,0.0,1.135991e+18,1.0,1.0,2.0,Regularly,51.0,8db52856d17c197683efbc9d5ef2dc873aaf7062486b2d...,110065011,306,1010016,1339,B,1,61,1017,-1.981697e-15
4,0.0,1.135991e+18,1.0,1.0,2.0,Regularly,51.0,8db52856d17c197683efbc9d5ef2dc873aaf7062486b2d...,111565001,304,1010016,3608,B,1,62,1021,5.801825e-15


In [52]:
merge_df= pd.merge(merge_df, articles, on='article_id', how='left')

In [68]:
merge_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1146200 entries, 0 to 1146199
Data columns (total 17 columns):
 #   Column                   Non-Null Count    Dtype  
---  ------                   --------------    -----  
 0   Buy                      1146200 non-null  float64
 1   customer_id              1146200 non-null  int32  
 2   FN                       1146200 non-null  int8   
 3   Active                   1146200 non-null  float64
 4   club_member_status       1146200 non-null  float64
 5   fashion_news_frequency   1146200 non-null  object 
 6   age                      1146200 non-null  float64
 7   postal_code              1146200 non-null  object 
 8   article_id               1146200 non-null  int64  
 9   product_type_no          1146200 non-null  int64  
 10  graphical_appearance_no  1146200 non-null  int64  
 11  department_no            1146200 non-null  int64  
 12  index_code               1146200 non-null  object 
 13  index_group_no           1146200 non-null 

In [69]:
merge_df['customer_id'] = merge_df['customer_id'].astype('int32')
merge_df['FN'] = merge_df['FN'].astype('int8')
merge_df['Active'] = merge_df['Active'].astype('int8')
merge_df['club_member_status'] = merge_df['club_member_status'].astype('int8')


In [70]:
X = merge_df.drop(columns=['Buy']) 
y = merge_df['Buy']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=42)

In [71]:
CAT_FEATURES = ['FN','Active','customer_id','club_member_status', 'fashion_news_frequency','postal_code',
                 'article_id','product_type_no','graphical_appearance_no','department_no','index_code','index_group_no','section_no','garment_group_no'] #list of your categorical features


In [74]:
# set up the model
catboost_model = CatBoostRegressor(n_estimators=100,
                                   loss_function = 'RMSE',
                                   eval_metric = 'RMSE',
                                   cat_features = CAT_FEATURES)


In [75]:
# fit model
catboost_model.fit(X_train, y_train, 
                   eval_set = (X_test, y_test),
                   use_best_model = True,
                   plot = True)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

Learning rate set to 0.5
0:	learn: 0.1774853	test: 0.1773356	best: 0.1773356 (0)	total: 500ms	remaining: 49.5s
1:	learn: 0.1773755	test: 0.1773093	best: 0.1773093 (1)	total: 1.27s	remaining: 1m 2s
2:	learn: 0.1773248	test: 0.1773069	best: 0.1773069 (2)	total: 1.42s	remaining: 45.9s
3:	learn: 0.1772832	test: 0.1772877	best: 0.1772877 (3)	total: 1.7s	remaining: 40.8s
4:	learn: 0.1772374	test: 0.1772418	best: 0.1772418 (4)	total: 1.89s	remaining: 35.9s
5:	learn: 0.1771778	test: 0.1771914	best: 0.1771914 (5)	total: 2.03s	remaining: 31.8s
6:	learn: 0.1771432	test: 0.1771945	best: 0.1771914 (5)	total: 2.15s	remaining: 28.6s
7:	learn: 0.1770740	test: 0.1771301	best: 0.1771301 (7)	total: 2.36s	remaining: 27.2s
8:	learn: 0.1770610	test: 0.1771227	best: 0.1771227 (8)	total: 2.52s	remaining: 25.4s
9:	learn: 0.1770577	test: 0.1771381	best: 0.1771227 (8)	total: 2.6s	remaining: 23.4s
10:	learn: 0.1770394	test: 0.1771286	best: 0.1771227 (8)	total: 2.73s	remaining: 22.1s
11:	learn: 0.1769767	test: 0.1

<catboost.core.CatBoostRegressor at 0x1899ca438>

In [76]:
# get your predictions
preds = catboost_model.predict(X_test)

In [98]:
#create DF
new_df = pd.DataFrame()
new_df['customer_id'] = X_test['customer_id']
new_df['article_id'] = X_test['article_id']
new_df['modelOutput'] = preds
print(new_df)

         customer_id  article_id  modelOutput
957265   -2147483648   819113007     0.032365
1056829  -2147483648   580102004     0.024923
728664   -2147483648   828067002     0.043149
746543   -2147483648   759108001     0.007688
208701   -2147483648   581781010     0.045422
...              ...         ...          ...
599491   -2147483648   622969013     0.033838
1099024  -2147483648   887681002     0.043588
828199   -2147483648   607427005     0.093917
20073    -2147483648   861547002     0.024769
563128   -2147483648   758573001     0.036887

[286550 rows x 3 columns]


In [99]:
new_df.sort_values(
    by=["modelOutput", "customer_id"],
    ascending=False
)

Unnamed: 0,customer_id,article_id,modelOutput
201669,-2147483648,832298003,0.390544
197685,-2147483648,776237011,0.337248
1143144,-2147483648,859125005,0.288886
998788,-2147483648,759871002,0.284604
746624,-2147483648,759871002,0.280164
...,...,...,...
133111,-2147483648,835639002,-0.029898
1129533,-2147483648,677930023,-0.031371
1126038,-2147483648,599580017,-0.033246
1143296,-2147483648,860949002,-0.038040
