**Student name: Giorgi Guledani**

**Student ID: 20193667**

# Lecture 2: Feature engineering

In [14]:
import pandas as pd
path = "../datasets/"
articles = pd.read_parquet(path + "articles.parquet")
customers = pd.read_parquet(path + "customers.parquet")
transactions_train = pd.read_parquet(path + "transactions_train.parquet")

# Radek's preprocessing

Radek already got rid of missing values (replaced by -1's), used label encoding on some columns and optimized RAM usage by changing data types. He also added a column "week" in the transactions table in order to represent column "t_dat" in numbers. The notebook takes in the .parquet files produced by Radek where further preprocessing and feature engineering is done.

Source: https://www.kaggle.com/code/marcogorelli/radek-s-lgbmranker-starter-pack-warmup

# Articles

No extra changes made

In [15]:
articles.head()

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,108775015,108775,12855,253,9,0,1010016,0,9,0,...,10,0,0,1,0,16,30,1002,2,8834
1,108775044,108775,12855,253,9,0,1010016,0,10,2,...,10,0,0,1,0,16,30,1002,2,8834
2,108775051,108775,44846,253,9,0,1010017,3,11,11,...,10,0,0,1,0,16,30,1002,2,8834
3,110065001,110065,8159,306,13,4,1010016,0,9,0,...,131,7,7,1,0,61,5,1017,4,8243
4,110065002,110065,8159,306,13,4,1010016,0,10,2,...,131,7,7,1,0,61,5,1017,4,8243


# Transactions

No extra changes made

In [16]:
transactions_train.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,week
25784,2018-09-20,1728846800780188,519773001,0.028458,2,0
25785,2018-09-20,1728846800780188,578472001,0.032525,2,0
5389,2018-09-20,2076973761519164,661795002,0.167797,2,0
5390,2018-09-20,2076973761519164,684080003,0.101678,2,0
47429,2018-09-20,2918879973994241,662980001,0.033881,1,0


# Customers

### Preprocessing

In [17]:
customers.groupby(by=["Active", "FN"], dropna=False).size().reset_index()

Unnamed: 0,Active,FN,0
0,-1,-1,895050
1,-1,1,12526
2,1,1,464404


We can see that the 2 columns are highly correlated. An active account will always receive fashion news. The only reason 2 values may be different is because Radek replaced missing values of FN by 0 (when Active was 1)
Let's combine it into 1 column by dropping FN column:

In [18]:
customers = customers.drop("FN", axis=1)
customers.head()

Unnamed: 0,customer_id,Active,club_member_status,fashion_news_frequency,age,postal_code
0,6883939031699146327,-1,0,0,49,6305
1,11246327431398957306,-1,0,0,25,33726
2,18439897732908966680,-1,0,0,24,3247
3,18352672461570950206,-1,0,0,54,168643
4,18162778555210377306,1,0,1,52,168645


### Feature engineering #1: Missing ages to mean

In [19]:
# fill in missing age values (mean)
mean_age = customers['age'].mean()
customers["age"] = customers['age'].replace(-1, mean_age)

### Feature engineering #2: Age bins


In [20]:
# give labels to age groups
bin_edges = [0, 25, 35, 50, float('inf')]  # Age ranges and an upper bound
bin_labels = ['Young', 'Adult', 'Middle-aged', 'Senior']
customers['age_group'] = pd.cut(customers['age'], bins=bin_edges, labels=bin_labels)
customers.head()

Unnamed: 0,customer_id,Active,club_member_status,fashion_news_frequency,age,postal_code,age_group
0,6883939031699146327,-1,0,0,49.0,6305,Middle-aged
1,11246327431398957306,-1,0,0,25.0,33726,Young
2,18439897732908966680,-1,0,0,24.0,3247,Young
3,18352672461570950206,-1,0,0,54.0,168643,Senior
4,18162778555210377306,1,0,1,52.0,168645,Senior


### Feature engineering #3: favorite category

In [21]:
articles_transactions = pd.merge(articles[["article_id", "product_group_name", "product_type_name"]], transactions_train[["article_id", "customer_id"]], on="article_id")
articles_transactions_customers = pd.merge(customers["customer_id"], articles_transactions[["article_id", "customer_id"]], on="customer_id")
articles_transactions_customers.head()


Unnamed: 0,customer_id,article_id
0,6883939031699146327,176209023
1,6883939031699146327,568601006
2,6883939031699146327,568601006
3,6883939031699146327,568601043
4,6883939031699146327,607642008


We now find the most popular group, which will be given to customers that have no favorite product group (no purchases)

In [22]:
most_popular_group =  articles_transactions["product_group_name"].mode().iloc[0] # most popular group in column (iloc: to get actual val instead of table)
most_popular_group

0

When we compare all groups, only the top 7 are very popular:

In [23]:
articles_transactions["product_group_name"].value_counts(dropna=False)


product_group_name
0     12552755
1      7046054
2      3552470
6      2579222
4      2565858
3      1599593
5       745521
7       685712
8       348180
9        97040
12        7313
13        5427
11        1500
10         559
14         533
15         279
16         229
17          74
18           5
Name: count, dtype: int64

Now we get the favorite category group per customer

In [24]:
articles_transactions = (articles_transactions.groupby(["customer_id"])
.product_group_name
.apply(lambda x: x.mode()[0])
.reset_index()
.rename({'product_group_name': 'fav_product_group'}, axis=1)
)

articles_transactions

Unnamed: 0,customer_id,fav_product_group
0,4245900472157,0
1,23962613628581,1
2,25398598941468,1
3,28847241659200,0
4,41046458195168,4
...,...,...
1362276,18446630855572834764,0
1362277,18446662237889060501,0
1362278,18446705133201055310,0
1362279,18446723086055369602,0


Merge with the customer dataset:

In [25]:
customers = pd.merge(customers, articles_transactions[["customer_id", "fav_product_group"]], on="customer_id", how="left")
customers["fav_product_group"] = customers["fav_product_group"].fillna(most_popular_group) # give most popular group to customers without purchases
customers.head()

Unnamed: 0,customer_id,Active,club_member_status,fashion_news_frequency,age,postal_code,age_group,fav_product_group
0,6883939031699146327,-1,0,0,49.0,6305,Middle-aged,0.0
1,11246327431398957306,-1,0,0,25.0,33726,Young,6.0
2,18439897732908966680,-1,0,0,24.0,3247,Young,0.0
3,18352672461570950206,-1,0,0,54.0,168643,Senior,4.0
4,18162778555210377306,1,0,1,52.0,168645,Senior,0.0


# Generate updated datasets

Only generates new customers file, as no extra changes were made on articles and transactions_train.

In [26]:
customers.to_parquet(path + "customers2.parquet")