# FALL 2021 IE 582 PROJECT REPORT

### Alperen KÖKSAL
### Mehmet ÖZER

### Topic: Customer Gender Prediction based on E-commerce Data

#### 23.01.2022

### Professor: Mustafa Gökçe BAYDOĞAN

# Introduction

## Problem Description

### The aim of the project is to predict the customer gender based on the actions taken on the online retail website. This predictions could be beneficial for different use cases such as targeted promotions. For this purpose, we are given data of the one of the biggest retail companies in Turkey. The data consist of features such as timestamp, action taken by the user, information related to the product that the page the action was taken on belongs to, unique id and gender of the action taker.

## Approaches

### There are two main approaches used to develop a solution to this business problem: 
* 1. Treating each instances as a seperate entity and build a model that maximizes the AUC on the train data. After that, predict the probabilities of each instance on the test data and grouping them by their unique ids by taking the probability averages. 
* 2. Grouping each instances by their unique ids by our proposed bag representation, then building a predictive model maximizing the predefined performance metric ( (AUC+BAR)/2).

## Descriptive Analysis

* As the first step, data is checked to see if any duplicates exist. For train and test data, more than half of the instances are duplicates. These instances do not bring any extra information, therefore they are dropped from the data. There also exists missing values which will be dealt with later on. Except the selling price, the data consist of categorical variables. The relations between the categorical variables are investigated. Brand name is more unique than the brand id, therefore brand id is dropped. Contentid and product name have a lot of different categories. It is unlikely that they will provide any usefull information for model improvement, therefore, they are dropped as well.
*  For all three category levels, id and name of the category represent the same thing. Thus, category level names are dropped. Comparing the category_id with Level3_Category_Id, we see that category_id is more unique than Level3_Category_Id. Grouping the data by Level3_Category_Id, each group has a unique category_id. Therefore, Level3_Category_Id is dropped. User action has five different categories. product_gender is missing in the 11% of the data. In the test data, there are unique_id's that have at least one missing value in all of their instances, therefore dropping the missing values blindly is not an option.

* Product-related features have their missing values exactly at the same instances. It seems that these missing values are due to some structural reasons. 

# Literature

### Kucuka, E.Ş., Baydoğan, M.G., “Bag Encoding Strategies in Multiple Instance Learning Problems“, Information Sciences, 467, 559-578, 2018.
### Veronika Cheplygina, David M. J. Tax, Marco Loog, "Multiple Instance Learning with Bag Dissimilarities", Pattern Recognition 48.1 (2015): 264-275

### These papers helped us to build a base knowledge for the bag representation of the second approach.

### Tran Duc, Duong & Tan, Hanh & Pham, Son. (2016). Customer gender prediction based on E-commerce data. 91-95. 10.1109/KSE.2016.7758035. 

### Based on this paper, we had the idea of including the timestamp in different ways and chose our initial models as SVC and Random Forest.

# Approach

### Throughout the project, different types of approaches are combined for the different preprocessing and model building steps. For each step, everything that we tried will be explained.

* From timestamp feature, week, day, dayoftheweek, month, hour, minute and second information are extracted. Also, a feature called 'FLAG_weekend' is created that gets the value of 1 if the day is on weekend.

* Before dealing with the missing values, two FLAG features are created for product_gender and selling_price features such that it gets the value of 1 if the corresponding column has missing value.

* For one of the approaches, product gender is dropped due to the possibility of creating bias.

## Handling the Missing Values

### For missing value problem, two approaches are applied: 
* First approach is to fill the categorical variables with the most frequent values and the numerical variables with the mean. 
* Second approach is to take advantage of the Random Forest. Product related features are missing in the 0.1% of the data. This is a small proportion, therefore they are filled with the most frequent values with no problem. Afterwards, only two columns are left with missing values: selling_price and product_gender. selling_price have around 30000 values missing, far more less than the product_gender missing value count, which is 23400. Therefore, first, selling_price is imputed using other columns (except product_gender) by building a Random Forest model with selling_price as the target variable. Then, using the same logic, product_gender column is imputed using all the other columns.

## Rare Encoding

### Next step is to do a rare encoding. For rare encoding, two approaches are applied:

* In the first approach, train and test data are merged. By analysing the levels of all the categorical features, percentage thresholds which balances the dimension and the information loss is defined. Then, rare encoding is done such that, for all categories, levels of that feature that has frequency less than the threshold defined for that feature are transformed into a new level called 'Rare'. One hot encoding is applied to the categorical features. Then, train and test data are split.

* In the second approach, levels that are not common in the train and test data are converted into a new level called 'Other'. After that, two data are merged and rare encoding is applied such that levels that have frequency less then 1% are transformed into a new level called 'Rare'. Then, train and test data are split. Label encoding is applied to each categoric feature.

## Extra Steps

* One of the approaches was to apply PCA to reduce the dimensionality of the data while keeping as much of the variance as possible.

* Another approach was to apply undersampling and oversampling (SMOTE) to deal with class imbalances. Data consist of more female-labeled instances and it is possible that model favors towards the frequent class. Undersampling is done by selecting from female population with replacement such that #females = #males. SMOTE is done by creating new male instances by using the k_neighbors algorithm on the existing male instances.

## Model Building

### In general, for model building, train data is split into two datasets: Train and Validation. Model is trained on the train data with cross validation and validated on the validation data.
* When SVC is applied for the data, all the test instances are labeled as Female. It is decided that SVC is not a good fit for this problem.

* Random Forest had good results. It is robust to some features dominating the overall results and it has low variance due to the high number of estimators. Robustness to the feature dominance was a good fit for this problem because product_gender had a potential to dominate the results and could create a bias. These are the reasons for us to choose Random Forest as one of our main algorithms.

* LightGBM and CatBoost are similar algorithms that we used. Both of them are good at dealing with high cardinality categorical features (CatBoost is probably superior in this manner) and they have in-built ways to handle missing data (This was useful in the first phase of the project that we didn't decide the impute the missing values yet). CatBoost was the first choice that was focused on, however, it has an isolated structure that is not compatible with sklearn and it was not possible to use some good features of the other libraries. For these reasons, CatBoost is abondoned.

* LightGBM was fast and compatible with other libraries. For these reasons, we kept using it together with the other algorithms.

* The last model building approach we used was AutoML of the H2O package. It is an automated ML library that does the parameter tuning and ensembling steps automatically, given the proper dataset. For AutoML, dataset must be converted to H2O Frame and variable types must be defined. Data is given, stopping criteria is selected (# of models to try or number of seconds) and it is started.

# Results

### Images represent the training metrics. Validation score is obtained from the validation set that is seperated before training. Submission score is score obtained from the submission of the test data. For AutoML models, the best model is used for prediction.

### RF (Submission AUC= 0.845, Submission Performance= 0.804)
![title](images/rf_raw.jpg)

### RF with PCA
![title](images/rf_pca.jpg)

### SVC
![title](images/svc_raw.jpg)

### SVC with PCA
![title](images/svc_pca.jpg)

### AutoML CV  (10 mins) (Validation AUC= 0.87, Submission AUC= 0.839, Submission Performance= 0.707)
![title](images/automl_600.png)

### AutoML CV  (10 mins) (Validation AUC= 0.90, Submission AUC= 0.833, Submission Performance= 0.805)
![title](images/automl_4800.png)

### AutoML Feature Importances
![title](images/importance.png)

### AutoML Feature Importances (without product_gender)
![title](images/importance_without_gender.png)

# Conclusions and Future Work

### As the final result, we got a performance ((BAR+AUC)/2) value of ~0.80. It has an AUC score of 0.845 and BAR score of 0.76. It is a good AUC score overall. On the validation set, the model got an AUC score of ~0.91. Considering that data was split carefully to prevent any leakage, drop of 0.065 in the AUC score is not something to be expected. Even though data size was large enough to prevent overfitting, it seems like there were some overfitting issues.

### Clearly, there is scope for future improvement in this project. For the group by approach, features such as 'percentage of the actions taken on the specific category level', 'percentage of the actions taken in the specific time interval', 'percentage of the action taken in the specific price interval' could be added. Some useful information could be extracted from the product name. Features that are similar to those used in RFM applications could be added, such as 'Frequency of the purchases', 'Monetary value of the purchases', 'RFM values of the person', interaction between selling price and action types etc.

# Codes

## Group By Approach: https://github.com/BU-IE-582/fall21-sencer4898/blob/gh-pages/files/project_codes/groupby_source_code.ipynb

## Individual Instances Approach: https://github.com/BU-IE-582/fall21-sencer4898/blob/gh-pages/files/project_codes/individual_treatment_source_code.ipynb

In [1]:
# This line is used to create html version of the notebook.

import os

os.system('jupyter nbconvert --to html ProjectReport.ipynb')

0