# Training Facebook's DLRM on the digix dataset 

Run this notebook on google colab and select GPU as runtime. Follow the steps below in order to run the Digix API

optionally mount your google drive account to persist the data outside of the current runime

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import os
os.chdir("/content/drive/MyDrive/Colab Notebooks/")

In [None]:
base_path = "/content/drive/MyDrive/Colab Notebooks/"

or just run it on this machine only



In [1]:
base_path = "/content/"

In [2]:
!mkdir dlrm_facebook

In [3]:
import os
os.chdir(base_path + "dlrm_facebook")

In [4]:
# donwload the digix dataset from kaggle 
os.environ['KAGGLE_USERNAME'] = "<your username>"
os.environ['KAGGLE_KEY'] = "<your kaggle key>"
!kaggle datasets download -d louischen7/2020-digix-advertisement-ctr-prediction

Downloading 2020-digix-advertisement-ctr-prediction.zip to /content/dlrm_facebook
100% 1.20G/1.20G [00:12<00:00, 116MB/s]
100% 1.20G/1.20G [00:12<00:00, 100MB/s]


In [5]:
import json
import zipfile

zip_ref = zipfile.ZipFile("2020-digix-advertisement-ctr-prediction.zip", 'r')
zip_ref.extractall()
zip_ref.close()

In [39]:
!ls

2020-digix-advertisement-ctr-prediction.zip  test_data_A.csv  train_data
dlrm					     test_data_B.csv


In [7]:
!git clone https://github.com/mabeckers/dlrm.git

Cloning into 'dlrm'...
remote: Enumerating objects: 20, done.[K
remote: Counting objects: 100% (20/20), done.[K
remote: Compressing objects: 100% (14/14), done.[K
remote: Total 511 (delta 11), reused 14 (delta 6), pack-reused 491[K
Receiving objects: 100% (511/511), 1.32 MiB | 30.00 MiB/s, done.
Resolving deltas: 100% (300/300), done.


In [8]:
os.chdir(base_path + "dlrm_facebook/dlrm")

In [9]:
!git checkout new_dataset

Branch 'new_dataset' set up to track remote branch 'new_dataset' from 'origin'.
Switched to a new branch 'new_dataset'


In [40]:
!python dlrm_s_pytorch.py --data-generation=dataset --mini-batch-size=1000 --data-set=digix --raw-data-file="/content/dlrm_facebook/train_data/train_data.csv" --processed-data-file="/content/dlrm_facebook/dlrm/digix_processed.npz" --nepochs=50 --print-freq=4000 --test-freq=4000 --loss-function=bce --test-mini-batch-size=2000 --arch-sparse-feature-size=5 --arch-mlp-bot="8-5" --arch-mlp-top="150-75-1" --learning-rate=0.01 --use-gpu

In [None]:
# run this if you have mounted google colab
#!python dlrm_s_pytorch.py --data-generation=dataset --mini-batch-size=1000 --data-set=digix --raw-data-file="/content/drive/MyDrive/Colab Notebooks/dlrm_facebook/ctr-prediction-digix/train_data/train_data.csv" --processed-data-file="/content/drive/MyDrive/Colab Notebooks/dlrm_facebook/ctr-prediction-digix/train_data/digix_processed.npz" --nepochs=30 --print-freq=4000 --test-freq=4000 --loss-function=bce --test-mini-batch-size=2000 --arch-sparse-feature-size=5 --arch-mlp-bot="8-5" --arch-mlp-top="150-75-1" --learning-rate=0.01 --use-gpu

In [41]:
%load_ext tensorboard

In [42]:
%tensorboard --logdir="/content/dlrm_facebook/dlrm/run_kaggle_pt"

# Notes

1. The goal of this project was to create a simple and technically clean extension of the DLRM model API that can use the raw digix data as input.

2. The model is in no way at the hight of its possibilities yet when it comes to performance! Things that can be improved to get better performance:
- Data cleaning! Outlier detection and removal of unwanted features / unrealistic rows etc.
- Potentially removing certain users that do not contribute enough (e.g. don't have enough 1s)
- Potentially removing certain ad campaigns that aren't successfull enough. 
- Feature engineering. In general understanding the data and the CTR prediction in the digix context a lot better.
- Changing and tuning the hyperparameters of the model such as the embedding size as well as the top and bottom MLP architecture 
- Try out different under / over sampling techniques
- Use the full dataset
- change the training mechanism. Change LR scheduler and or the optimization algorithm (e.g. Stochastic gradient descent with restarts and cosine annealing for learing rate).
- Add more regularization! In the above Tensorflow logs of the last traning run it is clear to see that overfitting happens around the 40Kth training iteration and that we should add some form of regularization to work against that

3. It is important to take a look at label distribution in this dataset as there is a huge imbalance between the positive and the negative class (way more negative). As one would assume of a real world dataset. In order to fix this huge imbalance, I did 2 things.
- Toss out all users that only have negative labels
- Undersample the remaining dataset such that 60% 0s and 40% 1s remain. 


3. In my trials google colab was not able to handle the full (around 40 million rows) of this dataset, which is why I only used half. 

