In [1]:
import os
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Set the correct backend for DGL
os.environ["DGLBACKEND"] = "pytorch"

In [2]:
# Load the 5% sample of the transactions data
BASE_PATH = '../data/'

transactions = pd.read_parquet(BASE_PATH + 'parquet/transactions_train_sample_0.05.parquet')

# Preprocessing the data

DGSR except the data in a certain format the following code will change the data to the correct format so we can run the DGSR code without having to change the DGSR code itself.

In [3]:
# This are the columns that DGSR expects
transactions = transactions.rename(columns={"article_id": "item_id", "customer_id": "user_id"})
# Need unix timestamp for DGSR
transactions['time'] = (transactions['t_dat'] - pd.Timestamp("1970-01-01")) // pd.Timedelta('1s')

# Drop the columns that are not needed for DGSR
transactions.drop(columns=['t_dat', 'sales_channel_id', 'price', 'week'], inplace=True)

In [4]:
# Change the datatype of the columns to the ones expected by DGSR
# Label encode the user_id 
# Easy way to get the user_id to start at 0 and have no gaps between the different user_id
le = LabelEncoder()
le.fit(transactions['user_id'])
transactions['user_id'] = le.transform(transactions['user_id'])

# Label encode the item_id 
# Easy way to get the item_id to start at 0 and have no gaps between the different item_id
le = LabelEncoder()
le.fit(transactions['item_id'])
transactions['item_id'] = le.transform(transactions['item_id'])

In [5]:
# Save to csv
transactions.to_csv(BASE_PATH + 'dgsr/transactions_train.csv', index=False)

# Important note
 Make sure the DGSR submodule is pulled from git and the correct requirements are installed
 
# Installation

The requirements inside the [requirements](./requirements.txt) file work for me. Depending on the GPU you have and what CUDA versions it
supports you might have to install different versions of dgl and pytorch. If you have an AMD GPU you will have to find out what works for you.

Version selector for [dgl](https://www.dgl.ai/pages/start.html) and for [pytorch](https://pytorch.org/get-started/locally/)

At some point I needed an older version of pytorch to get the correct CUDA versions so it didn't clash with the CUDA versions for DGL, those you can find [here](https://pytorch.org/get-started/previous-versions/)

In [6]:
# Move the data into the DGSR submodule
!cp -a ../data/dgsr/. ../DGSR/Data/

The following might take a while to generate all the graphs and it will take quit a lot of storage space (50+ GB). Expanding the small amount of data results in massive amounts of graph data and is the reason why it is not feasible for me to run this on my own hardware. 

Their datasets were similar size to our 5% sample which I use, but later on I found out that they have way more compute power than I do. So hence the reason why it is not feasible for me to run this on my own hardware.

In [None]:
# Generate the dynamic graphs and save them to disk so they don't have to be generated multiple times at runtime
!./load_data.sh

# Changes to make in the DGSR code
Open [this](./DGSR/generate_neg.py) file and change the dataset to "transactions_train"

In [None]:
# Generate negative samples, those are used for test and validation dataset of the algorithm
!./load_neg_data.sh

The following will take a while to run, I ran it for like 2 or 3 hours and it wasn't even close to finishing. I estimate to take at least 1.5 to 2 days to run on my hardware. I have other courses for which I also need my the GPU compute power so I can't run it for that long. 

In [None]:
# Train the model
!./train.sh