In [27]:
import os
import pandas as pd
from sklearn.preprocessing import LabelEncoder

os.environ["DGLBACKEND"] = "pytorch"

In [37]:
# Load the 5% sample of the transactions data
BASE_PATH = '../data/'

transactions = pd.read_parquet(BASE_PATH + 'parquet/transactions_train_sample_0.05.parquet')

user_id    int64
item_id    int64
time       int64
dtype: object    user_id  item_id        time
0      184     4220  1537401600
1      224     3747  1537401600
2      224    23666  1537401600
3      224    15699  1537401600
4      303      479  1537401600


In [None]:
# This are the columns that DGSR expects
transactions = transactions.rename(columns={"article_id": "item_id", "customer_id": "user_id"})
transactions['time'] = (transactions['t_dat'] - pd.Timestamp("1970-01-01")) // pd.Timedelta('1s')

# Drop the columns that are not needed for DGSR
transactions.drop(columns=['t_dat', 'sales_channel_id', 'price', 'week'], inplace=True)

In [None]:
# Change the datatype of the columns to the ones expected by DGSR
# Label encode the user_id 
# Easy way to get the user_id to start at 0
le = LabelEncoder()
le.fit(transactions['user_id'])
transactions['user_id'] = le.transform(transactions['user_id'])

# Label encode the item_id 
# Easy way to get the item_id to start at 0
le = LabelEncoder()
le.fit(transactions['item_id'])
transactions['item_id'] = le.transform(transactions['item_id'])

In [None]:
# Save to csv
transactions.to_csv(BASE_PATH + 'dgsr/transactions_train.csv', index=False)

# Important note
 Make sure the DGSR submodule is pulled from git, the requirements are installed
 
# Installation

The inside the [requirments](./requirements.txt) file work for me. Depending on the GPU you have and what CUDA versions it
supports you might have to install different versions of dgl and pytorch. 

Version selector for [dgl](https://www.dgl.ai/pages/start.html) and for [pytorch](https://pytorch.org/get-started/locally/)

At some point I needed an older version of pytorch to get the correct CUDA versions, those you can find [here](https://pytorch.org/get-started/previous-versions/)

In [38]:
!cp -a ../data/dgsr/. ../DGSR/Data/

The following might take a while to generate all the graphs and it will take quit a lot of storage space. Expanding the small amount of data results in massive amounts of graph data and is the reason why it is not feasible for me to run this on my own hardware. 

Their datasets were similar size to our 5% sample which I use, but they have either way more compute or way more time to be able to run the training stage.

In [39]:
!./load_data.sh

start: 2023-12-18 21:51:35.653903
^C


# Changes to make in the DGSR code
Open [this](./DGSR/generate_neg.py) file and change the dataset to "transactions_train"

In [None]:
!./load_neg_data.sh

In [None]:
!./train.sh