# Downloading and preprocessing data

<a id='contents'></a>
## Contents

* [Introduction](#introduction)
* [Setup](#setup)
* [Downloading csv files](#downloading_csv)
* [Reading csv files into dataframes and inspecting](#reading_csv)
* [Edgelist data](#edgelist_data)
    * [Sorting by increasing order of 'ts' (timestamp) column](#edgelist_preprocessing_sort_ts)
	* [Encoding users, genres, etc. using TGB function ```load_edgelist_datetime```](#edgelist_preprocessing_tgb_load_edgelist_datetime)
	* [Fixing text encoding mismatch 'rock en español' -> 'rock en espaÃ±ol' in labels_dict.](#edgelist_preprocessing_fix_rock_en_espanol)	
	* [Converting datatypes and removing redundant columns to save memory](#edgelist_preprocessing_dtype_convert)
	* [Adding multiplicity column and removing duplicate records](#edgelist_preprocessing_duplicates)
	* [Renaming the columns, resetting the index, and saving to csv](#edgelist_preprocessing_reset_idx_save)
	* [Saving labels_dict and node_ids to csv](#edgelist_preprocessing_save_dict)
* [Node labels data](#node_labels_preprocessing)
	* [Notwithstanding weights, there is overlap between node label data and edgelist data](#node_labels_overlap)
	* [Node label data is sorted by increasing timestamp](#node_labels_preprocessing_sort)	
	* [Node label data contains no duplicate records](#node_labels_preprocessing_verify_no_duplicates)
	* [Node label user_ids are a proper subset of edgelist user_ids](#node_labels_preprocessing_verify_user_id)
	* [Node label genres are the same as edgelist genres](#node_labels_preprocessing_verify_genre)
	* [Mapping node label user_ids and genres using node_ids and labels_dict](#node_labels_preprocessing_mapping)
	* [Renaming and reordering node label dataframe columns, and saving to csv](#node_labels_preprocessing_renaming_columns)
* [Daily labels and final genre list](#daily_labels)
* [References](#references)

<a id='introduction'></a>
## Introduction
↑↑ [Contents](#contents) ↓ [Setup](#setup)

From [[1, p. 7]](#H:2023):
>```tgbn-genre``` This is a bipartite and weighted interaction network between users and the music genres of songs they listen to. Both users and music genres are represented as nodes while an interaction specifies a user listens to a music genre at a given time. The edge weights denote the percentage of which a song belongs to a certain genre. The dataset is constructed by cross-referencing the songs in the [LastFM-song-listens dataset](http://snap.stanford.edu/jodie/#datasets) [15, 24] with that of music genres in the [million-song dataset](#http://millionsongdataset.com/) [2]. The LastFM-song-listens dataset has one month of who-listens-to-which-song information for 1000 users and the million-song dataset provides genre weights for all songs in the [LastFM-song-listens dataset](http://snap.stanford.edu/jodie/#datasets). We only retained genres with at least 10% weights for each song that are repeated at least a thousand times in the dataset. Genre names are cleaned to remove typos. Here, the task is to predict how frequently each user will interact with music genres over the next week. This is applicable to many music recommendation systems where providing personalized recommendation is important and user preference shifts over time.

> **References**

>[2] Bertin-Mahieux, T., D. P. Ellis, B. Whitman, and P. Lamere. '[The million-song dataset](https://ismir2011.ismir.net/papers/OS6-1.pdf).' 2011.

>[15] Hidasi, B. and D. Tikk. '[Fast ALS-based tensor factorization for context-aware recommendation from implicit feedback](https://doi.org/10.1007/978-3-642-33486-3_5).' In: Flach, P.A., T. De Bie, and N. Cristianini (eds) _Machine Learning and Knowledge Discovery in Databases_. ECML PKDD 2012. Lecture Notes in Computer Science, vol 7524. Springer, Berlin, Heidelberg. 

>[24] Kumar, S., X. Zhang, and J. Leskovec. '[Predicting dynamic embedding trajectory in temporal interaction networks](https://doi.org/10.1145/3292500.3330895).' In _Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining_, pp. 1269–1278, 2019.


This will all be elucidated and explored in depth in our [EDA notebook](02-eda-tgbn-genre-dataset.ipynb). The purpose of the present notebook is merely to download the data, perform some very basic preprocessing, and save the processed data in our data directory for later use.

<a id='setup'></a>
## Setup
↑↑ [Contents](#contents) ↑ [Introduction](#introduction) ↓ [Downloading csv files](#downloading_csv)

In [1]:
from pathlib import Path
import os
import sys

# Determine the project root directory and add it to the Python path
notebook_path = Path(os.getcwd()).resolve()  # Path to the current working directory
project_root = notebook_path.parent          # Parent directory of notebooks, which is the project root

# Add the project root directory to the Python path
sys.path.append(str(project_root))

# The setup module is project_root/scripts/prjct_setup.py
from scripts.prjct_setup import *


PROJECT DIRECTORY STRUCTURE

├─ Resources/
├─ presentation/
├─ notebooks/
├─ scripts/
│  ├─ tgb/
│  │  ├─ linkproppred/
│  │  ├─ nodeproppred/
│  │  ├─ utils/
│  │  ├─ datasets/
│  │  │  ├─ dataset_scripts/
├─ literature/
├─ data/
├─ models/


FIRST-LEVEL SUBDIRECTORY PATHS

path['Resources'] = F:\projects\temporal-graphs\Resources
path['presentation'] = F:\projects\temporal-graphs\presentation
path['notebooks'] = F:\projects\temporal-graphs\notebooks
path['scripts'] = F:\projects\temporal-graphs\scripts
path['literature'] = F:\projects\temporal-graphs\literature
path['data'] = F:\projects\temporal-graphs\data
path['models'] = F:\projects\temporal-graphs\models

TGB PATHS

tgb_path['linkproppred'] = F:\projects\temporal-graphs\scripts\tgb\linkproppred
tgb_path['nodeproppred'] = F:\projects\temporal-graphs\scripts\tgb\nodeproppred
tgb_path['utils'] = F:\projects\temporal-graphs\scripts\tgb\utils
tgb_path['datasets'] = F:\projects\temporal-graphs\scripts\tgb\datasets
tgb_path['dataset_s

<a id='downloading_csv'></a>
## Downloading csv files (for the first time)
↑↑ [Contents](#contents) ↑ [Setup](#setup) ↓ [Reading csv files into dataframes and inspecting](#reading_csv)

```csv``` files containing containing records related to the ```tgbn-genre``` dataset can be obtained as follows. The [TGB package](https://github.com/shenyangHuang/TGB) [[2](#H_GH:2023)] can also download and process the data for us, but we opt to follow these steps.

The data may move or change: the instructions and dataset properties described here reflect the state of the data as at May 23, 2024.

1. Point browser to [https://object-arbutus.cloud.computecanada.ca/tgb/](https://object-arbutus.cloud.computecanada.ca/tgb/)
2. Search for the term ```genre```.
3. Find ```<Key>tgbn-genre.zip</Key>```. Copy-paste ```tgbn-genre.zip``` at the end of the URL above. This will initiate the downloading of a zipped folder containing the two csv files below:
    * ```tgbn-genre.zip``` -> ```tgbn-genre_edgelist.csv``` and ```tgbn-genre_node_labels.csv```.
    We believe that these files contain what would be considered raw data for the purposes of this project. As we will see, the files in Step 5 are either duplicates or can be generated using TGB processing.
5. Repeat similar for other keys containing the string ```genre```, which pertain to the ```tgbn-genre``` dataset:
    * ```lastfmgenre.zip``` -> ```lastfmgenre_edgelist.csv```, ```lastfmgenre_node_labels.csv```
    * ```lastfmgenre_raw.zip``` -> ```daily_labels.csv```, ```genre_list_final.csv```, ```lastfmgenre_edgelist.csv```, ```lastfmgenre_node_labels.csv```.
    * ```lastfmgenre_processed.zip``` -> ```genre_list_final.csv```
6. Unzip and move all csv files to the directory specified by ```path['data']``` (this project's data directory). We assume files with the same name are the same.

<a id='reading_csv'></a>
## Reading csv files into dataframes and inspecting
↑↑ [Contents](#contents) ↑ [Downloading csv files](#downloading_csv) ↓ [Edgelist data](#edgelist_data)

We'll read the just-downloaded csv files into dataframes, which we'll store in a dictionary called ```df_```. The items of ```df_``` are of the form ```df_name : df```, where ```df_name.csv``` is a csv file and ```df``` is the corresponding dataframe.

In [2]:
import time 
import pandas as pd

csv_fnames = ['tgbn-genre_edgelist',             
              'lastfmgenre_edgelist',
              'tgbn-genre_node_labels',
              'lastfmgenre_node_labels',
              'daily_labels',
              'genre_list_final',
             ]

df_ = {}

tic = [time.time()]

for fname in csv_fnames:
    df_[fname] = pd.read_csv(path['data'].joinpath(fname + '.csv'))
    tic.append(time.time())
    print(f'Time taken to read {fname}.csv into dataframe: {tic[-1] - tic[-2]:.2f} seconds.')

print(f'Total time taken: {tic[-1] - tic[0]:.2f} seconds.')

Time taken to read tgbn-genre_edgelist.csv into dataframe: 8.84 seconds.
Time taken to read lastfmgenre_edgelist.csv into dataframe: 8.91 seconds.
Time taken to read tgbn-genre_node_labels.csv into dataframe: 1.47 seconds.
Time taken to read lastfmgenre_node_labels.csv into dataframe: 1.46 seconds.
Time taken to read daily_labels.csv into dataframe: 1.85 seconds.
Time taken to read genre_list_final.csv into dataframe: 0.00 seconds.
Total time taken: 22.53 seconds.


Let's inspect the first five rows of each of the dataframes we've just downloaded.

In [3]:
for fname in csv_fnames:
    print_header(fname)
    display(df_[fname].head())


TGBN-GENRE_EDGELIST



Unnamed: 0,ts,user_id,genre,weight
0,1108357203,user_000871,Rock Argentino,0.375
1,1108357203,user_000871,Rock Argentino,0.375
2,1108357264,user_000709,80s,0.452489
3,1108357264,user_000709,pop,0.289593
4,1108357264,user_000709,new wave,0.257919



LASTFMGENRE_EDGELIST



Unnamed: 0,ts,user_id,genre,weight
0,1108357203,user_000871,Rock Argentino,0.375
1,1108357203,user_000871,Rock Argentino,0.375
2,1108357264,user_000709,80s,0.452489
3,1108357264,user_000709,pop,0.289593
4,1108357264,user_000709,new wave,0.257919



TGBN-GENRE_NODE_LABELS



Unnamed: 0,ts,user_id,genre,weight
0,1108443600,user_000054,chillout,0.015835
1,1108443600,user_000054,female vocalist,0.01533
2,1108443600,user_000054,downtempo,0.008128
3,1108443600,user_000054,electronic,0.072162
4,1108443600,user_000054,reggae,0.021465



LASTFMGENRE_NODE_LABELS



Unnamed: 0,ts,user_id,genre,weight
0,1108443600,user_000054,chillout,0.015835
1,1108443600,user_000054,female vocalist,0.01533
2,1108443600,user_000054,downtempo,0.008128
3,1108443600,user_000054,electronic,0.072162
4,1108443600,user_000054,reggae,0.021465



DAILY_LABELS



Unnamed: 0,user_id,year,month,day,genre,weight
0,user_000001,2006,8,15,electronic,1.172941
1,user_000001,2006,8,15,alternative,0.468085
2,user_000001,2006,8,15,chillout,0.358974
3,user_000001,2006,8,15,math rock,1.0
4,user_000001,2006,8,16,acid jazz,0.35461



GENRE_LIST_FINAL



Unnamed: 0,genre
electronic,743529
alternative,862812
chillout,187629
math rock,6559
electronica,54289


It seems like the ```lastfm``` dataframes are copies of others, so before going any further let us verify this. Instead of removing them from memory, we'll actually keep them and apply some basic pre-processing steps to them: we want to preserve the original ```df_['tgbn-genre_edgelist']``` and ```df_['tgbn-genre_node_labels']```. However, we'll create a new reference to the ```lastfm``` dataframes, so that it's clear that we'll think of them as processed versions of the 'original' edgelist and node label dataframes.

In [5]:
for suffix in ['_edgelist', '_node_labels']:
    df_name = 'tgbn-genre' + suffix
    df = df_[df_name]
    df_[df_name + '_processed'] = df_['lastfmgenre' + suffix]
    if all(df_[df_name + '_processed'] == df):
        print(f'lastfmgenre{suffix}.csv is identical to tgbn-genre{suffix}.csv.')
    else:
        print(f'lastfmgenre{suffix}.csv is not identical to tgbn-genre{suffix}.csv.')

lastfmgenre_edgelist.csv is identical to tgbn-genre_edgelist.csv.
lastfmgenre_node_labels.csv is identical to tgbn-genre_node_labels.csv.


<a id='edgelist_data'></a>
## Edgelist data
↑↑ [Contents](#contents) ↑ [Reading csv files into dataframes and inspecting](#reading_csv) ↓ [Sorting by increasing order of 'ts' (timestamp) column](#edgelist_preprocessing_sort_ts)

We will now look a bit more closely at the remaining dataframes, and pre-process them. We start with the edgelist data, i.e. ```df_['tgbn-genre_edgelist']```, which we read from ```tgbn-genre_edgelist.csv```. Recall that we now have a reference ```df_['tgbn-genre_edgelist_processed']``` to an exact copy if this dataframe: we will modify this copy, but never the original.

<a id='edgelist_preprocessing_sort_ts'></a>
### Sorting by increasing order of 'ts' (timestamp) column
↑↑ [Contents](#contents) ↑ [Edgelist data](#edgelist_data) ↓ [Encoding users, genres, etc. using TGB function](#edgelist_preprocessing_tgb_load_edgelist_datetime)

In the 'ts' columns, the values are [Unix timestamps](https://www.unixtimestamp.com/), i.e. number of non-leap seconds since the start of the 'Unix epoch' (January 1, 1970, 00:00:00 UTC). The original edgelist dataframe, and hence the one we'll process, turn out to **nont** be sorted by increasing timestamp. They 'almost' are, in the sense that there are several runs of consecutive records that are sorted chronologically. We will sort the processed dataframe. Perhaps the original edgelist csv data is supposed to be completely sorted, with the discrepancies due to some kind of batching in its creation. Regardless, we will not modify the original edgelist dataframe in any way.

In [6]:
df_name = 'tgbn-genre_edgelist_processed'
df = df_[df_name]
col = 'ts'
        
if df[col].is_monotonic_increasing:
    print(f'df_[\'{df_name}\'] is already sorted by {col} column (increasing).\n')
else:
    print(f'df_[\'{df_name}\'] is not sorted by {col} column (increasing).')
    
    condition = df[col] < df[col].shift(1)
    indices = condition[condition].index
    
    print(f'Indices where {col}[i] < {col}[i - 1]:', end = ' ')
    for i, index in enumerate(indices):
        if i < len(indices) - 1:
            print(index, end = ', ')
        else:
            print(index, end = '.\n')
    print('')

df_['tgbn-genre_edgelist_processed'] is not sorted by ts column (increasing).
Indices where ts[i] < ts[i - 1]: 74517, 1709056, 5694923, 10575592, 16591333.



In [7]:
df_name = 'tgbn-genre_edgelist_processed'
df = df_[df_name]
col = 'ts'

print(f'Now sorting df_[\'{df_name}\'] by increasing {col} values...')

tic = time.time()

df.sort_values(by = col, inplace=True)

toc = time.time()

print(f'Done sorting. Time taken: {toc - tic:.2f} seconds.')

Now sorting df_['tgbn-genre_edgelist_processed'] by increasing ts values...
Done sorting. Time taken: 1.35 seconds.


<a id='edgelist_preprocessing_tgb_load_edgelist_datetime'></a>
### Encoding users, genres, etc. using TGB function ```load_edgelist_datetime```
↑↑ [Contents](#contents) ↑ [Sorting by increasing order of 'ts' (timestamp) column](#edgelist_preprocessing_sort_ts) ↓ [Fixing text encoding mismatch 'rock en español' -> 'rock en espaÃ±ol' in labels_dict.](#edgelist_preprocessing_fix_rock_en_espanol)

From the TGB module ```tgb_genre.py``` (under ```/scripts/tgb/datasets/dataset_scripts/```, also see it on [TGB's GitHub](https://github.com/shenyangHuang/TGB/blob/main/tgb/datasets/dataset_scripts/tgbn-genre.py) [[3]](#H_GH:tgbn_genre_py)), we import the function ```load_edgelist_datetime```, and apply it to our edgelist dataframe ```df_['tgbn-genre_edgelist_processed']``` (or to be precise, a csv file equivalent). Among other things, this will create dictionaries that map ```user_id``` and ```genre``` to integers, and apply the mappings to create a new dataframe, which we will call ```processed_df``` for now.

In [18]:
df_name = 'tgbn-genre_edgelist_processed'
df = df_[df_name]

print(f'Saving df_{df_name} to csv file {df_name}.csv in data directory.')
print('This file will be overwritten later with final verion of processed data.')

tic = time.time()

df.to_csv(path['data'].joinpath(df_name + '.csv'), index = False)

toc = time.time()
print(f'Time taken: {toc - tic:.2f} seconds.')

Saving df_tgbn-genre_edgelist_processed to csv file tgbn-genre_edgelist_processed.csv in data directory.
This file will be overwritten later with final verion of processed data.
Time taken: 392.48 seconds.


In [19]:
# Our setup added all subdirectories of scripts to sys.path, so we can simply do
from pre_process import load_edgelist_datetime

In [75]:
# We can inspect the TGB code to see how to use the function.
# It is presumed that a string will be passed to the fname argument.
# No type hint is given and it works just as well with a Path object.
df_name = 'tgbn-genre_edgelist_processed'
fname = path['data'].joinpath(df_name + '.csv')
df_[df_name], edge_feat, node_ids, labels_dict = load_edgelist_datetime(fname)

number of lines counted 17858395


17858396it [01:12, 244725.01it/s]


So far, we have taken a copy of ```df_['tgbn-genre_edgelist']```, which we're calling ```df_['tgbn-genre_edgelist_processed']```, sorted this copy by increasing timestamp, saved it to a csv file, and applied the ```load_edgelist_datetime``` function to this csv, creating a new dataframe. We're now using ```df_['tgbn-genre_edgelist_processed']``` to reference this new dataframe. 

Let's take a look at the original and the current version of the processed dataframe.

In [76]:
df_original_name = 'tgbn-genre_edgelist'
df_original = df_[df_original_name]
df_processed_name = 'tgbn-genre_edgelist_processed'
df_processed = df_[df_processed_name]

print_header('Original edgelist dataframe')
display(df_original.head())

print_header('Processed edgelist dataframe')
display(df_processed.head())


ORIGINAL EDGELIST DATAFRAME



Unnamed: 0,ts,user_id,genre,weight
0,1108357203,user_000871,Rock Argentino,0.375
1,1108357203,user_000871,Rock Argentino,0.375
2,1108357264,user_000709,80s,0.452489
3,1108357264,user_000709,pop,0.289593
4,1108357264,user_000709,new wave,0.257919



PROCESSED EDGELIST DATAFRAME



Unnamed: 0,u,i,ts,idx,w
0,514.0,0.0,1108357000.0,1.0,0.375
1,514.0,0.0,1108357000.0,2.0,0.375
2,515.0,1.0,1108357000.0,3.0,0.452489
3,515.0,2.0,1108357000.0,4.0,0.289593
4,515.0,3.0,1108357000.0,5.0,0.257919


From inspecting the code, we see that the first value in the 'genre' column is mapped to 0, then the next new value is mapped to 1, and so on. There are 513 distinct genres, so the last new genre in that column is mapped to 512. Perhaps to leave open the option of 1-based indexing of the 'user_id' column, 513 is skipped, and then the first value in the 'user_id' column is mapped to 513, the next new value to 514, and so on. There are 992 distinct user_ids, so the last new user_id in that column is mapped to 514 + 992 - 1 = 1505.

Let us subtract one from every value in the 'u' column of the processed dataframe, so that the genre codes go from 0 to 512 inclusive, and the user ID codes go from 513 to 1504 inclusive.

In [77]:
df_name = 'tgbn-genre_edgelist_processed'
df = df_[df_name]

df['u'] = df['u'] - 1

We'll have to apply the same map to the values of ```node_ids```, which we'll discuss presently.

In [78]:
node_ids = {user_id : code - 1 for user_id, code in node_ids.items()}

We can see the 'genre' correspondence by inspecting the ```labels_dict``` dictionary, and the 'user_id' correspondence by inspecting the ```node_ids``` dictionary.

In [79]:
labels_dict_df = pd.DataFrame(list(labels_dict.items()), columns=['genre', 'genre_code'])
node_ids_df = pd.DataFrame(list(node_ids.items()), columns=['user_id', 'user_id_code'])

print_header('Genre codes')
display(labels_dict_df)

print_header('user_id codes')
display(node_ids_df)


GENRE CODES



Unnamed: 0,genre,genre_code
0,Rock Argentino,0
1,80s,1
2,pop,2
3,new wave,3
4,rock,4
...,...,...
508,Jordin Sparks,508
509,duffy,509
510,Gabriella Cilmi,510
511,Lady Gaga,511



USER_ID CODES



Unnamed: 0,user_id,user_id_code
0,user_000871,513
1,user_000709,514
2,user_000285,515
3,user_000525,516
4,user_000966,517
...,...,...
987,user_000098,1500
988,user_000186,1501
989,user_000533,1502
990,user_000538,1503


The ```edge_feat``` variable stores a ```numpy``` array: it corresponds precisely to the 'w' column of ```df_['tgbn-genre_edgelist_processed']```, which in turn is just a reordering of the 'weight' column of the original ```df_['tgbn-genre_edgelist']``` (we sorted by timestamp before applying the ```load_edgelist_datetime``` function). We verify all of this now.

In [80]:
edge_feat

array([[0.375     ],
       [0.375     ],
       [0.45248869],
       ...,
       [0.54945055],
       [0.54945055],
       [0.45054945]])

Recall that ```df_['tgbn-genre_edgelist']``` is not sorted by increasing timestamp, while ```df_['tgbn-genre_edgelist_processed']``` is, and we sorted it before applying the ```load_edgelist_datetime``` function to it. Thus, in order to check that the difference between the original 'weight' column and the current 'w' are the same up to order (and some negligible floating point errors), we need a sorted version of the 'weight' column of the original dataframe.

In [81]:
df_original_name = 'tgbn-genre_edgelist'
df_original_weight = df_[df_original_name][['ts','weight']].copy(deep=True)
df_original_weight.sort_values(by = 'ts', inplace=True)
df_original_weight = df_original_weight['weight']

In [82]:
import numpy as np

df_processed_name = 'tgbn-genre_edgelist_processed'
df_processed = df_[df_processed_name]

print(f'edge_feat corresponds exactly to df_[\'tgbn-genre_edgelist_processed\'][\'w\']: {all(edge_feat.reshape(-1) == df_processed['w'].values)}.')
print(f'edge_feat corresponds exactly to (sorted) df_[\'tgbn-genre_edgelist\'][\'weight\']: {all(edge_feat.reshape(-1) == df_original_weight.values)}.')

error = np.sum(np.abs(df_original_weight - df_processed['w'].values))
print(f'Sum absolute error between (sorted) original and processed weights: {error}.')

edge_feat corresponds exactly to df_['tgbn-genre_edgelist_processed']['w']: True.
edge_feat corresponds exactly to (sorted) df_['tgbn-genre_edgelist']['weight']: True.
Sum absolute error between (sorted) original and processed weights: 0.0.


<a id='edgelist_preprocessing_fix_rock_en_espanol'></a>
### Fixing text encoding mismatch 'rock en español' -> 'rock en espaÃ±ol' in labels_dict
↑↑ [Contents](#contents) ↑ [Encoding users, genres, etc. using TGB function](#edgelist_preprocessing_tgb_load_edgelist_datetime) ↓ [Sorting by increasing order of 'ts' (timestamp) column](#edgelist_preprocessing_sort_ts)

Further down, during processing of node labels data, we noticed some null values in the ```genre``` column after applying the map given by ```labels_dict```. We discovered that this is due to ```'rock en español'``` in ```df_['tgbn-genre_edgelist_processed']``` becoming ```'rock en espaÃ±ol'``` in ```labels_dict```. We fix this now, and re-do the saving to csv file below (if this is the first run through the notebook).

In [83]:
df_original_name = 'tgbn-genre_edgelist'
df_original = df_[df_original_name]

print('rock en español' in set(df_original['genre']))
print('rock en español' in labels_dict)
print('rock en espaÃ±ol' in set(df_original['genre']))
print(labels_dict['rock en espaÃ±ol'])

True
False
False
67


In [84]:
labels_dict['rock en español'] = labels_dict.pop('rock en espaÃ±ol')

In [85]:
labels_dict_df['genre'] = labels_dict_df['genre'].replace('rock en espaÃ±ol', 'rock en español')

In [86]:
labels_dict['rock en español']

67

In [87]:
labels_dict_df[labels_dict_df['genre'] == 'rock en español']

Unnamed: 0,genre,genre_code
67,rock en español,67


<a id='edgelist_preprocessing_dtype_convert'></a>
### Converting datatypes and removing redundant columns to save memory
↑↑ [Contents](#contents) ↑ [Sorting by increasing order of 'ts' (timestamp) column](#edgelist_preprocessing_sort_ts) ↓ [Adding multiplicity column and removing duplicate records](#edgelist_preprocessing_duplicates)

Every column in ```df['tgbn-genre_edgelist_processed']``` has ```float64``` ```dtype```, but outside the weight ('w') column, all values are integers. Also, notice that the 'idx' column is simply an index column (it's the index shifted by one). After verifying these claims, we convert datatypes and drop the 'idx' column, saving about 340 MB of memory.

In [88]:
df_name = 'tgbn-genre_edgelist_processed'
df = df_[df_name]

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17858395 entries, 0 to 17858394
Data columns (total 5 columns):
 #   Column  Dtype  
---  ------  -----  
 0   u       float64
 1   i       float64
 2   ts      float64
 3   idx     float64
 4   w       float64
dtypes: float64(5)
memory usage: 681.2 MB


In [89]:
df_name = 'tgbn-genre_edgelist_processed'
df = df_[df_name]

tic = time.time()

for col in ['u', 'i', 'ts']:
    all_integers = df[col].apply(lambda x: x.is_integer()).all()
    if all_integers:
        print(f'In df[\'tgbn-genre_edgelist_processed\'], all values in {col} column are integers.')
    else:
        print(f'In df[\'tgbn-genre_edgelist_processed\'], not all values in {col} column are integers.')

for col in ['idx']:
    all_equal = (df[col] == df.index + 1).all()
    if all_equal:
        print(f'In df[\'tgbn-genre_edgelist_processed\'], every value in {col} is equal to its row index plus one.')
    else:
        print(f'In df[\'tgbn-genre_edgelist_processed\'], not every value in {col} is equal to its row index plus one.')

toc = time.time()

print(f'Time taken: {toc - tic:.2f} seconds.')

In df['tgbn-genre_edgelist_processed'], all values in u column are integers.
In df['tgbn-genre_edgelist_processed'], all values in i column are integers.
In df['tgbn-genre_edgelist_processed'], all values in ts column are integers.
In df['tgbn-genre_edgelist_processed'], every value in idx is equal to its row index plus one.
Time taken: 14.21 seconds.


In [90]:
df_name = 'tgbn-genre_edgelist_processed'
df = df_[df_name]

tic = time.time()

for col in ['u', 'i','ts']:
    print(f'Converting dtype of {col} column to int.')
    df[col] = df[col].astype(int)

for col in ['idx']:
    print(f'Dropping {col} column.')
    df.drop(col, axis=1, inplace=True)
    
toc = time.time()

print(f'Time taken: {toc - tic:.2f} seconds.')

Converting dtype of u column to int.
Converting dtype of i column to int.
Converting dtype of ts column to int.
Dropping idx column.
Time taken: 0.47 seconds.


In [91]:
df_name = 'tgbn-genre_edgelist_processed'
df = df_[df_name]

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17858395 entries, 0 to 17858394
Data columns (total 4 columns):
 #   Column  Dtype  
---  ------  -----  
 0   u       int32  
 1   i       int32  
 2   ts      int32  
 3   w       float64
dtypes: float64(1), int32(3)
memory usage: 340.6 MB


<a id='edgelist_preprocessing_duplicates'></a>
### Adding multiplicity column and removing duplicate records 
↑↑ [Contents](#contents) ↑ [Converting datatypes and removing redundant columns to save memory](#edgelist_preprocessing_dtype_convert) ↓ [Renaming the columns, resetting the index, and saving to csv](#edgelist_preprocessing_reset_idx_save)

We notice that, after removing the 'idx' column, which was simply a 1-based index column, ```df['tgbn-genre_edgelist_processed']``` contains duplicate records. We're not sure if this is intentional, so while we'll remove duplicate records, we'll keep track of the number of copies of each record by adding a 'multiplicity' column. This saves a further 76 MB of memory.

Could the duplicate records be an artifact of the [way the dataset is constructed](#introduction), namely, by cross-referencing the songs in the [LastFM-song-listens dataset](http://snap.stanford.edu/jodie/#datasets) with that of music genres in the [million-song dataset](#http://millionsongdataset.com/), and the fact that the million-song dataset contains duplicates (various versions of songs, etc.)? We do not know, and perhaps it doesn't matter too much because of [the way the data is aggregated](#node_labels_preprocessing) in the node labels dataset. See the [EDA notebook](02-eda-tgbn-genre-dataset.ipynb) for further discussion.

In [92]:
df_name = 'tgbn-genre_edgelist_processed'
df = df_[df_name]

df.head(2)

Unnamed: 0,u,i,ts,w
0,513,0,1108357203,0.375
1,513,0,1108357203,0.375


In [93]:
df_name = 'tgbn-genre_edgelist_processed'
df = df_[df_name]

print(f'Adding multiplicity column to df[\'tgbn-genre_edgelist_processed\'].')

tic = time.time()

df['multiplicity'] = df.groupby(df.columns.tolist()).transform('size')

toc = time.time()
print(f'Time taken: {toc - tic:.2f} seconds.')

Adding multiplicity column to df['tgbn-genre_edgelist_processed'].
Time taken: 8.18 seconds.


In [94]:
df_name = 'tgbn-genre_edgelist_processed'
df = df_[df_name]

df.head()

Unnamed: 0,u,i,ts,w,multiplicity
0,513,0,1108357203,0.375,2
1,513,0,1108357203,0.375,2
2,514,1,1108357264,0.452489,1
3,514,2,1108357264,0.289593,1
4,514,3,1108357264,0.257919,1


In [95]:
print(f'Dropping duplicate records from df_[\'tgbn-genre_edgelist_processed\'].')

tic = time.time()

'''
Careful! df.drop_duplicates creates a new DataFrame without duplicates and assigns it back to df. 
If df is a reference to df_[...], this does not modify df_[...] because df is now pointing to a new DataFrame, while df_[...] remains unchanged.
Trying to use inplace=True will throw a warning 'A value is trying to be set on a copy of a slice from a DataFrame'...
'''
df_['tgbn-genre_edgelist_processed'] = df_['tgbn-genre_edgelist_processed'].drop_duplicates(keep='first')
toc = time.time()
print(f'Time taken: {toc - tic:.2f} seconds.')

Dropping duplicate records from df_['tgbn-genre_edgelist_processed'].
Time taken: 3.95 seconds.


In [96]:
df_name = 'tgbn-genre_edgelist_processed'
df = df_[df_name]

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11630360 entries, 0 to 17858392
Data columns (total 5 columns):
 #   Column        Dtype  
---  ------        -----  
 0   u             int32  
 1   i             int32  
 2   ts            int32  
 3   w             float64
 4   multiplicity  int64  
dtypes: float64(1), int32(3), int64(1)
memory usage: 399.3 MB


We can now see, for instance, that there are 45 copies of four of the records, 15 copies of 5,555 of the records, and so on.

In [97]:
df_name = 'tgbn-genre_edgelist_processed'
df = df_[df_name]

df['multiplicity'].value_counts()

multiplicity
1     8395030
2     1926100
3      609210
4      295074
5      186173
6       78563
7       71569
8       30243
12       9351
11       9339
10       6035
15       5555
9        3839
14       2225
13        763
18        656
17        328
21        213
19         32
16         27
24         11
30          8
20          6
22          6
45          4
Name: count, dtype: int64

<a id='edgelist_preprocessing_reset_idx_save'></a>
### Renaming the columns, resetting the index, and saving to csv
↑↑ [Contents](#contents) ↑ [Adding multiplicity column and removing duplicate records](#edgelist_preprocessing_duplicates) ↓ [Saving labels_dict and node_ids to csv](#edgelist_preprocessing_save_dict)

We will now reset the index of ```df_['tgbn-genre_edgelist_processed']```. If we hadn't sorted the records by increasing timestamp, all operations hitherto would be reversible and we'd be able to recover ```df_['tgbn-genre_edgelist']``` exactly from ```df_['tgbn-genre_edgelist_processed']```. We'll rename the columns (for reasons that will become clear in the [EDA notebook](02-eda-tgbn-genre-dataset.ipynb)), and save the final processed dataframe to a csv file so that it can be read into a dataframe at a later time without needing to repeat the above steps.

In [99]:
df_name = 'tgbn-genre_edgelist_processed'
df = df_[df_name]

df.reset_index(drop=True, inplace=True)

In [101]:
rename_dict = {
    'u': 'sources',
    'i': 'destinations',
    'ts': 'timestamps',
    'w' : 'edge_feat',
}
df_['tgbn-genre_edgelist_processed'] = df_['tgbn-genre_edgelist_processed'].rename(columns=rename_dict)

In [102]:
df_name = 'tgbn-genre_edgelist_processed'
df = df_[df_name]

df.head()

Unnamed: 0,sources,destinations,timestamps,edge_feat,multiplicity
0,513,0,1108357203,0.375,2
1,514,1,1108357264,0.452489,1
2,514,2,1108357264,0.289593,1
3,514,3,1108357264,0.257919,1
4,515,4,1108357321,0.362319,1


In [103]:
df_name = 'tgbn-genre_edgelist_processed'
df = df_[df_name]

print('Saving df_[\'tgbn-genre_edgelist_processed\'] to csv file.')

tic = time.time()

df.to_csv(path['data'].joinpath(df_name + '.csv'), index=False)

toc = time.time()
print(f'Time taken: {toc - tic:.2f} seconds.')

Saving df_['tgbn-genre_edgelist_processed'] to csv file.
Time taken: 177.51 seconds.


<a id='edgelist_preprocessing_save_dict'></a>
### Saving labels_dict and node_ids to csv
↑↑ [Contents](#contents) ↑ [Renaming the columns, resetting the index, and saving to csv](#edgelist_preprocessing_reset_idx_save) ↓ [Node labels data](#node_labels_preprocessing)

We'll also save ```labels_dict``` and ```node_ids``` to csv files, as they'll come in useful later on.

In [104]:
print('Saving labels_dict and node_ids to csv files.')

labels_dict_df.to_csv(path['data'].joinpath('tgbn-genre_labels_dict.csv'), index=False, encoding='utf-8')
node_ids_df.to_csv(path['data'].joinpath('tgbn-genre_node_ids.csv'), index=False)

Saving labels_dict and node_ids to csv files.


<a id='node_labels_preprocessing'></a>
## Node labels data
↑↑ [Contents](#contents) ↑ [Saving labels_dict and node_ids to csv](#edgelist_preprocessing_save_dict) ↓ [Notwithstanding weights, there is overlap between node label data and edgelist data](#node_labels_overlap)

We now perform similar steps for the node labels data contained in ```df_['tgbn-genre_node_labels_processed']```, which is currently a copy of ```df_['tgbn-genre_node_labels']```, which we read from ```tgbn-genre_node_labels.csv```.

The node labels data is generated from the edgelist data (essentially) using the ```generate_aggregate_labels``` function of the TGB module ```tgb_genre.py``` (under ```/scripts/tgb/datasets/dataset_scripts/```, also see it on [TGB's GitHub](https://github.com/shenyangHuang/TGB/blob/main/tgb/datasets/dataset_scripts/tgbn-genre.py) [[3]](#H_GH:tgbn_genre_py)). It aggregates the edgelist 'weight' column data by seven-day intervals (the 'ts' column is a timestamp column). See [EDA notebook](02-eda-tgbn-genre-dataset.ipynb) for further details.

<a id='node_labels_overlap'></a>
### Notwithstanding weights, there is overlap between node label data and edgelist data  
↑↑ [Contents](#contents) ↑ [Node labels data](#node_labels_preprocessing) ↓ [Node label data is sorted by increasing timestamp](#node_labels_preprocessing_sort)

There are 739 records in both ```df_['tgbn-genre_edgelist']``` and ```df_['tgbn-genre_node_labels']``` where 'ts', 'user_id', and 'genre' agree. Note, however, that the 'weight's are not the same.

As mentioned above, the weights in the node labels dataset are aggregations of weights in the edgelist dataset. As we will see in the [EDA notebook](02-eda-tgbn-genre-dataset.ipynb), this overlap represents records in the edge list dataset that correspond to the very beginning of a 24-hour interval.

In [109]:
df1_name = 'tgbn-genre_edgelist'
df2_name = 'tgbn-genre_node_labels'
df1 = df_[df1_name]
df2 = df_[df2_name]

merged_df = pd.merge(df1, df2, on=['ts', 'user_id', 'genre'], how='inner')

In [110]:
merged_df

Unnamed: 0,ts,user_id,genre,weight_x,weight_y
0,1112245200,user_000557,indie,0.442478,1.000000
1,1112245200,user_000557,indie,0.442478,1.000000
2,1113796800,user_000709,soul,0.549451,0.091032
3,1113796800,user_000709,60s,0.258242,0.054021
4,1113796800,user_000709,oldies,0.192308,0.090759
...,...,...,...,...,...
734,1244347200,user_000488,pop rock,0.268519,0.031380
735,1244347200,user_000488,female vocalist,0.268519,0.066939
736,1244347200,user_000488,pop,0.462963,0.242574
737,1244347200,user_000488,pop rock,0.268519,0.031380


<a id='node_labels_preprocessing_sort'></a>
### Node label data is sorted by increasing timestamp
↑↑ [Contents](#contents) ↑ [Notwithstanding weights, there is overlap between node label data and edgelist data](#node_labels_overlap) ↓ [Node label data contains no duplicate records](#node_labels_preprocessing_verify_no_duplicates)

Unlike the edgelist data, it turns out that the records of ```df_['tgbn-genre_node_labels']``` are already completely sorted chronologically. We verify this.

In [114]:
df_name = 'tgbn-genre_node_labels'
df = df_[df_name]
col = 'ts'
        
if df[col].is_monotonic_increasing:
    print(f'df_[\'{df_name}\'] is already sorted by {col} column (increasing).\n')
else:
    print(f'df_[\'{df_name}\'] is not sorted by {col} column (increasing).')
    
    condition = df[col] < df[col].shift(1)
    indices = condition[condition].index
    
    print(f'Indices where {col}[i] < {col}[i - 1]:', end = ' ')
    for i, index in enumerate(indices):
        if i < len(indices) - 1:
            print(index, end = ', ')
        else:
            print(index, end = '.\n')
    print('')

df_['tgbn-genre_node_labels'] is already sorted by ts column (increasing).



<a id='node_labels_preprocessing_verify_no_duplicates'></a>
### Node label data contains no duplicate records 
↑↑ [Contents](#contents) ↑ [Node label data is sorted by increasing timestamp](#node_labels_preprocessing_sort) ↓ [Node label user_ids are a proper subset of edgelist user_ids](#node_labels_preprocessing_verify_user_id)

Unlike the edgelist dataset, the node label dataset contains no duplicate records, as we now verify.

[Recall](#node_labels_preprocessing) that the node labels data is generated from the edgelist data (essentially) using the ```generate_aggregate_labels``` function of the TGB module ```tgb_genre.py``` (under ```/scripts/tgb/datasets/dataset_scripts/```, also see it on [TGB's GitHub](https://github.com/shenyangHuang/TGB/blob/main/tgb/datasets/dataset_scripts/tgbn-genre.py) [[3]](#H_GH:tgbn_genre_py)). It aggregates the edgelist 'weight' column data by seven-day intervals (the 'ts' column is a timestamp column). From the code, this aggregation does appear to have aggregated over duplicate records: the way in which the weights are combined may mean that the duplicate records typically have little effect in any case. This is discussed in the [EDA notebook](02-eda-tgbn-genre-dataset.ipynb).

In [116]:
df_name = 'tgbn-genre_node_labels'
df = df_[df_name]

contains_duplicates = df[df.duplicated()].shape[0] > 0

print(f'df_[\'tgbn-genre_node_labels\'] contains duplicate records: {contains_duplicates}.') 

df_['tgbn-genre_node_labels'] contains duplicate records: False.


<a id='node_labels_preprocessing_verify_user_id'></a>
### Node label user_ids are a proper subset of edgelist user_ids
↑↑ [Contents](#contents) ↑ [Node label data contains no duplicate records](#node_labels_preprocessing_verify_no_duplicates) ↓ [Node label genres are the same as edgelist genres](#node_labels_preprocessing_verify_genre)

It turns out that all 'user_id's found in ```df_['tgbn-genre_node_labels']``` are also in ```df_['tgbn-genre_edgelist']```, but the latter dataset contains records for 18 more 'user_id's that cannot be found in the former. Perhaps the 18 'user_id's are overlooked in the aggregation process due to being on the boundary between seven-day intervals, or the corresponding records are all within less than seven days of the last timestamp in the edgelist dataset.

In [117]:
NLuid = set(df_['tgbn-genre_node_labels']['user_id'])
ELuid = set(df_['tgbn-genre_edgelist']['user_id'])

equality_condition = NLuid == ELuid
subset_condition = NLuid.issubset(ELuid)

print(f'Set of user_ids in node label data is equal to set of user_ids in edgelist data: {equality_condition}.')
print(f'Set of user_ids in node label data is contained in set of user_ids in edgelist data: {subset_condition}.')
print('\nuser_ids present in edgelist data that are not present in node labels data:')
for i, uid in enumerate(ELuid - NLuid):
    print(f'{i + 1}: {uid} ({node_ids[uid]})')

Set of user_ids in node label data is equal to set of user_ids in edgelist data: False.
Set of user_ids in node label data is contained in set of user_ids in edgelist data: True.

user_ids present in edgelist data that are not present in node labels data:
1: user_000856 (848)
2: user_000186 (1501)
3: user_000852 (743)
4: user_000895 (772)
5: user_000399 (1497)
6: user_000561 (1379)
7: user_000101 (1017)
8: user_000240 (1019)
9: user_000784 (1496)
10: user_000815 (957)
11: user_000560 (1321)
12: user_000332 (1106)
13: user_000098 (1500)
14: user_000061 (1182)
15: user_000566 (1012)
16: user_000686 (1127)
17: user_000538 (1503)
18: user_000533 (1502)


<a id='node_labels_preprocessing_verify_genre'></a>
### Node label genres are the same as edgelist genres  
↑↑ [Contents](#contents) ↑ [Node label user_ids are a proper subset of edgelist user_ids](#node_labels_preprocessing_verify_user_id) ↓ [Mapping node label user_ids and genres using node_ids and labels_dict](#node_labels_preprocessing_mapping)

The 'genre's represented in both datasets are identical, as we now verify.

In [118]:
NLgenre= set(df_['tgbn-genre_node_labels']['genre'])
ELgenre = set(df_['tgbn-genre_edgelist']['genre'])

equality_condition = NLgenre == ELgenre

print(f'Set of genres in node label data is equal to set of genres in edgelist data: {equality_condition}.')

Set of genres in node label data is equal to set of genres in edgelist data: True.


<a id='node_labels_preprocessing_mapping'></a>
### Mapping node label user_ids and genres using node_ids and labels_dict 
↑↑ [Contents](#contents) ↑ [Node label genres are the same as edgelist genres](#node_labels_preprocessing_verify_genre) ↓ [Renaming and reordering node label dataframe columns, and saving to csv](#node_labels_preprocessing_renaming_columns)

We now work with our copy ```df_['tgbn-genre_node_labels_processed']``` of ```df_['tgbn-genre_node_labels']```, and map 'user_id's and 'genres' of this copy using ```node_ids``` and ```labels_dict``` respectively. The encoded node labels data will be consistent with the processed edgelist data.

In [123]:
df_name = 'tgbn-genre_node_labels_processed'
df = df_[df_name]

print('Mapping user_ids using node_ids dictionary.')
df['user_id'] = df['user_id'].map(node_ids)

print('Mapping genres using labels_dict.')
df['genre'] = df['genre'].map(labels_dict)

Mapping user_ids using node_ids dictionary.
Mapping genres using labels_dict.


In [124]:
df_name = 'tgbn-genre_node_labels_processed'
df = df_[df_name]

df.head()

Unnamed: 0,ts,user_id,genre,weight
0,1108443600,533,83,0.015835
1,1108443600,533,24,0.01533
2,1108443600,533,103,0.008128
3,1108443600,533,56,0.072162
4,1108443600,533,66,0.021465


<a id='node_labels_preprocessing_renaming_columns'></a>
### Renaming and reordering node label dataframe columns, and saving to csv 
↑↑ [Contents](#contents) ↑ [Mapping node label user_ids and genres using node_ids and labels_dict](#node_labels_preprocessing_mapping) ↓ [Daily labels and final genre list](#daily_labels)

For consistency with the final version of the processed edgelist dataframe ```df_['tgbn-genre_edgelist_processed']```, we rename and reorder the columns of ```df_['tgbn-genre_node_labels_processed']```. Finally, we save the processed dataframe to a csv file.

In [127]:
rename_dict = {
    'user_id': 'sources',
    'genre': 'destinations',
    'ts': 'timestamps',
}
df_['tgbn-genre_node_labels_processed'] = df_['tgbn-genre_node_labels_processed'].rename(columns=rename_dict)

new_order = ['sources', 'destinations', 'timestamps', 'weight']
df_['tgbn-genre_node_labels_processed'] = df_['tgbn-genre_node_labels_processed'].reindex(columns=new_order)

In [128]:
df_name = 'tgbn-genre_node_labels_processed'
df = df_[df_name]

df.head()

Unnamed: 0,sources,destinations,timestamps,weight
0,533,83,1108443600,0.015835
1,533,24,1108443600,0.01533
2,533,103,1108443600,0.008128
3,533,56,1108443600,0.072162
4,533,66,1108443600,0.021465


In [129]:
df_name = 'tgbn-genre_node_labels_processed'
df = df_[df_name]

print('Saving df_[\'tgbn-genre_node_labels_processed\'] to csv file.')

tic = time.time()

df.to_csv(path['data'].joinpath(df_name + '.csv'), index=False)

toc = time.time()
print(f'Time taken: {toc - tic:.2f} seconds.')

Saving df_['tgbn-genre_node_labels_processed'] to csv file.
Time taken: 43.83 seconds.


<a id='daily_labels'></a>
## Daily labels and final genre list
↑↑ [Contents](#contents) ↑ [Renaming and reordering node label dataframe columns, and saving to csv](#node_labels_preprocessing_renaming_columns) ↓ [References](#references)

The csv file ```daily_labels.csv``` represents an intermediate step in the generation of ```tgbn-genre_node_labels.csv``` from ```tgbn-genre_edgelist.csv```. See the ```generate_daily_node_labels``` function of the TGB module ```tgb_genre.py``` (under ```/scripts/tgb/datasets/dataset_scripts/```, also see it on [TGB's GitHub](https://github.com/shenyangHuang/TGB/blob/main/tgb/datasets/dataset_scripts/tgbn-genre.py) [[3]](#H_GH:tgbn_genre_py)). 

The csv file ```genre_list_final.csv``` essentially maps genres used in the ```tgbn-genre``` data to codes, but not the same codes as in ```labels_dict```: perhaps the mapping given by ```genre_list_final.csv``` comes from the full genre list in the million-song dataset. We will not explore or process these files.

In [130]:
df_name = 'daily_labels'
df = df_[df_name]
df.head()

Unnamed: 0,user_id,year,month,day,genre,weight
0,user_000001,2006,8,15,electronic,1.172941
1,user_000001,2006,8,15,alternative,0.468085
2,user_000001,2006,8,15,chillout,0.358974
3,user_000001,2006,8,15,math rock,1.0
4,user_000001,2006,8,16,acid jazz,0.35461


In [131]:
df_name = 'genre_list_final'
df = df_[df_name]
df.head()

Unnamed: 0,genre
electronic,743529
alternative,862812
chillout,187629
math rock,6559
electronica,54289


<a id='references'></a>
## References
↑↑ [Contents](#contents) ↑ [Daily labels and final genre list](#daily_labels) 

<a id='H:2023'></a>[1] Huang, S., et al. [Temporal graph benchmark for machine learning on temporal graphs.](https://doi.org/10.48550/arXiv.2307.01026) _Advances in Neural Information Processing Systems_, 2023. Preprint: [arXiv:2307.01026](https://doi.org/10.48550/arXiv.2307.01026), 2023.

<a id='H_GH:2023'></a>[2] Huang, S., et al. [TGB.](https://github.com/shenyangHuang/TGB) GitHub Repository. [https://github.com/shenyangHuang/TGB](https://github.com/shenyangHuang/TGB), 2023. Accessed May 14, 2024.

<a id='H_GH:tgbn_genre_py'></a>
[3] Huang, S., et al. [tgbn-genre dataset.](https://github.com/shenyangHuang/TGB/blob/main/tgb/datasets/dataset_scripts/tgbn-genre.py)
[https://github.com/shenyangHuang/TGB/blob/main/tgb/datasets/dataset_scripts/tgbn-genre.py](https://github.com/shenyangHuang/TGB/blob/main/tgb/datasets/dataset_scripts/tgbn-genre.py), 2023. Accessed May 14, 2024.