# Exploratory data analysis

<a id='contents'></a>
## Contents

* [Introduction](#introduction)
* [Setup](#setup)
* [Data](#data)
* [Exploration](#exploration)
* [...](...)
* [References](#references)

<a id='introduction'></a>
## Introduction
↑↑ [Contents](#contents) ↓ [Setup](#setup)

From [[1, p. 7]](#H:2023):
>```tgbn-genre``` This is a bipartite and weighted interaction network between users and the music genres of songs they listen to. Both users and music genres are represented as nodes while an interaction specifies a user listens to a music genre at a given time. The edge weights denote the percentage of which a song belongs to a certain genre. The dataset is constructed by cross-referencing the songs in the [LastFM-song-listens dataset](http://snap.stanford.edu/jodie/#datasets) [15, 24] with that of music genres in the [million-song dataset](#http://millionsongdataset.com/) [2]. The LastFM-song-listens dataset has one month of who-listens-to-which-song information for 1000 users and the million-song dataset provides genre weights for all songs in the [LastFM-song-listens dataset](http://snap.stanford.edu/jodie/#datasets). We only retain genres with at least 10% weights for each song that are repeated at least a thousand times in the dataset. Genre names are cleaned to remove typos. Here, the task is to predict how frequently each user will interact with music genres over the next week. This is applicable to many music recommendation systems where providing personalized recommendation is important and user preference shifts over time.

> **References**

>[2] Bertin-Mahieux, T., D. P. Ellis, B. Whitman, and P. Lamere. 'The million song dataset.' 2011.

>[15] Hidasi, B. and D. Tikk. 'Fast als-based tensor factorization for context-aware recommendation from implicit feedback.' In _Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2012, Bristol, UK, September 24-28, 2012. Proceedings, Part II 23_, pp. 67–82. Springer, 2012.

>[24] Kumar, S., X. Zhang, and J. Leskovec. 'Predicting dynamic embedding trajectory in temporal interaction networks.' In _Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining_, pp. 1269–1278, 2019.


Clarification: 'the percentage of which a song belongs to a certain genre' means...

<a id='setup'></a>
## Setup
↑↑ [Contents](#contents) ↑ [Introduction](#introduction) ↓ [Data](#data)

First intall the TGB (Temporal Graph Benchmark) package using ```pip install py-tgb``` as per the [README.md in Shenyang Huang's TGB GitHub repo](https://github.com/shenyangHuang/TGB/blob/main/README.md) [[2]](#H_GH:2023).

In [1]:
# SETUP

import os
from pathlib import Path
import sys

# If we're using Google Colab, we set the environment variable to point to the relevant folder in our Google Drive:
if 'COLAB_GPU' in os.environ:
    from google.colab import drive
    drive.mount('/content/drive')
    os.environ['TEMPORAL_GRAPHS'] = '/content/drive/MyDrive/Colab Notebooks/temporal_graphs'

# Otherwise, we use the environment variable on our local system:
project_environment_variable = "TEMPORAL_GRAPHS"

# Path to the root directory of the project:
project_path = Path(os.environ.get("TEMPORAL_GRAPHS"))

# Relative path to /scripts (from where custom modules will be imported):
scripts_path = project_path.joinpath("scripts")

# Add this path to sys.path so that Python will look there for modules:
sys.path.append(str(scripts_path))

# Now import path_step from our custom utils module to create a dictionary to all subdirectories in our root directory:
from utils import path_setup
path = path_setup.subfolders(base_path = project_path)

path['project'] : F:\projects\temporal-graphs
path['Resources'] : F:\projects\temporal-graphs\Resources
path['presentation'] : F:\projects\temporal-graphs\presentation
path['notebooks'] : F:\projects\temporal-graphs\notebooks
path['scripts'] : F:\projects\temporal-graphs\scripts
path['literature'] : F:\projects\temporal-graphs\literature
path['data'] : F:\projects\temporal-graphs\data


<a id='data'></a>
## Data
↑↑ [Contents](#contents) ↑ [Setup](#setup) ↓ [Exploration](#exploration)

In [2]:
import pandas as pd

In [3]:
# from tgb.nodeproppred.dataset import NodePropPredDataset

# name = "tgbn-genre"

# dataset = NodePropPredDataset(name=name, root=path['data'], preprocess=True)

# data = dataset.full_data

Will you download the dataset(s) now? (y/N)
 y


Download started, this might take a while . . . 
Dataset title: tgbn-genre
Download completed 
Dataset directory is  C:\Users\tmfre\AppData\Local\Programs\Python\Python312\Lib\site-packages\tgb/datasets\tgbn_genre
file not processed, generating processed file
number of lines counted 17858395


17858396it [01:11, 248461.93it/s]
2741936it [00:07, 362521.13it/s]


file processed and saved


In [3]:
tgbn_genre_df = pd.read_csv(path['data'].joinpath('tgbn_genre.csv'))

In [14]:
tgbn_genre_df.head(20)

Unnamed: 0,sources,destinations,timestamps,edge_idxs,edge_feat
0,513.0,0.0,1108357000.0,1.0,0.375
1,513.0,0.0,1108357000.0,2.0,0.375
2,514.0,1.0,1108357000.0,3.0,0.452489
3,514.0,2.0,1108357000.0,4.0,0.289593
4,514.0,3.0,1108357000.0,5.0,0.257919
5,515.0,4.0,1108357000.0,6.0,0.362319
6,515.0,5.0,1108357000.0,7.0,0.322464
7,515.0,6.0,1108357000.0,8.0,0.315217
8,515.0,4.0,1108357000.0,9.0,0.361011
9,515.0,5.0,1108357000.0,10.0,0.3213


In [4]:
tgbn_genre_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17858395 entries, 0 to 17858394
Data columns (total 5 columns):
 #   Column        Dtype  
---  ------        -----  
 0   sources       float64
 1   destinations  float64
 2   timestamps    float64
 3   edge_idxs     float64
 4   edge_feat     float64
dtypes: float64(5)
memory usage: 681.2 MB


<a id='exploration'></a>
## Exploration
↑↑ [Contents](#contents) ↑ [Data](#data) ↓ [...](#...)

In [10]:
for col in tgbn_genre_df.columns:
    print(f'Number of unique values in {col} column: {tgbn_genre_df[col].nunique()}')

Number of unique values in sources column: 992
Number of unique values in destinations column: 513
Number of unique values in timestamps column: 4187046
Number of unique values in edge_idxs column: 17858395
Number of unique values in edge_feat column: 6107


In [25]:
from datetime import timedelta

def seconds_to_timedelta(seconds, base_time):
    return timedelta(seconds=seconds) - base_time

In [26]:
base_time = timedelta(seconds=tgbn_genre_df['timestamps'].iloc[0])
tgbn_genre_df['timedelta'] = tgbn_genre_df['timestamps'].apply(seconds_to_timedelta, base_time=base_time)

In [27]:
tgbn_genre_df['timedelta']

0             0 days 00:00:00
1             0 days 00:00:00
2             0 days 00:01:01
3             0 days 00:01:01
4             0 days 00:01:01
                  ...        
17858390   1586 days 20:22:58
17858391   1586 days 20:26:57
17858392   1586 days 20:26:57
17858393   1586 days 20:26:57
17858394   1586 days 20:26:57
Name: timedelta, Length: 17858395, dtype: timedelta64[ns]

In [35]:
(tgbn_genre_df['timestamps'].iloc[-1] - tgbn_genre_df['timestamps'].iloc[0])/(60*60*24)

1586.852048611111

In [34]:
137104017/10**6

137.104017

<a id='references'></a>
## References
↑↑ [Contents](#contents) ↑ [...](#...)

<a id='H:2023'></a>[1] Huang, S., et al. [Temporal graph benchmark for machine learning on temporal graphs.](https://doi.org/10.48550/arXiv.2307.01026) _Advances in Neural Information Processing Systems_, 2023. Preprint: [arXiv:2307.01026](https://doi.org/10.48550/arXiv.2307.01026), 2023.

<a id='H_GH:2023'></a>[2] Huang, S., et al. [TGB.](https://github.com/shenyangHuang/TGB) GitHub Repository. [https://github.com/shenyangHuang/TGB](https://github.com/shenyangHuang/TGB), 2023. Accessed May 14, 2024.

[3] Huang, S., et al. [Temporal Graph Benchmark.](https://tgb.complexdatalab.com/) [https://tgb.complexdatalab.com/](https://tgb.complexdatalab.com/), 2023. Accessed May 14, 2024.

[4] Huang, S., et al. [tgbn-genre dataset.](https://github.com/shenyangHuang/TGB/blob/main/tgb/datasets/dataset_scripts/tgbn-genre.py)
[https://github.com/shenyangHuang/TGB/blob/main/tgb/datasets/dataset_scripts/tgbn-genre.py](https://github.com/shenyangHuang/TGB/blob/main/tgb/datasets/dataset_scripts/tgbn-genre.py), 2023. Accessed May 14, 2024.