This notebook is a showcase of loading datasets and their statistics

### Bitcoin dataset

This dataset was used in [Tigger](https://github.com/data-iitd/tigger/tree/main/data/bitcoin) composed of a homogeneous financial transaction graph of bitcoin trading between users of
bitcoin-alpha trading platform (Kumar et al. 2016)

In [159]:
#start by loading the data file
import pandas as pd
url= f"https://raw.githubusercontent.com/data-iitd/tigger/main/data/bitcoin/data.csv"
# Load the CSV file into a DataFrame
df = pd.read_csv(url, sep=',', header=None)
df

Unnamed: 0,0,1,2,3
0,,start,end,days
1,0.0,0,398,137
2,1.0,1,398,102
3,2.0,2,398,94
4,3.0,3,398,71
...,...,...,...,...
24182,24181.0,1064,1061,87
24183,24182.0,1061,1064,87
24184,24183.0,1064,1060,87
24185,24184.0,1060,1064,87


In [160]:
# Apply transofrmation for tglib
df = df.iloc[:, 1:]

In [161]:
df

Unnamed: 0,1,2,3
0,start,end,days
1,0,398,137
2,1,398,102
3,2,398,94
4,3,398,71
...,...,...,...
24182,1064,1061,87
24183,1061,1064,87
24184,1064,1060,87
24185,1060,1064,87


In [162]:
df = df.drop(0)
df

Unnamed: 0,1,2,3
1,0,398,137
2,1,398,102
3,2,398,94
4,3,398,71
5,4,398,68
...,...,...,...
24182,1064,1061,87
24183,1061,1064,87
24184,1064,1060,87
24185,1060,1064,87


In [163]:
import os
path_dataset = os.getcwd()
path_dataset_tglib = path_dataset + "-tglib"
df.to_csv(path_dataset_tglib, sep=' ', index=False, header=False)

**How to install the tglib package?**

You need to download and compile tglib before running this notebook

- git clone --recurse-submodules https://gitlab.com/tgpublic/tglib.gitcd 

- tglib/tglib_cppmkdir build-release

- cd build-release

- cmake .. -DCMAKE_BUILD_TYPE=Release

- make

In [164]:
import sys
sys.path.append("/home/houssem.souid/tglib/tglib_cpp/build-release/src/python_binding")

In [165]:
temporal_graph = tgl.load_ordered_edge_list(path_dataset_tglib)

In [166]:
stats = tgl.get_statistics(temporal_graph)
print(stats)

number of nodes: 3783
number of edges: 24186
number of static edges: 24186
number of time stamps: 190
number of transition times: 1
min. time stamp: 1
max. time stamp: 191
min. transition time: 1
max. transition time: 1
min. temporal in-degree: 0
max. temporal in-degree: 398
min. temporal out-degree: 0
max. temporal out-degree: 490


## Reddit dataset

This dataset was used in [Tigger](https://github.com/data-iitd/tigger/tree/main/data/CAW_data) composed of a bipartite graph of users’ post on subreddits (Leskovec and Krevl 2014)

In [167]:
#start by loading the data file
import pandas as pd
url = "https://raw.githubusercontent.com/data-iitd/tigger/main/data/CAW_data/reddit_processed.csv"
# Load the CSV file into a DataFrame
df = pd.read_csv(url, sep=',', header=None)
df

  df = pd.read_csv(url, sep=',', header=None)


Unnamed: 0,0,1,2
0,start,end,days
1,1,10001,1
2,2,10002,7
3,3,10003,8
4,4,10003,14
...,...,...,...
671335,124,10026,2678366
671336,4557,10007,2678372
671337,1072,10130,2678379
671338,11,10009,2678390


In [168]:
# Apply transofrmation for tglib
df = df.drop(0)
path_dataset = os.getcwd()
path_dataset_tglib = path_dataset + "-tglib"
df.to_csv(path_dataset_tglib, sep=' ', index=False, header=False)

In [169]:
temporal_graph = tgl.load_ordered_edge_list(path_dataset_tglib)
stats = tgl.get_statistics(temporal_graph)
print(stats)

number of nodes: 10984
number of edges: 671339
number of static edges: 78516
number of time stamps: 588915
number of transition times: 1
min. time stamp: 1
max. time stamp: 2678391
min. transition time: 1
max. transition time: 1
min. temporal in-degree: 0
max. temporal in-degree: 58725
min. temporal out-degree: 0
max. temporal out-degree: 4690


## Wiki-small dataset


This dataset was used in [Tigger](https://github.com/data-iitd/tigger/tree/main/data/CAW_data) contains a a bipartite graph between human editors and Wikipedia pages for 50 hours (Leskovec and Krevl 2014)

In [170]:
#start by loading the data file
import pandas as pd
url= "https://github.com/data-iitd/tigger/raw/main/data/CAW_data/wiki_744_50.csv"
# Load the CSV file into a DataFrame
df = pd.read_csv(url, sep=',', header=None)
df

Unnamed: 0,0,1,2,3
0,,start,end,days
1,0.0,0,1125,1
2,1.0,1,1126,1
3,2.0,2,1127,1
4,3.0,3,1128,1
...,...,...,...,...
2976,2975.0,373,1372,50
2977,2976.0,30,1152,50
2978,2977.0,727,1437,50
2979,2978.0,632,1475,50


In [171]:
# Apply transofrmation for tglib
df = df.drop(0, axis=1)
df = df.drop(0)
path_dataset = os.getcwd()
path_dataset_tglib = path_dataset + "-tglib"
df.to_csv(path_dataset_tglib, sep=' ', index=False, header=False)

In [172]:
temporal_graph = tgl.load_ordered_edge_list(path_dataset_tglib)
stats = tgl.get_statistics(temporal_graph)
print(stats)

number of nodes: 1616
number of edges: 2980
number of static edges: 1575
number of time stamps: 50
number of transition times: 1
min. time stamp: 1
max. time stamp: 50
min. transition time: 1
max. transition time: 1
min. temporal in-degree: 0
max. temporal in-degree: 74
min. temporal out-degree: 0
max. temporal out-degree: 34


### IMDB dynamic

This [dataset](https://networkrepository.com/imdb.php) contains movies and actors as nodes and the edge represents the collaboration of an actor in a movie/with another actor and the timestamp is the year of collaboration.

In [173]:
#tranformation functions
def transform_line(line: str) -> str:
    """transforms a single line from dataset:
    deleted the third column

    Args:
        line (string): line from dataset file

    Returns:
        str : new line without the third column
    """
    s_line = line.split(",")
    return ' '.join(s_line[:2] + s_line[-1:])

def add_opposite_edges(line: str) -> str:
    """transforms the edge in the opposite direction

    Args:
        line (string): line from dataset file

    Returns:
        str : new ledge in the direction
    """
    s_line = line.split(",")
    return ' '.join(s_line[:2][::-1] + s_line[-1:])


def convert_data_file_to_tglib_file(
        path: str,
        opposite_edges:bool=True
):
    """converts the dataset in the tglib format

    Args:
        path (string): line from dataset file
        opposite_edges (bool): add the edges in the opposite direction
    """
    with open(path, 'r') as file:
        string_list = file.readlines()

    # remove last column and switch column 3 with column 4
    transformed_data = list(map(transform_line, string_list))

    data = transformed_data
    if opposite_edges:
        opposite_edges_list = list(map(add_opposite_edges, string_list))
        data = data + opposite_edges_list

    new_path = path + "-tglib"
    with open(new_path , "w") as file_out:
        new_file_contents = " ".join(data)
        file_out.write(new_file_contents)
        
    file_out.close()


In [174]:
import os
import tempfile
import urllib.request
import zipfile
import sys
sys.path.append("/home/houssem.souid/tglib/tglib_cpp/build-release/src/python_binding")
import pytglib as tgl

with tempfile.TemporaryDirectory() as tmpdirname:
       
        url = f"https://nrvis.com/download/data/dynamic/imdb.zip"

        urllib.request.urlretrieve(
            url, os.path.join(tmpdirname, f"imdb.zip")
        )

        # unzip it
        with zipfile.ZipFile(
            os.path.join(tmpdirname, f"imdb.zip"), "r"
        ) as zip_ref:
            zip_ref.extractall(tmpdirname)

        # Load the file
        path_dataset = os.path.join(tmpdirname, f"imdb.edges")
        convert_data_file_to_tglib_file(path_dataset, opposite_edges=True)
        path_dataset_tglib = path_dataset + "-tglib"
        temporal_graph = tgl.load_ordered_edge_list(path_dataset_tglib)

In [175]:
stats = tgl.get_statistics(temporal_graph)
print(stats)

number of nodes: 150545
number of edges: 592375
number of static edges: 591457
number of time stamps: 28
number of transition times: 2
min. time stamp: 1980
max. time stamp: 2007
min. transition time: 1
max. transition time: 137025
min. temporal in-degree: 1
max. temporal in-degree: 435
min. temporal out-degree: 1
max. temporal out-degree: 435


##  Twitter

[Twitter mention graphs](https://pytorch-geometric-temporal.readthedocs.io/en/latest/_modules/torch_geometric_temporal/dataset/twitter_tennis.html) related to major tennis tournaments from 2017. The nodes are Twitter accounts and edges are mentions between them. Each snapshot contains the graph induced by the most popular nodes of the original dataset. Node labels encode the number of mentions received in the original dataset for the next snapshot.

In [176]:
from torch_geometric_temporal.dataset import TwitterTennisDatasetLoader
data = TwitterTennisDatasetLoader()
dataset = data.get_dataset()

In [177]:
len(dataset.features[0])

1000

In [178]:
dataset.snapshot_count

120

In [179]:
import numpy as np 
concatenated_edge_indices = np.concatenate(dataset.edge_indices , axis=1)
# Find unique values in the concatenated array
connected_nodes = np.unique(concatenated_edge_indices)

In [180]:
len(connected_nodes)

995