# Data Folder

## TL;DR
Use standard_train (test, val) to train (test, val) both the classifier and the generative model.

Save the results of the generative model G for dataset D in connectivity_augmented/D/G

Train the generator with one model per label

## Structure
**Connectivity_augmented**/**\<dataset name\>**/**\<generative model name\>**: dataset generated by connectivity augmentation of dataset \<dataset name\> through the generative model \<generative model name\>

**standard/\<dataset_name\>**: raw dataset

**standard_train/\<dataset_name\>**: dataset to use for training both the generative model and the classifier. Split 80%

**standard_eval/\<dataset_name\>**: dataset to use for evaluation both the generative model and the classifier. Split 10%

**standard_test/\<dataset_name\>**: dataset to use for testing both the generative model and the classifier. Split 10%

## Useful code

### Example of dataset spliltting

from discrete_diffusion.io_utils import load_TU_dataset, write_TU_format
from pathlib import Path
from random import shuffle

data_list, _ = load_TU_dataset(paths=[Path("standard_test/PROTEINS_full")], dataset_names=["PROTEINS_full"],
                               output_type="networkx", max_num_nodes=50)
shuffle(data_list)

data_list_train = data_list[: int(len(data_list) * 0.5)]
data_list_test = data_list[int(len(data_list) * 0.5) + 1:]

write_TU_format(data_list_train, path=Path("standard_eval/PROTEINS_full"), dataset_name="PROTEINS_full")
write_TU_format(data_list_test, path=Path("standard_test/PROTEINS_full"), dataset_name="PROTEINS_full")


### Example of labels merging

from pathlib import Path

from discrete_diffusion.io_utils import load_TU_dataset, write_TU_format

data_list, _ = load_TU_dataset(
    [Path("connectivity_augmented/PROTEINS_full/sample_data")] * 2, ["PROTEINS_full_label_1", "PROTEINS_full_label_2"],
    output_type="networkx"
)
write_TU_format(data_list, Path("connectivity_augmented/PROTEINS_full"), "PROTEINS_full")


In [None]:
from discrete_diffusion.io_utils import load_TU_dataset, write_TU_format
from pathlib import Path
import importlib
importlib.reload(discrete_diffusion.io_utils)
from networkx import is_connected, nodes_with_selfloops
from random import shuffle

data_list, _ = load_TU_dataset(paths=[Path("../data/standard/PROTEINS_full")], dataset_names=["PROTEINS_full"],
                               output_type="networkx", max_num_nodes=50)

shuffle(data_list)

train_end = int(len(data_list) * 0.8)
val_end = int(len(data_list) * 0.9)
data_list_train = data_list[: train_end]
data_list_val = data_list[train_end + 1: val_end]
data_list_test = data_list[val_end + 1:]

In [None]:
import networkx as nx
nx.draw(data_list_train[10], with_labels=True)
len(data_list_test)

In [None]:
write_TU_format(data_list_train, path=Path("../data/standard_train/PROTEINS_50"), dataset_name="PROTEINS_50")
write_TU_format(data_list_val, path=Path("../data/standard_val/PROTEINS_50"), dataset_name="PROTEINS_50")
write_TU_format(data_list_test, path=Path("../data/standard_test/PROTEINS_50"), dataset_name="PROTEINS_50")

In [None]:
import discrete_diffusion.io_utils
from pathlib import Path
import importlib
importlib.reload(discrete_diffusion.io_utils)
train, _ = discrete_diffusion.io_utils.load_TU_dataset(
    paths=[Path("../data/standard_train/PROTEINS_50")], dataset_names=["PROTEINS_50"], 
    output_type="networkx")


In [None]:
train_check, _ = discrete_diffusion.io_utils.load_TU_dataset(
    paths=[Path("../data/tmp")], dataset_names=["tmp"], output_type="networkx")

In [None]:
vars(train_check[0])

In [None]:
vars(train[0])

In [None]:
import networkx as nx
nx.draw(train[9], with_labels=True)
train[9].number_of_nodes()

In [None]:
import torch_geometric
import matplotlib.pyplot as plt
count = []
for i in range(1113):
    data = dataset.get(i)
    G = torch_geometric.utils.to_networkx(data).to_undirected()
    if not nx.is_connected(G):
        count.append(G)