# Data Folder

## TL;DR
Use standard_train (test, val) to train (test, val) both the classifier and the generative model.

Save the results of the generative model G for dataset D in connectivity_augmented/D/G

Train the generator with one model per label

## Structure
**Connectivity_augmented**/**\<dataset name\>**/**\<generative model name\>**: dataset generated by connectivity augmentation of dataset \<dataset name\> through the generative model \<generative model name\>

**standard/\<dataset_name\>**: raw dataset

**standard_train/\<dataset_name\>**: dataset to use for training both the generative model and the classifier. Split 80%

**standard_eval/\<dataset_name\>**: dataset to use for evaluation both the generative model and the classifier. Split 10%

**standard_test/\<dataset_name\>**: dataset to use for testing both the generative model and the classifier. Split 10%

## Useful code

### Example of dataset spliltting

from discrete_diffusion.io_utils import load_TU_dataset, write_TU_format
from pathlib import Path
import importlib
importlib.reload(discrete_diffusion.io_utils)
from networkx import is_connected, nodes_with_selfloops
from random import shuffle

data_list, _ = load_TU_dataset(paths=[Path("../data/standard/PROTEINS_full")], dataset_names=["PROTEINS_full"],
                               output_type="networkx", max_num_nodes=50)

shuffle(data_list)

train_end = int(len(data_list) * 0.8)
val_end = int(len(data_list) * 0.9)
data_list_train = data_list[: train_end]
data_list_val = data_list[train_end + 1: val_end]
data_list_test = data_list[val_end + 1:]

write_TU_format(data_list_train, path=Path("../data/standard_train/PROTEINS_50"), dataset_name="PROTEINS_50")
write_TU_format(data_list_val, path=Path("../data/standard_val/PROTEINS_50"), dataset_name="PROTEINS_50")
write_TU_format(data_list_test, path=Path("../data/standard_test/PROTEINS_50"), dataset_name="PROTEINS_50")

In [None]:
### Example of labels merging

from pathlib import Path

from discrete_diffusion.io_utils import load_TU_dataset, write_TU_format
import importlib
#importlib.reload(discrete_diffusion.io_utils)

data_list, _ = load_TU_dataset(
    [Path("../data/connectivity_augmented/PROTEINS_50/")] * 2, ["PROTEINS_50_label_1", "PROTEINS_50_label_2"],
    output_type="networkx"
)
data_new = []
for data in data_list:
    if data.number_of_nodes() > 1:
        data_new.append(data)
write_TU_format(data_new, Path("../data/connectivity_augmented/PROTEINS_50/edp-gnn"), "PROTEINS_50")


In [None]:
from pathlib import Path
data_check, _ = load_TU_dataset(
    paths=[Path("../data/connectivity_augmented/PROTEINS_50/edp-gnn")], dataset_names=["PROTEINS_50"], 
    output_type="networkx")


In [None]:
import networkx as nx
n = 211
nx.draw(data_check[n], with_labels=True)
print(data_check[n].number_of_nodes())
print(data_check[n].graph["label"])