# Data Folder

## TL;DR
Use standard_train (test, val) to train (test, val) both the classifier and the generative model.

Save the results of the generative model G for dataset D in connectivity_augmented/D/G

Train the generator with one model per label

## Structure
**Connectivity_augmented**/**\<dataset name\>**/**\<generative model name\>**: dataset generated by connectivity augmentation of dataset \<dataset name\> through the generative model \<generative model name\>

**standard/\<dataset_name\>**: raw dataset

**standard_train/\<dataset_name\>**: dataset to use for training both the generative model and the classifier. Split 80%

**standard_eval/\<dataset_name\>**: dataset to use for evaluation both the generative model and the classifier. Split 10%

**standard_test/\<dataset_name\>**: dataset to use for testing both the generative model and the classifier. Split 10%

## Useful code

### Example of dataset spliltting

from discrete_diffusion.io_utils import load_TU_dataset, write_TU_format
from pathlib import Path
import importlib
importlib.reload(discrete_diffusion.io_utils)
from networkx import is_connected, nodes_with_selfloops
from random import shuffle

data_list, _ = load_TU_dataset(paths=[Path("../data/standard/PROTEINS_full")], dataset_names=["PROTEINS_full"],
                               output_type="networkx", max_num_nodes=50)

shuffle(data_list)

train_end = int(len(data_list) * 0.8)
val_end = int(len(data_list) * 0.9)
data_list_train = data_list[: train_end]
data_list_val = data_list[train_end + 1: val_end]
data_list_test = data_list[val_end + 1:]

write_TU_format(data_list_train, path=Path("../data/standard_train/PROTEINS_50"), dataset_name="PROTEINS_50")
write_TU_format(data_list_val, path=Path("../data/standard_val/PROTEINS_50"), dataset_name="PROTEINS_50")
write_TU_format(data_list_test, path=Path("../data/standard_test/PROTEINS_50"), dataset_name="PROTEINS_50")

In [None]:
### Example of labels merging

from pathlib import Path

from discrete_diffusion.io_utils import load_TU_dataset, write_TU_format

In [None]:
data_list, _ = load_TU_dataset(
    [Path("../data/connectivity_augmented/PROTEINS_50/gdss")] * 2, ["PROTEINS_50_label_1", "PROTEINS_50_label_2"],
    output_type="networkx"
)
data_new = []
for data in data_list:
    if data.number_of_nodes() > 1:
        data_new.append(data)
write_TU_format(data_new, Path("../data/connectivity_augmented/PROTEINS_50/gdss"), "PROTEINS_50")

### Check properties of merged dataset

In [None]:
from pathlib import Path
data_check, _ = load_TU_dataset(
    paths=[Path("../data/standard_val/PROTEINS_50")], dataset_names=["PROTEINS_50"], 
    output_type="networkx")


In [None]:
import networkx as nx
n = 10
nx.draw(data_check[n], with_labels=True)
print(data_check[n].number_of_nodes())
print(data_check[n].graph["label"])

In [None]:
c1 = []
c2 = []
for g in data_check:
    if g.graph["label"] == 1:
        c1.append(g)
    elif g.graph["label"] == 2:
        c2.append(g)
print(len(c1))
print(len(c2))
print(sum([g.number_of_nodes() for g in c1]) / len(c1))
print(sum([g.number_of_nodes() for g in c2]) / len(c2))

In [None]:
n = -1
nx.draw(c2[n], with_labels=True)
print(c2[n].number_of_nodes())
print(c2[n].graph["label"])

In [None]:
import torch_geometric
train_dataset, _ = load_TU_dataset(
    [Path("../data/connectivity_augmented/PROTEINS_50/edp-gnn")], ["PROTEINS_50"],
    output_type="networkx", max_graphs_per_dataset=[512]
)

In [None]:
import matplotlib.pyplot as plt

fig, axs = plt.subplots(4, 4, constrained_layout=True)
axs = axs.flatten()
k = 1
for i in range(k, 16*(k+1)):
    nx.draw(data_check[i - 16*k], with_labels=True, ax=axs[i - 16*k], node_size=0.1)


In [None]:
iso_count = []
iso_conn = []
for idxc, test_graph in enumerate(data_check):
    for idxt, train_graph in enumerate(train_dataset):
        if nx.is_isomorphic(train_graph, test_graph):
            train_attrs = set(nx.get_node_attributes(train_graph, "x").values())
            check_attrs = set(nx.get_node_attributes(test_graph, "x").values())
            iso_conn.append((idxc, idxt))
            if check_attrs == train_attrs:
                iso_count.append(idxc)

In [None]:
len(set([couple[0] for couple in iso_conn]))

In [None]:
n = 0
graph_check = data_check[iso_conn[n][0]]
nx.draw(graph_check, with_labels=True)
graph_check.graph["label"]

In [None]:
graph_train = train_dataset[iso_conn[n][1]]
nx.draw(graph_train, with_labels=True)
graph_train.graph["label"]

In [None]:
nx.get_node_attributes(graph_check, "x")

In [None]:
 nx.get_node_attributes(graph_train, "x")

## Findings of the day

In [None]:
1) test e val hanno grafi che overlappano per connettività (11, 12) con train ma non per attributi. 
2) test e val hanno grafi che overlappano per connettività (4, 1) con conn_augm_gdss ma non per attributi. 
3) test e val hanno grafi che overlappano per connettività (8, 9) con conn_augm_edp-gnn ma non per attributi. 
4) test ha 48 label 1 e 39 label 2. val ha 46 label 1 e 40 label 2. 
5) outliers: 