# Data Folder

## TL;DR
Use standard_train (test, val) to train (test, val) both the classifier and the generative model.

Save the results of the generative model G for dataset D in connectivity_augmented/D/G

Train the generator with one model per label

## Structure
**Connectivity_augmented**/**\<dataset name\>**/**\<generative model name\>**: dataset generated by connectivity augmentation of dataset \<dataset name\> through the generative model \<generative model name\>

**standard/\<dataset_name\>**: raw dataset

**standard_train/\<dataset_name\>**: dataset to use for training both the generative model and the classifier. Split 80%

**standard_eval/\<dataset_name\>**: dataset to use for evaluation both the generative model and the classifier. Split 10%

**standard_test/\<dataset_name\>**: dataset to use for testing both the generative model and the classifier. Split 10%

## Useful code

In [None]:
### Imports
from pathlib import Path

from discrete_diffusion.io_utils import load_TU_dataset, write_TU_format
from pathlib import Path
import networkx as nx
import torch_geometric
from pathlib import Path
from networkx import is_connected, nodes_with_selfloops
from random import shuffle

### Example of dataset spliltting

data_list, _ = load_TU_dataset(paths=[Path("../data/standard/PROTEINS_full")], dataset_names=["PROTEINS_full"],
                               output_type="networkx", max_num_nodes=50)

shuffle(data_list)

train_end = int(len(data_list) * 0.8)
val_end = int(len(data_list) * 0.9)
data_list_train = data_list[: train_end]
data_list_val = data_list[train_end + 1: val_end]
data_list_test = data_list[val_end + 1:]

write_TU_format(data_list_train, path=Path("../data/standard_train/PROTEINS_50"), dataset_name="PROTEINS_50")
write_TU_format(data_list_val, path=Path("../data/standard_val/PROTEINS_50"), dataset_name="PROTEINS_50")
write_TU_format(data_list_test, path=Path("../data/standard_test/PROTEINS_50"), dataset_name="PROTEINS_50")

### Example of labels merging

data_list, _ = load_TU_dataset(
    [Path("../data/connectivity_augmented/PROTEINS_50/gdss")] * 2, ["PROTEINS_50_label_1", "PROTEINS_50_label_2"],
    output_type="networkx"
)
data_new = []
for data in data_list:
    if data.number_of_nodes() > 1:
        data_new.append(data)
write_TU_format(data_new, Path("../data/connectivity_augmented/PROTEINS_50/gdss"), "PROTEINS_50")

### Check properties of merged dataset

In [None]:
data_check, _ = load_TU_dataset(
    paths=[Path("../data/connectivity_augmented/PROTEINS_50/gdss/")], dataset_names=["PROTEINS_50"], 
    output_type="networkx")


In [None]:
# divide labels
c1 = []
c2 = []
for g in data_check:
    if g.graph["label"] == 1:
        c1.append(g)
    elif g.graph["label"] == 2:
        c2.append(g)
print(len(c1))
print(len(c2))
print(sum([g.number_of_nodes() for g in c1]) / len(c1))
print(sum([g.number_of_nodes() for g in c2]) / len(c2))

In [None]:
# Visualize
import matplotlib
import matplotlib.pyplot as plt
# matplotlib.use("qtAgg")

fig, axs = plt.subplots(4, 4, constrained_layout=True)
axs = axs.flatten()
k = 3
# while k < len(data_check) // 16:
print(k)
for i in range(k * 16, 16 * (k+1)):
    nx.draw(data_check[i - 16*k], with_labels=True, ax=axs[i - 16*k], node_size=0.1) 
plt.show()

In [None]:
# Check repeated
other_dataset, _ = load_TU_dataset(
    paths=[Path("../data/standard_train/PROTEINS_50/")], dataset_names=["PROTEINS_50"], 
    output_type="networkx")
iso_count = []
iso_conn = []
for idxc, check_graph in enumerate(data_check):
    for idxt, other_data in enumerate(other_dataset):
        if nx.is_isomorphic(other_data, check_graph):
            other_attrs = set(nx.get_node_attributes(other_data, "x").values())
            check_attrs = set(nx.get_node_attributes(check_graph, "x").values())
            iso_conn.append((idxc, idxt))
            if check_attrs == other_attrs:
                iso_count.append(idxc)

In [None]:
len(iso_count)

In [None]:
# Check connected 
not_conn_count = []
for idxc, check_graph in enumerate(data_check):
    if not nx.is_connected(check_graph):
        not_conn_count.append(idxc)

In [None]:
len(not_conn_count)

# Orca Evaluation

In [None]:
import pandas

Path("../orca/PROTEINS_50").mkdir(parents=True, exist_ok=True)
for i, graph in enumerate(data_check):
    pyg_graph = torch_geometric.utils.from_networkx(graph)
    pandas.DataFrame(
        pyg_graph.edge_index.T.numpy()[:, [1, 0]] + 1, 
        index=list(range(pyg_graph.num_edges)), 
        columns=[pyg_graph.num_nodes, pyg_graph.num_edges]
    ).to_csv("../orca/PROTEINS_50/PROTEINS_50_" + str(i), header=True, index=False)


In [None]:
import discrete_diffusion

In [None]:
discrete_diffusion.evaluation.stats.eval_graph_list

## Findings of the day

### 6/9/22
1) test e val hanno grafi che overlappano per connettività (11, 12) con train ma non per attributi. 

2) test e val hanno grafi che overlappano per connettività (4, 1) con conn_augm_gdss ma non per attributi. 

3) test e val hanno grafi che overlappano per connettività (8, 9) con conn_augm_edp-gnn ma non per attributi. 

4) test ha 48 label 1 e 39 label 2. val ha 46 label 1 e 40 label 2. 

### 7/9/22 (analisi dei dataset generati)
1) grafi disconnessi: 108 edp-gnn, 173 gdss

2) grafi ripetuti: 443 su 1152 ripetuta connettività, 0 su 1152 ripetuti anche attributi per gdss e 366 su 897 conn e 0 entrambi per edp-gnn

3) orca evaluation: 