## Downloading Wico Dataset Uploaded On Kaggle

In [None]:
!pip install -q kaggle
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
! kaggle datasets download -d manaspp/wico-graph
! unzip wico-graph

# Dataset

Wico Graph Dataset : https://datasets.simula.no/wico-graph/
<br/>DOI : 10.1145/3472720.3483617<br/>
<br />
In the wake of the COVID-19 pandemic, a surge of misinformation has flooded social media and other internet channels, and some of it has the potential to cause real-world harm. To counteract this misinformation, reliably identifying it is a principal problem to be solved.
 `However, the identification of misinformation poses a formidable challenge for language processing systems since the texts containing misinformation are short, work with insinuation rather than explicitly stating a false claim, or resemble other postings that deal with the same topic ironically.` 
 Accordingly, for the development of better detection systems, it is not only essential to use hand-labeled ground truth data and extend the analysis with methods beyond Natural Language Processing to consider the characteristics of the participant's relationships and the diffusion of misinformation. 

---

## Importing Necessary Libraries

In [None]:
import random
import networkx as nx
import pandas as pd
import os
import statistics

def list_files(dir):
    r = []
    for root, dirs, files in os.walk(dir):
        for name in files:
            r.append(os.path.join(root, name))
    return r

## Installing the StellarGraph Library

<img src="https://raw.githubusercontent.com/stellargraph/stellargraph/develop/stellar-graph-banner.png">

---

- `The StellarGraph library offers state-of-the-art algorithms for graph machine learning, making it easy to discover patterns and answer questions about graph-structured data.` 

- It can solve many machine learning tasks:
 - Representation learning for nodes and edges, to be used for visualisation and various downstream machine learning tasks
 - Classification and attribute inference of nodes or edges
 - Classification of whole graphs
 - Link prediction


- Graph-structured data represent entities as nodes (or vertices) and relationships between them as edges (or links), and can include data associated with either as attributes. 
- For example, a graph can contain people as nodes and friendships between them as links, with data like a personâ€™s age and the date a friendship was established. 
- StellarGraph supports analysis of many kinds of graphs:

 - homogeneous (with nodes and links of one type)

 - heterogeneous (with more than one type of nodes and/or links)

 - knowledge graphs (extreme heterogeneous graphs with thousands of types of edges)

 - graphs with or without data associated with nodes

 - graphs with edge weights

In [None]:
! pip install stellargraph

In [None]:
import stellargraph as sg
from stellargraph import StellarGraph
import tensorflow as tf

## Extracting Graph Data With Labels For Classification
---

### Homogeneous graph with features

For many real-world problems, we have more than just graph structure: we have information about the nodes and edges. For our instance, we have a graph of users (nodes) and how information travels between them (edges): we have information about the nodes such as the `friends` and `followers`. 

In [None]:
rumor = []
labels = []
temp = []

for i in range(1,413):
    temp.append(list_files(f'5G_Conspiracy_Graphs/{i}/'))

for i in range(len(temp)):
  node_features = {}
  for j in range(len(temp[i])):
    if 'edges' in temp[i][j]:
      break
  edges = pd.read_csv(temp[i][j],sep=" ",names=['source','target'])
  labels.append(1)

  for j in range(len(temp[i])):
    if 'nodes' in temp[i][j]:
      break
  f = pd.read_csv(temp[i][j],usecols = [0,2,3])
  for k in range(f.shape[0]):
    node_features[f.iloc[k]['id']] = [f.iloc[k]['friends'],f.iloc[k]['followers']]

  node_data = pd.DataFrame(
    {"friends": [i[0] for i in list(node_features.values())], "followers": [i[1] for i in list(node_features.values())]}, index=list(node_features.keys()))
  rumor.append(StellarGraph(node_data, edges))


temp = []

for i in range(1,598):
    temp.append(list_files(f'Other_Graphs/{i}/'))

for i in range(len(temp)):
  node_features = {}
  for j in range(len(temp[i])):
    if 'edges' in temp[i][j]:
      break
  edges = pd.read_csv(temp[i][j],sep=" ",names=['source','target'])
  labels.append(1)

  for j in range(len(temp[i])):
    if 'nodes' in temp[i][j]:
      break
  f = pd.read_csv(temp[i][j],usecols = [0,2,3])
  for k in range(f.shape[0]):
    node_features[f.iloc[k]['id']] = [f.iloc[k]['friends'],f.iloc[k]['followers']]

  node_data = pd.DataFrame(
    {"friends": [i[0] for i in list(node_features.values())], "followers": [i[1] for i in list(node_features.values())]}, index=list(node_features.keys()))
  rumor.append(StellarGraph(node_data, edges))

temp = []

for i in range(1,2502):
    temp.append(list_files(f'Non_Conspiracy_Graphs/{i}/'))

for i in range(len(temp)):
  node_features = {}
  for j in range(len(temp[i])):
    if 'edges' in temp[i][j]:
      break
  edges = pd.read_csv(temp[i][j],sep=" ",names=['source','target'])
  labels.append(0)
  
  for j in range(len(temp[i])):
    if 'nodes' in temp[i][j]:
      break
  f = pd.read_csv(temp[i][j],usecols = [0,2,3])
  for k in range(f.shape[0]):
    node_features[f.iloc[k]['id']] = [f.iloc[k]['friends'],f.iloc[k]['followers']] 

  node_data = pd.DataFrame(
    {"friends": [i[0] for i in list(node_features.values())], "followers": [i[1] for i in list(node_features.values())]}, index=list(node_features.keys()))
  rumor.append(StellarGraph(node_data, edges)) 

The `graphs` value is a list of many `StellarGraph` instances, each of which has a few node features:

In [None]:
print(rumor[0].info())

StellarGraph: Undirected multigraph
 Nodes: 89, Edges: 42

 Node types:
  default: [89]
    Features: float32 vector, length 2
    Edge types: default-default->default

 Edge types:
    default-default->default: [42]
        Weights: all 1 (default)
        Features: none


### Summary statistics of the sizes of the graphs:

In [None]:
summary = pd.DataFrame(
    [(g.number_of_nodes(), g.number_of_edges()) for g in rumor],
    columns=["nodes", "edges"],
)
summary.describe().round(1)

Unnamed: 0,nodes,edges
count,3510.0,3510.0
mean,61.2,144.1
std,34.7,268.8
min,1.0,0.0
25%,24.0,28.0
50%,82.5,77.0
75%,92.0,162.0
max,101.0,4706.0


The labels are `1` or `0`:

In [None]:
labels = pd.Series(labels)
labels.value_counts().to_frame()

Unnamed: 0,0
0,2501
1,1009


## Building The Graph Neural Network

In [None]:
from stellargraph.mapper import PaddedGraphGenerator
from stellargraph.layer import GCNSupervisedGraphClassification

from sklearn import model_selection

from tensorflow.keras import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Dense
from tensorflow.keras.losses import binary_crossentropy
from tensorflow.keras.callbacks import EarlyStopping

### Prepare graph generator

To feed data to the `tf.Keras` model that we will create later, we need a data generator. For supervised graph classification, we create an instance of `StellarGraph`'s `PaddedGraphGenerator` class. Note that `graphs` is a list of `StellarGraph` graph objects.

In [None]:
generator = PaddedGraphGenerator(graphs=rumor)

### Create the Keras graph classification model

We are now ready to create a `tf.Keras` graph classification model using `StellarGraph`'s `GraphClassification` class together with standard `tf.Keras` layers, e.g., `Dense`. 

The input is the graph represented by its adjacency and node features matrices. The first two layers are Graph Convolutional as in [2] with each layer having 64 units and `relu` activations. The next layer is a mean pooling layer where the learned node representation are summarized to create a graph representation. The graph representation is input to two fully connected layers with 32 and 16 units respectively and `relu` activations. The last layer is the output layer with a single unit and `sigmoid` activation.

<img src="http://drive.google.com/uc?export=view&id=11ztoQtOc5la47yedafZ5jkqNq2eRL4Cg">

In [None]:
def create_graph_classification_model(generator):
    gc_model = GCNSupervisedGraphClassification(
        layer_sizes=[64, 64],
        activations=["relu", "relu"],
        generator=generator,
        dropout=0.5,
    )
    x_inp, x_out = gc_model.in_out_tensors()
    predictions = Dense(units=32, activation="relu")(x_out)
    predictions = Dense(units=16, activation="relu")(predictions)
    predictions = Dense(units=1, activation="sigmoid")(predictions)

    # Let's create the Keras model and prepare it for training
    model = Model(inputs=x_inp, outputs=predictions)
    model.compile(optimizer=Adam(0.005), loss=binary_crossentropy, metrics=["acc"])

    return model

## Training the Model


We can now train the model using the model's `fit` method. First, we specify some important training parameters such as the number of training epochs, number of fold for cross validation and the number of time to repeat cross validation.

In [None]:
epochs = 200  # maximum number of training epochs
folds = 10  # the number of folds for k-fold cross validation
n_repeats = 5  # the number of repeats for repeated k-fold cross validation

es = EarlyStopping(
    monitor="val_loss", min_delta=0, patience=25, restore_best_weights=True
)

The method `train_fold` is used to train a graph classification model for a single fold of the data.

In [None]:
def train_fold(model, train_gen, test_gen, es, epochs):
    history = model.fit(
        train_gen, epochs=epochs, validation_data=test_gen, verbose=0, callbacks=[es],
    )
    # calculate performance on the test data and return along with history
    test_metrics = model.evaluate(test_gen, verbose=0)
    test_acc = test_metrics[model.metrics_names.index("acc")]

    return history, test_acc

In [None]:
def get_generators(train_index, test_index, graph_labels, batch_size):
    train_gen = generator.flow(
        train_index, targets=graph_labels.iloc[train_index].values, batch_size=batch_size
    )
    test_gen = generator.flow(
        test_index, targets=graph_labels.iloc[test_index].values, batch_size=batch_size
    )

    return train_gen, test_gen

The code below puts all the above functionality together in a training loop for repeated k-fold cross-validation where the number of folds is 10, `folds=10`; that is we do 10-fold cross validation `n_repeats` times where `n_repeats=5`.

**Note**: The below code may take a long time to run depending on the value set for `n_repeats`. The larger the latter, the longer it takes since for each repeat we train and evaluate 10 graph classification models, one for each fold of the data. For progress updates, we recommend that you set `verbose=2` in the call to the `fit` method is cell 10, line 3.

In [None]:
test_accs = []

stratified_folds = model_selection.RepeatedStratifiedKFold(
    n_splits=folds, n_repeats=n_repeats
).split(labels, labels)

for i, (train_index, test_index) in enumerate(stratified_folds):
    print(f"Training and evaluating on fold {i+1} out of {folds * n_repeats}...")
    train_gen, test_gen = get_generators(
        train_index, test_index, labels, batch_size=30
    )

    model = create_graph_classification_model(generator)

    history, acc = train_fold(model, train_gen, test_gen, es, epochs)

    test_accs.append(acc)

Training and evaluating on fold 1 out of 50...
Training and evaluating on fold 2 out of 50...
Training and evaluating on fold 3 out of 50...
Training and evaluating on fold 4 out of 50...
Training and evaluating on fold 5 out of 50...
Training and evaluating on fold 6 out of 50...
Training and evaluating on fold 7 out of 50...
Training and evaluating on fold 8 out of 50...
Training and evaluating on fold 9 out of 50...
Training and evaluating on fold 10 out of 50...
Training and evaluating on fold 11 out of 50...
Training and evaluating on fold 12 out of 50...
Training and evaluating on fold 13 out of 50...
Training and evaluating on fold 14 out of 50...
Training and evaluating on fold 15 out of 50...
Training and evaluating on fold 16 out of 50...
Training and evaluating on fold 17 out of 50...
Training and evaluating on fold 18 out of 50...
Training and evaluating on fold 19 out of 50...
Training and evaluating on fold 20 out of 50...
Training and evaluating on fold 21 out of 50...
T

In [None]:
import numpy as np
print(f"Accuracy over all folds mean: {np.mean(test_accs)*100:.3}% and std: {np.std(test_accs)*100:.2}%")

Accuracy over all folds mean: 71.3% and std: 0.085%
