# Feature based methods

In this notebook we will exploring a very naive (yet powerful) approach for solving graph-based supervised machine learning. The idea rely on the classic machine learning approach of handcrafted feature extraction.

In Chapter 1 you learned how local and global graph properties can be extracted from graphs. Those properties represent the graph itself and bring important informations which can be useful for classification.

In this demo, we will be using the PROTEINS dataset, already integrated in StellarGraph

In [None]:
from stellargraph import datasets
from IPython.display import display, HTML

datasets.PROTEINS.url = 'https://www.chrsmrrs.com/graphkerneldatasets/PROTEINS.zip'

dataset = datasets.PROTEINS()
display(HTML(dataset.description))
graphs, graph_labels = dataset.load()

To compute the graph metrics, one way is to retrieve the adjacency matrix representation of each graph.

In [None]:
# convert graphs from StellarGraph format to numpy adj matrices
adjs = [graph.to_adjacency_matrix().A for graph in graphs]
# convert labes fom Pandas.Series to numpy array
labels = graph_labels.to_numpy(dtype=int)

In [None]:
import numpy as np
import networkx as nx

metrics = []
for adj in adjs:
  G = nx.from_numpy_matrix(adj)
  # basic properties
  num_edges = G.number_of_edges()
  # clustering measures
  cc = nx.average_clustering(G)
  # measure of efficiency
  eff = nx.global_efficiency(G)

  metrics.append([num_edges, cc, eff])



We can now exploit scikit-learn utilities to create a train and test set. In our experiments, we will be using 70% of the dataset as training set and the remaining as testset

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(metrics, labels, test_size=0.3, random_state=42)

As commonly done in many Machine Learning workflows, we preprocess features to have zero mean and unit standard deviation

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

It's now time for training a proper algorithm. We chose a support vector machine for this task

In [None]:
from sklearn import svm
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

clf = svm.SVC()
clf.fit(X_train_scaled, y_train)

y_pred = clf.predict(X_test_scaled)

print('Accuracy', accuracy_score(y_test,y_pred))
print('Precision', precision_score(y_test,y_pred))
print('Recall', recall_score(y_test,y_pred))
print('F1-score', f1_score(y_test,y_pred))