# CORA dataset 03 - graph classification

### Feature selection
We now can use the citations feature since they are modelled by a graph data format. We can leverage this data model to extract useful information about the data.

We create spatial features ("embeddings") for each data point based on their position in the graph. These features can be used for classification. 

We create the embeddings using the [FastRP](https://neo4j.com/docs/graph-data-science/current/machine-learning/node-embeddings/fastrp/) algorithm.

Both FastRP and the downstream classification are statistical techniques and are analogous to simple NLP techniques. We don't touch on GNNs here (analogous to transformers for text, see [here](https://graphdeeplearning.github.io/post/transformers-are-gnns/) for interesting commentary).

## 0. Preliminaries

Run [01 - EDA](01%20-%20EDA.ipynb) notebook first to import dataset.

## 1. Project graph natively into memory

In [1]:
from getpass import getpass
from graphdatascience import GraphDataScience
auth = ("neo4j", getpass("Password:"))
bolt = "bolt://localhost:7687/neo4j"
gds = GraphDataScience(bolt, auth=auth)

Password: ········


In [2]:
G, res = gds.graph.project(
   "cora-graph",
   {"Paper": {"properties": ["subjectClass"]} },
   {"CITES": {"orientation": "UNDIRECTED", "aggregation": "SINGLE"}}
)

Loading:   0%|          | 0/100 [00:00<?, ?%/s]

In [5]:
G.memory_usage()

'901 KiB'

## 2. Create node embedding vectors
We select embedding vector dim = 128.

In [7]:
result = gds.fastRP.mutate(
   G,
   featureProperties=None,
   embeddingDimension=128,
   iterationWeights=[0, 0, 1.0, 1.0],
   normalizationStrength=0.05,
   mutateProperty="fastRP_Extended_Embedding"
)

## 3. Train a Random Forest classifier

**Note** GDS offers complete ML pipelines for data splitting, training classification models, hyperparameter tuning and inferencing. This is advantageous as it H acts directly on embeddings which are stored in the in-memory graph, reducing need to move data. However here we export the embeddings and demonstrate using `scikit-learn` for familiarity.

In [86]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from hpsklearn import HyperoptEstimator, random_forest_classifier

In [47]:
X = gds.graph.streamNodeProperty(G, 'fastRP_Extended_Embedding')["propertyValue"].apply(pd.Series)

In [49]:
y = gds.graph.streamNodeProperty(G, 'subjectClass')["propertyValue"]

In [50]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [67]:

clf = HyperoptEstimator(classifier=random_forest_classifier("myclf"), trial_timeout=10, seed=1)

In [68]:
clf.fit(X_train, y_train)

100%|█████████████| 1/1 [00:02<00:00,  2.80s/trial, best loss: 0.6953316953316953]
100%|█████████████| 2/2 [00:11<00:00, 11.60s/trial, best loss: 0.6953316953316953]
100%|█████████████| 3/3 [00:11<00:00, 11.50s/trial, best loss: 0.6953316953316953]
100%|████████████| 4/4 [00:03<00:00,  3.22s/trial, best loss: 0.18918918918918914]
100%|████████████| 5/5 [00:07<00:00,  7.66s/trial, best loss: 0.18918918918918914]
100%|████████████| 6/6 [00:03<00:00,  3.90s/trial, best loss: 0.18918918918918914]
100%|████████████| 7/7 [00:11<00:00, 11.61s/trial, best loss: 0.18918918918918914]
100%|████████████| 8/8 [00:09<00:00,  9.65s/trial, best loss: 0.18918918918918914]
100%|████████████| 9/9 [00:11<00:00, 11.60s/trial, best loss: 0.18918918918918914]
100%|██████████| 10/10 [00:06<00:00,  6.83s/trial, best loss: 0.18918918918918914]


In [69]:
clf.score(X_test, y_test)

0.8301329394387001

## 4. Extension: ensemble model

### Ignore: not useful

Combine features from graph embeddings (size 128) + text features (size 1433)

In [72]:
df = pd.DataFrame.from_dict(gds.run_cypher("""
MATCH (n) WHERE n.features IS NOT NULL
RETURN DISTINCT n.paper_Id as id, n.subjectClass AS class, n.features AS features
"""))

In [73]:
X_t = df["features"].apply(pd.Series).to_numpy()
y_t = df["class"].to_numpy()

In [105]:
from sklearn.decomposition import PCA
X_t_PCA = PCA(128).fit_transform(X_t)

In [78]:
assert np.all(y_t == y.to_numpy())

In [106]:
X_new = np.concatenate((X, X_t_PCA), axis=1)

In [107]:
X_train, X_test, y_train, y_test = train_test_split(X_new, y, random_state=0)

In [110]:
clf = HyperoptEstimator(classifier=random_forest_classifier("myclf"), trial_timeout=10, n_jobs=4, seed=1)

In [111]:
clf.fit(X_train, y_train)

100%|████████████| 1/1 [00:04<00:00,  4.24s/trial, best loss: 0.46928746928746934]
100%|████████████| 2/2 [00:11<00:00, 11.70s/trial, best loss: 0.46928746928746934]
100%|████████████| 3/3 [00:11<00:00, 11.89s/trial, best loss: 0.46928746928746934]
100%|█████████████| 4/4 [00:08<00:00,  8.80s/trial, best loss: 0.2235872235872236]
100%|█████████████| 5/5 [00:03<00:00,  3.76s/trial, best loss: 0.2235872235872236]
100%|█████████████| 6/6 [00:01<00:00,  1.97s/trial, best loss: 0.2235872235872236]
100%|█████████████| 7/7 [00:03<00:00,  3.40s/trial, best loss: 0.2235872235872236]
100%|█████████████| 8/8 [00:11<00:00, 11.58s/trial, best loss: 0.2235872235872236]
100%|█████████████| 9/9 [00:02<00:00,  2.15s/trial, best loss: 0.2235872235872236]
100%|███████████| 10/10 [00:01<00:00,  1.92s/trial, best loss: 0.2235872235872236]


In [112]:
clf.score(X_test, y_test)

0.7813884785819794