# Cora dataset 02 - tabular classification

### Feature selection

In a traditional ML system, we use features of each sample independently of other nodes to classify. We cannot easily model each paper's citations in a tabular data format. 

We could construct aggregate numerical features such as how many times each paper cites another paper in a particular category. However this loses information about the spatial structure of the graph. It is hard to randomly split into train-test whilst preventing data leakage. 

Therefore we will only use text classification for the Cora dataset.

## 0. Preliminaries

Run [01 - EDA](01%20-%20EDA.ipynb) notebook first to import dataset.

## 1. Export nodes from DB

In [1]:
from getpass import getpass
from graphdatascience import GraphDataScience
auth = ("neo4j", getpass("Password:"))
bolt = "bolt://localhost:7687/neo4j"
gds = GraphDataScience(bolt, auth=auth)

Password: ········


In [2]:
import pandas as pd

In [4]:
df = pd.DataFrame.from_dict(gds.run_cypher("""
MATCH (n) WHERE n.features IS NOT NULL
RETURN DISTINCT n.paper_Id as id, n.subjectClass AS class, n.features AS features
"""))

## 2. Train Random Forest classifier

In [13]:
X = df["features"].apply(pd.Series).to_numpy()
y = df["class"].to_numpy()

In [17]:
import numpy as np
from sklearn.model_selection import train_test_split
from hpsklearn import HyperoptEstimator, random_forest_classifier

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [26]:
clf = HyperoptEstimator(classifier=random_forest_classifier("myclf"), trial_timeout=10)

In [27]:
clf.fit(X_train, y_train)

100%|█████████████| 1/1 [00:08<00:00,  8.02s/trial, best loss: 0.6904176904176904]
100%|█████████████| 2/2 [00:12<00:00, 12.35s/trial, best loss: 0.6904176904176904]
100%|█████████████| 3/3 [00:11<00:00, 11.70s/trial, best loss: 0.6904176904176904]
100%|█████████████| 4/4 [00:04<00:00,  4.52s/trial, best loss: 0.3906633906633906]
100%|█████████████| 5/5 [00:01<00:00,  1.88s/trial, best loss: 0.3906633906633906]
100%|█████████████| 6/6 [00:05<00:00,  5.38s/trial, best loss: 0.3587223587223587]
100%|█████████████| 7/7 [00:11<00:00, 11.57s/trial, best loss: 0.3587223587223587]
100%|█████████████| 8/8 [00:02<00:00,  2.17s/trial, best loss: 0.3587223587223587]
100%|█████████████| 9/9 [00:11<00:00, 11.87s/trial, best loss: 0.3587223587223587]
100%|███████████| 10/10 [00:03<00:00,  3.93s/trial, best loss: 0.3587223587223587]


In [24]:
#RF, no PCA
clf.score(X_test, y_test)

0.6691285081240768