# Node2Vec Graph Analysis

## 1. Creating the Graph and Computing Embeddings

First, we created a graph projection and computed node2vec embeddings using Neo4j on the undirected graph:

```cypher
CALL gds.graph.project.cypher(
  'Node2VecGraph',
  'MATCH (n:Area_4) RETURN id(n) AS id',
  'MATCH (n)-[r:trip]->(m)
   WHERE r.NB IS NOT NULL
   RETURN id(n) AS source, id(m) AS target, r.NB AS weight'
);

```

## 2. Exporting the Embeddings

We exported the embeddings to a CSV file for further analysis:

```cypher
CALL apoc.export.csv.query(
  'CALL gds.node2vec.stream("Node2VecGraph", {
    embeddingDimension: 256,
    walkLength: 15,
    iterations: 15,
    inOutFactor: 2.0,
    returnFactor: 1.0,
    relationshipWeightProperty: "weight"
  })
  YIELD nodeId, embedding
  WITH gds.util.asNode(nodeId) AS node, embedding
  RETURN
    node.name_4 AS city,
    node.name_1 AS region,
    embedding',
   "node2vec_embeddings.csv",
  {}
);

```



## 3. Loading the Embeddings

In [1]:
import pandas as pd
import ast
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import top_k_accuracy_score
from sklearn.preprocessing import LabelEncoder


In [2]:

# Load the embeddings from the exported CSV
data = pd.read_csv("node2vec_embeddings_undirected.csv")

# Convert embeddings from string to numeric array
data['embedding'] = data['embedding'].apply(ast.literal_eval)

# Extract features (embeddings) and labels (regions)
X = np.array(data['embedding'].to_list())
y = data['region']


# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Encode the labels (regions)
y = label_encoder.fit_transform(data['region'])

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


## 4. Training the Models and Evaluating the Results
### 4.1 Random Forest Classifier

In [3]:

# Train a Random Forest Classifier
clf = RandomForestClassifier(n_estimators=400, random_state=42)
clf.fit(X_train, y_train)

# Evaluate the model
y_pred = clf.predict(X_test)
top1 = clf.score(X_test, y_test)  # Top-1 accuracy
top3 = top_k_accuracy_score(y_test, clf.predict_proba(X_test), k=3)
top5 = top_k_accuracy_score(y_test, clf.predict_proba(X_test), k=5)

print(f"Top-1 Accuracy: {top1:.4f}")
print(f"Top-3 Accuracy: {top3:.4f}")
print(f"Top-5 Accuracy: {top5:.4f}")


Top-1 Accuracy: 0.5923
Top-3 Accuracy: 0.8574
Top-5 Accuracy: 0.9369


### 4.2 Support Vector Classifier

In [4]:
# Import the Support Vector Classifier
from sklearn.svm import SVC

# Replace the Gradient Boosting Classifier with Support Vector Classifier
clf = SVC(kernel='linear', probability=True, random_state=42)
clf.fit(X_train, y_train)

# Evaluate the model
y_pred = clf.predict(X_test)
top1 = clf.score(X_test, y_test)  # Top-1 accuracy
top3 = top_k_accuracy_score(y_test, clf.predict_proba(X_test), k=3)
top5 = top_k_accuracy_score(y_test, clf.predict_proba(X_test), k=5)

print(f"Top-1 Accuracy: {top1:.4f}")
print(f"Top-3 Accuracy: {top3:.4f}")
print(f"Top-5 Accuracy: {top5:.4f}")

Top-1 Accuracy: 0.6938
Top-3 Accuracy: 0.9214
Top-5 Accuracy: 0.9653


### 4.3 TPOT

In [5]:
# Import TPOT
from tpot import TPOTClassifier

# Initialize TPOT Classifier
tpot = TPOTClassifier(verbosity=2, random_state=42, generations=5, population_size=20, n_jobs=-1)

# Fit TPOT on the training data
tpot.fit(X_train, y_train)


Optimization Progress:   0%|          | 0/120 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.7570484632209048

Generation 2 - Current best internal CV score: 0.7652722458846551

Generation 3 - Current best internal CV score: 0.7652722458846551

Generation 4 - Current best internal CV score: 0.7652722458846551

Generation 5 - Current best internal CV score: 0.7652722458846551

Best pipeline: MultinomialNB(MinMaxScaler(input_matrix), alpha=0.001, fit_prior=False)


In [6]:

# Evaluate the model
top1 = tpot.score(X_test, y_test)  # Top-1 accuracy
y_pred = tpot.predict(X_test)
top3 = top_k_accuracy_score(y_test, tpot.predict_proba(X_test), k=3)
top5 = top_k_accuracy_score(y_test, tpot.predict_proba(X_test), k=5)

print(f"Top-1 Accuracy: {top1:.4f}")
print(f"Top-3 Accuracy: {top3:.4f}")
print(f"Top-5 Accuracy: {top5:.4f}")

# Export the pipeline
tpot.export('best_pipeline.py')

Top-1 Accuracy: 0.7678
Top-3 Accuracy: 0.9342
Top-5 Accuracy: 0.9744


### 5. Analysis of results
In this analysis, we explored various machine learning models to classify node embeddings generated from a graph analysis using Node2Vec. The primary goal was to evaluate the performance of different classifiers and identify the best-performing model based on accuracy metrics.

#### Models Evaluated
We implemented three different classifiers:
1. **Random Forest Classifier (RF)**
2. **Support Vector Classifier (SVC)**
3. **TPOT (Tree-based Pipeline Optimization Tool)**

##### 1. Random Forest Classifier (RF)
- **Top-1 Accuracy**: 0.5923
- **Top-3 Accuracy**: 0.8574
- **Top-5 Accuracy**: 0.9369

The Random Forest Classifier, an ensemble method that builds multiple decision trees and merges them to get a more accurate and stable prediction, provided a baseline performance. While it achieved reasonable accuracy, it was outperformed by the other models.

##### 2. Support Vector Classifier (SVC)
- **Top-1 Accuracy**: 0.6938
- **Top-3 Accuracy**: 0.9214
- **Top-5 Accuracy**: 0.9653

The Support Vector Classifier, which finds the hyperplane that best separates the classes in the feature space, showed a significant improvement over the Random Forest model. The increase in accuracy indicates that SVC is better suited for this classification task, likely due to its ability to handle high-dimensional data effectively.

##### 3. TPOT (Tree-based Pipeline Optimization Tool)
- **Top-1 Accuracy**: 0.7678
- **Top-3 Accuracy**: 0.9342
- **Top-5 Accuracy**: 0.9744

TPOT is an automated machine learning tool that optimizes machine learning pipelines using genetic programming. It automatically searches for the best model and hyperparameters for a given dataset. In this analysis, TPOT outperformed both the Random Forest and Support Vector Classifier, achieving the highest accuracy across all metrics. This indicates that the automated optimization process effectively identified a suitable model and configuration for the data.

#### Conclusion
The results demonstrate that the TPOT classifier provided the best performance in terms of accuracy, followed by the Support Vector Classifier and then the Random Forest Classifier. The findings suggest that automated machine learning tools like TPOT can significantly enhance model selection and hyperparameter tuning, leading to improved predictive performance.


