# Learning from networks project
### Evaluation of different Node Embedding algorithms
Members:<br>
- D'Emilio Filippo, id : 2120931
- Volpato Pietro, id : 2120825

### Introduction to the project
 ---- to do ----


### Required installations:
- pip install node2vec

### Imports

In [14]:
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
from node2vec import Node2Vec
from sklearn.metrics import mean_squared_error
from sklearn.metrics.pairwise import cosine_similarity
import gzip

### Picking the graphs
(per ora solo questo esempio giocattolo). Grafo di facebook: https://snap.stanford.edu/data/ego-Facebook.html

In [19]:
file_path = "../data/facebook_combined.txt.gz"

# Read and parse the graph
G = nx.Graph()  # Create an empty undirected graph
with gzip.open(file_path, 'rt') as f:
    for line in f:
        node1, node2 = map(int, line.strip().split())
        G.add_edge(node1, node2)

print(f"Number of nodes: {len(G.nodes)}")
print(f"Number of edges: {len(G.edges)}")

Number of nodes: 4039
Number of edges: 88234


Per Pietro: ciao, visto che ti ho fatto fare il proposal, qui proverò a fare l'embedding di un grafo giocattolousando node2vec.<br>
Ho visto il link a github nel proposal e si tratta proprio dell'implementazione standard in python di node2vec, visto che nel readme come guida di installazione dice: pip install node2vec.
<br><br>
Spiegazione sui parametri di node2vec:<br>
- G (required): The graph on which to run Node2Vec. Must be an undirected networkx.Graph object.
- dimensions (default = 128): The dimensionality of the node embeddings. Higher dimensions allow for capturing more information but increase computational cost.
- walk_length (default = 80): The number of steps for each random walk. A larger walk_length captures more of the network structure.
- num_walks (default = 10): The number of random walks to start per node. Increasing this can improve the representation at the cost of additional computation.
- workers (default = 1): The number of CPU cores to use for parallel processing. If you're running this on a multi-core machine, increasing this can speed up the computation.
- p (return parameter): p<1: Increases the likelihood of revisiting a node (DFS-like behavior). p>1: Discourages revisiting nodes, encouraging exploration (BFS-like behavior).
- q (in-out parameter): q<1: Encourages walks to nodes further away from the starting node (BFS-like).q>1: Biases walks to nodes closer to the starting node (DFS-like).

Spiegazione di : model = node2vec.fit(window=5, min_count=1, batch_words=4)<br>
This trains a Word2Vec model (from the gensim library) using the random walks. Let’s go over the parameters:<br>

- window (default = 10): The maximum distance between the current and predicted nodes in the random walk sequence. Larger windows capture more context but require more computation.

- min_count (default = 1): Minimum frequency for a node to be considered in the embedding. Since most graphs are sparse, this is often set to 1.

- batch_words (default = 4): The number of words (or nodes) processed in each training batch. Adjust this for performance depending on your hardware.

### Test provando il reconstruction error

In [None]:
def reconstruction_error(G, embeddings):
    # Adjacency matrix of the original graph
    A = nx.to_numpy_array(G)

    # Compute similarity matrix from embeddings
    embedding_vectors = np.array([embeddings[node] for node in G.nodes])
    similarity_matrix = cosine_similarity(embedding_vectors)

    # Normalize similarity matrix to match adjacency scale
    similarity_matrix = (similarity_matrix - similarity_matrix.min()) / (
        similarity_matrix.max() - similarity_matrix.min()
    )

    # Compute reconstruction error (MSE)
    mse = mean_squared_error(A.flatten(), similarity_matrix.flatten())
    return mse

# Function to evaluate the reconstruction error
def evaluate_reconstruction_error(G, dimensions, walk_length, num_walks, p, q):
    # Generate embeddings with Node2Vec
    node2vec = Node2Vec(
        G, dimensions=dimensions, walk_length=walk_length, num_walks=num_walks, p=p, q=q, workers=2
    )
    model = node2vec.fit(window=10, min_count=1, batch_words=4)
    embeddings = {node: model.wv[node] for node in G.nodes}
    
    # Compute reconstruction error
    return reconstruction_error(G, embeddings)

# Define the parameter grid for p and q
p_values = [0.25, 0.5, 1, 2, 4]
q_values = [0.25, 0.5, 1, 2, 4]

# Placeholder for results
results = []

# Loop over p and q values
for p in p_values:
    for q in q_values:
        print(f"Evaluating p={p}, q={q}...")
        error = evaluate_reconstruction_error(G, dimensions=128, walk_length=10, num_walks=20, p=p, q=q)
        results.append((p, q, error))
        print(f"Reconstruction Error: {error}")

# Convert results to a structured format
results = np.array(results, dtype=[('p', float), ('q', float), ('error', float)])

# Find the best combination of p and q
best_index = np.argmin(results['error'])
best_p, best_q, best_error = results[best_index]
print(f"Best parameters: p={best_p}, q={best_q} with Reconstruction Error: {best_error}")

Evaluating p=0.25, q=0.25...


Computing transition probabilities:   0%|          | 0/4039 [00:00<?, ?it/s]

Reconstruction Error: 0.2701472292931636
Evaluating p=0.25, q=0.5...


Computing transition probabilities:   0%|          | 0/4039 [00:00<?, ?it/s]

Reconstruction Error: 0.26431369322650033
Evaluating p=0.25, q=1...


Computing transition probabilities:   0%|          | 0/4039 [00:00<?, ?it/s]

Reconstruction Error: 0.2539990913159048
Evaluating p=0.25, q=2...


Computing transition probabilities:   0%|          | 0/4039 [00:00<?, ?it/s]

Reconstruction Error: 0.2559409015413418
Evaluating p=0.25, q=4...


Computing transition probabilities:   0%|          | 0/4039 [00:00<?, ?it/s]

Reconstruction Error: 0.2311173155050151
Evaluating p=0.5, q=0.25...


Computing transition probabilities:   0%|          | 0/4039 [00:00<?, ?it/s]

Reconstruction Error: 0.26310072528408524
Evaluating p=0.5, q=0.5...


Computing transition probabilities:   0%|          | 0/4039 [00:00<?, ?it/s]

Reconstruction Error: 0.27476605307279406
Evaluating p=0.5, q=1...


Computing transition probabilities:   0%|          | 0/4039 [00:00<?, ?it/s]

Reconstruction Error: 0.2735465842805719
Evaluating p=0.5, q=2...


Computing transition probabilities:   0%|          | 0/4039 [00:00<?, ?it/s]

Reconstruction Error: 0.25712319069877626
Evaluating p=0.5, q=4...


Computing transition probabilities:   0%|          | 0/4039 [00:00<?, ?it/s]

Reconstruction Error: 0.23122614678320177
Evaluating p=1, q=0.25...


Computing transition probabilities:   0%|          | 0/4039 [00:00<?, ?it/s]

Reconstruction Error: 0.27319153802185336
Evaluating p=1, q=0.5...


Computing transition probabilities:   0%|          | 0/4039 [00:00<?, ?it/s]

Reconstruction Error: 0.27886158411085643
Evaluating p=1, q=1...


Computing transition probabilities:   0%|          | 0/4039 [00:00<?, ?it/s]

Reconstruction Error: 0.2851119592059456
Evaluating p=1, q=2...


Computing transition probabilities:   0%|          | 0/4039 [00:00<?, ?it/s]

Reconstruction Error: 0.2751326887142336
Evaluating p=1, q=4...


Computing transition probabilities:   0%|          | 0/4039 [00:00<?, ?it/s]

Reconstruction Error: 0.24256983450267536
Evaluating p=2, q=0.25...


Computing transition probabilities:   0%|          | 0/4039 [00:00<?, ?it/s]

Reconstruction Error: 0.2569877472550811
Evaluating p=2, q=0.5...


Computing transition probabilities:   0%|          | 0/4039 [00:00<?, ?it/s]