<a href="https://colab.research.google.com/github/PietroVolpato/lfn_project/blob/main/src/LFN_project_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Learning from networks project
### Evaluation of different Node Embedding algorithms
Members:<br>
- D'Emilio Filippo, id : 2120931
- Volpato Pietro, id : 2120825

### Information about the notebook (have a look at the report for details)
This notebook is responsable of computing the embeddings for every embedding technique and for every selected graph.<br>
Each computed embedding is saved to file as a numpy array (extension .npy), in the directory /embeddings. In this way that once an embedding is computed, it won't be lost when the runtime of the notebook is terminated.<br>
We can then efficiently load the embeddings in the "test" notebook, and evaluate the quality of the embeddings.<br>
Selected embedding techniques:
- Node2Vec
- Line
- ...

For information about the graphs, se cells below.<br>
*NOTE*: by implementation choice, the computation of each embedding is computed separately (e.g. there are no function to coincisely compute all embeddings).<br>
This choice comes from the fact that computing embeddings is computationally intensive, and we might want to compute only a specific
embedding strategy for a specific graph, in order to update only this entry in the folder containing the embeddings.

### Imports

In [1]:
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
from node2vec import Node2Vec
from sklearn.metrics import mean_squared_error
from sklearn.metrics.pairwise import cosine_similarity
import gzip
import sys
import re

# configuration
Here you can properly configure the names of the graphs and the names of the embedding strategies. Use meaningful names.

In [None]:
graph_keys = ["facebook","citation","biological","CL","COX2"]
embedding_keys = ["LINE"]

# Loading the graphs
Selected graphs:
- Facebook_combined    https://snap.stanford.edu/data/ego-Facebook.html          
- cit-Helpth           https://networkrepository.com/cit-HepTh.php             
- bio-CE-CX            https://networkrepository.com/bio-CE-CX.php             
- CL-100K-1d8-L9       https://networkrepository.com/CL-100K-1d8-L9.php ---- the graph has node labels
- COX2-MD              https://networkrepository.com/COX2-MD.php  ---- the graph has node labels

To run this notebook, adjust the paths to match where the files are saved in your PC.<br>
To keep paths as they are, create a "data" folder inside the directory of this notebook, and store the files there.<br><br>

Graphs are stored as a dictionary: the key is the graph name, the value is the corresponding netowrkx graph.<br>

In [2]:
facebook_path = 'data/facebook_combined.txt.gz'
citation_path = 'data/cit-HepTh.edges'
biological_path = 'data/bio-CE-CX.edges'
CL_path = "data/CL-100K-1d8-L9/CL-100K-1d8-L9.edges"
COX2_path = "data/COX2-MD/COX2-MD.edges"

In [None]:
def load_graph(path):
    """
    For files with extension .edges
    """
    G = nx.Graph()
    with open(path, 'rt') as f:
        for line in f:
            if line.startswith('%'):  # Skip comment lines
                continue
            # Split the line based on spaces or commas
            data = re.split(r'[,\s]+', line.strip())
            if len(data) < 2:  # Skip lines that don't have at least two columns
                continue
            # Extract the first two columns (nodes)
            node1, node2 = int(data[0]), int(data[1])
            G.add_edge(node1, node2)
    G = nx.convert_node_labels_to_integers(G)  # Relabel nodes to integers
    return G

def load_graph_with_gz(path):
    """
    For files with extension .txt.gz
    """
    G = nx.Graph()
    with gzip.open(path, 'rt') as f:
        for line in f:
            node1, node2 = map(int, line.strip().split())
            G.add_edge(node1, node2)
    G = nx.convert_node_labels_to_integers(G)  # Relabel nodes to integers
    return G

def print_graphs_info(graphs):
    for k in graph_keys:
        G = graphs[k]
        print(f"{k}: |V|={len(G.nodes)}, |E|={len(G.edges)}")

In [3]:
graphs = {}

# facebook graph is the only one .tar.gz        
graphs[graph_keys[0]] = load_graph_with_gz(facebook_path)  # relabeling nodes to integer
graphs[graph_keys[1]] = load_graph(citation_path)
graphs[graph_keys[2]] = load_graph(biological_path)
graphs[graph_keys[3]] = load_graph(CL_path)  # node labeled
graphs[graph_keys[4]] = load_graph(COX2_path)  # node labeled

print_graphs_info(graphs)

facebook graph: |V|=4039, |E|=88234
citation graph: |V|=22908, |E|=2444798
biological graph: |V|=15229, |E|=245952
CL graph: |V|=92482, |E|=436611
COX2 graph: |V|=7962, |E|=101542


# Functions and declarations for the embeddings
Embedding data structure is defined as following:<br>
- The first index refer to the graph (e.g. embeddings["facebook"] contains the embeddings of the facebook graph for every embedding technique).<br>
- The second index refer to the embedding technique (e.g. embeddings["facebook"]["LINE"] cointans the embedding of facebook graph computed using LINE)

In [15]:
def save(emb, path):
    np.save(f"embeddings/{path}.npy", emb)
    print(f"Successfully saved the embeddings in embeddings/{path}.npy")

# dictionaries to store the embeddings, obtained by several techniques, for each graph
embeddings = {}
for k in graph_keys:
    embeddings[k] = {}

Spiegazione sui parametri di node2vec:<br>
- G (required): The graph on which to run Node2Vec. Must be an undirected networkx.Graph object.
- dimensions (default = 128): The dimensionality of the node embeddings. Higher dimensions allow for capturing more information but increase computational cost.
- walk_length (default = 80): The number of steps for each random walk. A larger walk_length captures more of the network structure.
- num_walks (default = 10): The number of random walks to start per node. Increasing this can improve the representation at the cost of additional computation.
- workers (default = 1): The number of CPU cores to use for parallel processing. If you're running this on a multi-core machine, increasing this can speed up the computation.
- p (return parameter): p<1: Increases the likelihood of revisiting a node (DFS-like behavior). p>1: Discourages revisiting nodes, encouraging exploration (BFS-like behavior).
- q (in-out parameter): q<1: Encourages walks to nodes further away from the starting node (BFS-like).q>1: Biases walks to nodes closer to the starting node (DFS-like).

Spiegazione di : model = node2vec.fit(window=5, min_count=1, batch_words=4)<br>
This trains a Word2Vec model (from the gensim library) using the random walks. Let’s go over the parameters:<br>

- window (default = 10): The maximum distance between the current and predicted nodes in the random walk sequence. Larger windows capture more context but require more computation.

- min_count (default = 1): Minimum frequency for a node to be considered in the embedding. Since most graphs are sparse, this is often set to 1.

- batch_words (default = 4): The number of words (or nodes) processed in each training batch. Adjust this for performance depending on your hardware.

# Node2Vec
- pip install node2vec

In [6]:
def get_node2vec_embeddings(G, dimensions=128, walk_length=10, num_walks=20, p=1, q=1, workers=1):
    """
    Generate node embeddings for a graph using the Node2Vec algorithm.

    Parameters:
        G (networkx.Graph): 
            The input graph for which embeddings are to be generated. 
            The graph should have nodes labeled as integers, ideally sequentially starting from 0.

        dimensions (int, optional): 
            The dimensionality of the embedding space. Default is 128.

        walk_length (int, optional): 
            The length of each random walk. Default is 10.

        num_walks (int, optional): 
            The number of random walks to start from each node. Default is 20.

        p (float, optional): 
            The return parameter, controlling the likelihood of immediately revisiting a node in the walk. 
            A higher value makes it more likely to backtrack. Default is 1.

        q (float, optional): 
            The in-out parameter, controlling the likelihood of exploring outward from the starting node. 
            A higher value makes it more likely to move outward. Default is 1.

        workers (int, optional): 
            The number of parallel workers for random walk generation and model training. Default is 1.

    Returns:
        np.ndarray:
            A NumPy array where each row represents the embedding of a node.
            The row index corresponds to the node ID, and each row has `dimensions` elements.
    """
    # Initialize Node2Vec model
    node2vec = Node2Vec(G, dimensions=dimensions, walk_length=walk_length, num_walks=num_walks, p=p, q=q, workers=workers)
    
    # Fit the Node2Vec model and generate embeddings
    model = node2vec.fit(window=10, min_count=1, batch_words=4)
    
    # Convert embeddings to a NumPy array
    num_nodes = G.number_of_nodes()
    embeddings = np.zeros((num_nodes, dimensions))  # Preallocate array
    for node in G.nodes:
        embeddings[node] = model.wv[node]
    
    return embeddings

In [None]:
curr = "node2vec"
embeddings[graph_keys[0][curr]] = get_node2vec_embeddings(graphs[graph_keys[0]])
save(embeddings[graph_keys[0][curr]], f"embeddings_{graph_keys[0]}_{curr}")

# LINE : Large-scale information network embedding
installation guide:
- git clone https://github.com/VahidooX/LINE.git
- !pip install keras
- !pip install tensorflow
- adjust the sys.path to where you downloaded line repository

*NOTE*: it was necessary to modify utils.py to adapt it at current version of keras. Some elements were deprecated

In [9]:
sys.path.append(r'C:\Users\oppil\OneDrive\Desktop\Universita\magistrale\2_1\LFN\LINE')

from model import create_model
from utils import batchgen_train

def get_LINE_embeddings(G, embedding_dim=128, batch_size=1024, negative_ratio=5, epochs=10, negative_sampling="UNIFORM"):
    """
    Generate LINE embeddings for a given graph.

    Parameters:
        G (nx.Graph): The graph for which embeddings are computed.
        embedding_dim (int): Dimensionality of the embeddings.
        batch_size (int): Batch size for training.
        negative_ratio (int): Ratio of negative to positive samples.
        epochs (int): Number of training epochs.
        negative_sampling (str): Negative sampling strategy ("UNIFORM" or "NON-UNIFORM").

    Returns:
        numpy.ndarray: Node embeddings (shape: [num_nodes, embedding_dim]).
    """
    num_nodes = G.number_of_nodes()

    # Convert networkx.Graph to adj_list (edge list as 2D numpy array)
    adj_list = np.array(list(G.edges()), dtype=np.int32)

    # Create LINE model
    model, embed_generator = create_model(num_nodes, embedding_dim)

    # Generate training batches
    train_gen = batchgen_train(adj_list, num_nodes, batch_size, negative_ratio, negative_sampling)

    # Compile and train the model
    model.compile(optimizer="adam", loss="binary_crossentropy")
    model.fit(train_gen, steps_per_epoch=500, epochs=epochs)

    # Extract embeddings
    node_ids = np.arange(num_nodes)  # Sequential node IDs
    embeddings = embed_generator.predict_on_batch(node_ids)

    print("Node Embeddings Shape:", embeddings[0].shape)
    return embeddings

In [17]:
curr = "LINE"
embeddings[graph_keys[0][curr]] = get_LINE_embeddings(graphs[graph_keys[0]])
save(embeddings[graph_keys[0][curr]], f"embeddings_{graph_keys[0]}_{curr}")

In [18]:
embeddings_biological["LINE"] = get_LINE_embeddings(G_biological)
save(embeddings_biological["LINE"],"embeddings_biological_LINE")

In [19]:
embeddings_CL["LINE"] = get_LINE_embeddings(G_CL)
save(embeddings_CL["LINE"],"embeddings_CL_LINE")

In [20]:
embeddings_citation["LINE"] = get_LINE_embeddings(G_citation)
save(embeddings_citation["LINE"],"embeddings_citation_LINE")

In [None]:
embeddings_COX2["LINE"] = get_LINE_embeddings(G_COX2)
save(embeddings_COX2["LINE"],"embeddings_COX2_LINE")