<a href="https://colab.research.google.com/github/PietroVolpato/lfn_project/blob/main/src/LFN_project_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Learning from networks project
### Evaluation of different Node Embedding algorithms
Members:<br>
- D'Emilio Filippo, id : 2120931
- Volpato Pietro, id : 2120825

### Information about the notebook (have a look at the report for details)
This notebook is responsable of computing the embeddings for every embedding technique and for every selected graph.<br>
Each computed embedding is saved to file as a numpy array (extension .npy), in the directory /embeddings. In this way that once an embedding is computed, it won't be lost when the runtime of the notebook is terminated.<br>
We can then efficiently load the embeddings in the "test" notebook, and evaluate the quality of the embeddings.<br>
Selected embedding techniques:
- Node2Vec
- Line
- ...

For information about the graphs, se cells below.<br>
*NOTE*: by implementation choice, the computation of each embedding is computed separately (e.g. there are no function to coincisely compute all embeddings).<br>
This choice comes from the fact that computing embeddings is computationally intensive, and we might want to compute only a specific
embedding strategy for a specific graph, in order to update only this entry in the folder containing the embeddings.

### Imports

In [1]:
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
from node2vec import Node2Vec
from sklearn.metrics import mean_squared_error
from sklearn.metrics.pairwise import cosine_similarity
import gzip
import sys
import re

# configuration
Here you can properly configure the names of the graphs and the names of the embedding strategies. Use meaningful names.

In [2]:
graph_keys = ["facebook","citation","biological","proteins"]
embedding_keys = ["LINE", "node2vec"]

# Loading the graphs
Selected graphs:
- Facebook_combined    https://snap.stanford.edu/data/ego-Facebook.html          
- cit-Helpth           https://networkrepository.com/cit-HepTh.php             
- bio-CE-CX            https://networkrepository.com/bio-CE-CX.php             
- proteins-full        https://networkrepository.com/PROTEINS-full.php ---- the graph has node labels
- COX2-MD              https://networkrepository.com/COX2-MD.php  ---- the graph has node labels

To run this notebook, adjust the paths to match where the files are saved in your PC.<br>
To keep paths as they are, create a "data" folder inside the directory of this notebook, and store the files there.<br>
Graphs are stored as a dictionary: the key is the graph name, the value is the corresponding netowrkx graph.<br>

When it is created a networkX graph from a text file the node are renamed as integers form 0 to |V|-1, so that we can store the embeddings
on a matrix, and each row index corresponds to the embedding vector of the corrisponding node.

In [3]:
facebook_path = '../data/facebook/facebook_combined.txt.gz'
citation_path = '../data/citation/cit-HepTh.edges'
biological_path = '../data/biological/bio-CE-CX.edges'
proteins_path = "../data/proteins/PROTEINS-full.edges"
#COX2_path = "../data/COX2-MD.edges"

proteins_labels_path = "../data/proteins/PROTEINS-full.node_labels"
#CL_labels_path = "../data/CL-100K-1d8-L9.node_labels"

In [14]:
def load_graph(path):
    """
    For files with extension .edges
    nodes are renamed as integers, starting from 0
    """
    G = nx.Graph()
    with open(path, 'rt') as f:
        for line in f:
            if line.startswith('%'):  # Skip comment lines
                continue
            # Split the line based on spaces or commas
            data = re.split(r'[,\s]+', line.strip())
            if len(data) < 2:  # Skip lines that don't have at least two columns
                continue
            # Extract the first two columns (nodes)
            node1, node2 = int(data[0]), int(data[1])
            G.add_edge(node1, node2)
    mapping = {node : i for i,node in enumerate(G.nodes)} # mappoing original : relabeled
    G = nx.relabel_nodes(G, mapping)
    return G

def load_graph_with_gz(path):
    """
    For files with extension .txt.gz
    nodes are renamed as integers, starting from 0
    """
    G = nx.Graph()
    with gzip.open(path, 'rt') as f:
        for line in f:
            node1, node2 = map(int, line.strip().split())
            G.add_edge(node1, node2)
    mapping = {node : i for i,node in enumerate(G.nodes)} # mappoing original : relabeled
    G = nx.relabel_nodes(G, mapping)
    return G

def print_graphs_info(graphs):
    for k in graph_keys:
        G = graphs[k]
        print(f"{k}: |V|={len(G.nodes)}, |E|={len(G.edges)}")

In [15]:
graphs = {}

# facebook graph is the only one .tar.gz
graphs[graph_keys[0]] = load_graph_with_gz(facebook_path)  # relabeling nodes to integer
graphs[graph_keys[1]] = load_graph(citation_path)
graphs[graph_keys[2]] = load_graph(biological_path)
graphs[graph_keys[3]] = load_graph(proteins_path)  # node labeled
#graphs[graph_keys[4]] = load_graph(COX2_path)  # node labeled

print_graphs_info(graphs)

facebook: |V|=4039, |E|=88234
citation: |V|=22908, |E|=2444798
biological: |V|=15229, |E|=245952
proteins: |V|=43471, |E|=81049


# Download the dataset from the GitHub repository

In [9]:
import requests

url = "https://raw.githubusercontent.com/PietroVolpato/lfn_project/main/data/"
filename = "bio-CE-CX_edges.csv"

response = requests.get(url + filename)
with open(filename, "wb") as file:
    file.write(response.content)

# Functions and declarations for the embeddings
Embedding data structure is defined as following:<br>
- The first index refer to the graph (e.g. embeddings["facebook"] contains the embeddings of the facebook graph for every embedding technique).<br>
- The second index refer to the embedding technique (e.g. embeddings["facebook"]["LINE"] cointans the embedding of facebook graph computed using LINE)

In [6]:
def save(emb, graph_key, embedding_key):
    path = f"../result/embeddings_{graph_key}_{embedding_key}.npy"
    np.save(path, emb)
    print(f"Successfully saved the embeddings in {path}")

# dictionaries to store the embeddings, obtained by several techniques, for each graph
embeddings = {}
for k in graph_keys:
    embeddings[k] = {}

# Node2Vec
- pip install node2vec

In [18]:
def get_node2vec_embeddings(G, dimensions=128, walk_length=50, num_walks=40, p=0.5, q=2, workers=1):
    """
    Generate node embeddings for a graph using the Node2Vec algorithm.

    Parameters:
        G (networkx.Graph):The input graph for which embeddings are to be generated.
            The graph should have nodes labeled as integers, ideally sequentially starting from 0.
        dimensions (int, optional): The dimensionality of the embedding space. Default is 128.
        walk_length (int, optional): The length of each random walk. Default is 10.
        num_walks (int, optional): The number of random walks to start from each node. Default is 20.
        p (float, optional):
            The return parameter, controlling the likelihood of immediately revisiting a node in the walk.
            A higher value makes it more likely to backtrack. Default is 1.
        q (float, optional):
            The in-out parameter, controlling the likelihood of exploring outward from the starting node.
            A higher value makes it more likely to move outward. Default is 1.
        workers (int, optional): The number of parallel workers for random walk generation and model training. Default is 1.

    Returns:
        np.ndarray: A NumPy array where each row represents the embedding of a node.
            The row index corresponds to the node ID, and each row has `dimensions` elements.
    """
    # Initialize Node2Vec model
    node2vec = Node2Vec(G, dimensions=dimensions, walk_length=walk_length, num_walks=num_walks, p=p, q=q, workers=workers)

    # Fit the Node2Vec model and generate embeddings
    model = node2vec.fit(window=10, min_count=1, batch_words=4)

    # Convert embeddings to a NumPy array
    num_nodes = G.number_of_nodes()
    embeddings = np.zeros((num_nodes, dimensions))  # Preallocate array
    for node in G.nodes:
        embeddings[node] = model.wv[node]

    return embeddings

## Produce the embeddings with node2vec
here you can easily produce the embeddings for any of the loaded graphs using node2vec.<br>
Adjust the variable curr_graph_key with the key of the graph you want to compute the embeddings for.<br>
The embeddings are saved to file (look output to get path).

In [19]:
# graph_keys[0] = facebook
# graph_keys[1] = biological
# graph_keys[2] = citation
# graph_keys[3] = proteins
curr_graph_key = graph_keys[3]   # chose the graph

embeddings[curr_graph_key]["node2vec"] = get_node2vec_embeddings(graphs[curr_graph_key], dimensions=128, walk_length=50, num_walks=40, p=0.5, q=2, workers=1)
save(embeddings[curr_graph_key]["node2vec"], curr_graph_key, "node2vec")

Computing transition probabilities:   0%|          | 0/43471 [00:00<?, ?it/s]

Generating walks (CPU: 1): 100%|███████████████████████████████████████████████████████| 40/40 [05:50<00:00,  8.77s/it]


Successfully saved the embeddings in ../result/embeddings_proteins_node2vec.npy


# LINE : Large-scale information network embedding
installation guide:
- git clone https://github.com/VahidooX/LINE.git
- !pip install keras
- !pip install tensorflow
- adjust the sys.path to where you downloaded LINE repository

*NOTE*: it was necessary to modify utils.py to adapt it at current version of keras because some elements were deprecated

In [12]:
sys.path.append(r'C:\Users\oppil\OneDrive\Desktop\Universita\magistrale\2_1\LFN\REPO_PROJECT\lfn_project\src\LINE')

from model import create_model
from utils import batchgen_train

def get_LINE_embeddings(G, embedding_dim=128, batch_size=1024, negative_ratio=5, epochs=10, negative_sampling="UNIFORM"):
    """
    Generate LINE embeddings for a given graph.

    Parameters:
        G (nx.Graph): The graph for which embeddings are computed.
        embedding_dim (int): Dimensionality of the embeddings.
        batch_size (int): Batch size for training.
        negative_ratio (int): Ratio of negative to positive samples.
        epochs (int): Number of training epochs.
        negative_sampling (str): Negative sampling strategy ("UNIFORM" or "NON-UNIFORM").

    Returns:
        numpy.ndarray: Node embeddings (shape: [num_nodes, embedding_dim]).
    """
    num_nodes = G.number_of_nodes()

    # Convert networkx.Graph to adj_list (edge list as 2D numpy array)
    adj_list = np.array(list(G.edges()), dtype=np.int32)

    # Create LINE model
    model, embed_generator = create_model(num_nodes, embedding_dim)

    # Generate training batches
    train_gen = batchgen_train(adj_list, num_nodes, batch_size, negative_ratio, negative_sampling)

    # Compile and train the model
    model.compile(optimizer="adam", loss="binary_crossentropy")
    model.fit(train_gen, steps_per_epoch=500, epochs=epochs)

    # Extract embeddings
    node_ids = np.arange(num_nodes)  # Sequential node IDs
    embeddings = embed_generator.predict_on_batch(node_ids)

    print("Node Embeddings Shape:", embeddings[0].shape)
    return embeddings

## Produce the embeddings with LINE
here you can easily produce the embeddings for any of the loaded graphs using LINE.<br>
Adjust the variable curr_graph_key with the key of the graph you want to compute the embeddings for.<br>
The embeddings are saved to file (look output to get path).

In [17]:
curr_graph_key = graph_keys[3]   # chose the graph

embeddings[curr_graph_key]["LINE"] = get_LINE_embeddings(graphs[curr_graph_key], epochs = 20)
save(embeddings[curr_graph_key]["LINE"], curr_graph_key, "LINE")

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Node Embeddings Shape: (128,)
Successfully saved the embeddings in ../result/embeddings_proteins_LINE.npy


# AttentionWalk

## Installation guide
<ol>
<li>git clone https://github.com/benedekrozemberczki/AttentionWalk.git</li>
<li>pip install texttable</li>
</ol>

It requires that the input file is a .csv, so first we have implemented a function that converts the .txt.gz and the .edges files to a .csv to be given as input to the AttentionWalk algorithm.<br>
For starting the algorithm you have to enter to the AttentionWalk folder after having cloned it from the Github repository and then set the arguments as described in the README.md file.

In [None]:
!git clone https://github.com/benedekrozemberczki/AttentionWalk.git

In [None]:
!pip install texttable

## Test with the facebook network
Save the embeddings in the result folder<br>
Time: 1m 50s

In [1]:
!cd AttentionWalk && python src/main.py --edge-path ../../data/facebook/facebook_combined.csv --embedding-path ../../result/embeddings_facebook_AW_256.csv --dimensions 256

+----------------+---------------------------------------------+
| Attention path |     ./output/chameleon_AW_attention.csv     |
| Beta           | 0.500                                       |
+----------------+---------------------------------------------+
| Dimensions     | 256                                         |
+----------------+---------------------------------------------+
| Edge path      | ../../data/facebook/facebook_combined.csv   |
+----------------+---------------------------------------------+
| Embedding path | ../../result/embeddings_facebook_AW_256.csv |
+----------------+---------------------------------------------+
| Epochs         | 200                                         |
+----------------+---------------------------------------------+
| Gamma          | 0.500                                       |
+----------------+---------------------------------------------+
| Learning rate  | 0.010                                       |
+----------------+-------


Adjacency matrix powers:   0%|          | 0/4 [00:00<?, ?it/s]
Adjacency matrix powers:  50%|█████     | 2/4 [00:00<00:00,  4.80it/s]
Adjacency matrix powers:  75%|███████▌  | 3/4 [00:01<00:00,  2.16it/s]
Adjacency matrix powers: 100%|██████████| 4/4 [00:02<00:00,  1.22it/s]
Adjacency matrix powers: 100%|██████████| 4/4 [00:02<00:00,  1.51it/s]

Loss:   0%|          | 0/200 [00:00<?, ?it/s]
Attention Walk (Loss=48.3137):   0%|          | 0/200 [00:00<?, ?it/s]
Attention Walk (Loss=48.3137):   0%|          | 1/200 [00:00<01:21,  2.45it/s]
Attention Walk (Loss=48.2877):   0%|          | 1/200 [00:00<01:21,  2.45it/s]
Attention Walk (Loss=48.2877):   1%|          | 2/200 [00:00<01:22,  2.41it/s]
Attention Walk (Loss=47.8906):   1%|          | 2/200 [00:01<01:22,  2.41it/s]
Attention Walk (Loss=47.8906):   2%|▏         | 3/200 [00:01<01:22,  2.39it/s]
Attention Walk (Loss=47.2749):   2%|▏         | 3/200 [00:01<01:22,  2.39it/s]
Attention Walk (Loss=47.2749):   2%|▏         | 4/200 [00:01

## Test with the citation network
Infeasible!!

In [None]:
!cd AttentionWalk && python src/main.py --edge-path ../../data/cit-HepTh_edges.csv --embedding-path ../../result/cit-HepTh_embeddings_attention.csv --attention-path ../../result/cit-HepTh_attention.csv --epochs 176

## Test with the biological network
Infeasible!!!

In [None]:
!cd AttentionWalk && python src/main.py --edge-path ../../data/bio-CE-CX_edges.csv --embedding-path ../../result/bio-CE-CX_embeddings_attention.csv --attention-path ....//result/bio-CE-CX_attention.csv

## Test proteins network

In [1]:
!cd AttentionWalk && python src/main.py --edge-path ../../data/proteins/PROTEINS-full.csv --embedding-path ../../result/embeddings_PROTEINS_AW_128.csv --dimensions 128

+----------------+---------------------------------------------+
| Attention path |     ./output/chameleon_AW_attention.csv     |
| Beta           | 0.500                                       |
+----------------+---------------------------------------------+
| Dimensions     | 128                                         |
+----------------+---------------------------------------------+
| Edge path      | ../../data/proteins/PROTEINS-full.csv       |
+----------------+---------------------------------------------+
| Embedding path | ../../result/embeddings_PROTEINS_AW_128.csv |
+----------------+---------------------------------------------+
| Epochs         | 200                                         |
+----------------+---------------------------------------------+
| Gamma          | 0.500                                       |
+----------------+---------------------------------------------+
| Learning rate  | 0.010                                       |
+----------------+-------

  return self._with_data(data ** n)

Adjacency matrix powers:   0%|          | 0/4 [00:00<?, ?it/s]
Adjacency matrix powers:  25%|██▌       | 1/4 [00:00<00:01,  1.75it/s]
Adjacency matrix powers:  50%|█████     | 2/4 [00:01<00:01,  1.07it/s]
Adjacency matrix powers:  50%|█████     | 2/4 [00:01<00:01,  1.02it/s]
Traceback (most recent call last):
  File "c:\Users\pietr\OneDrive\Desktop\lfn_project\src\AttentionWalk\src\main.py", line 19, in <module>
    main()
  File "c:\Users\pietr\OneDrive\Desktop\lfn_project\src\AttentionWalk\src\main.py", line 14, in main
    model = AttentionWalkTrainer(args)
  File "c:\Users\pietr\OneDrive\Desktop\lfn_project\src\AttentionWalk\src\attentionwalk.py", line 78, in __init__
    self.initialize_model_and_features()
  File "c:\Users\pietr\OneDrive\Desktop\lfn_project\src\AttentionWalk\src\attentionwalk.py", line 84, in initialize_model_and_features
    self.target_tensor = feature_calculator(self.args, self.graph)
  File "c:\Users\pietr\OneDrive\Desktop

## Test with email-Enron
Infeasibile

In [None]:
!cd AttentionWalk && python src/main.py --edge-path ../../data/email-Enron.csv --embedding-path ../../result/email-Enron_embeddings_attention.csv --attention-path ../../result/email-Enron_attention.csv

## Test with CL-100K-1d8-L9

In [1]:
!cd AttentionWalk && python src/main.py --edge-path ../../data/CL-100K-1d8-L9.csv --embedding-path ../../result/CL-100K-1d8-L9_embeddings_attention.csv --attention-path ../../result/CL-100K-1d8-L9_attention.csv

+----------------+------------------------------------------------------+
| Attention path |      ../../result/CL-100K-1d8-L9_attention.csv       |
| Beta           | 0.500                                                |
+----------------+------------------------------------------------------+
| Dimensions     | 128                                                  |
+----------------+------------------------------------------------------+
| Edge path      | ../../data/CL-100K-1d8-L9.csv                        |
+----------------+------------------------------------------------------+
| Embedding path | ../../result/CL-100K-1d8-L9_embeddings_attention.csv |
+----------------+------------------------------------------------------+
| Epochs         | 200                                                  |
+----------------+------------------------------------------------------+
| Gamma          | 0.500                                                |
+----------------+--------------------


Adjacency matrix powers:   0%|          | 0/5 [00:00<?, ?it/s]
Adjacency matrix powers:  40%|████      | 2/5 [00:00<00:00,  4.25it/s]
Adjacency matrix powers:  60%|██████    | 3/5 [00:24<00:20, 10.05s/it]
Adjacency matrix powers:  60%|██████    | 3/5 [01:46<01:11, 35.56s/it]
Traceback (most recent call last):
  File "c:\Users\pietr\OneDrive\Desktop\lfn_project\src\AttentionWalk\src\main.py", line 19, in <module>
    main()
  File "c:\Users\pietr\OneDrive\Desktop\lfn_project\src\AttentionWalk\src\main.py", line 14, in main
    model = AttentionWalkTrainer(args)
  File "c:\Users\pietr\OneDrive\Desktop\lfn_project\src\AttentionWalk\src\attentionwalk.py", line 71, in __init__
    self._initialize_model_and_data()
  File "c:\Users\pietr\OneDrive\Desktop\lfn_project\src\AttentionWalk\src\attentionwalk.py", line 75, in _initialize_model_and_data
    sparse_target_tensor = feature_calculator(self.args, self.graph)
  File "c:\Users\pietr\OneDrive\Desktop\lfn_project\src\AttentionWalk\src\utils

## Test with COX2-MD

In [1]:
!cd AttentionWalk && python src/main.py --edge-path ../../data/COX2-MD.csv --embedding-path ../../result/embeddings_COX2-MD_AW_256.csv --dimensions 256

+----------------+--------------------------------------------+
| Attention path |    ./output/chameleon_AW_attention.csv     |
| Beta           | 0.500                                      |
+----------------+--------------------------------------------+
| Dimensions     | 256                                        |
+----------------+--------------------------------------------+
| Edge path      | ../../data/COX2-MD.csv                     |
+----------------+--------------------------------------------+
| Embedding path | ../../result/embeddings_COX2-MD_AW_256.csv |
+----------------+--------------------------------------------+
| Epochs         | 200                                        |
+----------------+--------------------------------------------+
| Gamma          | 0.500                                      |
+----------------+--------------------------------------------+
| Learning rate  | 0.010                                      |
+----------------+----------------------

  return self._with_data(data ** n)

Adjacency matrix powers:   0%|          | 0/4 [00:00<?, ?it/s]
Adjacency matrix powers: 100%|██████████| 4/4 [00:00<00:00, 45.91it/s]

Loss:   0%|          | 0/200 [00:00<?, ?it/s]
Attention Walk (Loss=56.1408):   0%|          | 0/200 [00:02<?, ?it/s]
Attention Walk (Loss=56.1408):   0%|          | 1/200 [00:02<07:13,  2.18s/it]
Attention Walk (Loss=56.0613):   0%|          | 1/200 [00:04<07:13,  2.18s/it]
Attention Walk (Loss=56.0613):   1%|          | 2/200 [00:04<07:09,  2.17s/it]
Attention Walk (Loss=55.3078):   1%|          | 2/200 [00:06<07:09,  2.17s/it]
Attention Walk (Loss=55.3078):   2%|▏         | 3/200 [00:06<07:04,  2.16s/it]
Attention Walk (Loss=54.3311):   2%|▏         | 3/200 [00:08<07:04,  2.16s/it]
Attention Walk (Loss=54.3311):   2%|▏         | 4/200 [00:08<06:53,  2.11s/it]
Attention Walk (Loss=52.9198):   2%|▏         | 4/200 [00:10<06:53,  2.11s/it]
Attention Walk (Loss=52.9198):   2%|▎         | 5/200 [00:10<06:38,  2.05s/it]


# GAE

In [None]:
!cd gae && python setup.py install

In [None]:
!cd gae/gae && python train.py