# Songs recommendation with GNN

Graph Neural Networks (GNNs) have recently gained increasing popularity in both research and real-world applications, therefore I decided to test several models in order to learn from the lyrics and from some other features of the songs, which are the ones related to each other.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Installation of the libraries and download of the datasets

As dataset I used a subset of the [Million Songs Dataset](http://millionsongdataset.com/), created by combining the musiXmatch and the Last.fm versions to retrieve the song lyrics and the similar songs respectively. You can download the script that create the dataset from Drive and execute it to create also the datasets for the analysis on R.

For the creation of the models I used the [PyTorch Geometric](https://pytorch-geometric.readthedocs.io/en/latest/) library, because is probably the most efficient and well supported tool for dealing with GNNs.

In [1]:
import os
import torch

os.environ['TORCH'] = torch.__version__
print(f"torch version: {torch.__version__}")

!pip install -q torch-scatter -f https://data.pyg.org/whl/torch-${TORCH}.html
!pip install -q torch-sparse -f https://data.pyg.org/whl/torch-${TORCH}.html
!pip install -q torch-geometric -f https://data.pyg.org/whl/torch-${TORCH}.html

!pip install --upgrade --no-cache-dir -q gdown
import gdown

torch version: 1.13.1+cu116
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.4/9.4 MB[0m [31m26.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.5/4.5 MB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m565.0/565.0 KB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.2/280.2 KB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for torch-geometric (setup.py) ... [?25l[?25hdone


In the following cells you can choose between downloading and creating the dataset elaborating the data from the [Million Song Dataset](http://millionsongdataset.com/) (also using the musiXmatch and the Last.fm versions) or to copy the already created files from Drive.

In [None]:
#@title #### Download script and create the datasets
#@markdown It first downloads the script and then it executes it to create the needed datasets.
#url_script = ""
#gdown.download(id=url_script, output="create_dataset.py", quiet=True)

!python create_dataset.py

In [None]:
#@title #### Load the files from Drive
#@markdown Download the training and evaluation dataset from Drive.

# File ids in Drive
url_train_data = "1-1I90HW0wZ1eVgXX9KETgEgYcnvJ0qAg"
url_validation_data = "1-1kUEXn2Hw_FB-xWbXk6njIao2RwhSSO"

gdown.download(id=url_train_data, output="songs_train.csv", quiet=False)
gdown.download(id=url_validation_data, output="songs_val.csv", quiet=False)

The downloaded files are not suitable to work with PyG, therefore I made some preprocessing on the dataframe in order to create the nodes and the edges of the GNN.

## Preprocessing to create graph data

As said before, the downloaded dataframe is not usable for creating the GNN, I have to preprocess the data in a specific way, creating the ids for the nodes, generating the edges and computing the features to use for these elements. First of all let's import all the needed libraries.

In [2]:
# Import the PyG modules
from torch_geometric import seed_everything
from torch_geometric.data import Data, Dataset, InMemoryDataset
from torch_geometric.loader import DataLoader
from torch_geometric.nn import MessagePassing
from torch_geometric.transforms import RandomLinkSplit
from torch_geometric.utils import degree

# Basic modules to deal with dataframe
import pandas as pd
import numpy as np
import json

# Scikit-learn modules to create song lyrics embeddings
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import LabelEncoder, StandardScaler, Normalizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Monitor the progress of the functions
from tqdm import tqdm
tqdm.pandas()
pd.options.mode.chained_assignment = None

# nltk preprocessing to remove stopwords
import nltk
nltk.download('stopwords', quiet=True)
from nltk.corpus import stopwords
stop = stopwords.words('english')

# Set a seed for every random operation
seed_everything(88)

### Download the preprocessed files

Download the files from Drive avoiding to run the preprocessing operations and load them into some variables.

In [3]:
url_graph_data = "1WKrN3NpXNdIovXdwEeGH9ZEUUYNOVI6j"
url_core_mapping = "1-3K1BqkUItn93_m7Wq1QGvESAXrt0RCd"
url_track_id_mapping = "1-548e0dADOLkbRoirJVTw5I2_ni1Ldif"
url_reduced_df = "1-1PXzF3XK6TsFdfvaxM708nCUrZKkRPz"

file_names = ['graph_data.pt', 'core_mapping.json', 'track_id_mapping.json', 'reduced_df.csv']

for name, url_id in zip(file_names, [url_graph_data, url_core_mapping, url_track_id_mapping, url_reduced_df]):
    gdown.download(id=url_id, output=name, quiet=True)

After having loaded the pre-saved files from Drive, execute the following cell to have the data loaded in the variables.

In [4]:
reduced_df = pd.read_csv(file_names[3])

with open(file_names[1], 'r') as fin1, open(file_names[2], 'r') as fin2:
    core_mapping = json.load(fin1)
    id_mapping = json.load(fin2)

graph_data = torch.load(file_names[0])

track_to_id, id_to_track = id_mapping['track_to_id'], id_mapping['id_to_track']
old_to_core_id, core_to_old_id = core_mapping['old_to_core_id'], core_mapping['core_to_old_id']

At this point you can skip the preprocessing part and go to the training section.

### Execute preprocessing

In the following cells there are all the needed passages to build the files that I also saved to Drive.

In [5]:
#@title #### Preprocessing useful functions
#@markdown In this cell I implemented different functions to make the preprocessing operations.
def check_evaluation_dataset(target, track_id_list):
    '''
        It removes from the target string all the similar songs that are not
        in our main dataset.
    '''
    target_list = target.split(',')
    # We need to preserve the order of relevance for computing the mAP
    present = sorted(set(target_list).intersection(track_id_list), key=target_list.index)
    return ','.join(present)


def remove_stopwords(text, stop):
    '''
        It returns the string without the words within the list 'stop'.
    '''
    result = ' '.join([word for word in text.split(" ") if word not in stop])
    return result


def compute_LSA(corpus, max_features_tfidf=2000, k_svd=200):
    '''
        It returns the documents matrix multiplied by the singular values, both
        computed using SVD truncated at k_svd.

        Parameters:
            - corpus: pd.Series
                The pandas series where the function will find the texts you want
                to use for the creation of the matrix.
            - max_features: int
                The maximum number of features to use for the TF-IDF matrix.
            - k_svd: int
                The number at which truncate the SVD matrix.
            
        Returns:
            - dict, torch.tensor
                It returns the dictionary used in the TF-IDF matrix, the tensor
                U * sigma of the SVD decomposition, i.e. the documents matrix, and
                the transpose of the terms matrix.
    '''
    # The number of terms to keep, almost half in this case
    tfidf_vectorizer = TfidfVectorizer(max_features=max_features_tfidf)
    tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

    # Standardizing the vectors
    #tfidf_matrix = StandardScaler().fit_transform(tfidf_matrix.toarray())
    
    # Specify the number of latent dimensions
    lsa = TruncatedSVD(n_components=k_svd)
    
    # Apply the truncatedSVD function, the fit_transform function returns U*sigma
    documents_lsa = lsa.fit_transform(tfidf_matrix) 
    # Shape of the reduced matrix
    print(f"The documents matrix after the SVD decomposition has shape {documents_lsa.shape}")

    # Normalize the vectors
    documents_lsa = Normalizer(copy=False).fit_transform(documents_lsa)

    return tfidf_vectorizer.vocabulary_, torch.tensor(documents_lsa, dtype=torch.float32), torch.tensor(lsa.components_, dtype=torch.float32)


def project_text_lsa(text, vocabulary, words_matrix):
    '''
        It projects a given text in the latent space, it first computes the TF-IDF
        vector considering the same vocabulary used to build the latent space and 
        then it projects the vector using the transpose words matrix. 
    '''
    vectorizer = TfidfVectorizer(vocabulary=vocabulary)

    tf_idf_vector = torch.tensor(vectorizer.fit_transform([text]).toarray(), dtype=torch.float32)

    # Compute the projection
    ls_vector = tf_idf_vector @ words_matrix.t()

    return ls_vector


def map_string_ids_to_num(str_list, mapping):
    conversion = str_list
    for tid in str_list.split(","):
        if tid in mapping:
            conversion = conversion.replace(tid, str(mapping[tid]))
        else:
            # Remove ids that are not in the dataset
            conversion = conversion.replace(f'{tid},', '')
    return conversion

The needed operations, before to create the graph, are the following:
- remove the NA elements from the dataframe;
- keep only the nodes with a number of similar songs between 5 and 50 in order to reduce the dataset;
- remove the stopwords from the lyrics;
- check that in the similar song lists there are songs still present in the dataset.

In [None]:
train_df = pd.read_csv('songs_train.csv')
print(f"The length of the dataframe is {len(train_df)}.\n")

train_df = train_df.dropna()
# Keep only the nodes that have between 5 and 50 similar songs (edges)
reduced_df = train_df[(train_df['similars'].str.split(',').apply(len) >= 5) & (train_df['similars'].str.split(',').apply(len) <= 50)]

# Remove the stopwords from the lyrics
tqdm.pandas(desc="- Removing stopwords from lyrics")
reduced_df['lyrics'] = reduced_df['lyrics'].progress_apply(remove_stopwords, stop=stop)

# Check and keep only the lyrics with at least one word
reduced_df = reduced_df[~(reduced_df['lyrics'] == "")]
tqdm.pandas(desc="- (1) Keep only similar songs that are in the dataset")
reduced_df['similars'] = reduced_df['similars'].progress_apply(check_evaluation_dataset, track_id_list=reduced_df['track_id'].tolist())

# Remove the songs that are without similars
reduced_df = reduced_df[~(reduced_df.similars == "")].reset_index(drop=True)
# Last check after removal of songs without similars
tqdm.pandas(desc="- (2) Keep only similar songs that are in the dataset")
reduced_df['similars'] = reduced_df['similars'].progress_apply(check_evaluation_dataset, track_id_list=reduced_df['track_id'].tolist())

print(f"\n\nThe length of the reduced dataframe is {len(reduced_df)}.")

The length of the dataframe is 105031.



- Removing stopwords from lyrics: 100%|██████████| 75661/75661 [00:29<00:00, 2582.90it/s]
- (1) Keep only similar songs that are in the dataset: 100%|██████████| 75654/75654 [05:32<00:00, 227.47it/s]
- (2) Keep only similar songs that are in the dataset: 100%|██████████| 75645/75645 [05:50<00:00, 216.04it/s]


The length of the reduced dataframe is 75645.





After the previous steps I create a numerical id for the songs and for the tags (genres).

In [None]:
# Encode each song with a numerical ID
item_encoder = LabelEncoder()
reduced_df['item_id'] = item_encoder.fit_transform(reduced_df['track_id'])

# Encode the genre as numerical values
genre_encoder = LabelEncoder()
reduced_df['tag'] = genre_encoder.fit_transform(reduced_df['tag'])

I also created a mapping to convert from string ids to numerical ones and viceversa, then I apply the conversion to the similar song lists.

In [None]:
# Create the mapping from track_id to item_id and viceversa
track_to_id = {}
id_to_track = {}
for tid, iid in zip(reduced_df['track_id'], reduced_df['item_id']):
    track_to_id[tid] = iid
    id_to_track[iid] = tid

# Convert the list of similar songs to list of numerical ids
tqdm.pandas(desc="Convert similar songs track ids to numerical ids")
reduced_df['similars'] = reduced_df['similars'].progress_apply(map_string_ids_to_num, mapping=track_to_id)

After all this operations I can create the graph structure that I will use later for the GNN model.

#### Creation of the nodes and edges data

In [None]:
!pip install -q snap-stanford

# Install snap for simple graph creation
import snap

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.6/11.6 MB[0m [31m50.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
G = snap.snap.TUNGraph().New()

# First add all song IDs as nodes in G
for i in tqdm(range(len(reduced_df))):
    song = int(reduced_df.loc[i, 'item_id'])
    if not G.IsNode(song):
        G.AddNode(song)
    # Add a node for each similar song and then add the edge
    for sim in reduced_df.loc[i, 'similars'].split(","):
        if not G.IsNode(int(sim)):
            G.AddNode(int(sim))
        G.AddEdge(song, int(sim))

print("Original graph:")
print(f"Num nodes: {len([x for x in G.Nodes()])} ({len(reduced_df['item_id'])} unique songs)")
print(f"Num edges: {len([x for x in G.Edges()])} (undirected)")

100%|██████████| 75645/75645 [00:04<00:00, 15317.53it/s]


TODO: add explanation about core graph 

In [None]:
K = 30
kcore = G.GetKCore(K)
if kcore.Empty():
    raise Exception(f"No Core exists for K={K}")

print("K-core graph:")
print(f"Num nodes: {len([x for x in kcore.Nodes()])} (and unique songs)")
print(f"Num edges: {len([x for x in kcore.Edges()])} (undirected)")

I need to re-index the nodes from 0 to n to avoid problem with PyG and I save the mapping such that I will be able to convert the new ids to the original ones.

In [None]:
# We need to re-index the nodes, otherwise some problems later with PyG
old_to_core_id = {}
core_to_old_id = {}
for i, NI in enumerate(kcore.Nodes()):
    old_id = NI.GetId()
    assert old_id not in old_to_core_id
    new_id = i
    old_to_core_id[old_id] = new_id
    core_to_old_id[new_id] = old_id 

Now I can convert the given graph structure to the one compatible with PyG and I create a Data object.

In [None]:
# Convert snap graph to a format that can be used in PyG, converting to edge_index and storing in a PyG Data object
all_edges = []
for EI in tqdm(kcore.Edges()):
    edge_info = [old_to_core_id[EI.GetSrcNId()], old_to_core_id[EI.GetDstNId()]]
    all_edges.append(edge_info)
    # Also add the edge in the opposite direction because undirected
    all_edges.append(edge_info[::-1]) 
edge_idx = torch.LongTensor(all_edges)

graph_data = Data(edge_index=edge_idx.t().contiguous(), num_nodes=kcore.GetNodes())

176020it [00:01, 112467.72it/s]


In order to use these data in a future run I save reduced_df, graph_data and the mappings in some files that I put on Drive. 

In [None]:
# Save Data object (for training model) and the reduced df for retrieving info about the songs in the future
torch.save(graph_data, 'graph_data.pt')
# Save the reduced df and the mapping from one index to another
reduced_df.to_csv('reduced_df_GNN.csv', index=False)
core_mapping = {'old_to_core_id': old_to_core_id, 'core_to_old_id': core_to_old_id}
id_mapping = {'track_to_id': track_to_id, 'id_to_track': id_to_track}

with open('core_mapping.json', 'w') as fp1, open('track_id_mapping.json', 'w') as fp2:
    json.dump(core_mapping, fp1)
    json.dump(id_mapping, fp2)

### Create graph features to use in the GNN 

In [6]:
songs_vocabulary, songs_lsa, terms_lsa = compute_LSA(reduced_df['lyrics'], max_features_tfidf=2000, k_svd=200)

The documents matrix after the SVD decomposition has shape (75645, 200)


In [7]:
import torch_geometric.transforms as T

class MSongsDataset(InMemoryDataset):
    def __init__(self, root, transform=None, pre_transform=None):
        super(MSongsDataset, self).__init__(root, transform, pre_transform)
        self.data, self.slices = torch.load(self.processed_paths[0])

    @property
    def raw_file_names(self):
        return 'graph_data.pt'

    @property
    def processed_file_names(self):
        return 'processed_data.pt'

    def download(self):
        # Download to `self.raw_dir`.
        pass

    def process(self):

        core_ids = [core_to_old_id[str(node.item())] for node in torch.unique(graph_data.edge_index[0,:])]
        core_songs = songs_lsa[core_ids, :]

        data = Data(x=core_songs, edge_index=graph_data.edge_index)
        # Transform to sparse tensor if the transformation is given
        data = data if self.pre_transform is None else self.pre_transform(data)
        
        torch.save(self.collate([data]), self.processed_paths[0])

In [8]:
graph_dataset = MSongsDataset(root='.', transform=T.ToSparseTensor())

Processing...
Done!


In [10]:
split = RandomLinkSplit(num_val=0.15, num_test=0.15, is_undirected=True,
                            add_negative_train_samples=False, neg_sampling_ratio=0.8, split_labels=True)
train_data, val_data, test_data = split(graph_dataset[0])

## MODELS

### VAE GNN

In [None]:
# https://towardsdatascience.com/graph-neural-networks-with-pyg-on-node-classification-link-prediction-and-anomaly-detection-14aa38fe1275

from sklearn.metrics import roc_auc_score
from torch_geometric.utils import negative_sampling
from torch_geometric.nn import GCNConv
import torch.nn.functional as F


class Net(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels):
        super().__init__()
        self.conv1 = GCNConv(in_channels, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, out_channels)

    def encode(self, x, edge_index):
        x = self.conv1(x, edge_index).relu()
        return self.conv2(x, edge_index)

    def decode(self, z, edge_label_index):
        return (z[edge_label_index[0]] * z[edge_label_index[1]]).sum(
            dim=-1
        )  # product of a pair of nodes on each edge

    def decode_all(self, z):
        prob_adj = z @ z.t()
        return (prob_adj > 0).nonzero(as_tuple=False).t()
    

def train_link_predictor(
    model, train_data, val_data, optimizer, criterion, n_epochs=100
):
    train_losses, val_aucs = [], []
    for epoch in range(1, n_epochs + 1):

        model.train()
        optimizer.zero_grad()
        z = model.encode(train_data.x, train_data.edge_index)

        # sampling training negatives for every training epoch
        neg_edge_index = negative_sampling(
            edge_index=train_data.edge_index, num_nodes=train_data.num_nodes,
            num_neg_samples=train_data.edge_label_index.size(1), method='sparse')

        edge_label_index = torch.cat(
            [train_data.edge_label_index, neg_edge_index],
            dim=-1,
        )
        edge_label = torch.cat([
            train_data.edge_label,
            train_data.edge_label.new_zeros(neg_edge_index.size(1))
        ], dim=0)

        out = model.decode(z, edge_label_index).view(-1)
        loss = criterion(out, edge_label)
        loss.backward()
        optimizer.step()

        val_auc = eval_link_predictor(model, val_data)
        train_losses.append(loss.item())
        val_aucs.append(val_auc)

        if epoch % 10 == 0:
            print(f"Epoch: {epoch:03d}, Train Loss: {loss:.3f}, Val AUC: {val_auc:.3f}")

    return model, train_losses, val_aucs


@torch.no_grad()
def eval_link_predictor(model, data):

    model.eval()
    z = model.encode(data.x, data.edge_index)
    out = model.decode(z, data.edge_label_index).view(-1).sigmoid()

    return roc_auc_score(data.edge_label.cpu().numpy(), out.cpu().numpy())

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = Net(graph_dataset.num_features, 128, 64).to(device)
optimizer = torch.optim.Adam(params=model.parameters(), lr=0.01)
criterion = torch.nn.BCEWithLogitsLoss()

# Move the data to the gpu if available
train_data = train_data.to(device)
val_data = val_data.to(device)
test_data = test_data.to(device)

model, train_losses, val_aucs = train_link_predictor(model, train_data, val_data, optimizer, criterion)

In [None]:
# Test the model
test_auc = eval_link_predictor(model, test_data)
print(f"Test: {test_auc:.3f}")

Test: 0.961


### SAGEConv

In [12]:
emb = graph_dataset[0].x
adj_t = graph_dataset[0].adj_t
num_nodes = emb.size(0)

In [13]:
import torch
import torch.nn.functional as F
import random
import torch_geometric.transforms as T

from torch import Tensor
from torch.utils.data import DataLoader
from torch_geometric.utils import negative_sampling, convert, to_dense_adj
from torch_geometric.nn import GCNConv, SAGEConv
from torch_geometric.nn.conv import MessagePassing

In [14]:
class SAGE(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels, num_layers,
                 dropout, aggr="add"):
        super(SAGE, self).__init__()

        self.convs = torch.nn.ModuleList()
        self.convs.append(SAGEConv(in_channels, hidden_channels, normalize=True, aggr=aggr))
        for _ in range(num_layers - 2):
            self.convs.append(SAGEConv(hidden_channels, hidden_channels, normalize=True, aggr=aggr))
        self.convs.append(SAGEConv(hidden_channels, out_channels, normalize=True, aggr=aggr))

        self.dropout = dropout

    def reset_parameters(self):
        for conv in self.convs:
            conv.reset_parameters()

    def forward(self, x, adj_t):
        for conv in self.convs[:-1]:
            x = conv(x, adj_t)
            x = F.relu(x)
            x = F.dropout(x, p=self.dropout, training=self.training)
        x = self.convs[-1](x, adj_t)
        return x


class DotProductLinkPredictor(torch.nn.Module):
    def __init__(self):
        super(DotProductLinkPredictor, self).__init__()

    def forward(self, x_i, x_j):
        out = (x_i*x_j).sum(-1)
        return torch.sigmoid(out)
    
    def reset_parameters(self):
      pass

In [15]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Initialize our model and LinkPredictor
hidden_dimension = 256
model = SAGE(emb.size(1), hidden_dimension, hidden_dimension, 7, 0.3).to(device)
predictor = DotProductLinkPredictor().to(device)

In [16]:
#@title #### Train loop
def create_train_batch(all_pos_train_edges, perm, edge_index):
    # First, we get our positive edges of dimensions (2, perm)     
    pos_edges = all_pos_train_edges[:, perm].to(device)

    # We then sample the negative edges using PyG functionality
    neg_edges = negative_sampling(edge_index, num_nodes=num_nodes,
                                  num_neg_samples=perm.shape[0], method='dense').to(device)

    # Our training batch is just the positive edges concatanted with the negative ones
    train_edge = torch.cat([pos_edges, neg_edges], dim=1)  

    # Our labels are all 1 for the positive edges and 0 for the negative ones                          
    pos_label = torch.ones(pos_edges.shape[1], )
    neg_label = torch.zeros(neg_edges.shape[1], )
    train_label = torch.cat([pos_label, neg_label], dim=0).to(device)

    return train_edge, train_label
  
def train(model, predictor, x, adj_t, train_data, loss_fn, optimizer, batch_size, num_epochs, edge_model=False, spd=None):
    # adj_t isn't used everywhere in PyG yet, so we switch back to edge_index for negative sampling
    row, col, edge_attr = adj_t.t().coo()
    edge_index = torch.stack([row, col], dim=0)

    model.train()
    predictor.train()

    model.reset_parameters()
    predictor.reset_parameters()

    all_pos_train_edges = train_data.pos_edge_label_index
    for epoch in range(num_epochs):
        epoch_total_loss = 0
        for perm in DataLoader(range(all_pos_train_edges.shape[1]), batch_size, shuffle=True):
            optimizer.zero_grad()

            train_edge, train_label = create_train_batch(all_pos_train_edges, perm, edge_index)

            # Use the GNN to generate node embeddings
            if edge_model:
                h = model(x, edge_index, spd)
            else:
                h = model(x, adj_t)

            # Get predictions for our batch and compute the loss
            preds = predictor(h[train_edge[0]], h[train_edge[1]])
            loss = loss_fn(preds, train_label)

            epoch_total_loss += loss.item()

            # Update our parameters
            loss.backward()
            # To avoid exploding gradient problem
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            torch.nn.utils.clip_grad_norm_(predictor.parameters(), 1.0)
            optimizer.step()
        print(f'Epoch {epoch} has loss {round(epoch_total_loss, 4)}')

In [17]:
!pip -q install torchmetrics

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/517.2 KB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m256.0/517.2 KB[0m [31m7.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m517.2/517.2 KB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25h

TODO: in order to evaluate the predictions I should consider the similarity score that I have in the dataset, otherwise the order of prediction doesn't matter...

In [237]:
def get_similar_dict(edges, k):
    '''
        It returns the dictionary where for each node I have the list of connected
        (similar) ones.

        Parameters:
            - edges: torch.tensor
                The edges between the nodes.
            - k: int
                It is used to filter and keep only the songs with more than k similar
                songs.
        
        Returns:
            - dict
                The dictionary with the songs as key and the list of similar songs
                as values.
    '''
    songs = {}
    # Iterate over the columns of the edges tensor (with dim. (2, N))
    for i in range(edges.size(1)):
        src = edges[0, i].item()
        dest = edges[1, i].item()
        # Save for each song the similar ones
        if src not in songs:
            songs[src] = []
        if dest not in songs:
            songs[dest] = []
        # We need to save in both way since the edges are undirected
        songs[src].append(dest)
        songs[dest].append(src)

    # Delete the songs with less than k links from the dictionary
    songs = {song: similar for song, similar in songs.items() if len(similar) >= k}
    if len(songs) < 200:
        print(f"WARNING: the function kept {len(songs)} songs.")
    return songs


def compute_AP_k(predictions, target, k):
    '''
        It computes the average precision at k for the given lists.

        Parameters:
            - predictions: list
                The list that contains the ids of the songs predicted as similar.
            - target: list
                The groundtruth list of similar songs.
            - k: int
                The value to compute the AP at.
            
        Returns:
            - float
                It returns the AP@k.
    '''
    score=count = 0

    # Take the minimum value between k and the number of predicted edges
    k = min(k, len(predictions))
    for i in range(1, k+1):
        if predictions[i-1] in target and predictions[i-1] not in predictions[0:(i-1)]:
            count += 1
            score = score + count/i 
    
    score = score / k
    return score


def compute_mAP_k(songs_dict, destinations, predictions, k):
    '''
        It computes the mean average precision at k on a set of predictions and
        target labels.

        Parameters:
            - songs_dict: dict
                The dictionary with the songs as keys and the similar ones as values.
            - destinations: torch.tensor
                The destination nodes for which I computed the predictions [src, dest]
                for each song.
            - predictions: torch.tensor
                The tensor that contains the probability to have a link between each
                pair of nodes.
            - k: int
                The value to compute the mAP at.

        Returns:
            - float
                It return the mAP@k.
    '''
    scores = []
    # Iterate over the number of songs 
    for i in range(predictions.size(0)):
        # Take the id of the song
        song = list(songs_dict.keys())[i]
        # Retrieve the k most probable predicted links (edges)
        top_k_idx = torch.topk(predictions[i, :], k)[1]
        # Take the ids of the most probable predicted nodes
        predicted_k = destinations[i, top_k_idx]
        
        # The nodes similar to 'song'
        target_nodes = songs_dict[song][:k]
        # Compute the AP@k
        apk = compute_AP_k(predicted_k.tolist(), target_nodes, k)
        scores.append(apk)

    return np.mean(scores)


@torch.no_grad()
def test(model, predictor, embeddings, adj_t, test_data, k):
    '''
        It test the model and it returns the mAP@k, given the model and the predictor.

        Parameters:
            - model:
                The GNN model that creates the embeddins for the nodes.
            - predictor:
                The predictor that returns the probability of a link for each pair
                of nodes.
            - embeddings: torch.tensor
                The embeddings of the nodes.
            - adj_t
    '''
    model.eval()
    predictor.eval()

    h = model(embeddings, adj_t)

    # Create the dictionary that keep songs with at least k similar ones and the indices of the similars
    similar_songs = get_similar_dict(test_data.pos_edge_label_index, k)
    
    pos_eval_edge = test_data.pos_edge_label_index.to(device)

    num_test_nodes = k*2
    predictions = []
    destinations = []
    # TODO: repeat this operation n times and take the mean, because you will consider more random nodes
    for song, similars in similar_songs.items():

        sim_dest_nodes = torch.tensor(similars[:k])
        # I have to choose wrong, random nodes, therefore I remove the similars from the choice
        rand_nodes = pos_eval_edge[0, torch.isin(pos_eval_edge[0,:], sim_dest_nodes, invert=True)]
        # The number of random nodes to take
        rand_num = num_test_nodes - len(sim_dest_nodes)

        rand_nodes = torch.tensor(np.random.choice(rand_nodes, rand_num, replace=False))

        dest = torch.cat([sim_dest_nodes, rand_nodes])
        src = torch.full((len(dest), ), song)

        pred_song = predictor(h[src], h[dest]).squeeze().cpu()

        destinations.append(dest)
        predictions.append(pred_song)
        break

    predictions = torch.stack(predictions)
    destinations = torch.stack(destinations)

    map_k = compute_mAP_k(similar_songs, destinations, predictions, k)

    return map_k

In [249]:
type(model)

__main__.SAGE

In [None]:
optimizer = torch.optim.Adam(
            list(model.parameters())  +
            list(predictor.parameters()), lr=0.01)

train(model, predictor, emb, adj_t, train_data, torch.nn.BCELoss(), 
      optimizer, 64 * 1024, 30)

In [243]:
map = test(model, predictor, emb, adj_t, test_data, 20)

In [244]:
map

0.27885507509346513