# TITLE EMBEDDINGS AND SIMILARITY SCORE

This notebook runs the model to calculate the embeddings of all the titles from the wiki_pagerank dataset. Run this before starting the gugol_main server and the interface. 

First let's load all the necessary libraries

In [1]:
from sentence_transformers import SentenceTransformer
import torch
import numpy as np
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


Check if cuda is avaible for GPU acceleration

In [2]:
torch.cuda.is_available()  

True

We extract all the titles and node_id from the csv file obtained with PageRank.py

In [3]:
results_ds = pd.read_csv("results/wiki_pagerank_RNA_results.csv")
results_ds.sort_values(by="node_id", inplace=True)
results_ds.head()

Unnamed: 0,node_id,page_name,pagerank_score
351507,0,Chiasmal syndrome,3.4e-07
323596,1,Kleroterion,3.7e-07
653050,2,Pinakion,1.9e-07
1341340,3,LyndonHochschildSerre spectral sequence,1e-07
873932,4,Zariski's main theorem,1.5e-07


In [4]:
page_names = results_ds["page_name"].tolist()

In [5]:
len(page_names)

1791489

In [6]:
page_names[:10]

['Chiasmal syndrome',
 'Kleroterion',
 'Pinakion',
 'LyndonHochschildSerre spectral sequence',
 "Zariski's main theorem",
 'FultonHansen connectedness theorem',
 "Cayley's ruled cubic surface",
 'Annulus theorem',
 "Bing's recognition theorem",
 'BochnerMartinelli formula']

To calculate embeddings we use "all-MiniLM-L6-v2", check the HuggingFace page [here](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2). The embeddings are stored in data/embeddings.npy file.

In [None]:
model2 = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model2.encode(page_names, show_progress_bar=True, device="cuda:0", batch_size=64)  # Change device to "cpu" if you don't have a GPU
np.save("data/embeddings.npy", embeddings)


Batches: 100%|██████████| 27993/27993 [03:20<00:00, 139.94it/s]


In [None]:

embeddings = np.load("data/embeddings.npy")
embeddings = torch.tensor(embeddings)

Let's try the embeddings by checking similarity scores. 

First we write a query string and we calculate the query embedding. Then we calculate the cosine similarity with all the computed title embeddings.

In [9]:
query = "University"
query_embedding = model2.encode(query, device="cuda:0")
similarities = model2.similarity(query_embedding, embeddings)



In [10]:
sorted_indices = similarities[0].argsort(descending=True)

In [11]:
for i in sorted_indices[:10]:
    print(f"{page_names[i]}: {similarities[0][i]:.4f}")

University: 1.0000
University School: 0.9277
University College School: 0.8865
Campus university: 0.8792
Collegiate university: 0.8537
College: 0.8203
Corporate university: 0.8139
American University: 0.8132
Universities UK: 0.7995
University Link: 0.7993


Now we combine the similarity score with the PageRank score

In [12]:
pagerank_scores = results_ds["pagerank_score"].tolist()
pagerank_scores = torch.tensor(pagerank_scores)
np.save("data/pagerank_scores.npy", pagerank_scores.numpy())

In [13]:
p_w = 0 #weight of the pagerank score
p_e = 1 #weight of the embedding similarity

final_scores = p_w * pagerank_scores + p_e * similarities[0]
sorted_final_indices = final_scores.argsort(descending=True)
for i in sorted_final_indices[:10]:
    print(f"{page_names[i]}: {final_scores[i]:.4f}")



University: 1.0000
University School: 0.9277
University College School: 0.8865
Campus university: 0.8792
Collegiate university: 0.8537
College: 0.8203
Corporate university: 0.8139
American University: 0.8132
Universities UK: 0.7995
University Link: 0.7993


Since we also want to get the categories for each page, let's extract them

In [14]:
categories = {}
node_categories = {}
with open("data/wiki-topcats-categories.txt", "r") as f:
    for line in f:
                line = line.strip()
                if line:
                    parts = line.split(';')
                    if len(parts) >= 2:
                        category = parts[0].strip()
                        #Remove the "Category:" prefix if it exists
                        if category.startswith("Category:"):
                            category = category[len("Category:"):].strip()
                            
                        node_ids = [int(x) for x in parts[1].split()]
                        categories[category] = node_ids
                        
                        # Build reverse mapping
                        for node_id in node_ids:
                            if int(node_id) not in node_categories:
                                node_categories[int(node_id)] = []
                            node_categories[int(node_id)].append(category)

In [16]:
for i in sorted_final_indices[:10]:
    print(node_categories.get(i.item(), []), page_names[i], final_scores[i].item())

['Youth'] University 0.9999999403953552
['Private_schools_in_Ohio'] University School 0.9276874661445618
['Old_Gowers', "Member_schools_of_the_Headmasters'_and_Headmistresses'_Conference"] University College School 0.8864802122116089
['School_types'] Campus university 0.8791838884353638
['School_types'] Collegiate university 0.8537181615829468
['Youth', 'School_types'] College 0.8203128576278687
['Alternative_education'] Corporate university 0.8138669729232788
['Article_Feedback_Pilot', 'National_Association_of_Independent_Colleges_and_Universities_members', 'Middle_States_Association_of_Colleges_and_Schools'] American University 0.8132139444351196
['University_associations_and_consortia'] Universities UK 0.7994534969329834
['Proposed_public_transportation_in_the_United_States'] University Link 0.7992901802062988
