<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#0ca9e9; overflow:hidden"><b>Importing Libraries</b></div>

In [14]:
import pandas as pd
import numpy as np

<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#0ca9e9; overflow:hidden"><b>Reading Data</b></div>

In [15]:
df = pd.read_csv("arxiv_data_210930-054931.csv")
df.head()

Unnamed: 0,terms,titles,abstracts
0,['cs.LG'],Multi-Level Attention Pooling for Graph Neural...,Graph neural networks (GNNs) have been widely ...
1,"['cs.LG', 'cs.AI']",Decision Forests vs. Deep Networks: Conceptual...,Deep networks and decision forests (such as ra...
2,"['cs.LG', 'cs.CR', 'stat.ML']",Power up! Robust Graph Convolutional Network v...,Graph convolutional networks (GCNs) are powerf...
3,"['cs.LG', 'cs.CR']",Releasing Graph Neural Networks with Different...,With the increasing popularity of Graph Neural...
4,['cs.LG'],Recurrence-Aware Long-Term Cognitive Network f...,Machine learning solutions for pattern classif...


In [16]:
df.shape

(56181, 3)

<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#0ca9e9; overflow:hidden"><b>Data Cleaning</b></div>

In [17]:
df.duplicated().sum() #check for duplicates

15054

In [18]:
df.drop_duplicates(inplace=True)

In [19]:
df.duplicated().sum()

0

<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#0ca9e9; overflow:hidden"><b>Paper Recommendation</b></div>


In [20]:
df["content"] = df["titles"] + " " + df["abstracts"]  #combine titles and abstracts into one column
titles = df["titles"]
abstracts = df["abstracts"]

In [8]:
!pip install -U -q sentence-transformers

In [9]:

# this is a pretrained model from the sentence-transformers library
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
# we load all-MiniLM-L6-v2, which is a MiniLM model fine tuned on a large dataset of over
# 1 billion training pairs.
#This initializes the 'all-MiniLM-L6-v2' model from Sentence Transformers.
# This model is capable of encoding sentences into fixed-size vectors (embeddings).
model = SentenceTransformer("all-MiniLM-L6-v2")
#The model is used to encode the content of the DataFrame into embeddings.
# The show_progress_bar=True argument displays a progress bar during the encoding process.
embeddings = model.encode(df["content"].tolist() , show_progress_bar=True)





Batches:   0%|          | 0/1286 [00:00<?, ?it/s]

<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#0ca9e9; overflow:hidden"><b>Why select all-MiniLM-L6-v2?</b></div>




All-round model tuned for many use-cases. Trained on a large and diverse dataset of over 1 billion training pairs. Source

Its small in size 80 MB with good performance.

<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#0ca9e9; overflow:hidden"><b>Save Files</b></div>

In [10]:
import pickle
#save the embeddings, titles, abstracts and model to disk using pickle

with open('embeddings.pkl', 'wb') as f:
    pickle.dump(embeddings, f)

with open('titles.pkl', 'wb') as f:
    pickle.dump(titles, f)

with open('abstracts.pkl', 'wb') as f:
    pickle.dump(abstracts, f)

with open('rec_model.pkl', 'wb') as f:
    pickle.dump(model, f)

In [4]:
import pickle

<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#0ca9e9; overflow:hidden"><b>Loading Files</b></div>

In [5]:
embeddings = pickle.load(open('models/embeddings.pkl','rb'))
titles = pickle.load(open('models/titles.pkl','rb'))
abstracts = pickle.load(open('models/abstracts.pkl','rb'))
rec_model = pickle.load(open('models/rec_model.pkl','rb'))




<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#0ca9e9; overflow:hidden"><b>Build Recommendation Function</b></div>

In [35]:
# This function takes a title as input and returns the top N most similar papers based on cosine similarity.
# It uses the pre-trained model to encode the input title and compares it with the embeddings of all papers.
def recommend_papers(input_title, top_n=5):
    input_embedding = rec_model.encode([input_title]) #encode the input title
    sim_scores = cosine_similarity(input_embedding, embeddings)[0]  #calculate cosine similarity
    top_indices = sim_scores.argsort()[::-1][:top_n]  #get the indices of the top N most similar papers
    #sort the indices in descending order of similarity scores
    return df.iloc[top_indices][['titles', 'abstracts']]  #return the titles and abstracts of the top N most similar papers

In [36]:
# example usage
recommended = recommend_papers("Attention Is All You Need", top_n=5)
recommended

Unnamed: 0,titles,abstracts
33669,An Attentive Survey of Attention Models,Attention Model has now become an important co...
38666,Copy this Sentence,Attention is an operation that selects some la...
1248,Area Attention,Existing attention mechanisms are trained to a...
36547,AiR: Attention with Reasoning Capability,While attention has been an increasingly popul...
34788,"Attention, please! A survey of Neural Attentio...","In humans, Attention is a core property of all..."


<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:15px; color:white; margin:0; font-size:100%; font-family:Pacifico; background-color:#0ca9e9; overflow:hidden"><b>Thank You</b></div>