


<p style="font-size:36px;text-align:center"> <b>Research Paper Recommendation using Information Retrieval</b> </p>

<h1>1. Business Problem</h1>

<h2> 1.1 Description </h2>

<p>Paper submission systems (CMT, OpenReview, etc.) require the users to upload paper titles and paper abstracts and then specify the subject areas their papers best belong to.</p>
<p>
Won't it be nice if such submission systems provided viable subject area suggestions as to where the corresponding papers could be best associated with?
</p>
<p>
This dataset would allow developers to build baseline models that might benefit this use case. Data analysts might also enjoy analyzing the intricacies
of different papers and how well their abstracts correlate to their noted categories. Additionally, we hope that the dataset will serve as a decent benchmark for building useful text classification systems.</p>
<br>
> Credits: Kaggle 


<h2> 1.2 Problem Statement </h2>

- We have data like Titles, Abstracts of different research papers.
- These research papers are related with topics such as Machine Learning, Deep Learning, State of the Art Deep Learning models
- These research papers are categorized into different subject areas.
- Our task is when user provides Title and Abstract of any research paper, our model specify the areas in which that research paper belong to.

<h2> 1.3 Sources/Useful Links</h2>

- Source : https://www.kaggle.com/datasets/spsayakpaul/arxiv-paper-abstracts/data
<br><br>____ Useful Links ____
- Blog 1 : https://www.analyticsvidhya.com/blog/2021/06/part-20-step-by-step-guide-to-master-nlp-information-retrieval/
- Blog 2 : https://scikit-learn.org/stable/modules/neighbors.html

<h2>1.4 Real world/Business Objectives </h2>

- Along with the classification problem, we will further develop our model for the **Information Retrieval (IR) system**.
- Our model will recommend relevant research papers to the user for which user asked the query.

**Objective:**
- Develop an advanced document retrieval system leveraging text classification and information retrieval techniques to improve the accuracy and efficiency of retrieving relevant documents from large datasets.</p>

1. **Text Classification:** Automatically categorize documents into predefined categories based on their content.

2. **Information Retrieval:** Enhance the retrieval process by ranking and retrieving documents based on their relevance to user queries.

## What is **Information retrieval (IR)** system ?
 
<p>1. It is the activity of obtaining information from resources that are relevant to an information need from a collection of those resources.</p><p>2. Searches can be based on full-text or other content-based indexing.</p>
<p>3. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.  Web search engines are the most visible IR applications.</p>
<p>4. An information retrieval process begins when a user enters a query into the system. Queries are formal statements of information needs, for example search strings in web search engines. In information retrieval a query does not uniquely identify a single object in the collection. Instead, several objects may match the query, perhaps with different degrees of relevancy.</p>
<p>5. User queries are matched against the database information. In information retrieval the results returned may or may not match the query, so results are typically ranked. This ranking of results is a key difference of information retrieval searching compared to database searching. </p>

<p>6. Most IR systems compute a numeric score on how well each object in the database matches the query, and rank the objects according to this value. The top ranking objects are then shown to the user.

<img src='relevant_output_about_information.jpg'>

<h1>2. Machine Learning Problem </h1>

<h3> 2.1 Data Overview </h3>

<p> 
- Data will be in a file arxiv_data_210930-054931.csv <br>
- Train.csv contains 5 columns : titles, abstracts, terms <br>
- Size of arxiv_data_210930-054931.csv - 70MB <br>
- Number of rows in arxiv_data_210930-054931.csv = 56,181
</p>

<h3> 2.2 Type of Machine Learning Problem </h3>

<p> It is a Multi-class classification problem. For a given research paper, we need to predict to which subject it belongs. </p>

<h2> 3. Importing the Libraries </h2>

In [1]:
import numpy as np
import pandas as pd
import sklearn
import nltk
import re
import warnings
warnings.filterwarnings('ignore')

**Lets import the article data**

In [3]:
df = pd.read_csv("arxiv_data_210930-054931.csv")

# Show the top few papers
df.head(5)

Unnamed: 0,terms,titles,abstracts
0,['cs.LG'],Multi-Level Attention Pooling for Graph Neural...,Graph neural networks (GNNs) have been widely ...
1,"['cs.LG', 'cs.AI']",Decision Forests vs. Deep Networks: Conceptual...,Deep networks and decision forests (such as ra...
2,"['cs.LG', 'cs.CR', 'stat.ML']",Power up! Robust Graph Convolutional Network v...,Graph convolutional networks (GCNs) are powerf...
3,"['cs.LG', 'cs.CR']",Releasing Graph Neural Networks with Different...,With the increasing popularity of Graph Neural...
4,['cs.LG'],Recurrence-Aware Long-Term Cognitive Network f...,Machine learning solutions for pattern classif...


<h3> 3.1 Copy Original Data to new Dataframe </h3>

In [5]:
papers = df.copy()

In [7]:
papers.shape

(56181, 3)

<h3> 3.2 Text Preprocessing </h3>

<h4> 3.2.1 Check for duplicate data </h4>

In [9]:
papers.duplicated().sum()

15054

<h4> 3.2.2 Remove duplicate data </h4>

In [11]:
papers.drop_duplicates(keep='first', inplace=True)

In [13]:
papers.shape

(41127, 3)

<h4> 3.2.3 Overview of data </h4>

In [15]:
print ("Title: ", papers['titles'][10])
print ('\n')
print ("Abstract: ", papers['abstracts'][10])

Title:  Local Augmentation for Graph Neural Networks


Abstract:  Data augmentation has been widely used in image data and linguistic data but
remains under-explored on graph-structured data. Existing methods focus on
augmenting the graph data from a global perspective and largely fall into two
genres: structural manipulation and adversarial training with feature noise
injection. However, the structural manipulation approach suffers information
loss issues while the adversarial training approach may downgrade the feature
quality by injecting noise. In this work, we introduce the local augmentation,
which enhances node features by its local subgraph structures. Specifically, we
model the data argumentation as a feature generation process. Given the central
node's feature, our local augmentation approach learns the conditional
distribution of its neighbors' features and generates the neighbors' optimal
feature to boost the performance of downstream tasks. Based on the local
augmentation,

In [17]:
print ("Title: ", papers['titles'][100])
print ('\n')
print ("Abstract: ", papers['abstracts'][100])

Title:  Tensor Networks for Probabilistic Sequence Modeling


Abstract:  Tensor networks are a powerful modeling framework developed for computational
many-body physics, which have only recently been applied within machine
learning. In this work we utilize a uniform matrix product state (u-MPS) model
for probabilistic modeling of sequence data. We first show that u-MPS enable
sequence-level parallelism, with length-n sequences able to be evaluated in
depth O(log n). We then introduce a novel generative algorithm giving trained
u-MPS the ability to efficiently sample from a wide variety of conditional
distributions, each one defined by a regular expression. Special cases of this
algorithm correspond to autoregressive and fill-in-the-blank sampling, but more
complex regular expressions permit the generation of richly structured data in
a manner that has no direct analogue in neural generative models. Experiments
on sequence modeling with synthetic and real text data show u-MPS outperform

<h4> 3.2.4 Function to clean Abstracts and Titles </h4>

In [19]:
# Function that cleans text by removing '\x0c' and '\n' characters
# as well as all non-alpha characters and finally converts everything
# to lower case
def clean_text(text):
    stop_words = ['\x0c', '\n']
    for i in stop_words:
        text.replace(i, ' ')
    clean_text = re.sub('[^a-zA-Z]+', ' ', text)
    return clean_text.lower()

In [21]:
# Create a column for cleaned Abstract and cleaned Title
papers['clean_abstract'] = papers['abstracts'].apply(clean_text)
papers['clean_title'] = papers['titles'].apply(clean_text)

papers.head()

Unnamed: 0,terms,titles,abstracts,clean_abstract,clean_title
0,['cs.LG'],Multi-Level Attention Pooling for Graph Neural...,Graph neural networks (GNNs) have been widely ...,graph neural networks gnns have been widely us...,multi level attention pooling for graph neural...
1,"['cs.LG', 'cs.AI']",Decision Forests vs. Deep Networks: Conceptual...,Deep networks and decision forests (such as ra...,deep networks and decision forests such as ran...,decision forests vs deep networks conceptual s...
2,"['cs.LG', 'cs.CR', 'stat.ML']",Power up! Robust Graph Convolutional Network v...,Graph convolutional networks (GCNs) are powerf...,graph convolutional networks gcns are powerful...,power up robust graph convolutional network vi...
3,"['cs.LG', 'cs.CR']",Releasing Graph Neural Networks with Different...,With the increasing popularity of Graph Neural...,with the increasing popularity of graph neural...,releasing graph neural networks with different...
4,['cs.LG'],Recurrence-Aware Long-Term Cognitive Network f...,Machine learning solutions for pattern classif...,machine learning solutions for pattern classif...,recurrence aware long term cognitive network f...


In [25]:
print ("Title: ", papers['titles'][4])
print ('\n')
print ("Abstract: ", papers['abstracts'][4])

Title:  Recurrence-Aware Long-Term Cognitive Network for Explainable Pattern Classification


Abstract:  Machine learning solutions for pattern classification problems are nowadays
widely deployed in society and industry. However, the lack of transparency and
accountability of most accurate models often hinders their safe use. Thus,
there is a clear need for developing explainable artificial intelligence
mechanisms. There exist model-agnostic methods that summarize feature
contributions, but their interpretability is limited to predictions made by
black-box models. An open challenge is to develop models that have intrinsic
interpretability and produce their own explanations, even for classes of models
that are traditionally considered black boxes like (recurrent) neural networks.
In this paper, we propose a Long-Term Cognitive Network for interpretable
pattern classification of structured data. Our method brings its own mechanism
for providing explanations by quantifying the relevance 

In [27]:
print ("Title: ", papers['clean_title'][4])
print ('\n')
print ("Abstract: ", papers['clean_abstract'][4])

Title:  recurrence aware long term cognitive network for explainable pattern classification


Abstract:  machine learning solutions for pattern classification problems are nowadays widely deployed in society and industry however the lack of transparency and accountability of most accurate models often hinders their safe use thus there is a clear need for developing explainable artificial intelligence mechanisms there exist model agnostic methods that summarize feature contributions but their interpretability is limited to predictions made by black box models an open challenge is to develop models that have intrinsic interpretability and produce their own explanations even for classes of models that are traditionally considered black boxes like recurrent neural networks in this paper we propose a long term cognitive network for interpretable pattern classification of structured data our method brings its own mechanism for providing explanations by quantifying the relevance of each featu

<h2> 4. Advanced Text Preprocessing </h2>

- Once basic text Pre-processing done, we should do below Preprocessing steps.

1. **Tokenization:**
   * Tokenization is the process of splitting the given text into smaller pieces called tokens. Words, numbers, punctuation marks, and others can be considered as tokens. It is one of the initial steps of any NLP pipeline.

   * Example - \
              Input: "Tokenization is an important NLP task."  \
              Output: ["Tokenization", "is", "an", "important", "NLP", "task", "."
</br>

2. **Stemming:**

   * <p>Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma.</p>

   * <p>Stemming is a part of linguistic studies in morphology and artificial intelligence (AI) information retrieval and extraction. Stemming and AI knowledge extract meaningful information from vast sources like big data or the Internet since additional forms of a word related to a subject may need to be searched to get the best results. Stemming is also a part of queries and Internet search engines.</p>

   * <p>Recognizing, searching and retrieving more forms of words returns more results. When a form of a word is recognized it can make it possible to return search results that otherwise might have been missed. That additional information retrieved is why stemming is integral to search queries and information retrieval.</p>


   * <p>Applications of stemming are:
     <p>1. Stemming is used in information retrieval systems like search engines.</p>
     <p>2. It is used to determine domain vocabularies in domain analysis.</p>
     <p>3. Stemming is desirable as it may reduce redundancy as most of the time the word stem and their inflected/derived words mean the same.</p>
</br>

3. **Lemmatization:**

   * <p>The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.</p>

   * <p>However, the two words differ in their flavor. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.</p>

   * <p>Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .</p> 

</br>

<p>The difference between Stemming and Lemmatization is that stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications.</p>

</br>

**For instance:**

The word "better" has "good" as its lemma. This link is missed by stemming, as it requires a dictionary look-up.

The word "walk" is the base form for word "walking", and hence this is matched in both stemming and lemmatisation.

The word "meeting" can be either the base form of a noun or a form of a verb ("to meet") depending on the context, e.g., "in our last meeting" or "We are meeting again tomorrow". Unlike stemming, lemmatisation can in principle select the appropriate lemma depending on the context.

In [29]:
'''Use NLTK word_tokenize() and SnowballStemmer() to tokenize and stem document's Titles and Abstracts'''

# Function that takes text, tokenizes it and returns list of stemmed tokens
def tokenize_and_stem(text):
    tokens = nltk.word_tokenize(text)
    stemmer = nltk.stem.snowball.SnowballStemmer("english")
    return [i for i in [stemmer.stem(t) for t in tokens] if len(i) > 2]

<h2> 5. Feature Extraction </h2>

<h3> 5.1 Different Feature Extraction Techniques </h3>

<h2> 5.1.1 Bag of Words (BoW) </h2>

* <p>The bag of words model is used for text representation and feature extraction in natural language processing and information retrieval tasks. It represents a text document as a multiset of its words, disregarding grammar and word order, but keeping the frequency of words. This representation is useful for tasks such as text classification, document similarity, and text clustering. </p>

* <p>Bag-of-Words is one of the most fundamental methods to transform tokens into a set of features. The BoW model is used in document classification, where each word is used as a feature for training the classifier. For example, in a task of review based sentiment analysis, the presence of words like ‘fabulous’, ‘excellent’ indicates a positive review, while words like ‘annoying’, ‘poor’ point to a negative review . </p>
    

* <p>There are 3 steps while creating a BoW model : </p>
  <p>1. The first step is text-preprocessing </p>
  <p>2. The second step is to create a vocabulary of all unique words from the corpus. </p>
  <p>3. In the third step, we create a matrix of features by assigning a separate column for each word, while each row corresponds to a review/document. This process is known as Text Vectorization. Each entry in the matrix signifies the presence(or absence) of the word in the review. We put 1 if the word is present in the review, and 0 if it is not present. </p>

* <p>Let’s start with an example to understand by taking some sentences and generating vectors for those. </p>
  <p>1. "John likes to watch movies. Mary likes movies too". </p>
  <p>2. "John also likes to watch football games". </p>

* <p>BoW vector representation for these two sentences will be </p>
  <p>[1, 2, 1, 1, 2, 1, 1, 0, 0, 0] </p>
  <p>[1, 1, 1, 1, 0, 0, 0, 1, 1, 1] </p>


* **Issues of Bag of Words:**
  <p>1. Lack of semantic information: As the bag of words model only considers individual words, it does not capture semantic relationships or the meaning of words in context. </p> 
  <p>2. Sparsity: For many applications, the bag of words representation of a document can be very sparse, meaning that most entries in the resulting feature vector will be zero. This can lead to issues with computational efficiency and difficulty in interpretability. </p>

<h2> 5.1.2 TF-IDF (Term Frequency-Inverse Document Frequency) </h2>

<font face='georgia'>
    
   <h4><strong>What does tf-idf mean?</strong></h4>

   <p>    
Tf-idf stands for <em>term frequency-inverse document frequency</em>, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf-idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.
</p>
    
   <p>
One of the simplest ranking functions is computed by summing the tf-idf for each query term; many more sophisticated ranking functions are variants of this simple model.
</p>
    
    
</font>

<font face='georgia'>
    <h4><strong>How to Compute:</strong></h4>

Typically, the tf-idf weight is composed by two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.

 <ul>
    <li>
<strong>TF:</strong> Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: <br>

$TF(t) = \frac{\text{Number of times term t appears in a document}}{\text{Total number of terms in the document}}.$
</li>
<li>
<strong>IDF:</strong> Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: <br>

$IDF(t) = \log_{e}\frac{\text{Total  number of documents}} {\text{Number of documents with term t in it}}.$
for numerical stabiltiy we will be changing this formula little bit
$IDF(t) = \log_{e}\frac{\text{Total  number of documents}} {\text{Number of documents with term t in it}+1}.$
</li>
</ul>

<br>
<h4><strong>Example</strong></h4>
<p>

Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.
</p>
</font>

<h3> 5.2 Create a tf-idf vectorizer using sklearn TfidfVectorizer </h3>

1. First we create the vectorizer specifying the paramters
    * max_df is the maximum allowable document frequency for a token this is set to 0.50 to include terms that appear in less than 50% of documents.
    * min_df is the minimum allowable document frequency for a token.
    * max_features sets the maximum number of features allowed and is set to an arbitrarily large number (i.e. 200,000) to ensure we capture at least as many features
    * stop_words specifies the words/tokens to remove from the corpus
    * use_idf enables reweighting each feature by its inverse-document-frequency when set to true
    * tokenizer specifies which tokenizer to use, we want to tokenize and stem so we pass it our tokenized_and_stem() function we created above. The default tokenizer will tokenize words and include those greater than two characters in length.
2. We then fit the vectorizer to our cleaned text using *vectorizer.fit_transform()*
3. The output is a n*m matrix where n is the number of documents in our corpus and m is the number of features.
4. We can inspect the features using *vectorizer.get_feature_names()*

In [31]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\rajan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [33]:
# Import the TfidfVectorizer from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

# Create vectorizer for Abstracts, max_df is set to 0.5, we only want
# to include terms that appear in less than 50% of the documents (i.e. rare terms)
# we are selecting maximum 200,000 abstract features
abstract_tfidf_vectorizer = TfidfVectorizer(max_df=0.5, min_df=1, max_features=200000,
               stop_words='english', use_idf=True, tokenizer=tokenize_and_stem)

# Create vectorizer for Title, max_df is set to 0.5, we only want
# to include terms that appear in less than 50% of the documents (i.e. rare terms)
# we are selecting maximum 100,000 title features
title_tfidf_vectorizer = TfidfVectorizer(max_df=0.5, min_df=1, max_features=100000,
               stop_words='english', use_idf=True, tokenizer=tokenize_and_stem)

<h3> 5.3 Compute TF-IDF weights for Abstracts and Title </h3>

In [35]:
tfidf_weights_abs = abstract_tfidf_vectorizer.fit_transform(papers['clean_abstract'])
tfidf_weights_title = title_tfidf_vectorizer.fit_transform(papers['clean_title'])

In [36]:
tfidf_weights_abs.shape

(41127, 40339)

In [37]:
tfidf_weights_title.shape

(41127, 13543)

<h3> 5.4 Get feature names for Abstract and Title models </h3>

In [38]:
# Get feature names for Abstract and Title models
tfidf_features_abs = abstract_tfidf_vectorizer.get_feature_names_out()
tfidf_features_title = title_tfidf_vectorizer.get_feature_names_out()

In [39]:
tfidf_features_abs

array(['aaa', 'aaae', 'aaai', ..., 'zzh', 'zzz', 'zzzjzzz'], dtype=object)

<h3> 6. Function to get the top-k features associated with a document </h3>

In [40]:
# Function for returning the top_k features of an Abstract
# or Title
def get_top_features(rownum, weights, features, top_k=30):

    #weights is a Sparse vector, hence need to convert into 2D matrix using toarray() function
    weight_vec = weights.toarray()[rownum,:]

    # weight_vec is tf-idf weight vector for particular row_num (abstract) 
    # where no. of columns = vocab size
    
    # We will sort this weight vector in decreasing order of the weights and will choose top_k weights
    top_idx = np.argsort(weight_vec)[::-1][:top_k]

    # We will return k-features which corresponds to the top_k weights
    return [features[i] for i in top_idx]


In [41]:
# Top k features of Abstract 5
get_top_features(5, tfidf_weights_abs, tfidf_features_abs)

['graph',
 'fgn',
 'node',
 'lifelong',
 'convert',
 'gnn',
 'turn',
 'stream',
 'featur',
 'structur',
 'independ',
 'continu',
 'wearabl',
 'inherit',
 'problem',
 'classif',
 'gnns',
 'bridg',
 'necessari',
 'new',
 'topolog',
 'fashion',
 'devic',
 'assum',
 'cnns',
 'neural',
 'classic',
 'complet',
 'signal',
 'usual']

In [42]:
# Top k features of Title 1
get_top_features(1, tfidf_weights_title, tfidf_features_title)

['conceptu',
 'size',
 'empir',
 'forest',
 'small',
 'differ',
 'similar',
 'decis',
 'sampl',
 'deep',
 'network',
 'foraminifera',
 'zstgan',
 'forarbitrari',
 'footwear',
 'forbidden',
 'forc',
 'forcenet',
 'ford',
 'forag',
 'foothil',
 'footstep',
 'footprint',
 'forecastnet',
 'footbal',
 'footag',
 'foot',
 'fool',
 'foodlogodet',
 'food']

<h3> 7. Find Similar Abstracts and Titles using Nearest Neighbors model </h3>

- Using **Unsupervised k-NN** for information retrieval involves leveraging the k-nearest neighbors algorithm to find relevant items or documents without relying on labeled training data. Here’s how it can be applied in information retrieval:

* In information retrieval, you often need to find items (like documents, images, or products) similar to a query item. Unsupervised k-NN can help by:

1. **Constructing Feature Vectors:** Convert items and queries into feature vectors. For text, this might involve TF-IDF, word embeddings (e.g., Word2Vec, GloVe), or transformer-based embeddings (e.g., BERT). For images, this might involve feature vectors extracted using deep learning models.

2. **Calculating Distances:** Use a distance metric (like Euclidean, cosine similarity, or Manhattan distance) to compute the similarity between the query vector and the vectors of the items in the dataset.

3. **Finding Neighbors:** For a given query, find the k-nearest neighbors in the feature space, which are the k most similar items to the query.

<h4> 7.1 Build model to return 10 closest neighbors </h4>

In [43]:

from sklearn.neighbors import NearestNeighbors

# Create the k-NN model using k=10
nn_abs = NearestNeighbors(n_neighbors=10, algorithm='auto')
nn_title = NearestNeighbors(n_neighbors=10, algorithm='auto')

# Fit the models to the TF-IDF weights matrix
nn_fitted_abs = nn_abs.fit(tfidf_weights_abs)
nn_fitted_title = nn_title.fit(tfidf_weights_title)


<h4> 7.2 Function to return the top-k nearest papers </h4>

In [44]:
def find_nearest_papers(row, kNNmodel, tfidf_weights, tfidf_features, papers):
    # Get the top_k features of an Abstract or Title
    keywords = get_top_features(row, tfidf_weights, tfidf_features)

    # Get the Nearest 10 neighbors (similar papers) using k-NN
    dist,idx = kNNmodel.kneighbors(tfidf_weights[row,:])

    # Suggest top most 5 similar papers
    idx = list(idx[0][1:6])
    return {'papers':papers.iloc[idx], 'keywords':keywords}
    #return papers.iloc[idx, 1:3]

<h3> 8. Return papers based on Abstract similarity </h3>

Now that we have a function to return similar papers, we can use it to find papers with similar abstracts.

In [45]:
print ("Title: ", papers['titles'][10])
print ('\n')
print ("Abstract: ", papers['abstracts'][10])

Title:  Local Augmentation for Graph Neural Networks


Abstract:  Data augmentation has been widely used in image data and linguistic data but
remains under-explored on graph-structured data. Existing methods focus on
augmenting the graph data from a global perspective and largely fall into two
genres: structural manipulation and adversarial training with feature noise
injection. However, the structural manipulation approach suffers information
loss issues while the adversarial training approach may downgrade the feature
quality by injecting noise. In this work, we introduce the local augmentation,
which enhances node features by its local subgraph structures. Specifically, we
model the data argumentation as a feature generation process. Given the central
node's feature, our local augmentation approach learns the conditional
distribution of its neighbors' features and generates the neighbors' optimal
feature to boost the performance of downstream tasks. Based on the local
augmentation,

In [46]:
find_nearest_papers(10, nn_fitted_abs, tfidf_weights_abs, tfidf_features_abs, papers)['papers']

Unnamed: 0,terms,titles,abstracts,clean_abstract,clean_title
54045,"['cs.LG', 'stat.ML']",Data Augmentation for Graph Neural Networks,Data augmentation has been widely used to impr...,data augmentation has been widely used to impr...,data augmentation for graph neural networks
47255,"['cs.LG', 'stat.ML']",Graph Analysis and Graph Pooling in the Spatia...,The spatial convolution layer which is widely ...,the spatial convolution layer which is widely ...,graph analysis and graph pooling in the spatia...
19868,"['cs.LG', 'cs.CV', 'stat.ML']",Data Augmentation Revisited: Rethinking the Di...,Data augmentation has been widely applied as a...,data augmentation has been widely applied as a...,data augmentation revisited rethinking the dis...
15123,['cs.LG'],Graph Contrastive Learning with Adaptive Augme...,"Recently, contrastive learning (CL) has emerge...",recently contrastive learning cl has emerged a...,graph contrastive learning with adaptive augme...
29882,['cs.CV'],Adversarial Feature Augmentation for Unsupervi...,Recent works showed that Generative Adversaria...,recent works showed that generative adversaria...,adversarial feature augmentation for unsupervi...


<h3> 9. Return papers based on Title similarity </h3>

Now that we have a function to return similar papers, we can use it to find papers with similar Titles.es

In [47]:
find_nearest_papers(10, nn_fitted_title, tfidf_weights_title, tfidf_features_title, papers)['papers']

Unnamed: 0,terms,titles,abstracts,clean_abstract,clean_title
54045,"['cs.LG', 'stat.ML']",Data Augmentation for Graph Neural Networks,Data augmentation has been widely used to impr...,data augmentation has been widely used to impr...,data augmentation for graph neural networks
36964,"['cs.LG', 'stat.ML']",Non-Local Graph Neural Networks,Modern graph neural networks (GNNs) learn node...,modern graph neural networks gnns learn node e...,non local graph neural networks
52623,['cs.LG'],Graph Neural Networks with Local Graph Parameters,Various recent proposals increase the distingu...,various recent proposals increase the distingu...,graph neural networks with local graph parameters
134,"['cs.LG', 'eess.SP', 'stat.ML']",Graph Neural Network for Large-Scale Network L...,Graph neural networks (GNNs) are popular to us...,graph neural networks gnns are popular to use ...,graph neural network for large scale network l...
11690,"['cs.LG', 'stat.ML']",Reinforcement Learning using Augmented Neural ...,Neural networks allow Q-learning reinforcement...,neural networks allow q learning reinforcement...,reinforcement learning using augmented neural ...


**Conclusion:**
1. Here both approaches means suggesting papers with Title similarity and with Abstract similarity gives good results.
2. But with the Abstract similarity, model produces most accurate results and suggests better papers which are similar to the Query paper.
3. Hence we will use the Abstract similarity model to suggest similar papers and we will suggest only titles of similar papers

<h3> 10. Let's find similar papers using Abstract similarity </h3>

<h4> 10.1 Show the Titles of similar papers </h4>

In [48]:
nearest_papers = find_nearest_papers(200, nn_fitted_abs, tfidf_weights_abs, tfidf_features_abs, papers)

for paper in nearest_papers['papers']['titles']:
    print("Title: " + paper + "\n")

Title: Adaptive Neural Message Passing for Inductive Learning on Hypergraphs

Title: Hypergraph Modelling for Geometric Model Fitting

Title: HyperSF: Spectral Hypergraph Coarsening via Flow-based Local Clustering

Title: Noise-robust classification with hypergraph neural network

Title: Hypergraph Convolution and Hypergraph Attention



<h4> 10.2 Show the Abstracts of similar papers </h4>

In [49]:
for paper in nearest_papers['papers']['abstracts']: 
    print ("Abstract: "+ paper +"\n")

Abstract: Graphs are the most ubiquitous data structures for representing relational
datasets and performing inferences in them. They model, however, only pairwise
relations between nodes and are not designed for encoding the higher-order
relations. This drawback is mitigated by hypergraphs, in which an edge can
connect an arbitrary number of nodes. Most hypergraph learning approaches
convert the hypergraph structure to that of a graph and then deploy existing
geometric deep learning methods. This transformation leads to information loss,
and sub-optimal exploitation of the hypergraph's expressive power. We present
HyperMSG, a novel hypergraph learning framework that uses a modular two-level
neural message passing strategy to accurately and efficiently propagate
information within each hyperedge and across the hyperedges. HyperMSG adapts to
the data and task by learning an attention weight associated with each node's
degree centrality. Such a mechanism quantifies both local and global 

<h3> 11. Save and Load the Model </h3>

<h4> 11.1 Save Model and Feature weights </h4>

In [65]:
import joblib

joblib.dump(nn_fitted_abs , 'models/NN_abstract_model.pkl')
joblib.dump(nn_fitted_title , 'models/NN_title_model.pkl')
joblib.dump(tfidf_weights_abs , 'models/tfidf_abstract_weights.pkl')
joblib.dump(tfidf_weights_title , 'models/tfidf_title_weights.pkl')
joblib.dump(tfidf_features_abs , 'models/tfidf_abstract_features.pkl')
joblib.dump(tfidf_features_title , 'models/tfidf_title_features.pkl')

['models/tfidf_title_features.pkl']

<h4> 11.2 Load Model and Feature weights </h4>

In [67]:
NN_abstract_model = joblib.load('models/NN_abstract_model.pkl')
tfidf_abstract_weights = joblib.load('models/tfidf_abstract_weights.pkl')
tfidf_abstract_features = joblib.load('models/tfidf_abstract_features.pkl')

<h4> 11.3 Recommendation of Papers </h4>

In [69]:
query_paper_title = 'HyperSAGE: Generalizing Inductive Representation Learning on Hypergraphs'
query_paper_index = papers.index[papers['titles'] == query_paper_title].tolist()
print(query_paper_index[0])

200


<h5> 11.3.1 Show the Titles of similar papers </h5>

In [71]:
nearest_papers = find_nearest_papers(query_paper_index[0], NN_abstract_model, tfidf_abstract_weights, tfidf_abstract_features, papers)

for paper in nearest_papers['papers']['titles']:
    print("Title: " + paper + "\n")

Title: Adaptive Neural Message Passing for Inductive Learning on Hypergraphs

Title: Hypergraph Modelling for Geometric Model Fitting

Title: HyperSF: Spectral Hypergraph Coarsening via Flow-based Local Clustering

Title: Noise-robust classification with hypergraph neural network

Title: Hypergraph Convolution and Hypergraph Attention



<h5> 11.3.2 Show the Abstracts of similar papers </h5>

In [73]:
for paper in nearest_papers['papers']['abstracts']: 
    print ("Abstract: "+ paper +"\n")

Abstract: Graphs are the most ubiquitous data structures for representing relational
datasets and performing inferences in them. They model, however, only pairwise
relations between nodes and are not designed for encoding the higher-order
relations. This drawback is mitigated by hypergraphs, in which an edge can
connect an arbitrary number of nodes. Most hypergraph learning approaches
convert the hypergraph structure to that of a graph and then deploy existing
geometric deep learning methods. This transformation leads to information loss,
and sub-optimal exploitation of the hypergraph's expressive power. We present
HyperMSG, a novel hypergraph learning framework that uses a modular two-level
neural message passing strategy to accurately and efficiently propagate
information within each hyperedge and across the hyperedges. HyperMSG adapts to
the data and task by learning an attention weight associated with each node's
degree centrality. Such a mechanism quantifies both local and global 