# arXiv Paper Recommendations 

In this notebook, we explore three well-known natural language processing models for recommending similar articles to a user's input. The three models are:

1. pre-trained Word2Vec 
2. Doc2Vec with Distributed Bag of Words
3. Doc2Vec with Dsitributed Memory.

We implement functions that allow for the user to enter one or more articles (denoted by their arXiv ids) and request some number $n>0$ of recommendations. We will use `cosine similarity` of the word/sentence embeddings of the article abstracts as our similary measure.

### Note: In this notebook of the recommendations for multiple inputs are made in the following manner: 
#### (1) For each the abstract of each input article compute the cosine similarities with all of the elements in the dataset add them to a dataset.

#### (2) Sort the rows in the dataset created in part (1) from highest to lowest by cosine similarity. Remove all duplicate articles from the dataset keeping only the ones with the largest cosine similarity.  Return the first $n$ articles in the dataset.

In [76]:
import pandas as pd
import numpy as np

import arxiv
import time
from string import punctuation

import nltk
from nltk.corpus import stopwords
from nltk.probability import FreqDist

import gensim.downloader as api
import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from multipledispatch import dispatch
from data_utils import format_query, query_to_df, clean_data, clean_authors

First we load in our pre-cleaned dataset. Our dataset contains 20,000 arXiv papers.

In [2]:
## Load in the dataset
## This file keeps the stopwords, but removes words of freq=1 in the corpus from the tokens
#df = pd.read_parquet("./data/filter_20k_tokenized_stopwords.parquet")

## This file removes both stopwords and words of freq=1 in the corpus from the tokens
df = pd.read_parquet("./data/filter_20k_tokenized.parquet")
df.head(1)

Unnamed: 0,id,title,abstract,update_date,authors_parsed,strip_cat,clean_title,clean_abstract,clean_authors,abstract_tokenized,abstract_reduced_tokens
182244,1412.3275,Limit cycles bifurcating from a degenerate center,We study the maximum number of limit cycles ...,2014-12-11,"[['Llibre', 'J.', ''], ['Pantazi', 'C.', '']]",[DS],limit cycles bifurcating from a degenerate center,we study the maximum number of limit cycles th...,"[['llibre', 'j', ''], ['pantazi', 'c', '']]","[we, study, the, maximum, number, of, limit, c...","[study, maximum, number, limit, cycles, bifurc..."


We will reset the index values for ease of analysis.

In [3]:
df = df.reset_index()
df = df.rename(columns={"index": "original_index"})
df.head(1)

Unnamed: 0,original_index,id,title,abstract,update_date,authors_parsed,strip_cat,clean_title,clean_abstract,clean_authors,abstract_tokenized,abstract_reduced_tokens
0,182244,1412.3275,Limit cycles bifurcating from a degenerate center,We study the maximum number of limit cycles ...,2014-12-11,"[['Llibre', 'J.', ''], ['Pantazi', 'C.', '']]",[DS],limit cycles bifurcating from a degenerate center,we study the maximum number of limit cycles th...,"[['llibre', 'j', ''], ['pantazi', 'c', '']]","[we, study, the, maximum, number, of, limit, c...","[study, maximum, number, limit, cycles, bifurc..."


In [4]:
df['abstract_reduced_tokens'].isnull().values.any()

False

In [5]:
# Rejoin the tokens for use with some models
df['abstract_reduced'] = df['abstract_reduced_tokens'].apply(" ".join)
df.head(1)

Unnamed: 0,original_index,id,title,abstract,update_date,authors_parsed,strip_cat,clean_title,clean_abstract,clean_authors,abstract_tokenized,abstract_reduced_tokens,abstract_reduced
0,182244,1412.3275,Limit cycles bifurcating from a degenerate center,We study the maximum number of limit cycles ...,2014-12-11,"[['Llibre', 'J.', ''], ['Pantazi', 'C.', '']]",[DS],limit cycles bifurcating from a degenerate center,we study the maximum number of limit cycles th...,"[['llibre', 'j', ''], ['pantazi', 'c', '']]","[we, study, the, maximum, number, of, limit, c...","[study, maximum, number, limit, cycles, bifurc...",study maximum number limit cycles bifurcate de...


Next, we create a corpus with all the words in the abstracts of all of the papers in the dataset. Then, we find their frequency distribution.

We will need this list of words with frequency 1  when we process the abstracts of the papers in the test data. Essentially, we will remove those words from the abstracts of the user's papers before inputting them into our models.

In [6]:
def get_freq1_corpus(df):
    indices = df.index.values

    # Corpus will be a list of lists
    corpus = []
    for i  in indices:
        corpus.append(df['abstract_tokenized'][i])

    # Convert a list of lists to a list because FreqDist takes in a
    # list of strings
    flat_corpus = []
    for sublist in corpus:
        for item in sublist:
            flat_corpus.append(item)

    # Create a frequency distribution from the flattened corpus
    freq = FreqDist(flat_corpus)
    #print("There are", len(freq), "words in the frequency distribution.")

    df_fdist = pd.DataFrame(list(freq.items()), columns = ["Word","Frequency"])

    ## Create a list of the words that appear only once
    unique_words = list(df_fdist[df_fdist['Frequency'] == 1]['Word'])
    #print("There are", len(unique_words), "words that appear only once in the abstracts.")
    return unique_words

In [7]:
## The list of frequency 1 words in the entire corpus
unique_words = get_freq1_corpus(df)

Now we define the functions for the various models we will use to compute cosine similarities and offer recommendations.

In [8]:
## A function to get the vector norm
def norm(u):
    return np.sqrt(np.sum(np.power(u,2)))

## A function to get the cosine similarity
def cos_sim(u,v):
    if norm(u)*norm(v) > 0:
        return (u.dot(v))/(norm(u)*norm(v))
    else:
        return np.nan

### Functions for returning the top $n$ most similar papers to a paper already in the dataset

As practice for our recommendations with our user test data, we first consider making recommendations for a paper that is already present in our dataset.

In [10]:
## Auxilary function
"""
    Display the top n most similar article in the dataset to an 
    article that is already in the dataset.
    
    Inputs:
    df: a DataFrame with all of the articles
    df_sim: a DataFrame with the top n most similar articles to article_index 
            and their computed cosine similarities
            
    article_index: an integer that is an index of an article in the DataFrame;
                   should range from 0 to len(df)-1 
"""
def print_similar(df, df_sim, article_index):
    
    print("The top", len(df_sim), "articles most similar to the article \n\n", 
            article_index, ".", df['title'][article_index])
    print("-----------------------------------------------------\n")
    
    i = 1
    for index in df_sim.index.values: 
        print(i, ".", "(", index , ")", df['title'][index], 
          ", Cosine Similiarity=", np.round(df_sim['Cosine Similarity'][index], 3))
        print()
        i = i + 1
        
"""
    Removes NaN entries from a list and return the modified list.
    
    Inputs:
    x: a list
"""        
def remove_nan(x):
    temp = x.copy()
    for i in range(len(temp)):
        if np.isnan(temp[i]):
            temp[i] = -1
    return temp

## For use with CountVectorizer and TfidVectorizer
"""
    Prints the top n most similar article titles from the dataframe
    to the input article by calculating their cosine similarity.
    
    Inputs:
    df: a DataFrame with all of the articles
    df_vectorized: a dataframe of word frequencies
    article_index: index of the article we want to compare cosine similarities to
    n: number of most similar articles to search for
"""
@dispatch(pd.core.frame.DataFrame, pd.core.frame.DataFrame, int, int)

def get_n_most_similar(df, df_vectorized, article_index , n):
    
    print("Using a CountVectorizer or TfidVectorizer\n")
    
    # Calculate the cosine similariy scores for the i-th article in the dataset
    cosine_sim_list = np.zeros(len(df))

    for i in range(len(df)):
        
        if i != article_index:

            text_to_vector_v1 = df_vectorized.iloc[i].values
            text_to_vector_v2 = df_vectorized.iloc[article_index].values
            
            sim_scores = cos_sim(text_to_vector_v1, text_to_vector_v2)
            cosine_sim_list[i] = sim_scores
   
    ## Caution: there may be some cosine sims with value NaN
    cosine_sim_list = remove_nan(cosine_sim_list)
    ## Getting indices of n maximum values
    x = np.argsort(cosine_sim_list)[::-1][:n]
     
    
    ## Create a dataframe with the results
    df_sim = pd.DataFrame(columns=df.columns)
    df_sim['Cosine Similarity'] = []
   
    for index in x:
        df_sim.loc[index] = df.iloc[index]
        df_sim.at[index, 'Cosine Similarity'] = cosine_sim_list[index]
          
    return df_sim, article_index

###################################################################

## For use with Word2Vec
## Word2Vec supports a function called n_similarity
"""
    Prints the top n most similar (by cosine similarity)
    article titles to article_index. 
    
    Inputs: 
    model: a trained Word2Vec model
    df: a DataFrame with all of the articles
    article_index: an integer that is an index of an article in the DataFrame;
                   should range from 0 to len(df)-1 
    n: number of most similar articles to search for
    
"""
@dispatch(gensim.models.keyedvectors.KeyedVectors, pd.core.frame.DataFrame, int, int)

def get_n_most_similar(model, df, article_index , n):
    
    print("Using Word2Vec\n")
     
    cosine_sim_list = np.zeros(len(df))

    for i in range(len(df)):      
        # Calculate the cosine similariy scores with the i-th article in the dataset
        
        if i != article_index and len(df['abstract_reduced_tokens'][i]) != 0:
        # The difference here is the use of the n_similarity function
            cosine_sim_list[i]  = model.n_similarity(df['abstract_reduced_tokens'][article_index], 
                                                     df['abstract_reduced_tokens'][i])
        
    ## Caution: there may be some cosine sims with value NaN
    cosine_sim_list = remove_nan(cosine_sim_list)
    ## Getting indices of the n maximum values
    x = np.argsort(cosine_sim_list)[::-1][:n]
            
    ## Create a dataframe with the results
    df_sim = pd.DataFrame(columns=df.columns)
    df_sim['Cosine Similarity'] = []
    
    for index in x:
        df_sim.loc[index] = df.iloc[index].copy()
        df_sim.at[index, 'Cosine Similarity'] = cosine_sim_list[index]
        
    return df_sim, article_index

###################################################################

## Doc2Vec supports a function called most_similar 
"""
    Prints the top n most similar (by cosine similarity)
    article titles to article_index.    
    
    Inputs: 
    model: a trained Doc2Vec model
    df: a DataFrame with all of the articles
    article_index: an integer that is an index of an article in the DataFrame;
                   should range from 0 to len(df)-1 
    n: number of most similar articles to search for
"""
@dispatch(gensim.models.doc2vec.Doc2Vec, pd.core.frame.DataFrame, int, int)

def get_n_most_similar(model, df, article_index , n):
    
    print("Using Doc2Vec\n")
    
    # dv.most_similar returns the same values as d2v_model.dv.similarity(i, j)
    topn = model.dv.most_similar(article_index, topn=n)
   
    article_indices = [x[0] for x in topn]
    cos_sims = [x[1] for x in topn]
    
    ## Create a dataframe with the results
    df_sim = pd.DataFrame(columns=df.columns)
    
    for index in article_indices:
        df_sim.loc[index] = df.iloc[index]
    
    df_sim['Cosine Similarity'] = cos_sims
        
    return df_sim, article_index

### Functions for returning the top $n$ most similar papers to a paper NOT already in the dataset

In [13]:
## For use with Word2Vec
## Word2Vec supports a function called n_similarity

## user_vector: the tokenized and cleaned abstract of the user's input

@dispatch(gensim.models.keyedvectors.KeyedVectors, pd.core.frame.DataFrame, list, int)
def get_n_most_similar(model, df, user_tokens, n):
    
    print("Using Word2Vec\n")
     
    cosine_sim_list = np.zeros(len(df))

    for i in range(len(df)):      
        # Calculate the cosine similariy scores with the i-th article in the dataset
        # The difference here is the use of the n_similarity function
        cosine_sim_list[i]  = model.n_similarity(user_tokens, df['abstract_reduced_tokens'][i])
        
    ## Getting indices of the n maximum values
    x = np.argsort(cosine_sim_list)[::-1][:n]
            
    ## Create a dataframe with the results
    df_sim = pd.DataFrame(columns=df.columns)
    df_sim['Cosine Similarity'] = []
    
    for index in x:
        df_sim.loc[index] = df.iloc[index].copy() 
        df_sim['Cosine Similarity'].loc[index] = cosine_sim_list[index]
        
    return df_sim

###################################################################

## Doc2Vec Recommender
## use the infer_vector function (may not be necessary?)
## Choose the first paper in our dataset

@dispatch(gensim.models.doc2vec.Doc2Vec, pd.core.frame.DataFrame, list, int)

def get_n_most_similar(d2v_model, df, user_vector, n):
    print("Using Doc2Vec\n")
    
    topn = d2v_model.dv.most_similar([user_vector], topn=n)
    
    article_indices = [x[0] for x in topn]
    cos_sims = [x[1] for x in topn]
    
    ## Create a dataframe with the results
    df_sim = pd.DataFrame(columns=df.columns)
    
    for index in article_indices:
        df_sim.loc[index] = df.iloc[index].copy()
    
    df_sim['Cosine Similarity'] = cos_sims
        
    return df_sim

### Functions for finding the top $n$ recommendations based on $m$ user inputs

In [14]:
## Auxiliary function
"""
    Prints the titles of the papers the user inputted
    and the top n recommendations store in the df_results.
    
    df_user: the dataset of articles the user has input
    df_results: results of similar articles based on cosine similarity
"""
def display_results(df_results, df_user):
    print("The top", len(df_results), "articles most similar to the articles: \n\n")   
   ## print(df_user.index.values)
    ## Get the titles
    i = 0
    for i in range(len(df_user)):
        title = df_user.iloc[i]['title']
        print(i+1, ".", title)  
        i = i + 1
        
    print("\n#############################################################\n")
    
    for i in range(len(df_results)):
       
        match_index = df_results.index.values[i]
        title = df_results.loc[match_index]['title']
        authors = df_results.loc[match_index]['authors_parsed']
        abstract = df_results.loc[match_index]['abstract']
      #  link = df_results.loc[match_index]['entry_id'] 
        cos_sim = df_results.loc[match_index]['Cosine Similarity']
        
        print(i+1, ".", "(", match_index, ")", title, 
              "\n [ Cosine Similarity=", np.round(cos_sim, 3) ,"]\n")
     #   print("\n", authors)
        print("\n", abstract)
      #  print("\n", link) 
        print("\n-----------------------------------------------------\n")

In [15]:
## For use with Word2Vec
"""
    Returns a dataframe of the top n most similar (by cosine similarity)
    article titles to the user's inputs. 
    
    Inputs:
    model: a trained Word2Vec model
    df: a DataFrame with all of the articles  
    df_user: a dataframe of the user's article inputs   
    n: number of recommendations to return 
"""
## Note: Word2Vec's most_similar function returns the top n most similar words,
## which isn't what we need 
@dispatch(gensim.models.keyedvectors.KeyedVectors, pd.core.frame.DataFrame,
                                              pd.core.frame.DataFrame, int)

def n_recommendations(model, df, df_user, n):
    
    print("Using Word2Vec")
    
    ## Create a DataFrame to store the similarity results     
    df_sim_scores = pd.DataFrame(columns=df.columns)
    df_sim_scores['Cosine Similarity'] = []
    
    for article_index in df_user.index.values: 
    
        user_abstract = df_user['abstract_reduced_tokens'][article_index]
    
        for i in range(len(df)): 
            if len(df['abstract_reduced_tokens'][i]) != 0:
                # Calculate the cosine similarity scores with the i-th article in the dataset
                sim_score = model.n_similarity(user_abstract, df['abstract_reduced_tokens'][i])
                df_sim_scores.loc[len(df_sim_scores)] = df.iloc[i]
                df_sim_scores.at[i, 'Cosine Similarity'] = sim_score
    
    ## Sort the cosine similarity scores in the dataframe
    df_sim_scores = df_sim_scores.sort_values(by=['Cosine Similarity'], ascending=False)
        
    ## Now check for duplicate articles indices and keep the last index
    ## By default, it will keep the first row and remove the redundant rows.    
    df_sim_scores = df_sim_scores.drop_duplicates(subset=['id'])
           
     ## Get the first n articles in the dataframe
    df_top_n = df_sim_scores.head(n)

    return df_top_n, df_user

###################################################################

"""
    Returns a dataframe of the top n most similar (by cosine similarity)
    article titles to the user's inputs. 
    
    Inputs:
    model: a trained Doc2Vec model
    df: a DataFrame with all of the articles  
    df_user: a dataframe of the user's article inputs   
    n: number of recommendations to return 
"""
@dispatch(gensim.models.doc2vec.Doc2Vec, pd.core.frame.DataFrame,
                                      pd.core.frame.DataFrame, int)

def n_recommendations(model, df, df_user, n):
    
    print("Using Doc2Vec")
    
    ## Create a DataFrame to store the similarity results     
    df_sim_scores = pd.DataFrame(columns=df.columns)
    df_sim_scores['Cosine Similarity'] = []
    
    for article_index in df_user.index.values: 
    
        user_abstract = df_user['abstract_reduced_tokens'][article_index]
        
        for i in range(len(df)):      
        # Calculate the cosine similarity scores with the i-th article in the dataset   
            vector = df['abstract_reduced_tokens'][i] 
            
            if len(vector) != 0:
                sim_score = model.wv.n_similarity(user_abstract, vector)
                df_sim_scores.loc[len(df_sim_scores)] = df.iloc[i]
                df_sim_scores.at[i, 'Cosine Similarity'] = sim_score
    
    ## Sort the cosine similarity scores in the dataframe
    df_sim_scores = df_sim_scores.sort_values(by=['Cosine Similarity'], ascending=False)
        
    ## Now check for duplicate articles indices and keep the last index
    ## By default, it will keep the first row and remove the redundant rows.        
    df_sim_scores = df_sim_scores.drop_duplicates(subset=['id'])

     ## Get the first n articles in the dataframe
    df_top_n = df_sim_scores.head(n)

    return df_top_n, df_user

## The models

Here are train the models that we will be working with for the rest of this notebook.

### CountVectorizer

Creates a (sparse) matrix in which each unique word is represented by a column of the matrix. It is also known as document term matrix (dtm).

In [16]:
## max_df: When building the vocabulary ignore terms that have a document frequency 
##         strictly higher than the given threshold (corpus-specific stop words).

count_vectorizer = CountVectorizer(analyzer="word", 
                                tokenizer=nltk.word_tokenize,
                                preprocessor=None, 
                               # stop_words='english', 
                                max_features=2500,
                                ngram_range=(1,3))
                                    ##  max_df=.9
    
bow = count_vectorizer.fit_transform(df['abstract_reduced'])

df_bow = pd.DataFrame(bow.toarray(),
                      columns = count_vectorizer.get_feature_names_out())

### TfidVectorizer


TF for term frequency and ID stands for inverse document. 
The document frequency of a given term is the number of documents that contain that term. 

Unlike CountVectorizer which gives all words equal weights, 
TfidfVectorizer weights the word counts by a measure of how often they appear in the documents.

To compute the tf-idf of a term for a given document you multiply the term frequency of the term within that document by the $\log$ (base 10) of the inverse document frequency for that term across the corpus. 

$$
\text{tf-idf } = \text{ term-frequency } \times \text{ }\log(\text{inverse-document-frequency}) \text{} 
$$

Here, words like 'this', 'are' etc., that are commonly present in all the documents are not given a very high rank.

In [17]:
tfid_vectorizer = TfidfVectorizer(analyzer="word", 
                                tokenizer=nltk.word_tokenize,
                                preprocessor=None, 
                             #   stop_words='english', 
                                max_features=2500,
                                ngram_range=(1,3))
                                    ##  max_df=.9
    
tfid = tfid_vectorizer.fit_transform(df['abstract_reduced'])

df_tfid = pd.DataFrame(tfid.toarray(),
                      columns = tfid_vectorizer.get_feature_names_out())

### Word2Vec 


Rather than training our own model here, we use a pretrained model that has been trained on a much larger dataset for better results!

Internally, this algorithm uses a neural network to learn word associations from the corpus. The generate word embeddings (vectors of length 300 in the case below) which capture semantic and syntactic qualities of words

In [18]:
## Load word2vec model, here GoogleNews is used
## The file must be previously downloaded
## The size of the vectors is 300
w2v_model = gensim.models.KeyedVectors.load_word2vec_format('../GoogleNews-vectors-negative300.bin', 
                                                        binary=True)

### Doc2Vec 

The Doc2Vec algorithm is an extension of Word2Vec. While Word2Vec computes a feature vector for every word in the corpus, Doc2Vec computes a feature vector for every document in the corpus.

In [19]:
# Rather than a list of tokenized docs Doc2Vec requires tagged lists of tokens
# that's because the model needs to keep track of the documents.
summaries = [TaggedDocument(doc,[i]) for i, doc in enumerate(df['abstract_reduced_tokens'])]

There are two approaches to the Doc2Vec model, (1) distributed bag of words and (2) distributed memory models.

The difference between the distributed bag of words and the distributed memory model is that the distributed memory model approximates the word using the context of surrounding words and the distributed bag of words model uses the target word to approximate the context of the word.

As in the Word2Vec model we will use a vector size of 300.

#### Doc2Vec using Distributed Bag of Words

In [20]:
# Now we train the Doc2Vec models

# dm = 1 or 0 (optional) – Defines the training algorithm. 
# If dm=1, ‘distributed memory’ (PV-DM) is used. 
# Otherwise, distributed bag of words (PV-DBOW) is employed.

# vector_size - Dimensionality of the feature vectors.

# window -  The maximum distance between the current and 
# predicted word within a sentence.

# min_count - require words to show up a minimum of 2 times
# iscard words with very few occurrences. (Without a variety of representative 
# examples, retaining such infrequent words can often make a model worse!)


d2v_model_bow = Doc2Vec(documents = summaries,
                    dm = 0, ## use distributed bag of words (PV-DBOW) 
                    vector_size = 300, 
                    window = 2, 
                    min_count = 2,
                    epochs=50)

#### Doc2Vec using Distributed Memory

In [21]:
#### Doc2Vec using Distributed Memory
d2v_model_dm = Doc2Vec(documents = summaries,
                    dm = 1, ## use distributed memory
                    vector_size = 300, 
                    window = 2, 
                    min_count = 2,
                    epochs=50)

#### Sanity check

We can  see how "good" the embedding is by looping through the abstracts and recording the similarity rank of the actual abstract embedding to the inferred embedding.

Then, we can calculate the fraction of documents whose rank was 0.

In [22]:
"""
    To see how "good" the Doc2Vec embedding is loop through the abstracts 
    and record the similarity rank of the actual abstract embedding to the 
    inferred embedding.
    Then, calculate and print the fraction of documents whose rank was 0.
    
    Input:
    d2v_model: a Doc2Vec model
"""
def check_doc2vec_embedding(d2v_model):
    # We'll loop through all of the documents
    # and record the similarity rank to their inferred vector
    summary_ranks = []

    # for each document
    for summary in summaries:
        # get the inferred vector
        inferred_vec = d2v_model.infer_vector(summary.words)
        # find the most similar vectors
        sims = d2v_model.dv.most_similar([inferred_vec], topn=len(summaries))
    
    # loop through those vectors
        for i in range(len(sims)):
            # find the rank of the document
            if summary.tags[0] == sims[i][0]:
                # record it
                summary_ranks.append(i)
                
    # the fraction of documents whose rank was 0           
    rank_0 = np.sum(np.array(summary_ranks)==0)/len(summary_ranks)
    print("The fraction of documents whose rank is 0 is", np.round(rank_0, 4))

In [23]:
print("----------------- Doc2Vec Model (Dist. Bag of Words)-----------------\n")
check_doc2vec_embedding(d2v_model_bow)

----------------- Doc2Vec Model (Dist. Bag of Words)-----------------

The fraction of documents whose rank is 0 is 0.9923


In [24]:
print("----------------- Doc2Vec Model (Dist. Memory)-----------------\n")
check_doc2vec_embedding(d2v_model_dm)

----------------- Doc2Vec Model (Dist. Memory)-----------------

The fraction of documents whose rank is 0 is 0.992


These models seem reasonable!

#### Compare results of CountVectorizer, TfidVectorizer, Word2Vec, Doc2Vec for articles within the dataset

In [56]:
## Similarity scores for the the first article in the DataFrame

## Using CountVectorizer
df_cv = get_n_most_similar(df.copy(), df_bow, 0, 5)[0]

# Using TfidVectorizer
df_tf = get_n_most_similar(df.copy(), df_tfid, 0, 5)[0]

# Using Word2Vec
df_wv = get_n_most_similar(w2v_model, df, 0, 5)[0]

# Using Doc2Vec with Distributed BOW
df_dv_bow = get_n_most_similar(d2v_model_bow, df, 0, 5)[0]

# Using Doc2Vec with Distributed Memory
df_dv_dm = get_n_most_similar(d2v_model_dm, df, 0, 5)[0]

Using a CountVectorizer or TfidVectorizer

Using a CountVectorizer or TfidVectorizer

Using Word2Vec

Using Doc2Vec

Using Doc2Vec



In [57]:
print("--------------------- Count Vectorizer ---------------------\n")
df_cv[['original_index', 'title', 'abstract', 'strip_cat', 'authors_parsed', 'Cosine Similarity']]

--------------------- Count Vectorizer ---------------------



Unnamed: 0,original_index,title,abstract,strip_cat,authors_parsed,Cosine Similarity
15936,329106,Bifurcation of limit cycles from a quadratic g...,"In this paper, we generalize the PicardFuchs...",[DS],"[['Yang', 'Jihua', '']]",0.410132
9508,315961,Planar Semiquasi Homogeneous Polynomial differ...,This paper study the planar semiquasi homoge...,[DS],"[['Tian', 'Yuzhou', ''], ['Liang', 'Haihua', '']]",0.381928
4796,408265,The local period function for Hamiltonian syst...,In the first part of the paper we develop a ...,[DS],"[['Buzzi', 'Claudio A.', ''], ['Carvalho', 'Ya...",0.335422
9854,367828,First Integrals vs Limit Cycles,This paper applies a recent result determini...,[DS],"[['García', 'Andrés G.', '']]",0.317011
7480,58128,Structure Theory for Second Order 2D Superinte...,The structure theory for the quadratic algeb...,[MP],"[['Kalnins', 'Ernest G.', ''], ['Kress', 'Jona...",0.307934


In [58]:
print("--------------------- Tfid Vectorizer ---------------------\n")
df_tf[['original_index', 'title', 'abstract', 'strip_cat', 'authors_parsed', 'Cosine Similarity']]

--------------------- Tfid Vectorizer ---------------------



Unnamed: 0,original_index,title,abstract,strip_cat,authors_parsed,Cosine Similarity
9508,315961,Planar Semiquasi Homogeneous Polynomial differ...,This paper study the planar semiquasi homoge...,[DS],"[['Tian', 'Yuzhou', ''], ['Liang', 'Haihua', '']]",0.405092
15936,329106,Bifurcation of limit cycles from a quadratic g...,"In this paper, we generalize the PicardFuchs...",[DS],"[['Yang', 'Jihua', '']]",0.325908
17297,169000,Solution of the parametric center problem for ...,The Abel differential equation with is sai...,"[CA, DS]","[['Pakovich', 'Fedor', '']]",0.32451
13706,511903,A sufficient and necessary condition of genera...,The aim of this paper is to give a sufficien...,[DS],"[['Chen', 'Hebai', ''], ['Li', 'Zhijie', ''], ...",0.309687
4796,408265,The local period function for Hamiltonian syst...,In the first part of the paper we develop a ...,[DS],"[['Buzzi', 'Claudio A.', ''], ['Carvalho', 'Ya...",0.301153


In [59]:
print("--------------------- Word2Vec Model ---------------------\n")
df_wv[['original_index', 'title', 'abstract', 'strip_cat', 'authors_parsed', 'Cosine Similarity']]

--------------------- Word2Vec Model ---------------------



Unnamed: 0,original_index,title,abstract,strip_cat,authors_parsed,Cosine Similarity
6108,158258,Topology trivialization and large deviations f...,Finding the global minimum of a cost functio...,"[MP, OC]","[['Fyodorov', 'Yan V', ''], ['Doussal', 'Pierr...",0.884011
13151,107803,Continuous Limits of Classical Repeated Intera...,We consider the physical model of a classica...,"[MP, PR]","[['Deschamps', 'Julien', '']]",0.880297
19029,53103,Phase portraits for quadratic homogeneous poly...,Let X be a homogeneous polynomial vector fie...,[DS],"[['Llibre', 'Jaume', ''], ['Pessoa', 'Claudio'...",0.878772
9508,315961,Planar Semiquasi Homogeneous Polynomial differ...,This paper study the planar semiquasi homoge...,[DS],"[['Tian', 'Yuzhou', ''], ['Liang', 'Haihua', '']]",0.877008
18932,365655,"Invariant tori, actionangle variables and phas...","We study the classical RajeevRanken model, a...","[DS, MP]","[['Krishnaswami', 'Govind S.', ''], ['Vishnu',...",0.875581


In [60]:
print("----------------- Doc2Vec Model (Dist. Bag of Words)-----------------\n")
df_dv_bow[['original_index', 'title', 'abstract', 'strip_cat', 'authors_parsed', 'Cosine Similarity']]

----------------- Doc2Vec Model (Dist. Bag of Words)-----------------



Unnamed: 0,original_index,title,abstract,strip_cat,authors_parsed,Cosine Similarity
6867,63325,Planar polynomial vector fields having a polyn...,We consider in this work planar polynomial d...,"[CA, DS]","[['Garcia', 'Belen', ''], ['Giacomini', 'Hecto...",0.591277
10619,468612,Rational integrals of 2dimensional geodesic fl...,This paper is devoted to searching for Riema...,"[DS, AP, DG]","[['Agapov', 'Sergei', '', '1 and 2'], ['Shubin...",0.517568
10096,325855,Averaging theory at any order for computing li...,This work is devoted to study the existence ...,[DS],"[['Llibre', 'Jaume', ''], ['Novaes', 'Douglas ...",0.504668
6281,464716,Bifurcation diagrams of onedimensional Kirchho...,We study the onedimensional Kirchhoff type e...,[AP],"[['Shibata', 'Tetsutaro', '']]",0.483492
15464,240592,Dual morse index estimates and application to ...,"In this paper, we study the multiplicity of ...",[AP],"[['Tang', 'Shanshan', '']]",0.482491


In [61]:
print("----------------- Doc2Vec Model (Dist. Memory)-----------------\n")
df_dv_dm[['original_index', 'title', 'abstract', 'strip_cat', 'authors_parsed', 'Cosine Similarity']]

----------------- Doc2Vec Model (Dist. Memory)-----------------



Unnamed: 0,original_index,title,abstract,strip_cat,authors_parsed,Cosine Similarity
9508,315961,Planar Semiquasi Homogeneous Polynomial differ...,This paper study the planar semiquasi homoge...,[DS],"[['Tian', 'Yuzhou', ''], ['Liang', 'Haihua', '']]",0.52938
4796,408265,The local period function for Hamiltonian syst...,In the first part of the paper we develop a ...,[DS],"[['Buzzi', 'Claudio A.', ''], ['Carvalho', 'Ya...",0.484793
10131,178087,On the dynamics of lattice systems with unboun...,We supply the mathematical arguments require...,[MP],"[['Nachtergaele', 'Bruno', ''], ['Sims', 'Robe...",0.47306
5240,40189,Displacement energy of coisotropic submanifold...,We prove that the displacement energy of a s...,"[SG, DG]","[['Kerman', 'Ely', '']]",0.453683
6867,63325,Planar polynomial vector fields having a polyn...,We consider in this work planar polynomial d...,"[CA, DS]","[['Garcia', 'Belen', ''], ['Giacomini', 'Hecto...",0.450362


## Test the models using user data

We will next test our user data on the three more sophisticated models: 

(1) pretrained Word2Vec 

(2) Doc2Vec with distributed bag of words

(3) Doc2Vec with distributed memory.

In [31]:
## Here are several lists of papers we are interested in
ethan = ['1802.03426', '2304.14481', '2303.03190', '2210.13418',
         '2210.12824', '2210.00661', '2007.02390', '1808.05860',
         '2005.12732','1804.05690']

jeeuhn = ['0905.0486', 'math/0006187', '2106.07444', '1402.0490', 
          '1512.08942', '1603.09235', 'math/0510265', 'math/0505056', 
          'math/0604379', '2209.02568']

mike = ['2207.13571','2207.13498','2211.09644','2001.10647',
        '2103.08093','2207.08245', '2207.01677','2205.08744',
        '2008.04406','1912.09845']

jenia = ['2010.14967', '1307.0493', 'quant-ph/0604014', '2201.05140', 
         '1111.1877', 'quant-ph/9912054', '1611.08286', '1507.02858', 
         'math-ph/0107001','1511.01241', 'math-ph/9904020', '2211.15336', 
         '2212.03719']
jenia = jenia[0:10]

In [32]:
## Get the list of words of frequency 1 in the dataframe
unique_words = get_freq1_corpus(df)
print(len(unique_words))

21707


We've observed that removing unique words changes the recommendations and increases the execution time of the algorithm.

In [34]:
# words can be accessed like so
# print(stopwords.words('english'))

## Tokenize the abstract by splitting on whitespaces
## and get rid of the occasional empty string.
def clear_empty(clean_string):
    return [word for word in clean_string.split(" ") if word != '']

## Remove the common stop words
def remove_stop(tokens):
    
    ## Remove punctutation from stopwords because we've already 
    ## removed it from the abstracts
    ## punctuation is imported from the string class
    eng_stopwords = stopwords.words('english')
    new_punct = punctuation + "’" + "‘"

    new_stop = []
    for word in eng_stopwords:
        new_word = ""
        for char in word:
            if char not in new_punct:
                new_word = new_word + char
        
        new_stop.append(new_word)
    
 #   print(new_stop)
    return [token for token in tokens if token not in new_stop]

## Remove the words that appear only once in the corpus of the dataset 
def remove_unique(tokens):
    return [token for token in tokens if token not in unique_words]

In [35]:
"""
    Process the arxiv ids that the user has given. 
    Return a DataFrame with the user's article data.
    
    Inputs:
    paper_ids_list: a list of ArXiv ids
"""

def user_data(paper_ids_list):
    ## Create the dataframe to store the user's input papers
    df_user = pd.DataFrame(columns=['id','entry_id', 'title', 'authors','abstract'])
    
    list_urls = []
    list_titles = []
    list_authors = []
    list_abstracts = []

    ## Extract the article info from ArXiv
    for item in paper_ids_list:
        paper = next(arxiv.Search(id_list=[item]).results())
        list_titles.append(paper.title)
        list_authors.append(paper.authors)
        list_abstracts.append(paper.summary)
        list_urls.append(paper.entry_id)
    
    df_user['id'] = paper_ids_list
    df_user['entry_id'] = list_urls
    df_user['title'] = list_titles
    df_user['authors'] = list_authors
    df_user['abstract'] = list_abstracts
    
    ## Clean the user's data
    df_user['abstract_clean'] = df_user['abstract'].apply(clean_data)
    df_user['abstract_tokenized'] = df_user['abstract_clean'].apply(nltk.word_tokenize)
    df_user['abstract_tokenized'] = df_user['abstract_clean'].apply(clear_empty)
    df_user['abstract_tokenized'] = df_user['abstract_tokenized'].apply(remove_stop)
    df_user['abstract_reduced_tokens'] = df_user['abstract_tokenized'].apply(remove_unique)
    
    return df_user

### Ethan's Recommendations

In [36]:
df_ethan = user_data(ethan)
df_ethan

Unnamed: 0,id,entry_id,title,authors,abstract,abstract_clean,abstract_tokenized,abstract_reduced_tokens
0,1802.03426,http://arxiv.org/abs/1802.03426v3,UMAP: Uniform Manifold Approximation and Proje...,"[Leland McInnes, John Healy, James Melville]",UMAP (Uniform Manifold Approximation and Proje...,umap uniform manifold approximation and projec...,"[umap, uniform, manifold, approximation, proje...","[umap, uniform, manifold, approximation, proje..."
1,2304.14481,http://arxiv.org/abs/2304.14481v1,"Endperiodic maps, splitting sequences, and bra...","[Michael P. Landry, Chi Cheuk Tsang]",We strengthen the unpublished theorem of Gabai...,we strengthen the unpublished theorem of gabai...,"[strengthen, unpublished, theorem, gabai, mosh...","[strengthen, unpublished, theorem, gabai, mosh..."
2,2303.0319,http://arxiv.org/abs/2303.03190v1,Train track combinatorics and cluster algebras,[Shunsuke Kano],The concepts of train track was introduced by ...,the concepts of train track was introduced by ...,"[concepts, train, track, introduced, w, p, thu...","[concepts, train, track, introduced, w, p, thu..."
3,2210.13418,http://arxiv.org/abs/2210.13418v2,Standardly embedded train tracks and pseudo-An...,"[Eriko Hironaka, Chi Cheuk Tsang]",We show that given a fully-punctured pseudo-An...,we show that given a fully punctured pseudo an...,"[show, given, fully, punctured, pseudo, anosov...","[show, given, fully, punctured, pseudo, anosov..."
4,2210.12824,http://arxiv.org/abs/2210.12824v2,Class number for pseudo-Anosovs,"[François Dahmani, Mahan Mj]","Given two automorphisms of a group $G$, one is...",given two automorphisms of a group one is int...,"[given, two, automorphisms, group, one, intere...","[given, two, automorphisms, group, one, intere..."
5,2210.00661,http://arxiv.org/abs/2210.00661v1,"Braids, entropies and fibered 2-fold branched ...","[Susumu Hirose, Eiko Kin]",It is proved by Sakuma and Brooks that any clo...,it is proved by sakuma and brooks that any clo...,"[proved, sakuma, brooks, closed, orientable, m...","[proved, sakuma, brooks, closed, orientable, m..."
6,2007.0239,http://arxiv.org/abs/2007.02390v1,The (homological) persistence of gerrymandering,"[Moon Duchin, Tom Needham, Thomas Weighill]","We apply persistent homology, the dominant too...",we apply persistent homology the dominant tool...,"[apply, persistent, homology, dominant, tool, ...","[apply, persistent, homology, dominant, tool, ..."
7,1808.0586,http://arxiv.org/abs/1808.05860v1,Discrete geometry for electoral geography,"[Moon Duchin, Bridget Eileen Tenner]","We discuss the ""compactness,"" or shape analysi...",we discuss the compactness or shape analysis o...,"[discuss, compactness, shape, analysis, electo...","[discuss, compactness, shape, analysis, distri..."
8,2005.12732,http://arxiv.org/abs/2005.12732v1,Mathematics of Nested Districts: The Case of A...,"[Sophia Caldera, Daryl DeFord, Moon Duchin, Sa...","In eight states, a ""nesting rule"" requires tha...",in eight states a nesting rule requires that e...,"[eight, states, nesting, rule, requires, state...","[eight, states, nesting, rule, requires, state..."
9,1804.0569,http://arxiv.org/abs/1804.05690v4,You can hear the shape of a billiard table: Sy...,"[Moon Duchin, Viveka Erlandsson, Christopher J...",We give a complete characterization of the rel...,we give a complete characterization of the rel...,"[give, complete, characterization, relationshi...","[give, complete, characterization, relationshi..."


In [37]:
start = time.time()

## Using the pretrained Word2Vec
df_rec_wv, df_ethan_new_wv = n_recommendations(w2v_model, df, df_ethan[0:3], 10)

end = time.time()
res = (end - start)/60
print('Execution time:', res, 'minutes')

## For a set of 2 papers, the execution about 4.5 min.
## For a set of 5 papers, the execution about 19 min.

Using Word2Vec
Execution time: 9.650244998931885 minutes


In [38]:
print("--------------------- Word2Vec Model ---------------------\n")
display_results(df_rec_wv, df_ethan_new_wv)

--------------------- Word2Vec Model ---------------------

The top 10 articles most similar to the articles: 


1 . UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
2 . Endperiodic maps, splitting sequences, and branched surfaces
3 . Train track combinatorics and cluster algebras

#############################################################

1 . ( 5972 ) Pointed Admissible GCovers and Gequivariant Cohomological Field   Theories 
 [ Cosine Similarity= 0.795 ]


   For any finite group G we define the moduli space of pointed admissible Gcovers and the concept of a Gequivariant cohomological field theory (GCohFT), which, when G is the trivial group, reduce to the moduli space of stable curves and a cohomological field theory (CohFT), respectively. We prove that by taking the "quotient" by G, a GCohFT reduces to a CohFT. We also prove that a GCohFT contains a GFrobenius algebra, a Gequivariant generalization of a Frobenius algebra, and that the "quotient" by G 

In [39]:
start = time.time()

## Using Doc2Vec with Distributed Bag of Words
df_rec_d2v_bow, df_ethan_new_bow = n_recommendations(d2v_model_bow, df, df_ethan[0:3], 10)

end = time.time()
res = (end - start)/60
print('Execution time:',res, 'minutes')
## For a set of 2 papers, this takes about 4.5 min.

Using Doc2Vec
Execution time: 9.684119053681691 minutes


In [42]:

print("----------------- Doc2Vec Model (Dist. Bag of Words)-----------------\n")
display_results(df_rec_d2v_bow, df_ethan_new_bow)

----------------- Doc2Vec Model (Dist. Bag of Words)-----------------

The top 10 articles most similar to the articles: 


1 . UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
2 . Endperiodic maps, splitting sequences, and branched surfaces
3 . Train track combinatorics and cluster algebras

#############################################################

1 . ( 10116 ) Stability of FractionalOrder Systems with Rational Orders 
 [ Cosine Similarity= 0.311 ]


   This paper deals with stability of a certain class of fractional order linear and nonlinear systems. The stability is investigated in the time domain and the frequency domain. The general stability conditions and several illustrative examples are presented as well. 

-----------------------------------------------------

2 . ( 13467 ) Asymptotical stability of differential equations driven by   H\"oldercontinuous paths 
 [ Cosine Similarity= 0.279 ]


   In this manuscript, we establish asymptotic local

In [40]:
start = time.time()

## Using Doc2Vec with Distributed Memory
df_rec_d2v_dm, df_ethan_new_dm = n_recommendations(d2v_model_dm, df, df_ethan[0:3], 10)

end = time.time()
res = (end - start)/60
print('Execution time:',res, 'minutes')
## For a set of 2 papers, this takes about 4.5 min.

Using Doc2Vec
Execution time: 10.251612730820973 minutes


In [41]:
print("----------------- Doc2Vec Model (Dist. Memory)-----------------\n")
display_results(df_rec_d2v_dm, df_ethan_new_dm)

----------------- Doc2Vec Model (Dist. Memory)-----------------

The top 10 articles most similar to the articles: 


1 . UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
2 . Endperiodic maps, splitting sequences, and branched surfaces
3 . Train track combinatorics and cluster algebras

#############################################################

1 . ( 18893 ) WeilPetersson metric on the universal Teichmuller space II. Kahler   potential and period mapping 
 [ Cosine Similarity= 0.795 ]


   We study the Hilbert manifold structure on   the connected component of the identity of the Hilbert manifold T(1). We characterize points on  in terms of Bers and preBers embeddings, and prove that the Grunsky operators  and , associated with the points in  via conformal welding, are HilbertSchmidt. We define a ``universal Liouville action''  a realvalued function  on , and prove that it is a K\"{a}hler potential of the WeilPetersson metric on . We also prove that  is  

### Jee Uhn's Recommendations

In [43]:
df_jeeuhn = user_data(jeeuhn)
df_jeeuhn

Unnamed: 0,id,entry_id,title,authors,abstract,abstract_clean,abstract_tokenized,abstract_reduced_tokens
0,0905.0486,http://arxiv.org/abs/0905.0486v3,A geometric construction of colored HOMFLYPT h...,"[Ben Webster, Geordie Williamson]","The aim of this paper is two-fold. First, we g...",the aim of this paper is two fold first we giv...,"[aim, paper, two, fold, first, give, fully, ge...","[aim, paper, two, fold, first, give, fully, ge..."
1,math/0006187,http://arxiv.org/abs/math/0006187v1,The Hard Lefschetz Theorem and the topology of...,"[Mark Andrea de Cataldo, Luca Migliorini]",We introduce the notion of lef line bundles on...,we introduce the notion of lef line bundles on...,"[introduce, notion, lef, line, bundles, comple...","[introduce, notion, line, bundles, complex, pr..."
2,2106.07444,http://arxiv.org/abs/2106.07444v2,From the Hecke Category to the Unipotent Locus,[Minh-Tâm Quang Trinh],Let $W$ be the Weyl group of a split semisimpl...,let be the weyl group of a split semisimple gr...,"[let, weyl, group, split, semisimple, group, h...","[let, weyl, group, split, semisimple, group, h..."
3,1402.0490,http://arxiv.org/abs/1402.0490v2,Legendrian knots and constructible sheaves,"[Vivek Shende, David Treumann, Eric Zaslow]",We study the unwrapped Fukaya category of Lagr...,we study the unwrapped fukaya category of lagr...,"[study, unwrapped, fukaya, category, lagrangia...","[study, unwrapped, fukaya, category, lagrangia..."
4,1512.08942,http://arxiv.org/abs/1512.08942v3,Cluster varieties from Legendrian knots,"[Vivek Shende, David Treumann, Harold Williams...",Many interesting spaces --- including all posi...,many interesting spaces including all positroi...,"[many, interesting, spaces, including, positro...","[many, interesting, spaces, including, positro..."
5,1603.09235,http://arxiv.org/abs/1603.09235v1,The Hodge theory of the Decomposition Theorem ...,[Geordie Williamson],In its simplest form the Decomposition Theorem...,in its simplest form the decomposition theorem...,"[simplest, form, decomposition, theorem, asser...","[simplest, form, decomposition, theorem, asser..."
6,math/0510265,http://arxiv.org/abs/math/0510265v3,Triply-graded link homology and Hochschild hom...,[Mikhail Khovanov],We trade matrix factorizations and Koszul comp...,we trade matrix factorizations and koszul comp...,"[trade, matrix, factorizations, koszul, comple...","[trade, matrix, factorizations, koszul, comple..."
7,math/0505056,http://arxiv.org/abs/math/0505056v2,Matrix factorizations and link homology II,"[Mikhail Khovanov, Lev Rozansky]",To a presentation of an oriented link as the c...,to a presentation of an oriented link as the c...,"[presentation, oriented, link, closure, braid,...","[presentation, oriented, link, closure, braid,..."
8,math/0604379,http://arxiv.org/abs/math/0604379v4,Constructible Sheaves and the Fukaya Category,"[David Nadler, Eric Zaslow]","Let $X$ be a compact real analytic manifold, a...",let be a compact real analytic manifold and le...,"[let, compact, real, analytic, manifold, let, ...","[let, compact, real, analytic, manifold, let, ..."
9,2209.02568,http://arxiv.org/abs/2209.02568v1,The $P=W$ conjecture for $\mathrm{GL}_n$,"[Davesh Maulik, Junliang Shen]",We prove the $P=W$ conjecture for $\mathrm{GL}...,we prove the conjecture for for all ranks and ...,"[prove, conjecture, ranks, curves, arbitrary, ...","[prove, conjecture, ranks, curves, arbitrary, ..."


In [44]:
## Using the pretrained Word2Vec
df_rec_wv2, df_jeeuhn_new_wv = n_recommendations(w2v_model, df, df_jeeuhn[0:3], 10)

Using Word2Vec


In [45]:
print("--------------------- Word2Vec Model ---------------------\n")
display_results(df_rec_wv2, df_jeeuhn_new_wv)

--------------------- Word2Vec Model ---------------------

The top 10 articles most similar to the articles: 


1 . A geometric construction of colored HOMFLYPT homology
2 . The Hard Lefschetz Theorem and the topology of semismall maps
3 . From the Hecke Category to the Unipotent Locus

#############################################################

1 . ( 7870 ) Twistors, Generalizations and Exceptional Structures 
 [ Cosine Similarity= 0.904 ]


   This paper is intended to describe twistors via the paravector model of Clifford algebras and to relate such description to conformal maps in the Clifford algebra over R(4,1), besides pointing out some applications of the pure spinor formalism. We construct twistors in Minkowski spacetime as algebraic spinors associated with the DiracClifford algebra, using one lower spacetime dimension than standard Clifford algebra formulations, since for this purpose the Clifford algebra over R{4,1} is also used to describe conformal maps, instead of R{2

In [46]:
## Using Doc2Vec with Distributed Bag of Words
df_rec_d2v_bow2, df_jeeuhn_new_bow = n_recommendations(d2v_model_bow, df, df_jeeuhn[0:3], 10)

print("----------------- Doc2Vec Model (Dist. Bag of Words)-----------------\n")
display_results(df_rec_d2v_bow2, df_jeeuhn_new_bow)

Using Doc2Vec
----------------- Doc2Vec Model (Dist. Bag of Words)-----------------

The top 10 articles most similar to the articles: 


1 . A geometric construction of colored HOMFLYPT homology
2 . The Hard Lefschetz Theorem and the topology of semismall maps
3 . From the Hecke Category to the Unipotent Locus

#############################################################

1 . ( 635 ) The restricted KirillovReshetikhin modules for the current and twisted   current algebras 
 [ Cosine Similarity= 0.317 ]


   We define a family of graded restricted modules for the polynomial current algebra associated to a simple Lie algebra. We study the graded character of these modules and show that they are the same as the graded characters of certain Demazure modules. In particular, we see that the specialized characters are the same as those of the Kirillov Reshetikhin modules for quantum affine algebras. 

-----------------------------------------------------

2 . ( 8206 ) Controlled coarse homo

In [47]:
## Using Doc2Vec with Distributed Memory
df_rec_d2v_dm2, df_jeeuhn_new_dm = n_recommendations(d2v_model_dm, df, df_jeeuhn[0:3], 10)

print("----------------- Doc2Vec Model (Dist. Memory)-----------------\n")
display_results(df_rec_d2v_dm2, df_jeeuhn_new_dm)

Using Doc2Vec
----------------- Doc2Vec Model (Dist. Memory)-----------------

The top 10 articles most similar to the articles: 


1 . A geometric construction of colored HOMFLYPT homology
2 . The Hard Lefschetz Theorem and the topology of semismall maps
3 . From the Hecke Category to the Unipotent Locus

#############################################################

1 . ( 3651 ) Cylindric Hecke characters and GromovWitten invariants via the   asymmetric sixvertex model 
 [ Cosine Similarity= 0.894 ]


   We construct a family of infinitedimensional positive subcoalgebras within the Grothendieck ring of Hecke algebras, when viewed as a Hopf algebra with respect to the induction and restriction functor. These subcoalgebras have as structure constants the 3point genus zero GromovWitten invariants of Grassmannians and are spanned by what we call cylindric Hecke characters, a particular set of virtual characters for whose computation we give several explicit combinatorial formulae. One of

### Mike's Recommendations

In [48]:
df_mike = user_data(mike)
df_mike

Unnamed: 0,id,entry_id,title,authors,abstract,abstract_clean,abstract_tokenized,abstract_reduced_tokens
0,2207.13571,http://arxiv.org/abs/2207.13571v2,Scaling asymptotics of spectral Wigner functions,"[Boris Hanin, Steve Zelditch]",We prove that smooth Wigner-Weyl spectral sums...,we prove that smooth wigner weyl spectral sums...,"[prove, smooth, wigner, weyl, spectral, sums, ...","[prove, smooth, wigner, weyl, spectral, sums, ..."
1,2207.13498,http://arxiv.org/abs/2207.13498v1,$2$-nodal domain theorems for higher dimension...,"[Junehyuk Jung, Steve Zelditch]",We prove that the real parts of equivariant (b...,we prove that the real parts of equivariant bu...,"[prove, real, parts, equivariant, non, invaria...","[prove, real, parts, equivariant, non, invaria..."
2,2211.09644,http://arxiv.org/abs/2211.09644v1,Asymptotics for the spectral function on Zoll ...,"[Yaiza Canzani, Jeffrey Galkowski, Blake Keeler]","On a smooth, compact, Riemannian manifold with...",on a smooth compact riemannian manifold withou...,"[smooth, compact, riemannian, manifold, withou...","[smooth, compact, riemannian, manifold, withou..."
3,2001.10647,http://arxiv.org/abs/2001.10647v4,Caustics of weakly Lagrangian distributions,"[Sean Gomes, Jared Wunsch]",We study semiclassical sequences of distributi...,we study semiclassical sequences of distributi...,"[study, semiclassical, sequences, distribution...","[study, semiclassical, sequences, distribution..."
4,2103.08093,http://arxiv.org/abs/2103.08093v2,Around quantum ergodicity,[Semyon Dyatlov],We discuss Shnirelman's Quantum Ergodicity The...,we discuss shnirelmans quantum ergodicity theo...,"[discuss, shnirelmans, quantum, ergodicity, th...","[discuss, shnirelmans, quantum, ergodicity, th..."
5,2207.08245,http://arxiv.org/abs/2207.08245v1,Classical Wave methods and modern gauge transf...,"[Jeffrey Galkowski, Leonid Parnovski, Roman Sh...","In this article, we consider the asymptotic be...",in this article we consider the asymptotic beh...,"[article, consider, asymptotic, behaviour, spe...","[article, consider, asymptotic, behaviour, spe..."
6,2207.01677,http://arxiv.org/abs/2207.01677v2,Scaling Asymptotics of Wigner Distributions of...,[Nicholas Lohr],The main result of this article gives scaling ...,the main result of this article gives scaling ...,"[main, result, article, gives, scaling, asympt...","[main, result, article, gives, scaling, asympt..."
7,2205.08744,http://arxiv.org/abs/2205.08744v2,A proof of a Melrose's trace formula,[Yves Colin de Verdière],We give a new proof ofan extension of the Chaz...,we give a new proof ofan extension of the chaz...,"[give, new, proof, ofan, extension, chazarain,...","[give, new, proof, ofan, extension, chazarain,..."
8,2008.04406,http://arxiv.org/abs/2008.04406v1,Reduction and Coherent States,"[Jenia Rousseva, Alejandro Uribe]",We apply a quantum version of dimensional redu...,we apply a quantum version of dimensional redu...,"[apply, quantum, version, dimensional, reducti...","[apply, quantum, version, dimensional, reducti..."
9,1912.09845,http://arxiv.org/abs/1912.09845v2,An introduction to microlocal complex deformat...,"[Jeffrey Galkowski, Maciej Zworski]",In this expository article we relate the prese...,in this expository article we relate the prese...,"[expository, article, relate, presentation, we...","[expository, article, relate, presentation, we..."


In [49]:
## Using the pretrained Word2Vec
df_rec_wv3, df_mike_new_wv = n_recommendations(w2v_model, df, df_mike[0:3], 10)

print("--------------------- Word2Vec Model ---------------------\n")
display_results(df_rec_wv3, df_mike_new_wv)

Using Word2Vec
--------------------- Word2Vec Model ---------------------

The top 10 articles most similar to the articles: 


1 . Scaling asymptotics of spectral Wigner functions
2 . $2$-nodal domain theorems for higher dimensional circle bundles
3 . Asymptotics for the spectral function on Zoll manifolds

#############################################################

1 . ( 12345 ) Critical point asymptotics for Gaussian random waves with densities of   any Sobolev regularity 
 [ Cosine Similarity= 0.878 ]


   We consider Gaussian random monochromatic waves  on the plane depending on a real parameter  that is directly related to the regularity of its Fourier transform. Specifically, the Fourier transform of  is , where  is the Hausdorff measure on the unit circle and the density  is a function on the circle that, roughly speaking, has exactly  derivatives in  almost surely. When , one recovers the classical setting for random waves with a translationinvariant covariancekernel. The m

In [50]:
## Using Doc2Vec with Distributed Bag of Words
df_rec_d2v_bow3, df_mike_new_bow = n_recommendations(d2v_model_bow, df, df_mike[0:3], 10)

print("----------------- Doc2Vec Model (Dist. Bag of Words)-----------------\n")
display_results(df_rec_d2v_bow3, df_mike_new_bow)

Using Doc2Vec
----------------- Doc2Vec Model (Dist. Bag of Words)-----------------

The top 10 articles most similar to the articles: 


1 . Scaling asymptotics of spectral Wigner functions
2 . $2$-nodal domain theorems for higher dimensional circle bundles
3 . Asymptotics for the spectral function on Zoll manifolds

#############################################################

1 . ( 11906 ) Eigenvalue Inequalities for the Clamped Plate Problem of    Operator 
 [ Cosine Similarity= 0.323 ]


    operator is introduced by Y.L. Xin (\emph{Calculus of Variations and Partial Differential Equations. 2015, \textbf{54}(2):19952016)}, which is an important extrinsic elliptic differential operator of divergence type and has profound geometric meaning. In this paper, we extend  operator to more general elliptic differential operator , and investigate the clamped plate problem of bi operator, which is denoted by , on the complete Riemannian manifolds. A general formula of eigenvalues for the  o

In [51]:
## Using Doc2Vec with Distributed Memory
df_rec_d2v_dm3, df_mike_new_dm = n_recommendations(d2v_model_dm, df, df_mike[0:3], 10)

print("----------------- Doc2Vec Model (Dist. Memory)-----------------\n")
display_results(df_rec_d2v_dm3, df_mike_new_dm)

Using Doc2Vec
----------------- Doc2Vec Model (Dist. Memory)-----------------

The top 10 articles most similar to the articles: 


1 . Scaling asymptotics of spectral Wigner functions
2 . $2$-nodal domain theorems for higher dimensional circle bundles
3 . Asymptotics for the spectral function on Zoll manifolds

#############################################################

1 . ( 13790 ) Mean of the norm for normalized random waves on compact   aperiodic Riemannian manifolds 
 [ Cosine Similarity= 0.899 ]


   This article concerns upper bounds for norms of random approximate eigenfunctions of the Laplace operator on a compact aperiodic Riemannian manifold  We study  chosen uniformly at random from the space of normalized linear combinations of Laplace eigenfunctions with eigenvalues in the interval  Our main result is that the expected value of  grows at most like  as , where  is an explicit constant depending only on the dimension and volume of  In addition, we obtain concentration o

### Jenia's Recommendations

In [52]:
df_jenia = user_data(jenia)
df_jenia

Unnamed: 0,id,entry_id,title,authors,abstract,abstract_clean,abstract_tokenized,abstract_reduced_tokens
0,2010.14967,http://arxiv.org/abs/2010.14967v5,Construction of quasimodes for non-selfadjoint...,[Víctor Arnaiz],We construct quasimodes for some non-selfadjoi...,we construct quasimodes for some non selfadjoi...,"[construct, quasimodes, non, selfadjoint, semi...","[construct, quasimodes, non, selfadjoint, semi..."
1,1307.0493,http://arxiv.org/abs/1307.0493v2,The exponential map of the complexification of...,"[Daniel Burns, Ernesto Lupercio, Alejandro Uribe]","Let $(M, \omega, J)$ be a K\""ahler manifold an...",let be a kahler manifold and k its group of ha...,"[let, kahler, manifold, k, group, hamiltonian,...","[let, kahler, manifold, k, group, hamiltonian,..."
2,quant-ph/0604014,http://arxiv.org/abs/quant-ph/0604014v2,Time evolution of non-Hermitian Hamiltonian sy...,"[Carla Figueira de Morisson Faria, Andreas Fring]","We provide time-evolution operators, gauge tra...",we provide time evolution operators gauge tran...,"[provide, time, evolution, operators, gauge, t...","[provide, time, evolution, operators, gauge, t..."
3,2201.05140,http://arxiv.org/abs/2201.05140v1,An introduction to PT-symmetric quantum mechan...,[Andreas Fring],I will provide a pedagogical introduction to n...,i will provide a pedagogical introduction to n...,"[provide, pedagogical, introduction, non, herm...","[provide, pedagogical, introduction, non, herm..."
4,1111.1877,http://arxiv.org/abs/1111.1877v2,Complexified coherent states and quantum evolu...,"[Eva-Maria Graefe, Roman Schubert]","The complex geometry underlying the Schr\""odin...",the complex geometry underlying the schrodinge...,"[complex, geometry, underlying, schrodinger, d...","[complex, geometry, underlying, schrodinger, d..."
5,quant-ph/9912054,http://arxiv.org/abs/quant-ph/9912054v2,Holomorphic Methods in Mathematical Physics,[Brian C. Hall],This set of lecture notes gives an introductio...,this set of lecture notes gives an introductio...,"[set, lecture, notes, gives, introduction, hol...","[set, lecture, notes, gives, introduction, hol..."
6,1611.08286,http://arxiv.org/abs/1611.08286v1,Unitarity of the time-evolution and observabil...,"[F. S. Luiz, M. A. Pontes, M. H. Y. Moussa]",Here we present an strategy for the derivation...,here we present an strategy for the derivation...,"[present, strategy, derivation, time, dependen...","[present, strategy, derivation, time, dependen..."
7,1507.02858,http://arxiv.org/abs/1507.02858v3,Non-Hermitian propagation of Hagedorn wavepackets,"[Caroline Lasser, Roman Schubert, Stephanie Tr...",We investigate the time evolution of Hagedorn ...,we investigate the time evolution of hagedorn ...,"[investigate, time, evolution, hagedorn, wavep...","[investigate, time, evolution, hagedorn, wavep..."
8,math-ph/0107001,http://arxiv.org/abs/math-ph/0107001v3,Pseudo-Hermiticity versus PT Symmetry: The nec...,[Ali Mostafazadeh],We introduce the notion of pseudo-Hermiticity ...,we introduce the notion of pseudo hermiticity ...,"[introduce, notion, pseudo, hermiticity, show,...","[introduce, notion, pseudo, hermiticity, show,..."
9,1511.01241,http://arxiv.org/abs/1511.01241v2,Semiclassical states associated to isotropic s...,"[Victor Guillemin, Alejandro Uribe, Zuoqin Wang]",We define classes of quantum states associated...,we define classes of quantum states associated...,"[define, classes, quantum, states, associated,...","[define, classes, quantum, states, associated,..."


In [53]:
## Using the pretrained Word2Vec
df_rec_wv4, df_jenia_new_wv = n_recommendations(w2v_model, df, df_jenia[0:3], 10)

print("--------------------- Word2Vec Model ---------------------\n")
display_results(df_rec_wv4, df_jenia_new_wv)

Using Word2Vec
--------------------- Word2Vec Model ---------------------

The top 10 articles most similar to the articles: 


1 . Construction of quasimodes for non-selfadjoint operators via propagation of Hagedorn wave-packets
2 . The exponential map of the complexification of {\em Ham} in the real-analytic case
3 . Time evolution of non-Hermitian Hamiltonian systems

#############################################################

1 . ( 3373 ) Freely floating objects on a fluid governed by the Boussinesq equations 
 [ Cosine Similarity= 0.904 ]


   We investigate here the interactions of waves governed by a Boussinesq system with a partially immersed body allowed to move freely in the vertical direction. We show that the whole system of equations can be reduced to a transmission problem for the Boussinesq equations with transmission conditions given in terms of the vertical displacement of the object and of the average horizontal discharge beneath it; these two quantities are in tur

In [54]:
## Using Doc2Vec with Distributed Bag of Words
df_rec_d2v_bow4, df_jenia_new_bow = n_recommendations(d2v_model_bow, df, df_jenia[0:3], 10)

print("----------------- Doc2Vec Model (Dist. Bag of Words)-----------------\n")
display_results(df_rec_d2v_bow4, df_jenia_new_bow)

Using Doc2Vec
----------------- Doc2Vec Model (Dist. Bag of Words)-----------------

The top 10 articles most similar to the articles: 


1 . Construction of quasimodes for non-selfadjoint operators via propagation of Hagedorn wave-packets
2 . The exponential map of the complexification of {\em Ham} in the real-analytic case
3 . Time evolution of non-Hermitian Hamiltonian systems

#############################################################

1 . ( 377 ) Existence and regularity results for weak solutions to elliptic   systems in divergence form 
 [ Cosine Similarity= 0.367 ]


   We prove existence and regularity results for weak solutions of non linear elliptic systems with non variational structure satisfying growth conditions. In particular we are able to prove higher differentiability results under a dimensionfree gap between  and . 

-----------------------------------------------------

2 . ( 13966 ) Integrable Floquet dynamics, generalized exclusion processes and "fused"   matr

In [55]:
## Using Doc2Vec with Distributed Memory
df_rec_d2v_dm4, df_jenia_new_dm = n_recommendations(d2v_model_dm, df, df_jenia[0:3], 10)

print("----------------- Doc2Vec Model (Dist. Memory)-----------------\n")
display_results(df_rec_d2v_dm4, df_jenia_new_dm)

Using Doc2Vec
----------------- Doc2Vec Model (Dist. Memory)-----------------

The top 10 articles most similar to the articles: 


1 . Construction of quasimodes for non-selfadjoint operators via propagation of Hagedorn wave-packets
2 . The exponential map of the complexification of {\em Ham} in the real-analytic case
3 . Time evolution of non-Hermitian Hamiltonian systems

#############################################################

1 . ( 3590 ) Nosignaling principle and quantum brachistochrone problem in   symmetric fermionic two and fourdimensional models 
 [ Cosine Similarity= 0.882 ]


   Fermionic systems differ from bosonic ones in several ways, in particular that the timereversal operator  is odd, . For symmetric bosonic systems, the nosignaling principle and the quantum brachistochrone problem have been studied to some degree, both of them controversially. In this paper, we apply the basic methods proposed for bosonic systems to {\it fermionic} two and fourdimensional symme

### Assessment of Recommendations

We don't have an objective metric for assessing the quality of recommendations for papers made by our three models based on the users' inputs. It is inherently a subjective task as different users will have different use cases. For example, some users may prefer papers closely related to their current research interests, whereas others, perhaps new to a given field of mathematics, may wish to receive a broader survey of the field. We are also only basing this on the information provided by the article titles and abstracts. Nevertheless, we'll give it a shot! 

Here are the users' rankings of the "best" and "worst" models based on how well the recommended papers from the three input papers capture their interests.

#### Ethan: 1.    2.   3.

#### Jee Uhn: 1.    2.   3.

#### Mike: 1.    2.   3.

#### Jenia: 1.  `Doc2Vec w/ Distributed Memory`        2.  `Word2Vec`       3. `Doc2Vec w/ Bag of Words`