# arXiv Paper Recommendations 

In this notebook, we explore three well-known natural language processing models for recommending similar articles to a user's input. The three models are:

1. pre-trained Word2Vec 
2. Doc2Vec with Distributed Bag of Words
3. Doc2Vec with Distributed Memory.

We implement functions that allow for the user to enter one or more articles (denoted by their arXiv ids) and request some number $n>0$ of recommendations. We will use `cosine similarity` of the word/sentence embeddings of the article abstracts as our similary measure.

### Note: in this notebook the recommendations for multiple inputs are made in the following manner: 
#### (1) Merge the tokens of the abstracts the user inputs into one merged abstract. 

#### (2) Find the $n$ articles with the highest cosine similarity with the merged abstract in the dataset. 


In [1]:
import pandas as pd
import numpy as np

import arxiv
import time
from string import punctuation

import nltk
from nltk.corpus import stopwords
from nltk.probability import FreqDist

import gensim.downloader as api
import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from multipledispatch import dispatch
from data_utils import format_query, query_to_df, clean_data, clean_authors

First we load in our pre-cleaned dataset. Our dataset contains 20,000 arXiv papers.

In [2]:
## This file keeps the stopwords, but removes words of freq=1 in the corpus from the tokens
#df = pd.read_parquet("./data/filter_20k_tokenized_stopwords.parquet")

## This file removes both stopwords and words of freq=1 in the corpus from the tokens. 
## See column "abstract_reduced_tokens".
df = pd.read_parquet("./data/filter_20k_tokenized.parquet")
df.head(1)

Unnamed: 0,id,title,abstract,update_date,authors_parsed,strip_cat,clean_title,clean_abstract,clean_authors,abstract_tokenized,abstract_reduced_tokens
182244,1412.3275,Limit cycles bifurcating from a degenerate center,We study the maximum number of limit cycles ...,2014-12-11,"[['Llibre', 'J.', ''], ['Pantazi', 'C.', '']]",[DS],limit cycles bifurcating from a degenerate center,we study the maximum number of limit cycles th...,"[['llibre', 'j', ''], ['pantazi', 'c', '']]","[we, study, the, maximum, number, of, limit, c...","[study, maximum, number, limit, cycles, bifurc..."


We will reset the index values for ease of analysis.

In [3]:
df = df.reset_index()
df = df.rename(columns={"index": "original_index"})
df.head(1)

Unnamed: 0,original_index,id,title,abstract,update_date,authors_parsed,strip_cat,clean_title,clean_abstract,clean_authors,abstract_tokenized,abstract_reduced_tokens
0,182244,1412.3275,Limit cycles bifurcating from a degenerate center,We study the maximum number of limit cycles ...,2014-12-11,"[['Llibre', 'J.', ''], ['Pantazi', 'C.', '']]",[DS],limit cycles bifurcating from a degenerate center,we study the maximum number of limit cycles th...,"[['llibre', 'j', ''], ['pantazi', 'c', '']]","[we, study, the, maximum, number, of, limit, c...","[study, maximum, number, limit, cycles, bifurc..."


In [4]:
df['abstract_reduced_tokens'].isnull().values.any()

False

In [5]:
# Rejoin the tokens for use with some models
df['abstract_reduced'] = df['abstract_reduced_tokens'].apply(" ".join)
df.head(1)

Unnamed: 0,original_index,id,title,abstract,update_date,authors_parsed,strip_cat,clean_title,clean_abstract,clean_authors,abstract_tokenized,abstract_reduced_tokens,abstract_reduced
0,182244,1412.3275,Limit cycles bifurcating from a degenerate center,We study the maximum number of limit cycles ...,2014-12-11,"[['Llibre', 'J.', ''], ['Pantazi', 'C.', '']]",[DS],limit cycles bifurcating from a degenerate center,we study the maximum number of limit cycles th...,"[['llibre', 'j', ''], ['pantazi', 'c', '']]","[we, study, the, maximum, number, of, limit, c...","[study, maximum, number, limit, cycles, bifurc...",study maximum number limit cycles bifurcate de...


Next, we create a corpus with all the words in the abstracts of all of the papers in the dataset. Then, we find their frequency distribution.

We will need this list of words with frequency 1  when we process the abstracts of the papers in the test data. Essentially, we will remove those words from the abstracts of the user's papers before inputting them into our models.

In [6]:
def get_freq1_corpus(df):
    indices = df.index.values

    # Corpus will be a list of lists
    corpus = []
    for i  in indices:
        corpus.append(df['abstract_tokenized'][i])

    # Convert a list of lists to a list because FreqDist takes in a
    # list of strings
    flat_corpus = []
    for sublist in corpus:
        for item in sublist:
            flat_corpus.append(item)

    # Create a frequency distribution from the flattened corpus
    freq = FreqDist(flat_corpus)
    #print("There are", len(freq), "words in the frequency distribution.")

    df_fdist = pd.DataFrame(list(freq.items()), columns = ["Word","Frequency"])

    ## Create a list of the words that appear only once
    unique_words = list(df_fdist[df_fdist['Frequency'] == 1]['Word'])
    #print("There are", len(unique_words), "words that appear only once in the abstracts.")
    return unique_words

In [7]:
## The list of frequency 1 words in the entire corpus
unique_words = get_freq1_corpus(df)

Now we define the functions for the various models we will use to compute cosine similarities and offer recommendations.

In [8]:
## A function to get the vector norm
def norm(u):
    return np.sqrt(np.sum(np.power(u,2)))

## A function to get the cosine similarity
def cos_sim(u,v):
    if norm(u)*norm(v) > 0:
        return (u.dot(v))/(norm(u)*norm(v))
    else:
        return np.nan

### Functions for returning the top $n$ most similar papers to a paper already in the dataset

As practice for our recommendations with our user test data, we first consider making recommendations for a paper that is already present in our dataset.

In [9]:
## Auxilary function
"""
    Prints the top n most similar article in the dataset to an 
    article indexed by article_index that is assumed to be in the dataset.
    
    Inputs:
    df: a DataFrame with all of the articles
    df_sim: a DataFrame with the top n most similar articles to article_index 
            and their computed cosine similarities
            
    article_index: an integer that is an index of an article in the DataFrame;
                   should range from 0 to len(df)-1 
"""
def print_similar(df, df_sim, article_index):
    
    print("The top", len(df_sim), "articles most similar to the article \n\n", 
            article_index, ".", df['title'][article_index])
    print("-----------------------------------------------------\n")
    
    i = 1
    for index in df_sim.index.values: 
        print(i, ".", "(", index , ")", df['title'][index], 
          ", Cosine Similiarity=", np.round(df_sim['Cosine Similarity'][index], 3))
        print()
        i = i + 1

"""
    Removes NaN entries from a list and return the modified list.
    
    Inputs:
    x: a list
"""        
def remove_nan(x):
    temp = x.copy()
    for i in range(len(temp)):
        if np.isnan(temp[i]):
            temp[i] = -1
    return temp

## For use with CountVectorizer and TfidVectorizer
"""
    Prints the top n most similar article titles from the dataframe
    to the input article by calculating their cosine similarity.
    
    Inputs:
    df: a DataFrame with all of the articles
    df_vectorized: a dataframe of word frequencies
    article_index: index of the article we want to compare cosine similarities to
    n: number of most similar articles to search for
    cv: a boolean value that is True if CountVectorizer was used to compute 
        df_vectorized and False if TfidVectorizer was used
"""
@dispatch(pd.core.frame.DataFrame, pd.core.frame.DataFrame, int, int, bool)

def get_n_most_similar(df, df_vectorized, article_index, n, cv):
       
    if cv == True:
        print("--------------------- Using CountVectorizer ---------------------\n")
    else:
        print("--------------------- Using TfidVectorizer ---------------------\n")
    
    # Calculate the cosine similariy scores for the i-th article in the dataset
    cosine_sim_list = np.zeros(len(df))

    for i in range(len(df)):
        
        if i != article_index: ## No need to compare the article to itself

            text_to_vector_v1 = df_vectorized.iloc[i].values
            text_to_vector_v2 = df_vectorized.iloc[article_index].values
            
            sim_scores = cos_sim(text_to_vector_v1, text_to_vector_v2)
            cosine_sim_list[i] = sim_scores
   
    ## Caution: there may be some cosine sims with value NaN
    cosine_sim_list = remove_nan(cosine_sim_list)
    ## Getting indices of n maximum values
    x = np.argsort(cosine_sim_list)[::-1][:n]
     
    
    ## Create a dataframe with the results
    df_sim = pd.DataFrame(columns=df.columns)
    df_sim['Cosine Similarity'] = []
   
    for index in x:
        df_sim.loc[index] = df.iloc[index]
        df_sim.at[index, 'Cosine Similarity'] = cosine_sim_list[index]
          
    return df_sim, article_index

###################################################################

## For use with Word2Vec
## Word2Vec supports a function called n_similarity
"""
    Prints the top n most similar (by cosine similarity)
    article titles to article_index. 
    
    Inputs: 
    model: a trained Word2Vec model
    df: a DataFrame with all of the articles
    article_index: an integer that is an index of an article in the DataFrame;
                   should range from 0 to len(df)-1 
    n: number of most similar articles to search for
    
"""
@dispatch(gensim.models.keyedvectors.KeyedVectors, pd.core.frame.DataFrame, int, int)

def get_n_most_similar(model, df, article_index , n):
    
    print("--------------------- Using Word2Vec ---------------------\n")
     
    cosine_sim_list = np.zeros(len(df))

    for i in range(len(df)):      
        # Calculate the cosine similariy scores with the i-th article in the dataset       
        if i != article_index and len(df['abstract_reduced_tokens'][i]) != 0:
            cosine_sim_list[i]  = model.n_similarity(df['abstract_reduced_tokens'][article_index], 
                                                     df['abstract_reduced_tokens'][i])
        
    ## Caution: there may be some cosine sims with value NaN
    cosine_sim_list = remove_nan(cosine_sim_list)
    ## Getting indices of the n maximum values
    x = np.argsort(cosine_sim_list)[::-1][:n]
            
    ## Create a dataframe with the results
    df_sim = pd.DataFrame(columns=df.columns)
    df_sim['Cosine Similarity'] = []
    
    for index in x:
        df_sim.loc[index] = df.iloc[index].copy()
        df_sim.at[index, 'Cosine Similarity'] = cosine_sim_list[index]
        
    return df_sim, article_index

###################################################################

## Doc2Vec supports a function called most_similar 
"""
    Prints the top n most similar (by cosine similarity)
    article titles to article_index.    
    
    Inputs: 
    model: a trained Doc2Vec model
    df: a DataFrame with all of the articles
    article_index: an integer that is an index of an article in the DataFrame;
                   should range from 0 to len(df)-1 
    n: number of most similar articles to search for
    dm: a boolean value that is False if Doc2Vec with distributed bag of words 
        was used and True if Doc2Vec with distributed memory was used
"""
@dispatch(gensim.models.doc2vec.Doc2Vec, pd.core.frame.DataFrame, int, int, bool)

def get_n_most_similar(model, df, article_index, n, dm):
    
    if dm == True:
        print("----------------- Doc2Vec Model (Dist. Memory)-----------------\n")
    else:
        print("----------------- Doc2Vec Model (Dist. Bag of Words)-----------------\n")
    
    # dv.most_similar returns the same values as d2v_model.dv.similarity(i, j)
    topn = model.dv.most_similar(article_index, topn=n)
   
    article_indices = [x[0] for x in topn]
    cos_sims = [x[1] for x in topn]
    
    ## Create a dataframe with the results
    df_sim = pd.DataFrame(columns=df.columns)
    
    for index in article_indices:
        df_sim.loc[index] = df.iloc[index]
    
    df_sim['Cosine Similarity'] = cos_sims
        
    return df_sim, article_index

### Functions for returning the top $n$ most similar papers to a paper NOT already in the dataset

In [10]:
## For use with Word2Vec
## Word2Vec supports a function called n_similarity

## user_vector: the tokenized and cleaned abstract of the user's input

@dispatch(gensim.models.keyedvectors.KeyedVectors, pd.core.frame.DataFrame, list, int)
def get_n_most_similar(model, df, user_tokens, n):
    
    print("Using Word2Vec\n")
     
    cosine_sim_list = np.zeros(len(df))

    for i in range(len(df)):      
        # Calculate the cosine similariy scores with the i-th article in the dataset
        # The difference here is the use of the n_similarity function
        cosine_sim_list[i]  = model.n_similarity(user_tokens, df['abstract_reduced_tokens'][i])
        
    ## Getting indices of the n maximum values
    x = np.argsort(cosine_sim_list)[::-1][:n]
            
    ## Create a dataframe with the results
    df_sim = pd.DataFrame(columns=df.columns)
    df_sim['Cosine Similarity'] = []
    
    for index in x:
        df_sim.loc[index] = df.iloc[index].copy() 
        df_sim['Cosine Similarity'].loc[index] = cosine_sim_list[index]
        
    return df_sim

###################################################################

## Doc2Vec Recommender
## use the infer_vector function (may not be necessary?)
## Choose the first paper in our dataset

@dispatch(gensim.models.doc2vec.Doc2Vec, pd.core.frame.DataFrame, list, int)

def get_n_most_similar(d2v_model, df, user_vector, n):
    print("Using Doc2Vec\n")
    
    topn = d2v_model.dv.most_similar([user_vector], topn=n)
    
    article_indices = [x[0] for x in topn]
    cos_sims = [x[1] for x in topn]
    
    ## Create a dataframe with the results
    df_sim = pd.DataFrame(columns=df.columns)
    
    for index in article_indices:
        df_sim.loc[index] = df.iloc[index].copy()
    
    df_sim['Cosine Similarity'] = cos_sims
        
    return df_sim

### Functions for finding the top $n$ recommendations based on $m$ user inputs

In [11]:
"""
    Prints the titles of the papers the user inputted
    and the top n recommendations store in the df_results.
    
    df_user: the dataset of articles the user has input
    df_results: results of similar articles based on cosine similarity
"""
def display_results(df_results, df_user):
    
    print("The top", len(df_results), "article(s) most similar to the article(s): \n\n")   
    ## Get the titles of the user's inputs
    i = 0
    for i in range(len(df_user)):
        title = df_user.iloc[i]['title']
        print(i+1, ".", title)  
        i = i + 1
        
    print("\n#############################################################\n")
    
    for i in range(len(df_results)):
       
        match_index = df_results.index.values[i]
        title = df_results.loc[match_index]['title']
        authors = df_results.loc[match_index]['authors_parsed']
        abstract = df_results.loc[match_index]['abstract']
      #  link = df_results.loc[match_index]['entry_id'] 
        cos_sim = df_results.loc[match_index]['Cosine Similarity']
        
        print(i+1, ".", "(", match_index, ")", title, 
              "\n [ Cosine Similarity=", np.round(cos_sim, 3) ,"]\n")
     #   print("\n", authors)
        print("\n", abstract)
      #  print("\n", link) 
        print("\n-----------------------------------------------------\n")
        

"""
    Join the lists of tokens in the dataframe column into one tokenized list.
    
    Inputs:
    df_col: a column of a dataframe whose rows contain list of tokens 
"""
def join_tokens(df_col):
    
    all_tokens = []
    
    for i in range(len(df_col)):
        all_tokens.extend(df_col[i])
    
    return all_tokens      

In [56]:
## For use with Word2Vec
"""
    Returns a dataframe of the top n most similar (by cosine similarity)
    article titles to the user's inputs. 
    
    Inputs:
    model: a trained Word2Vec model
    df: a DataFrame with all of the articles  
    df_user: a dataframe of the user's article inputs   
    n: number of recommendations to return 
"""

## Note: Word2Vec's most_similar function returns the top n most similar words,
## which isn't what we need     
@dispatch(gensim.models.keyedvectors.KeyedVectors, pd.core.frame.DataFrame,
                                              pd.core.frame.DataFrame, int)

def n_recommendations(model, df, df_user, n):
    
    print("--------------------- Word2Vec Model ---------------------\n")
    
    ## Create a DataFrame to store the similarity results     
    df_sim_scores = pd.DataFrame(columns=df.columns)
    df_sim_scores['Cosine Similarity'] = []
    
    ## Join the abstracts into a single list of tokens
    user_abstracts = join_tokens(df_user['abstract_reduced_tokens'])
    
    for i in range(len(df)): 
        
        if len(df['abstract_reduced_tokens'][i]) != 0:
            sim_score = model.n_similarity(user_abstracts, df['abstract_reduced_tokens'][i])    
            df_sim_scores.loc[len(df_sim_scores)] = df.iloc[i]
            df_sim_scores.at[i, 'Cosine Similarity'] = sim_score
    
    ## Sort the cosine similarity scores in the dataframe
    df_sim_scores = df_sim_scores.sort_values(by=['Cosine Similarity'], ascending=False)
        
    ## Now check for duplicate articles indices and keep the last index
    ## By default, it will keep the first row and remove the redundant rows.     
    df_sim_scores = df_sim_scores.drop_duplicates(subset=['id'])
       
     ## Get the first n articles in the dataframe
    df_top_n = df_sim_scores.head(n)

    return df_top_n, df_user  

###################################################################

"""
    Returns a dataframe of the top n most similar (by cosine similarity)
    article titles to the user's inputs. 
    
    Inputs:
    model: a trained Doc2Vec model
    df: a DataFrame with all of the articles  
    df_user: a dataframe of the user's article inputs   
    n: number of recommendations to return 
    dm: boolean: True, if using Doc2Vec is using distributed memory, 
        False, if using distributed BOW 
"""
@dispatch(gensim.models.doc2vec.Doc2Vec, pd.core.frame.DataFrame,
                                      pd.core.frame.DataFrame, int, bool)

def n_recommendations(model, df, df_user, n, dm):
    
    if dm == True:
        print("----------------- Doc2Vec Model (Dist. Memory)-----------------\n")
    else:
        print("----------------- Doc2Vec Model (Dist. Bag of Words)-----------------\n")

    
    ## Create a DataFrame to store the similarity results     
    df_sim_scores = pd.DataFrame(columns=df.columns)
    df_sim_scores['Cosine Similarity'] = []
    
    ## Join the abstracts into a single list of tokens
    user_abstracts = join_tokens(df_user['abstract_reduced_tokens'])
    
    for i in range(len(df)):      
        
        if len(df['abstract_reduced_tokens'][i] ) != 0:
            ## Here we use the n_similarity function 
            sim_score = model.wv.n_similarity(user_abstracts, df['abstract_reduced_tokens'][i])
            df_sim_scores.loc[len(df_sim_scores)] = df.iloc[i]
            df_sim_scores.at[i, 'Cosine Similarity'] = sim_score 
    
    ## Sort the cosine similarity scores in the dataframe
    df_sim_scores = df_sim_scores.sort_values(by=['Cosine Similarity'], ascending=False)
        
    ## Now check for duplicate articles indices and keep the last index
    ## By default, it will keep the first row and remove the redundant rows       
    df_sim_scores = df_sim_scores.drop_duplicates(subset=['id'])
    
     ## Get the first n articles in the dataframe
    df_top_n = df_sim_scores.head(n)

    return df_top_n, df_user

## The models

Here are train the models that we will be working with for the rest of this notebook.

### CountVectorizer

Creates a (sparse) matrix in which each unique word is represented by a column of the matrix. It is also known as document term matrix (dtm).

In [14]:
## max_df: When building the vocabulary ignore terms that have a document frequency 
##         strictly higher than the given threshold (corpus-specific stop words).

count_vectorizer = CountVectorizer(analyzer="word", 
                                tokenizer=nltk.word_tokenize,
                                preprocessor=None, 
                               # stop_words='english', 
                                max_features=2500,
                                ngram_range=(1,3))
                                    ##  max_df=.9
    
bow = count_vectorizer.fit_transform(df['abstract_reduced'])

df_bow = pd.DataFrame(bow.toarray(),
                      columns = count_vectorizer.get_feature_names_out())

### TfidVectorizer

TF for term frequency and ID stands for inverse document. 
The document frequency of a given term is the number of documents that contain that term. 

Unlike CountVectorizer which gives all words equal weights, 
TfidfVectorizer weights the word counts by a measure of how often they appear in the documents.

To compute the tf-idf of a term for a given document you multiply the term frequency of the term within that document by the $\log$ (base 10) of the inverse document frequency for that term across the corpus. 

$$
\text{tf-idf } = \text{ term-frequency } \times \text{ }\log(\text{inverse-document-frequency}) \text{} 
$$

Here, words like 'this', 'are' etc., that are commonly present in all the documents are not given a very high rank.

In [15]:
tfid_vectorizer = TfidfVectorizer(analyzer="word", 
                                tokenizer=nltk.word_tokenize,
                                preprocessor=None, 
                             #   stop_words='english', 
                                max_features=2500,
                                ngram_range=(1,3))
                                    ##  max_df=.9
    
tfid = tfid_vectorizer.fit_transform(df['abstract_reduced'])

df_tfid = pd.DataFrame(tfid.toarray(),
                      columns = tfid_vectorizer.get_feature_names_out())

### Word2Vec 

Rather than training our own model here, we use a pretrained model that has been trained on a much larger dataset for better results!

Internally, this algorithm uses a neural network to learn word associations from the corpus. The generate word embeddings (vectors of length 300 in the case below) which capture semantic and syntactic qualities of words

In [16]:
## Load word2vec model, here GoogleNews is used
## The file must be previously downloaded
## The size of the vectors is 300
w2v_model = gensim.models.KeyedVectors.load_word2vec_format('../GoogleNews-vectors-negative300.bin', 
                                                        binary=True)

### Doc2Vec 

The Doc2Vec algorithm is an extension of Word2Vec. While Word2Vec computes a feature vector for every word in the corpus, Doc2Vec computes a feature vector for every document in the corpus.

In [17]:
# Rather than a list of tokenized docs Doc2Vec requires tagged lists of tokens
# that's because the model needs to keep track of the documents.
summaries = [TaggedDocument(doc,[i]) for i, doc in enumerate(df['abstract_reduced_tokens'])]

There are two approaches to the Doc2Vec model, (1) distributed bag of words and (2) distributed memory models.

The difference between the distributed bag of words and the distributed memory model is that the distributed memory model approximates the word using the context of surrounding words and the distributed bag of words model uses the target word to approximate the context of the word.

As in the Word2Vec model we will use a vector size of 300.

#### Doc2Vec using Distributed Bag of Words

In [18]:
# Now we train the Doc2Vec models

# dm = 1 or 0 (optional) – Defines the training algorithm. 
# If dm=1, ‘distributed memory’ (PV-DM) is used. 
# Otherwise, distributed bag of words (PV-DBOW) is employed.

# vector_size - Dimensionality of the feature vectors.

# window -  The maximum distance between the current and 
# predicted word within a sentence.

# min_count - require words to show up a minimum of 2 times
# iscard words with very few occurrences. (Without a variety of representative 
# examples, retaining such infrequent words can often make a model worse!)


d2v_model_bow = Doc2Vec(documents = summaries,
                    dm = 0, ## use distributed bag of words (PV-DBOW) 
                    vector_size = 300, 
                    window = 2, 
                    min_count = 2,
                    epochs=50)

#### Doc2Vec using Distributed Memory

In [19]:
#### Doc2Vec using Distributed Memory
d2v_model_dm = Doc2Vec(documents = summaries,
                    dm = 1, ## use distributed memory
                    vector_size = 300, 
                    window = 2, 
                    min_count = 2,
                    epochs=50)

#### Sanity check

We can  see how "good" the embedding is by looping through the abstracts and recording the similarity rank of the actual abstract embedding to the inferred embedding.

Then, we can calculate the fraction of documents whose rank was 0.

In [20]:
"""
    To see how "good" the Doc2Vec embedding is loop through the abstracts 
    and record the similarity rank of the actual abstract embedding to the 
    inferred embedding.
    Then, calculate and print the fraction of documents whose rank was 0.

    Input:
    d2v_model: a Doc2Vec model
"""
def check_doc2vec_embedding(d2v_model):
    # We'll loop through all of the documents
    # and record the similarity rank to their inferred vector
    summary_ranks = []

    # for each document
    for summary in summaries:
        # get the inferred vector
        inferred_vec = d2v_model.infer_vector(summary.words)
        # find the most similar vectors
        sims = d2v_model.dv.most_similar([inferred_vec], topn=len(summaries))
    
    # loop through those vectors
        for i in range(len(sims)):
            # find the rank of the document
            if summary.tags[0] == sims[i][0]:
                # record it
                summary_ranks.append(i)
                
    # the fraction of documents whose rank was 0           
    rank_0 = np.sum(np.array(summary_ranks)==0)/len(summary_ranks)
    print("The fraction of documents whose rank is 0 is", np.round(rank_0, 4))

In [21]:
print("----------------- Doc2Vec Model (Dist. Bag of Words)-----------------\n")
check_doc2vec_embedding(d2v_model_bow)

----------------- Doc2Vec Model (Dist. Bag of Words)-----------------

The fraction of documents whose rank is 0 is 0.9922


In [22]:
print("----------------- Doc2Vec Model (Dist. Memory)-----------------\n")
check_doc2vec_embedding(d2v_model_dm)

----------------- Doc2Vec Model (Dist. Memory)-----------------

The fraction of documents whose rank is 0 is 0.9924


These models seem reasonable!

#### Compare results of CountVectorizer, TfidVectorizer, Word2Vec, Doc2Vec for articles within the dataset

In [23]:
## Similarity scores for the the first article in the DataFrame

## Using CountVectorizer
df_cv = get_n_most_similar(df.copy(), df_bow, 0, 5, True)[0]

# Using TfidVectorizer
df_tf = get_n_most_similar(df.copy(), df_tfid, 0, 5, False)[0]

# Using Word2Vec
df_wv = get_n_most_similar(w2v_model, df, 0, 5)[0]

# Using Doc2Vec with Distributed BOW
df_dv_bow = get_n_most_similar(d2v_model_bow, df, 0, 5, False)[0]

# Using Doc2Vec with Distributed Memory
df_dv_dm = get_n_most_similar(d2v_model_dm, df, 0, 5, True)[0]

--------------------- Using CountVectorizer ---------------------

--------------------- Using TfidVectorizer ---------------------

--------------------- Using Word2Vec ---------------------

----------------- Doc2Vec Model (Dist. Bag of Words)-----------------

----------------- Doc2Vec Model (Dist. Memory)-----------------



In [24]:
print("--------------------- Count Vectorizer ---------------------\n")
df_cv[['original_index', 'title', 'abstract', 'strip_cat', 'authors_parsed', 'Cosine Similarity']]

--------------------- Count Vectorizer ---------------------



Unnamed: 0,original_index,title,abstract,strip_cat,authors_parsed,Cosine Similarity
15936,329106,Bifurcation of limit cycles from a quadratic g...,"In this paper, we generalize the PicardFuchs...",[DS],"[['Yang', 'Jihua', '']]",0.410132
9508,315961,Planar Semiquasi Homogeneous Polynomial differ...,This paper study the planar semiquasi homoge...,[DS],"[['Tian', 'Yuzhou', ''], ['Liang', 'Haihua', '']]",0.381928
4796,408265,The local period function for Hamiltonian syst...,In the first part of the paper we develop a ...,[DS],"[['Buzzi', 'Claudio A.', ''], ['Carvalho', 'Ya...",0.335422
9854,367828,First Integrals vs Limit Cycles,This paper applies a recent result determini...,[DS],"[['García', 'Andrés G.', '']]",0.317011
7480,58128,Structure Theory for Second Order 2D Superinte...,The structure theory for the quadratic algeb...,[MP],"[['Kalnins', 'Ernest G.', ''], ['Kress', 'Jona...",0.307934


In [25]:
print("--------------------- Tfid Vectorizer ---------------------\n")
df_tf[['original_index', 'title', 'abstract', 'strip_cat', 'authors_parsed', 'Cosine Similarity']]

--------------------- Tfid Vectorizer ---------------------



Unnamed: 0,original_index,title,abstract,strip_cat,authors_parsed,Cosine Similarity
9508,315961,Planar Semiquasi Homogeneous Polynomial differ...,This paper study the planar semiquasi homoge...,[DS],"[['Tian', 'Yuzhou', ''], ['Liang', 'Haihua', '']]",0.405092
15936,329106,Bifurcation of limit cycles from a quadratic g...,"In this paper, we generalize the PicardFuchs...",[DS],"[['Yang', 'Jihua', '']]",0.325908
17297,169000,Solution of the parametric center problem for ...,The Abel differential equation with is sai...,"[CA, DS]","[['Pakovich', 'Fedor', '']]",0.32451
13706,511903,A sufficient and necessary condition of genera...,The aim of this paper is to give a sufficien...,[DS],"[['Chen', 'Hebai', ''], ['Li', 'Zhijie', ''], ...",0.309687
4796,408265,The local period function for Hamiltonian syst...,In the first part of the paper we develop a ...,[DS],"[['Buzzi', 'Claudio A.', ''], ['Carvalho', 'Ya...",0.301153


In [26]:
print("--------------------- Word2Vec Model ---------------------\n")
df_wv[['original_index', 'title', 'abstract', 'strip_cat', 'authors_parsed', 'Cosine Similarity']]

--------------------- Word2Vec Model ---------------------



Unnamed: 0,original_index,title,abstract,strip_cat,authors_parsed,Cosine Similarity
6108,158258,Topology trivialization and large deviations f...,Finding the global minimum of a cost functio...,"[MP, OC]","[['Fyodorov', 'Yan V', ''], ['Doussal', 'Pierr...",0.884011
13151,107803,Continuous Limits of Classical Repeated Intera...,We consider the physical model of a classica...,"[MP, PR]","[['Deschamps', 'Julien', '']]",0.880297
19029,53103,Phase portraits for quadratic homogeneous poly...,Let X be a homogeneous polynomial vector fie...,[DS],"[['Llibre', 'Jaume', ''], ['Pessoa', 'Claudio'...",0.878772
9508,315961,Planar Semiquasi Homogeneous Polynomial differ...,This paper study the planar semiquasi homoge...,[DS],"[['Tian', 'Yuzhou', ''], ['Liang', 'Haihua', '']]",0.877008
18932,365655,"Invariant tori, actionangle variables and phas...","We study the classical RajeevRanken model, a...","[DS, MP]","[['Krishnaswami', 'Govind S.', ''], ['Vishnu',...",0.875581


In [27]:
print("----------------- Doc2Vec Model (Dist. Bag of Words)-----------------\n")
df_dv_bow[['original_index', 'title', 'abstract', 'strip_cat', 'authors_parsed', 'Cosine Similarity']]

----------------- Doc2Vec Model (Dist. Bag of Words)-----------------



Unnamed: 0,original_index,title,abstract,strip_cat,authors_parsed,Cosine Similarity
6867,63325,Planar polynomial vector fields having a polyn...,We consider in this work planar polynomial d...,"[CA, DS]","[['Garcia', 'Belen', ''], ['Giacomini', 'Hecto...",0.588146
10619,468612,Rational integrals of 2dimensional geodesic fl...,This paper is devoted to searching for Riema...,"[DS, AP, DG]","[['Agapov', 'Sergei', '', '1 and 2'], ['Shubin...",0.521801
10096,325855,Averaging theory at any order for computing li...,This work is devoted to study the existence ...,[DS],"[['Llibre', 'Jaume', ''], ['Novaes', 'Douglas ...",0.496199
15464,240592,Dual morse index estimates and application to ...,"In this paper, we study the multiplicity of ...",[AP],"[['Tang', 'Shanshan', '']]",0.490552
9508,315961,Planar Semiquasi Homogeneous Polynomial differ...,This paper study the planar semiquasi homoge...,[DS],"[['Tian', 'Yuzhou', ''], ['Liang', 'Haihua', '']]",0.489677


In [28]:
print("----------------- Doc2Vec Model (Dist. Memory)-----------------\n")
df_dv_dm[['original_index', 'title', 'abstract', 'strip_cat', 'authors_parsed', 'Cosine Similarity']]

----------------- Doc2Vec Model (Dist. Memory)-----------------



Unnamed: 0,original_index,title,abstract,strip_cat,authors_parsed,Cosine Similarity
9508,315961,Planar Semiquasi Homogeneous Polynomial differ...,This paper study the planar semiquasi homoge...,[DS],"[['Tian', 'Yuzhou', ''], ['Liang', 'Haihua', '']]",0.513413
4796,408265,The local period function for Hamiltonian syst...,In the first part of the paper we develop a ...,[DS],"[['Buzzi', 'Claudio A.', ''], ['Carvalho', 'Ya...",0.496991
6867,63325,Planar polynomial vector fields having a polyn...,We consider in this work planar polynomial d...,"[CA, DS]","[['Garcia', 'Belen', ''], ['Giacomini', 'Hecto...",0.480605
10131,178087,On the dynamics of lattice systems with unboun...,We supply the mathematical arguments require...,[MP],"[['Nachtergaele', 'Bruno', ''], ['Sims', 'Robe...",0.45268
15423,137024,Poisson cohomology of two Fano threefolds,We study the variety of Poisson structures a...,"[AG, DG]","[['Mayanskiy', 'Evgeny', '']]",0.446801


## Test the models using user data

We will next test our user data on the three more sophisticated models: 

(1) pretrained Word2Vec 

(2) Doc2Vec with distributed bag of words

(3) Doc2Vec with distributed memory.

In [29]:
## Presumably, these articles are not already in the dataset.

## Here are several lists of papers we are interested in
ethan = ['1802.03426', '2304.14481', '2303.03190', '2210.13418',
         '2210.12824', '2210.00661', '2007.02390', '1808.05860',
         '2005.12732','1804.05690']

jeeuhn = ['0905.0486', 'math/0006187', '2106.07444', '1402.0490', 
          '1512.08942', '1603.09235', 'math/0510265', 'math/0505056', 
          'math/0604379', '2209.02568']

mike = ['2207.13571','2207.13498','2211.09644','2001.10647',
        '2103.08093','2207.08245', '2207.01677','2205.08744',
        '2008.04406','1912.09845']

jenia = ['2010.14967', '1307.0493', 'quant-ph/0604014', '2201.05140', 
         '1111.1877', 'quant-ph/9912054', '1611.08286', '1507.02858', 
         'math-ph/0107001','1511.01241', 'math-ph/9904020', '2211.15336', 
         '2212.03719']

In [30]:
## Get the list of words of frequency 1 in the dataframe
unique_words = get_freq1_corpus(df)
print(len(unique_words))

21707


We've observed that removing unique words changes the recommendations and increases the execution time of the algorithm.

In [31]:
# words can be accessed like so
# print(stopwords.words('english'))

## Tokenize the abstract by splitting on whitespaces
## and get rid of the occasional empty string.
def clear_empty(clean_string):
    return [word for word in clean_string.split(" ") if word != '']

## Remove the common stop words
def remove_stop(tokens):
    
    ## Remove punctutation from stopwords because we've already 
    ## removed it from the abstracts
    ## punctuation is imported from the string class
    eng_stopwords = stopwords.words('english')
    new_punct = punctuation + "’" + "‘"

    new_stop = []
    for word in eng_stopwords:
        new_word = ""
        for char in word:
            if char not in new_punct:
                new_word = new_word + char
        
        new_stop.append(new_word)

    return [token for token in tokens if token not in new_stop]

## Remove the words that appear only once in the corpus of the dataset 
def remove_unique(tokens):
    return [token for token in tokens if token not in unique_words]

In [32]:
"""
    Process the arxiv ids that the user has given. 
    Return a DataFrame with the user's article data.
    
    Inputs:
    paper_ids_list: a list of ArXiv ids
"""

def user_data(paper_ids_list):
    ## Create the dataframe to store the user's input papers
    df_user = pd.DataFrame(columns=['id','entry_id', 'title', 'authors','abstract'])
    
    list_urls = []
    list_titles = []
    list_authors = []
    list_abstracts = []

    ## Extract the article info from ArXiv
    for item in paper_ids_list:
        paper = next(arxiv.Search(id_list=[item]).results())
        list_titles.append(paper.title)
        list_authors.append(paper.authors)
        list_abstracts.append(paper.summary)
        list_urls.append(paper.entry_id)
    
    df_user['id'] = paper_ids_list
    df_user['entry_id'] = list_urls
    df_user['title'] = list_titles
    df_user['authors'] = list_authors
    df_user['abstract'] = list_abstracts
    
    ## Clean the user's data
    df_user['abstract_clean'] = df_user['abstract'].apply(clean_data)
    df_user['abstract_tokenized'] = df_user['abstract_clean'].apply(nltk.word_tokenize)
    df_user['abstract_tokenized'] = df_user['abstract_clean'].apply(clear_empty)
    df_user['abstract_tokenized'] = df_user['abstract_tokenized'].apply(remove_stop)
    df_user['abstract_reduced_tokens'] = df_user['abstract_tokenized'].apply(remove_unique)
    
    return df_user

In [33]:
"""
    Join the lists of tokens in the dataframe column into one tokenized list.
"""
def join_tokens(df_col):
    
    all_tokens = []
    
    for i in range(len(df_col)):
        all_tokens.extend(df_col[i])
    
    return all_tokens

### Ethan's Recommendations

In [34]:
df_ethan = user_data(ethan)
df_ethan[['id', 'entry_id', 'title', 'authors', 'abstract']]

Unnamed: 0,id,entry_id,title,authors,abstract
0,1802.03426,http://arxiv.org/abs/1802.03426v3,UMAP: Uniform Manifold Approximation and Proje...,"[Leland McInnes, John Healy, James Melville]",UMAP (Uniform Manifold Approximation and Proje...
1,2304.14481,http://arxiv.org/abs/2304.14481v1,"Endperiodic maps, splitting sequences, and bra...","[Michael P. Landry, Chi Cheuk Tsang]",We strengthen the unpublished theorem of Gabai...
2,2303.0319,http://arxiv.org/abs/2303.03190v1,Train track combinatorics and cluster algebras,[Shunsuke Kano],The concepts of train track was introduced by ...
3,2210.13418,http://arxiv.org/abs/2210.13418v2,Standardly embedded train tracks and pseudo-An...,"[Eriko Hironaka, Chi Cheuk Tsang]",We show that given a fully-punctured pseudo-An...
4,2210.12824,http://arxiv.org/abs/2210.12824v2,Class number for pseudo-Anosovs,"[François Dahmani, Mahan Mj]","Given two automorphisms of a group $G$, one is..."
5,2210.00661,http://arxiv.org/abs/2210.00661v1,"Braids, entropies and fibered 2-fold branched ...","[Susumu Hirose, Eiko Kin]",It is proved by Sakuma and Brooks that any clo...
6,2007.0239,http://arxiv.org/abs/2007.02390v1,The (homological) persistence of gerrymandering,"[Moon Duchin, Tom Needham, Thomas Weighill]","We apply persistent homology, the dominant too..."
7,1808.0586,http://arxiv.org/abs/1808.05860v1,Discrete geometry for electoral geography,"[Moon Duchin, Bridget Eileen Tenner]","We discuss the ""compactness,"" or shape analysi..."
8,2005.12732,http://arxiv.org/abs/2005.12732v1,Mathematics of Nested Districts: The Case of A...,"[Sophia Caldera, Daryl DeFord, Moon Duchin, Sa...","In eight states, a ""nesting rule"" requires tha..."
9,1804.0569,http://arxiv.org/abs/1804.05690v4,You can hear the shape of a billiard table: Sy...,"[Moon Duchin, Viveka Erlandsson, Christopher J...",We give a complete characterization of the rel...


#### Make recommendations for a single paper. 

In [44]:
## Using the pretrained Word2Vec
df_rec_1, df_ethan_1 = n_recommendations(w2v_model, df, df_ethan[0:1], 5)
display_results(df_rec_1, df_ethan_1)

## Using Doc2Vec with Distributed Bag of Words
df_rec_2, df_ethan_2 = n_recommendations(d2v_model_bow, df, df_ethan[0:1], 5, False)
display_results(df_rec_2, df_ethan_2)

## Using Doc2Vec with Distributed Memory
df_rec_3, df_ethan_3 = n_recommendations(d2v_model_dm, df, df_ethan[0:1], 5, True)
display_results(df_rec_3, df_ethan_3)

--------------------- Word2Vec Model ---------------------

The top 5 article(s) most similar to the article(s): 


1 . UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

#############################################################

1 . ( 11576 ) The Casimir effect from the point of view of algebraic quantum field   theory 
 [ Cosine Similarity= 0.935 ]


   We consider a region of Minkowski spacetime bounded either by one or by two parallel, infinitely extended plates orthogonal to a spatial direction and a real KleinGordon field satisfying Dirichlet boundary conditions. We quantize these two systems within the algebraic approach to quantum field theory using the socalled functional formalism. As a first step we construct a suitable unital *algebra of observables whose generating functionals are characterized by a labelling space which is at the same time optimal and separating and fulfils the Flocality property. Subsequently we give a definition for these s

The top 5 article(s) most similar to the article(s): 


1 . UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

#############################################################

1 . ( 14718 ) Random Projection Trees Revisited 
 [ Cosine Similarity= 0.35 ]


   The Random Projection Tree structures proposed in  are space partitioning data structures that automatically adapt to various notions of intrinsic dimensionality of data. We prove new results for both the RPTreeMax and the RPTreeMean data structures. Our result for RPTreeMax gives a nearoptimal bound on the number of levels required by this data structure to reduce the size of its cells by a factor . We also prove a packing lemma for this data structure. Our final result shows that lowdimensional manifolds have bounded Local Covariance Dimension. As a consequence we show that RPTreeMean adapts to manifold dimension as well. 

-----------------------------------------------------

2 . ( 12410 ) Robust and sca

#### Make recommendations for a multiple papers.

In [45]:
start = time.time()

## Using the pretrained Word2Vec
df_rec_wv, df_ethan_new_wv = n_recommendations(w2v_model, df, df_ethan[0:3], 10)

end = time.time()
res = (end - start)/60
## For a set of 3 papers, the execution time is about 2.5 min.

display_results(df_rec_wv, df_ethan_new_wv)

print('\nExecution time:', res, 'minutes')

#############################################################
start = time.time()

## Using Doc2Vec with Distributed Bag of Words
df_rec_d2v_bow, df_ethan_new_bow = n_recommendations(d2v_model_bow, df, df_ethan[0:3], 10, False)

end = time.time()
res = (end - start)/60

## For a set of 3 papers, this takes about 2.5 min.
display_results(df_rec_d2v_bow, df_ethan_new_bow)

print('\nExecution time:',res, 'minutes')

#############################################################
start = time.time()

## Using Doc2Vec with Distributed Memory
df_rec_d2v_dm, df_ethan_new_dm = n_recommendations(d2v_model_dm, df, df_ethan[0:3], 10, True)

end = time.time()
res = (end - start)/60

## For a set of 3 papers, this takes about 2.5 min.

display_results(df_rec_d2v_dm, df_ethan_new_dm)

print('Execution time:',res, 'minutes')

--------------------- Word2Vec Model ---------------------

The top 10 article(s) most similar to the article(s): 


1 . UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
2 . Endperiodic maps, splitting sequences, and branched surfaces
3 . Train track combinatorics and cluster algebras

#############################################################

1 . ( 11576 ) The Casimir effect from the point of view of algebraic quantum field   theory 
 [ Cosine Similarity= 0.935 ]


   We consider a region of Minkowski spacetime bounded either by one or by two parallel, infinitely extended plates orthogonal to a spatial direction and a real KleinGordon field satisfying Dirichlet boundary conditions. We quantize these two systems within the algebraic approach to quantum field theory using the socalled functional formalism. As a first step we construct a suitable unital *algebra of observables whose generating functionals are characterized by a labelling space which is at t

The top 10 article(s) most similar to the article(s): 


1 . UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
2 . Endperiodic maps, splitting sequences, and branched surfaces
3 . Train track combinatorics and cluster algebras

#############################################################

1 . ( 9378 ) On the Quantum Theory of Molecules 
 [ Cosine Similarity= 0.327 ]


   Transition state theory was introduced in the 1930s to account for chemical reactions. Central to this theory is the idea of a potential energy surface (PES). It was assumed that such a surface could be constructed using eigensolutions of the Schr\"{o}dinger equation for the molecular (Coulomb) Hamiltonian but at that time such calculations were not possible. Nowadays quantum mechanical abinitio electronic structure calculations are routine and from their results PESs can be constructed which are believed to approximate those assumed derivable from the eigensolutions. It is argued here that t

The top 10 article(s) most similar to the article(s): 


1 . UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
2 . Endperiodic maps, splitting sequences, and branched surfaces
3 . Train track combinatorics and cluster algebras

#############################################################

1 . ( 761 ) Rectifiability of Singular Sets in Noncollapsed Spaces with Ricci   Curvature bounded below 
 [ Cosine Similarity= 0.901 ]


   This paper is concerned with the structure of GromovHausdorff limit spaces  of Riemannian manifolds satisfying a uniform lower Ricci curvature bound  as well as the noncollapsing assumption . In such cases, there is a filtration of the singular set, , where x; equivalently no tangent cone splits off a Euclidean factor  isometrically. Moreover, by , . However, little else has been understood about the structure of the singular set .   Our first result for such limit spaces  states that  is rectifiable. In fact, we will show that for a.e. 

### Jee Uhn's Recommendations

In [57]:
df_jeeuhn = user_data(jeeuhn)
df_jeeuhn[['id', 'entry_id', 'title', 'authors', 'abstract']]

Unnamed: 0,id,entry_id,title,authors,abstract
0,0905.0486,http://arxiv.org/abs/0905.0486v3,A geometric construction of colored HOMFLYPT h...,"[Ben Webster, Geordie Williamson]","The aim of this paper is two-fold. First, we g..."
1,math/0006187,http://arxiv.org/abs/math/0006187v1,The Hard Lefschetz Theorem and the topology of...,"[Mark Andrea de Cataldo, Luca Migliorini]",We introduce the notion of lef line bundles on...
2,2106.07444,http://arxiv.org/abs/2106.07444v2,From the Hecke Category to the Unipotent Locus,[Minh-Tâm Quang Trinh],Let $W$ be the Weyl group of a split semisimpl...
3,1402.0490,http://arxiv.org/abs/1402.0490v2,Legendrian knots and constructible sheaves,"[Vivek Shende, David Treumann, Eric Zaslow]",We study the unwrapped Fukaya category of Lagr...
4,1512.08942,http://arxiv.org/abs/1512.08942v3,Cluster varieties from Legendrian knots,"[Vivek Shende, David Treumann, Harold Williams...",Many interesting spaces --- including all posi...
5,1603.09235,http://arxiv.org/abs/1603.09235v1,The Hodge theory of the Decomposition Theorem ...,[Geordie Williamson],In its simplest form the Decomposition Theorem...
6,math/0510265,http://arxiv.org/abs/math/0510265v3,Triply-graded link homology and Hochschild hom...,[Mikhail Khovanov],We trade matrix factorizations and Koszul comp...
7,math/0505056,http://arxiv.org/abs/math/0505056v2,Matrix factorizations and link homology II,"[Mikhail Khovanov, Lev Rozansky]",To a presentation of an oriented link as the c...
8,math/0604379,http://arxiv.org/abs/math/0604379v4,Constructible Sheaves and the Fukaya Category,"[David Nadler, Eric Zaslow]","Let $X$ be a compact real analytic manifold, a..."
9,2209.02568,http://arxiv.org/abs/2209.02568v1,The $P=W$ conjecture for $\mathrm{GL}_n$,"[Davesh Maulik, Junliang Shen]",We prove the $P=W$ conjecture for $\mathrm{GL}...


#### Make recommendations for a single paper.

In [58]:
## Using the pretrained Word2Vec
df_rec_4, df_jeeuhn_1 = n_recommendations(w2v_model, df, df_jeeuhn[0:1], 5)
display_results(df_rec_4, df_jeeuhn_1)

## Using Doc2Vec with Distributed Bag of Words
df_rec_5, df_jeeuhn_2 = n_recommendations(d2v_model_bow, df, df_jeeuhn[0:1], 5, False)
display_results(df_rec_5, df_jeeuhn_2)

df_rec_6, df_jeeuhn_3 = n_recommendations(d2v_model_dm, df, df_jeeuhn[0:1], 5, True)
display_results(df_rec_6, df_jeeuhn_3)

--------------------- Word2Vec Model ---------------------

The top 5 article(s) most similar to the article(s): 


1 . A geometric construction of colored HOMFLYPT homology

#############################################################

1 . ( 6293 ) The term a_4 in the heat kernel expansion of noncommutative tori 
 [ Cosine Similarity= 0.862 ]


   We consider the Laplacian associated with a general metric in the canonical conformal structure of the noncommutative two torus, and calculate a local expression for the term a_4 that appears in its corresponding smalltime heat kernel expansion. The final formula involves one variable functions and lengthy two, three and four variable functions of the modular automorphism of the state that encodes the conformal perturbation of the flat metric. We confirm the validity of the calculated expressions by showing that they satisfy a family of conceptually predicted functional relations. By studying these functional relations abstractly, we derive

The top 5 article(s) most similar to the article(s): 


1 . A geometric construction of colored HOMFLYPT homology

#############################################################

1 . ( 12512 ) Isomonodromic deformations of the sl(2) Fuchsian systems on the Riemann   sphere 
 [ Cosine Similarity= 0.861 ]


   This paper is devoted to two geometric constructions related to the isomonodromic method. We follow the Drinfeld ideas and develop them in the case of the curve . Thus we generalize the results of Arinkin and Lysenko to the case of arbitrary number  of points. First, we construct separated Darboux coordinated in terms of the Hecke correspondences between moduli spaces. In this way we present a geometric interpretation of the Sklyanin formulas. In the second part of the paper, we construct Drinfeld's compactification of the initial data space and describe the compactifying divisor in terms of certain FHsheaves. Finally, we give a geometric presentation of the dynamics of the isomonod

#### Make recommendations for multiple papers.

In [59]:
## Using the pretrained Word2Vec
df_rec_wv2, df_jeeuhn_new_wv = n_recommendations(w2v_model, df, df_jeeuhn[0:3], 10)
display_results(df_rec_wv2, df_jeeuhn_new_wv)

## Using Doc2Vec with Distributed Bag of Words
df_rec_d2v_bow2, df_jeeuhn_new_bow = n_recommendations(d2v_model_bow, df, df_jeeuhn[0:3], 10, False)
display_results(df_rec_d2v_bow2, df_jeeuhn_new_bow)

df_rec_d2v_dm2, df_jeeuhn_new_dm = n_recommendations(d2v_model_dm, df, df_jeeuhn[0:3], 10, True)
display_results(df_rec_d2v_dm2, df_jeeuhn_new_dm)

--------------------- Word2Vec Model ---------------------

The top 10 article(s) most similar to the article(s): 


1 . A geometric construction of colored HOMFLYPT homology
2 . The Hard Lefschetz Theorem and the topology of semismall maps
3 . From the Hecke Category to the Unipotent Locus

#############################################################

1 . ( 961 ) Classification of ArnoldBeltrami Flows and their Hidden Symmetries 
 [ Cosine Similarity= 0.924 ]


   In the context of mathematical hydrodynamics, we consider the group theory structure which underlies the ABCflow introduced by Beltrami, Arnold and Childress. Beltrami equation is the eigenstate equation for the first order LaplaceBeltrami operator *d, which we solve by using harmonic analysis. Taking torus T^3 constructed as R^3/L, where L is a crystallographic lattice, we present a general algorithm to construct solutions of Beltrami equation which utilizes as main ingredient the orbits under the action of the point group

The top 10 article(s) most similar to the article(s): 


1 . A geometric construction of colored HOMFLYPT homology
2 . The Hard Lefschetz Theorem and the topology of semismall maps
3 . From the Hecke Category to the Unipotent Locus

#############################################################

1 . ( 8206 ) Controlled coarse homology and isoperimetric inequalities 
 [ Cosine Similarity= 0.326 ]


   We study a coarse homology theory with prescribed growth conditions. For a finitely generated group G with the word length metric this homology theory turns out to be related to amenability of G. We characterize vanishing of a certain fundamental class in our homology in terms of an isoperimetric inequality on G and show that on any group at most linear control is needed for this class to vanish. The latter is a homological version of the classical Burnside problem for infinite groups, with a positive solution. As applications we characterize existence of primitives of the volume form with 

### Mike's Recommendations

In [60]:
df_mike = user_data(mike)
df_mike[['id', 'entry_id', 'title', 'authors', 'abstract']]

Unnamed: 0,id,entry_id,title,authors,abstract
0,2207.13571,http://arxiv.org/abs/2207.13571v2,Scaling asymptotics of spectral Wigner functions,"[Boris Hanin, Steve Zelditch]",We prove that smooth Wigner-Weyl spectral sums...
1,2207.13498,http://arxiv.org/abs/2207.13498v1,$2$-nodal domain theorems for higher dimension...,"[Junehyuk Jung, Steve Zelditch]",We prove that the real parts of equivariant (b...
2,2211.09644,http://arxiv.org/abs/2211.09644v1,Asymptotics for the spectral function on Zoll ...,"[Yaiza Canzani, Jeffrey Galkowski, Blake Keeler]","On a smooth, compact, Riemannian manifold with..."
3,2001.10647,http://arxiv.org/abs/2001.10647v4,Caustics of weakly Lagrangian distributions,"[Sean Gomes, Jared Wunsch]",We study semiclassical sequences of distributi...
4,2103.08093,http://arxiv.org/abs/2103.08093v2,Around quantum ergodicity,[Semyon Dyatlov],We discuss Shnirelman's Quantum Ergodicity The...
5,2207.08245,http://arxiv.org/abs/2207.08245v1,Classical Wave methods and modern gauge transf...,"[Jeffrey Galkowski, Leonid Parnovski, Roman Sh...","In this article, we consider the asymptotic be..."
6,2207.01677,http://arxiv.org/abs/2207.01677v2,Scaling Asymptotics of Wigner Distributions of...,[Nicholas Lohr],The main result of this article gives scaling ...
7,2205.08744,http://arxiv.org/abs/2205.08744v2,A proof of a Melrose's trace formula,[Yves Colin de Verdière],We give a new proof ofan extension of the Chaz...
8,2008.04406,http://arxiv.org/abs/2008.04406v1,Reduction and Coherent States,"[Jenia Rousseva, Alejandro Uribe]",We apply a quantum version of dimensional redu...
9,1912.09845,http://arxiv.org/abs/1912.09845v2,An introduction to microlocal complex deformat...,"[Jeffrey Galkowski, Maciej Zworski]",In this expository article we relate the prese...


#### Make recommendations for a single paper.

In [61]:
## Using the pretrained Word2Vec
df_rec_7, df_mike_1 = n_recommendations(w2v_model, df, df_mike[0:1], 5)
display_results(df_rec_7, df_mike_1)

## Using Doc2Vec with Distributed Bag of Words
df_rec_8, df_mike_2 = n_recommendations(d2v_model_bow, df, df_mike[0:1], 5, False)
display_results(df_rec_8, df_mike_2)

## Using Doc2Vec with Distributed Memory
df_rec_9, df_mike_3 = n_recommendations(d2v_model_dm, df, df_mike[0:1], 5, True)
display_results(df_rec_9, df_mike_3)

--------------------- Word2Vec Model ---------------------

The top 5 article(s) most similar to the article(s): 


1 . Scaling asymptotics of spectral Wigner functions

#############################################################

1 . ( 14683 ) Encoding Curved Tetrahedra in Face Holonomies: a Phase Space of Shapes   from GroupValued Moment Maps 
 [ Cosine Similarity= 0.88 ]


   We present a generalization of Minkowski's classic theorem on the reconstruction of tetrahedra from algebraic data to homogeneously curved spaces. Euclidean notions such as the normal vector to a face are replaced by LeviCivita holonomies around each of the tetrahedron's faces. This allows the reconstruction of both spherical and hyperbolic tetrahedra within a unified framework. A new type of hyperbolic simplex is introduced in order for all the sectors encoded in the algebraic data to be covered. Generalizing the phase space of shapes associated to flat tetrahedra leads to group valued moment maps and quasiP

The top 5 article(s) most similar to the article(s): 


1 . Scaling asymptotics of spectral Wigner functions

#############################################################

1 . ( 8639 ) Black Hole Instabilities and Exponential Growth 
 [ Cosine Similarity= 0.885 ]


   Recently, a general analysis has been given of the stability with respect to axisymmetric perturbations of stationaryaxisymmetric black holes and black branes in vacuum general relativity in arbitrary dimensions. It was shown that positivity of canonical energy on an appropriate space of perturbations is necessary and sufficient for stability. However, the notions of both "stability" and "instability" in this result are significantly weaker than one would like to obtain. In this paper, we prove that if a perturbation of the form with  a solution to the linearized Einstein equationhas negative canonical energy, then that perturbation must, in fact, grow exponentially in time. The key idea is to make use of the  or ()refle

#### Make recommendations for a multiple papers.

In [62]:
## Using the pretrained Word2Vec
df_rec_wv3, df_mike_new_wv = n_recommendations(w2v_model, df, df_mike[0:3], 10)
display_results(df_rec_wv3, df_mike_new_wv)

## Using Doc2Vec with Distributed Bag of Words
df_rec_d2v_bow3, df_mike_new_bow = n_recommendations(d2v_model_bow, df, df_mike[0:3], 10, False)
display_results(df_rec_d2v_bow3, df_mike_new_bow)

## Using Doc2Vec with Distributed Memory
df_rec_d2v_dm3, df_mike_new_dm = n_recommendations(d2v_model_dm, df, df_mike[0:3], 10, True)
display_results(df_rec_d2v_dm3, df_mike_new_dm)

--------------------- Word2Vec Model ---------------------

The top 10 article(s) most similar to the article(s): 


1 . Scaling asymptotics of spectral Wigner functions
2 . $2$-nodal domain theorems for higher dimensional circle bundles
3 . Asymptotics for the spectral function on Zoll manifolds

#############################################################

1 . ( 18932 ) Invariant tori, actionangle variables and phase space structure of the   RajeevRanken model 
 [ Cosine Similarity= 0.927 ]


   We study the classical RajeevRanken model, a Hamiltonian system with three degrees of freedom describing nonlinear continuous waves in a 1+1dimensional nilpotent scalar field theory pseudodual to the SU(2) principal chiral model. While it loosely resembles the Neumann and Kirchhoff models, its equations may be viewed as the Euler equations for a centrally extended Euclidean algebra. The model has a Lax pair and rmatrix leading to four generically independent conserved quantities in involutio

The top 10 article(s) most similar to the article(s): 


1 . Scaling asymptotics of spectral Wigner functions
2 . $2$-nodal domain theorems for higher dimensional circle bundles
3 . Asymptotics for the spectral function on Zoll manifolds

#############################################################

1 . ( 7133 ) Area minimizing hypersurfaces modulo : a geometric freeboundary   problem 
 [ Cosine Similarity= 0.34 ]


   We consider area minimizing dimensional currents  in complete  Riemannian manifolds  of dimension . For odd moduli we prove that, away from a closed rectifiable set of codimension , the current in question is, locally, the union of finitely many smooth minimal hypersurfaces coming together at a common  boundary of dimension , and the result is optimal. For even  such structure holds in a neighborhood of any point where at least one tangent cone has dimensional spine. These structural results are indeed the byproduct of a theorem that proves (for any modulus) uniqueness 

The top 10 article(s) most similar to the article(s): 


1 . Scaling asymptotics of spectral Wigner functions
2 . $2$-nodal domain theorems for higher dimensional circle bundles
3 . Asymptotics for the spectral function on Zoll manifolds

#############################################################

1 . ( 5790 ) Removable sets and uniqueness on manifolds and metric measure   spaces 
 [ Cosine Similarity= 0.929 ]


   We study symmetric diffusion operators on metric measure spaces. Our main question is whether or not the restriction of the operator to a suitable core continues to be essentially selfadjoint or unique if a small closed set is removed from the space. The effect depends on how large the removed set is, and we provide characterizations of the critical size in terms of capacities and Hausdorff dimension. As a key tool we prove a truncation result for potentials of nonnegative functions. We apply our results to Laplace operators on Riemannian and subRiemannian manifolds and o

### Jenia's Recommendations

In [63]:
df_jenia = user_data(jenia)
df_jenia[['id', 'entry_id', 'title', 'authors', 'abstract']]

Unnamed: 0,id,entry_id,title,authors,abstract
0,2010.14967,http://arxiv.org/abs/2010.14967v5,Construction of quasimodes for non-selfadjoint...,[Víctor Arnaiz],We construct quasimodes for some non-selfadjoi...
1,1307.0493,http://arxiv.org/abs/1307.0493v2,The exponential map of the complexification of...,"[Daniel Burns, Ernesto Lupercio, Alejandro Uribe]","Let $(M, \omega, J)$ be a K\""ahler manifold an..."
2,quant-ph/0604014,http://arxiv.org/abs/quant-ph/0604014v2,Time evolution of non-Hermitian Hamiltonian sy...,"[Carla Figueira de Morisson Faria, Andreas Fring]","We provide time-evolution operators, gauge tra..."
3,2201.05140,http://arxiv.org/abs/2201.05140v1,An introduction to PT-symmetric quantum mechan...,[Andreas Fring],I will provide a pedagogical introduction to n...
4,1111.1877,http://arxiv.org/abs/1111.1877v2,Complexified coherent states and quantum evolu...,"[Eva-Maria Graefe, Roman Schubert]","The complex geometry underlying the Schr\""odin..."
5,quant-ph/9912054,http://arxiv.org/abs/quant-ph/9912054v2,Holomorphic Methods in Mathematical Physics,[Brian C. Hall],This set of lecture notes gives an introductio...
6,1611.08286,http://arxiv.org/abs/1611.08286v1,Unitarity of the time-evolution and observabil...,"[F. S. Luiz, M. A. Pontes, M. H. Y. Moussa]",Here we present an strategy for the derivation...
7,1507.02858,http://arxiv.org/abs/1507.02858v3,Non-Hermitian propagation of Hagedorn wavepackets,"[Caroline Lasser, Roman Schubert, Stephanie Tr...",We investigate the time evolution of Hagedorn ...
8,math-ph/0107001,http://arxiv.org/abs/math-ph/0107001v3,Pseudo-Hermiticity versus PT Symmetry: The nec...,[Ali Mostafazadeh],We introduce the notion of pseudo-Hermiticity ...
9,1511.01241,http://arxiv.org/abs/1511.01241v2,Semiclassical states associated to isotropic s...,"[Victor Guillemin, Alejandro Uribe, Zuoqin Wang]",We define classes of quantum states associated...


#### Make recommendations for a single paper.

In [64]:
## Using the pretrained Word2Vec
df_rec_10, df_jenia_1 = n_recommendations(w2v_model, df, df_jenia[0:1], 5)
display_results(df_rec_10, df_jenia_1)

## Using Doc2Vec with Distributed Bag of Words
df_rec_11, df_jenia_2 = n_recommendations(d2v_model_bow, df, df_jenia[0:1], 5, False)
display_results(df_rec_11, df_jenia_2)

df_rec_12, df_jenia_3 = n_recommendations(d2v_model_dm, df, df_jenia[0:1], 5, True)
display_results(df_rec_12, df_jenia_3)

--------------------- Word2Vec Model ---------------------

The top 5 article(s) most similar to the article(s): 


1 . Construction of quasimodes for non-selfadjoint operators via propagation of Hagedorn wave-packets

#############################################################

1 . ( 18992 ) Instability, index theorem, and exponential trichotomy for Linear   Hamiltonian PDEs 
 [ Cosine Similarity= 0.883 ]


   Consider a general linear Hamiltonian system  in a Hilbert space . We assume that induces a bounded and symmetric bilinear form  on , which has only finitely many negative dimensions . There is no restriction on the antiselfdual operator . We first obtain a structural decomposition of  into the direct sum of several closed subspaces so that  is blockwise diagonalized and  is of upper triangular form, where the blocks are easier to handle. Based on this structure, we first prove the linear exponential trichotomy of . In particular,  has at most algebraic growth in the finite co

The top 5 article(s) most similar to the article(s): 


1 . Construction of quasimodes for non-selfadjoint operators via propagation of Hagedorn wave-packets

#############################################################

1 . ( 10200 ) Fractal Weyl law for the Ruelle spectrum of Anosov flows 
 [ Cosine Similarity= 0.877 ]


   On a closed manifold , we consider a smooth vector field  that generates an Anosov flow. Let  be a smooth potential function. It is known that for any , there exists some anisotropic Sobolev space  such that the operator  has intrinsic discrete spectrum on  called RuellePollicott resonances. In this paper, we show that the density of resonances is bounded by  where ,  and  is the H\"older exponent of the distribution  (strong stable and unstable). We also obtain some more precise results concerning the wave front set of the resonances states and the group property of the transfer operator. We use some semiclassical analysis based on wave packet transform associat

#### Make recommendations for a multiple papers.

In [66]:
## Using the pretrained Word2Vec
df_rec_wv4, df_jenia_new_wv = n_recommendations(w2v_model, df, df_jenia[0:3], 10)
display_results(df_rec_wv4, df_jenia_new_wv)

## Using Doc2Vec with Distributed Bag of Words
df_rec_d2v_bow4, df_jenia_new_bow = n_recommendations(d2v_model_bow, df, df_jenia[0:3], 10, False)
display_results(df_rec_d2v_bow4, df_jenia_new_bow)

df_rec_d2v_dm4, df_jenia_new_dm = n_recommendations(d2v_model_dm, df, df_jenia[0:3], 10, True)
display_results(df_rec_d2v_dm4, df_jenia_new_dm)

--------------------- Word2Vec Model ---------------------

The top 10 article(s) most similar to the article(s): 


1 . Construction of quasimodes for non-selfadjoint operators via propagation of Hagedorn wave-packets
2 . The exponential map of the complexification of {\em Ham} in the real-analytic case
3 . Time evolution of non-Hermitian Hamiltonian systems

#############################################################

1 . ( 18992 ) Instability, index theorem, and exponential trichotomy for Linear   Hamiltonian PDEs 
 [ Cosine Similarity= 0.929 ]


   Consider a general linear Hamiltonian system  in a Hilbert space . We assume that induces a bounded and symmetric bilinear form  on , which has only finitely many negative dimensions . There is no restriction on the antiselfdual operator . We first obtain a structural decomposition of  into the direct sum of several closed subspaces so that  is blockwise diagonalized and  is of upper triangular form, where the blocks are easier to hand

The top 10 article(s) most similar to the article(s): 


1 . Construction of quasimodes for non-selfadjoint operators via propagation of Hagedorn wave-packets
2 . The exponential map of the complexification of {\em Ham} in the real-analytic case
3 . Time evolution of non-Hermitian Hamiltonian systems

#############################################################

1 . ( 377 ) Existence and regularity results for weak solutions to elliptic   systems in divergence form 
 [ Cosine Similarity= 0.397 ]


   We prove existence and regularity results for weak solutions of non linear elliptic systems with non variational structure satisfying growth conditions. In particular we are able to prove higher differentiability results under a dimensionfree gap between  and . 

-----------------------------------------------------

2 . ( 16041 ) Bundle Theory of Improper Spin Transformations 
 [ Cosine Similarity= 0.377 ]


   {\it We first give a geometrical description of the action of the parity oper

The top 10 article(s) most similar to the article(s): 


1 . Construction of quasimodes for non-selfadjoint operators via propagation of Hagedorn wave-packets
2 . The exponential map of the complexification of {\em Ham} in the real-analytic case
3 . Time evolution of non-Hermitian Hamiltonian systems

#############################################################

1 . ( 11576 ) The Casimir effect from the point of view of algebraic quantum field   theory 
 [ Cosine Similarity= 0.916 ]


   We consider a region of Minkowski spacetime bounded either by one or by two parallel, infinitely extended plates orthogonal to a spatial direction and a real KleinGordon field satisfying Dirichlet boundary conditions. We quantize these two systems within the algebraic approach to quantum field theory using the socalled functional formalism. As a first step we construct a suitable unital *algebra of observables whose generating functionals are characterized by a labelling space which is at the same tim

### Assessment of Recommendations

We don't have an objective metric for assessing the quality of recommendations for papers made by our three models based on the users' inputs. It is inherently a subjective task as different users will have different use cases. For example, some users may prefer papers closely related to their current research interests, whereas others, perhaps new to a given field of mathematics, may wish to receive a broader survey of the field. We are also only basing this on the information provided by the article titles and abstracts. Nevertheless, we'll give it a shot! 

Here are the users' rankings of the "best" and "worst" models based on how well the recommended papers from the three input papers capture their interests.

#### Ethan: 
1. `Doc2Vec w/ Distributed Memory`
2. `Doc2Vec w/ Bag of Words`
3. `Word2Vec`

#### Jee Uhn:
1. `Doc2Vec w/ Distributed Memory`
2. `Doc2Vec w/ Bag of Words`
3. `Word2Vec`

#### Mike:
1. 
2. 
3. 

#### Jenia:
1. `Word2Vec`
2. `Doc2Vec w/ Distributed Memory`
3. `Doc2Vec w/ Bag of Words`

