### Compare abstracts of papers in my library with new Arxiv submissions.

In [1]:
from nltk.stem import WordNetLemmatizer
import re
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer


## self.text is a term used in linguistics meaning a large set of text body.
stop_words = set(stopwords.words("english"))
from mylibrary.Keyword_extractor import Keyword_extractor

## New papers

In [2]:
import requests
from lxml import etree
import mylibrary


search_results = requests.get("http://export.arxiv.org/api/query?"+
                              "search_query=cat:astro-ph.GA"+
                              "&start=0&max_results=10"+
                              "&sortBy=submittedDate&sortOrder=descending")

alldata=[]

root = etree.fromstring(search_results.content)
for entry in root.findall("{http://www.w3.org/2005/Atom}entry"):
    self = mylibrary.Arxiv.Arxiv_meta()

    for element in entry:
        name = element.tag.split("}")[-1]
        if name == "author":
            for Echild in element.getchildren():
                self.meta[name].append(Echild.text)            
        else:
            try:
                if isinstance(self.meta[name], list):
                    try:
                        self.meta[name].append(element.attrib["term"])
                    except:
                        self.meta[name].append(element.text)
                else:
                    try:
                        self.meta[name] = element.attrib["term"]
                    except:
                        self.meta[name] = element.text
            except:
                continue
                
    alldata.append(self)

In [4]:
kwe = Keyword_extractor()
kwe.set_vectorizer(3, max_features=2000)

summed_abstract = ""
for article in alldata:
    summed_abstract += article.meta["summary"]

In [44]:
kwe.extract_words([summed_abstract])
top_100_2 = kwe.get_top_n_words(50, ngram=2)
top_100_1 = kwe.get_top_n_words(50, ngram=1)

## Old papers

In [38]:
import bibtexparser

# Extract keywords from my library
with open("Cluster_env_papers.bib", "r") as f:
    bib_db = bibtexparser.load(f)
    
summed_my_abstract = ""
for article in bib_db.entries:
    try:
        summed_my_abstract += article["abstract"]
    except:
        print("No abstract")

No abstract
No abstract


In [45]:
kwe.extract_words([summed_my_abstract])
my_top100_2 = kwe.get_top_n_words(50, ngram=2)
my_top100_1 = kwe.get_top_n_words(50, ngram=1)

In [48]:
print(top_100_1)
print(my_top100_1)

[('galaxy', 20), ('model', 16), ('star', 13), ('arm', 12), ('spiral', 11), ('result', 11), ('grain', 11), ('gas', 10), ('density', 9), ('heating', 9), ('velocity', 8), ('observed', 7), ('formation', 7), ('bulge', 7), ('also', 7), ('stellar', 7), ('data', 7), ('temperature', 7), ('distance', 7), ('two', 6), ('time', 6), ('high', 6), ('winding', 6), ('observation', 6), ('process', 6), ('dispersion', 6), ('sfr', 6), ('surface', 6), ('specie', 6), ('destruction', 6), ('induced', 6), ('scattering', 6), ('evolution', 6), ('accretion', 5), ('cosmic', 5), ('inflow', 5), ('sigma', 5), ('interstellar', 5), ('whole', 5), ('low', 5), ('effective', 5), ('extinction', 5), ('band', 5), ('binary', 5), ('suggests', 4), ('resolution', 4), ('show', 4), ('map', 4), ('correlation', 4), ('size', 4)]
[('galaxy', 143), ('cluster', 86), ('mass', 39), ('sample', 28), ('code', 25), ('stellar', 20), ('group', 20), ('distribution', 19), ('formation', 18), ('halo', 18), ('star', 18), ('gas', 17), ('simulation', 17)

## Now, I want to compare each new paper to the whole set of my library.

https://open.blogs.nytimes.com/2015/08/11/building-the-next-new-york-times-recommendation-engine/  

1. **Collaborative filtering** cannot be applied to new entries as no user rating is available. (It's true for any paper, as there is no such thing as user rating. --- maybe citation in case of old papers?)  

2. **Content-based filtering** also doesn't work very well. If I use pre-defined keywords by journals, I get too broad list of recommendations. I tried to extract more specific keywords, which means now there is no exact one-to-one correlation between keywords of different articles.  

3. Then...?

### Latent Dirichlet allocation
https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation  

LDA, an example of topic model, tries to quantify the topic of an article as a combination of *unobserved / latent 
/ hidden* smaller topics, assuming the article is made up of a few topics.  
It can be used, for example, to automatically assign a news articles into one of (culture, politics, economy, science, ...) categories. 

The name "Dirichlet" comes from the assumption that the distribution of topics has a sparse *Dirichlet* distribution. In a layman's term, this means that an article is a mixture of small number of topics, and each 
topic has a small set of characteristic/distingushing vocabularies.


#### Things to consider further

Can a scientific paper be well represented by a number of sub topics? Maybe not because there is only central topic in a paper, and subtopics are further details of the central topic. This heirarchy may not be correctly captured by LDA algorithm. 

My actual thought process when determining which paper to read and which not need to be analyzed before I can go for a specific algorithm....





