# 'Recommendation of similar articles from journal abstract analysis'  
# Modeling for recommendation creation
## 2019, Misty M. Giles
### https://github.com/OhThatMisty/astro_categories/

In [1]:
from gensim.parsing.preprocessing import remove_stopwords, strip_punctuation, strip_short
import numpy as np
import os
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import spacy
import unicodedata



In [2]:
# Future work: Lemma could come before removing punctuation?

def normalize(text):
    '''Convert to ascii, remove special characters associated with LaTeX when given a df column,
       only keep alpha chars and contractions/posessives'''
    normalized_text = []
    
    for t in text:
        t = unicodedata.normalize('NFKD', t).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        # This line is necessary to separate some words because of the LaTeX/mathtext formatting 
        # (Split in three because they wouldn't work well together.)
        #t = re.sub('', ' ', t)
        t = re.sub('mathrm|gtrsim|lesssim|odot|langle', ' ', t)
        t = re.sub('rangle|rm off|\\\\\S*|\$\S*\$', ' ', t) 
        # Expand "not" before removing punctuation 
        t = re.sub('n\'t', ' not', t)
        t = strip_punctuation(t)
        # This line gets rid of non-alphanumeric 
        t = re.sub('[\W]+', ' ', t) 
        # strip_short gets rid of the rest of the math leftovers
        normalized_text.append(strip_short(t.lower(), minsize=2))
    return normalized_text

# This function is to remove excess whitespace 
def remove(token):
    '''Provide feedback on whether a token is excess whitespace'''
    return token.is_space or token.is_digit

# This function ensures that all printouts use the same formula
def join_tokens(sent):
    '''Joins tokens in a sent without whatever is in remove(), adds pronoun back
       in instead of -PRON-'''
    return ' '.join([token.lemma_ if token.lemma_ != '-PRON-' else token.text.lower()
                     for token in sent if not remove(token)])

# This function prevents nested lists that kill the vectorizer
def join_sentences(doc):
    '''Joins sentences in a doc (includes join_tokens)'''
    return ' '.join([join_tokens(sent) for sent in doc.sents])

# Set up spacy to lemmatize the text
nlp = spacy.load('en', disable=['ner'])

###  Change the "docs_to_run" variable in this box to reflect testing/full run.

In [3]:
# Get the csv file created in the cleaning notebook
file = os.path.join('..', 'data', 'astro_intermediate.csv')
df = pd.read_csv(file, index_col=0)

# Set variables for testing speed
docs_to_run = len(df)  # 1000 for testing, len(df) for real processing
train_docs = int(0.9 * docs_to_run)  # 90% training, 10% test
test_docs = docs_to_run - train_docs
print('Setting up to train on', train_docs, 'abstracts.')
print('Setting up test block of', test_docs, 'abstracts.')
print('Total', docs_to_run, 'abstracts in session.')

Setting up to train on 53974 abstracts.
Setting up test block of 5998 abstracts.
Total 59972 abstracts in session.


In [4]:
%%time

# Clean the file (remove punctuation, lowercase, lemmatize, remove 1-char objects --
# most are math/LaTeX formatting leftovers or possessives)
text = [join_sentences(doc) for doc in nlp.pipe(normalize(df.abstract[:docs_to_run]), batch_size=1000)]

Wall time: 1h 49min 35s


In [5]:
# Create a filepath and write out the sentences, removing stopwords to save space on GitHub.
sentences_file = os.path.join('..', 'data', 'astro_normalized_premodel.txt')
with open(sentences_file, 'w') as out_file:
     for sent in text:
            out_file.write(remove_stopwords(sent) + '\n')

In [6]:
# Script for printing out an article from the text file to test it.  Can be used to stream
# if necessary.  (iterator)
sentences = open(sentences_file, 'r')
for i, line in enumerate(sentences):
    if i < 0: print(line)

In [7]:
print('Sample abstract to demonstrate cleaning:\n')
print(text[22:23])

Sample abstract to demonstrate cleaning:

['high resolution pc alma ghz mm and vla ghz measurement have be use to image continuum and spectral line emission from the inner region of the nearby infrared luminous galaxy ic we detect compact pc luminous mm continuum emission in the core of ic with brightness temperature the ghz continuum be equally compact but fainter in flux we suggest that the to mm continuum be opaque at mm wavelength imply very large column density of 1e26 cm and that it emerge from hot dust with temperature vibrationally excited line of hcn 1f and hcn vib be see in emission and resolve on scale of pc the hcn vib emission reveal north south nuclear velocity gradient with projected rotation velocity of km at pc the bright hcn vib emission be orient perpendicular to the velocity gradient ground state line of hcn hc hco and cs show complex line absorption and emission feature hcn and hco have red shift reversed cygni profile consistent with gas inflow of km the absorptio

###  Now that the text has been prepared, it's time to choose a sample abstract from the set of abstracts that won't be used to train the model.  This is a proof-of-concept method for testing the recommendation engine that can be tested even with internet issues.

In [8]:
# Pick an article to function as the sample 
article_idx = np.random.randint(0, test_docs)

In [9]:
# Uncomment the line below if you want to see the cleaned data from the new sample
text[article_idx+train_docs:article_idx+train_docs+1]

['the cherenkov telescope array cta be an international project for next generation ground base gamma ray observatory cta conceive as an array of ten of image atmospheric cherenkov telescope comprise small medium and large size telescope be aim to improve on the sensitivity of current generation experiment by an order of magnitude and provide energy coverage from gev to more than tev the schwarzschild couder sc medium size candidate telescope model feature novel aplanatic two mirror optical design capable of wide field of view with significantly improve imaging resolution as compare to the traditional davis cotton optic design achieve this image resolution impose strict alignment requirement to be accomplish by dedicated alignment system in this contribution we present the status of the development of the sc optical alignment system soon to be materialize in full scale prototype sc medium size telescope at the fr lawrence whipple observatory in southern arizona']

### Model: tfidf using sklearn

In [10]:
%%time

# Set up the model for vectorizing/calculating the similarity; words must appear 
# in at least 500 documents.  sklearn stopwords are used, as removing stopwords at
# the beginning proved to damage the ngram results.  
tfidf = TfidfVectorizer(ngram_range=(1,5), min_df=0.01, stop_words='english')

# Transform/fit the training and test data to the model
tfidf_matrix = tfidf.fit_transform(text[:train_docs]).todense()
article_matrix = tfidf.transform(text[train_docs:]).todense()

# Create a df of the model's values
tfidf_df = pd.DataFrame(tfidf_matrix, columns=tfidf.get_feature_names())
article_df = pd.DataFrame(article_matrix, columns=tfidf.get_feature_names())

Wall time: 6min 53s


In [11]:
print('Training vocabulary:', len(tfidf.get_feature_names()))

Training vocabulary: 1676


In [12]:
# Check the top features of an abstract from the dataset
top_features = tfidf_df.iloc[22]
top_features.sort_values(ascending=False)[:10]

pc                0.324876
mm                0.291944
continuum         0.246002
emission          0.238574
line              0.208441
ghz               0.203845
compact           0.176493
km                0.170026
north             0.165100
column density    0.145339
Name: 22, dtype: float64

#### Finding recommendations from tfidf and cosine similarity, using the sample abstract above.

In [13]:
# Compute the document similarities with sklearn linear_kernel.  Per sklearn,
# linear_kernal is faster than cosine_similarity for tfidf.
document_similarity = linear_kernel(article_df.iloc[article_idx:article_idx+1], tfidf_df).flatten()

# Get the indices for the documents that have highest cosine similarity to the sample.
related_indices = document_similarity.argsort()[:-7:-1]
related_indices

array([27715, 41498, 41819,  2565, 27908, 27977], dtype=int64)

In [14]:
# Create a df with the attributes of the similar documents
related_abstracts = df[['abstract', 'title', 'terms']].iloc[related_indices]
related_abstracts['document_similarity'] = document_similarity[related_indices]

# Print out the most similar documents
related_abstracts

Unnamed: 0,abstract,title,terms,document_similarity
27715,The Cherenkov Telescope Array (CTA) is an inte...,Prototype 9.7 m Schwarzschild-Couder telescope...,astro-ph.IM,0.698308
41498,The Cherenkov Telescope Array (CTA) is planned...,Status of the Schwarzchild-Couder Medium-Sized...,astro-ph.IM,0.610975
41819,The Cherenkov Telescope Array (CTA) is a forth...,The Gamma-ray Cherenkov Telescope for the Cher...,astro-ph.IM|astro-ph.HE,0.520134
2565,The Cherenkov Telescope Array (CTA) is the maj...,Monte Carlo studies for the optimisation of th...,astro-ph.IM,0.496978
27908,The Cherenkov Telescope Array (CTA) will be th...,ASTRI for the Cherenkov Telescope Array,astro-ph.IM,0.492957
27977,The Cherenkov Telescope Array (CTA) is an inte...,Tools and Procedures for the CTA Array Calibra...,astro-ph.IM,0.490299


### How much does the first result have in common with the sample?  This table (sorted by feature importance of the sample article) shows how many and which of the features match up.

In [15]:
# Get the features of the sample article
sample_features = dict(article_df.iloc[article_idx])
sample_features = pd.DataFrame.from_dict(sample_features, orient='index').reset_index()
sample_features.columns = ['feature', 'sample_importance']
# Get the features for the highest-ranked related article
related_features = dict(tfidf_df.iloc[related_indices[0]])
related_features = pd.DataFrame.from_dict(related_features, orient='index').reset_index()
related_features.columns = ['feature', 'related_importance']
# Concat dfs
features_df = sample_features.merge(related_features, on='feature')
features_df.loc[features_df.sample_importance > 0].sort_values(by='sample_importance', ascending=False)[:10]

Unnamed: 0,feature,sample_importance,related_importance
1543,telescope,0.359725,0.301369
939,medium,0.270012,0.094254
1404,size,0.259504,0.090586
645,generation,0.208783,0.218642
379,design,0.208712,0.218567
746,improve,0.197089,0.206396
1057,observatory,0.192334,0.201416
90,array,0.191389,0.100213
734,image,0.164084,0.0
1075,optical,0.155562,0.244362


### But what features were most important for the highly ranked article?

In [16]:
related_features.sort_values(by='related_importance', ascending=False)[:5]

Unnamed: 0,feature,related_importance
1543,telescope,0.301369
577,field view,0.269781
1075,optical,0.244362
1642,view,0.219119
645,generation,0.218642


In [17]:
# Get the abstract for the sample article
# Abstract is unaltered from download (more human-readable but includes formatting).
print('ABSTRACT TO MATCH: \n')
print(df.title.iloc[(article_idx + train_docs)], '\n')
print(df.abstract.iloc[(article_idx + train_docs)], '\n')
print(df.terms.iloc[(article_idx + train_docs)])

ABSTRACT TO MATCH: 

Construction of a medium-sized Schwarzschild-Couder telescope as a   candidate for the Cherenkov Telescope Array: development of the optical   alignment system 

The Cherenkov Telescope Array (CTA) is an international project for a next-generation ground-based gamma-ray observatory. CTA, conceived as an array of tens of imaging atmospheric Cherenkov telescopes, comprising small, medium and large-size telescopes, is aiming to improve on the sensitivity of current-generation experiments by an order of magnitude and provide energy coverage from 20 GeV to more than 300 TeV. The Schwarzschild-Couder (SC) medium-size candidate telescope model features a novel aplanatic two-mirror optical design capable of a wide field-of-view with significantly improved imaging resolution as compared to the traditional Davis-Cotton optics design. Achieving this imaging resolution imposes strict alignment requirements to be accomplished by a dedicated alignment system. In this contributio

In [18]:
# Get the abstract for the highest-ranked related article
# Abstract is unaltered from download (more human-readable but includes formatting).
print('HIGHEST-RATED MATCH \n')
print(df.title[related_indices[0]], '\n')
print(df.abstract[related_indices[0]], '\n')
print(df.terms[related_indices[0]])

HIGHEST-RATED MATCH 

Prototype 9.7 m Schwarzschild-Couder telescope for the Cherenkov   Telescope Array: status of the optical system 

The Cherenkov Telescope Array (CTA) is an international project for a next-generation ground-based gamma ray observatory, aiming to improve on the sensitivity of current-generation experiments by an order of magnitude and provide energy coverage from 30 GeV to more than 300 TeV. The 9.7m Schwarzschild-Couder (SC) candidate medium-size telescope for CTA exploits a novel aplanatic two-mirror optical design that provides a large field of view of 8 degrees and substantially improves the off-axis performance giving better angular resolution across all of the field of view with respect to single-mirror telescopes. The realization of the SC optical design implies the challenging production of large aspherical mirrors accompanied by a submillimeter-precision custom alignment system. In this contribution we report on the status of the implementation of the opt