# 'Recommendation of similar articles from journal abstract analysis'  
# Modeling for recommendation creation
## 2019, Misty M. Giles
### https://github.com/OhThatMisty/astro_categories/

In [1]:
from gensim.parsing.preprocessing import remove_stopwords, strip_punctuation, strip_short
import numpy as np
import os
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import spacy
import unicodedata



In [2]:
def normalize(text):
    '''Convert to ascii, remove special characters associated with LaTeX when given a df column,
       only keep alpha chars and contractions/posessives'''
    normalized_text = []
    
    for t in text:
        t = unicodedata.normalize('NFKD', t).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        # This line is necessary to separate some words because of the LaTeX/math formatting 
        # (Split in two because they wouldn't work well together.)
        t = re.sub('mathrm|gtrsim|lesssim|odot|langle', ' ', t)
        t = re.sub('rangle|rm off|sigma_', ' ', t) 
        # Expand "not" before removing punctuation 
        t = re.sub('n\'t', ' not', t)
        t = strip_punctuation(t)
        # This line gets rid of non-alpha (mostly for digits now) 
        t = re.sub('[^A-Za-z]+', ' ', t) 
        normalized_text.append(strip_short(t))
    # strip_short gets rid of the rest of the math leftovers (and some abbreviations, like ir for
    # infrared - judged that the random letters left over from the math and units of measurement like
    # km, mm caused more noise than losing a few v short acronyms would cause problems
    return normalized_text

# This function is to remove excess whitespace 
def remove(token):
    '''Provide feedback on whether a token is excess whitespace'''
    return token.is_space

# This function ensures that all printouts use the same formula
def join_tokens(sent):
    '''Joins tokens in a sent without whatever is in remove(), adds pronoun back
       in instead of -PRON-'''
    return ' '.join([token.lemma_ if token.lemma_ != '-PRON-' else token.text.lower()
                     for token in sent if not remove(token)])

# This function prevents nested lists that kill the vectorizer
def join_sentences(doc):
    '''Joins sentences in a doc (includes join_tokens)'''
    return ' '.join([join_tokens(sent) for sent in doc.sents])

# Set up spacy to lemmatize the text
nlp = spacy.load('en', disable=['ner'])

###  Change the "docs_to_run" variable in this box to reflect testing/full run.

In [3]:
# Get the csv file created in the cleaning notebook
file = os.path.join('..', 'data', 'astro_intermediate.csv')
df = pd.read_csv(file, index_col=0)

# Set variables for testing speed
docs_to_run = len(df)  # 1000 for testing, len(df) for real processing
train_docs = int(0.9 * docs_to_run)  # 90% training, 10% test
test_docs = docs_to_run - train_docs
print('Setting up to train on', train_docs, 'abstracts.')
print('Setting up test block of', test_docs, 'abstracts.')
print('Total', docs_to_run, 'abstracts in session.')

Setting up to train on 53974 abstracts.
Setting up test block of 5998 abstracts.
Total 59972 abstracts in session.


In [4]:
%%time

# Don't remove stopwords at this point.  Do that at the modeling stage with the model's 
# built-ins.

# Clean the file (remove punctuation, lowercase, lemmatize, remove 1- and 2-char objects --
# most are math/LaTeX formatting leftovers or possessives)
text = [join_sentences(doc) for doc in nlp.pipe(normalize(df.abstract[:docs_to_run]), batch_size=1000)]

Wall time: 1h 35min 6s


In [5]:
print('Sample abstract to demonstrate cleaning:\n')
print(text[22:23])

Sample abstract to demonstrate cleaning:

['high resolution alma ghz and vla ghz measurement have be use image continuum and spectral line emission from the inner region the nearby infrared luminous galaxy detect compact luminous continuum emission the core with brightness temperature the ghz continuum equally compact but fainter flux suggest that the continuum opaque wavelength imply very large column density and that emerge from hot dust with temperature sim vibrationally excited line hcn and hcn vib be see emission and resolve scale the hcn vib emission reveal north south nuclear velocity gradient with projected rotation velocity kms the bright hcn vib emission orient perpendicular the velocity gradient ground state line hcn hco and show complex line absorption and emission feature hcn and hco have red shift reversed cygni profile consistent with gas inflow sim km the absorption feature can traced from the north east into the nucleus contrast show blue shift line wing extend km sugg

###  Now that the text has been prepared, it's time to choose a sample abstract from the set of abstracts that won't be used to train the model.  This is a proof-of-concept method for testing the recommendation engine that can be tested even with internet issues.

In [6]:
# Pick an article to function as the sample and print some attributes.  
# Abstract is unaltered from download (more human-readable but includes formatting).
article_idx = np.random.randint(0, test_docs)
print(df.title.iloc[(article_idx + train_docs)], '\n')
print(df.abstract.iloc[(article_idx + train_docs)], '\n')
print(df.terms.iloc[(article_idx + train_docs)])

Modeling solar coronal bright point oscillations with multiple nanoflare   heated loops 

Intensity oscillations of coronal bright points (BPs) have been studied for past several years. It has been known for a while that these BPs are closed magnetic loop like structures. However, initiation of such intensity oscillations is still an enigma. There have been many suggestions to explain these oscillations, but modeling of such BPs have not been explored so far. Using a multithreaded nanoflare heated loop model we study the behavior of such BPs in this work. We compute typical loop lengths of BPs using potential field line extrapolation of available data (Chandrashekhar et al. 2013), and set this as the length of our simulated loops. We produce intensity like observables through forward modeling and analyze the intensity time series using wavelet analysis, as was done by previous observers. The result reveals similar intensity oscillation periods reported in past observations. It is sugge

In [7]:
text[article_idx+train_docs:article_idx+train_docs+1]

['intensity oscillation coronal bright point bp have be study for past several year have be know for while that these bp be close magnetic loop like structure however initiation such intensity oscillation still enigma there have be many suggestion explain these oscillation but model such bp have not be explore far use multithread nanoflare heated loop model study the behavior such bp this work compute typical loop length bp use potential field line extrapolation available datum chandrashekhar and set this the length our simulate loop produce intensity like observable through forward modeling and analyze the intensity time series use wavelet analysis be do previous observer the result reveal similar intensity oscillation period report past observation suggest these oscillation be actually shock wave propagation along the loop also show that one consider different background subtraction one can extract adiabatic standing mode from the intensity time series datum well both from the observ

### Model: tfidf using sklearn

In [8]:
%%time

# Set up the model for vectorizing/calculating the similarity; words must appear 
# in at least 500 documents.  sklearn stopwords are used, as removing stopwords at
# the beginning proved to damage the ngram results.  
tfidf = TfidfVectorizer(ngram_range=(1,5), min_df=0.01, stop_words='english')

# Transform/fit the training and test data to the model
tfidf_matrix = tfidf.fit_transform(text[:train_docs]).todense()
article_matrix = tfidf.transform(text[train_docs:]).todense()

# Create a df of the model's values
tfidf_df = pd.DataFrame(tfidf_matrix, columns=tfidf.get_feature_names())
article_df = pd.DataFrame(article_matrix, columns=tfidf.get_feature_names())

Wall time: 7min


In [9]:
print('Training vocabulary:', len(tfidf.get_feature_names()))

Training vocabulary: 1673


In [10]:
# Check the top features of an abstract from the dataset
top_features = tfidf_df.iloc[22]
top_features.sort_values(ascending=False)[:10]

continuum         0.280843
emission          0.272353
line              0.237815
ghz               0.229133
compact           0.201455
north             0.188428
column density    0.166471
shift             0.160873
column            0.160774
gradient          0.160358
Name: 22, dtype: float64

#### Finding recommendations from tfidf and cosine similarity, using the sample abstract above.

In [11]:
# Compute the document similarities with sklearn linear_kernel.  Per sklearn,
# linear_kernal is faster than cosine_similarity for tfidf.
document_similarity = linear_kernel(article_df.iloc[article_idx:article_idx+1], tfidf_df).flatten()

# Get the indices for the documents that have highest cosine similarity to the sample.
related_indices = document_similarity.argsort()[:-7:-1]
related_indices

array([28895, 48132, 13338, 19700, 31466, 38146], dtype=int64)

In [12]:
# Create a df with the attributes of the similar documents
related_abstracts = df[['abstract', 'title', 'terms']].iloc[related_indices]
related_abstracts['document_similarity'] = document_similarity[related_indices]

# Print out the most similar documents
related_abstracts

Unnamed: 0,abstract,title,terms,document_similarity
28895,We observe intensity oscillations along corona...,First Imaging Observation of Standing Slow Wav...,astro-ph.SR,0.5696
48132,An observation from the Interface Region Imagi...,Global sausage oscillation of solar flare loop...,astro-ph.SR,0.543786
13338,Recent developments in the observation and mod...,Evolution of the transverse density structure ...,astro-ph.SR,0.524391
19700,Coronal loops exist ubiquitously in the solar ...,Period Increase and Amplitude Distribution of ...,astro-ph.SR,0.493516
31466,The analysis of a hot loop oscillation event u...,Slow-Mode Oscillations of Hot Loops Excited at...,astro-ph.SR,0.483658
38146,Context. The dynamics of the flaring loops in ...,Quasi-oscillatory dynamics observed in ascendi...,astro-ph.SR|astro-ph.IM|physics.data-an|physic...,0.479122


In [13]:
# Get the features of the sample article
sample_features = article_df.iloc[article_idx]
sample_features.sort_values(ascending=False)[:10]

intensity      0.479109
loop           0.459913
oscillation    0.394364
time series    0.201368
past           0.173105
length         0.173020
series         0.161585
simulate       0.152786
datum          0.125236
like           0.120374
Name: 5883, dtype: float64

In [14]:
# Get the features for the highest-ranked related article
related_features = tfidf_df.iloc[related_indices[0]]
related_features.sort_values(ascending=False)[:10]

oscillation      0.498137
loop             0.497944
wave             0.227356
intensity        0.216136
phase            0.208189
active region    0.180443
coronal          0.158171
slow             0.147638
flare            0.144144
speed            0.143567
Name: 28895, dtype: float64

In [15]:
# Get the abstract for the highest-ranked related article
# Abstract is unaltered from download (more human-readable but includes formatting).
print(df.title[related_indices[0]], '\n')
print(df.abstract[related_indices[0]], '\n')
print(df.terms[related_indices[0]])

First Imaging Observation of Standing Slow Wave in Coronal Fan loops 

We observe intensity oscillations along coronal fan loops associated with the active region AR 11428. The intensity oscillations were triggered by blast waves which were generated due to X-class flares in the distant active region AR 11429. To characterise the nature of oscillations, we created time-distance maps along the fan loops and noted that the intensity oscillations at two ends of the loops were out of phase. As we move along the fan loop, the amplitude of the oscillations first decreased and then increased. The out-of-phase nature together with the amplitude variation along the loop implies that these oscillations are very likely to be standing waves. The period of the oscillations are estimated to be $\sim$27 min, damping time to be $\sim$45 min and phase velocity projected in the plane of sky $\sim$ 65-83 km s$^{-1}$. The projected phase speeds were in the range of acoustic speed of coronal plasma at abou