# Extracting keywords from papers


Journal-provided Keywords are too broad to be useful.  
I will extract keywords from the abstract carried along with a bibtex entry.
A few options exist for NLP(Natural Language Processing.) In this notebook, I will use a python module called nltk (Natual Language Toolkit.)  
Note that different functionalities of NLTK depend on different DBs.


A general tutorial on NLTK (https://www.guru99.com/nltk-tutorial.html)  
NLTK syntax processing https://krakensystems.co/blog/2018/nlp-syntax-processing  
A working example of keyword extraction using different tool sets 
https://towardsdatascience.com/textrank-for-keyword-extraction-by-python-c0bae21bcec0

See also great keyword extraction packages such as  
rake-nltk, RAKE (Rapid Automatic Keyword Extraction algorithm) or  
KEA (Keyphrase extraction algorithm)

In [40]:
import bibtexparser

# Abstracts in plain text, or unstrctured text form.
with open("Cluster_env_papers.bib", "r") as f:
    bib_db = bibtexparser.load(f)

for entry in bib_db.entries:
    try:
        print(entry['abstract'])
    except:
        pass
    
# Note that threre are quite a few special characters for Latex equations and acronyms.

Abstract image available at: http://adsabs.harvard.edu/abs/1974ApJ...194....1O
The unprecedented depth and area surveyed by the Subaru Strategic Program with the Hyper Suprime-Cam (HSC-SSP) have enabled us to construct and publish the largest distant cluster sample out to {\$}z\backslashsim 1{\$} to date. In this exploratory study of cluster galaxy evolution from {\$}z=1{\$} to {\$}z=0.3{\$}, we investigate the stellar mass assembly history of brightest cluster galaxies (BCGs), and evolution of stellar mass and luminosity distributions, stellar mass surface density profile, as well as the population of radio galaxies. Our analysis is the first high redshift application of the top N richest cluster selection, which is shown to allow us to trace the cluster galaxy evolution faithfully. Our stellar mass is derived from a machine-learning algorithm, which we show to be unbiased and accurate with respect to the COSMOS data. We find very mild stellar mass growth in BCGs, and no evidence for 

In [34]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/hoseung/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

#### tokenize sentences 

In [13]:
entry = bib_db.entries[2]

In [16]:
tokens = nltk.word_tokenize(entry['abstract'])

#### identify types of words (POS, part of speech)

In [20]:
nltk.pos_tag(tokens)

[('We', 'PRP'),
 ('examine', 'VBP'),
 ('subhaloes', 'NNS'),
 ('and', 'CC'),
 ('galaxies', 'NNS'),
 ('residing', 'VBG'),
 ('in', 'IN'),
 ('a', 'DT'),
 ('simulated', 'JJ'),
 ('LCDM', 'NNP'),
 ('galaxy', 'NN'),
 ('cluster', 'NN'),
 ('(', '('),
 ('{', '('),
 ('\\', 'VB'),
 ('$', '$'),
 ('}', ')'),
 ('M', 'NNP'),
 ('{', '('),
 ('\\^', 'VB'),
 ('{', '('),
 ('}', ')'),
 ('}', ')'),
 ('{', '('),
 ('\\', 'JJ'),
 ('{', '('),
 ('}', ')'),
 ('\\backslashrm', 'NNP'),
 ('crit', 'NN'),
 ('{', '('),
 ('\\', 'JJ'),
 ('}', ')'),
 ('}', ')'),
 ('{', '('),
 ('\\_', 'JJ'),
 ('}', ')'),
 ('{', '('),
 ('\\', 'JJ'),
 ('{', '('),
 ('}', ')'),
 ('200', 'CD'),
 ('{', '('),
 ('\\', 'NN'),
 ('}', ')'),
 ('}', ')'),
 ('=1.1\\backslashtimes10', 'NNP'),
 ('{', '('),
 ('\\^', 'VB'),
 ('{', '('),
 ('}', ')'),
 ('}', ')'),
 ('{', '('),
 ('\\', 'JJ'),
 ('{', '('),
 ('}', ')'),
 ('15', 'CD'),
 ('{', '('),
 ('\\', 'NN'),
 ('}', ')'),
 ('}', ')'),
 ('M', 'NNP'),
 ('{', '('),
 ('\\_', 'VB'),
 ('}', ')'),
 ('\\backslashodot/h

#### stem words 
convert to its *root* form. (solve, solving -> solv)  
Already multiple altorightms are available for stemming.

In [30]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem import SnowballStemmer

porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer('english')

In [31]:
print(porter.stem("treated"))
print(lancaster.stem("treated"))
print(snowball.stem("treated"))

treat
tre
treat


#### Lemmatization 
convert to the *dictionary* form of a word.

In [32]:
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

In [39]:
wnl.lemmatize("galaxies")

'galaxy'

In [35]:
wnl.lemmatize("treated")

'treated'

In [36]:
print(porter.stem("subhaloes"))
print(wnl.lemmatize("subhaloes"))

subhalo
subhaloes


In [37]:
print(porter.stem("accreted"))
print(wnl.lemmatize("accreted"))

accret
accreted


In [38]:
print(porter.stem("infalling"))
print(wnl.lemmatize("infalling"))

infal
infalling


As demonstrated, lemmatization depends on dictionary vocavulary. Domain-specific words such as *infall* or *accrete* does not get correct vocab matching. Hmmm.. 

## 