# Spotting Trending Topics in Scientific Research with Latent Dirichlet Allocation

For this week's challenge I decided to spend some time finding a novel text dataset and dabbling in a little data mining. After bumming around the internet for a while I found my target: **Nature.com**

As far as I know nature.com does not provide any API service for programmatically accessing their content. While they have been nice enough to make some papers "open access", which means they are free to download as a PDF or view in browser as html, there's no way I am gonna point click drag copy paste through a hundred webpages to get the volume of data I would like for this notebook. 

Fortunately, python is good for more than data analysis. 

And with a url as straight forward as this

`https://www.nature.com/search?article_type=protocols,research,reviews&subject=biotechnology&page=3`

who needs an api???

There are two python files in this directory that scraped research papers off nature.com for use in the this notebook's dataset. 

The [first script](https://github.com/NoahLidell/math-of-intelligence/blob/master/generative_models/collect-article-html.py), `collect-article-html.py`, plugs different [keywords](https://www.nature.com/subjects) in the `&subject=` placeholder in the url and goes through the first eight pages of search results. Each search result page's html is loaded into an html parser (lxml) and the link and title for all 25 articles on the results page is accessed through the xpath for the respective html elements. The article links gathered are followed and the raw page html for the research paper document is saved in a database (mysql) along with the articles' title, date, etc. 

I ran this `collect-article-html.py` repeatedly across 32 different searchable keywords and in a few hours pulled down over 3000 research papers.

The [second script](https://github.com/NoahLidell/math-of-intelligence/blob/master/generative_models/process-html.py), `process-html.py`, pulls the raw html from the db for every article downloaded. Filtering the research paper text out of the html document proved easier than I expected. I used the BeautifulSoup4 library to removal all the html tags and then with just the page text leftover it was as easy as telling python to only keep the text after the "Abstract" substring and before the "References" substring. Additional text preprocessing was done in this script, removing all special characters, numbers, and excess whitespace.  

I dumped all of the article text data out of mysql and into a sqlite database file for this notebook to pull its data from. The sqlite db is over 100MB so I couldn't upload it directly to github. If you trying to run this notebook, you'll need to unzip the `article_db.zip` file in this directory.

I was motived to compile this dataset since I believe the machine learning isn't the only discipline where interesting things are happening right now. What about CRISPR? What about quantum computing? What about nano technology? I don't know anything about those topics, but they seem interesting... So this notebook is my attempt to gather and explore data on current research across a variety of fields, using LDA as a method for identifying keywords and topics within larger fields such as biotechnology and physics. 

In [1]:
import sqlite3
import pandas as pd
from gensim import corpora, models, similarities
import nltk
from collections import Counter

### Load the DB
The table where the articles are stored is called `articles`.

The columns are:


id | title | text | url | topic | journal | date | type | wc
--- | --- | --- | --- | --- | --- | --- | --- | ---
int | mediumtext | longtext | mediumtext | varchar(245) | varchar(245) | varchar(245) | varchar(245) | int

In [100]:
conn = sqlite3.connect('./database/nature_articles.db')
cursor = conn.cursor()
num_articles = cursor.execute('SELECT count(distinct title) FROM articles WHERE wc > 1500;').fetchall()[0][0]
print('Number of unquie articles in dataset: ', num_articles)

df = pd.read_sql_query("SELECT distinct(title), text, url, journal, date FROM articles WHERE wc > 1500 ORDER BY random();",
                       conn)
df.head()

Number of unquie articles in dataset:  3147


Unnamed: 0,title,text,url,journal,date
0,Long-Term Monitoring of Dolphin Biosonar Activ...,dolphins emit shortultrasonic pulses clicks to...,http://www.nature.com/articles/s41598-017-04608-6,Scientific Reports,28 June 2017
1,Methylation profile of a satellite DNA constit...,tandemly repeated dnas usually constitute sign...,http://www.nature.com/articles/s41598-017-07231-7,Scientific Reports,31 July 2017
2,Transmission is a Noticeable Cause of Resistan...,it is generally believed that drug resistance ...,http://www.nature.com/articles/s41598-017-08061-3,Scientific Reports,09 August 2017
3,Physiological and transcriptional approaches r...,to explain anaerobic nitrite nitrate productio...,http://www.nature.com/srep/2017/170320/srep447...,Scientific Reports,20 March 2017
4,The Rational Design of Therapeutic Peptides fo...,the m family of metalloproteases represents a ...,http://www.nature.com/articles/s41598-017-01542-5,Scientific Reports,02 May 2017


### Here is a sample article from the dataset

In [97]:
title, subject, article = cursor.execute("SELECT title, topic, text FROM articles ORDER BY random() LIMIT 1;").fetchall()[0]
print("\n", title)
print("\nSubject:", subject)
print("\n\t", article)


 Easy on-demand self-assembly of lateral nanodimensional hybrid graphene oxide flakes for near-infrared-induced chemothermal therapy

Subject: biotechnology

	 near infrared nir induced chemothermal doxorubicin dox release for anticancer activity was demonstrated using dox incorporated fully lateral nanodimensional graphene oxide ngo flakes layered with chitosan polyethylene glycol peg conjugate ngo dox cpeg from a single pass gas phase self assembly unlike most previously reported graphene oxide based drug carriers the proposed processing method introduced a fully nanoscale both in lateral dimension and thickness configuration without multistep wet physicochemical processes that enhance the drug loading capacity and nir induced heat generation resulting from the increased surface area the accumulation of ngo dox cpeg flakes in prostate cancer cells enhanced apoptotic phenomena via the combined effects of dox release and heat generation upon nir irradiation the combined anticancer eff

In [15]:
subjects = cursor.execute("SELECT distinct topic FROM articles;").fetchall()
print("Subjects in dataset:\n")
for s in subjects:
    print('\t',s[0])

Subjects in dataset:

	 biotechnology
	 anatomy
	 anthropology
	 physics
	 psychology
	 mathematics-and-computing
	 computational-biology-and-bioinformatics
	 ecology
	 cell-biology
	 microbiology
	 biogeochemistry
	 zoology
	 climate-sciences
	 neuroscience
	 genetics
	 cancer
	 plant-sciences
	 immunology
	 chemical-biology
	 chemistry
	 evolution
	 stem-cells
	 ocean-sciences
	 diseases
	 molecular-medicine
	 engineering
	 materials-science
	 nanoscience-and-technology
	 drug-discovery
	 philosophy
	 business-and-industry
	 developmental-biology


In [80]:
def render_topics(subjects, num_topics=3, stem=False, filter_n_most_common_words=500, num_words=30):
    if isinstance(subjects, str):
        df = pd.read_sql_query("SELECT distinct(title), text FROM articles WHERE wc > 1500 and topic = '{}';".format(subjects),
                               conn)
        
    
    else:
        df = pd.read_sql_query("SELECT distinct(title), text FROM articles WHERE wc > 1500 and topic IN {};".format(subjects),
                               conn)
    
    docs = df['text'].values
    split_docs = [doc.split(' ') for doc in docs]
    doc_words = [words for doc in split_docs for words in doc]
    wcount = Counter()
    wcount.update(doc_words)
    stopwords = nltk.corpus.stopwords.words('english') + ['introduction','conclusion'] # filter out terms used as section titles in most research papers
    for w, _ in wcount.most_common(filter_n_most_common_words):
        stopwords.append(w)
        
    if stem == True:
        docs = [stem_and_stopword_filter(doc, stopwords) for doc in docs]
    else:
        docs = [stopword_filter(doc, stopwords) for doc in docs]
    dictionary = corpora.Dictionary(docs)
    corpus = [dictionary.doc2bow(doc) for doc in docs]
    lda_model = models.LdaMulticore(corpus, id2word=dictionary, num_topics=num_topics)
    topics = lda_model.show_topics(formatted=False, num_words=num_words)
    
    print(subjects)
    
    for t in range(len(topics)):
        print("\nTopic {}, top {} words:".format(t+1, num_words))
        print(" ".join([w[0] for w in topics[t][1]]))
        
    
        
        
def stem_and_stopword_filter(text, filter_list):
    stemmer = nltk.stem.snowball.SnowballStemmer('english')
    return [stemmer.stem(word) for word in text.split() if word not in filter_list and len(word) > 2]

def stopword_filter(text, filter_list):
    return [word for word in text.split() if word not in filter_list and len(word) > 2]

In [81]:
# specific subjects to analyze for topics as a tuple of strings
# ie subjects = ('philosophy', 'nanoscience-and-technology', 'biotechnology')
subjects = ('philosophy')

render_topics(subjects, num_topics=9, stem=False, filter_n_most_common_words=500)

philosophy

Topic 1, top 30 words:
hermeneutical identity sloterdijk markets seen genotype advance heidegger fascism forces faith crisis perspective mean hermeneutics genocide event remains nothing christ actual central action calls show theatre finally autonomy past follows

Topic 2, top 30 words:
theatre particularly ukk suggest later past immanence suggests provides abstract markets several existing debate elements neoliberal worlds cognitive samples actors examples private perspective attention stakeholders associated today identification taken kantian

Topic 3, top 30 words:
theatre object ssh climate internal consciousness genotype ontological standard collaboration far external established transcendence basic contemporary involves private act understood economic field funding traits every identification remains suggest concern takes

Topic 4, top 30 words:
faith event holy sacred advance markets occurs neoliberal interior sloterdijk never forces seems theological politics centra

### Discussion
For the all the subjects that I pulled from nature.com, the philosophy articles seemed to present the clearest themes in the topics generated by LDA. I think this is because in philosophy you have different areas (game theory, ethics, theology, etc) which have established jargon that is specific to that sub field of philosophy but used widely within that sub field (ie, people who study ethics all have some take on Kant). 

Contrast this with the topics generated by LDA for the scientific disciplines. Even within a narrow subfield, such as nano technology, papers seem to have very specific subject matter. Instead of words representing subjects of study and inquery reocurring across texts within a scientific subfield, you have terms related to the execution and process of science occurring prominently (terms like 'datasets', 'index', 'amount'). 

In [82]:
render_topics(('mathematics-and-computing','computational-biology-and-bioinformatics','nanoscience-and-technology'),
               num_topics=9, stem=False, filter_n_most_common_words=500)

('mathematics-and-computing', 'computational-biology-and-bioinformatics', 'nanoscience-and-technology')

Topic 1, top 30 words:
voltage film pulse tdp devices crystal medium required mechanism compounds core resistance components scattering find concentrations measurement green furthermore profiles reaction overall limited cellular ras map way compound zero follows

Topic 2, top 30 words:
graphene csp clinical infection panel datasets membrane detected dynamic crystal measure cycle variables loss term diseases end connectivity global whereas paper green mechanical medium sensitivity clear neural mass years alignment

Topic 3, top 30 words:
events metal connectivity scattering treated mice sleep pressure detected paper stability smaller graph consistent duration global errors map fluorescence end impact datasets relatively index common amount dynamic identify solid probe

Topic 4, top 30 words:
pairs rna term detected neural strategy way sequencing five probe snps involved hence errors 

In [83]:
subjects = cursor.execute('SELECT distinct topic FROM articles;').fetchall()

for s in subjects:
    render_topics(s[0], num_topics=9, stem=False, filter_n_most_common_words=500) 
    print('==================================================================================================')

biotechnology

Topic 1, top 30 words:
shell tanc paper flies fluorescent domain index insulin success pha cellulose maximum dark indicates optical prexpress included sea honey correlation pre honeys glucose reduction ihc pids immune scattering removed ils

Topic 2, top 30 words:
opsin tobacco device leaves normal accumulation opnmw infected retention peg parkinsons viral form cone loss cohort targeted common bar compounds acids pre age fatty ratios reduction amount correlation likely stem

Topic 3, top 30 words:
ngo nps ago cpg peg mir mirna slt dcas flakes needle cpeg plasmid shaped ola tumor end antigen device cycle infected patient systems retention targeted plates primers dga salmonella membranes

Topic 4, top 30 words:
phya ecm synthesis ago phyb exon pathogenic seed seq accumulation epa luciferase acids tanc csps score variant double point inflammatory dha generation stem mrna imaging signaling grown plate infected tobacco

Topic 5, top 30 words:
copulation pase success mir agnps