# Spoting Trending Topics in Scientific Research with Latent Dirichlet Allocation

For this week's challenge I decided to put some effort towards finding a novel text dataset and dabble in a little data mining. After bumming around the internet for a while I found my target: **Nature.com**

As far as I know nature.com does not provide any API service for programmatically accessing their content. While they have been nice enough to make some paper "open access", which means they are free to download as a PDF or view in browser as html, there's no way I am gonna point click drag copy paste through at least a hundred of webpage to get the volumn of data I would like for this notebook. 

Fortunately, python is good for more than data analysis. 

And with a url as straight forward as this:
`https://www.nature.com/search?article_type=protocols,research,reviews&subject=biotechnology&page=3`
who needs an api? 

There are two python files in this folder that I used to scrape research papers off nature.com for use in the this notebook's dataset. 

The first script, `collect-article-html.py`, plugs different [keywords](https://www.nature.com/subjects) in the `&subject=` placeholder in the url and goes through the first eight pages of search results. Each search result page's html is loaded into an html parser (lxml) and the link and title for all 25 articles on the results page is accessed through the xpath for their respective html elements. All article links found are followed and the raw page html for the research paper document is saved in a database (mysql) along with the articles title, date, etc. 

I ran this `collect-article-html.py` repeatedly across 32 different searchable keywords and in a few hours pulled down over 3000 research papers.

The second script, `process-html.py`, pulls the raw html for every article downloaded. Filtering the research paper text out of the html document proved easier than I expected. I used the BeautifulSoup4 library to removal all the html tags and then with just the page text leftover it was as easy as telling python to only keep the text after the "Abstract" substring and before the "References" substring. Additional text preprocessing was doing in this script, removing all special characters, numbers, and excess whitespace.  

I have compiled a data set of over 3000 research papers across 32 different searchable topics from nature.com. 
Webscraping.
Sql.
So much information. So much scientific information.
Here I will create a tool that can be used to summarize the key words that are being used in recent research accross 32 different fields. 

In [1]:
import sqlite3
import pandas as pd
from gensim import corpora, models, similarities
import nltk
from collections import Counter

In [14]:
conn = sqlite3.connect('nature_articles.db')
cursor = conn.cursor()
num_articles = cursor.execute('SELECT count(distinct title) FROM articles WHERE wc > 1500;').fetchall()[0][0]
print('Number of unquie articles in dataset: ', num_articles)

df = pd.read_sql_query("SELECT distinct(title), text, url, journal, date FROM articles WHERE wc > 1500 ORDER BY random();",
                       conn)
df.head()

Number of unquie articles in dataset:  3147


Unnamed: 0,title,text,url,journal,date
0,Ratiometric fluorescent probe with AIE propert...,hydrogen peroxide ho plays a key role in the p...,http://www.nature.com/articles/s41598-017-07465-5,Scientific Reports,04 August 2017
1,Hypoxia ameliorates intestinal inflammation th...,hypoxia regulates autophagy and nucleotide bin...,http://www.nature.com/articles/s41467-017-00213-3,Nature Communications,24 July 2017
2,Clinico-biological significance of suppressor ...,suppressor of cytokine signaling socs protein ...,http://www.nature.com/bcj/journal/v7/n7/full/b...,Blood Cancer Journal,28 July 2017
3,Decreased long-chain acylcarnitines from insuf...,increasing evidence shows that metabolic abnor...,http://www.nature.com/articles/s41598-017-06767-y,Scientific Reports,04 August 2017
4,Response inhibition in Parkinsonâ€™s disease: ...,parkinsons disease is a neurodegenerative diso...,http://www.nature.com/articles/s41531-017-0024-2,npj Parkinson's Disease,07 July 2017


In [7]:
title, subject, article = cursor.execute("SELECT title, topic, text FROM articles ORDER BY random() LIMIT 1;").fetchall()[0]
print("\n", title)
print("\nSubject:", subject)
print("\n\t", article)


 Tauroursodeoxycholic acid enhances the development of porcine embryos derived from  -matured oocytes and evaporatively dried spermatozoa

Subject: cell-biology

	 evaporative drying ed is an alternative technique for long term preservation of mammalian sperm which does not require liquid nitrogen or freeze drying equipment but offers advantages for storage and shipping at ambient temperature and low cost however the development of zygotes generated from these sperms was poor here we demonstrated that the supplementation of tauroursodeoxycholic acid tudca an endogenous bile acid during embryo culture improved the developmental competency of embryos derived from in vitro matured pig oocytes injected intracytoplasmically with boar ed spermatozoa by reducing the production of reactive oxygen species the dna degradation and fragmentation and the expression of apoptosis related gene bax and bak and by increasing the transcription of anti apoptosis gene bcl xl and bcl furthermore tudca trea

In [15]:
subjects = cursor.execute("SELECT distinct topic FROM articles;").fetchall()
print("Subjects in dataset:\n")
for s in subjects:
    print('\t',s[0])

Subjects in dataset:

	 biotechnology
	 anatomy
	 anthropology
	 physics
	 psychology
	 mathematics-and-computing
	 computational-biology-and-bioinformatics
	 ecology
	 cell-biology
	 microbiology
	 biogeochemistry
	 zoology
	 climate-sciences
	 neuroscience
	 genetics
	 cancer
	 plant-sciences
	 immunology
	 chemical-biology
	 chemistry
	 evolution
	 stem-cells
	 ocean-sciences
	 diseases
	 molecular-medicine
	 engineering
	 materials-science
	 nanoscience-and-technology
	 drug-discovery
	 philosophy
	 business-and-industry
	 developmental-biology


In [80]:
def render_topics(subjects, num_topics=3, stem=False, filter_n_most_common_words=500, num_words=30):
    if isinstance(subjects, str):
        df = pd.read_sql_query("SELECT distinct(title), text FROM articles WHERE wc > 1500 and topic = '{}';".format(subjects),
                               conn)
        
    
    else:
        df = pd.read_sql_query("SELECT distinct(title), text FROM articles WHERE wc > 1500 and topic IN {};".format(subjects),
                               conn)
    
    docs = df['text'].values
    split_docs = [doc.split(' ') for doc in docs]
    doc_words = [words for doc in split_docs for words in doc]
    wcount = Counter()
    wcount.update(doc_words)
    stopwords = nltk.corpus.stopwords.words('english') + ['introduction','conclusion'] # filter out terms used as section titles in most research papers
    for w, _ in wcount.most_common(filter_n_most_common_words):
        stopwords.append(w)
        
    if stem == True:
        docs = [stem_and_stopword_filter(doc, stopwords) for doc in docs]
    else:
        docs = [stopword_filter(doc, stopwords) for doc in docs]
    dictionary = corpora.Dictionary(docs)
    corpus = [dictionary.doc2bow(doc) for doc in docs]
    lda_model = models.LdaMulticore(corpus, id2word=dictionary, num_topics=num_topics)
    topics = lda_model.show_topics(formatted=False, num_words=num_words)
    
    print(subjects)
    
    for t in range(len(topics)):
        print("\nTopic {}, top {} words:".format(t+1, num_words))
        print(" ".join([w[0] for w in topics[t][1]]))
        
    
        
        
def stem_and_stopword_filter(text, filter_list):
    stemmer = nltk.stem.snowball.SnowballStemmer('english')
    return [stemmer.stem(word) for word in text.split() if word not in filter_list and len(word) > 2]

def stopword_filter(text, filter_list):
    return [word for word in text.split() if word not in filter_list and len(word) > 2]

In [81]:
# specific subjects to analyze for topics as a tuple of strings
# ie subjects = ('philosophy', 'nanoscience-and-technology', 'biotechnology')
subjects = ('philosophy')

render_topics(subjects, num_topics=9, stem=False, filter_n_most_common_words=500)

philosophy

Topic 1, top 30 words:
hermeneutical identity sloterdijk markets seen genotype advance heidegger fascism forces faith crisis perspective mean hermeneutics genocide event remains nothing christ actual central action calls show theatre finally autonomy past follows

Topic 2, top 30 words:
theatre particularly ukk suggest later past immanence suggests provides abstract markets several existing debate elements neoliberal worlds cognitive samples actors examples private perspective attention stakeholders associated today identification taken kantian

Topic 3, top 30 words:
theatre object ssh climate internal consciousness genotype ontological standard collaboration far external established transcendence basic contemporary involves private act understood economic field funding traits every identification remains suggest concern takes

Topic 4, top 30 words:
faith event holy sacred advance markets occurs neoliberal interior sloterdijk never forces seems theological politics centra

In [82]:
render_topics(('mathematics-and-computing','computational-biology-and-bioinformatics','nanoscience-and-technology'),
               num_topics=9, stem=False, filter_n_most_common_words=500)

('mathematics-and-computing', 'computational-biology-and-bioinformatics', 'nanoscience-and-technology')

Topic 1, top 30 words:
voltage film pulse tdp devices crystal medium required mechanism compounds core resistance components scattering find concentrations measurement green furthermore profiles reaction overall limited cellular ras map way compound zero follows

Topic 2, top 30 words:
graphene csp clinical infection panel datasets membrane detected dynamic crystal measure cycle variables loss term diseases end connectivity global whereas paper green mechanical medium sensitivity clear neural mass years alignment

Topic 3, top 30 words:
events metal connectivity scattering treated mice sleep pressure detected paper stability smaller graph consistent duration global errors map fluorescence end impact datasets relatively index common amount dynamic identify solid probe

Topic 4, top 30 words:
pairs rna term detected neural strategy way sequencing five probe snps involved hence errors 

In [83]:
subjects = cursor.execute('SELECT distinct topic FROM articles;').fetchall()

for s in subjects:
    render_topics(s[0], num_topics=9, stem=False, filter_n_most_common_words=500) 
    print('==================================================================================================')

biotechnology

Topic 1, top 30 words:
shell tanc paper flies fluorescent domain index insulin success pha cellulose maximum dark indicates optical prexpress included sea honey correlation pre honeys glucose reduction ihc pids immune scattering removed ils

Topic 2, top 30 words:
opsin tobacco device leaves normal accumulation opnmw infected retention peg parkinsons viral form cone loss cohort targeted common bar compounds acids pre age fatty ratios reduction amount correlation likely stem

Topic 3, top 30 words:
ngo nps ago cpg peg mir mirna slt dcas flakes needle cpeg plasmid shaped ola tumor end antigen device cycle infected patient systems retention targeted plates primers dga salmonella membranes

Topic 4, top 30 words:
phya ecm synthesis ago phyb exon pathogenic seed seq accumulation epa luciferase acids tanc csps score variant double point inflammatory dha generation stem mrna imaging signaling grown plate infected tobacco

Topic 5, top 30 words:
copulation pase success mir agnps