# Web scrapping of Scientific Articles

Automating as many tasks as possible saves precious time and resources. Data collection tasks is one of the fundamental duties, researchers are faced with. An easy to buils web scraping tool makes this task repeatable,verifiable, efficient and ultimately increases productivity and output. 
In this notebook, I have compiled a scrapping tool to automate collecting scientific articles from Pubmed and ArXiv databases.


### Biopython
The Biopython project is a bioinformatics library, written and designed for biologists. Biopython can read and write a variety of file formats. Additionally, it contains  classes which represent biological sequences and sequence annotations. Information from Bioinformatic databases can be easily accessed through the biopython tools.Some of the specialized modules include sequence alignment, protein structure, population genetics, phylogenetics, sequence motifs, and machine learning.
ref: https://biopython.org/
 

In [1]:
from Bio import Entrez
#he Entrez module designed for querying databases at the National Center for 
#Biotechnology Information (NCBI), which include PubMed and PubMed Central
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import regex as re

import arxiv
import arxivscraper
import arxivscraper.arxivscraper as ax
import requests





### Scraping from PubMed
PubMed is a huge repository of biomedical articles. The **E-Utilities** is an extremely helpful tool on the page as it acts akin to the API. Thus, enabling researchers to freely download large amounts of data.The queries are made throuh the HTTP requests in the browser which result in XML files of the queries.The Entrez module in BioPython is designed to query the National Center for Biotechnology Information (NCBI) databases directly.


##### Method1
These functions written by Marco Bonzanini , can be used to scrape articles from Pubmed by:
1. Searching the database using Entrez 
2. Retrieving results
3. Parsing  XML results and printing tab delimited output. # 

In [2]:
#defining a search function for the query
def search(query):
    Entrez.email = 'your.email@example.com'
    handle = Entrez.esearch(db='pubmed', 
                            sort='relevance', #sort by the article's relevance
                            retmax='10',#retrieve 30 articles
                            retmode='xml', 
                            term=query)
    results = Entrez.read(handle)
    return results
#fetching the details of the search function defined earlier
def fetch_details(id_list):
    ids = ','.join(id_list)
    Entrez.email = 'your.email@example.com'
    handle = Entrez.efetch(db='pubmed',
                           retmode='xml',
                           id=ids)
    results = Entrez.read(handle)
    return results

In [3]:
if __name__ == '__main__':
    results = search('cold')
    id_list = results['IdList']
    papers = fetch_details(id_list)
    for i, paper in enumerate(papers['PubmedArticle']): 
        print("%d) %s" % (i+1, paper['MedlineCitation']['Article']['ArticleTitle']))
       # print(json.dumps(papers['PubmedArticle'][0], indent=2, separators=(',', ':')))# print the first article
 

1) Effect of diurnal temperature range on outpatient visits for common cold in Shanghai, China.
2) Prenatal exposure to ambient temperature variation increases the risk of common cold in children.
3) Knowledge, Attitude And Practice Of Common Cold And Its Management Among Doctors Of Pakistan.
4) Protocol for a randomised, single-blind, two-arm, parallel-group controlled trial of the efficacy of rhinothermy delivered by nasal high flow therapy in the treatment of the common cold.
5) Association between moderately cold temperature and mortality in China.
6) Common cold among young adults in China without a history of asthma or allergic rhinitis - associations with warmer climate zone, dampness and mould at home, and outdoor PM<sub>10</sub> and PM<sub>2.5</sub>.
7) Efficacy and safety of Lian-Ju-Gan-Mao capsules for treating the common cold with wind-heat syndrome: study protocol for a randomized controlled trial.
8) Common cold among pre-school children in China - associations with ambie

The first 10 articles, which are ordered according to their relevance are returned. These results can be written to  CSV file.

### Scraping from arXiv

The ArXiV is another great pre-print repository of peer-reviewed journal articles. The biorXiv,database contains articles in the field of biology. Both databases follow the same OAI-PMH protocol.Functions can be written to scrape the database,or the Python library arxiv and arxivscraper can be used to simplify the process.  

arXiv is a project by the Cornell University Library that provides open access to 1,000,000+ articles in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, and Statistics. https://pypi.org/project/arxiv/

### Without filtering

You can directly use arxivscraper in your scripts. Let's create a scraper to fetch all preprints in the quantitative biology category from 27 May 2020 until 7 June 2020:


In [4]:

#scraping articles in the quantitative biology category
scraper = arxivscraper.Scraper(category='q-bio', date_from='2020-05-27',date_until='2020-06-30')


In [5]:
output = scraper.scrape()

fetching up to  1000 records...
fetching is completed in 5.9 seconds.
Total number of records 590


In [6]:
#Save the output in your prefered format or convert it into a pandas dataframe:
import pandas as pd

#converting to a dataframe
cols = ('id', 'title', 'categories', 'abstract', 'doi', 'created', 'updated', 'authors')
df = pd.DataFrame(output,columns=cols)

In [7]:
df.head()

Unnamed: 0,id,title,categories,abstract,doi,created,updated,authors
0,1207.3563,a mathematical model of the metabolic and perf...,q-bio.nc q-bio.to,cortical spreading depression (csd) is a slow-...,10.1371/journal.pone.0070469,2012-07-15,2013-06-15,"[joshua c. chang, k. c. brennan, dongdong he, ..."
1,1212.0621,development of spatial coarse-to-fine processi...,q-bio.nc,the sequential analysis of information in a co...,10.1186/1471-2202-14-s1-p294,2012-12-04,2013-07-16,[jasmine a. nirody]
2,1311.1481,assessment of the mers-cov epidemic situation ...,q-bio.pe q-bio.qm,the appearance of a novel coronavirus named mi...,10.2807/1560-7917.es2014.19.23.20824,2013-11-06,2014-05-05,"[chiara poletto, camille pelat, daniel levy-br..."
3,1505.03738,a domain-level dna strand displacement reactio...,cs.ce cs.et q-bio.mn,dna strand displacement systems have proven th...,10.1098/rsif.2019.0866,2015-05-11,,"[casey grun, karthik sarma, brian wolfe, seung..."
4,1507.03703,alchemical grid dock (algdock): binding free e...,physics.chem-ph q-bio.bm,alchemical grid dock (algdock) is open-source ...,10.1002/jcc.26036,2015-07-13,2019-08-05,[david d. l minh]


In [8]:

scraper = ax.Scraper(category='stat',date_from='2020-10-01',date_until='2020-11-30',t=10, filters={'categories':['stat.ml'],'abstract':['tensorflow']})
output = scraper.scrape()

fetching up to  1000 records...
Got 503. Retrying after 10 seconds.
fetching up to  1000 records...
fetching up to  2000 records...
Got 503. Retrying after 10 seconds.
fetching up to  2000 records...
fetching up to  3000 records...
Got 503. Retrying after 10 seconds.
fetching up to  3000 records...
fetching up to  4000 records...
Got 503. Retrying after 10 seconds.
fetching up to  4000 records...
fetching up to  5000 records...
Got 503. Retrying after 10 seconds.
fetching up to  5000 records...
fetching is completed in 96.6 seconds.
Total number of records 3421


In [9]:
#Save the output in your prefered format or convert it into a pandas dataframe:

#converting to a dataframe

cols = ('id', 'title', 'categories', 'abstract', 'doi', 'created', 'updated', 'authors')
filtered_df = pd.DataFrame(output,columns=cols)

In [10]:
filtered_df.head()

Unnamed: 0,id,title,categories,abstract,doi,created,updated,authors
0,1405.5576,on the theoretical guarantees for parameter es...,stat.ml stat.co,iterative methods for fitting a gaussian rando...,,2014-05-21,2020-02-06,"[sam davanloo tajbakhsh, necdet serhat aybat, ..."
1,1503.03148,a neurodynamical system for finding a minimal ...,cs.lg stat.ml,the recently proposed minimal complexity machi...,10.1016/j.neunet.2020.08.013,2015-03-10,,"[n/a jayadeva, sumit soman, amit bhaya]"
2,1503.0781,interpretable classification models for recidi...,stat.ml stat.ap,"we investigate a long-debated question, which ...",10.1111/rssa.12227,2015-03-26,2016-07-07,"[jiaming zeng, berk ustun, cynthia rudin]"
3,1505.00398,block basis factorization for scalable kernel ...,stat.ml cs.lg cs.na,kernel methods are widespread in machine learn...,10.1137/18m1212586,2015-05-03,2019-04-16,"[ruoxi wang, yingzhou li, michael w. mahoney, ..."
4,1507.06217,persistence images: a stable vector representa...,cs.cg math.at stat.ml,many datasets can be viewed as a noisy samplin...,,2015-07-22,2016-07-11,"[henry adams, sofya chepushtanova, tegan emers..."


In [11]:
filtered_df.shape

(3421, 8)

#### Awesome! Congratulations on downloading all those research articles.
Ok, so now you've got thousands of research articles to read and compile your report. Python can help you get through that task quickly as well. Head over to my Biomedical Text Summarization Tool to learn more and for further help:
https://github.com/Flo-tyna/Biomedical-Text-Summarizer

### References:
     https://marcobonzanini.com/2015/01/12/searching-pubmed-with-python/
     https://medium.com/@kliang933/scraping-big-data-from-public-research-repositories-e-g-pubmed-arxiv-2-488666f6f29b
     http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec4
     https://dataguide.nlm.nih.gov/edirect/esearch.html
     https://devpost.com/software/pubmed-web-scraper
     https://github.com/kentaroy47/arxiv-scraping
     https://gitlab.tu-berlin.de/NLP/arxivscraper#with-filtering