# Article Data Extraction and Removal of Irrelevant Pages
Author: Sean Flannery [sflanner@purdue.edu](sflanner@purdue.edu)

Last Updated: June 14th, 2019

This notebook was developed with the intent of satisfying local data parsing needs for work with
Professor Daisuke Kihara [dkihara@purdue.edu](dkihara@purdue.edu).
### Description
We are interested in grabbing the Abstract, Author, & Introduction Data from a given set of article html files (assumed to have been already downloaded). 

We have provided a folder `articles` that contains the respective **years** and **\*.html** files where we may find pertinent articles.

**Libraries Needed:** 
[pandas](https://pandas.pydata.org/pandas-docs/stable/install.html), 
[numpy](https://www.numpy.org), 
[tqdm](https://github.com/tqdm/tqdm), 
[bs4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [1]:
import pandas as pd
import numpy as np
np.random.seed(42)
import re
import os
import time
import random
random.seed(42)

from bs4 import BeautifulSoup, SoupStrainer

from multiprocessing import Pool
from tqdm import tqdm_notebook as tqdm

import warnings
warnings.filterwarnings("ignore")

In [2]:
nar_df = pd.read_csv('article-path-data.csv',
                     names = ['year','article-link', 'local-path', 'title', 'abstract', 'authors', 'introduction'],
                     skiprows=1)

In [3]:
nar_df.head()

Unnamed: 0,year,article-link,local-path,title,abstract,authors,introduction
0,2019,https://doi.org/10.1093/nar/gky1267,articles/2019/0-NAR.html,,,,
1,2019,https://doi.org/10.1093/nar/gky993,articles/2019/1-NAR.html,,,,
2,2019,https://doi.org/10.1093/nar/gky1124,articles/2019/2-NAR.html,,,,
3,2019,https://doi.org/10.1093/nar/gky1069,articles/2019/3-NAR.html,,,,
4,2019,https://doi.org/10.1093/nar/gky843,articles/2019/4-NAR.html,,,,


Grab the title of the article.

In [4]:
def getTitle(filename):
    # Open html file for processing
    f = open(filename)
    soup = BeautifulSoup(f, 'html.parser')
    f.close()
    # Grab section of html named after the abstract
    tmp = str(soup.findAll("h1", {"class": "article-title-main"}))
    soup = BeautifulSoup(tmp, 'html.parser')
    # Clean up the text and store it in the result
    return soup.get_text(strip=True)[1:-1] 

Grab all text from the section with class abstract.

In [5]:
def getAbstract(filename):
    # Open html file for processing
    f = open(filename)
    soup = BeautifulSoup(f, 'html.parser')
    f.close()
    # Grab section of html named after the abstract
    tmp = str(soup.findAll("section", {"class": "abstract"}))
    soup = BeautifulSoup(tmp, 'html.parser')
    # Clean up the text and store it in the result
    return soup.get_text(strip=True)[1:-1]

Grab the information inside the info card at the top of the webpage. There are 2 methods to use here

In [6]:
def getAuthor(filename):
    f = open(filename)
    soup = BeautifulSoup(f, 'html.parser')
    f.close()
    tmp = str(soup.findAll("div", {"class": "info-card-name"}))
    if len(tmp) == 0:
        tmp = soup.findAll("div", {"class": "meta-authors--limited"})
    if len(tmp) == 0:
        tmp = soup.findAll("div", {"class": "wi-authors"})
    soup = BeautifulSoup(str(tmp), 'html.parser')
    return soup.get_text(strip=True)[1:-1].split(',')

Grab the first `<h2>` html tag that indicates a section title on the page, confirm that it contains the words "introduction", and then return it!

In [7]:
def getIntroductionFromHeaders(filename):
    f = open(filename)
    soup = BeautifulSoup(f, 'html.parser')
    f.close()
    # Seek to the first header
    found = soup.findAll("h2", {"class": "section-title"})
    # Initially assume that neither intro nor the section after is found
    intro = None
    after = None
    # If we find only 1 header, it's likely just the acknowledgements
    if len(found) <= 1:
        return None
    else:
        found_intro = False
        for ent in found:
            if found_intro:
                after = ent
                break
            if 'introduction' in str(ent).lower():
                intro = ent
        # Start at the first paragraph containing h2
        intro = found[0]
        # End before the next h2 section (AKA the next section of text -- likely not introduction)
        after = found[1]
    
    resStr = ''
    # If we were not able to identify individual sections
    # just take anything in paragraph tags in the article body
    if intro is None or after is None:
        res['introduction'] = None
        return res
    else:      
        next_p = intro.find_next("p")
        last_p = after.find_previous_sibling("p")
        if last_p is None:
            return None
        last_p = last_p.get_text(strip=True)
        if last_p is None or next_p is None:
            return None
        while str(next_p.get_text(strip=True)) != str(last_p):
            tmpStr = str(next_p.get_text(strip=True))
            resStr += tmpStr + '\n'
            next_p = next_p.find_next("p")
        resStr += str(last_p)
    return resStr

We need to handle the results with no intro now with a deeper level of granularity. This method focuses on just grabbing as many `<p>` tags as possible.

This also requires us to **define some number of paragraphs** to grab. We initially just grab up to 3 paragraphs.

In [8]:
PARAGRAPH_NUMBER = 3
def getIntroductionFromParagraphs(filename):
    abstract = getAbstract(filename)
    f = open(filename)
    soup = BeautifulSoup(f, 'html.parser')
    f.close()
    # Seek to the first header
    resStr = ''  
    content_blocks = soup.findAll("div", {"id": "ContentTab"})
    if content_blocks is None or len(content_blocks) == 0:
        return None
    content_block = content_blocks[0]
    soup = BeautifulSoup(str(content_block), 'html.parser')
    pars = soup.findAll("p")
    if len(pars) == 0:
        return None
    
    parCount = 0
    next_p = str(pars[parCount].get_text(strip=True))
    parCount += 1
    while not abstract in next_p and parCount < len(pars):
        next_p = str(pars[parCount].get_text(strip=True))
        parCount += 1
    counter = 0
    while next_p is not None and counter < PARAGRAPH_NUMBER and parCount < len(pars):
        counter = counter + 1
        next_p = str(pars[parCount].get_text(strip=True))
        parCount += 1 
        resStr += str(next_p + '\n')
    if resStr == '':
        return None
    else:
        return resStr

In [9]:
def getIntroduction(filename):
    res = getIntroductionFromHeaders(filename)
    if res is None or len(res) < 4:
        res = getIntroductionFromParagraphs(filename)
    return res

Define a function to grab a url's various page information.

In [10]:
def getWebPageData(urlID):
    res = dict()    
    res['urlID'] = urlID
    filename = str(nar_df.loc[urlID,'local-path'])
    ### Title 
    res['title'] = getTitle(filename)
    ### Abstract
    res['abstract'] = getAbstract(filename)
    ### Authors
    res['authors'] = getAuthor(filename)
    ### Introduction
    res['introduction'] = getIntroductionFromHeaders(filename)
    if res['introduction'] is None or len(res['introduction']) <=2:
        res['introduction'] = getIntroductionFromParagraphs(filename)
    return res

Generate several processes to do these processes concurrently.

In [11]:
req_list = []
entry_range = range(len(nar_df))
with Pool(8) as p:
    req_list = list(tqdm(p.imap(getWebPageData, entry_range), total=len(entry_range), leave=False))

HBox(children=(IntProgress(value=0, max=3149), HTML(value='')))



We are now going to transfer our returned results to the database.

In [12]:
# Quick fix for authors column so it accepts valid datatype
nar_df['authors'] = nar_df['authors'].astype(object)

for res in req_list:
    urlID = res['urlID']
    nar_df.loc[urlID, 'title'] = res['title']
    nar_df.loc[urlID, 'abstract'] = res['abstract']
    nar_df.at[urlID,'authors'] = res['authors']
    nar_df.loc[urlID, 'introduction'] = res['introduction']

**Let's store the original version of the nar database before we drop entries**

In [13]:
original_nar_df = nar_df.copy()
nar_df.to_csv('original-article-data.csv', index=False)
nar_df.head()

Unnamed: 0,year,article-link,local-path,title,abstract,authors,introduction
0,2019,https://doi.org/10.1093/nar/gky1267,articles/2019/0-NAR.html,The 26th annual Nucleic Acids Research databas...,The 2019 Nucleic Acids Research (NAR) Database...,"[Daniel J Rigden, Xosé M Fernández]",The Nucleic Acids Research (NAR) Database Issu...
1,2019,https://doi.org/10.1093/nar/gky993,articles/2019/1-NAR.html,Database Resources of the BIG Data Center in 2019,The BIG Data Center at Beijing Institute of Ge...,[BIG Data Center Members],The BIG Data Center (http://bigd.big.ac.cn) at...
2,2019,https://doi.org/10.1093/nar/gky1124,articles/2019/2-NAR.html,The European Bioinformatics Institute in 2018:...,The European Bioinformatics Institute (https:/...,"[Charles E Cook, Rodrigo Lopez, Oana Stroe, Gu...","A primary mission of EMBL-EBI is to collect, o..."
3,2019,https://doi.org/10.1093/nar/gky1069,articles/2019/3-NAR.html,Database resources of the National Center for ...,The National Center for Biotechnology Informat...,"[Eric W Sayers, Richa Agarwala, Evan E Bolton,...",The National Center for Biotechnology Informat...
4,2019,https://doi.org/10.1093/nar/gky843,articles/2019/4-NAR.html,AmtDB: a database of ancient human mitochondri...,Ancient mitochondrial DNA is used for tracing ...,"[Edvard Ehler, Jiří Novotný, Anna Juras, Macie...",Ancient DNA (aDNA) is a genetic material obtai...


Display Statistics!

In [14]:
no_abstracts = nar_df[nar_df.abstract.isna()]
no_intro = nar_df[nar_df.introduction.isna()]
has_intro = nar_df[nar_df.introduction.notna()]
        
print("No introduction %s/%s: %.2f%%" % (len(no_intro), len(nar_df), (100. * len(no_intro))/ len(nar_df)))
print("No abstracts %s/%s: %.2f%%" % (len(no_abstracts), len(nar_df), (100. * len(no_abstracts))/ len(nar_df)))
count = 0
for uid in has_intro.index:
    introStr = str(nar_df.loc[uid, 'introduction'])
    abstractStr = str(nar_df.loc[uid, 'abstract'])
    if abstractStr in introStr:
        count = count + 1
print("Intro contains abstract %s/%s: %.2f%%" % (count, len(has_intro), (100. * count / len(has_intro))))

No introduction 11/3149: 0.35%
No abstracts 0/3149: 0.00%
Intro contains abstract 18/3138: 0.57%


### Data Cleaning

We want to drop all unneeded pages based on title or filetype.

Drop entries leading to a PDF file (likely a pdf saying "under development", but nevertheless unscrapable)

We include a method to check if the downloaded file is a PDF.

In [15]:
def checkPDF(filename):
    f = open(filename)
    file_head = f.read(4)
    f.close()
    return "%PDF" == file_head

In [16]:
for index, row in nar_df.iterrows():
    # Remove all editorial, subscription pages
    if 'editorial' in row['title'].lower() or 'subscription' in row['title'].lower():
        nar_df.drop(index, inplace=True)
    # Drop all files that are just PDFs
    if checkPDF(row['local-path']):
        nar_df.drop(index, inplace=True)
    # Remove all database issues
    if 'database issue' in row['title'].lower():
        nar_df.drop(index, inplace=True)

In [17]:
no_abstracts = nar_df[nar_df.abstract.isna()]
no_intro = nar_df[nar_df.introduction.isna()]
has_intro = nar_df[nar_df.introduction.notna()]
        
print("No introduction %s/%s: %.2f%%" % (len(no_intro), len(nar_df), (100. * len(no_intro))/ len(nar_df)))
print("No abstracts %s/%s: %.2f%%" % (len(no_abstracts), len(nar_df), (100. * len(no_abstracts))/ len(nar_df)))
count = 0
for uid in nar_df.index:
    introStr = str(nar_df.loc[uid, 'introduction'])
    abstractStr = str(nar_df.loc[uid, 'abstract'])
    # Check if abstractStr isn't empty before seeing if its in the intro
    if abstractStr in introStr and len(abstractStr) > 0:
        count = count + 1
print("Intro contains abstract %s/%s: %.2f%%" % (count, len(has_intro), (100. * count / len(has_intro))))

No introduction 1/3113: 0.03%
No abstracts 0/3113: 0.00%
Intro contains abstract 0/3112: 0.00%


Drop undesirable attributes (those that don't have the data we need)

In [18]:
for uid in nar_df.index:
# Drop the undesired entries
    introStr = str(nar_df.loc[uid, 'introduction'])
    abstractStr = str(nar_df.loc[uid, 'abstract'])
    if len(abstractStr) == 0 or len(introStr) == 0:
        nar_df.drop(uid)

Here, we should **ensure we haven't dropped articles erroneously by spot-checking the articles we've dropped**.

In [19]:
difference = original_nar_df[~original_nar_df.index.isin(nar_df.index)]
print("Total dropped entries:", len(difference))

Total dropped entries: 36


We should run through these entries and see if there are any we ought to restore.

In [20]:
difference

Unnamed: 0,year,article-link,local-path,title,abstract,authors,introduction
0,2019,https://doi.org/10.1093/nar/gky1267,articles/2019/0-NAR.html,The 26th annual Nucleic Acids Research databas...,The 2019 Nucleic Acids Research (NAR) Database...,"[Daniel J Rigden, Xosé M Fernández]",The Nucleic Acids Research (NAR) Database Issu...
31,2019,https://doi.org/10.1093/nar/gky1050,articles/2019/31-NAR.html,,,[],
85,2019,https://doi.org/10.1093/nar/gky992,articles/2019/85-NAR.html,,,[],
150,2018,https://doi.org/10.1093/nar/gkx1235,articles/2018/150-NAR.html,The 2018Nucleic Acids Researchdatabase issue a...,The 2018Nucleic Acids ResearchDatabase Issue c...,"[Daniel J Rigden, Xosé M Fernández]",This 2018Nucleic Acids ResearchDatabase Issue ...
152,2018,https://doi.org/10.1093/nar/gkx897,articles/2018/152-NAR.html,,,[],
157,2018,https://doi.org/10.1093/nar/gkx1097,articles/2018/157-NAR.html,,,[],
179,2018,https://doi.org/10.1093/nar/gkx864,articles/2018/179-NAR.html,,,[],
262,2018,https://doi.org/10.1093/nar/gkx1020,articles/2018/262-NAR.html,,,[],
298,2018,https://doi.org/10.1093/nar/gkx892,articles/2018/298-NAR.html,,,[],
300,2017,https://doi.org/10.1093/nar/gkw1188,articles/2017/300-NAR.html,The 24th annualNucleic Acids Researchdatabase ...,This year's Database Issue ofNucleic Acids Res...,"[Michael Y Galperin, Xosé M Fernández-Suárez, ...",The current 2017Nucleic Acids ResearchDatabase...


All of these appear to be either editorials, database editions (which just describes the database for that year), or a subscription page. It seems reasonable that they ought to be dropped. 

In [21]:
nar_df.to_csv('complete-article-data.csv', index=False)