# Web Scraping and Downloading Files
Author: Sean Flannery [sflanner@purdue.edu](sflanner@purdue.edu)

Last Updated: June 13th, 2019

This notebook was developed with the intent of satisfying data acquisition needs for work with
Professor Daisuke Kihara [dkihara@purdue.edu](dkihara@purdue.edu).
### Description
*"The first task is to download pdf files of all the papers from all the years. Each year, the location of pdf files may be a bit different."*

We have provided a file `nardb.txt` that contains the respective **years** and **URLs** where we may find the annual collections of pertinent articles.

**Libraries Needed:** 
[pandas](https://pandas.pydata.org/pandas-docs/stable/install.html), 
[numpy](https://www.numpy.org), 
[tqdm](https://github.com/tqdm/tqdm), 
[bs4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [1]:
import pandas as pd
import numpy as np
np.random.seed(42)
import re
import os
import time
import random
random.seed(42)

import urllib3
from urllib.request import FancyURLopener
import requests
from bs4 import BeautifulSoup, SoupStrainer

from multiprocessing import Pool
from tqdm import tqdm_notebook as tqdm

import warnings
warnings.filterwarnings("ignore")

Read in the data from local file describing years of article data to read.

In [2]:
year_df = pd.read_csv('nardb.txt', delim_whitespace=True, names = ['year','url'])
year_df.head()

Unnamed: 0,year,url
0,2019,https://academic.oup.com/nar/issue/47/D1
1,2018,https://academic.oup.com/nar/issue/46/D1
2,2017,https://academic.oup.com/nar/issue/45/D1
3,2016,https://academic.oup.com/nar/issue/44/D1
4,2015,https://academic.oup.com/nar/issue/43/D1


We will need to actually navigate these websites now to discover the locations of the PDFs of interest (and possibly extract any useful data we might find).

We shall use the `urllib` and `BeautifulSoup` libraries to first download and then analyze the given webpages. 

In [3]:
year_df['html-data'] = [None]*len(year_df)
bs_list = []

Define a function to grab a webppage from our dataset in `year_df` (this may take a bit of time).

Note also that the error messages we check for are specific to the website we are crawling from (Nucleic Acids Research Database).

In [4]:
def grabYearWebPage(index):
    url = str(year_df.loc[index, 'url'])
    content = requests.post(url).content
    bs = BeautifulSoup(content, 'html.parser')
    resStr = str(bs.prettify())
    # NOTE: Hacky... These are just a couple of examples 
    if bs.find(id='captcha') is not None:
        print("Captcha encountered!")
        exit() # We ought to stop and adjust something if this happens...
        return None
    if "Your IP has been blacklisted due to excessive requests to the platform or suspicious activity." in resStr:
        print("BLACKLISTED AT THIS IP!")
        exit()
        return None
    if "Object reference not set to an instance of an object." in resStr:
        print("Found error... Need to retry")
        exit()
        return None
    return {'bs': resStr, 'index': index}

In [5]:
url = year_df['url'][0]

In [6]:
requests.post(url).content;

In [7]:
with Pool(10) as p:
        data_list = list(tqdm(p.imap(grabYearWebPage, range(len(year_df))), total=len(year_df), leave=True))

HBox(children=(IntProgress(value=0, max=24), HTML(value='')))




In order to prevent MemoryErrors, we have to parse the requests outside of the concurrent mapping below. We then add them to our `year_df` under html-data

In [8]:
for dataDict in data_list:
    year_df.loc[dataDict['index'], 'html-data'] = dataDict['bs']

In [9]:
year_df.head()

Unnamed: 0,year,url,html-data
0,2019,https://academic.oup.com/nar/issue/47/D1,"<!DOCTYPE html>\n<html lang=""en"">\n <head>\n ..."
1,2018,https://academic.oup.com/nar/issue/46/D1,"<!DOCTYPE html>\n<html lang=""en"">\n <head>\n ..."
2,2017,https://academic.oup.com/nar/issue/45/D1,"<!DOCTYPE html>\n<html lang=""en"">\n <head>\n ..."
3,2016,https://academic.oup.com/nar/issue/44/D1,"<!DOCTYPE html>\n<html lang=""en"">\n <head>\n ..."
4,2015,https://academic.oup.com/nar/issue/43/D1,"<!DOCTYPE html>\n<html lang=""en"">\n <head>\n ..."


Save acquired data into intermediate `year-YYYY-NAR.html` files in the `year_pages` directory for this part. 

In [10]:
if not os.path.exists('year_pages'):
    os.mkdir('year_pages')
for urlID in range(len(year_df)):
    fname = 'year_pages/year-' + str(year_df.loc[urlID, 'year']) + '-NAR.html'
    file = open(fname, "w+")
    file.write(year_df.loc[urlID, 'html-data'])
    file.close()

Here, we define some functions to encapsulate the process of grabbing the PDF article links from each site.

Note that we import `re` to utilize some of python's regular expression tools.

In [11]:
def grabArticleLinks(index):
    bs = BeautifulSoup(year_df['html-data'][index], 'html.parser')
    # Filter through all citations on the page
    filter_results = bs.findAll('div',  class_='ww-citation-primary')
    htmlStr = ''.join(str(elem) for elem in filter_results)
    # Grab the links from these sections for parsing again
    bstmp = BeautifulSoup(htmlStr, 'html.parser')
    page_links = bstmp.findAll('a', attrs={'href' : re.compile("http*")})
    # Create articledb entry for list of pages to crawl for pdfs
    data = {'year':[year_df['year'][index]]*len(page_links), 'article-link':[]}
    # Iterate over the links of the page
    for link in page_links:
        data['article-link'].append(link['href'])
    return data

It is useful to parallelize the crawling process we defined above, as there can be substantial lag times waiting for the network to respond.

In [12]:
with Pool(20) as p:
    resList = tqdm(p.map(grabArticleLinks, range(len(year_df))))

HBox(children=(IntProgress(value=0, max=24), HTML(value='')))

Now, we want to save all of this information as a CSV file for safety (and so we don't have to do all that crawling work once again).

In [13]:
final_dict = {'year':[], 'article-link':[]}
for ent in resList:
    final_dict['year'].extend(ent['year'])
    final_dict['article-link'].extend(ent['article-link'])
nar_df = pd.DataFrame.from_dict(final_dict)[['year', 'article-link']]




In [14]:
nar_df.head()

Unnamed: 0,year,article-link
0,2019,https://doi.org/10.1093/nar/gky1267
1,2019,https://doi.org/10.1093/nar/gky993
2,2019,https://doi.org/10.1093/nar/gky1124
3,2019,https://doi.org/10.1093/nar/gky1069
4,2019,https://doi.org/10.1093/nar/gky843


We will need to actually navigate these websites now to discover the locations of the PDFs of interest (and possibly extract any useful data we might find).

We shall continue use the `urllib` and `BeautifulSoup` libraries to first download and then analyze the given webpages.

We need to do some pre-processing to create the folders for our crawling.

In [15]:
if not os.path.exists('articles'):
    os.mkdir('articles')
for year in set(nar_df['year']):
    if not os.path.exists('articles/' + str(year)):
        os.mkdir('articles/' + str(year))

Define a function to grab a webppage from our dataset in `nar_df` then store them all locally.

In [33]:
def grabAndSaveWebPage(urlID):
    url = nar_df.loc[urlID, 'article-link']
    content = None
    while True:
        try:
            http = urllib3.PoolManager()
            r = http.request('GET', url)
            content = r.data
            break
        except ValueError:
            print("Received Value Error... Continuing after url:", str(url))
            return 'ValueError'
        except OSError:
            print("Received OS Error... Continuing after url:", str(url))
            return 'OSError'
        
    bs = BeautifulSoup(content, 'html.parser')
    resStr = str(bs.prettify())
    if bs.find(id='captcha') is not None:
        print("CAPTCHA")
        return 'CAPTCHA'
    if "Your IP has been blacklisted due to excessive requests to the platform or suspicious activity." in resStr:
        print("BLACKLISTED AT THIS IP!")
        return 'BLACKLISTED'
    if "Object reference not set to an instance of an object." in resStr:
        print("Found error... Need to retry")
        return 'WEIRD_ERROR'
    fname = 'articles/' + str(nar_df[ 'year'][urlID]) + '/' + str(urlID) + '-NAR.html'
    f = open(fname, "w+")
    f.write(resStr)
    f.flush()
    f.close()
    return 'SUCCESS'

This is a useful function to get all URLs we have yet to download. We pass in our list of urlIDs we have just attempted to crawl, and return a list containing only entries that we have not completely downloaded (note that we also check file size since occasionally we were receiving files of size 0).

In [18]:
def getAllUnscrapedUrlIDs(ids):
    values = set(ids)
    for root, dirs, files in os.walk("./articles"):
        # All files discovered so far
        for filename in files:
            # ignore hidden files based on machine or non-html
            if filename[0] == '.' or 'NAR.html' not in filename:
                continue
            urlID = int(filename.replace("-NAR.html", ""))
            size = os.path.getsize('articles/' + str(nar_df['year'][urlID]) + '/' + filename)
            # Non-Empty File Found! Don't want to search again
            if size > 0:
                if urlID in values:
                    values.remove(urlID)
    return list(values)

Below we do the actual scraping of the database. We will continuously attempt to get whatever webpages we couldn't recover before. The good thing is that you can stop running this for a bit and try again later or on a different connection to get more URLs... The behavior of the captcha protection can be inconsistent.

In order to speed up our crawling, we will also import the multiprocessing packages of Python to enable simultaneous crawling. Initally we assume we want to get all of the urlIDs (we ensure we aren't doing extra work later using our prior method).

In [28]:
ids = getAllUnscrapedUrlIDs(list(range(len(nar_df))))

#### WARNING: Prepare for a HOT CPU

Also, feel free to adjust the number of processes spawned by Pool

#### The captcha behavior is very inconsistent. 
You may end up with the same number of urlIDs not downloaded.
Here are some things to try should that happen (with no reason in particular given as to why they might work).
- Try pinging the `doi.org` server or `google.com` in a separate terminal window
- Try connecting to a different network than you're on or use a VPN
- Go to one of the sites that is not downloaded and complete the CAPTCHA on your current cconnection

In [48]:
ids = getAllUnscrapedUrlIDs(ids)
while len(ids) != 0:
    # random backoff 
    time.sleep(random.randint(1,3)*random.randint(1,3))
    random.shuffle(getAllUnscrapedUrlIDs(ids))
    start = time.time()
    # Spawn processes to concurrently grab webpages
    
    with Pool(10) as p:
        res_list = list(tqdm(p.imap(grabAndSaveWebPage, ids), total=len(ids), leave=True))
    
    print(set(res_list))
    
    print("Total Scraping Time over %d undownloaded urls :{%.3f} minutes" % (len(ids),(time.time() - start)/60.))
    ids = getAllUnscrapedUrlIDs(ids)

HBox(children=(IntProgress(value=0, max=11), HTML(value='')))

Found error... Need to retry
Found error... Need to retry
Found error... Need to retry
Found error... Need to retry
Found error... Need to retry
Found error... Need to retry
Found error... Need to retry
Found error... Need to retry
Found error... Need to retry
Found error... Need to retry
Found error... Need to retry
{'WEIRD_ERROR'}
Total Scraping Time over 11 undownloaded urls :{0.018} minutes


HBox(children=(IntProgress(value=0, max=11), HTML(value='')))

Found error... Need to retry
Found error... Need to retry
Found error... Need to retry
Found error... Need to retry
Found error... Need to retry
Found error... Need to retry
Found error... Need to retry
Found error... Need to retry
Found error... Need to retry
Found error... Need to retry
Found error... Need to retry
{'WEIRD_ERROR'}
Total Scraping Time over 11 undownloaded urls :{0.019} minutes


HBox(children=(IntProgress(value=0, max=11), HTML(value='')))

Found error... Need to retry
Found error... Need to retry
Found error... Need to retry
Found error... Need to retry
Found error... Need to retry
Found error... Need to retry
Found error... Need to retry
Found error... Need to retry
Found error... Need to retry
Found error... Need to retry
Found error... Need to retry
{'WEIRD_ERROR'}
Total Scraping Time over 11 undownloaded urls :{0.023} minutes


KeyboardInterrupt: 

We should have all of our needed files downloaded to the `articles` folder in the same directory as this one.

In [54]:
ids = getAllUnscrapedUrlIDs(ids)

In [55]:
print("URLs we failed to download (download manually)")
nar_df.loc[ids, ['year', 'article-link']]

URLs we failed to download (download manually)


Unnamed: 0,year,article-link


In [58]:
for root, dirs, files in os.walk("articles"):
    # All files discovered so far
    for filename in files:
        # ignore hidden files based on machine or non-html
        if filename[0] == '.' or filename is 'articles' or 'NAR.html' not in filename:
            continue
        urlID = int(filename.replace("-NAR.html", ""))
        nar_df.loc[urlID, 'local-path'] = 'articles/' + str(nar_df.loc[urlID,'year']) + '/' + filename

In [59]:
nar_df.head()

Unnamed: 0,year,article-link,local-path
0,2019,https://doi.org/10.1093/nar/gky1267,articles/2019/0-NAR.html
1,2019,https://doi.org/10.1093/nar/gky993,articles/2019/1-NAR.html
2,2019,https://doi.org/10.1093/nar/gky1124,articles/2019/2-NAR.html
3,2019,https://doi.org/10.1093/nar/gky1069,articles/2019/3-NAR.html
4,2019,https://doi.org/10.1093/nar/gky843,articles/2019/4-NAR.html


In [62]:
nar_df.to_csv('article-path-data.csv', index=False)