# Data Scraping

I fetch the data for NBER Working paper abstracts through the program archive (http://www.nber.org/papersbyprog/). Each program archive contains two links: one for recent papers and one for older papers. Working paper abstracts are also available at http://nber.org/new_archive/. However, it seems that this webpage doesn't work properly; the links for papers before 2013 don't work.

In [14]:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen

program_url = 'http://www.nber.org/papersbyprog/'
programs_data = pd.DataFrame(columns=['program', 'main_url', 'archive_url',
                                      'number_of_papers'])

# Fetch content of program's webpage
soup = BeautifulSoup(urlopen(program_url).read(), 'html.parser')
table = soup.findAll('table', {'border': 0, 'cellspacing': 1,
                               'cellpadding': 4})[0]

# Get strings of text
strings = list(table.stripped_strings)
programs_data['program'] = strings[0::2] 
programs_data['number_of_papers'] = \
pd.Series(strings[1::2]).str.extract('(\d+)', expand=False).astype(int)

# Get url for program
for row in table.findAll('a'):
    programs_data.loc[programs_data['program'] == row.text, 'main_url'] = \
    program_url + row['href']
    programs_data.loc[programs_data['program'] == row.text, 'archive_url'] = \
    program_url + row['href'].replace('.', '_archive.')

In [15]:
programs_data

Unnamed: 0,program,main_url,archive_url,number_of_papers
0,Aging,http://www.nber.org/papersbyprog/AG.html,http://www.nber.org/papersbyprog/AG_archive.html,1347
1,Asset Pricing,http://www.nber.org/papersbyprog/AP.html,http://www.nber.org/papersbyprog/AP_archive.html,2181
2,Corporate Finance,http://www.nber.org/papersbyprog/CF.html,http://www.nber.org/papersbyprog/CF_archive.html,1785
3,Children,http://www.nber.org/papersbyprog/CH.html,http://www.nber.org/papersbyprog/CH_archive.html,1312
4,Development of the American Economy,http://www.nber.org/papersbyprog/DAE.html,http://www.nber.org/papersbyprog/DAE_archive.html,1432
5,Development Economics,http://www.nber.org/papersbyprog/DEV.html,http://www.nber.org/papersbyprog/DEV_archive.html,716
6,Economics of Education,http://www.nber.org/papersbyprog/ED.html,http://www.nber.org/papersbyprog/ED_archive.html,1230
7,Environment and Energy Economics,http://www.nber.org/papersbyprog/EEE.html,http://www.nber.org/papersbyprog/EEE_archive.html,924
8,Economic Fluctuations and Growth,http://www.nber.org/papersbyprog/EFG.html,http://www.nber.org/papersbyprog/EFG_archive.html,4842
9,Health Care,http://www.nber.org/papersbyprog/HC.html,http://www.nber.org/papersbyprog/HC_archive.html,1292


In [16]:
# Total number of papers (including duplicates)
sum(programs_data.number_of_papers)

43448

In [17]:
papers_data = pd.DataFrame(columns=['id', 'title', 'authors', 'program', 'url', 'year', 'abstract'])

for papers_url in programs_data[['main_url', 'archive_url']].values.ravel():
    soup = BeautifulSoup(urlopen(papers_url).read(), 'html.parser')
    table = soup.findAll('table', {'border': 0, 'cellspacing': 1,
                                   'cellpadding': 4})[0]

    for row in table.findAll('a'):
        row_index = len(papers_data)
        papers_data.loc[row_index, 'id'] = row.text
        papers_data.loc[row_index, 'url'] = row['href']

In [35]:
papers_data = papers_data.drop_duplicates()
papers_data = papers_data[papers_data.url != '']

# Retain only the IDs which start by 'w' or 't' followed by a digit.
# This is used because other IDs do not appear to contain abstracts.
papers_data = papers_data[papers_data.id.str.contains(' [tw][0-9]')].reset_index(drop=True)

# Total number of unique papers
papers_data.shape[0]

24317

In [73]:
import re

for i, url in enumerate(papers_data.url):
    soup = BeautifulSoup(urlopen(url).read(), 'html.parser')
    table = soup.findAll('table', {'id': 'mainTable'})[0]
    abstract = soup.findAll('p', {'style': 'margin-left: 40px; margin-right: 40px; text-align: justify'}, text=True)
    
    text = ''
    for paragraph in abstract:
        text += paragraph.text
    papers_data.loc[i, 'abstract'] = text
    
    papers_data.loc[i, 'title'] = table.findAll('h1', {'class': 'title'})[0].text
    papers_data.loc[i, 'authors'] = table.findAll('h2', {'class': 'bibtop'})[0].text
    temp_str = table.findAll('p', {'class': 'bibtop'})[0].text
    papers_data.loc[i, 'program'] = re.search(':(.*)', temp_str).group(1)
    papers_data.loc[i, 'year'] = re.search('Issued in (.*)', temp_str).group(1)

# String Matching Analysis

I considered two approaches here. The first was to use an approximate string matching technique such as [Levenhstein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) to realize the matches using the original keywords in the article's chart. The problem is that this technique is not sufficiently powerful for this application.This is particularly evident for the string 'dynamic stochastic general equilibrium' which is closer to 'equilibrium' than to 'dsge'. As such, using this technique would give meaningless results for this string. Additionally, even though strings such as 'neural nets' and 'machine learning' are closely related from a semantical point of view, they are far away from each other based on Levenhstein distance. As such, I resorted to using a dictionary containing different variants of the original keywords to match the strings.

In [278]:
papers_data = pd.read_csv('nber_abstracts.csv', index_col=0)

In [279]:
papers_data['abstract'] = papers_data['abstract'].str.lower()
papers_data['title'] = papers_data['title'].str.lower()

In [280]:
techniques = {'diff-in-diff': ['difference in difference', 'differences in difference',
                                                'difference-in-difference', 'differences-in-difference',
                                                'diff-in-diff', 'diff in diff'],
                         'regression discontinuity': ['regression discontinuity'],
                         'dynamic stochastic general equilibrium': ['dynamic stochastic general equilibrium', 'dsge',
                                                                                                'dynamic general equilibrium', 'sdge', 'dge'],
                         'randomized controlled trial': ['randomized controlled trial', 'rct'],
                         'laboratory experiments': ['laboratory', 'lab experiment'],
                         'machine learning': ['machine learning', 'big data', 'deep learning',
                                                           'supervised learning', 'unsupervised learning',
                                                           'reinforcement learning', 'neural nets', 'semi-supervised learning'],
                        'artificial intelligence': [ 'artificial intelligence']
                         }

In [281]:
for technique in techniques.keys():
    papers_data[technique] = 0
    for key in techniques[technique]:
        papers_data[technique] += papers_data['abstract'].str.contains(key)
        papers_data[technique] += papers_data['title'].str.contains(key)
    papers_data[technique] = papers_data[technique] > 0

In [282]:
papers_data['year'] = papers_data['year'].str.split(',').str[0].str.extract('(\d+)').astype(np.int)

  """Entry point for launching an IPython kernel.


In [283]:
grouped = papers_data.groupby('year')
summary_stats = grouped.sum()
summary_stats = summary_stats.divide(grouped.size(), axis='rows')*100

In [284]:
moving_average = summary_stats.rolling(window=5,center=False).mean()

In [286]:
lines = []
legend_it = []

colors = all_palettes['Category10'][len(moving_average.columns)]

p = figure(tools=TOOLS, plot_width=800,
                  plot_height=600, toolbar_location="above")

for (i, col) in enumerate(moving_average.columns): 
    temp_line = p.line(x=moving_average.index.values,
                                     y=moving_average[col],
                                     color=colors[i])
    legend_it.append((col, [temp_line]))
    
legend = Legend(items=legend_it, location=(0, 25))
p.add_layout(legend, 'right')

show(p)

## Observations

- The original article seems to fail to account for variants of the keywords, and as such, underestimates the actual percentages
- Randomized controlled trials, difference in differences, machine learning and artificial intelligence have spiked in popularity since the original article was published