# Basic Webscraping with Python

In today's review session, we are going to go over how to do basic webscraping in Python. We are going to talk about how to webscrape using APIs and beautifulsoup.

## Webscraping with APIs

Many of you are interested in analyzing Twitter sentiments or New York Times headlines. Luckily, there are Python APIs (application programming interfaces) that allow us to retrieve data directly.

In [35]:
import requests
from pprint import pprint
import pandas as pd
import time

For example, the New York Times offers a few APIs that allow us to extract archive data, or most viewed articles. I will go over below how to extract headlines and abstracts from archives, but see [here](https://developer.nytimes.com/apis) for more information on other NY Times apis.

In [2]:
api_key = 'bHx7CRYcz18D0R3igw7ZTdFk4gURhcJ1'
year = 2020
month = 1
response = requests.get(f'https://api.nytimes.com/svc/archive/v1/{year}/{month}.json?api-key={api_key}')

The requests returns the extracted entries in json format. 

In [37]:
content = response.json()

Here is what each entry looks like. Some of the information available includes abstract, headline, and section name. 

In [38]:
content['response']['docs'][0]

{'abstract': 'The gunman who shot two parishioners at the West Freeway Church of Christ had come earlier looking for food and money, church leaders said.',
 'web_url': 'https://www.nytimes.com/2019/12/31/us/texas-church-shooting-white-settlement.html',
 'snippet': 'The gunman who shot two parishioners at the West Freeway Church of Christ had come earlier looking for food and money, church leaders said.',
 'lead_paragraph': 'WHITE SETTLEMENT, Texas — Given West Freeway Church of Christ’s location, on a busy thoroughfare just off a major highway, vagabonds and homeless people regularly found their way inside. Sometimes, they sought spiritual help. But more often, they came asking for food, which the church would provide, and money, which it typically would not. ',
 'print_section': 'A',
 'print_page': '16',
 'source': 'The New York Times',
 'multimedia': [{'rank': 0,
   'subtype': 'xlarge',
   'caption': None,
   'credit': None,
   'type': 'image',
   'url': 'images/2019/12/31/us/31TEXAS

We can iterate through the loaded json to retrieve headlines and abstracts.

In [33]:
nyt_headlines = []
nyt_abstract = []
for article in content['response']['docs']:
    nyt_headlines.append(article['headline']['main'])
    nyt_abstract.append(article['abstract'])
nyt_df = pd.DataFrame({'headline': nyt_headlines, 'abstract': nyt_abstract})
nyt_df.head()

Unnamed: 0,headline,abstract
0,‘Battling a Demon’: Drifter Sought Help Before...,The gunman who shot two parishioners at the We...
1,Protect Veterans From Fraud,Congress could do much more to protect America...
2,F.D.A. Plans to Ban Most E-Cigarette Flavors b...,The tobacco and vaping industries and conserva...
3,‘It’s Green and Slimy’,Christina Iverson and Jeff Chen ring in the Ne...
4,"Corrections: Jan. 1, 2020",Corrections that appeared in print on Wednesda...


If we want to look at historical headlines, we will have to make multiple requests. Usually, for APIs, there are per minute or per day limits. To avoid reaching the limit, it is good to "sleep" between requests. 

In [42]:
year = 2021
nyt_headlines = []
nyt_abstract = []
nyt_dates = []
for month in [1, 2]:
    response = requests.get(f'https://api.nytimes.com/svc/archive/v1/{year}/{month}.json?api-key={api_key}')
    content = response.json()
    for article in content['response']['docs']:
        nyt_headlines.append(article['headline']['main'])
        nyt_abstract.append(article['abstract'])
        nyt_dates.append(article['pub_date'])
    time.sleep(10)
nyt_df = pd.DataFrame({'headline': nyt_headlines, 'abstract': nyt_abstract, 'date': nyt_dates})
nyt_df.head()

Unnamed: 0,headline,abstract,date
0,Things Will Get Better. Seriously.,Reasons to be hopeful about the Biden economy.,2021-01-01T00:00:09+0000
1,Minneapolis Police Release Body Camera Video o...,The video shows a man raising something to his...,2021-01-01T00:16:53+0000
2,Resolving to live a lot better than in 2020.,"Every December since 2017, Ada Rojas has guide...",2021-01-01T00:58:19+0000
3,Justice Dept. Asks Judge to Toss Election Laws...,"The suit, led by Representative Louie Gohmert ...",2021-01-01T01:24:55+0000
4,The U.S. reaches 20 million cases.,The United States recorded its 20 millionth ca...,2021-01-01T01:28:22+0000


## Scraping Directly

The example here comes from a project I am working on with Professor Handlan. We want to find all the session names and their corresponding JEL codes for ASSA sessions in 2021. Since ASSA does not have an API, we have to go directly to the webpage and find where the information we want is stored. [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/) is a great tool that you can use to help you navigate through html source codes. 

In [54]:
from bs4 import BeautifulSoup
import re

# This script scrapes the JEL codes from the AEA webpage
URL = "html/ASSA2021.html"
page = open(URL)
soup = BeautifulSoup(page.read(), features="lxml")
results = soup.find(id="searchResults")
elements = results.find_all("article", class_="session-item")
JEL_df = pd.DataFrame(columns=['name', 'JEL_code_1', 'JEL_code_2'])
for element in elements:
    name = element.find("h3", class_="title")
    title = name.text.strip()
    title = re.sub('  +', '', title).split('\n')
    # Scrape Title
    name = title[0]
    JEL_code_1, JEL_code_2 = None, None
    if len(title) == 2:
        JEL_codes = title[1]
        JEL_codes = re.sub('[()]', '', JEL_codes).split(', ')
        JEL_code_1 = JEL_codes[0]
        if len(JEL_codes) == 2:
            JEL_code_2 = JEL_codes[1]
    # Scrape event start time
    event = element.find("span", class_="icon-clock")
    start_time = event.next_sibling.strip().split('   ')[0]
    JEL_df = JEL_df.append({'name':name, 'JEL_code_1':JEL_code_1, 'JEL_code_2': JEL_code_2, 'start_time':start_time}, ignore_index=True)

JEL_df.tail()

Unnamed: 0,name,JEL_code_1,JEL_code_2,start_time
533,Factors Impacting National Vitality in Emergin...,O5,I3,3:45 PM
534,Working from Home,J0,,3:45 PM
535,Public Utilities -2,L9,L5,3:45 PM
536,Informalization,P1,B5,3:45 PM
537,Post Capitalist Futures,C6,B5,3:45 PM


## Other Approaches

Sometimes, there are Python libraries that makes accessing APIs easier. For example, [tweepy](https://www.tweepy.org/) and [python-twitter](https://python-twitter.readthedocs.io/en/latest/) are two examples that help you scrape tweets. 

In addition, you can use selenium for webscraping. [Here](https://towardsdatascience.com/how-to-use-selenium-to-web-scrape-with-example-80f9b23a843a) is a tutorial. 