# Programming Historian Tutorial Outline: Scraping Chronicling America for Newspaper Data with Python

## 1. Introduction
- describe Chronicling America
    - size & scale
    - OCR limitations
- describe web-scraping (refer to earlier PH tutorials)
- describe Python
    - mention libraries of use: requests, pandas, BeautifulSoup, datetime
    - components of demonstrated code: CONSTANTS, dictionaries, functions, conditionals, loops
- preliminary warnings
    - scraping programs can take a long time
        - maintaining internet connection
        - disabling device sleep mode
    - LoC server connection occasionally fails/fair warning about (connectionerror: (‘connection aborted.’, connectionreseterror(54, ‘connection reset by peer’))
    - preprocessing will still be necessary after scraping dataset

## 2. define dataset goals/intentions
- review CA website: [https://chroniclingamerica.loc.gov/](https://chroniclingamerica.loc.gov/)
- review CA URL patterns
    - use this URL from first hypothetical dataset below: [https://chroniclingamerica.loc.gov/lccn/sn85025569/1861-04-17/ed-1/seq-2/ocr/](https://chroniclingamerica.loc.gov/lccn/sn85025569/1861-04-17/ed-1/seq-2/ocr/)
- pick newspapers
- pick timeframes
- establish pertinent newspaper metadata: city, state, sequence (aka number of pages per issue), etc.

## 3. Downloading libraries & setting constants

- briefly describe Python libraries
- briefly note how constants are determined based on aspects of newspapers being scraped

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from datetime import timedelta
from datetime import datetime

- describe 1st hypothetical: building a dataset of US newspapers from the area where I grew up. Date range is the first month on the Civil War (April 1861).
    - Small for demonstration purposes
- Hypothetical purpose of dataset: studying how my place of origin reported about the start of the Civil War

In [2]:
NEWSPAPER_DICTIONARY = {
    'sn85025569': ('Red Wing Sentinel', 'Red Wing', 'MN'),
    'sn83016836': ('St. Cloud Democrat', 'St. Cloud', 'MN'),
    'sn84031595': ('Minnesota Staats-Zeitung', 'St. Paul', 'MN')
    # Add more 'sn' strings with corresponding newspaper, city, state
    # they must remain in this format:
    # 'snxxxxxxxxxx': ('NewspaperName', 'City', 'State')
    # note that more metadata can be used but must also be established in section 5 loop
}

START_DATE = datetime(1861, 4, 1)  # Choose tart date
END_DATE = datetime(1861, 4, 30)  # Choose end date
DATE_FORMAT = "%Y-%m-%d"
START_PAGE = 1 # always starts with 1
END_PAGE = 5 # note Python peculiarity starting w/ 0, also all papers in this example have page ranges of 1 to 4
iterating_date = START_DATE # briefly note how this date will change with timedelta

## 4. Creating preliminary scraping request function
- describe 'building' dataframe
- pull_row() function

In [3]:
df = pd.DataFrame(columns=['url', 'text', 'date', 'newspaper', 'city', 'state'])

def pull_row(data, new_row):
    data.loc[len(data)] = new_row

## 5. Creating conditional and loop that pulls textual data
- describe putting the aforementioned pieces together in a conditional/loop
- explain how this code need not be edited–just change the aforementioned constants

In [4]:
while iterating_date <= END_DATE: # explain how this caps scrape range by date

    formatted_date = iterating_date.strftime(DATE_FORMAT) # explain how this ensures change date is properly formatted (i.e. read as date versus numbers)

    for sn_code, (newspaper, city, state) in NEWSPAPER_DICTIONARY.items():
        # explain sn_code & dictionary defined variables
        consistent_url = f'https://chroniclingamerica.loc.gov/lccn/{sn_code}/'
        # explain consistent portion of URL code

        for page in range(START_PAGE, END_PAGE):
            # explain looping over page range
            url_string = f'{consistent_url}{formatted_date}/ed-1/seq-{page}/ocr/'
            # explain piecing together URL string
            print(url_string)
            # explain printing to see progress (not required but helpful to see how program is running)
            pulled_data = requests.get(url_string)
            # explain requests.get as scraping function
            # note that this will pull raw HTML data, not just the newspaper text
            if pulled_data.status_code == 200:
                # explain conditional to only add data from pages/dates that have been digitized
                soup = BeautifulSoup(pulled_data.content, 'lxml')
                # note BeautifulSoup parses HTML data to separate newspaper text from other HTML tags/data
                text_chunks = soup.find_all('p')
                # Note this is where BeautifulSoup pulls HTML tagged 'p' content (just newspaper text)
                text = ' '.join([p.get_text() for p in text_chunks])
                # note this rejoins any separated/paragraphed text parsed with BeautifulSoup
                pull_row(df, [url_string, text, formatted_date, newspaper, city, state])
                # finally, pull_row function defined earlier puts data in our dataframe

    iterating_date += timedelta(days=1)
    # finally, previously defined iterating_date changes with timedelta

https://chroniclingamerica.loc.gov/lccn/sn85025569/1861-04-01/ed-1/seq-1/ocr/
https://chroniclingamerica.loc.gov/lccn/sn85025569/1861-04-01/ed-1/seq-2/ocr/
https://chroniclingamerica.loc.gov/lccn/sn85025569/1861-04-01/ed-1/seq-3/ocr/
https://chroniclingamerica.loc.gov/lccn/sn85025569/1861-04-01/ed-1/seq-4/ocr/
https://chroniclingamerica.loc.gov/lccn/sn83016836/1861-04-01/ed-1/seq-1/ocr/
https://chroniclingamerica.loc.gov/lccn/sn83016836/1861-04-01/ed-1/seq-2/ocr/
https://chroniclingamerica.loc.gov/lccn/sn83016836/1861-04-01/ed-1/seq-3/ocr/
https://chroniclingamerica.loc.gov/lccn/sn83016836/1861-04-01/ed-1/seq-4/ocr/
https://chroniclingamerica.loc.gov/lccn/sn84031595/1861-04-01/ed-1/seq-1/ocr/
https://chroniclingamerica.loc.gov/lccn/sn84031595/1861-04-01/ed-1/seq-2/ocr/
https://chroniclingamerica.loc.gov/lccn/sn84031595/1861-04-01/ed-1/seq-3/ocr/
https://chroniclingamerica.loc.gov/lccn/sn84031595/1861-04-01/ed-1/seq-4/ocr/
https://chroniclingamerica.loc.gov/lccn/sn85025569/1861-04-02/ed

## 6. Reviewing and saving the dataset

In [5]:
df

Unnamed: 0,url,text,date,newspaper,city,state
0,https://chroniclingamerica.loc.gov/lccn/sn8502...,"O S I E SfiESENTINEL.Wi W 1»II ELI'S, Editor.P...",1861-04-03,Red Wing Sentinel,Red Wing,MN
1,https://chroniclingamerica.loc.gov/lccn/sn8502...,"THE SENTINEL.RED WING APRIL i), tSGl.11 ci.isi...",1861-04-03,Red Wing Sentinel,Red Wing,MN
2,https://chroniclingamerica.loc.gov/lccn/sn8502...,a a aaI I E OE WING CltlTRCIIES.EPISCOPAL.CHRI...,1861-04-03,Red Wing Sentinel,Red Wing,MN
3,https://chroniclingamerica.loc.gov/lccn/sn8502...,"O A E SALEWhifrciis Walter Carpenter ""f Ooodlm...",1861-04-03,Red Wing Sentinel,Red Wing,MN
4,https://chroniclingamerica.loc.gov/lccn/sn8301...,"""IYUL.CLOUD DEMOCRATICE ON THE WESTERN BANK OF...",1861-04-04,St. Cloud Democrat,St. Cloud,MN
5,https://chroniclingamerica.loc.gov/lccn/sn8301...,"'-^^r^""~^^fftesShe (SIoiul ipwcmt.JANE G. SWIS...",1861-04-04,St. Cloud Democrat,St. Cloud,MN
6,https://chroniclingamerica.loc.gov/lccn/sn8301...,Vi1'IJOQAL 3Still s.New PntBi.—'The Mill compa...,1861-04-04,St. Cloud Democrat,St. Cloud,MN
7,https://chroniclingamerica.loc.gov/lccn/sn8301...,"®b* Mil &nvH\QzmiM.JANtt'Ck SWISSHELM, PROPRIE...",1861-04-04,St. Cloud Democrat,St. Cloud,MN
8,https://chroniclingamerica.loc.gov/lccn/sn8403...,Die Minnesota Staatszeitung6de der Dritten und...,1861-04-06,Minnesota Staats-Zeitung,St. Paul,MN
9,https://chroniclingamerica.loc.gov/lccn/sn8403...,GroßerAusverkauf!vonW a azumKostenpreisbeiD. W...,1861-04-06,Minnesota Staats-Zeitung,St. Paul,MN


In [6]:
df.to_csv('mn_newspaper_data.csv', index=False, encoding='utf-8')

## 7. Second Hypothetical Adaptation
- building a dataset of Puerto Rican newspapers. Date range is the year of Treaty of Paris (when Spain ceded colonial rule of Puerto Rico and U.S. took over)
    - larger than first demonstration, but still relatively small for demonstration purposes
- Hypothetical purpose of dataset: studying how Puerto Ricans were reporting on the end of Spanish rule and the transition to US rule

In [7]:
# update your constants – revisit CA website to establish newspapers

df2 = pd.DataFrame(columns=['url', 'text', 'date', 'newspaper', 'city', 'state'])

NEWSPAPER_DICTIONARY = {
    'sn91099747': ('La correspondencia de Puerto Rico', 'San Juan', 'PR'),
    'sn90070270': ('La Democracia', 'Ponce', 'PR'),
    'sn91099739': ('Boletín mercantil de Puerto Rico', 'San Juan', 'PR')
    # Add more 'sn' strings with corresponding newspaper, city, state
    # they must remain in this format:
    # 'snxxxxxxxxxx': ('NewspaperName', 'City', 'State')
    # note that more metadata can be used but must also be established in section 5 loop
}

START_DATE = datetime(1898, 1, 1)  # Choose start date
END_DATE = datetime(1898, 12, 31)  # Choose end date
DATE_FORMAT = "%Y-%m-%d"
START_PAGE = 1 # always starts with 1
END_PAGE = 5 # note Python peculiarity starting w/ 0, also all papers in this example have page ranges of 1 to 4
iterating_date = START_DATE # briefly note how this date will change with timedelta

In [8]:
# rerun the code:
while iterating_date <= END_DATE: # explain how this caps scrape range by date

    formatted_date = iterating_date.strftime(DATE_FORMAT) # explain how this ensures change date is properly formatted (i.e. read as date versus numbers)

    for sn_code, (newspaper, city, state) in NEWSPAPER_DICTIONARY.items():
        # explain sn_code & dictionary defined variables
        consistent_url = f'https://chroniclingamerica.loc.gov/lccn/{sn_code}/'
        # explain consistent portion of URL code

        for page in range(START_PAGE, END_PAGE):
            # explain looping over page range
            url_string = f'{consistent_url}{formatted_date}/ed-1/seq-{page}/ocr/'
            # explain piecing together URL string
            print(url_string)
            # explain printing to see progress (not required but helpful to see how program is running)
            pulled_data = requests.get(url_string)
            # explain requests.get as scraping function
            # note that this will pull raw HTML data, not just the newspaper text
            if pulled_data.status_code == 200:
                # explain conditional to only add data from pages/dates that have been digitized
                soup = BeautifulSoup(pulled_data.content, 'lxml')
                # note BeautifulSoup parses HTML data to separate newspaper text from other HTML tags/data
                text_chunks = soup.find_all('p')
                # Note this is where BeautifulSoup pulls HTML tagged 'p' content (just newspaper text)
                text = ' '.join([p.get_text() for p in text_chunks])
                # note this rejoins any separated/paragraphed text parsed with BeautifulSoup
                pull_row(df2, [url_string, text, formatted_date, newspaper, city, state])
                # finally, pull_row function defined earlier puts data in our dataframe

    iterating_date += timedelta(days=1)
    # finally, previously defined iterating_date changes with timedelta

https://chroniclingamerica.loc.gov/lccn/sn91099747/1898-01-01/ed-1/seq-1/ocr/
https://chroniclingamerica.loc.gov/lccn/sn91099747/1898-01-01/ed-1/seq-2/ocr/
https://chroniclingamerica.loc.gov/lccn/sn91099747/1898-01-01/ed-1/seq-3/ocr/
https://chroniclingamerica.loc.gov/lccn/sn91099747/1898-01-01/ed-1/seq-4/ocr/
https://chroniclingamerica.loc.gov/lccn/sn90070270/1898-01-01/ed-1/seq-1/ocr/
https://chroniclingamerica.loc.gov/lccn/sn90070270/1898-01-01/ed-1/seq-2/ocr/
https://chroniclingamerica.loc.gov/lccn/sn90070270/1898-01-01/ed-1/seq-3/ocr/
https://chroniclingamerica.loc.gov/lccn/sn90070270/1898-01-01/ed-1/seq-4/ocr/
https://chroniclingamerica.loc.gov/lccn/sn91099739/1898-01-01/ed-1/seq-1/ocr/
https://chroniclingamerica.loc.gov/lccn/sn91099739/1898-01-01/ed-1/seq-2/ocr/
https://chroniclingamerica.loc.gov/lccn/sn91099739/1898-01-01/ed-1/seq-3/ocr/
https://chroniclingamerica.loc.gov/lccn/sn91099739/1898-01-01/ed-1/seq-4/ocr/
https://chroniclingamerica.loc.gov/lccn/sn91099747/1898-01-02/ed

In [9]:
df2

Unnamed: 0,url,text,date,newspaper,city,state
0,https://chroniclingamerica.loc.gov/lccn/sn9109...,"/ ' ' T¡, '' ' Si... ~, Mí ,1**£'¿ate hi).¿ai ...",1898-01-02,Boletín mercantil de Puerto Rico,San Juan,PR
1,https://chroniclingamerica.loc.gov/lccn/sn9109...,"'dificarse nuestras eoi.vic, por el hecho de q...",1898-01-02,Boletín mercantil de Puerto Rico,San Juan,PR
2,https://chroniclingamerica.loc.gov/lccn/sn9109...,Text not available \nxml |\n txt\n,1898-01-02,Boletín mercantil de Puerto Rico,San Juan,PR
3,https://chroniclingamerica.loc.gov/lccn/sn9109...,Text not available \nxml |\n txt\n,1898-01-02,Boletín mercantil de Puerto Rico,San Juan,PR
4,https://chroniclingamerica.loc.gov/lccn/sn9109...,"p -2Diario absolutamente imparcial, eco tís a ...",1898-01-04,La correspondencia de Puerto Rico,San Juan,PR
...,...,...,...,...,...,...
2219,https://chroniclingamerica.loc.gov/lccn/sn9007...,. ;1 ! DR.. JOSE H. AMADEOV CURA IIISTANTAIIEA...,1898-12-30,La Democracia,Ponce,PR
2220,https://chroniclingamerica.loc.gov/lccn/sn9007...,i. ir'-3ARIO ZDZEj Xj. TLRIDESaño viii2Ponce5 ...,1898-12-31,La Democracia,Ponce,PR
2221,https://chroniclingamerica.loc.gov/lccn/sn9007...,"a V , . i: . iy 'í . : : ';. -7- -7-;.t sr if ...",1898-12-31,La Democracia,Ponce,PR
2222,https://chroniclingamerica.loc.gov/lccn/sn9007...,r .la isla del diablo y cscrinfora saerrada lo...,1898-12-31,La Democracia,Ponce,PR


In [10]:
# save the new dataframe
df2.to_csv('puerto_rico_newspaper_data.csv', index=False, encoding='utf-8')