# Chronicling America API

[Chronicling America](https://chroniclingamerica.loc.gov/) is a collection of digitized American newspapers dating from 1777 to 1963 provided by the Library of Congress. The collection offers an application programming interface (API) which allows users to easily harvest large amounts of data.

In this notebook we will search Chronicling America's API, gather the search results into a Pandas dataframe, clean the data, and save it as a csv file.

In [6]:
# imports
import requests
import json
import math
import pandas as pd
import spacy

##Chronicling America URLs

If I search for a term, "abolition" for example, on https://chroniclingamerica.loc.gov/ I will get a results url that looks like this:

https://chroniclingamerica.loc.gov/search/pages/results/?state=&date1=1770&date2=1963&proxtext=abolition&x=12&y=18&dateFilterType=yearRange&rows=20&searchType=basic

These search results are human actionable, but not machine actionable. Chronicling America as an API that allows me to get machine actionable results if I add `&format=json`:

https://chroniclingamerica.loc.gov/search/pages/results/?state=&date1=1770&date2=1963&proxtext=abolition&x=12&y=18&dateFilterType=yearRange&rows=20&searchType=basic&format=json

If we examine the url we see that there are a number of search parameters:
- `state=`
- `date1=1770`
- `date2=1963`
- `proxtext=abolition`

We can edit these values to modify our search. I change the parameters to limit our search:

https://chroniclingamerica.loc.gov/search/pages/results/?state=Massachusetts&date1=1770&date2=1865&proxtext=prohibition&x=20&y=8&dateFilterType=yearRange&rows=20&searchType=basic&format=json

Now I can use the `requests` library to retrieve data from the url.

In [16]:
# initial search
url = 'https://chroniclingamerica.loc.gov/search/pages/results/?state=&date1=1770&date2=1963&proxtext=Chumash&x=7&y=12&dateFilterType=yearRange&rows=20&searchType=basic&format=json'
response = requests.get(url)
raw = response.text
results = json.loads(raw)

## Explore search results

In [17]:
results.keys()

dict_keys(['totalItems', 'endIndex', 'startIndex', 'itemsPerPage', 'items'])

In [18]:
# explore items
print(type(results['items']))
print(response.status_code)

<class 'list'>
200


In [19]:
print(results['items'][0])

{'sequence': 6, 'county': ['Hennepin', 'Ramsey'], 'edition': None, 'frequency': 'Weekly', 'id': '/lccn/sn78004468/1918-04-12/ed-1/seq-6/', 'subject': ['Hennepin County (Minn.)--Newspapers.', 'Jewish newspapers--Minnesota.', 'Jewish newspapers--United States.', 'Jewish newspapers.--fast--(OCoLC)fst00982872', 'Minneapolis (Minn.)--Newspapers.', 'Minnesota--Hennepin County.--fast--(OCoLC)fst01213354', 'Minnesota--Minneapolis.--fast--(OCoLC)fst01204260', 'Minnesota--Ramsey County.--fast--(OCoLC)fst01213443', 'Minnesota--Saint Paul.--fast--(OCoLC)fst01212130', 'Minnesota.--fast--(OCoLC)fst01204560', 'Ramsey County (Minn.)--Newspapers.', 'Saint Paul (Minn.)--Newspapers.', 'United States.--fast--(OCoLC)fst01204155'], 'city': ['Minneapolis', 'Saint Paul'], 'date': '19180412', 'title': 'The American Jewish world. [volume]', 'end_year': 9999, 'note': ['Archived issues are available in digital format from the Library of Congress Chronicling America online collection.', 'Available on microfilm fro

In [20]:
print('totalItems:', results['totalItems'])
print('endIndex:', results['endIndex'])
print('startIndex:', results['startIndex'])
print('itemsPerPage:', results['itemsPerPage'])
print('Length and type of items:', len(results['items']), type(results['items']))

totalItems: 28
endIndex: 20
startIndex: 1
itemsPerPage: 20
Length and type of items: 20 <class 'list'>


The Chronicling America API returned 1,656 results. However, it will only display 20 at a time by default. I can add a new parameter `page=` to cycle through all the results, but first I need to know how many pages there will be. I can find this out by dividing `totalItems` (1,656) by `itemsPerPage` (20) and then round-up using `math.ceil`.

In [21]:
# find total amount of pages
total_pages = math.ceil(results['totalItems'] / results['itemsPerPage'])
print(total_pages)

2


Now that I know how many pages there will be, I can use a for loop to iterate through each result page and then each item on each result page. I then gather the data I want from each item: newspaper title, city, date, and text.

Notice in the code below I placed the url string in parentheses () so that I could break it up over multiple lines making it easier to read.

Also, for the sake of this demonstration, I am only iterating over 10 pages. For the full results the for loop should begin: `for i in range(1, total_pages+1)` (the `+1` is necessary becase the seond number in the range function is exclusive).

In [22]:
# create empty list for data
data = []

In [23]:
# set search parameters
start_date = '1770'
end_date = '1963'
search_term = 'Chumash'
state = ''

In [24]:
# loop through search results and collect data
for i in range(1, total_pages+1):  # for sake of time I'm doing only 10, you will want to put total_pages+1
    url = (f'https://chroniclingamerica.loc.gov/search/pages/results/?state={state}&date1={start_date}'
           f'&date2={end_date}&proxtext={search_term}&x=16&y=8&dateFilterType=yearRange&rows=20'
           f'&searchType=basic&format=json&page={i}')  # f-string
    response = requests.get(url)
    raw = response.text
    print(f'page {i} status code:', response.status_code)  # checking for errors
    results = json.loads(raw)
    items_ = results['items']
    for item_ in items_:
        row_data = {}
        try:
          row_data['title'] = item_['title_normal']
        except:
          row_data['title'] = "none"
        try:
          row_data['city'] = item_['city']
        except:
          row_data['city'] = "none"
        try:
          row_data['date'] = item_['date']
        except:
          row_data['date'] = "none"
        try:
          row_data['raw_text'] = item_['ocr_eng']
        except:
          row_data['raw_text'] = 'none'
    data.append(row_data)

page 1 status code: 200
page 2 status code: 200


In [25]:
# put data into DataFrame
df = pd.DataFrame.from_dict(data)

In [26]:
df.head()

Unnamed: 0,title,city,date,raw_text
0,southern jewish weekly.,[Jacksonville],19560113,"-s*\nFriday, January 13, 1956\ns\nThe south Fl..."
1,evening star.,[Washington],19280630,"' WILLIAM F. BRODT,!\nHATMAKER, IS DEAD\n, Fou..."


### Change date format
Pandas allows us to clean and edit our data easily (relatively). We can first convert the string values in the date column to properly formated dates and then sort the dataframe by date.

In [27]:
# convert date column from string to date-time object
df['date'] = pd.to_datetime(df['date'])

In [28]:
df.head()

Unnamed: 0,title,city,date,raw_text
0,southern jewish weekly.,[Jacksonville],1956-01-13,"-s*\nFriday, January 13, 1956\ns\nThe south Fl..."
1,evening star.,[Washington],1928-06-30,"' WILLIAM F. BRODT,!\nHATMAKER, IS DEAD\n, Fou..."


In [29]:
# sort by date
df = df.sort_values(by='date')

In [30]:
df.head()

Unnamed: 0,title,city,date,raw_text
1,evening star.,[Washington],1928-06-30,"' WILLIAM F. BRODT,!\nHATMAKER, IS DEAD\n, Fou..."
0,southern jewish weekly.,[Jacksonville],1956-01-13,"-s*\nFriday, January 13, 1956\ns\nThe south Fl..."


### Process text
We can now porcess our text for analysis. The text provded by Chronicling America comes from optical character recognition (ocr) and the accuracy of ocr can be low. Here I will remove new line characters (`\n`), stop words, and then lemamtize the text.

**Rememeber** the decisions you make in how to process your text should be based on the kind of analysis you want to do.

In [34]:
# write function to process text
# load nlp model
! python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")
nlp.disable_pipes('ner', 'parser')  # these are unnecessary for the task at hand

def process_text(text):
    """Remove new line characters and lemmatize text. Returns string of lemmas"""
    text = text.replace('\n', ' ')
    doc = nlp(text)
    tokens = [token for token in doc]
    no_stops = [token for token in tokens if not token.is_stop]
    no_punct = [token for token in no_stops if token.is_alpha]
    lemmas = [token.lemma_ for token in no_punct]
    lemmas_lower = [lemma.lower() for lemma in lemmas]
    lemmas_string = ' '.join(lemmas_lower)
    return lemmas_string

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting spacy<3.6.0,>=3.5.0 (from en-core-web-sm==3.5.0)
  Downloading spacy-3.5.4-cp311-cp311-macosx_10_9_x86_64.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting thinc<8.2.0,>=8.1.8 (from spacy<3.6.0,>=3.5.0->en-core-web-sm==3.5.0)
  Downloading thinc-8.1.12-cp311-cp311-macosx_10_9_x86_64.whl (858 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m858.7/858.7 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4 (from spacy<3.6.0,>=3.5.0->en-core-web-sm==3.5.0)
  Downloading pydantic

In [35]:
# apply process_text function
# this may take a few minutes
df['lemmas'] = df['raw_text'].apply(process_text)

In [38]:
# save to csv
df.to_csv(f'../PythonExercises/{search_term}{start_date}-{end_date}.csv', index=False)