# Building Chronicling America Search Keys

In [None]:
import requests
import pandas as pd
import time
from tqdm.notebook import tqdm
from bs4 import BeautifulSoup
import re

### 1) Overview

The following notebook shows how I compiled the metadata for Chronicling America's newspapers and batch files. These metadata can be viewed here:

- OCR Batch file data: [https://github.com/MatthewKollmer/chron_am_backup/blob/main/ocr_batches.csv](https://github.com/MatthewKollmer/chron_am_backup/blob/main/ocr_batches.csv)
- Chronicling America newspaper data: [https://github.com/MatthewKollmer/chron_am_backup/blob/main/newspapers.csv](https://github.com/MatthewKollmer/chron_am_backup/blob/main/newspapers.csv)

In further steps, I plan to use these dataframes to make Chron Am searchable on my local device.

### 2) Pulling OCR Batch Metadata

Before working in Python, I began by copy-and-pasting the contents of [this Chron Am batch directory](https://chroniclingamerica.loc.gov/data/ocr/) into BBEdit. Then I used regular expressions to separate the contents with commas rather than spaces, creating a csv file called 'ocr_batches.csv' (a critical step for my [pull_tarbiz2_files.ipynb notebook](https://github.com/MatthewKollmer/chron_am_backup/blob/main/pull_tarbiz2_files.ipynb)). This gave me a reference to make calls to download the tarball files containing Chron Am's OCR.

But I also needed to keep track of what newspapers and years were contained in each tarball, so I also scraped Chron Am's batch description pages. The following code demonstrates how I did it. Basically, it calls the associated batch html page and pulls its table data for each newspaper and year range, saving those details in dictionaries for each newspaper in the file.

In [None]:
batches = pd.read_csv('ocr_batches.csv') # online at https://raw.githubusercontent.com/MatthewKollmer/chron_am_backup/refs/heads/main/ocr_batches.csv 
batches['batch_url'] = 'https://chroniclingamerica.loc.gov/batches/' + batches['file_name'].str.replace('.tar.bz2', '', regex=False) + '/'
batches['contents'] = None
batches.head()

In [None]:
year_range_regex = re.compile(r'(\d{4}) -.*(\d{4})')
sn_code_regex = re.compile(r'\((sn?\d+|\d+)\)')

def pull_batch_metadata(batch_url: str) -> list[dict]:
    metadata_entries: list[dict] = []
    with requests.get(batch_url, timeout=30) as response:
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')
            header_tag = soup.find('h4', string='Reels in Batch')
            data_table = header_tag.find_next('table')
            
            for row in data_table.find_all('tr')[1:]:
                table_cells = row.find_all('td')
                
                date_text = table_cells[2].get_text(' ', strip=True)
                year_match = year_range_regex.search(date_text)
                start_year = int(year_match.group(1))
                end_year = int(year_match.group(2))
                full_years = list(range(start_year, end_year + 1))
    
                for title in table_cells[1].find_all('a'):
                    title_text = title.get_text(' ', strip=True)
                    sn_match = sn_code_regex.search(title_text)
                    sn_code = sn_match.group(1)
                    title = title_text.replace(f'({sn_code})', '').strip()
                    metadata_entries.append({sn_code: {'newspaper_title': title, 'years': full_years}})
                
            time.sleep(2) # this seems to work when calling these html pages. If you get 429 errors, add to the delay.

        elif response.status_code == 429:
            print('Received 429 error. Sorry Chron Am! Waiting 1 hour before retrying.')
            time.sleep(3600)

        else:
            print(f'Sumpin went wrong at {batch_url}: {response.status_code}')
            time.sleep(5)

    return metadata_entries

progress_bar = tqdm(total=len(batches), desc='Metadata scraped', unit='batch', mininterval=1.0)

for row, batch_row in batches.iterrows():
    metadata = pull_batch_metadata(batch_row['batch_url'])
    batches.at[row, 'contents'] = metadata
    progress_bar.update(1)
    
batches

### 3) Merging sn_code Dictionaries

Some newspapers were separated into different parts of the batch file (groupings that Chron Am calls 'reels'). To account for this, I merged dictionaries by sn_code, thereby having full year ranges per newspaper present in each tarball.

In [None]:
def merge_dictionaries_by_sn_code(contents):
    dictionary_key: dict[str, dict] = {}
    for single_dictionary in contents:
        sn_code, info = next(iter(single_dictionary.items()))
        if sn_code in dictionary_key:
            dictionary_key[sn_code]['years'].extend(info['years'])

        else:
            dictionary_key[sn_code] = {'newspaper_title': info['newspaper_title'], 'years': info['years'].copy()}

    for sn_code in dictionary_key:
        dictionary_key[sn_code]['years'] = sorted(set(dictionary_key[sn_code]['years']))

    return [{sn: data} for sn, data in dictionary_key.items()]

batches['contents'] = batches['contents'].apply(merge_dictionaries_by_sn_code)

batches

In [None]:
batches.to_csv('ocr_batches.csv', index=False)

### 4) Enriching newspapers.csv

I also thought it would be good to have a version of the data where each row represents a newspaper/sn_code rather than a tarball file. I started with [Chron Am's own file of newspapers](https://chroniclingamerica.loc.gov/newspapers.txt), downloading it and then converting it to a csv file in BBEdit. Then I cross-referenced this file's 'LCCN' column with the dictionaries in 'ocr_batches.csv' so it would have which tarballs contained which newspapers.

In [None]:
newspapers = pd.read_csv('newspapers.csv') # online at https://raw.githubusercontent.com/MatthewKollmer/chron_am_backup/refs/heads/main/newspapers.csv

sn_to_tarfiles = {}

for row, batch_row in batches.iterrows():
    tarball_name = batch_row['file_name']
    for entry in batch_row['contents']:
        sn_code, info = next(iter(entry.items()))
        sn_to_tarfiles.setdefault(sn_code, []).append({'file_name': tarball_name, 'years': info['years']})

newspapers['tarfiles'] = newspapers['LCCN'].map(lambda code: sn_to_tarfiles.get(code, []))

newspapers

In [None]:
newspapers.to_csv('newspapers.csv', index=False)