# Scraping ChronAm OCR Files (from tarbiz2 batches)

In [None]:
import requests
import pandas as pd
import time
from tqdm.notebook import tqdm # This best if you're working in a Jupyter notebook. Otherwise, try 'from tqdm import tqdm'
import os

### 1) Overview

The following notebook presents code for scraping the OCR-generated text files of Chronicling America. It's a step toward backing up Chronicling America and using the database locally. To replicate this process, you'll need at least 3TB of space. You will also need a solid Internet connection for 3+ days.

For more info about this project, [see my blog post](https://matthewkollmer.com/how-im-backing-up-chronicling-america/) describing the instigation and larger process.

### 2) Pulling Tarbiz2 Files

All of the Chronicling America text data is stored in tarball files here: [https://chroniclingamerica.loc.gov/data/ocr/](https://chroniclingamerica.loc.gov/data/ocr/). Using this page and the [ocr_batches.csv file I created](https://github.com/MatthewKollmer/chron_am_backup/blob/main/ocr_batches.csv), the code below which pulls each tarball and saves it to your chosen directory. This is one step toward backing up Chronicling America and being able to use it locally.

In [None]:
ocr_batches = pd.read_csv('ocr_batches.csv')
base_url = 'https://chroniclingamerica.loc.gov/data/ocr/'
# be sure to set your the output directory to your own external hard drive/SSD with at least 3TB of space
output_directory = 'CHANGE/TO/DIRECTORY/PATH/OF/YOUR/CHOOSING'

In [None]:
# Heads up: this code runs for a long, long time. On my device, it took about 80 hours. It's okay to stop and start it again–the first conditional statement in the pull_tarbiz_files() function (where it reads: "if os.path.exists(output_path):...") skips over files that have already been downloaded, so it's fine if you want to run the code for a while, stop running it, and then start again later. You'll just pick up where you left off.
progress_bar = tqdm(total=len(ocr_batches), unit='file', desc='Batches Downloaded', mininterval=1.0)

def pull_tarbiz_files(file_name):
    url = f'{base_url}{file_name}'
    output_path = os.path.join(output_directory, file_name)

    if os.path.exists(output_path):
        progress_bar.update(1)
        return

    try:
        with requests.get(url, stream=True) as response:
            if response.status_code == 200:
                with open(output_path, 'wb') as file:
                    file.write(response.content)
                    
                progress_bar.update(1)
                time.sleep(60) # This adds a minute in between each download to respect Chron Am's rate limits. You could try to shorten this timespan to speed up the whole process, but I was still getting 429 errors with time.sleep(30), so I just bumped it up to a minute, which kept things running. 

            elif response.status_code == 429:
                print('Received 429 error. Sorry Chron Am! Waiting 1 hour before retrying.')
                time.sleep(3600) # Idk if Chron Am bans IP addresses, but just in case, better back off for an hour if you somehow get a 429 error!

            else:
                print(f'Sumpin went wrong downloading {file_name}: {response.status_code}')
                time.sleep(5)

    except Exception as e:
        print(f'Exception issue with {file_name}: {e}')
        time.sleep(5)

for _, row in ocr_batches.iterrows():
    pull_tarbiz_files(row['file_name'])

progress_bar.close()