# Scraping ChronAm Search Results for Relevant OCR URLs and Scraping Those URLs for Data

To enter the relevant urls and other data into this notebook you must identify relevant search urls in ChronAm using your web browser. This notebook describes two parts to the process of getting every page with searched content from ChronAm into a dataframe. It accounts for ChronAm's rate limits: https://www.loc.gov/apis/json-and-yaml/working-within-limits/#rate-limits. This code was partially written with ChatGPT (GPT-4o). 

In [1]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import time
from datetime import datetime, timedelta

### 1) Scrape search results to get all pages with relevant results

Copy and paste your search url as search_url.

Then note the parameters in your search: rows and pages. Rows refers to the number of search results displayed per page. It appears in the url as "rows=". Replace its value with 1000. This is the max number of results allowed per search according to ChronAm's rate limits. Then refer to the page parameter. It appears in the url as "page=". change the page value to {}. This is a placeholder that will let us iterate over the number of pages per results. 

On that note, also look to your url search page in your browser. How many "Results" pages does it list? That number (plus one, since Python is funny that way) is the number of pages you need to iterate over.

In [2]:
# search URL to scrape, with a placeholder for the page number and max rows set to 1000
search_url = 'https://chroniclingamerica.loc.gov/search/pages/results/list/?date1=1800&rows=1000&searchType=basic&state=Minnesota&date2=1899&proxtext=star+spangled+banner&y=0&x=0&dateFilterType=yearRange&page={}&sort=date'

We'll also need the regex pattern for the subsequent urls in the search results. ChronAm urls that direct to a specific newspaper page have a consistent pattern: /lccn/sn_code/year-month-day/edition/sequence(page)/. The following regex object saved as page_pattern captures any combination of this url.

And, we'll need to save the scraped urls and their accompanying titles in a list, so we'll need a shell of that list called scrape_results 

In [3]:
# Regex pattern for ChronAm urls for specific newspaper pages
page_pattern = re.compile(r'/lccn/sn\d+/\d{4}-\d{2}-\d{2}/ed-\d/seq-\d+/')

# a shell for our scraped results
scrape_results = []

Next we'll need a function that sends requests to ChronAm's server and retrieves content from the search results. This function is also adapted to account for ChronAm's rate limits. Initially, it will not make more than 10 requests in 10 seconds. If the number of requests reaches 200 in a minute, it will also pause for 5 minutes. If somehow you still get a 429 error, it will revert to an exponential backoff with time limits for the backoff increasing every time you get a 429 error. This is the fastest way I can imagine working around ChronAm's rate limits.

In [4]:
# Function that sends requests with retry logic for handling 429 errors
def send_request_with_retry(url, retries=2, backoff_factor=2):
    global request_count, first_request_time
    
    # Check if this is the first request and set first_request_time to current present moment
    if request_count == 0:
        first_request_time = datetime.now()
    
    # screen for ChronAm's crawl limit (200 requests per 1 minute)
    if request_count >= 200:
        elapsed_time = datetime.now() - first_request_time
        if elapsed_time < timedelta(minutes=1):
            print("Crawl limit reached. Waiting for 5 minutes...")
            time.sleep(300)  # Wait for 5 minutes
            first_request_time = datetime.now()
            request_count = 0
    
    # Check for ChronAm's burst limit (10 requests per 10 seconds)
    if request_count > 0 and request_count % 10 == 0:
        print("Burst limit reached. Waiting for 5 seconds...")
        time.sleep(5)
    
    for i in range(retries):
        response = requests.get(url)
        
        # counter for the request count goes up
        request_count += 1
        print(f"Requests made: {request_count}")
        
        # if you get response code 200, hooray! The request was successful
        if response.status_code == 200:
            return response
        
        # if you get a 429 error, you're notified and the function adds to the backoff factor
        elif response.status_code == 429:
            print(f"Rate limit exceeded. Received 429 error. Retrying in {backoff_factor**i} seconds...")
            time.sleep(backoff_factor ** i)  # Exponential backoff
            
            # After hitting a 429 error, pause for 5 minutes
            if i == retries - 1:
                print("Too many retries. Waiting for 5 minutes...")
                time.sleep(300)
                
        else:
            print(f"Failed to retrieve {url}: {response.status_code}")
            return None
    return None

# Before you run this function, you'll need to set your initial these two initial variables
# request_count is the counter for requests to ChronAm's server
# first_request_time becomes the moment you run the program
request_count = 0
first_request_time = None

Okay, now with the function send_request_with_retry, we need to iterate over the number of pages per search results. How many Results pages are listed in your search? Make that (plus one, because Python is weird that way) your range. We also need to extract the relevant urls from the scraped HTML. The following code does these things.

In [5]:
# to keep track of how long the program takes to run
start_time = datetime.now()

# Iterate over the pages of your search results
for results_number in range(1, 3):
    # change the page= part of your url with the different numbers in your range
    url = search_url.format(results_number)
    
    # use our function on the iterating versions of your url to get regex matches
    scrape_content = send_request_with_retry(url)
    
    # Check if the scraped content was successful. If it isn't because it got errors more than twice, that page will be skipped. No sense in letting a broken link break your entire code.
    if scrape_content is None:
        print(f"Skipping page {results_number} due to repeated errors.")
        continue
    
    # Parse the HTML content from scraped_content using BeautifulSoup
    soup = BeautifulSoup(scrape_content.text, 'html.parser')
    
    # We only want the content with the <ul> label and class 'results_list'. This contains titles and urls to ChronAm pages containing your search term
    results_list = soup.find('ul', class_='results_list')
    
    # Check if results_list was found. It's unlikely, but if there's a bug in the HTML for a search result page, this will skip that page. No sense in breaking your code due to poorly coded HTML.
    if results_list is None:
        print(f"No results found on page {results_number}")
        continue
    
    # Find all <a> tags within the results_list that match our regex pattern for ChronAm newspaper pages.
    matching_links = results_list.find_all('a', href=page_pattern)
    
    # Iterate over the matching links and put their titles and urls into our shell list scrape_results
    for link in matching_links:
        link_text = link.get_text(strip=True)
    
        # Extract the matching string from link['href']
        match = page_pattern.search(link['href'])
        if match:
            # Use only the matched part of the href. Otherwise, you get a bunch of other url parts pertaining to the search results
            matched_href = match.group()
            # add the 'ocr' part to the url so it will take you straight to "scrapeable" newspaper content in the later step
            link_href = f"https://chroniclingamerica.loc.gov{matched_href}ocr/"
            scrape_results.append({'Link Title': link_text, 'URL': link_href})

    # Optionally, print progress. I like to do this so I know how far along I'm going.
    print(f"Page {results_number} processed.")

# Calculate and print the total elapsed time. This may be useful to help you know how long it takes to scrape.
end_time = datetime.now()
total_elapsed_time = end_time - start_time
print(f"Total elapsed time: {total_elapsed_time}")

# Convert our list of page titles and urls to a DataFrame
df = pd.DataFrame(scrape_results)

Requests made: 1
Page 1 processed.
Requests made: 2
Page 2 processed.
Total elapsed time: 0:00:00.455162


Hooray! Now let's review our dataframe.

In [6]:
df

Unnamed: 0,Link Title,URL
0,"The Minnesota pioneer. [volume](St. Paul, Minn...",https://chroniclingamerica.loc.gov/lccn/sn8302...
1,"The Minnesota pioneer. [volume](St. Paul, Minn...",https://chroniclingamerica.loc.gov/lccn/sn8302...
2,"Minnesota weekly times. [volume](St. Paul, Min...",https://chroniclingamerica.loc.gov/lccn/sn8502...
3,"The weekly Minnesotian. [volume](Saint Paul, M...",https://chroniclingamerica.loc.gov/lccn/sn8301...
4,"Minnesota weekly times. [volume](St. Paul, Min...",https://chroniclingamerica.loc.gov/lccn/sn8502...
...,...,...
1431,"Warren sheaf. [volume](Warren, Marshall County...",https://chroniclingamerica.loc.gov/lccn/sn9005...
1432,"The Sauk Centre herald.(Sauk Centre, Stearns C...",https://chroniclingamerica.loc.gov/lccn/sn8906...
1433,"The Saint Paul globe.(St. Paul, Minn.), Decemb...",https://chroniclingamerica.loc.gov/lccn/sn9005...
1434,Grand Rapids herald-review. [volume](Grand Rap...,https://chroniclingamerica.loc.gov/lccn/sn8200...


### 2) Scrape results of the URL search scrape to get your actual newspaper data

Now, using our dataframe of urls from ChronAm that lead to ocr pages of newspapers that contain relevant search results, we can scrape those urls using the same function send_request_with_retry. Then we'll put those scraped results into our dataframe.

In [7]:
# to keep track of how long the program takes to run
start_time = datetime.now()

# take the "URL" column from df and save it as urls
urls = df["URL"]

# another shell for scraped content
newspaper_content = []

# Iterate over each URL in the urls
for url in urls:
    
    try:
        # Send a request to the URL using our function that accounts for rate limits and errors
        response = send_request_with_retry(url)
        
        # Check if the request was successful
        if response.status_code == 200:
            # Parse the HTML content using BeautifulSoup to get just newspaper text
            soup = BeautifulSoup(response.text, 'html.parser')
            
            p_tags = soup.find_all('p')
            p_text = ' '.join([tag.get_text(strip=True) for tag in p_tags])
            
            # add the extracted text to the list
            newspaper_content.append(p_text)
            
        else:
            # If request fails, leave url instance blank
            newspaper_content.append(None)
            
    except Exception as e:
        # Handle any other exceptions that occur during the scrape by leaving url instance blank
        newspaper_content.append(None)

# Calculate and print the total elapsed time. This may be useful to help you know how long it takes to scrape.
end_time = datetime.now()
total_elapsed_time = end_time - start_time
print(f"Total elapsed time: {total_elapsed_time}")

# Add the scraped content as a new column in df called 'text'
df['text'] = newspaper_content

Requests made: 3
Requests made: 4
Requests made: 5
Requests made: 6
Requests made: 7
Requests made: 8
Requests made: 9
Requests made: 10
Burst limit reached. Waiting for 5 seconds...
Requests made: 11
Requests made: 12
Requests made: 13
Requests made: 14
Requests made: 15
Requests made: 16
Requests made: 17
Requests made: 18
Requests made: 19
Requests made: 20
Burst limit reached. Waiting for 5 seconds...
Requests made: 21
Requests made: 22
Requests made: 23
Requests made: 24
Requests made: 25
Requests made: 26
Requests made: 27
Requests made: 28
Requests made: 29
Requests made: 30
Burst limit reached. Waiting for 5 seconds...
Requests made: 31
Requests made: 32
Requests made: 33
Requests made: 34
Requests made: 35
Requests made: 36
Requests made: 37
Requests made: 38
Requests made: 39
Requests made: 40
Burst limit reached. Waiting for 5 seconds...
Requests made: 41
Requests made: 42
Requests made: 43
Requests made: 44
Requests made: 45
Requests made: 46
Requests made: 47
Requests made

And after some time, you'll have a df with all the pages of your search results:

In [8]:
df.head()

Unnamed: 0,Link Title,URL,text
0,"The Minnesota pioneer. [volume](St. Paul, Minn...",https://chroniclingamerica.loc.gov/lccn/sn8302...,i? <«) w IC*it. Th\«e chartuing hoes were writ...
1,"The Minnesota pioneer. [volume](St. Paul, Minn...",https://chroniclingamerica.loc.gov/lccn/sn8302...,NEWADVERTISEMENTS.FOR SALK.TTIE lurfre three s...
2,"Minnesota weekly times. [volume](St. Paul, Min...",https://chroniclingamerica.loc.gov/lccn/sn8502...,"tTlif Jpiiwstrta WkM>a (TimesNEWSON, MITCHELL ..."
3,"The weekly Minnesotian. [volume](Saint Paul, M...",https://chroniclingamerica.loc.gov/lccn/sn8301...,"THE MINNESOTIAN.THURSDAY MORNING, JULY 6,1854...."
4,"Minnesota weekly times. [volume](St. Paul, Min...",https://chroniclingamerica.loc.gov/lccn/sn8502...,"(Tbf Minnesota (Wtedilo (Times.NEWSON, MITCHEL..."


An example page:

In [9]:
df['text'][500]

'\\sfje-s.4\'I\'vibntxt.Published W ednesdaya.vjfc.0. aXEVJ2W®»Publisher.Official Paper of Storeas Gtaafa*The census year began Juno 1,1889,and will end May 81, 1890.Or. O. W. Holmes says it is better tobe 70 years young than 40 years old.At last the right word has been fotmdfor death by electricity. It is electricide.Admiral Porter had his 76th birthdayparty a few days ago. Gen. B. F. Butler sent regrets.There are at least thirty colleges thathad a perfectly quiet graduating andcommencement day. They were thedeaf mute schools.In all the national educational exhibitsat the Paris exposition the most prominent and interesting feature is the industrial and manual training display.They say that Woodruff, the Croninmurder confessor, has told at least onetruth in his multifarious stories. He hasadmitted that he is an indefatigable liar.Since President Harrison has been inoffice he has had one band of visitors whodid not annoy him. They were 126Dunkard brethren, recently. Not one ofthem wante

To save your df as a csv file, just change the csv title in the code below:

In [10]:
df.to_csv('ssb_mentions_mn_papers_1800-99.csv', index=False, encoding='utf-8')