# Scraping URLs of Victim Name Search Hits

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
from tqdm import tqdm
import os

## 1) Overview

The following steps showcase how I scraped digitized newspaper pages from Chronicling America that were labelled as containing victim names (see 02_pull_search_results.ipynb). With over 450,000 pages to scrape, this part of the process took a very long time. 

It must be emphasized that this notebook needs to be run repeatedly over many days. It will not compile all of the data included in this project in one sitting.

## 2) Define Scraping Function

This scrape function is set to work within Chron Am's rate limits as they were defined in late 2024. That is, it's sleep times are set to four seconds per request and one hour if somehow you receive a 429 error (the error message Chron Am will send you if you're scraping too much).

Before running this function, you should check Chron Am's most recent rate limits: [https://www.loc.gov/apis/json-and-yaml/working-within-limits/](https://www.loc.gov/apis/json-and-yaml/working-within-limits/). The four second delays in this function are based on the "Newspapers endpoint" burst limit of 20 requests per 1 minute. In other words, setting time.sleep to 4 means the function pauses requests for four seconds and it becomes impossible to hit the rate limit of 20 requests per minute. It's therefore important to double-check to see if the "Newspaper endpoint" rate limit has changed if and when you run this code. If it has, you may need to add more delay time to the time.sleep lines in the function.

In [None]:
def scrape_carefully(url, retries=3):
    for i in range(retries):
        try:
            response = requests.get(url)
            
            if response.status_code == 200:
                time.sleep(4)  # sleep time may need adjustment based on changes to the rate limits
                return response
            
            elif response.status_code == 429:
                print('Received 429 error. Sorry Chron Am! Waiting one hour before retrying.')
                time.sleep(3600)  # sleep time may need adjustment based on changes to the rate limits

            else:
                time.sleep(4)  # sleep time may need adjustment based on changes to the rate limits
            
        except Exception:
            time.sleep(4)  # sleep time may need adjustment based on changes to the rate limits

    return None

## 2) Point to Directory

Since this notebook is meant to be run numerous times, you'll need to change the directory each time. In 02_pull_search_results.ipynb, I split the data into manageable chunks of 4,000 - 5,000 search hits. Whenever you run this notebook, you need to change the label (i.e. chunk_1, chunk_2, chunk_50, chunk_96, etc.) so it runs on different chunks of the data each time.

In [None]:
directory = 'name_clusters/chunk_96'

## 3) Run the Scraping Instance



In [None]:
# this little loop counts the rows in all the files in the given directory
# using this information it's possible to make a progress bar with tqdm
total_rows = 0
for filename in os.listdir(directory):
    file_path = os.path.join(directory, filename)

    df = pd.read_csv(file_path)
    total_rows += len(df)

# a loop that iterates over each row in each csv file and pulls the text
# from the urls in the 'url' columns. It relies on BeautifulSoup and the
# scrape_carefully function defined above. It puts the scraped text into
# a new column called 'text'.
with tqdm(total=total_rows, desc='Scraping progress', unit='row') as pbar:
    for filename in os.listdir(directory):
        file_path = os.path.join(directory, filename)
        
        df = pd.read_csv(file_path)
        newspaper_content = []

        for url in df['url']:
            try:
                response = scrape_carefully(url)

                if response and response.status_code == 200:
                    soup = BeautifulSoup(response.text, 'html.parser')

                    p_tags = soup.find_all('p')
                    p_text = ' '.join([tag.get_text(strip=True) for tag in p_tags])

                    newspaper_content.append(p_text)

                else:
                    newspaper_content.append(None)

            except Exception as e:
                print(f'Error scraping {url}: {e}')
                newspaper_content.append(None)

            pbar.update(1)

        df['text'] = newspaper_content
        df.to_csv(file_path, index=False)

## 4) Record Progress and Repeat

You can record your progress any way you like. I chose to type the completed chunks every time I finished one.

Chunks completed: 

chunk_1, chunk_2, chunk_3, chunk_4, chunk_5, chunk_6, chunk_7, chunk_8, chunk_9, chunk_10, chunk_11, chunk_12, chunk_13, chunk_14, chunk_15, chunk_16, chunk_17, chunk_18, chunk_19, chunk_20, chunk_21, chunk_22, chunk_23, chunk_24, chunk_25, chunk_26, chunk_27, chunk_28, chunk_29, chunk_30, chunk_31, chunk_32, chunk_33, chunk_34, chunk_35, chunk_36, chunk_37, chunk_38, chunk_39, chunk_40, chunk_41, chunk_42, chunk_43, chunk_44, chunk_45, chunk_46, chunk_47, chunk_48, chunk_49, chunk_50, chunk_51, chunk_52, chunk_53, chunk_54, chunk_55, chunk_56, chunk_57, chunk_58, chunk_59, chunk_60, chunk_61, chunk_62, chunk_63, chunk_64, chunk_65, chunk_66, chunk_67, chunk_68, chunk_69, chunk_70, chunk_71, chunk_72, chunk_73, chunk_74, chunk_75, chunk_76, chunk_77, chunk_78, chunk_79, chunk_80, chunk_81, chunk_82, chunk_83, chunk_84, chunk_85, chunk_86, chunk_87, chunk_88, chunk_89, chunk_90, chunk_91, chunk_92, chunk_93, chunk_94, chunk_95, chunk_96