# Scraping US National Convention Transcripts

This notebook demonstrates how to scrape and process transcripts from the Democratic and Republican national conventions, using transcript pages provided by Rev.com.

In [None]:
import requests
from bs4 import BeautifulSoup
from time import sleep
from random import random
import textwrap

## Utility Functions

We define helpers to extract visible text from HTML and to generate filenames from URLs.

In [None]:
from bs4.element import Comment

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True

def generate_filename_from_url(url):
    if not url:
        return None
    name = url.replace("https", "").replace("http", "")
    name = name.replace("://", "").replace(".", "_").replace("/", "_").replace("-", "_")
    last_underscore_spot = name.rfind("_")
    name = name[:last_underscore_spot] + name[(last_underscore_spot+1):]
    name = name + ".txt"
    return name

In [None]:
# Define convention transcript URLs for both parties

convention_pages = {
    "democrats": [
        "https://www.rev.com/blog/transcripts/democratic-national-convention-dnc-night-1-transcript",
        "https://www.rev.com/blog/transcripts/democratic-national-convention-dnc-2020-night-2-transcript",
        "https://www.rev.com/blog/transcripts/democratic-national-convention-dnc-night-3-transcript",
        "https://www.rev.com/blog/transcripts/2020-democratic-national-convention-dnc-night-4-transcript"
    ],
    "republicans": [
        "https://www.rev.com/blog/transcripts/2020-republican-national-convention-rnc-night-1-transcript",
        "https://www.rev.com/blog/transcripts/2020-republican-national-convention-rnc-night-2-transcript",
        "https://www.rev.com/blog/transcripts/2020-republican-national-convention-rnc-night-3-transcript",
        "https://www.rev.com/blog/transcripts/2020-republican-national-convention-rnc-night-4-transcript"
    ]
}

## Data Structure

- `convention_pages` is a dictionary mapping party names to lists of transcript URLs.

## Scraping and Saving Transcripts

For each transcript URL, we fetch the page, extract visible text, and save the result to a text file named after the URL.

In [None]:
def scrape_and_save_transcripts(convention_pages):
    for party, links in convention_pages.items():
        for url in links:
            try:
                r = requests.get(url)
                if r.status_code == 200:
                    soup = BeautifulSoup(r.text, 'html.parser')
                    texts = soup.findAll(text=True)
                    visible_texts = filter(tag_visible, texts)
                    page_text = " ".join(t.strip() for t in visible_texts)
                else:
                    print(f"Failed to fetch {url} (status code {r.status_code})")
                    continue
            except Exception as e:
                print(f"Error fetching {url}: {e}")
                continue

            output_file_name = generate_filename_from_url(url)
            # Clean up text for output
            page_text = page_text.replace("\t", " ").replace("\n", " ").replace("\r", " ")
            with open(output_file_name, 'w', encoding="UTF-8") as outfile:
                outfile.write("link\ttext\n")
                outfile.write(f"{url}\t{page_text}\n")
            wait_time = 5 + random() * 10
            print(f"Saved {output_file_name}. Waiting for {wait_time:.02f} seconds.")
            sleep(wait_time)

scrape_and_save_transcripts(convention_pages)

Waiting for 10.70 seconds.
Waiting for 7.50 seconds.
Waiting for 14.00 seconds.
Waiting for 5.82 seconds.
Waiting for 9.49 seconds.
Waiting for 13.10 seconds.
Waiting for 5.84 seconds.
Waiting for 12.70 seconds.


---

## Example: Wrapping Long Text Output

When writing out long strings, it's helpful to wrap the text for readability. The `textwrap` library makes this easy.

In [None]:
from random import choices, seed
from string import ascii_letters

In [None]:
# Generate a long string with spaces for demonstration

string_length = 50000
chars_to_sample = ascii_letters + " " * 8

seed(20200916)
text = "".join(choices(chars_to_sample, k=string_length))

Write out the text as a single line.

In [None]:
with open("text.txt", 'w') as outfile:
    outfile.write(text)

Now, wrap the text for easier reading.

In [None]:
wrapped_text = textwrap.wrap(text)

with open("text_wrapped.txt", 'w') as outfile:
    for piece in wrapped_text:
        outfile.write(piece + "\n")