# Scraping Political Convention Transcripts
In this repo, we'll be utilizing transcripts from Rev to scrape the Democratic and Republican national conventions.

In [1]:
import requests               # To get the pages
from bs4 import BeautifulSoup # and to process them

from time import sleep      # Allowing us to pause between pulls
from random import random   # And allowing that pause to be random

import textwrap             # Useful for our wrapped output
from bs4.element import Comment

This code is a Python function that determines whether a given HTML element is visible. It checks if the element's parent name is in a list of tags that are not visible (style, script, head, title, meta, [document]), and checks if the element is a comment. If either of these conditions is true, the function returns false. Otherwise, the function returns true.

In [None]:
def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True

This code is creating a filename from a given url. It replaces the 'https' and 'http' with nothing, replaces '://', '.', and '/' with underscores, removes the last underscore, and adds '.txt' to the end of the name. It then returns the newly generated file name.

In [132]:
def generate_filename_from_url(url) :
    if not url :
        return None
    # drop the http or https
    name = url.replace('https','').replace('http','')
    
    #replace useless characters with UNDERSCORE
    name = name.replace('://','').replace('.','_').replace('/','_')
    
    #remove last underscore
    last_underscore_spot = name.rfind('_')
    
    name = name + '.txt'
    
    return(name)
    

## Set Up Convention URL Dictionary 
This code is creating a dictionary called convention_pages. The dictionary has two keys, "democrats" and "republicans". Each key is assigned a list of strings, which are URLs of transcripts from each party's convention.

In [193]:
convention_pages = dict()

convention_pages["democrats"] = """
https://www.rev.com/blog/transcripts/democratic-national-convention-dnc-night-1-transcript
https://www.rev.com/blog/transcripts/democratic-national-convention-dnc-2020-night-2-transcript
https://www.rev.com/blog/transcripts/democratic-national-convention-dnc-night-3-transcript
https://www.rev.com/blog/transcripts/2020-democratic-national-convention-dnc-night-4-transcript
""".split()

convention_pages["republicans"] = """
https://www.rev.com/blog/transcripts/2020-republican-national-convention-rnc-night-1-transcript
https://www.rev.com/blog/transcripts/2020-republican-national-convention-rnc-night-2-transcript
https://www.rev.com/blog/transcripts/2020-republican-national-convention-rnc-night-3-transcript
https://www.rev.com/blog/transcripts/2020-republican-national-convention-rnc-night-4-transcript
""".split()


This code below is looping through two dictionaries, convention_pages and party_text, to pull the page from a link and process the page with BeautifulSoup. It then stores the text from the page in the party_text dictionary and writes the link and text to a file with the appropriate name in a tab-separated format. After writing to the file, the code pauses for a random amount of time before continuing.

In [194]:
party_text = dict()

In [199]:
for party in convention_pages :
    for link in convention_pages[party] : 
        output_file_name = generate_filename_from_url(link)
        
        # pull the page 
        r = requests.get(link)
        r.status_code
        
        # process the page
        if r.status_code == 200 :
            soup = BeautifulSoup(r.text, 'html.parser')
            texts = soup.findAll(text=True)
            visible_texts = filter(tag_visible, texts) 
            party_text[link] = " ".join(t.strip() for t in visible_texts)
        
        # write out the page to a file with the appropriate name
        with open(output_file_name, 'w', encoding = 'UTF-8') as outfile :
            outfile.write('\t'.join(["link","text"]) + '\n')
            for link in party_text :
                the_text = party_text[link]
                the_text = the_text.replace("\t"," ").replace("\n"," ").replace("\r"," ")
                
                if not link :
                    link = "empty link RIP"
                    
                if the_text :
                    outfile.write('\t'.join([link,the_text]) + '\n')
        
        # Pause for a bit
        wait_time = 5 + random()*10
        print(f"Waiting for {wait_time:.02f} seconds.")
        
        sleep(wait_time)
        

Waiting for 12.98 seconds.
Waiting for 8.33 seconds.
Waiting for 11.35 seconds.
Waiting for 5.94 seconds.
Waiting for 5.72 seconds.
Waiting for 12.90 seconds.
Waiting for 10.92 seconds.
Waiting for 12.42 seconds.


--- 

### A Helpful Function

When I have to write out a long string, it's nice to wrap the text. The library `textwrap` makes it easy for me. The code below generates a long string and I can write out the output in wrapped form.

In [200]:
from random import choices, seed
from string import ascii_letters

In [201]:
# Generate a long string with some spaces. 

string_length = 50000
chars_to_sample = ascii_letters + " "*8 # Get some spaces in there

seed(20200916)

text = "".join(choices(chars_to_sample,k=string_length))

First we'll just write out the text. You'll notice it's just one long line. 

In [202]:
with open("text.txt",'w') as outfile :
    outfile.write(text)

The library `textwrap` will let us make a nice, wrapped output.

In [None]:
wrapped_text = textwrap.wrap(text)

with open("text_wrapped.txt",'w') as outfile :
    for piece in wrapped_text :
        outfile.write(piece + "\n")