## Convention Database

In this notebook we store our convention data in a database. See the README for details.

In [5]:
import re
import sqlite3
import os
from collections import defaultdict

In [6]:
path_to_files = "C:/users/jchan/dropbox/teaching/outsidedata/2020Conventions/"
transcript_files = os.listdir(path_to_files)

In [7]:
transcript_files

['www_rev_com_blog_transcripts2020-democratic-national-convention-dnc-night-4-transcript.txt',
 'www_rev_com_blog_transcripts2020-republican-national-convention-rnc-night-1-transcript.txt',
 'www_rev_com_blog_transcripts2020-republican-national-convention-rnc-night-2-transcript.txt',
 'www_rev_com_blog_transcripts2020-republican-national-convention-rnc-night-3-transcript.txt',
 'www_rev_com_blog_transcripts2020-republican-national-convention-rnc-night-4-transcript.txt',
 'www_rev_com_blog_transcriptsdemocratic-national-convention-dnc-2020-night-2-transcript.txt',
 'www_rev_com_blog_transcriptsdemocratic-national-convention-dnc-night-1-transcript.txt',
 'www_rev_com_blog_transcriptsdemocratic-national-convention-dnc-night-3-transcript.txt']

Let's start by creating a lookup between the files and the nights/parties. 

In [20]:
# Just going to be a two-item list with party and night.
lookup_party_night = defaultdict(list)

file_night = re.compile(r"night-[1-4]")

for file in transcript_files :
    if "rnc" in file :
        lookup_party_night[file].append("Republican")
    elif "dnc" in file : 
        lookup_party_night[file].append("Democratic")
        
    night_text = file_night.search(file).group(0)    
    lookup_party_night[file].append(night_text.split("-")[1])


Now let's set up our DB.

In [22]:
db = sqlite3.connect("2020_Conventions.db")
cur = db.cursor()

In [25]:
cur.execute('''DROP TABLE IF EXISTS conventions''')
cur.execute('''CREATE TABLE conventions (
    party TEXT, 
    night INTEGER, 
    speaker TEXT,
    speaker_count INTEGER,
    time TEXT, 
    text TEXT,
    file TEXT)''')

db.commit()

Okay, that was easy enough. Now the tough part. We're going to need to use a regular expression to match the speaker, split the text on that, and funnel everything into the right spot. We wrapped the text, so we'll clean returns out of the text as we put it in. 

### Notes

After writing this crazy regular expression, I think the move is to look for the time and then work our way backward to the previous period. 

In [128]:
speaker_pattern = re.compile(r"[A-Z][a-z]+( [A-Z]\.)?( [A-Z0-9][a-zA-Z0-9’-]*)?( Jr.)?: \( (\d{2}:)?\d{2}:\d{2} \)")
time_pattern = re.compile(r"\( (\d{2}:)?\d{2}:\d{2} \)")
returns_pattern = re.compile(r"\n")


In [129]:
with open(path_to_files + transcript_files[0],encoding="UTF-8") as infile :
    holder = infile.read()
    holder = returns_pattern.sub(" ",holder)
    

In [130]:
speakers = speaker_pattern.findall(holder)
text = speaker_pattern.split(holder,maxsplit=0)
text = [t for t in text if t]
text = text[1:]


In [None]:
for i in range(10) :
    print(speakers[i])
    print(text[i+1])

In [136]:
text[:10]

['Skip to content The Company Careers Press Freelancers Blog × Services Transcription Captions Foreign Subtitles Translation Freelancers About Contact Login « Return to Transcript Library home  Transcript Categories  All Transcripts 2020 Election Transcripts Classic Speech Transcripts Congressional Testimony & Hearing Transcripts Debate Transcripts Donald Trump Transcripts Entertainment Transcripts Financial Transcripts Interview Transcripts Political Transcripts Press Conference Transcripts Speech Transcripts Sports Transcripts Technology Transcripts Aug 21, 2020 2020 Democratic National Convention (DNC) Night 4 Transcript Rev  ›  Blog  ›  Transcripts  › 2020 Election Transcripts  ›  2020 Democratic National Convention (DNC) Night 4 Transcript Night 4 of the 2020 Democratic National Convention (DNC) on August 20. Read the full transcript of the event here. Transcribe Your Own Content  Try Rev for free  and save time transcribing, captioning, and subtitling. ',
 ' 1',
 '  I’m here by c

In [138]:
speakers[100:110]

[('', ' Campbell', '', ''),
 ('', ' Coons', '', ''),
 ('', ' Coons', '', ''),
 ('', ' Coons', '', ''),
 ('', ' Coons', '', ''),
 ('', ' Thompson', '', ''),
 ('', ' Biden', '', ''),
 ('', ' Biden', '', ''),
 ('', ' Louis-Dreyfus', '', ''),
 ('', ' Bottoms', '', '')]

In [133]:
len(text)

1211