## Comparing Convention Language

In this workbook we'll set ourselves up to work with the convention data we've scraped.

In [None]:
import sqlite3
import nltk
import random
import numpy as np
from collections import Counter, defaultdict
from string import punctuation

In [None]:
convention_db = sqlite3.connect("2020_Conventions.db")
convention_cur = convention_db.cursor()

We'll make punctuation a set and add the apostrophe that appears in the data. 

In [None]:
punctuation = set(punctuation)
punctuation.add("’")

Let's read in the convention data from the DB so that we can work with it. 

In [None]:
query_results = convention_cur.execute(
                            '''
                                SELECT text, party
                                FROM conventions
                                WHERE speaker != "Unknown"
                            ''')

And now we'll store all the text from every identified speaker in a dictionary that has just two keys, "Democratic" and "Republican".

In [None]:
convention_data = defaultdict(str)

for row in query_results :
    text, party = row

    # A nice trick to get rid of punctuation
    text = "".join([ch for ch in text if ch not in punctuation])    
    text = [w.lower() for w in text.split() if w.isalpha()]
    
    convention_data[party] += " ".join(text) + " "

In [None]:
nltk.FreqDist(convention_data['Democratic'].split()).most_common(20)

In [None]:
nltk.FreqDist(convention_data['Republican'].split()).most_common(20)