# Word frequency analysis
This notebook conducts word frequency analysis - it tries to identify what words are commonly used by politicians.

Let's first load the data:

In [2]:
import pandas as pd
# load the data
df = pd.read_pickle("../data/binary/us-politicians.pickle")
df.sample(5)

Unnamed: 0,speaker_id,quote_id,quotation,speaker,party,Date-Time
417306,22686,2017-07-07-014756,came out in droves. They voted in the last ele...,Donald Trump,29468,2017-07-07
325102,22686,2017-08-23-104623,"Reply # 7 on: August 22, 2017, 08:36:55 PM",Donald Trump,29468,2017-08-23
1183918,6279,2015-11-13-114714,time is ripe for peace.,Joe Biden,29552,2015-11-13
1266560,434706,2016-09-20-156989,"You have said repeatedly,' I am accountable,' ...",Elizabeth Warren,29552,2016-09-20
644373,22686,2018-10-24-133410,vote against all Republicans.,Donald Trump,29468,2018-10-24


### Preprocessing the quotes
Now, let's define some utility functions that will allow us to apply preprocessing operations such as:
- changing string to lowercase
- removing numbers
- removing punctuation
- removing single characters
- removing leading, trailing and repeating spaces
- expanding contractions
- removing stopwords

In [3]:
import re, string, contractions
from nltk.tokenize import word_tokenize

def preprocess_quote(quote):
    # to lowercase
    quote = quote.lower()

    # remove numbers and punctuation
    quote = re.sub(r'\d+', '', quote)
    quote = quote.translate(str.maketrans('', '', string.punctuation))

    # remove all single characters
    quote = re.sub(r'\s+[a-zA-Z]\s+', ' ', quote)
    # Remove single characters from the start
    quote = re.sub(r'\^[a-zA-Z]\s+', ' ', quote)

    # remove leading, trailing, and repeating spaces
    quote = re.sub(' +', ' ', quote)
    quote = quote.strip()

    return quote
    
def expand_contractions(quote):
    return contractions.fix(quote)

def remove_words(quote, words):
    tokens = word_tokenize(quote)
    filtered_tokens = [token for token in tokens if token.lower() not in words]
    return " ".join(filtered_tokens)

And now let's apply the preprocessing functions to the data:

In [4]:
# take a sample of the dataset, extract the quotation strings
df = df.sample(10000)

In [5]:
# expand contractions
df["quotation"] = df["quotation"].apply(lambda quote: expand_contractions(quote))

# load the stopwords (extend the standard list by the contraction leftovers) and remove them from the quotes 
from nltk.corpus import stopwords
stopwords = stopwords.words('english') + ['nt', 'ca', 'wo']
print("Example stopwords: ", stopwords[:10])
df["quotation"] = df["quotation"].apply(lambda quote: remove_words(quote, stopwords))

# apply other preprocessing
df["quotation"] = df["quotation"].apply(lambda quote: preprocess_quote(quote))

Example stopwords:  ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


In [6]:
# look at most common words
def get_words(df):
    quotes = df["quotation"]
    tokenized = quotes.apply(lambda quote: word_tokenize(quote))
    words = tokenized.explode()
    words = words.astype("str")
    return words

words = get_words(df)
words.value_counts()

people        1149
going          859
would          811
think          690
know           668
              ... 
recedes          1
vacuums          1
successors       1
applying         1
lightfoot        1
Name: quotation, Length: 13272, dtype: int64

## Removing the most common words
As you can see above, the most common words are somewhat meaningless for understanding the topics that politicians talk about. They are just words that appear commonly in English language. To be able to draw conclusions from the frequency analysis, we remove the 1000 most common words from the quotations and repeat the analysis.

In [7]:
# load the list of most common english words, taken from: https://gist.github.com/deekayen/4148741
def read_most_common(path):
    most_common = []
    with open(path, "r") as file:
        for line in file.readlines():
            most_common.append(line.strip())
    return most_common

most_common = read_most_common("../data/misc/most_common_words_1000.txt")
most_common[:12]

['the', 'of', 'to', 'and', 'a', 'in', 'is', 'it', 'you', 'that', 'he', 'was']

We make use of stemming to perform better filtering of the words. We first apply stemming to the list of common words, and then for each token in the quotations we check if the stemmed version of the token appears in the common words list. If it does, we discard this token.

In [8]:
# convert words to their stemmed versions
from nltk.stem import PorterStemmer

def filter_words(words):
    stemmer = PorterStemmer()
    most_common_stemmed = [stemmer.stem(word) for word in most_common]
    stemmed_words = {x: stemmer.stem(x) for x in (words.unique())}

    # and filter out the words w
    filtered_words = [word for word in words if stemmed_words[word] not in most_common_stemmed]
    
    return filtered_words

filtered_words = filter_words(words)

Now, let's look at the most common words again:

In [9]:
pd.Series(filtered_words).value_counts()[:20]

president     600
trump         425
america       312
really        294
american      277
something     240
important     166
today         165
security      162
americans     150
china         143
campaign      140
political     139
russia        134
health        132
democrats     127
congress      126
everything    120
public        120
military      118
dtype: int64

The results are much more meaningful - we can see keywords like *'america', 'president' or 'administration'. Let's get a better understanding of it by plotting a word cloud:

In [10]:
# adapted from: https://www.datacamp.com/community/tutorials/wordcloud-python
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

def make_wordcloud(df, out_path):
    words = get_words(df)
    filtered_words = filter_words(words)
    
    plt.rcParams["figure.figsize"] = (10,10)

    wordcloud = WordCloud(width=1600, height = 800, max_font_size=150, max_words=100, background_color="white")
    wordcloud.generate(" ".join(filtered_words))
    wordcloud.to_file(out_path)

In [14]:
democrats_id, republicans_id = 29552, 29468

democrats = df[df["party"] == democrats_id]
republicans = df[df["party"] == republicans_id]
all_speakers = df.copy()

plots_path = "../figures/{}.png"
paths = [plots_path.format(group) for group in ["all_speakers", "democrats", "republicans"]]
groups = [all_speakers, democrats, republicans]

for path, group in zip(paths, groups):
    make_wordcloud(group, path)
    

In [25]:
from ipywidgets import interact
import ipywidgets as widgets

wordclouds = [mpimg.imread(path) for path in paths]

def f(x):
    indices = {'Both parties': 0, 'Democrats only': 1, 'Repubicans only': 2}
    plt.imshow(wordclouds[indices[x]], interpolation='bilinear')
    plt.axis("off")
    plt.show()

interact(f, x=widgets.ToggleButtons(options=['Both parties', 'Democrats only', 'Repubicans only'], description='Group:'));

interactive(children=(ToggleButtons(description='Group:', options=('Both parties', 'Democrats only', 'Repubica…