### Using Word Embeddings

Word embeddings, numerical representations of words based on deep learning neural networks, have revolutionized natural language processing over the last eight years. Literally, every application of NLP has been transformed by this technology. We'll start modestly, using the numerical notion of distance to understand political speech. 

Our goal is multi-step: 

1. Scrape all speeches from the DNC and RNC national conventions. Good news, that's already done.
1. Convert each speech giver in a numeric value, based on the words they delivered at the convention.
1. Measure their distance to the speeches of Donald Trump and Joe Biden.
1. Visualize speech givers along these dimensions.

A good reference for word embeddings in spaCy can be found in [the documentation](https://spacy.io/usage/vectors-similarity). 

In [1]:
import sqlite3
from collections import defaultdict
import spacy

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# Load the parser up here, so we don't keep reloading it as 
# we run cells down below.
nlp = spacy.load("en_core_web_md")

## Gather the Data

Let's query the database (a copy of which is up on Moodle) and create a dictionary to store the data. The key should be the speaker and the value should be a string containing _everything_ they said at the convention.

In [4]:
db = sqlite3.connect("ConventionSpeeches.db")
cur = db.cursor()

In [None]:
# Write a query to pull speaker, party and text from the database
convention_data = cur.execute("""

                                """)


In [None]:
speeches = defaultdict(str)
party_lu = defaultdict(str)

for item in convention_data :
    speaker, party, text = item
    
    speeches[speaker] = " ".join([speeches[speaker],text]).strip() 
    party_lu[speaker] = party

Some people appeared in both conventions via video clips. Let's make sure they all have the correct party.

In [None]:
party_lu["Donald Trump"] = "Republican"
party_lu["Andrew Cuomo"] = "Democratic"
party_lu["Joe Biden"] = "Democratic"
party_lu["Nancy Pelosi"] = "Democratic"

Let's remove any speaker who spoke fewer than `length_cutoff` words at the convention or whose name we don't know.

In [None]:
length_cutoff = 100
to_remove = []
num_removed = 0
num_kept = 0

for speaker, text in speeches.items() :
    
    # Get the length of the speaker's text
    
    # if the text is shorter than length_cutoff,
    # add that speaker to `to_remove`
    
    # Update your kept and removed counters

In [None]:
print(num_kept)

In [None]:
print(num_removed)

In [None]:
# Let's remove the speakers who don't fit our filtering criteria
for speaker in to_remove :
    del speeches[speaker]


In [None]:
db.close()

## Creating a Numeric Vector

Now let's turn each speaker into a numeric vector. There are many ways we could potentially do this. We'll follow one of the most straightforward. We'll do some cleaning, convert each word of their speech into a vector, and average those vectors. 

We'll use the spaCy library's vectorization of words. Note that the small library (`en_core_web_sm`), doesn't include the word vectorization, so you'll need to have downloaded either the medium or large model. You can do this at the command line by running `python -m spacy download en_core_web_md`. The large model is _large_ -- it'll take a long time to download.

We're going to take these averages, but let's first play around with one person's speech and the tokens from it.

In [None]:
text = speeches["Michelle Obama"]
tokens = nlp(text)
token = tokens[8]

In [None]:
print(token)
print(token.vector)

In [None]:
token.vector.shape

Let's iterate over all the tokens in Michelle Obama's speeches and build an average word vector for her. I've set up an empty numpy array of the proper length. Let's sum up the vectors in that score vector. If you divide by the number of tokens you'll get the average word vector. 

In [None]:
score = np.zeros(300)
token_count = 0

for token in tokens :
    # add the vector to the score
    
    # update the token count
    
score = score/token_count

# Look at the sum of the score. We'll compare this to a different technique.
print(sum(score))

We can do this more quickly by taking advantage of list comprehensions (making a list of vectors) and simply applying `np.mean` across the list. 

In [None]:
word_vector_list = [token.vector for token in tokens]

In [None]:
average_word_vector = np.mean(word_vector_list, axis=0)

# check that the sum is the same as the other way
print(average_word_vector.sum())

That should be the same as the value above if you've done it correctly.

---

Now let's calculate these average word vectors for each speaker. Iterate through the `speeches` object and fill in the scores dictionary with the average word vector for the speaker. It might be helpful to do some cleaning of the speeches, to try to focus on words that carry semantic meaning. There are a variety of things we could try: 

* Removing stopwords using the `token.is_stop` attribute in spaCy.
* Removing punctuation using `token.is_punct`
* Keeping only certain parts of speech (like nouns and/or verbs)


In [None]:
scores = defaultdict(np.array)

for speaker, text in speeches.items() :

    tokens = nlp(text)
    
    # Use this space to do some cleaning of the tokens. We'll run 
    # this code with and without various cleaning approaches, trying
    # to find a version that gives us sensible output. 
    
    # write code to calculate the average word vector for the speaker.
    
    
    # store that average word vector in the `scores` object. Replace
    # the right-hand side below.
    scores[speaker] = np.zeros(300)
    
    
        

## Measuring Distance

Now we'll create two numpy arrays that measure the distance between each speaker and the presidental candiates. For reasons we'll discuss in class, the cosine similariy is the overwhelming choice for text data. The cosine distance measures the angle between the vectors, disregarding the magnitude of the vectors. As you may recall from high school algebra, one formula for the cosine, and the one we'll use, is this: 

$$
    \cos(a,b) = \frac{a \cdot b}{||a||\cdot||b||}
$$

In [None]:
def cosine_dist(a,b) :
    dist = np.dot(a,b)/(np.linalg.norm(a)*np.linalg.norm(b))
    return(dist)

In [None]:
trump_dists = np.zeros(len(scores))
biden_dists = np.zeros(len(scores))
speakers = []
party = []

for idx, speaker in enumerate(scores) :
    this_vec = scores[speaker]
    speakers.append(speaker)
    party.append(party_lu[speaker])
    
    trump_dists[idx] = cosine_dist(scores["Donald Trump"],this_vec)
    biden_dists[idx] = cosine_dist(scores["Joe Biden"],this_vec)
        

spaCy includes a similarity measure based on the cosine distance. You can feed two tokens in and receive the similarity score between them. How could you potentially use this functionality to measure the similarity between speakers?

In [None]:
print(nlp("young").similarity(nlp("old")))
print(nlp("dog").similarity(nlp("cat")))
print(nlp("dog").similarity(nlp("justice")))
print(nlp("inequality").similarity(nlp("justice")))

Now let's visualize these distances to get a sense of the distributions.

In [None]:
# matplotlib histogram
plt.hist(trump_dists, 
         color = 'red', 
         edgecolor = 'black',
         bins = 100)

In [None]:
# matplotlib histogram
plt.hist(biden_dists, 
         color = 'blue', 
         edgecolor = 'black',
         bins = 100)

For both speakers, most similarities are quite high; the values range from zero to one. Let's figure out which speakers are closest to both Trump and Biden. One way to do this is to put everything in a data frame and do some sorting. 

In [None]:
distances = pd.DataFrame(list(zip(speakers,party, trump_dists.tolist(),biden_dists.tolist())),
                         columns=["Speaker","Party","TrumpDist","BidenDist"])

In [None]:
# Far from Trump
distances.sort_values("TrumpDist").head(n=10)

In [None]:
# Close to Trump
distances.sort_values("TrumpDist").tail(n=10)

In [None]:
# Far from Biden
distances.sort_values("BidenDist").head(n=10)

In [None]:
# Close to Biden
distances.sort_values("BidenDist").tail(n=10)

Take a look at the speakers who are close and far from the presidential candidates. Do these make sense to you? Play around with the filtering options at the end of the "Creating a Numeric Vector" section. What combinations seem to make the most sense?

## Plotting Distances

Most speakers are pretty similar to each other. Let's make a plot of the distances to Trump and Biden to get a sense of the correlational and overall space of distances.

In [None]:
fig = plt.figure()
ax = plt.gca()
ax.scatter(trump_dists,biden_dists , c='blue', alpha=0.2, edgecolors='none')
ax.set_xlabel("Trump Distances")
ax.set_ylabel("Biden Distances")


## Recommendation Engines

Brenden asked me to post this assignment "as early as feasible" so that he could use these ideas to recommend songs. Here we are, last day of classes, and I'm posting this exercise. Sorry Brenden!

But the ideas in this exercise can work directly for recommendation engines. Who are the five speakers who are most similar to Biden? How could you use this for song lyrics?


In [None]:
# your code here