Let's work with word embeddings. We've been focusing on texts for the most part, but now we'll look at words directly using word2vec. We'll see what these embeddings look like and then use them to explore word similarity.

Let's start by getting our environment ready.

In [1]:
from text_analytics import TextAnalytics
import os
import pandas as pd

ai = TextAnalytics()
ai.data_dir = os.path.join(".", "data")
print("Done!")

Done!


Now, learning word embeddings takes quite a long time, even longer than the tasks we've been doing so far. Because of that, we'll load pre-trained word embeddings. If you *did* want to do it yourself, here's the code, commented out. And you can see a more detailed recipe in the *text_analytics* package.

In [2]:
file = "economic.nyt.1931-2016.gz"
file = os.path.join(ai.data_dir, file)
#df = pd.read_csv(file)
#print(df)

#ai.train_word2vec(df)

ai.word_vectors = ai.deserialize("w2v_embedding", file + ".w2v_embedding.json")
ai.word_vectors_vocab = ai.deserialize("w2v_vocab", file + ".w2v_vocab.json")
    
print(ai.word_vectors)
print(list(ai.word_vectors_vocab.keys())[0:20])

[[-0.1235968   0.04748945  0.16285081 ...  0.03241363 -0.03300684
   0.26601192]
 [-0.0030584   0.04572405  0.10358463 ... -0.00173236 -0.10827369
   0.20938973]
 [-0.19914334  0.03242941  0.0766376  ...  0.12763536  0.02289162
   0.0425519 ]
 ...
 [ 0.05791736  0.24080583  0.11852352 ...  0.13066971  0.03170295
   0.20842037]
 [-0.05558461  0.00972249  0.02807915 ...  0.00696762  0.19038115
   0.08795328]
 [ 0.04808212  0.17134945 -0.06976075 ...  0.02019391  0.03982212
   0.0915439 ]]
['the_DET', 'of_ADP', 'a_DET', 'and_CCONJ', 'in_ADP', 'number_NOUN', 'to_PART', 'to_ADP', 'for_ADP', 'on_ADP', 'at_ADP', 'is_AUX', 'by_ADP', 'was_AUX', 'with_ADP', 'that_SCONJ', 'as_SCONJ', 'his_DET', 'from_ADP', 'it_PRON']


We have now loaded the embeddings for the vocabulary. These embeddings are trained on lead paragraphs from *The New York Times* from 1931 to 2016. Now let's get them ready to work with.

In [3]:
#Build an index of each word
y = list(ai.word_vectors_vocab.keys())

Let's take a look at some of our vocabulary words.

In [4]:
import random

sample = random.sample(ai.word_vectors_vocab.keys(), 10)
print(sample)

['éminence_NOUN', 'mezzo_ADJ', 'marrieds_NOUN', 'kevles_PROPN', 'garner_PROPN', 'rebar_NOUN', 'nolting_VERB', 'margarette_PROPN', 'liveaction_PROPN', 'kusz_PROPN']


You'll notice right away that these aren't just plain words! There are tags like NOUN and VERB. What's going on?

First, we used PMI to find phrases *before* we trained the word embeddings. For example, in the random sample I'm looking at, we have "downton abbey". This is a single entity, and using PMI here is a way to capture that these words together have a different meaning than these words on their own.

Second, we have these tags like NOUN and VERB. These are part-of-speech tags, or pos-tags for short. This tells us the syntactic category for each word. We got these using the *spaCy* package. You can see how by looking at the *train_word2vec* and *clean* functions in our *text_analytics* package. This gives us the categories below, that you probably have heard about before. If not, you can read more about them [here](https://universaldependencies.org/u/pos/all.html).

| Open-class words  |  Closed-class words  |  Other  |
| :------------- | :----------:            | ------: |
|ADJ (adjective)         | ADP (preposition) 	                   | PUNCT (punctuation)  |
|ADV (adverb)	         | AUX (auxiliary verb)	                   | SYM (symbol)    |
|INTJ (interjection)         | CCONJ (coordinating conjunction)                   | X  (misc)     |
|NOUN (noun) 	         | DET (determiner, like "the")	                   |         |
|PROPN (proper noun)      | NUM (number)	                   |         |
|VERB (verb)	         | PART  (participle)                  |         |
                 | PRON (pronoun)                   |         |
	             | SCONJ (subordinating conjunction)                   |	     |

Just a word of explanation here: *open-class words* are an infinite set for every language. We can always make up new nouns, for example. So these tags are what we would have to learn to apply to words that we've never seen before. But *closed-class words* are limited, are fixed. For example, all the closed-class words in English are in our list of function words. So, we already know something about those words. And we will always have seen all the closed-class words that there are.

So, why did we add these syntactic tags before we learned word embeddings? This is the easiest way to disambiguate words. In the lecture, we talked about words like "table" that can be either a noun or a verb. And they have very different meanings, sometimes completely unrelated, in each form. So, we can add the part-of-speech tags to keep track of which word sense we're dealing with.

Now let's do some coding! The block below will randomly choose one word, show us that word, and then show us the five most similar words. *Similarity* here is calculated using cosine distance, applied to words instead of texts.

In [5]:
sample = random.sample(ai.word_vectors_vocab.keys(), 1)[0]
index = ai.word_vectors_vocab[sample]
sample, closest = ai.linguistic_distance(x = ai.word_vectors, y = y, sample = index, n = 5, metric="cosine")

print(sample, closest)

wildlings_NOUN ['cyclamens_NOUN', 'kalanchoes_NOUN', 'summerflowering_NOUN', 'bellflowers_NOUN', 'espaliers_NOUN']


And there you go! Every time you re-run the code block you will see a different set of similar words.

Let's load word embeddings from the corpus of congressional speechs the covers the same time period.

In [6]:
file2 = "economic.congress.1931-2016.gz"
file2 = os.path.join(ai.data_dir, file2)

congress_word_vectors = ai.deserialize("w2v_embedding", file2 + ".w2v_embedding.json")
congress_word_vectors_vocab = ai.deserialize("w2v_vocab", file2 + ".w2v_vocab.json")

#Build an index of each word
congress_y = list(congress_word_vectors_vocab.keys())

Now, let's choose a word from one dataset and, if it is present in both, find out the most similar words as computed from the different corpora.

In [7]:
sample = random.sample(ai.word_vectors_vocab.keys(), 1)[0]
index = ai.word_vectors_vocab[sample]

if sample in congress_word_vectors_vocab:
    print(sample)
    congress_index = congress_word_vectors_vocab[sample]
    
    sample, closest = ai.linguistic_distance(x = ai.word_vectors, y = y, sample = index, n = 10, metric="cosine")
    print("\n", "NYT", sample, closest)
    
    sample, closest = ai.linguistic_distance(x = congress_word_vectors, y = congress_y, sample = congress_index, n = 10, metric="cosine")
    print("\n", "Congress", sample, closest)
    
else:
    print("Try again! This word is not found in both corpora.")

oats_VERB

 NYT oats_VERB ['corn_VERB', 'wheat_VERB', 'soybeans_VERB', 'oats_NOUN', 'soybeans_NOUN', 'soy_NOUN', 'rye_VERB', 'oldcrop_PROPN', 'corn_NOUN', 'lard_NOUN']

 Congress oats_VERB ['barley_NOUN', 'rye_NOUN', 'grains_VERB', 'flaxseed_VERB', 'sorghum_NOUN', 'soybeans_NOUN', 'grains_NOUN', 'oats_NOUN', 'oats_ADJ', 'sorghums_NOUN']
