## The Basics of Word Embeddings

The basic goal of word embeddings is to develop a model of your dataset in which the semantic relationships between the words is given a numerical representation. If that sounds abstract, it is. Essentially your computer uses a predefined method to analyse the way a word is distributed in your corpus. Over many passes, it attempt to "learn" the deep relationships among words, and to place similar words in similar contexts.

You can think of it like this. Imagine that you have the tokens for `Newton`, `Leibniz`, and `Descartes`, three of the most important physicists of the seventeenth century. This is simplifying things a bit, but you can think of a word embedding like this: its goal is to create a model in which your computer assigns a huge number of mathematical relationships to each of these words and the words that tend to occur with them. The key is that the mathematical relationships between these different words and their contexts should be similar. So, for instance, the relationship between `Newton` and `English` should be similar to that of `Leibniz` and `German`, or `Descartes` and `French`. Now, imagine that all these words are similarly related to all the other words in the dataset, so that, for instance, `English` and `London` has a relationship that is mathematically similar to `French` and `Paris`, or, if you prefer, between `England` and `London` and `France` and `Paris`. These mathematical relationships, ultimately, create an enormous, multidimensional web of semantic interconnectedness, which allows us to study the deep relationships between any words, as they occur in the dataset.

## 0. Visualizing Word Vectors with a Toy Corpus

In this first video, I'm going to very briefly try to show you what a word vector model "looks" like. Of course, that's an impossible thing to do, because true vector models do not exist in Euclidean space, but I'm essentially going to create a stripped down, low-dimension model. In doing so, I'm drawing in part on an excellent [tutorial by Jason Brownlee](https://machinelearningmastery.com/develop-word-embeddings-python-gensim/). I'll move rather quickly through this segment, explaining as I go in the video, so I'll keep my commentary in the markdown to a minimum.

First, I create a list of sentences, import a few standard libraries like nltk and numpy, and do a bit of cleaning.

In [None]:
sentence_list = [
     ("Richardson wrote novels."),
     ("Richardson wrote fiction."), 
     ("Richardson wrote books."), 
     ("Fielding wrote novels. An example is Amelia."), 
     ("Austen wrote novels."), 
     ("Richardson is a novelist and Austen is a novelist."),
     ("Fielding and Austen both wrote novels."),
     ("A novelist writes novels."), 
     ("A novelist writes fiction."),
     ("A novelist writes books."),
     ("Novels are books."),
     ("Novels are books of fiction."),
     ("Richardson is an example of a novelist."),
     ("Fielding is an example of a novelist."),
     ("Austen is an example of a novelist."),
     ("Richardson wrote Pamela."),
     ("Fielding wrote Amelia."),
     ("Austen wrote Emma."),
     ("Pamela and Amelia are novels."),
     ("Amelia is an example of a fiction."),
     ("Emma is a fiction."),
     ("Amelia and Emma are books."),
     ("Novels are fiction.")]

In [None]:
import nltk
nltk.download("punkt")
from string import punctuation
import numpy as np

cleaned_sentences = []

for sentence in sentence_list:
    sentence_txt = ''.join(c for c in sentence if c not in punctuation)
    sentence_txt = sentence_txt.lower()
    sentence_words = nltk.tokenize.word_tokenize(sentence_txt)
    cleaned_sentences.append(sentence_words)


Then, I get Word2Vec from gensim, which I'll use to generate my toy model.

In [None]:
import gensim
from gensim.models.word2vec import Word2Vec

In [None]:
toy_model = Word2Vec(cleaned_sentences, min_count=1, size=2)

Once I have that, I can examine my model, view the words that it contains, and print it as an array...

In [None]:
print(toy_model)

In [None]:
toy_words = list(toy_model.wv.vocab)
print(toy_words)

In [None]:
print(toy_model.wv['novels'])

In [None]:
toy_array = toy_model.wv[toy_model.wv.vocab]
print(toy_array)

Finally, I can go ahead and view the results as a scatterplot.

In [None]:
import matplotlib.pyplot as plt

plt.scatter(toy_array[:, 0], toy_array[:, 1])
words = list(toy_model.wv.vocab)
for i, word in enumerate(words):
    plt.annotate(word, xy=(toy_array[i, 0], toy_array[i, 1]))
    plt.axhline(0, color='dimgrey')
    plt.axvline(0, color='dimgrey')
plt.show()

What we see is all the words in my toy set, spread out in two dimensions. I encourage you to try regenerating this mini vector model a few times. You'll find that it changes radically every time, due to the small size of my corpus and the probabilistic nature of models.

## 1. Initial Steps

Alright. Let's get down to business. To get started, create a dictionary out of the files in our `/working_set_cleaned/` corpus. This is, admittedly, a rather large set (although not terribly large, by word-embedding standards), and the scripts in this notebook can take a fair time to run. If you find that your script is taking too long, stop it. Instead, you can feed in a smaller directory, like the '/sec5/chunked_files_principles/', which will load much more quickly, while still producing reasonable results in many cases.

In [None]:
import os
from pathlib import Path
home = str(Path.home())

textdirectory = home + '/dh2/corpora_and_metadata/working_set_cleaned/'

os.chdir(textdirectory)
print(os.getcwd())

In [None]:
# Get list of filenames
import glob
print(glob.glob("*.txt"))
filenames = glob.glob("*.txt")

First, we'll make a giant list of all the files in our corpus.

<b> Skip down to Section 3 of this notebook, if you would prefer to make a model with stemmed files </b>

In [None]:
import nltk
list_files = []
for i in filenames:
    with open (str(i),'r') as file:
        readFile = file.read()
        tokenized_file = nltk.tokenize.word_tokenize(readFile)
        list_files.append(tokenized_file)

That done, let's return to our folder for Section 5.textdirectory = home + '/dh2/corpora_and_metadata/working_set_cleaned/'

os.chdir(textdirectory)
print(os.getcwd())

In [None]:
textdirectory = home + '/dh2/sec5/'

os.chdir(textdirectory)
print(os.getcwd())

We'll then analyse those separate files to create a single dictionary of all the unique tokens in that list of files. For this lesson, we'll be using [Gensim's word2vec model](https://radimrehurek.com/gensim/models/word2vec.html).

In [None]:
import gensim
from gensim import corpora

dictionary = corpora.Dictionary(list_files)

print(dictionary)

## Train Your Model

In [None]:
from gensim.models.word2vec import Word2Vec
from multiprocessing import cpu_count
import gensim.downloader as api

# Train Word2Vec model. Defaults result vector size = 100
model = Word2Vec(list_files, min_count = 0, workers=cpu_count())

# Save and Load Model
model.save('newmodel')
model = Word2Vec.load('newmodel')

This is where the fun really begins. We'll be concentrating on two simple operations with this model. In the first, we'll use words that exist in clear binary pairs like young/old, man/woman, french/english (perhaps), and long/short to add and subtract qualities from another word.

The format of this sort of operation is a little counterintuitive, in that it splits words into `positive` and `negative` categories:

 `result = model.most_similar(positive=['king', 'woman'], negative=['man'], topn=15)`
 
 `print(result)`
 
 `queen`

   Or, if you prefer:
   
 `king - man + woman = queen`
 
Place the word to be manipulated in the first position of the `positive` list, the word to be added alongside it, and the word to be subtracted in the `negative` list.

In [None]:
result = model.most_similar(positive=['king', 'woman'], negative=['man'], topn=15)
print(result)

Try the following examples, or create your own. You won't always get perfect results, of course, but it's interesting to explore the dataset, nonetheless.

`newton - english + french =`

`king - man + woman =`

`husband - male + female = `

`puppy - young + old =`

`dog - puppy + cat = `

`grandison - man + woman =`

`voltaire - french + english =`

In [None]:
result = model.most_similar(positive=['???', '???'], negative=['???'], topn=???)
print(result)

You can also perform a simpler operation, by checking out the `most_similar` tokens that are associated with a word. This looks for words in your model that are closely associated with your word of interest.

Attempt this for the token `principle` (We'll use the results as queeries for our concept-search algorithm in the next unit), and then test out these words and any others you might like: `literature`, `novel`, `wig`, `garter`, `saladin`, `damask`, `liberty`, `puppy`, `pirate`.

In [None]:
model.most_similar('principle', topn=30)

Make a new model with stemmed tokens, rather than just the straight tokens from before. Will be interesting to compare. Make sure to change the variable and file name for the model, especially when saving, so you don't overwrite the old one.

## 2. Start here to make a model with stemmed files

In [None]:
import nltk
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

list_stemmed_files = []
for i in filenames:
    with open (str(i),'r') as file:
        readFile = file.read()
        tokenized_file = nltk.tokenize.word_tokenize(readFile)
        stemmed_file = [ps.stem(word) for word in tokenized_file]
        list_stemmed_files.append(stemmed_file)

In [None]:
import gensim
from gensim import corpora

stemmed_dictionary = corpora.Dictionary(list_stemmed_files)

print(stemmed_dictionary)

In [None]:
from gensim.models.word2vec import Word2Vec
from multiprocessing import cpu_count
import gensim.downloader as api

# Train Word2Vec model. Defaults result vector size = 100
stemmed_model = Word2Vec(list_stemmed_files, min_count = 0, workers=cpu_count())

# Save and Load Model
stemmed_model.save('newmodel_stemmed')
stemmed_model = Word2Vec.load('newmodel_stemmed')