![](https://upload.wikimedia.org/wikipedia/commons/6/65/RothwellMaryShelley.jpg) 
**The task**: Today we're going to implement a skipgrams model to learn word vectors. We'll use the text of Mary Wollstonecraft Shelley's novel *Frankenstein; or, The Modern Prometheus* 


I've given you an outline of the code you'll need to get this running. You'll need to fill in the missing parts that I've marked.

Your deliverables:
- Word vectors for the 5,000 most frequent words in Frankenstein
- A function that takes a target word and uses the word vectors to find the most similar words to the target

In [None]:
#######################
# standard code block #
#######################

# see https://ipython.readthedocs.io/en/stable/interactive/magics.html
%pylab inline

# sets backend to render higher res images
%config InlineBackend.figure_formats = ['retina']

#######################
#       imports       #
#######################
# import pandas as pd
import seaborn as sns
# import sklearn

sns.set_style("whitegrid")

In [None]:
# First, let's grab the text of Frankenstein
# (thankfully, it's in the public domain!)

import requests

url = 'http://umich.edu/~umfandsf/other/ebooks/frank10.txt'
r = requests.get(url)

shelley_text = r.text

print(shelley_text[:300])

In [None]:
# Next, we'll pre-process the text by adding periods where
# weant sentence breaks to occur.
import re

# replace 2 or more new lines with a period and one new line
shelley_text = re.sub(r"\n{2,}", ".  ", shelley_text)

# replace single new lines with a space
shelley_text = re.sub(r"\n", " ", shelley_text)
print(shelley_text[:300])

In [None]:
# Now we're going to use NLTK's sentence tokenizer to split the text into sentences. Be careful! The sentence tokenizer is sensitive to preprocessing steps like lowercasing text.

import nltk

nltk.download('punkt')

from nltk import sent_tokenize

sentences = sent_tokenize(shelley_text)

sentences = sentences[4:]

print(f"We found {len(sentences)} sentences!", end="\n\n")

for sent in sentences[:3]:
    print(sent[:200], end="\n\n")

In [None]:
# the next step uses CountVectorizer to build a sentence vocabulary

from sklearn.feature_extraction.text import CountVectorizer

countvect = CountVectorizer(max_features=2500)

bow = countvect.fit_transform(sentences)

# Exercise 1

The next step, you're going to write. We need to build the observations we'll use to train the skipgram model.

The skipgram task asks a model to predict the center word given the words in the target's context. For instance, if we set `window_size=2`, then the first skipgram observation is

$$
\underbrace{\text{You},  \text{will}}_\text{context},  \overbrace{\text{rejoice}}^\text{target},  \underbrace{\text{to},  \text{hear}}_\text{context}
$$

And the second observation is

$$
\underbrace{\text{will},  \text{rejoice}}_\text{context},  \overbrace{\text{to}}^\text{target},  \underbrace{\text{hear},  \text{that}}_\text{context}
$$

Using `window_size=2`, construct two lists:
- a list of context observations
- a list of targets


Here's what the first five look like

```python
>>>context[:5]
['you will rejoice hear',
 'will rejoice to that',
 'rejoice to hear no',
 'to hear that disaster',
 'hear that no has']
>>>target[:5]
['to', 'hear', 'that', 'no', 'disaster']
```

```python

```

And the first sentence boundary

```python
>>>context[15:20]
['which you regarded with',
 'you have with such',
 'have regarded such evil',
 'regarded with evil forebodings',
 'arrived here and my']
>>>target[15:20]
['have', 'regarded', 'with', 'such', 'yesterday']
```

In [None]:
# The analyzer reproduces the preprocessing and tokenization
# used in CountVectorizer. You will probably need this to split
# sentences up in a way that our CountVectorizer understands

analyzer = countvect.build_analyzer()

# an example of what this looks like
analyzer(
    "You will rejoice to hear that no disaster has accompanied the commencement of an enterprise which you have regarded with such evil forebodings."
)

In [None]:
window_size = 2
chunk_size = window_size * 2 + 1

In [None]:
context = []
target = []

### BEGIN SOLUTION

### END SOLUTION

In [None]:
# provided tests
assert context[:5] == [
    'you will to hear',
    'will rejoice hear that',
    'rejoice to that no',
    'to hear no disaster',
    'hear that disaster has',
]
assert target[:5] == ['rejoice', 'to', 'hear', 'that', 'no']

In [None]:
assert context[15:20] == [
    'which you regarded with',
    'you have with such',
    'have regarded such evil',
    'regarded with evil forebodings',
    'arrived here and my',
]
assert target[15:20] == ['have', 'regarded', 'with', 'such', 'yesterday']

Ultimately, we want our inputs and outputs to be in a form that our models can understand. So we'll use a bag-of-words style encoding where each row will count the number of times a word appears in the context.

Thankfully, the CountVectorizer we just trained can do this for us!

The same model also has the index of each word stored in `countvect.vocabulary_`. We'll use -1 for any word not in the vocabulary (remember that we're only looking at the 1,000 most frequent words).

Now our `X` a bag-of-words representation of the context for each target we found in the text. Examine the nonzero elements in the rows to verify:

```python
>>>context[0]
'you will to hear'
>>>nonzero(X[0,:])
(array([0, 0, 0, 0], dtype=int32), array([412, 882, 967, 994], dtype=int32))
>>>countvect.get_feature_names()[412]
'hear'
>>>countvect.get_feature_names()[882]
'to'
>>>countvect.get_feature_names()[967]
'will'
>>>countvect.get_feature_names()[994]
'you'
```

In [None]:
X = countvect.transform(context)
# use -1 to represent
y = array([countvect.vocabulary_.get(t, -1) for t in target])

# Exercise 2

Now that we have our `X` and `y`, let's train a neural network! 

Import scikit-learn's `MLPClassifier` and train it on our data. For skipgrams to work, you should only use one hidden layer (hint: the size of the layer determines the size of your word vectors) and you should use a linear activation. It's also recommended that you limit the number of iterations so that training terminates in a reasonable time. You can set `verbose=True` to get a picture of how the model is training.

In [None]:
X.shape

In [None]:
### BEGIN SOLUTION

### END SOLUTION

# Exercise 3

The last step is to use these vectors! Write a function `get_similar(query, n=10)` that returns the `n` most similar words to `query` based on your vectors. Some helpful hints:

- `mlp.coefs_` let's us access the weights of the MLPClassifier 
- `feature_names = countvect.get_feature_names()` will give us a list of all the words in the vocabulary so that we can use `feature_names[i]` to get the ith word.
- Your word vectors are likely very high dimensional. Which distance metric works best for high dimensional data?

In [None]:
### BEGIN SOLUTION

### END SOLUTION

# Try it out!
get_similar("creature", 20)

In [None]:
#victor's wife and adopted sister
get_similar("elizabeth", 20)

In [None]:


get_similar("justine", 20)

# Extra

It's not part of the exercise, but below is a demo of how we can use these word vectors to look at character relationships and similarities

In [None]:
from sklearn import manifold
from sklearn.neighbors import NearestNeighbors
nn = NearestNeighbors(metric='cosine', algorithm='brute')
nn.fit(word_vecs)

word_indices = set()

key_words = [
    "monster",
    "safie",
    "agatha",
    "creature",
    "elizabeth",
    "victor",
    "frankenstein",
    "henry",
    "justine",
    "william",
    "felix",
]

for key in key_words:
    word_index = countvect.vocabulary_[key]
    dist, index = nn.kneighbors(word_vecs[[word_index], :], n_neighbors=10)
    word_indices.add(word_index)
    word_indices.update(list(index.flat))

word_indices = list(word_indices)

target_vecs = word_vecs[word_indices, :]

In [None]:
figsize(15, 10)

vecs_emb = manifold.TSNE(metric="cosine",
                         n_iter=3000).fit_transform(target_vecs)

sns.scatterplot(x=vecs_emb[:, 0], y=vecs_emb[:, 1], marker='.')
for i, row in enumerate(vecs_emb):
    word_i = word_indices[i]
    word = feature_names[word_i]
    if word in key_words:
        annotate(word, row, color="red", zorder=10,
                 bbox=dict(boxstyle="round", fc="w", ec="k", alpha=.5))
    else:
        annotate(word, row, zorder=-1, bbox=dict(boxstyle="round", fc="w",
                                                 ec="k", alpha=.5))