# Introduction to Natural Language Processing (NLP) in PyTorch

### Word Embeddings

Word embeddings, or word vectors, provide a way of mapping words from a vocabulary into a low-dimensional space, where words with similar meanings are close together. Let's play around with a set of pre-trained word vectors, to get used to their properties. There exist many sets of pretrained word embeddings; here, we use ConceptNet Numberbatch, which provides a relatively small download in an easy-to-work-with format (h5).

To read an h5 file, we'll need to use the `h5py` package. 
If you followed the PyTorch installation instructions in 0A, you should have it downloaded already.
Otherwise, you can install it with
```Shell 
pip install h5py
```
You may need to re-open this notebook for the installation to take effect.

Below, we use the package to open the `mini.h5` file we just downloaded. 
We extract from the file a list of utf-8-encoded words, as well as their 300-d vectors.

In [None]:
# # Load the file and pull out words and embeddings
# import h5py

# with h5py.File('datasets/mini.h5', 'r') as f:
#     all_words = [word.decode('utf-8') for word in f['mat']['axis1'][:]]
#     all_embeddings = f['mat']['block0_values'][:]
    
print("all_words dimensions: {}".format())
print("all_embeddings dimensions: {}".format())

print("Random example word: {}".format())

Now, `all_words` is a list of $V$ strings (what we call our *vocabulary*), and `all_embeddings` is a $V \times 300$ matrix. The strings are of the form `/c/language_code/word`—for example, `/c/en/cat` and `/c/es/gato`.

We are interested only in the English words. We use Python list comprehensions to pull out the indices of the English words, then extract just the English words (stripping the six-character `/c/en/` prefix) and their embeddings.

In [None]:
# Restrict our vocabulary to just the English words


print("word in all_words: {0}".format())
print("all_embeddings dimensions: {0}".format())

print()

The magnitude of a word vector is less important than its direction; the magnitude can be thought of as representing frequency of use, independent of the semantics of the word. 
Here, we will be interested in semantics, so we *normalize* our vectors, dividing each by its length. 
The result is that all of our word vectors are length 1, and as such, lie on a unit circle. 
The dot product of two vectors is proportional to the cosine of the angle between them, and provides a measure of similarity (the bigger the cosine, the smaller the angle).

<img src="Figures/cosine_similarity.png" alt="cosine" style="width: 500px;"/>

We want to look up words easily, so we create a dictionary that maps us from a word to its index in the word embeddings matrix.

Now we are ready to measure the similarity between pairs of words. We use numpy to take dot products.

In [None]:


# A word is as similar with itself as possible:
print('cat\tcat\t', )

# Closely related words still get high scores:
print('cat\tfeline\t', )
print('cat\tdog\t', )

# Unrelated words, not so much
print('cat\tmoo\t', )
print('cat\tfreeze\t', )

# Antonyms are still considered related, sometimes more so than synonyms
print('antonyms\topposites\t', )
print('antonyms\tsynonyms\t', )

We can also find, for instance, the most similar words to a given word.

In [None]:
print()
print()
print()

We can also use `closest_to_vector` to find words "nearby" vectors that we create ourselves. This allows us to solve analogies. For example, in order to solve the analogy "man : brother :: woman : ?", we can compute a new vector `brother - man + woman`: the meaning of brother, minus the meaning of man, plus the meaning of woman. We can then ask which words are closest, in the embedding space, to that new vector.

In [None]:


print()
print()
print()

These three results are quite good, but in general, the results of these analogies can be disappointing. Try experimenting with other analogies, and see if you can think of ways to get around the problems you notice (i.e., modifications to the solve_analogy algorithm).

### Using word embeddings in deep models

Word embeddings are fun to play around with, but their primary use is that they allow us to think of words as existing in a continuous, Euclidean space; we can then use an existing arsenal of techniques for machine learning with continuous numerical data (like logistic regression or neural networks) to process text.

Let's take a look at an especially simple version of this. 
We'll perform *sentiment analysis* on a set of movie reviews: in particular, we will attempt to classify a movie review as positive or negative based on its text.

We will use a [Simple Word Embedding Model](http://people.ee.duke.edu/~lcarin/acl2018_swem.pdf) (SWEM, Shen et al. 2018) to do so. 
We will represent a review as the *mean* of the embeddings of the words in the review. 
Then we'll train a two-layer MLP (a neural network) to classify the review as positive or negative.
As you might guess, using just the mean of the embeddings discards a lot of the information in a sentences, but for tasks like sentiment analysis, it can be surprisingly effective.

If you don't have it already, download the `movie-simple.txt` file. 
Each line of that file contains 

1. the numeral 0 (for negative) or the numeral 1 (for positive), followed by
2. a tab (the whitespace character), and then
3. the review itself.

Let's first read the data file, parsing each line into an input representation and its corresponding label.
Again, since we're using SWEM, we're going to take the mean of the word embeddings for all the words as our input.

In [None]:



# This function converts a line of our data file into
# a tuple (x, y), where x is 300-dimensional representation
# of the words in a review, and y is its label.

# Pull out the first character: that's our label (0 or 1)


# Split the line into words using Python's split() function


# Look up the embeddings of each word, ignoring words not
# in our pretrained vocabulary.


# Take the mean of the embeddings


# Apply the function to each line in the file.


# Concatenate all examples into a numpy array



In [None]:
print("Shape of inputs: {}".format())
print("Shape of labels: {}".format())



Now that we've parsed the data, let's save 20\% of the data (rounded to a whole number) for testing, using the rest for training.
The file we loaded had all the negative reviews first, followed by all the positive reviews, so we need to shuffle it before we split it into the train and test splits.
We'll then convert the data into PyTorch Tensors so we can feed them into our model.

In [None]:
print("First 10 labels before shuffling: {0}".format())



print("First 10 labels after shuffling: {0}".format())

We could format each batch individually as we feed it into the model, but to make it easier on ourselves, let's create a TensorDataset and DataLoader as we've used in the past for MNIST.

Time to build our model in PyTorch. 

First we build the model, organized as a `nn.Module`.
We could make the number of outputs for our MLP the number of classes for this dataset (2).
However, since we only have two output classes here ("positive" vs "negative"), we can instead produce a single output value, calling everything greater than $0$ "postive" and everything less than $0$ "negative".
If we pass this output through a sigmoid operation, then values are mapped to $[0,1]$, with $0.5$ being the classification threshold.

To train the model, we instantiate the model. 
Notice that since we use the binary cross-entropy (BCE) loss instead of the cross-entropy loss we've seen before.
We use the "with logits" version for numerical stability.

In [None]:
## Training
# Instantiate model


# Binary cross-entropy (BCE) Loss and SGD Optimizer


# Iterate through train set minibatchs 

# Zero out the gradients
        
# Forward pass


# Backward pass


# Print training progress
print("Epoch: {0} \t Train Loss: {1} \t Train Acc: {2}".format())

## Testing



# Iterate through test set minibatchs 

# Forward pass
    
print('Test accuracy: {}'.format())

We can now examine what our model has learned, seeing how it responds to word vectors for different words:

In [None]:
# Check some words


print("Sentiment of the word '{0}': {1}".format())

Try some words of your own!

### Recurrent Neural Networks (RNNs)

In the context of deep learning, sequential data is commonly modeled with Recurrent Neural Networks (RNNs).
As natural language can be viewed as a sequence of words, RNNs are commonly used for NLP.
As with the fully connected and convolutional networks we've seen before, RNNs use combinations of linear and nonlinear transformations to project the input into higher level representations, and these representations can be stacked with additional layers.

#### Sentences as sequences
The key difference between sequential models and the previous models we've seen is the presence of a "time" dimension: words in a sentence (or paragraph, document) have an ordering to them that convey meaning:

<img src="Figures/sentence_as_a_sequence.PNG" alt="basic_RNN" style="width: 300px;"/>

In the example sequence above, the word "Recurrent" is the $t=1$ word, which we denote $w_1$; similarly, "neural" is $w_2$, and so on.
As the preceding sections have hopefully impressed upon you, it is often more advantageous to model words as embedding vectors $x_1, ..., x_T$, rather than one-hot vectors (which tokens $w_1,...w_T$ correspond to), so our first step is often to do an embedding table look-up for each input word.

Note that in our previous sentiment analysis example, we just took the average embedding across time, treating the input as a "bag-of-words."
For simple problems, this can work surprisingly well, but as you might imagine, the ordering of words in a sentence is often important, and sometimes, we'd like to be able to model this temporal meaning as well.
Enter RNNs.

#### Review: Fully connected layer

Before we introduce the RNN, let's first again revist the fully connected layer that we used in our logistic regression and multilayer perceptron examples, with a few changes in notation:

\begin{align*}
h = f(x W + b)
\end{align*}

Instead of calling the result of the fully connected layer $y$, we're going to call it $h$, for hidden state.
The variable $y$ is usually reserved for the final layer of the neural network; since logistic regression was a single layer, using $y$ was fine. 
However, if we assume there is more than one layer, it is more common to refer to the intermediate representation as $h$.
Note that we also use $f()$ to denote a nonlinear activation function.
In the past, we've seen $f()$ as a $\text{ReLU}$, but this could also be a $\sigma()$ or $\tanh()$ nonlinearity.
Visualized:

<img src="Figures/rnn_mlp.PNG" width="175"/>

The key thing to notice here is that we project the input $x$ with a linear transformation (with $W$ and $b$), and then apply a nonlinearity to the output, giving us $h$.
During training, our goal is to learn $W$ and $b$.

#### A basic RNN

Unlike the previous examples we've seen using fully connected layers, sequential data have multiple inputs $x_1, ..., x_T$, instead of a single $x$.
We need to adapt our models accordingly for an RNN.
While there are several variations, a common basic formulation for an RNN is the Elman RNN, which is as follows&ast;:

\begin{align}
h_t = \tanh((x_t W_x + b_x) + (h_{t-1} W_h + b_h))
\end{align}

where $\tanh()$ is the hyperbolic tangent, a nonlinear activation function.
RNNs process words one at a time in sequence ($x_t$), producing a hidden state $h_t$ at every time step.
The first half of the above equation should look familiar; as with the fully connected layer, we are linearly transforming each input $x_t$, and then applying a nonlinearity.
Notice that we apply the same linear transformation ($W_x$, $b_x$) at every time step.
The difference is that we also apply a separate linear transform ($W_h$, $b_h$) to the previous hidden state $h_{t-1}$ and add it to our projected input.
This feedback is called a *recurrent* connection.

These directed cycles in the RNN architecture gives them the ability to model temporal dynamics, making them particularly suited for modeling sequences (e.g. text).
We can visualize an RNN layer as follows:

<img src="Figures/rnn.PNG" width="350"/>

We can unroll an RNN through time, making the sequential aspect of them more obvious:

<img src="Figures/rnn_unrolled.PNG" width="700"/>

You can think of these recurrent connections as allowing the model to consider previous hidden states of a sequence when calculating the hidden state for the current input.

<font size="1">&ast;Note: We don't actually need two separate biases $b_x$ and $b_h$, as you can combine both biases into a single learnable parameter $b$. 
However, writing it separately helps make it clear that we're performing a linear transformation on both $x_t$ and $h_{t-1}$.
Speaking of combining variables, we can also express the above operation by concatenating $x_t$ and $h_{t-1}$ into a single vector $z_t$, and then performing a single matrix multiply $z_t W_z + b$, where $W_z$ is essentially $W_x$ and $W_h$ concatenated.
Indeed this is how many "official" RNNs modules are implemented, as the reduction in the number of separate matrix multiply operations makes it computationally more effecient.
These are implementation details though.</font>

#### RNNs in PyTorch
How would we implement an RNN in PyTorch? 
There are quite a few ways, but let's build the Elman RNN from scratch first, using the embeddings matrix from our previous example.

In [None]:
# As always, import PyTorch first



Let's assume we have our inputs in word embedding form already, with dimension 300 as before. Although we usually group multiple examples into minibatches for efficiency, we'll use a minibatch size of 1 for this example for simplicity.

In an RNN, we project both the input $x_t$ and the previous hidden state $h_{t-1}$ to some hidden dimension, which we're going to choose to be 128.
To perform these operations, we're going to define some variables we're going to learn.

In [None]:


# For projecting the input


# For projecting the previous state

print()

For convenience, we define a function for one time step of the RNN.
This function take the current input $x_t$ and previous hidden state $h_{t-1}$, performs the linear transformations $x W_x + b_x$ and $h W_h + b_h$, and then a hyperbolic tangent nonlinearity.

Each step of our RNN is going to require feeding in an input (i.e. the word representation) and the previous hidden state (the summary of preceding sequence).
Note that at the beginning of a sentence, we don't have a previous hidden state, so we initialize it to some value, for example all zeros:

In [None]:
# Word embedding for first word


# Initialize hidden state to 0


To take one time step of the RNN, we call the function we wrote, passing in $x_1$ and $h_0$.
In this case, 

In [None]:
# Forward pass of one RNN step for time step t=1


print("Hidden state h1 dimensions: {0}".format())

We can call the `RNN_step` function again to get the next time step output from our RNN.

In [None]:
# Word embedding for second word


# Forward pass of one RNN step for time step t=2


print("Hidden state h2 dimensions: {0}".format())

We can continue unrolling the RNN as far as we need to. 
For each step, we feed in the current input ($x_t$) and previous hidden state ($h_{t-1}$) to get a new output.

#### Using `torch.nn`

In practice, much like fully connected and convolutional layers, we typically don't implement RNNs from scratch as above, instead relying on higher level APIs.
PyTorch has RNNs implemented in the `torch.nn` library. 

In [None]:


print("RNN parameter shapes: {}".format())

Note that the RNN created by `torch.nn` produces parameters of the same dimensions as our from scratch example above.

#### Gated RNNs

While the RNNs we've just explored can successfully model simple sequential data, they tend to struggle with longer sequences, with [vanishing gradients](https://en.wikipedia.org/wiki/Vanishing_gradient_problem) an especially big problem.
A number of RNN variants have been proposed over the years to mitigate this issue and have been shown empirically to be more effective.
In particular, Long Short-Term Memory (LSTM) and the Gated Recurrent Unit (GRU) have seen wide use recently in deep learning.
We're not going to go into detail here about what structural differences they have from vanilla RNNs; a fantastic summary can be found [here](https://colah.github.io/posts/2015-08-Understanding-LSTMs/).
Note that "RNN" as a name is somewhat overloaded: it can refer to both the basic recurrent model we went over previously, or recurrent models in general (including LSTMs and GRUs).

LSTMs and GRUs layers can be created in much the same way as basic RNN layers.
Again, rather than implementing it yourself, it's recommend to use the `torch.nn` implementations, although we highly encourage that you peek at the source code so you understand what's going on under the hood.

In [None]:

print("LSTM parameters: {}".format())


print("GRU parameters: {}".format())

### Torchtext

Much like PyTorch has [Torchvision](https://pytorch.org/docs/stable/torchvision/index.html) for computer vision, PyTorch also has [Torchtext](https://torchtext.readthedocs.io/en/latest/) for natural language processing.
As with Torchvision, Torchtext has a number of popular NLP benchmark datasets, across a wide range of tasks (e.g. sentiment analysis, language modeling, machine translation).
It also has a few pre-trained word embeddings available as well, including the popular Global Vectors for Word Representation (GloVe).
If you need to load your own dataset, Torchtext has a number of useful containers that can make the data pipeline easier.

At the moment, Torchtext isn't included with the PyTorch installation, and it isn't included in Anaconda or available for `pip`. 
This may change, but for the time being, if you'd like to take advantage of Torchtext, you can clone the [GitHub repo](https://github.com/pytorch/text) and follow their installation instructions there.

### Other materials:
Natural Language Processing can be several full courses on its own at most universities, both with or without neural networks.
Here are some additional reads:

- [Fantastic introduction to LSTMs and GRUs](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- [Popular blog post on RNNs](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)