# Sentiment with Deep Neural Networks

In this notebook, we will explore sentiment analysis using deep neural networks. 

## Table of Contents
- [1 - Import Libraries and try out Trax](#1)
- [2 - Importing the Data](#2)
    - [2.1 - Loading in the Data](#2-1)
    - [2.2 - Building the Vocabulary](#2-2)
    - [2.3 - Converting a Tweet to a Tensor](#2-3)
        - [2.3.1 - tweet_to_tensor ](#2.3.1)
    - [2.4 - Creating a Batch Generator](#2-4)
        - [2.4.1 - data_generator](#2.4.1)
- [3 - Defining Classes](#3)
    - [3.1 - ReLU Class](#3-1)
        - [3.2.1 - Relu](#3.1.1)
    - [3.2 - Dense Class](#3.2)
        - [3.3.1 - Dense](#3.3.1)
    - [3.3 - Model](#3-3)
        - [3.3.1 classifier](#3.3.1)
- [4 - Training](#4)
    - [4.1 Training the Model](#4-1)
        - [4.1.1 - train_model](#4.1.1)
    - [4.2 - Making a Prediction](#4-2)
- [5 - Evaluation](#5)
    - [5.1 - Computing the Accuracy on a Batch](#5-1)
        - [5.1.1 - compute_accuracy](#5.1.1)
    - [5.2 - Testing your Model on Validation Data](#5-2)
        - [5.2.1 - test_model](#5.2.1)
- [6 - Testing with your Own Input](#6)
- [7 - Word Embeddings](#7)

In course 1, you implemented Logistic regression and Naive Bayes for sentiment analysis. However if you were to give your old models an example like:

<center> <span style='color:blue'> <b>This movie was almost good.</b> </span> </center>

Your model would have predicted a positive sentiment for that review. However, that sentence has a negative sentiment and indicates that the movie was not good. To solve those kinds of misclassifications, you will write a program that uses deep neural networks to identify sentiment in text. By completing this assignment, you will: 

- Understand how you can build/design a model using layers
- Train a model using a training loop
- Use a binary cross-entropy loss function
- Compute the accuracy of your model
- Predict using your own input

As you can tell, this model follows a similar structure to the one you previously implemented in the second course of this specialization. 
- Indeed most of the deep nets you will be implementing will have a similar structure. The only thing that changes is the model architecture, the inputs, and the outputs. Before starting the assignment, we will introduce you to the Google library `trax` that we use for building and training models.


Now we will show you how to compute the gradient of a certain function `f` by just using `  .grad(f)`. 

- Trax source code can be found on Github: [Trax](https://github.com/google/trax)
- The Trax code also uses the JAX library: [JAX](https://jax.readthedocs.io/en/latest/index.html)

<a name="1"></a>
## 1 - Import Libraries and try out Trax

- Let's import libraries and look at an example of using the Trax library.

In [2]:
%pip install trax

Collecting trax
  Using cached trax-1.4.1-py2.py3-none-any.whl (637 kB)
Collecting absl-py (from trax)
  Obtaining dependency information for absl-py from https://files.pythonhosted.org/packages/01/e4/dc0a1dcc4e74e08d7abedab278c795eef54a224363bb18f5692f416d834f/absl_py-2.0.0-py3-none-any.whl.metadata
  Using cached absl_py-2.0.0-py3-none-any.whl.metadata (2.3 kB)
Collecting gym (from trax)
  Using cached gym-0.26.2-py3-none-any.whl
Collecting jax (from trax)
  Obtaining dependency information for jax from https://files.pythonhosted.org/packages/b5/5b/5131520dd9a384a640399e5efe4324fdee9e8a48685a33d08eb47140ccc3/jax-0.4.18-py3-none-any.whl.metadata
  Using cached jax-0.4.18-py3-none-any.whl.metadata (23 kB)
Collecting jaxlib (from trax)
  Obtaining dependency information for jaxlib from https://files.pythonhosted.org/packages/4d/af/22bf25b1b9c56a774d34eeac8f6d70c2e5d0a9d8b33b374e39517f830902/jaxlib-0.4.18-cp311-cp311-win_amd64.whl.metadata
  Using cached jaxlib-0.4.18-cp311-cp311-win_amd

In [1]:
import os 
import shutil
import random as rnd

# import relevant libraries
import trax
import trax.fastmath.numpy as np
from trax import layers as tl
from trax import fastmath

# import Layer from the utils.py file
from utils import Layer, load_tweets, process_tweet
import w1_unittest

ModuleNotFoundError: No module named 'trax'

<a name="2"></a>
## 2 - Importing the Data

<a name="2-1"></a>
### 2.1 - Loading in the Data

- implementation of `process_tweets` function is available in utils.py file, it removes unwanted characters e.g. hashtag, hyperlinks, stock tickers from a tweet.
- It also returns a list of words (it tokenizes the original string).

In [None]:
def train_val_split():
    # Load positive and negative tweets
    all_positive_tweets, all_negative_tweets = load_tweets()

    # View the total number of positive and negative tweets.
    print(f"The number of positive tweets: {len(all_positive_tweets)}")
    print(f"The number of negative tweets: {len(all_negative_tweets)}")

    # Split positive set into validation and training
    val_pos   = all_positive_tweets[4000:] # generating validation set for positive tweets
    train_pos  = all_positive_tweets[:4000]# generating training set for positive tweets

    # Split negative set into validation and training
    val_neg   = all_negative_tweets[4000:] # generating validation set for negative tweets
    train_neg  = all_negative_tweets[:4000] # generating training set for nagative tweets
    
    # Combine training data into one set
    train_x = train_pos + train_neg 

    # Combine validation data into one set
    val_x  = val_pos + val_neg

    # Set the labels for the training set (1 for positive, 0 for negative)
    train_y = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)))

    # Set the labels for the validation set (1 for positive, 0 for negative)
    val_y  = np.append(np.ones(len(val_pos)), np.zeros(len(val_neg)))


    return train_pos, train_neg, train_x, train_y, val_pos, val_neg, val_x, val_y

In [None]:
train_pos, train_neg, train_x, train_y, val_pos, val_neg, val_x, val_y = train_val_split()

print(f"length of train_x {len(train_x)}")
print(f"length of val_x {len(val_x)}")

In [None]:
# Try out function that processes tweets
print("original tweet at training position 0")
print(train_pos[0])

print("Tweet at training position 0 after processing:")
process_tweet(train_pos[0])

<a name="2-2"></a>
### 2.2 - Building the Vocabulary

Now build the vocabulary.
- Map each word in each tweet to an integer (an "index"), we will build it based on the training data. We will do that by assigning an index to everyword by iterating over the training set.

The vocabulary will also include some special tokens
- `__PAD__`: padding
- `</e>`: end of line
- `__UNK__`: a token representing any word that is not in the vocabulary.

The dictionary `Vocab` will look like this:
```
{'__PAD__': 0,
 '__</e>__': 1,
 '__UNK__': 2,
 'followfriday': 3,
 'top': 4,
 'engag': 5,
```

- So Each unique word has a unique integer associated with it.
- The total number of words in Vocab: 9088

In [None]:
# Build the vocabulary
# There is no test set here only train/val
def get_vocab(train_x):

    # Include special tokens 
    # started with pad, end of line and unk tokens
    Vocab = {'__PAD__': 0, '__</e>__': 1, '__UNK__': 2} 

    for tweet in train_x: 
        processed_tweet = process_tweet(tweet)
        for word in processed_tweet:
            if word not in Vocab: 
                Vocab[word] = len(Vocab)
    
    return Vocab

In [None]:
Vocab = get_vocab(train_x)

print("Total words in vocab are",len(Vocab))
display(Vocab)

<a name="2-3"></a>
## 2.3 - Converting a Tweet to a Tensor

the next function does the following, given an example

##### Example
Input a tweet:
```CPP
'@happypuppy, is Maria happy?'

```

the function first converts the tweet into a list of tokens (including only relevant words)
```CPP
['maria', 'happi']
```

Then it will convert each word into its unique integer from the vocabulary 

```CPP
[2, 56]
```
- Notice that the word "maria" is not in the vocabulary, so it is assigned the unique integer associated with the `__UNK__` token, because it is considered "unknown."

<a name="2.3.1"></a>
### 2.3.1 - tweet_to_tensor

In [None]:
def tweet_to_tensor(tweet, vocab_dict, unk_token='__UNK__', verbose=False):
    '''
    Input: 
        tweet - A string containing a tweet
        vocab_dict - The words dictionary
        unk_token - The special string for unknown tokens
        verbose - Print info durign runtime
    Output:
        tensor_l - A python list with
        
    '''     
    # Process the tweet into a list of words
    # where only important words are kept (stop words removed)
    word_l = process_tweet(tweet)
    
    if verbose:
        print("List of words from the processed tweet:")
        print(word_l)
        
    # Initialize the list that will contain the unique integer IDs of each word
    tensor_l = [] 
    
    # Get the unique integer ID of the __UNK__ token
    unk_ID = vocab_dict[unk_token]
    
    if verbose:
        print(f"The unique integer ID for the unk_token is {unk_ID}")
        
    # for each word in the list:
    for word in word_l:
        
        # Get the unique integer ID.
        # If the word doesn't exist in the vocab dictionary,
        # use the unique ID for __UNK__ instead.     

        word_ID = vocab_dict.get(word, vocab_dict[unk_token])

        #equivalent to 
        #if word not in vocab_dict:
        #    word = unk_token
            
        #word_ID = vocab_dict[word]

            
        # Append the unique integer ID to the tensor list.
        tensor_l.append(word_ID)
    
    return tensor_l

In [3]:
print("Actual tweet is\n", val_pos[0])
print("\nTensor of tweet:\n", tweet_to_tensor(val_pos[0], vocab_dict=Vocab))

NameError: name 'val_pos' is not defined

<a name="2-4"></a>
### 2.4 - Creating a Batch Generator

Most of the time in Natural Language Processing, and AI in general we use batches when training our data sets. 
- If instead of training with batches of examples, you were to train a model with one example at a time, it would take a very long time to train the model. 
- So we will now build a data generator that takes in the positive/negative tweets and returns a batch of training examples. It returns the model inputs, the targets (positive or negative labels) and the weight for each target (ex: this allows us to treat some examples as more important to get right than others, but commonly this will all be 1.0). 

Once we create the generator, we can use it in a for loop

```CPP
for batch_inputs, batch_targets, batch_example_weights in data_generator:
    forward_propagate
    ...
```

We can also get a single batch like this:

```CPP
batch_inputs, batch_targets, batch_example_weights = next(data_generator)
```

The generator returns the next batch each time it's called. 
- This generator returns the data in a format (tensors) that could directly be used in our model.
- It returns a triplet: the inputs, targets, and loss weights:
    - Inputs is a tensor that contains the batch of tweets we put into the model.
    - Targets is the corresponding batch of labels that we train to generate.
    - Loss weights here are just 1s with same shape as targets. 

<a name="2.4.1"></a>
### 2.4.1 - data_generator

In [None]:
def data_generator(data_pos, data_neg, batch_size, loop, vocab_dict, shuffle=False):
    '''
    Input: 
        data_pos - Set of positive examples
        data_neg - Set of negative examples
        batch_size - number of samples per batch. Must be even
        loop - True or False
        vocab_dict - The words dictionary
        shuffle - Shuffle the data order
    Yield:
        inputs - Subset of positive and negative examples
        targets - The corresponding labels for the subset
        example_weights - A numpy array specifying the importance of each example
        
    '''     

    # make sure the batch size is an even number
    # to allow an equal number of positive and negative samples    
    assert batch_size % 2 == 0
    
    # Number of positive examples in each batch is half of the batch size
    # same with number of negative examples in each batch
    n_to_take = batch_size // 2
    
    # Use pos_index to walk through the data_pos array
    # same with neg_index and data_neg
    pos_index = 0
    neg_index = 0
    
    len_data_pos = len(data_pos)
    len_data_neg = len(data_neg)
    
    # Get and array with the data indexes
    pos_index_lines = list(range(len_data_pos))
    neg_index_lines = list(range(len_data_neg))
    
    # shuffle lines if shuffle is set to True
    if shuffle:
        rnd.shuffle(pos_index_lines)
        rnd.shuffle(neg_index_lines)
        
    stop = False
    
    # Loop indefinitely
    while not stop:  
        
        # create a batch with positive and negative examples
        batch = []
        
        # First part: Pack n_to_take positive examples
        
        # Start from 0 and increment i up to n_to_take
        for i in range(n_to_take):
                    
            # If the positive index goes past the positive dataset,
            if pos_index >= len_data_pos: 
                
                # If loop is set to False, break once we reach the end of the dataset
                if not loop:
                    stop = True;
                    break;
                # If user wants to keep re-using the data, reset the index
                pos_index = 0
                if shuffle:
                    # Shuffle the index of the positive sample
                    rnd.shuffle(pos_index_lines)
                    
            # get the tweet as pos_index
            tweet = data_pos[pos_index_lines[pos_index]]
            
            # convert the tweet into tensors of integers representing the processed words
            tensor = tweet_to_tensor(tweet, vocab_dict)
            
            # append the tensor to the batch list
            batch.append(tensor)
            
            # Increment pos_index by one
            pos_index = pos_index + 1


        # Second part: Pack n_to_take negative examples

        # Using the same batch list, start from 0 and increment i up to n_to_take
        for i in range(n_to_take):
            
            # If the negative index goes past the negative dataset,
            if pos_index >= len_data_pos: 
                
                # If loop is set to False, break once we reach the end of the dataset
                if not loop:
                    stop = True 
                    break 
                    
                # If user wants to keep re-using the data, reset the index
                neg_index = 0
                
                if shuffle:
                    # Shuffle the index of the negative sample
                    rnd.shuffle(neg_index_lines)
                    
            # get the tweet as neg_index
            tweet =  data_neg[pos_index_lines[neg_index]]
            
            # convert the tweet into tensors of integers representing the processed words
            tensor = tweet_to_tensor(tweet, vocab_dict)
            
            # append the tensor to the batch list
            batch.append(tensor)
            
            # Increment neg_index by one
            neg_index += 1

        if stop:
            break;

        # Get the max tweet length (the length of the longest tweet) 
        # (we will pad all shorter tweets to have this length)
        max_len = max([len(t) for t in batch]) 
        
        # Initialize the input_l, which will 
        # store the padded versions of the tensors
        tensor_pad_l = []
        # Pad shorter tweets with zeros
        for tensor in batch:

            # Get the number of positions to pad for this tensor so that it will be max_len long
            n_pad = max_len - len(tensor)
            
            # Generate a list of zeros, with length n_pad
            pad_l = [0] * n_pad
            
            # concatenate the tensor and the list of padded zeros
            tensor_pad = tensor + pad_l
            
            # append the padded tensor to the list of padded tensors
            tensor_pad_l.append(tensor_pad)

        # convert the list of padded tensors to a numpy array
        # and store this as the model inputs
        inputs = np.array(tensor_pad_l)
  
        # Generate the list of targets for the positive examples (a list of ones)
        # The length is the number of positive examples in the batch
        target_pos = [1] * n_to_take
        
        # Generate the list of targets for the negative examples (a list of zeros)
        # The length is the number of negative examples in the batch
        target_neg = [0] * n_to_take
        
        # Concatenate the positve and negative targets
        target_l = target_pos + target_neg
        
        # Convert the target list into a numpy array
        targets = np.array(target_l)

        # Example weights: Treat all examples equally importantly.
        example_weights = np.ones_like(targets)
        

        # note we use yield and not return
        yield inputs, targets, example_weights

Now we can use our function to create a data generator for the training data, and another data generator for the validation data.

We will create a third data generator that does not loop, for testing the final accuracy of the model.

In [None]:
# Set the random number generator for the shuffle procedure
rnd.seed(30) 

# Create the training data generator

def train_generator(batch_size,
                    train_pos,
                    train_neg,
                    vocab_dict,
                    loop=True, 
                    shuffle = False):
    return data_generator(train_pos, train_neg, batch_size, loop, vocab_dict, shuffle)

# Create the validation data generator
def val_generator(batch_size,
                    val_pos,
                    val_neg, 
                    vocab_dict,
                    loop=True,
                    shuffle = False):
    return data_generator(val_pos, val_neg, batch_size, loop, vocab_dict, shuffle)

# Create the validation data generator
def test_generator(batch_size, 
                    val_pos,
                    val_neg,   
                    vocab_dict, 
                    loop=False,
                    shuffle = False):
    return data_generator(val_pos, val_neg, batch_size, loop, vocab_dict, shuffle)


In [None]:
# Get a batch from the train_generator and inspect.
inputs, targets, example_weights = next(train_generator(4, train_pos, train_neg, Vocab, shuffle=True))

# this will print a list of 4 tensors padded with zeros
print(f'Inputs: {inputs}')
print(f'Targets: {targets}')
print(f'Example Weights: {example_weights}')

Now that we have your train/val generators, you can just call them and they will return tensors which correspond to the tweets in the first column and their corresponding labels in the second column(positive or negative). 
>Now we will go ahead and start building our neural network model. 

<a name="3"></a>
## 3 - Defining Classes

In this part, we will write our own small framework (library of layers). It will be very similar
to the one used in Trax and also in Keras and PyTorch. 

Your framework will be based on the following `Layer` class from utils.py.

```CPP
class Layer(object):
    """ Base class for layers.
    """
      
    # Constructor
    def __init__(self):
        # set weights to None
        self.weights = None

    # The forward propagation should be implemented
    # by subclasses of this Layer class
    def forward(self, x):
        raise NotImplementedError

    # This function initializes the weights
    # based on the input signature and random key,
    # should be implemented by subclasses of this Layer class
    def init_weights_and_state(self, input_signature, random_key):
        pass

    # This initializes and returns the weights, do not override.
    def init(self, input_signature, random_key):
        self.init_weights_and_state(input_signature, random_key)
        return self.weights
 
    # __call__ allows an object of this class
    # to be called like it's a function.
    def __call__(self, x):
        # When this layer object is called, 
        # it calls its forward propagation function
        return self.forward(x)
```

<a name="3-1"></a>
### 3.1 - ReLU Class
We will now implement the ReLU activation function in a class below. The ReLU function looks as follows: 
<img src = "images/relu.jpg" style="width:300px;height:150px;"/>

$$ \mathrm{ReLU}(x) = \mathrm{max}(0,x) $$


In [None]:

class Relu(Layer):
    """Relu activation function implementation"""
    def forward(self, x):
        '''
        Input: 
            - x (a numpy array): the input
        Output:
            - activation (numpy array): all positive or 0 version of x
        '''
        
        activation = np.maximum(x,0)

        return activation

<a name="3.2"></a>
### 3.2 - Dense Class 

- The forward function of the Dense class multiplies the input to the layer (`x`) by the weight matrix (`W`)

$$\mathrm{forward}(\mathbf{x},\mathbf{W}) = \mathbf{xW} $$

we use  the trax version of `math`, for more efficient code execution,  which includes a trax version of `numpy` and also `random`.

- Weights are initialized with a random key.
- The second parameter is a tuple for the desired shape of the weights (num_rows, num_cols)
- The num of rows for weights is equal to the number of columns in x, because for forward propagation, you will multiply x times weights.

we use `trax.fastmath.random.normal(key, shape, dtype=tf.float32)` to generate random values for the weight matrix. The key difference between this function and the standard `numpy` randomness is the explicit use of random keys, which need to be passed. 
- `key` can be generated by calling `random.get_prng(seed=)` and passing in a number for the `seed`.
- `shape` is a tuple with the desired shape of the weight matrix.
    - The number of rows in the weight matrix should equal the number of columns in the variable `x`.  Since `x` may have 2 dimensions if it represents a single training example (row, col), or three dimensions (batch_size, row, col), we get the last dimension using shape[-1] and not to always get the last dimension from the tuple that holds the dimensions of x.
    - The number of columns in the weight matrix is the number of units chosen for that dense layer.
    
- The values generated have a mean of 0 and standard deviation of 1 (standard normal distribution) then multiplied by a standard deviation of the random values to 0.1

<a name="3.3.1"></a>
### 3.3.1 - Dense


In [None]:
class Dense(Layer):
    """
    A dense (fully-connected) layer.
    """

    # __init__ is implemented for you
    def __init__(self, n_units, init_stdev=0.1):
        
        # Set the number of units in this layer
        self._n_units = n_units
        self._init_stdev = init_stdev

    def forward(self, x):
        # Matrix multiply x and the weight matrix
        dense = np.dot(x,self.weights)
        
        return dense

    # init_weights
    def init_weights_and_state(self, input_signature, random_key):
        
        # The input_signature has a .shape attribute that gives the shape as a tuple
        input_shape = input_signature.shape
        
        # Generate the weight matrix from a normal distribution, 
        # and standard deviation of 'stdev'        
        w = self._init_stdev * trax.fastmath.random.normal(key=random_key, shape=(input_shape[-1], self._n_units))

        self.weights = w
        return self.weights

<a name="3-3"></a>
### 3.3 - Model

Now we will implement a classifier using neural networks. Here is the model architecture we will build. 

<img src = "images/nn.jpg" style="width:400px;height:250px;"/>

For the model implementation, we will use the Trax `layers` module, imported as `tl`.
Trax layers are very similar to the ones we implemented(Relu and Dense) above,
but in addition trainable weights can also have a non-trainable state.
State is used in layers like batch normalization and for inference.

First, look at the code of the Trax Dense layer and compare to your implementation above.
- [tl.Dense](https://github.com/google/trax/blob/master/trax/layers/core.py#L29): Trax Dense layer implementation

One other important layer that you will use a lot is one that allows to execute one layer after another in sequence.
- [tl.Serial](https://github.com/google/trax/blob/master/trax/layers/combinators.py#L26): Combinator that applies layers serially.  
    - You can pass in the layers as arguments to `Serial`, separated by commas. 
    - For example: `tl.Serial(tl.Embeddings(...), tl.Mean(...), tl.Dense(...), tl.LogSoftmax(...))`

- [tl.Embedding](https://github.com/google/trax/blob/1372b903bb66b0daccee19fd0b1fdf44f659330b/trax/layers/core.py#L113): Layer constructor function for an embedding layer.  
    - `tl.Embedding(vocab_size, d_feature)`.
    - `vocab_size` is the number of unique words in the given vocabulary.
    - `d_feature` is the number of elements in the word embedding (some choices for a word embedding size range from 150 to 300, for example).  

- [tl.Mean](https://github.com/google/trax/blob/1372b903bb66b0daccee19fd0b1fdf44f659330b/trax/layers/core.py#L276): Calculates means across an axis.  In this case, we chose axis = 1 to get an average embedding vector (an embedding vector that is an average of all words in the sentence).  
- For example, if the embedding matrix is 300 elements and vocab size is 10,000 words, taking the mean of the embedding matrix along axis=1 will yield a vector of 300 elements.  

**Online documentation**

- [tl.Dense](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.core.Dense)

- [tl.Serial](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#module-trax.layers.combinators)

- [tl.Embedding](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.core.Embedding)

- [tl.Mean](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.core.Mean)

<a name="3.3.1"></a>
### 3.3.1 classifier

In [None]:
def classifier(vocab_size=9088, embedding_dim=256, output_dim=2, mode='train'):
            
    # create embedding layer
    embed_layer = tl.Embedding( 
        vocab_size=vocab_size, # Size of the vocabulary
        d_feature=embedding_dim # Embedding dimension
    ) 
    
    # Create a mean layer, to create an "average" word embedding
    mean_layer = tl.Mean(axis=1)
    
    # Create a dense layer, one unit for each output
    dense_output_layer = tl.Dense(n_units = output_dim)
    
    # Use tl.Serial to combine all layers and create the classifier
    model = tl.Serial( 
      embed_layer,
      mean_layer,
      dense_output_layer 
      ) 
    
    # return the model of type
    return model

<a name="4"></a>
## 4 - Training

To train a model on a task, Trax defines an abstraction [`trax.supervised.training.TrainTask`](https://trax-ml.readthedocs.io/en/latest/trax.supervised.html#trax.supervised.training.TrainTask) which packages the train data, loss and optimizer (among other things) together into an object.

Similarly to evaluate a model, Trax defines an abstraction [`trax.supervised.training.EvalTask`](https://trax-ml.readthedocs.io/en/latest/trax.supervised.html#trax.supervised.training.EvalTask) which packages the eval data and metrics (among other things) into another object.

The final piece tying things together is the [`trax.supervised.training.Loop`](https://trax-ml.readthedocs.io/en/latest/trax.supervised.html#trax.supervised.training.Loop) abstraction that is a very simple and flexible way to put everything together and train the model, all the while evaluating it and saving checkpoints.

Using `Loop` will save us a lot of code compared to always writing the training loop by hand.

In [None]:
# View documentation for trax.supervised.training.TrainTask
help(trax.supervised.training.TrainTask)

In [None]:
# View documentation for trax.supervised.training.EvalTask
help(trax.supervised.training.EvalTask)

In [None]:
# View documentation for trax.supervised.training.Loop
help(trax.supervised.training.Loop)

In [None]:
# View optimizers that you could choose from
help(trax.optimizers)

Notice some available optimizers include:
```CPP
    adafactor
    adam
    momentum
    rms_prop
    sm3
```

<a name="4-1"></a>
### 4.1  Training the Model

Now you are going to train the model. 

Let's define the `TrainTask`, `EvalTask` and `Loop` in preparation to train the model.

In [None]:
from trax.supervised import training

def get_train_eval_tasks(train_pos,
                         train_neg,
                         val_pos, 
                         val_neg, 
                         vocab_dict, 
                         loop, 
                         batch_size = 16):
    
    rnd.seed(271)

    train_task = training.TrainTask(
        labeled_data=train_generator(
                                    batch_size,
                                    train_pos,
                                    train_neg, 
                                    vocab_dict,
                                    loop, 
                                    shuffle = True),
        loss_layer=tl.WeightedCategoryCrossEntropy(),
        optimizer=trax.optimizers.Adam(0.01),
        n_steps_per_checkpoint=10,
    )

    eval_task = training.EvalTask(
        labeled_data=val_generator(
                                    batch_size, 
                                    val_pos,
                                    val_neg, 
                                    vocab_dict, 
                                    loop, 
                                    shuffle = True),
                                    
        metrics=[tl.WeightedCategoryCrossEntropy(),tl.WeightedCategoryAccuracy()],
    )
    
    return train_task, eval_task
    

train_task, eval_task = get_train_eval_tasks(train_pos, train_neg, val_pos, val_neg, Vocab, True, batch_size = 16)
model = classifier()

In [None]:
model

This defines a model trained using [`tl.WeightedCategoryCrossEntropy`](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.metrics.WeightedCategoryCrossEntropy) optimized with the [`trax.optimizers.Adam`](https://trax-ml.readthedocs.io/en/latest/trax.optimizers.html#trax.optimizers.adam.Adam) optimizer, all the while tracking the accuracy using [`tl.WeightedCategoryAccuracy`](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.metrics.WeightedCategoryAccuracy) metric. We also track `tl.WeightedCategoryCrossEntropy` on the validation set.

Now let's make an output directory and train the model.

In [None]:
dir_path = './model/'

try:
    shutil.rmtree(dir_path)
except OSError as e:
    pass


output_dir = './model/'
output_dir_expand = os.path.expanduser(output_dir)
print(output_dir_expand)

<a name="4.1.1"></a>
### 4.1.1 - train_model

> Now we will implement `train_model` to train the model ( the `classifier` that we wrote previously) for the given number of training steps (`n_steps`) using `TrainTask`, `EvalTask` and `Loop`. 

In [None]:
def train_model(classifier, train_task, eval_task, n_steps, output_dir):
    '''
    Input: 
        classifier - the model you are building
        train_task - Training task
        eval_task - Evaluation task. Received as a list.
        n_steps - the evaluation steps
        output_dir - folder to save your files
    Output:
        trainer -  trax trainer
    '''
    rnd.seed(31) #
    
    training_loop = training.Loop( 
                                classifier, # The learning model
                                train_task, # The training task
                                eval_tasks=eval_task, # The evaluation task
                                output_dir=output_dir, # The output directory
                                random_seed=31 
    ) 

    training_loop.run(n_steps = n_steps)
    
    # Return the training_loop, since it has the model.
    return training_loop

In [None]:
# Take a look on how the eval_task is inside square brackets and 
training_loop = train_model(model, train_task, [eval_task], 100, output_dir_expand)

<a name="4-2"></a>
### 4.2 - Making a Prediction

Now that we have trained our model, we can access it as `training_loop.model` object. We will actually use `training_loop.eval_model` 

Use the training data just to see how the prediction process works.  
- Later, we will use validation data to evaluate your model's performance.


In [None]:
# Create a generator object
tmp_train_generator = train_generator(
                        16,
                        train_pos,
                        train_neg, 
                        Vocab, 
                        loop=True, 
                        shuffle = False)

# get one batch
tmp_batch = next(tmp_train_generator)

# Position 0 has the model inputs (tweets as tensors)
# position 1 has the targets (the actual labels)
tmp_inputs, tmp_targets, tmp_example_weights = tmp_batch

print(f"The batch is a tuple of length {len(tmp_batch)} because position 0 contains the tweets, and position 1 contains the targets.") 
print(f"The shape of the tweet tensors is {tmp_inputs.shape} (num of examples, length of tweet tensors)")
print(f"The shape of the labels is {tmp_targets.shape}, which is the batch size.")
print(f"The shape of the example_weights is {tmp_example_weights.shape}, which is the same as inputs/targets size.")

In [None]:
# feed the tweet tensors into the model to get a prediction
tmp_pred = training_loop.eval_model(tmp_inputs)
print(f"The prediction shape is {tmp_pred.shape}, num of tensor_tweets as rows")
print("Column 0 is the probability of a negative sentiment (class 0)")
print("Column 1 is the probability of a positive sentiment (class 1)")
print()
print("View the prediction array")
tmp_pred

To turn these probabilities into categories (negative or positive sentiment prediction), for each row:
- Compare the probabilities in each column.
- If column 1 has a value greater than column 0, classify that as a positive tweet.
- Otherwise if column 1 is less than or equal to column 0, classify that example as a negative tweet.

In [None]:
# turn probabilites into category predictions
tmp_is_positive = tmp_pred[:,1] > tmp_pred[:,0]
for i, p in enumerate(tmp_is_positive):
    print(f"Neg log prob {tmp_pred[i,0]:.4f}\tPos log prob {tmp_pred[i,1]:.4f}\t is positive? {p}\t actual {tmp_targets[i]}")

Notice that since we are making a prediction using a training batch, it's more likely that the model's predictions match the actual targets (labels).  
- Every prediction that the tweet is positive is also matching the actual target of 1 (positive sentiment).
- Similarly, all predictions that the sentiment is not positive matches the actual target of 0 (negative sentiment)

<a name="5"></a>
## 5 - Evaluation  

<a name="5-1"></a>
### 5.1 - Computing the Accuracy on a Batch

We will now write a function that evaluates the model on the validation set and returns the accuracy. 
- `preds` contains the predictions.
    - Its dimensions are `(batch_size, output_dim)`.  `output_dim` is two in this case.  Column 0 contains the probability that the tweet belongs to class 0 (negative sentiment). Column 1 contains probability that it belongs to class 1 (positive sentiment).
    - If the probability in column 1 is greater than the probability in column 0, then interpret this as the model's prediction that the example has label 1 (positive sentiment).  
    - Otherwise, if the probabilities are equal or the probability in column 0 is higher, the model's prediction is 0 (negative sentiment).
- `y` contains the actual labels.
- `y_weights` contains the weights to give to predictions.

<a name="5.1.1"></a>
### 5.1.1 - compute_accuracy

In [None]:
def compute_accuracy(preds, y, y_weights):
    """
    Input: 
        preds: a tensor of shape (dim_batch, output_dim) 
        y: a tensor of shape (dim_batch,) with the true labels
        y_weights: a n.ndarray with the a weight for each example
    Output: 
        accuracy: a float between 0-1 
        weighted_num_correct (np.float32): Sum of the weighted correct predictions
        sum_weights (np.float32): Sum of the weights
    """
    # Create an array of booleans, 
    # True if the probability of positive sentiment is greater than
    # the probability of negative sentiment
    # else False
    is_pos = preds[:,1] > prediction[:,0]

    # convert the array of booleans into an array of np.int32
    is_pos.astype(np.int32)    
    # compare the array of predictions (as int32) with the target (labels) of type int32
    correct = tmp_is_positive == y

    # Count the sum of the weights.
    sum_weights = None
    
    # convert the array of correct predictions (boolean) into an arrayof np.float32
    correct_float = None
    
    # Multiply each prediction with its corresponding weight.
    weighted_correct_float = None

    # Sum up the weighted correct predictions (of type np.float32), to go in the
    # numerator.
    weighted_num_correct = None

    # Divide the number of weighted correct predictions by the sum of the
    # weights.
    accuracy = None

    ### END CODE HERE ###
    return accuracy, weighted_num_correct, sum_weights

In [None]:
# test your function
tmp_val_generator = val_generator(64, val_pos
                    , val_neg, Vocab, loop=True
                    , shuffle = False)

# get one batch
tmp_batch = next(tmp_val_generator)

# Position 0 has the model inputs (tweets as tensors)
# position 1 has the targets (the actual labels)
tmp_inputs, tmp_targets, tmp_example_weights = tmp_batch

# feed the tweet tensors into the model to get a prediction
tmp_pred = training_loop.eval_model(tmp_inputs)
tmp_acc, tmp_num_correct, tmp_num_predictions = compute_accuracy(preds=tmp_pred, y=tmp_targets, y_weights=tmp_example_weights)

print(f"Model's prediction accuracy on a single training batch is: {100 * tmp_acc}%")
print(f"Weighted number of correct predictions {tmp_num_correct}; weighted number of total observations predicted {tmp_num_predictions}")

##### Expected output (Approximately)

```
Model's prediction accuracy on a single training batch is: 100.0%
Weighted number of correct predictions 64.0; weighted number of total observations predicted 64
```

In [None]:
# Test your function
w1_unittest.test_compute_accuracy(compute_accuracy)

<a name="5-2"></a>
### 5.2 - Testing your Model on Validation Data

Now you will write a test function to check your model's prediction accuracy on validation data. 

This program will take in a data generator and your model. 
- The generator allows you to get batches of data. You can use it with a `for` loop:

```
for batch in iterator: 
   # do something with that batch
```

`batch` has `3` elements:
- the first element contains the inputs
- the second element contains the targets
- the third element contains the weights

<a name="5.2.1"></a>
### 5.2.1 - test_model

**Instructions:** 
- Compute the accuracy over all the batches in the validation iterator. 
- Make use of `compute_accuracy`, which you recently implemented, and return the overall accuracy.

In [None]:
# UNQ_C8 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: test_model
def test_model(generator, model, compute_accuracy=compute_accuracy):
    '''
    Input: 
        generator: an iterator instance that provides batches of inputs and targets
        model: a model instance 
    Output: 
        accuracy: float corresponding to the accuracy
    '''
    
    accuracy = 0.
    total_num_correct = 0
    total_num_pred = 0
        
    ### START CODE HERE (Replace instances of 'None' with your code) ###
    for batch in generator: 
        
        # Retrieve the inputs from the batch
        inputs = None
        
        # Retrieve the targets (actual labels) from the batch
        targets = None
        
        # Retrieve the example weight.
        example_weight = None

        # Make predictions using the inputs            
        pred = None
        
        # Calculate accuracy for the batch by comparing its predictions and targets
        batch_accuracy, batch_num_correct, batch_num_pred = None
                
        # Update the total number of correct predictions
        # by adding the number of correct predictions from this batch
        total_num_correct += None
        
        # Update the total number of predictions 
        # by adding the number of predictions made for the batch
        total_num_pred += None

    # Calculate accuracy over all examples
    accuracy = None
    
    ### END CODE HERE ###
    return accuracy

In [None]:
# DO NOT EDIT THIS CELL
# testing the accuracy of your model: this takes around 20 seconds
model = training_loop.eval_model
accuracy = test_model(test_generator(16, val_pos
                    , val_neg, Vocab, loop=False
                    , shuffle = False), model)

print(f'The accuracy of your model on the validation set is {accuracy:.4f}', )

##### Expected Output (Approximately)

```CPP
The accuracy of your model on the validation set is 0.9950
```

In [None]:
w1_unittest.unittest_test_model(test_model, test_generator(16, val_pos , val_neg, Vocab, loop=False, shuffle = False), model)

<a name="6"></a>
## 6 - Testing with your Own Input

Finally you will test with your own input. You will see that deepnets are more powerful than the older methods you have used before. Although you go close to 100% accuracy on the first two assignments, the task was way easier. 

In [None]:
# this is used to predict on your own sentnece
def predict(sentence):
    inputs = np.array(tweet_to_tensor(sentence, vocab_dict=Vocab))
    
    # Batch size 1, add dimension for batch, to work with the model
    inputs = inputs[None, :]  
    
    # predict with the model
    preds_probs = model(inputs)
    
    # Turn probabilities into categories
    preds = int(preds_probs[0, 1] > preds_probs[0, 0])
    
    sentiment = "negative"
    if preds == 1:
        sentiment = 'positive'

    return preds, sentiment


In [None]:
# try a positive sentence
sentence = "It's such a nice day, I think I'll be taking Sid to Ramsgate for lunch and then to the beach maybe."
tmp_pred, tmp_sentiment = predict(sentence)
print(f"The sentiment of the sentence \n***\n\"{sentence}\"\n***\nis {tmp_sentiment}.")

print()
# try a negative sentence
sentence = "I hated my day, it was the worst, I'm so sad."
tmp_pred, tmp_sentiment = predict(sentence)
print(f"The sentiment of the sentence \n***\n\"{sentence}\"\n***\nis {tmp_sentiment}.")

Notice that the model works well even for complex sentences.

<a name="7"></a>
## 7 - Word Embeddings

In this section, you will visualize the word embeddings that were constructed for this sentiment analysis task. You can retrieve them by looking at the `model.weights` tuple (recall that the first layer of the model is the embedding layer).

In [None]:
embeddings = model.weights[0]

Let's take a look at the size of the embeddings. 

In [None]:
embeddings.shape

To visualize the word embeddings, it is necessary to choose 2 directions to use as axes for the plot. You could use random directions or the first two eigenvectors from PCA. Here, you'll use scikit-learn to perform dimensionality reduction of the word embeddings using PCA. 

In [None]:
from sklearn.decomposition import PCA #Import PCA from scikit-learn
pca = PCA(n_components=2) #PCA with two dimensions

emb_2dim = pca.fit_transform(embeddings) #Dimensionality reduction of the word embeddings

Now, everything is ready to plot a selection of words in 2d. 

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

#Selection of negative and positive words
neg_words = ['worst', 'bad', 'hurt', 'sad', 'hate']
pos_words = ['best', 'good', 'nice', 'better', 'love']

#Index of each selected word
neg_n = [Vocab[w] for w in neg_words]
pos_n = [Vocab[w] for w in pos_words]

plt.figure()

#Scatter plot for negative words
plt.scatter(emb_2dim[neg_n][:,0],emb_2dim[neg_n][:,1], color = 'r')
for i, txt in enumerate(neg_words): 
    plt.annotate(txt, (emb_2dim[neg_n][i,0],emb_2dim[neg_n][i,1]))

#Scatter plot for positive words
plt.scatter(emb_2dim[pos_n][:,0],emb_2dim[pos_n][:,1], color = 'g')
for i, txt in enumerate(pos_words): 
    plt.annotate(txt,(emb_2dim[pos_n][i,0],emb_2dim[pos_n][i,1]))

plt.title('Word embeddings in 2d')

plt.show()

As you can see, the word embeddings for this task seem to distinguish negative and positive meanings very well. However, clusters don't necessarily have similar words since you only trained the model to analyze overall sentiment. 

### On Deep Nets

Deep nets allow you to understand and capture dependencies that you would have not been able to capture with a simple linear regression, or logistic regression. 
- It also allows you to better use pre-trained embeddings for classification and tends to generalize better.