# Tasks 4

In this notebook, you'll implement a recurrent neural network for sentiment analysis as part of Task 4. Utilizing an RNN over a feedforward network is advantageous because it can incorporate information about the sequence of words. We'll be working with a dataset of movie reviews paired with sentiment labels.


**The network architecture is illustrated below.**
<img src="assets/network_diagram.png" width=400px>

- The input words will be passed through an embedding layer. This layer is essential since we're dealing with tens of thousands of words, necessitating a more efficient representation than one-hot encoded vectors. Although training an embedding with word2vec is an option, for simplicity, we'll employ an embedding layer and let the network learn the embeddings on its own.

- The embeddings will then be fed into LSTM cells. These cells introduce recurrent connections to the network, enabling it to capture information about the sequence of words in the data. Finally, the LSTM cells will lead to a sigmoid output layer. Given that we aim to predict whether a text expresses positive or negative sentiment, the sigmoid activation function is utilized. Consequently, the output layer will consist of a single unit with a sigmoid activation function.

- We're primarily interested in the sigmoid outputs of the last step; the rest can be disregarded. The cost will be computed based on the output of the last step and the training label.


In [None]:
import numpy as np
import tensorflow as tf

To load the dataset, you have two files: reviews.txt containing movie reviews and labels.txt containing corresponding sentiment labels. Begin by reading and extracting the data from these files. Preprocess the data as needed for sentiment analysis.

In [2]:
with open('reviews.txt', 'r') as f:
    reviews = f.read()
with open('labels.txt', 'r') as f:
    labels = f.read()

In [None]:
reviews[:2000]

## Data preprocessing

- The first step when building a neural network model is getting your data into the proper form to feed into the network. Since we're using embedding layers, we'll need to encode each word with an integer. We'll also want to clean it up a bit.

- You can see an example of the reviews data above. We'll want to get rid of those periods. Also, you might notice that the reviews are delimited with newlines `\n`. To deal with those, I'm going to split the text into each review using `\n` as the delimiter. Then I can combined all the reviews back together into one big string.

- First, let's remove all punctuation. Then get all the text without the newlines and split it into individual words.

In [4]:
from string import punctuation
all_text = ''.join([c for c in reviews if c not in punctuation])
reviews = all_text.split('\n')

all_text = ' '.join(reviews)
words = all_text.split()

In [None]:
all_text[:2000]

In [None]:
words[:100]

In [None]:
print(type(reviews), len(reviews), reviews[0],'last review:', reviews[25000])

 ## Task  Details

## Encoding the words

The embedding lookup requires that we pass in integers to our network. The easiest way to do this is to create dictionaries that map the words in the vocabulary to integers. Then we can convert each of our reviews into integers so they can be passed into the network.
> **Exercise:** Now you're going to encode the words with integers. Build a dictionary that maps words to integers. Later we're going to pad our input vectors with zeros, so make sure the integers **start at 1, not 0**.
> Also, convert the reviews to integers and store the reviews in a new list called `reviews_ints`. 

In [7]:
# Create your dictionary that maps vocab words to integers here

# vocab_to_int = {word:idx+1 for idx,word in enumerate(set(words)) }

from collections import Counter

## Encoding the labels

Our labels are "positive" or "negative". To use these labels in our network, we need to convert them to 0 and 1.

> **Exercise:** Convert labels from `positive` and `negative` to 1 and 0, respectively.

In [25]:
## Code Here

If you built `labels` correctly, you should see the next output.

In [None]:
from collections import Counter

review_lens = Counter([len(x) for x in reviews_int])
print("Zero-length reviews: {}".format(review_lens[0]))
print("Maximum review length: {}".format(max(review_lens)))

Okay, a couple issues here. We seem to have one review with zero length. And, the maximum review length is way too many steps for our LSTM
> **Exercise:** First, remove the review with zero length from the `reviews_ints` list.

In [None]:
# Filter out that review with 0 length


In [None]:
review_lens = Counter([len(x) for x in reviews_ints])
print("Zero-length reviews: {}".format(review_lens[0]))
print("Maximum review length: {}".format(max(review_lens)))

> **Exercise:** Now, create an array `features` that contains the data we'll pass to the network. The data should come from `review_ints`, since we want to feed integers to the network. Each row should be 200 elements long. For reviews shorter than 200 words, left pad with 0s. That is, if the review is `['best', 'movie', 'ever']`, `[117, 18, 128]` as integers, the row will look like `[0, 0, 0, ..., 0, 117, 18, 128]`. For reviews longer than 200, use on the first 200 words as the feature vector.

This isn't trivial and there are a bunch of ways to do this. But, if you're going to be building your own deep learning networks, you're going to have to get used to preparing your data.



In [29]:
###Code here for creating feature vectors

If you build features correctly, it should look like that cell output below.

In [None]:
features[90:100,:100]

In [None]:
features[:10,:100]

## Training, Validation, Test



With our data in nice shape, we'll split it into training, validation, and test sets.

> **Exercise:** Create the training, validation, and test sets here. You'll need to create sets for the features and the labels, `train_x` and `train_y` for example. Define a split fraction, `split_frac` as the fraction of data to keep in the training set. Usually this is set to 0.8 or 0.9. The rest of the data will be split in half to create the validation and testing data.

In [None]:
print(features.shape, labels.shape)

In [None]:
split_frac = 0.8
split_idx = int(len(features)*0.8)

train_x, val_x = features[:split_idx], features[split_idx:]
train_y, val_y = labels[:split_idx], labels[split_idx:]

test_idx = int(len(val_x)*0.5)
val_x, test_x = val_x[:test_idx], val_x[test_idx:]
val_y, test_y = val_y[:test_idx], val_y[test_idx:]

print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape), 
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape))

With train, validation, and text fractions of 0.8, 0.1, 0.1, the final shapes should look like:
```
                    Feature Shapes:
Train set: 		 (20000, 200) 
Validation set: 	(2500, 200) 
Test set: 		  (2501, 200)
```

## Build Model 
- Freedom of library to use: TensorFlow or Pytorch

## Embedding

Now we'll add an embedding layer. We need to do this because there are 74000 words in our vocabulary. It is massively inefficient to one-hot encode our classes here. You should remember dealing with this problem from the word2vec lesson. Instead of one-hot encoding, we can have an embedding layer and use that layer as a lookup table. You could train an embedding layer using word2vec, then load it here. But, it's fine to just make a new layer and let the network learn the weights.

> **Exercise:** Create the embedding lookup matrix 



In [40]:
# Size of the embedding vectors (number of units in the embedding layer)


## LSTM cell implementation

<img src="assets/network_diagram.png" width=400px>

- Next, we'll create our LSTM cells to use in the recurrent network
- Add Dropout: Dropout regularization can be added to the LSTM cell

In [None]:
## Code for defining basic LSTM architecture - you can refer to open source repositories for implementations 

## RNN forward pass

<img src="assets/network_diagram.png" width=400px>

Now we need to actually run the data through the RNN nodes. 

> **Exercise:** Write forward pass through the RNN. Remember that we're actually passing in vectors from the embedding layer, `embed`.



In [42]:
## Code for forward pass

## Output

We only care about the final output, we'll be using that as our sentiment prediction. So we need to grab the last output with `outputs[:, -1]`, the calculate the cost from that and `labels_`.

In [43]:
## Code for fetching out outputs and calculate loss (For simplicity you can use Mean Square Error Loss)

## Batching

This is a simple function for returning batches from our data. First it removes data such that we only have full batches. Then it iterates through the `x` and `y` arrays and returns slices out of those arrays with size `[batch_size]`.

In [45]:
def get_batches(x, y, batch_size=100):
    
    n_batches = len(x)//batch_size
    x, y = x[:n_batches*batch_size], y[:n_batches*batch_size]
    for ii in range(0, len(x), batch_size):
        yield x[ii:ii+batch_size], y[ii:ii+batch_size]

## Training
|
Below is the typical training code. If you want to do this yourself, feel free to delete all this code and implement it yourself. Before you run this, make sure the `checkpoints` directory exists.

In [None]:
# Write a code for training batches through forward pass, and saving checkpoints in ./checkpoints/ folder

## Testing

In [None]:
# Write a code for testing

 ## Submission Guidelines

SInce it is an open ended task, you are free to use whichever technique/method you want. There is no restriction as such. You can use any tool of your choice including tensorflow, pytorch etc., but please clearly mention whatever you assume while designing the solution. Do submit a .ipynb as solution. Please don't submit zipped folders etc.