<a href="https://colab.research.google.com/github/DrAlexSanz/NLP-SPEC-C2/blob/master/W1/Assignment_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 1: Sentiment with Deep Neural Networks

In course 1, you implemented Logistic regression and Naive Bayes for sentiment analysis. However if you were to give your old models an example like:

Your model would have predicted a positive sentiment for that review. However, that sentence has a negative sentiment and indicates that the movie was not good. To solve those kinds of misclassifications, you will write a program that uses deep neural networks to identify sentiment in text. By completing this assignment, you will:

* Understand how you can build/design a model using layers
* Train a model using a training loop
* Use a binary cross-entropy loss function
* Compute the accuracy of your model
* Predict using your own input

As you can tell, this model follows a similar structure to the one you previously implemented in the second course of this specialization.

Indeed most of the deep nets you will be implementing will have a similar structure. The only thing that changes is the model architecture, the inputs, and the outputs. Before starting the assignment, we will introduce you to the Google library trax that we use for building and training models.
Now we will show you how to compute the gradient of a certain function f by just using .grad(f).

Trax source code can be found on Github: Trax
The Trax code also uses the JAX library: JAX

## Part 1: Import libraries and try out Trax

In [1]:
import os 
import random as rnd

# Install trax

!pip install sentencepiece==0.1.91
!pip install trax

# import relevant libraries
import trax

# set random seeds to make this notebook easier to replicate
#trax.supervised.trainer_lib.init_random_number_generators(31)

# import trax.fastmath.numpy
import trax.fastmath.numpy as np

# import trax.layers
from trax import layers as tl

# Download the utils file

!wget https://raw.githubusercontent.com/DrAlexSanz/NLP-SPEC-C3/main/W1/utils.py

# import Layer from the utils.py file
from utils import Layer, load_tweets, process_tweet

print("Imports OK")

Collecting sentencepiece==0.1.91
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |▎                               | 10kB 17.1MB/s eta 0:00:01[K     |▋                               | 20kB 24.4MB/s eta 0:00:01[K     |█                               | 30kB 28.0MB/s eta 0:00:01[K     |█▏                              | 40kB 24.0MB/s eta 0:00:01[K     |█▌                              | 51kB 22.4MB/s eta 0:00:01[K     |█▉                              | 61kB 16.3MB/s eta 0:00:01[K     |██▏                             | 71kB 16.6MB/s eta 0:00:01[K     |██▍                             | 81kB 15.2MB/s eta 0:00:01[K     |██▊                             | 92kB 14.7MB/s eta 0:00:01[K     |███                             | 102kB 15.8MB/s eta 0:00:01[K     |███▍                            | 112kB 15.8MB/s eta 0:00:01[K     |███▋   

### Since I imported trax's version of numpy I can create vectors directly.

In [9]:
a = np.array((5., 2.))

type(a) # Notice it's not a np array but a jax DeviceArray.

jax.interpreters.xla.DeviceArray

In [10]:
# Now do a function with the same array

def f(x):

    return(x**2)

print(f"f(a) = {f(a)}")

f(a) = [25.  4.]


In [13]:
# And now the derivative (2x)

grad_f = trax.fastmath.grad(fun = f) # grad only takes scalar arguments
type(grad_f)

function

In [14]:
# grad_f(a)
b = 13.0
grad_b = grad_f(b)

display(grad_b)

DeviceArray(26., dtype=float32)

# Part 2: Importing the data

## 2.1 Loading in the data
Import the data set.

* You may recognize this from earlier assignments in the specialization.
* Details of process_tweet function are available in utils.py file.

In [15]:
import numpy as np # Let's go back to the usual thing.

In [16]:
all_pos_tweets, all_neg_tweets = load_tweets()

print("Number of positive tweets", len(all_pos_tweets))
print("Number of negative tweets", len(all_neg_tweets))

Number of positive tweets 5000
Number of negative tweets 5000


## Now I'll create train and validation sets

* Shuffle the tweets if they are not randomly sorted.
* Split the positive in train-test (80-20 because I can).
* Add labels (1 positive, 0 negative).
* Check it all.

In [17]:
train_pos = all_pos_tweets[:4000]
test_pos = all_pos_tweets[4000:]

train_neg = all_neg_tweets[:4000]
test_neg = all_neg_tweets[4000:]

train_x = train_pos + train_neg
test_x = test_pos + test_neg

train_y = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)))

print("Length of train_pos:", len(train_pos))
print("Length of train_neg:", len(train_neg))

print("Length of test_pos:", len(test_pos))
print("Length of test_neg:", len(test_neg))

print("Length of train_x:", len(train_x))
print("Length of train_y:", len(train_y))

print("First 5 values of tags", train_y[0:5])
print("Last 5 values of tags", train_y[-5:])



Length of train_pos: 4000
Length of train_neg: 4000
Length of test_pos: 1000
Length of test_neg: 1000
Length of train_x: 8000
Length of train_y: 8000
First 5 values of tags [1. 1. 1. 1. 1.]
Last 5 values of tags [0. 0. 0. 0. 0.]


### Preprocess the tweets to clean them. I have a function but in any case I'm used to this.

In [18]:
# This function only processes one tweet. I'll call it in a loop or a list comprehension

print("The first positive tweet is:", all_pos_tweets[0])

clean_tweet = process_tweet(all_pos_tweets[0])

print("The clean tweet is:", clean_tweet) # Notice it removes all the twitter handles and the "#" symbol. It also tokenizes and stems the words.

The first positive tweet is: #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
The clean tweet is: ['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']


# 2.2 Building the vocabulary
Now build the vocabulary.

* Map each word in each tweet to an integer (an "index").
* The following code does this for you, but please read it and understand what it's doing.
* Note that you will build the vocabulary based on the training data.
* To do so, you will assign an index to everyword by iterating over your training set.
* The vocabulary will also include some special tokens

* <--PAD--> padding
* <--END-->: end of line
* <--UNK-->: a token representing any word that is not in the vocabulary.

In [None]:
# Start with the padding, end and UNK

vocab = {"<PAD>": 0, "<END>": 1, "<UNK>": 2}

# Keep in mind, the vocabulary is only with the training data!!

for tweet in train_x:
    processed_tweet = process_tweet(tweet)

    for word in processed_tweet:
        if word not in vocab:
            vocab[word] = len(vocab) # len vocab changes with every new word

print("Total words:", len(vocab))

#display(vocab)

## Exercise 01
Instructions: Write a program tweet_to_tensor that takes in a tweet and converts it to an array of numbers. You can use the Vocab dictionary you just found to help create the tensor.

* Use the vocab_dict parameter and not a global variable.
* Do not hard code the integer value for the __UNK__ token.
* Map each word in tweet to corresponding token in 'Vocab'
* Use Python's Dictionary.get(key,value) so that the function returns a default value if the key is not found in the dictionary.

In [23]:
def tweet_to_tensor(tweet, vocab_dict, unk_token = "<UNK>"):
    """
    Take a tweet (tokens) and return a tensor (numbers) basically, translate from keys to values
    Inputs:
      Tweet: a clean tweet
      vocab_dict: the vocabulary, word and index
    Output:
      tensor_l: a vector (list) with the indices of the words
    """
    word_list = process_tweet(tweet)

    unk_ID = vocab_dict[unk_token]
    tensor_l = []

    for word in word_list:
      if word in vocab_dict:
        word_ID = vocab_dict[word]
        tensor_l.append(word_ID)
      else:
        tensor_l.append(unk_ID)

    return tensor_l



In [24]:
print("Tweet is:", test_pos[0])
tensor_tweet = tweet_to_tensor(test_pos[0], vocab)
print("Corresponding tensor is:", tensor_tweet)

Tweet is: Bro:U wan cut hair anot,ur hair long Liao bo
Me:since ord liao,take it easy lor treat as save $ leave it longer :)
Bro:LOL Sibei xialan
Corresponding tensor is: [1065, 136, 479, 2351, 745, 8146, 1123, 745, 53, 2, 2672, 791, 2, 2, 349, 601, 2, 3489, 1017, 597, 4559, 9, 1065, 157, 2, 2]


In [27]:
#Thorough checks

# test tweet_to_tensor

def test_tweet_to_tensor():
    test_cases = [
        
        {
            "name":"simple_test_check",
            "input": [test_pos[1], vocab],
            "expected":[444, 2, 304, 567, 56, 9],
            "error":"The function gives bad output for test_pos[1]. Test failed"
        },
        {
            "name":"datatype_check",
            "input":[test_pos[1], vocab],
            "expected":type([]),
            "error":"Datatype mismatch. Need only list not np.array"
        },
        {
            "name":"without_unk_check",
            "input":[test_pos[1], vocab],
            "expected":6,
            "error":"Unk word check not done- Please check if you included mapping for unknown word"
        }
    ]
    count = 0
    for test_case in test_cases:
        
        try:
            if test_case['name'] == "simple_test_check":
                assert test_case["expected"] == tweet_to_tensor(*test_case['input'])
                count += 1
            if test_case['name'] == "datatype_check":
                assert isinstance(tweet_to_tensor(*test_case['input']), test_case["expected"])
                count += 1
            if test_case['name'] == "without_unk_check":
                assert None not in tweet_to_tensor(*test_case['input'])
                count += 1
                
            
            
        except:
            print(test_case['error'])
    if count == 3:
        print("\033[92m All tests passed")
    else:
        print(count," Tests passed out of 3")
test_tweet_to_tensor()


[92m All tests passed
