Before starting the DataLab, preprocess the tweets like you did before.

In [None]:
import nltk
from nltk.corpus import twitter_samples
import matplotlib.pyplot as plt
import random
import numpy as np
import pandas as pd

In [None]:
nltk.download('twitter_samples')

In [None]:
# select the set of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

In [None]:
print('Number of positive tweets: ', len(all_positive_tweets))
print('Number of negative tweets: ', len(all_negative_tweets))

In [None]:
import re                                  # library for regular expression operations
import string                              # for string operations

from nltk.corpus import stopwords          # module for stop words that come with NLTK
from nltk.stem import PorterStemmer        # module for stemming
from nltk.tokenize import TweetTokenizer   # module for tokenizing strings

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stopwords_english = stopwords.words('english')

In [None]:
def tweet_processor(tweet):
    """
    Input:
        tweet: a string containing a tweet
    Output:
        processed_tweet: a list of token
        
    Processing steps:
    - Removes hyperlinks
    - Removes # sign
    - Tokenizes
    - Removes stopwords and punctuation
    - Stem tokens
        
    """
    # YOUR CODE HERE #
    return processed_tweet

In [None]:
tweet = all_positive_tweets[2277]
tweet_processed = tweet_processor(tweet)

print(tweet)
print(tweet_processed)

In [None]:
# 80% training 20% testing
positive_tweets_tr = all_positive_tweets[:4000]
positive_tweets_te = all_positive_tweets[4000:]

negative_tweets_tr = all_negative_tweets[:4000]
negative_tweets_te = all_negative_tweets[4000:]

In [None]:
def tweet_processor_list(tweet_list):
    # YOUR CODE HERE #
    return processed_tweet_list

In [None]:
positive_tweets_tr = tweet_processor_list(positive_tweets_tr)
positive_tweets_te = tweet_processor_list(positive_tweets_te)

negative_tweets_tr = tweet_processor_list(negative_tweets_tr)
negative_tweets_te = tweet_processor_list(negative_tweets_te)

You already did the steps until here in the previous DataLab. Let's do a quick sanity check:

In [None]:
assert len(positive_tweets_tr) == 4000
assert len(negative_tweets_tr) == 4000

assert len(positive_tweets_te) == 1000
assert len(negative_tweets_te) == 1000

assert type(positive_tweets_tr) is list
assert type(positive_tweets_tr[0]) is list
assert type(positive_tweets_tr[0][0]) is str

# Introduction

In this DataLab you will train a neural network with an embedding layer for classifying tweets as positive and negative.

Until now, we based our models (Logistic Regression and Naive Bayes) on counting the words. For the first time, we will try to capture meaning as well by using embeddings.

To achieve this, we need to represent words as numbers because neural networks work with numbers.

## 1. Converting words to numbers

Before using the embedding layer, we need to convert tokens to numbers. The best way to explain how to do it is to actually show it. So let's use an example corpus with only 3 documents.

In [None]:
training_documents = ['This is a tasty apple.',
                      'Hello John!',
                      'I liked the movie.',
                      'I have a car.']

Let's use `Tokenizer` from Keras to convert these documents to numbers:

https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(training_documents)

What we just did was to simply tokenize all the documents, find the unique tokens (or words) and assign a number (index) to them. You can view the results by using the `word_index` attribute on the `tokenizer` object, which in fact is our vocabulary:

In [None]:
tokenizer.word_index

Now I can use the `tokenizer` object again to convert documents to numbers.

In [None]:
tokenizer.texts_to_sequences(training_documents)

`'Hello John!'` becomes `[7, 8]`.

But what happens if you would like to convert a document with an unknown word (out of vocabulary)? For example `'Hello Mary!'`

In [None]:
tokenizer.texts_to_sequences(['Hello Mary!'])

Unknown word is ignored. This is not ideal. It is better to have a special token that indicates unknown words. Let's repeat the steps above but this time with `oov_token="<OOV>"`

In [None]:
tokenizer = Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts(training_documents)

In [None]:
tokenizer.word_index

This time we have a new token for out of vocabulary words `<OOV>` with `1` as its index. Which means whenever we have an unknown word, it will be indexed as `1`.

In [None]:
tokenizer.texts_to_sequences(['Hello Mary!'])

**Task 1.1**

Write a document with a few words and convert it to numbers using `tokenizer.texts_to_sequences()`.

In [None]:
# YOUR CODE HERE #

Let's take a look at our corpus turned into sequences again.

In [None]:
training_sequences = tokenizer.texts_to_sequences(training_documents)
training_sequences

Notice that naturally, the length of the sentences are different. But typically machine learning models expect a fixed input size, in other words a fixed number of features. Handling this is as easy as padding the short sentences with zeros. This can be done using `pad_sequences` from Keras.

https://www.tensorflow.org/api_docs/python/tf/keras/utils/pad_sequences

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
pad_sequences(training_sequences)

Zeros are added to the short sentences and now all sequences are 5 numbers long. Now let's dig deeper to understand the default arguments and how to change them.

For example by default zeros are added to the left. We can use `padding='post'` to add zeros to the right.

In [None]:
pad_sequences(training_sequences,
              padding='post')

Note that the length of the sequences are equal to the longest sentence. We can make it shorter or longer.

In [None]:
pad_sequences(training_sequences,
              padding='post',
              maxlen=3)

In [None]:
pad_sequences(training_sequences,
              padding='post',
              maxlen=6)

and as you might have guessed we can decide if we want to truncate from left or right of the sentence.

In [None]:
pad_sequences(training_sequences,
              padding='post',
              maxlen=3,
              truncating='post')

**Summary**

We started with a list of strings, in other words a corpus with documents.

```
training_documents = ['This is a tasty apple.',
                      'Hello John!',
                      'I liked the movie.',
                      'I have a car.']
```

Then, fit a `tokenizer` to it, which assigned a number to every unique word.

```
{'<OOV>': 1,
 'a': 2,
 'i': 3,
 'this': 4,
 'is': 5,
 'tasty': 6,
 'apple': 7,
 'hello': 8,
 'john': 9,
 'liked': 10,
 'the': 11,
 'movie': 12,
 'have': 13,
 'car': 14}
```

Finally we converted strings into numbers. We used padding to obtain a fixed length for sequences.

```
array([[ 4,  5,  2,  6,  7],
       [ 8,  9,  0,  0,  0],
       [ 3, 10, 11, 12,  0],
       [ 3, 13,  2, 14,  0]], dtype=int32)
```

## 2. Converting tweets to numbers

Now it is time to apply what you have learned to tweets. But let's first create `training_tweets` and `test_tweets` by combining positive and negative tweets.



In [None]:
training_tweets = positive_tweets_tr + negative_tweets_tr
test_tweets = positive_tweets_te + negative_tweets_te

While we are creating our dataset, we can also create our labels. We know that first half of `training_tweets` and `test_tweets` are positive (label = 1) and second half is negative (label = 0). Therefore creating the labels is as easy as:

In [None]:
y_train = np.append(np.ones(len(positive_tweets_tr)),
                    np.zeros(len(negative_tweets_tr)))

y_test = np.append(np.ones(len(positive_tweets_te)),
                   np.zeros(len(negative_tweets_te)))

print(y_train.shape)
print(y_test.shape)

Remember that we already preprocessed and tokenized our tweets:

In [None]:
training_tweets[0]

But Keras `Tokenizer()` expects a list of strings. So let's combine tokens into strings:

In [None]:
training_tweets_str = []
for tw in training_tweets:
    training_tweets_str.append(' '.join(tw))
    
test_tweets_str = []
for tw in test_tweets:
    test_tweets_str.append(' '.join(tw))

In [None]:
training_tweets_str[0]

**Task 2.1**

Use tokenizer on `training_tweets_str`. Notice that tokenizer processes text with the `filters` parameter. Set it to `filters=''` to prevent processing because we already processed our tweets.

In [None]:
# YOUR CODE HERE #

**Task 2.2**

Calculate the size of the vocabulary.

In [None]:
vocab_size = # YOUR CODE HERE #
print(vocab_size)

**Task 2.3**

Find the numbers that represent the words `'boy'`, `'girl'`, `'man'` and `'woman'`.

In [None]:
boy_index = # YOUR CODE HERE #
girl_index = # YOUR CODE HERE #
man_index = # YOUR CODE HERE #
woman_index = # YOUR CODE HERE #

print(f"The index of 'boy' is {boy_index}")
print(f"The index of 'girl' is {girl_index}")
print(f"The index of 'man' is {man_index}")
print(f"The index of 'woman' is {woman_index}")

**Task 2.4**

Convert training and test tweets to sequences and use padding.

Example tweet:

`'followfriday top engag member commun week :)'`

Corresponding sequence:

`[347, 221, 937, 400, 286, 52, 3]`

Padded sequence:

`array([347, 221, 937, 400, 286,  52,   3,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0], dtype=int32)`


For padding arguments use `padding='post'` and `maxlen=30`.

In [None]:
training_sequences = # YOUR CODE HERE #
training_padded = # YOUR CODE HERE #

test_sequences = # YOUR CODE HERE #
test_padded = # YOUR CODE HERE #

In [None]:
assert training_padded.shape == (8000, 30)
assert test_padded.shape == (2000, 30)

## 3. Build a neural network with an embedding layer

You just converted tweets to numbers, now we are ready to train a neural network on this dataset. Let's first define `X_train`, `y_train`, `X_test`, `y_test`. We already created the labels and padded sequences will be our `X_train` and `X_test`.

In [None]:
X_train = training_padded
X_test = test_padded

In [None]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

**Task 3.1**

Build a `Sequential` model from Keras. First layer should be an `Embedding` layer. In the `Embedding` layer define the following parameters:

>input_dim: Integer. Size of the vocabulary, i.e. maximum integer index + 1.

>output_dim: Integer. Dimension of the dense embedding.

>input_length: Length of input sequences, when it is constant. This argument is required if you are going to connect Flatten then Dense layers upstream (without it, the shape of the dense outputs cannot be computed).

Note that `input_dim` is `vocab_size + 1` because we are padding with zeros. For `output_dim` please use `2` because we would like to plot the embeddings. Finally for `input_length` use `30` because we used `maxlen=30` during padding.

After the `Embedding` layer flatten its output and connect `Dense` layers. As a last layer, add a `Dense` layer suitable for binary classification.

In [None]:
from keras import Sequential
from keras.layers import # YOUR CODE HERE #

model = Sequential()

# YOUR CODE HERE #

model.summary()

**Task 3.2**

Compile the model by selecting a proper loss, optimizer and metric.

In [None]:
# YOUR CODE HERE #

**Task 3.3**

Train the model with `X_train` and `y_train`. Use `X_test` and `y_test` as validation data.

In [None]:
# YOUR CODE HERE #

You just developed a neural network for sentiment analysis congrats!

**Task 3.4**

Predict the class of a test tweet.

Example tweet:
`"back thnx god i'm happi :)"`

Model prediction:
`array([[0.99999976]], dtype=float32)`

In [None]:
# YOUR CODE HERE #

## 4. Semantic properties of embeddings

Take a look at Figures 6.15 and 6.16 (Section 6.10, page 126, Speech and Language Processing). They show one of the ways embeddings capture meaning in language. Let's try if embeddings from our model learned any semantic properties.

In order to achieve this we need to access the output of our trained embedding layer, which is the first layer. We can access each layer as follows:

In [None]:
model.layers

and input/output of a layer as follows:

In [None]:
model.layers[0].input

In [None]:
model.layers[0].output

We can create a new model only from the trained Embedding layer.

In [None]:
from tensorflow.keras.models import Model
embedding_model = Model(inputs=model.layers[0].input,
                        outputs=model.layers[0].output)

`embedding_model` accepts sequences as input and returns the output of the embedding layer.

**Task 4.1**

We can use this model, just like any model. Use the `embedding_model` on the test tweet you used for Task 3.4 and get the embeddings. You should obtain two numbers for each word in the tweet. These two numbers are called word embeddings.

Example tweet:

`"back thnx god i'm happi :)"`

Expected output:

```
array([[[ 2.03354090e-01, -1.61128387e-01],
        [ 3.02020200e-02, -4.17610519e-02],
        [-6.28291816e-03, -3.17817833e-03],
        [ 1.16536245e-01, -1.00184120e-01],
        [-2.93731004e-01,  2.99061388e-01],
        [-1.42090285e+00,  1.37957323e+00],
        [-7.92113831e-04,  2.80260388e-02],
        [-7.92113831e-04,  2.80260388e-02],
        [-7.92113831e-04,  2.80260388e-02],
        ...
        [-7.92113831e-04,  2.80260388e-02]]], dtype=float32)
```

where `'back'` is `[ 2.03354090e-01, -1.61128387e-01]`.

In [None]:
# YOUR CODE HERE #

Using the idea above, let's create a function that can give us a vector (word embedding) for any given word.

**Task 4.2**

Write a function that:
- Accepts a word as a string
```
'man'
```
- Converts it into a sequence using the `tokenizer` you fitted previously
```
[[199]]
```
- Pads it
```
array([[199,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0]], dtype=int32)
```
- Uses the `embedding_model` to get the embeddings
```
array([[[ 0.13211662,  0.11832968],
        [-0.03486646, -0.00182698],
        [-0.03486646, -0.00182698],
        [-0.03486646, -0.00182698],
        [-0.03486646, -0.00182698],
        ...
        [-0.03486646, -0.00182698]]], dtype=float32)
```
- And finally returns the embedding (vector) that corresponds to the word e.g. 
```
array([0.13211662, 0.11832968], dtype=float32)
```

In [None]:
def word_to_vector(word):
    '''
    Input:
        word: a list containing the word as a string e.g. ['man']
    Output:
        vector: a numpy array containing the vector obtained 
                from the trained embedding layer
    
    '''
    # YOUR CODE HERE #
    return vector

In [None]:
vector = word_to_vector('man')
vector

In [None]:
assert type(vector) == np.ndarray
assert vector.shape == (2,)

**Task 4.3**

Plot the embeddings of the following words `['boy', 'girl', 'man', 'woman']` and check if your model captures the male-female relation.

<img src=https://i.imgur.com/CUZuvxW.png width="400">

In [None]:
def word_plotter(words):
    # YOUR CODE HERE #

In [None]:
word_plotter(['boy', 'girl', 'man', 'woman'])