## 1. Tweet preprocessing (previous datalab)

Before starting the DataLab, preprocess the tweets like you did before.

In [None]:
import nltk
from nltk.corpus import twitter_samples
import matplotlib.pyplot as plt
import random
import numpy as np
import pandas as pd

In [None]:
nltk.download('twitter_samples')

In [None]:
# select the set of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

In [None]:
print('Number of positive tweets: ', len(all_positive_tweets))
print('Number of negative tweets: ', len(all_negative_tweets))

In [None]:
import re                                  # library for regular expression operations
import string                              # for string operations

from nltk.corpus import stopwords          # module for stop words that come with NLTK
from nltk.stem import PorterStemmer        # module for stemming
from nltk.tokenize import TweetTokenizer   # module for tokenizing strings

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stopwords_english = stopwords.words('english')

In [None]:
def tweet_processor(tweet):
    """
    Input:
        tweet: a string containing a tweet
    Output:
        processed_tweet: a list of token
        
    Processing steps:
    - Removes hyperlinks
    - Removes # sign
    - Tokenizes
    - Removes stopwords and punctuation
    - Stem tokens
        
    """
    # YOUR CODE HERE #

    return processed_tweet

In [None]:
tweet = all_positive_tweets[2277]
tweet_processed = tweet_processor(tweet)

print(tweet)
print(tweet_processed)

In [None]:
# 80% training 20% testing
positive_tweets_tr = all_positive_tweets[:4000]
positive_tweets_te = all_positive_tweets[4000:]

negative_tweets_tr = all_negative_tweets[:4000]
negative_tweets_te = all_negative_tweets[4000:]

In [None]:
def tweet_processor_list(tweet_list):
    # YOUR CODE HERE #
    return processed_tweet_list

In [None]:
positive_tweets_tr = tweet_processor_list(positive_tweets_tr)
positive_tweets_te = tweet_processor_list(positive_tweets_te)

negative_tweets_tr = tweet_processor_list(negative_tweets_tr)
negative_tweets_te = tweet_processor_list(negative_tweets_te)

You already did the steps until here in the previous DataLab. Let's do a quick sanity check:

In [None]:
assert len(positive_tweets_tr) == 4000
assert len(negative_tweets_tr) == 4000

assert len(positive_tweets_te) == 1000
assert len(negative_tweets_te) == 1000

assert type(positive_tweets_tr) is list
assert type(positive_tweets_tr[0]) is list
assert type(positive_tweets_tr[0][0]) is str

## 2. Converting tweets to numbers (previous datalab)

Repeat your code from the last datalab.

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
training_tweets = positive_tweets_tr + negative_tweets_tr
test_tweets = positive_tweets_te + negative_tweets_te

While we are creating our dataset, we can also create our labels. We know that first half of `training_tweets` and `test_tweets` are positive (label = 1) and second half is negative (label = 0). Therefore creating the labels is as easy as:

In [None]:
y_train = np.append(np.ones(len(positive_tweets_tr)),
                    np.zeros(len(negative_tweets_tr)))

y_test = np.append(np.ones(len(positive_tweets_te)),
                   np.zeros(len(negative_tweets_te)))

print(y_train.shape)
print(y_test.shape)

Remember that we already preprocessed and tokenized our tweets:

In [None]:
training_tweets[0]

But Keras `Tokenizer()` expects a list of strings. So let's combine tokens into strings:

In [None]:
training_tweets_str = []
for tw in training_tweets:
    training_tweets_str.append(' '.join(tw))
    
test_tweets_str = []
for tw in test_tweets:
    test_tweets_str.append(' '.join(tw))

In [None]:
training_tweets_str[0]

**Task 2.1**

Use tokenizer on `training_tweets_str`. Notice that tokenizer processes text with the `filters` parameter. Set it to `filters=''` to prevent processing because we already processed our tweets.

In [None]:
# YOUR CODE HERE #

**Task 2.2**

Calculate the size of the vocabulary.

In [None]:
# YOUR CODE HERE #

**Task 2.3**

Find the numbers that represent the words `'boy'`, `'girl'`, `'man'` and `'woman'`.

In [None]:
# YOUR CODE HERE #

**Task 2.4**

Convert training and test tweets to sequences and use padding.

Example tweet:

`'followfriday top engag member commun week :)'`

Corresponding sequence:

`[347, 221, 937, 400, 286, 52, 3]`

Padded sequence:

`array([347, 221, 937, 400, 286,  52,   3,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0], dtype=int32)`


For padding arguments use `padding='post'` and `maxlen=30`.

In [None]:
# YOUR CODE HERE #

In [None]:
assert training_padded.shape == (8000, 30)
assert test_padded.shape == (2000, 30)

## 3. Combine embedding layer with RNNs (this datalab)

Create a model with the following layers:
- Embedding layer
- Recurrent layer
- Dense layer

This is the minimum architecture. You can modify it to increase the number of layers or add new layers such as Dropout.

Keras provides a few recurrent layers:

- LSTM layer
- GRU layer
- SimpleRNN layer
- TimeDistributed layer
- Bidirectional layer
- ConvLSTM1D layer
- ConvLSTM2D layer
- ConvLSTM3D layer
- Base RNN layer

https://keras.io/api/layers/recurrent_layers/

For your recurrent layer try LSTM, GRU and SimpleRNN.

In [None]:
X_train = training_padded
X_test = test_padded

In [None]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

In [None]:
from keras import Sequential
from keras.layers import # YOUR CODE HERE #

model = Sequential()
# YOUR CODE HERE #
model.summary()

**Task 3.2**

Compile the model by selecting a proper loss, optimizer and metric.

In [None]:
# YOUR CODE HERE #

**Task 3.3**

Train the model with `X_train` and `y_train`. Use `X_test` and `y_test` as validation data.

In [None]:
# YOUR CODE HERE #

**Task 3.4**

Predict the class of a test tweet.

Example tweet:
`"back thnx god i'm happi :)"`

Model prediction:
`array([[0.99999976]], dtype=float32)`

In [None]:
test_tweets_str[1]

In [None]:
# YOUR CODE HERE #

## 4. Bidirectional LSTM (this datalab)

Try bidirectional LSTMs.

https://keras.io/examples/nlp/bidirectional_lstm_imdb/