#**Implementing RNN architecture**
<font color='grey' size='1.5'> Created by Parisa Hosseinzadeh for *Machine learning for proteins*, Spring 2022. The code in this exercise is adapted from [Victor Zhou](https://victorzhou.com/blog/keras-rnn-tutorial/)

In today's in class activity, we will be building a recurrent neural net for sentiment analysis, in which we're reading sentences from movie reviews and want to predict whether it is a good review or not. This can be similar to reading in protein sequences and trying to predict whether they are an enzymes or not. 

## Step 1. Set up

### 1.1. Set up directory

Open the zip file and put it in your drive. It should contain these folders

```
dataset/
  test/
    neg/
    pos/
  train/
    neg/
    pos/
```

### 1.2. Mount google drive

In [None]:
# Mounting google drive
google_drive_mount_point = '/content/google_drive'

import os, sys, time

if 'google.colab' in sys.modules:
    from google.colab import drive
    drive.mount(google_drive_mount_point)

if not os.getenv("DEBUG"):
    google_drive = google_drive_mount_point + '/My Drive' 

Mounted at /content/google_drive


## Step 2. Loading and preparing data

Now let's read in the reviews and their labels.

In [None]:
from tensorflow.keras.preprocessing import text_dataset_from_directory

train_data = text_dataset_from_directory(
    "google_drive/MyDrive/Faculty_files/Education/Teaching/LearningProt_2022//dataset/train",
    batch_size=64
)# <-- change the path to location in drive
test_data = text_dataset_from_directory(
    "google_drive/MyDrive/    "google_drive/MyDrive/Faculty_files/Education/Teaching/LearningProt_2022//dataset/train",
/dataset/test",
    batch_size=64
)# <-- change the path to location in drive

SyntaxError: ignored

There are some html markups in some of the lines in our data (things like `<br />`). Let's clean-up the data and remove some html markups.

In [None]:
from tensorflow.keras.preprocessing import text_dataset_from_directory
from tensorflow.strings import regex_replace

def prepareData(dir):
  data = text_dataset_from_directory(dir)
  return data.map(
    lambda text, label: (regex_replace(text, '<br />', ' '), label),
  )

train_data = prepareData(
    'google_drive/MyDrive/.../dataset/train'
)# <-- change the path to location in drive (google_drive/MyDrive/...)
test_data = prepareData(
    'google_drive/MyDrive/.../dataset/test'
)# <-- change the path to location in drive

## Step 3. Building your model

Now let's build our model. RNN is still a sequential model, but it has some new units that we will be using. Let's walk through them one by one.

### 3.1. Starting the model input

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras import Input

model = Sequential()
model.add(Input(shape=(1,), dtype="string"))

Input is one string at a time. We take the string, pass it to RNN and predict whether it is a positive or negative review.

### 3.2. First layer, text visualization

Our first layer will be a [TextVectorization layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization), which will process the input string and turn it into a sequence of integers, each one representing a token.

For something like a protein string, you won't need this as we only have 20 amino acids and we can one-hot-encode those.

In [None]:
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

max_tokens = 1000
max_len = 100
vectorize_layer = TextVectorization(
  # Max vocab size. Any words outside of the max_tokens most common ones
  # will be treated the same way: as "out of vocabulary" (OOV) tokens.
  max_tokens=max_tokens,
  # Output integer indices, one per string token
  output_mode="int",
  # Always pad or truncate to exactly this many tokens
  output_sequence_length=max_len,
  
)

to initialize the layer, we call adapt. This cell can take up to ~10 min to run.

In [None]:
# Call adapt(), which fits the TextVectorization layer to our text dataset.
# This is when the max_tokens most common words (i.e. the vocabulary) are selected.
train_texts = train_data.map(lambda text, label: text)
vectorize_layer.adapt(train_texts, batch_size=32)

now we can add it to our model

In [None]:
model.add(vectorize_layer)

### 3.3. Embedding

Our next layer will be an [Embedding layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding), which will turn the integers produced by the previous layer into fixed-length vectors.

Note that here, we're performing the embedding on the fly instead of loading GloVe or Word2Vec. For proteins, you can also learn embeddings in a similar way.

In [None]:
from tensorflow.keras.layers import Embedding

# Note that we're using max_tokens + 1 here, since there's an
# out-of-vocabulary (OOV) token that gets added to the vocab.
model.add(Embedding(max_tokens + 1, 128))

### 3.4. The recurrent layer

For this activity, we will use a [SimpleRNN layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/SimpleRNN).

In [None]:
from tensorflow.keras.layers import SimpleRNN

# 64 is the "units" parameter, which is the
# dimensionality of the output space.
model.add(SimpleRNN(64))

### 3.5. Dense layer

Finally, we will add our dense layers. We will be including two layers, one dense of size 64 and one output layer.

Try to write it yourself. Remember, we only have two categories: positive or negative review.



In [None]:
from tensorflow.keras.layers import Dense

model.add(Dense(64, activation="relu"))
model.add(Dense(1, activation="sigmoid"))

let's take a look at our model.

In [None]:
from keras.utils.vis_utils import plot_model

plot_model(model, to_file='model_plot.png', show_shapes=True, show_layer_names=True)

#### Q1. Model architecture

Load the architecture of your model.

## Step 4. Compilation

Let's compile our model. Try to write the code yourself. Here's a refresher from keras:

```
compile(
    optimizer='rmsprop',
    loss=None,
    metrics=None,
    loss_weights=None,
    weighted_metrics=None,
    run_eagerly=None,
    steps_per_execution=None,
    jit_compile=None,
    **kwargs
)
```

In [None]:
# your code

In [None]:
#@markdown Sample code

model.compile(
  optimizer='adam',
  loss='binary_crossentropy',
  metrics=['accuracy'],
)

## Step 5. Training and evaluation

Let's train our model. Because we have lots of training data, this will take some time 5-10 min.

In [None]:
model.fit(train_data, epochs=10, batch_size=64)

#### Q2. Model performance

What is the accuracy of your model?

Run the next cell and see how it works on two test examples.

Now let's see how your model works on test data.

In [None]:
# Should print a very high score like 0.98.
print(model.predict([
  "i loved it! highly recommend it to anyone and everyone looking for a great movie to watch.",
]))

# Should print a very low score like 0.01.
print(model.predict([
  "this was awful! i hated it so much, nobody should watch this. the acting was terrible, the music was terrible, overall it was just bad.",
]))

[[0.74810326]]
[[0.25660262]]


Try to get the accuracy on your test set based on previous codes you used. Note that test_data includes both X and Y, so you don't need to provide them separately. Since we have lots of test data, let's use `batch_size=32`.


In [None]:
# your code



In [None]:
#@markdown Sample code

model.evaluate(test_data, batch_size=32)
#print('Accuracy: %.2f' % (accuracy*100))

#### Q3. Model tuning

What is the issue with your model? What are possible ways you can fix it?

#### Q4. Early stopping

Try changing the number of epochs from 10 to 8 and 5. This is called early stopping. What changes do you observe in test/train performance?

Note that you need to rebuild your model, otherwise it will continue training (for example if your model had accuracy of 0.8, if you just repeat training, it will start with accuracy of 0.8)

#### Q5. Adding dropouts

One way to avoid overfitting is drop-out. Drop-out means that at each round, some percent of neurons are ingonred. This randomness in removing some neurons help prevent the network from memorizing.

Try adding drop-out to your model at different percentages (0.1, 0.2). How does that affect your performance?

```
model.add(SimpleRNN(64, dropout=0.25, recurrent_dropout=0.25))

```

In [None]:
# your code

### Optional. Adding depth

Try to make the model deep. To do this, you can add your layers like this:

```
# Return the full sequence instead of just the last
# output of the sequence.
model.add(SimpleRNN(64, return_sequences=True))

# This second recurrent layer's input sequence is the
# output sequence of the previous layer.
model.add(SimpleRNN(64))
```

How does this affect your results?