# Ch. 1 Recurrent Neural Networks and Keras
- Applications of machine learning to text data
    - Sentiment Analysis
        - classifying customer feedback into positive or negative to guage how the customer base feels on the business
    - Multi-Class Classification
        - Recommender systems. 
    - Text Generation
        - Auto-complete sentences
    - Machine Neural Translation
        - Translate languages

## Recurrent Neural Networks
- main advantage is that they reduce the number of parameters of the model by avoiding one-hot encoding
- Sequence to Sequence Models
    - Many to one: classification tasks
        - final output is a probability distribution. Y-pred is the probability of the sentiment belonging to the class "positive"
        - used on sentiment analysis and multi-classification applications
    - Many to Many: text generation
    - Many to Many: neural machine translation
        - encoder block
        - decoder block
        
## Introduction to Language Models
### Sentence Probability
- Many Available Models
    - Probability of "I loved this movie"
    - Neural Networks
        - the probability of the sentence is given by a softmax function on the output layer of the network

### Link to RNNs
- Language models are everywhere in RNNs
- the network itself is a language model when fed text data
    - Give the probability of the next token given the previous tokens
- Embedding layer can be used to create vector representations of the tokens in the first layer
- Need to build vocabulary dictionaries where each unique work is assigned its number as it's index in the unique vocabulary array

#### Building vocabulary dictionaries
- get unique words/tokens from the corpus

In [1]:
sheldon_quotes = ["You're afraid of insects and women, Ladybugs must render you catatonic.",
                 'Scissors cuts paper, paper covers rock, rock crushes lizard, lizard poisons Spock, Spock smashes scissors, scissors decapitates lizard, lizard eats paper, paper disproves Spock, Spock vaporizes rock, and as it always has, rock crushes scissors.',
                 'For example, I cry because others are stupid, and that makes me sad.',
                 "I'm not insane, my mother had me tested.",
                 'Two days later, Penny moved in and so much blood rushed to your genitals, your brain became a ghost town.',
                 "Amy's birthday present will be my genitals.",
                 '(3 knocks) Penny! (3 knocks) Penny! (3 knocks) Penny!',
                 'Thankfully all the things my girlfriend used to do can be taken care of with my right hand.',
                 'I would have been here sooner but the bus kept stopping for other people to get on it.',
                 'Oh gravity, thou art a heartless bitch.',
                 'I am aware of the way humans usually reproduce which is messy, unsanitary and based on living next to you for three years, involves loud and unnecessary appeals to a deity.',
                 'Well, today we tried masturbating for money.',
                 'I think that you have as much of a chance of having a sexual relationship with Penny as the Hubble telescope does of discovering at the center of every black hole is a little man with a flashlight searching for a circuit breaker.',
                 "Well, well, well, if it isn't Wil Wheaton! The Green Goblin to my Spider-Man, the Pope Paul V to my Galileo, the Internet Explorer to my Firefox.",
                 "What computer do you have? And please don't say a white one.",
                 "She calls me moon-pie because I'm nummy-nummy and she could just eat me up.",
                 'Ah, memory impairment; the free prize at the bottom of every vodka bottle.']

In [2]:
# Transform the list of sentences into a list of words
all_words = ' '.join(sheldon_quotes).split(' ')

# Get number of unique words
unique_words = list(set(all_words))

# Dictionary of indexes as keys and words as values
index_to_word = {i:wd for i, wd in enumerate(sorted(unique_words))}

# Dictionary of words as keys and indexes as values
word_to_index = {wd:i for i, wd in enumerate(sorted(unique_words))}

print(word_to_index)

{'(3': 0, 'Ah,': 1, "Amy's": 2, 'And': 3, 'Explorer': 4, 'Firefox.': 5, 'For': 6, 'Galileo,': 7, 'Goblin': 8, 'Green': 9, 'Hubble': 10, 'I': 11, "I'm": 12, 'Internet': 13, 'Ladybugs': 14, 'Oh': 15, 'Paul': 16, 'Penny': 17, 'Penny!': 18, 'Pope': 19, 'Scissors': 20, 'She': 21, 'Spider-Man,': 22, 'Spock': 23, 'Spock,': 24, 'Thankfully': 25, 'The': 26, 'Two': 27, 'V': 28, 'Well,': 29, 'What': 30, 'Wheaton!': 31, 'Wil': 32, "You're": 33, 'a': 34, 'afraid': 35, 'all': 36, 'always': 37, 'am': 38, 'and': 39, 'appeals': 40, 'are': 41, 'art': 42, 'as': 43, 'at': 44, 'aware': 45, 'based': 46, 'be': 47, 'became': 48, 'because': 49, 'been': 50, 'birthday': 51, 'bitch.': 52, 'black': 53, 'blood': 54, 'bottle.': 55, 'bottom': 56, 'brain': 57, 'breaker.': 58, 'bus': 59, 'but': 60, 'calls': 61, 'can': 62, 'care': 63, 'catatonic.': 64, 'center': 65, 'chance': 66, 'circuit': 67, 'computer': 68, 'could': 69, 'covers': 70, 'crushes': 71, 'cry': 72, 'cuts': 73, 'days': 74, 'decapitates': 75, 'deity.': 76, '

In [3]:
# Create lists to keep the sentences and the next character
sentences = []   # ~ Training data
next_chars = []  # ~ Training labels

# Define hyperparameters
step = 2          # ~ Step to take when reading the texts in characters
chars_window = 10 # ~ Number of characters to use to predict the next one  

# Loop over the text: length `chars_window` per time with step equal to `step`
for i in range(0, len(sheldon_quotes) - chars_window, step):
    sentences.append(sheldon_quotes[i:i + chars_window])
    next_chars.append(sheldon_quotes[i+chars_window])

#### Transforming new text
In this exercise, you will transform a new text into sequences of numerical indexes on the dictionaries created before.

This is useful when you already have a trained model and want to apply it on a new dataset. The preprocessing steps done on the training data should also be applied to the new text, so the model can make predictions/classifications.

## Introduction to Keras

- Keras: high level API built on top of tensorflow
    - keras.models.
        - Sequential: layers in model are run one after another. Easier
        - Models: More flexibility with layers. can have multiple inputs and outputs
    - keras.layers.
        - Dense
        - LSTM
        - GRU
        - Embedding
        - Dropout
        - Bidirectional
    - keras.preprocessing
        - sequence.pad_sequences(texts, maxlen=int)
    - keras.datasets
        - IMDB Movie reviews: sentiment analysis
        - Reuters Newswire: multiclass classification with 4-6 classes
    

### Model Method

In [4]:
from tensorflow import keras
from keras.models import Model
from keras.layers import Dense, Input

# Define the input layer
main_input = Input(shape=(None, 10), name="input")

# One LSTM layer (input shape is already defined)
dense_layer = Dense(128, name="LSTM")(main_input)

# Add a dense layer with one unit
main_output = Dense(1, activation="sigmoid", name="output")(dense_layer)

# Instantiate the class at the end
model = Model(inputs=main_input, outputs=main_output, name="modelclass_model")

# Same amount of parameters to train as before (71,297)
model.summary()

Model: "modelclass_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input (InputLayer)           [(None, None, 10)]        0         
_________________________________________________________________
LSTM (Dense)                 (None, None, 128)         1408      
_________________________________________________________________
output (Dense)               (None, None, 1)           129       
Total params: 1,537
Trainable params: 1,537
Non-trainable params: 0
_________________________________________________________________


### Sequential Method

In [5]:
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Input

# Instantiate the class
model = Sequential(name="sequential_model")

# One LSTM layer (defining the input shape because it is the 
# initial layer)
model.add(Dense(128, input_shape=(None, 10), name="Dense"))

# Add a dense layer with one unit
model.add(Dense(1, activation="sigmoid", name="output"))

# The summary shows the layers and the number of parameters 
# that will be trained
model.summary()

Model: "sequential_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
Dense (Dense)                (None, None, 128)         1408      
_________________________________________________________________
output (Dense)               (None, None, 1)           129       
Total params: 1,537
Trainable params: 1,537
Non-trainable params: 0
_________________________________________________________________


## Keras Preprocessing
The second most important module of Keras is keras.preprocessing. You will see how to use the most important modules and functions to prepare raw data to the correct input shape. Keras provides functionalities that substitute the dictionary approach you learned before.

You will use the module keras.preprocessing.text.Tokenizer to create a dictionary of words using the method .fit_on_texts() and change the texts into numerical ids representing the index of each word on the dictionary using the method .texts_to_sequences().

Then, use the function .pad_sequences() from keras.preprocessing.sequence to make all the sequences have the same size (necessary for the model) by adding zeros on the small texts and cutting the big ones.

In [6]:
import numpy as np
texts = np.array(['So if a photon is directed through a plane with two slits in it and either slit is observed it will not go through both slits. If it’s unobserved it will, however,                     if it’s observed after it’s left the plane but before it hits its target, it will not have gone through both slits.',
                     'Hello, female children. Allow me to inspire you with a story about a great female scientist. Polish-born, French-educated Madame Curie. Co-discoverer of                             radioactivity, she was a hero of science, until her hair fell out, her vomit and stool became filled with blood, and she was poisoned to death by her own                             discovery. With a little hard work, I see no reason why that can’t happen to any of you. Are we done? Can we go?'])

# Import relevant classes/functions
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# Build the dictionary of indexes
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)

# Change texts into sequence of indexes
texts_numeric = tokenizer.texts_to_sequences(texts)
print("Number of words in the sample texts: ({0}, {1})".format(len(texts_numeric[0]), len(texts_numeric[1])))

# Pad the sequences
texts_pad = pad_sequences(texts_numeric, 60)
print("Now the texts have fixed length: 60. Let's see the first one: \n{0}".format(texts_pad[0]))

Number of words in the sample texts: (54, 78)
Now the texts have fixed length: 60. Let's see the first one: 
[ 0  0  0  0  0  0 24  4  1 25 13 26  5  1 14  3 27  6 28  2  7 29 30 13
 15  2  8 16 17  5 18  6  4  9 31  2  8 32  4  9 15 33  9 34 35 14 36 37
  2 38 39 40  2  8 16 41 42  5 18  6]


# Ch. 2 RNN Architecture
## Vanishing and Exploding Gradients
https://towardsdatascience.com/the-vanishing-exploding-gradient-problem-in-deep-neural-networks-191358470c11

- Exploding Gradient: when working back propogation the derivatives multiply exponentially and eventually "explode" to infinity
- Vanishing Gradient: when the gradients vanish, or go to zero. 
    - This is a much harder problem to solve because it is not as easy to detect. 
    - If the loss function does not improve on every step, is it because the gradients went to zero and thus didn't update the weights? Or is it because the model is not able to learn?
    - This problem occurs more often in RNN models when long memory is required, meaning having long sentences

## GRU and LSTM Cells
Achieve good results in language modeling and solve the vanishing gradient problem

### No more vanishing gradients
- The simpleRNN cell can have gradient problems
    - the weight matrix power t multiplies the other terms
- GRU and LSTM cells don't have vanishing gradient problems:
    - because of their gates
    - Don't have the weight matrices terms multiplying the rest 
    - Exploding gradient problems are easier to solve

### Usage in Keras

In [13]:
# Import Layers
from keras.layers import GRU, LSTM

# Instantiate a model
model = Sequential()

# Add Layers: units is number of memory cells to keep track of. return_sequences states whether the cell is going to be an input to another layer
model.add(GRU(units=128, return_sequences=True, name='GRU Layer'))
model.add(LSTM(units=64, return_sequences=False, name='LSTM Layer'))

## The Embedding layer
- Advantages: 
    - Reduces dimensions needed for data
    - Dense Representation
    - Transfer Learning
- Disadvantages:
    - need to train lots of parameters to Learn
    - Can make training slower
- Should be the first layer of the model


In [15]:
from keras.layers import Embedding
from keras import Sequential

# Use embedding as first layer
model.add(Embedding(input_dim=10000, # size of the vocabulary
                    output_dim=300, # output of the embedding space
                    trainable=True, # to update or not update this layers weights during training
                    embeddings_initializer=None, # Transfer learning with pre-traiing weights
                    input_length=120)) # Size of sequences, assumes the inputs have been padded

## Improving RNN Models 
- To improve the model's performance we can:
    - Add the embedding layer
    - increase the number of layers
    - tune the parameters
    - Increase vocabulary size
    - accept longer sentences with more memory cells
    
- to avoid overfitting:
    - Test using different batch sizes
    - add Dropout layer
    - Add dropout and recurrent_dropout parameters on RNN layers
        - dropout: 'rate' parameter removes the specified percentage of input data to add noise to model
        - 'recurrent_dropout' removes the specified percentage of input and memory cells respectively
        
    