# Sequential Data

- Points in the dataset are dependent on the other points in the dataset
- Text is one of the most widespread forms of sequence data
    - Understood as either a sequence of characters or a sequence of words
    - common to work at the level of words

### How to handle Sequential Data <br>
- Since text aren't stateless and usually depends on some other parts of the text, feedforward neural networks can't effectively handle these types of data
- Use Neural Network with internal memory i.e. Recurrent Neural Network (RNN)

## Recurrent Neural Network

- To generate text, past input is needed 
    - Saving the output of a layer and feeding it to the input to predict the output of the next layer
- RNN $\rightarrow$ Class of artificial neural networks that is powerful for modelling sequence data such as time series or natural language 

### How it handles sequential information 
- Implementation of Internal Memory 
<img src = './images/rnn.png' width="400" height="400">

### Overview of RNN

<img src = 'https://www.simplilearn.com/ice9/free_resources_article_thumb/Fully_connected_Recurrent_Neural_Network.gif'>

- "x" is the input layer, "h" is the hidden layer, and "y" is the output layer
- "A","B","C" $\rightarrow$ Network parameters used to improve the models
    - At time "t" : 
        - Current Input $\rightarrow$ Combination of input at x(t) and x(t-1)
        - Output at any given time is fetched back to the network to improve on the output

### Problem with using Simple RNN for text generation

- Vanishing Gradient Problem 
    - Arises when there is large sequence of data and/or there is multiple hidden layers in RNN
    - Activation Function such as Sigmoid Function squishes a large input space into a small input space between 0 and 1; so, large change in input causes small change in the output
    - Gradients carry information used in the RNN parameter update and when it becomes smaller and smaller, the parameter updates become insignificant and no real learning is done
- To circumvent $\rightarrow$ LSTM architecture  

## LSTM

- Advanced version of RNN architecture to model Sequential data and their long-range dependencies better than conventional RNN
- Hidden layer consisting of a gated cell which has four layers that interact with one another to produce the output of the cell along with its state
- Gates limits the information that is passed through the cell and helps to determine which part of the information will be needed by the next cell and which part is to be discarded
<img src = 'https://www.mdpi.com/remotesensing/remotesensing-12-00256/article_deploy/html/images/remotesensing-12-00256-g003.png' width = '400' height ='400'>

### Architecture of LSTM 

- At time "t" $\rightarrow$ input vector $[h(t-1),x(t)]$ ; network cell state [c(t)] ; output vector [h(t)]
- Hyperbolic tangent and Sigmoid activation functions
- 3 gates control the cell states
    - Forget Gate $\rightarrow$ Controls what information in the cell state to forget given new information that entered the network <img src = 'https://miro.medium.com/max/711/1*PJ5atpFStpNWE_XpB4e8qQ.png' width = 300 height ='300'>
    - Input Gate $\rightarrow$ Controls what new information will be encoded into the cell state, given the new input information <img src = 'https://miro.medium.com/max/698/1*pAzAFns1ccuHmBvCqwh3Fg.png' width = 300 height ='300'>
    - Output Gate $\rightarrow$ Controls what information encoded in the cell state is sent to the network as input in the following time step via output vector h(t) <img src = 'https://miro.medium.com/max/715/1*wXoU29bsWxi1WQ0DUAnK7g.png' width = 300 height ='300'>

### How it handles the vanishing gradient problem 

Gradient contains the forget gate's activation that allows the LSTM to decide, at each time step, that certain information should not be forgotten and to update the model’s parameters accordingly

### How LSTM was implemented
Using the Keras Sequential API 

In [None]:
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout, Masking, Embedding

model = Sequential()

# Embedding layer
model.add(Embedding(#parameters))

# Recurrent layer
model.add(LSTM(#parameters))

# Fully connected layer
model.add(Dense(#parameter,
                activation='relu'))

# Dropout for regularization
model.add(Dropout(#value))

# Output layer
model.add(Dense(#parameter,
                activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

- `Embedding` maps each input word to a n dimensional vector
- `LSTM` cells; Since we are using 2 LSTM layer, return sequences is kept 
- Fully-connected `Dense` layer with `relu` activation; Adds additional representational capacity to the network
- `Dropout` layer to prevent overfitting to the training data
- `Dense` fully-connected output layer; Produces a probability for every word in the vocab using `softmax` activation
- Compiled using `Adam` optimizer (a variant on Stochastic Gradient Descent) and trained using the `categorical_crossentropy` loss function

# Preprocessing the data