# Recurrent Neural Network

## References

1. [Udacity's Deep Learning Nanodegree](https://classroom.udacity.com/nanodegrees/nd101-ent/syllabus/core-curriculum) 
2. [Machine Talk](https://machinetalk.org/2019/02/08/text-generation-with-pytorch/)
3. [KD Nuggets tutorial on text generation via LSTM](https://www.kdnuggets.com/2020/07/pytorch-lstm-text-generation-tutorial.html)
4. [Pytorch official documentation](https://pytorch.org/tutorials/intermediate/char_rnn_generation_tutorial.html)

Some applications of deep-learning involve temporal-dependencies i.e. dependencies over time i.e. not just on current input but also on past inputs. RNNs are similar to feed-forward networks but in addition to *memory*.<br>

<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/RNNs%20-%20Temporal%20Dependencies.png?raw=1" width="250" height="40%"></img>

In RNNs, the current output *y* depends not only on current input *x*, but also on memory element *s*, that takes into account past inputs. 

RNNs also attempt to address the need of capturing information in previous inputs by maintaining internal memory elements called *States.*<br><br>

<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/RNNs-%20States.png?raw=1" width="300"></img>

## Applications of RNNs

1. Some of the applications of RNN requires predicting the next word in the sentence which requires looking at *last few words instead of the current one.*

2. Sentiment Analysis
3. Speech Recognition
4. Time Series Prediction
5. NLP
6. Gesture Recognition

## Structure of RNNs
Below are the folded and unfolded sructure of RNNs - <br>

| Folded RNN                                                    | Un-folded RNN                                               |
|---------------------------------------------------------------|-------------------------------------------------------------|
| <img  src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/RNN-%20Folded%20Model.png?raw=1" width="300"></img> | <img  src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/RNNs%20-%20Unfolded.png?raw=1" width="300"></img> |



# Back Propogation Through Time (BPTT)

Lets look at the timestep t=3, the error associated w.r.t Wx depends on : vector S3 and its predecessor S2 and S1.<br>

<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/BPTT.png?raw=1" width="600"></img><br>

Looking at the pattern above while calculating the *accumulative gradient*, we can generalize the formula for Back Propogation Through Time (BPTT)as follows - <br>

<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/General%20formula%20for%20BPTT.png?raw=1" width="300"></img><br>



# Drawbacks of RNNs

## Vanishing Gradient Problem

In RNNs, if we continue to back-propogate further after 8-9 time steps, the contributions of information (graident) keeps on decreading geometrically over time which is known as the *vanishing gradient problem.* Here is where the **LSTM** comes into picture.<br>

<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/LSTM%20Intro.png?raw=1" width="600"></img>

## Exploding Gradient Problem

In RNNs we can also have the opposite problem, called the *exploding gradient* problem, in which the value of the gradient grows uncontrollably. A simple solution for the exploding gradient problem is **Gradient Clipping.**

<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/Gradient%20Clipping.png?raw=1" width="500"></img>


# Long Short Term Memory Cells (LSTM Cells)

## Basics of LSTM

Basic RNN was unable to retain long term memory to make prediction regarding the current picture is that od a wolf or dog. This is where LSTM comes into picture. The LSTM cell allows a recurrent system to learn over many time steps without the fear of losing information due to the vanishing gradient problem. It is fully differentiable, therefore gives us the option of easily using backpropagation when updating the weights. Below is the a sample mathematical model of an LSTM cell - <br>

<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/01.lstm_cell.png?raw=1" width="300"></img><br>


In an LSTM, we would expect the following behaviour -


| Expected Behaviour of LSTM                                                                   | Reference Diagram                                                       |
|----------------------------------------------------------------------------------------------|-------------------------------------------------------------------------|
| 1. Long Term Memory (LTM) and Short Term Memory (STM) to combine and produce correct output. | <img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/05.%20lstm_basics_1.png?raw=1" width="300"> |
| 2. LTM and STM and event should update the new LTM.                                          | </img>  <img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/06.%20lstm_basics_2.png?raw=1" width="300"></img>  |
| 3. LTM and STM and event should update the new STM.                                          | <img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/07.%20lstm_basics_3.png?raw=1" width="300"></img>          |



## How LSTMs work?

| LSTM consists of 4 types of gates -  <br>1. Forget Gate<br>  2. Learn Gate<br> 3. Remember Gate<br> 4. Use Gate<br> | <img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/10.%20lstm_architecture_02.png?raw=1" width="530px" height="250px"></img> |
|-------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|

### LSTM Explained
Assume the following - 
1. LTM = Elephant
2. STM = Fish
3. Event = Wolf/Dog

| LSTM Operations                                                                                                                                                                                            | Reference Video                                      |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------|
| **LSTM places LTM, STM and Event as follows -**<br> 1. Forget Gate = LTM<br>  2. Learn Gate = STM + Event<br> 3. Remember Gate = LTM + STM + Event<br> 4. Use Gate = LTM + STM + Event<br> 5. In the end, LTM and STM are updated.<br> | <img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/Animated%20GIF-downsized_large.gif?raw=1"></img> |


## General Architecture of LSTM 

<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/LSTM%20Architecture.png?raw=1" width="400"><img>




## Learn Gate
Learn gate takes into account **short-term memory and event** and then ignores a part of it and retains only a part of information.<br>
<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/11.%20learn_gate.png?raw=1" height="200px" width="500px"></img>

### Mathematically Explained
STM and Event are combined together through **activation function** (tanh), which we further multiply it by a **ignore factor** as follows -<br>

<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/12.lean_gate_equation.png?raw=1" height="200px" width="500px"></img>

## Forget Gate
Forget gate takes into account the LTM and decides which part of it to keep and which part of LTM is useless and forgets it. LTM gets multiplied by a **forget factor** inroder to forget useless parts of LTM. <br>
<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/13.%20forget_gate.png?raw=1" height="200px" width="500px"></img>

## Remember Gate
Remember gate takes LTM coming from Forget gate and STM coming from Learn gate and combines them together. Mathematically, remember gate adds LTM and STM.<br><br>
<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/14.%20remember_gate.png?raw=1" height="200px" width="400px"></img> <img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/15.%20remember_gate_equation.png?raw=1" height="200px" width="450px"></img>

## Use Gate
Use gate takes what is useful from LTM and what's useful from STM and generates a new LTM.<br><br>
<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/16.%20use_gate.png?raw=1" height="200px" width="400px"></img> <img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/17.%20use_gate_equation.png?raw=1" height="200px" width="450px"></img>







# RNNs and LSTM for Text Generation


## Drawbacks of one-hot encoding 

Considering an example of an excert from a book containing large collection of dataset and when you use these words as an input to RNN, we can one-hot encode them, but this would mean that we will end up having giant vector with mostly zeros except that one entry as shown below:<br>

<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/One%20hot%20encoded%20vectors.png?raw=1" width="500"></img>

Then we pass this one-hot encoded vector into hidden-layer of RNN and the result is a huge matrix of values most of which are zeros because of the initial one-hot encoding and this is really *computaionally inefficient*.<br>

<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/Computationally%20in-efficient.png?raw=1" width="500"></img>

This is where *Embeddings* come into picture.


## Word Embeddings

Word embeddings is a general technique of reducing the dimensionality of text data, but the embedding models can also learn some interesting traits about words in a vocabulary.<br>

Embeddings can improve the ability of neural networks to learn from text data by representing them as *lower dimensional vectors.*

The idea here is when we multiply one-hot encoded vector with weight-matrix, returns only the row of the matrix that corresponds to the 1 or the on input unit.<br><br>

Hence, instead of doing matrix multiplication, we use weight-matrix as a look-up table and instead of representing words as one-hot vectors, we encode each word with a unique integer.


<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/Embedding%20Lookup.png?raw=1" width="500"></img>



# Look-up Tables

Considering the example of "heart" mentioned above, we see that "heart" is encoded as the integer "958", we can look-up the embedding vector for this word in the 958th row of the embedding weight matrix. This is called a *look-up table*

## Dimensions of Look-up table

If we have a vocabulary of 10k words, then we will have a 10k row embedded weight matrix. The width of the table is called *embedding dimensions*.


<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/embedding_lookup_table.png?raw=1" width="500"></img>


# Word2Vec Models

Word2Vec model provides much efficient representations by finding vectors that represents words.<br>

There are 2 architectures for implementing Word2Vec -
1. CBOW (Continous Bag Of Words)
2. Skip-gram



<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/word2vec_architectures.png?raw=1" width="500"></img>


We have implemened *Talking Points* using the *Skip-gram* model.

In [23]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os

In [24]:
import os

import pandas as pd
df = pd.read_csv('data/parsed_data/data.csv',   sep=';')

In [4]:
df.head()

Unnamed: 0,id,Question
0,1,¿Cuáles son los principios fundamentales de la...
1,2,¿Cómo se implementa el polimorfismo en lenguaj...
2,3,¿Cuál es la diferencia entre un algoritmo recu...
3,4,¿Cómo se optimiza el rendimiento de bases de d...
4,5,¿Qué es la arquitectura de microservicios y en...


In [25]:
news = []
for i, j in df.iterrows():
    news.append(j['Question'])
    
print(len(news))

500


In [26]:
news[:1]

['¿Cuáles son los principios fundamentales de la programación orientada a objetos?']

In [27]:
len(news)

500

In [28]:
news = news[:109233]

In [29]:
len(news)

500

In [30]:
os.path.join('/kaggle/working', 'finance_news.txt')

'/kaggle/working\\finance_news.txt'

In [31]:
f = open('./data/parsed_data/data.txt', 'w')
f.write('\n'.join(news))
f.close()

# Pre-processing Stock News 

The following section pre-processes our text file so that -
1. Any punctuation are converted into tokens, so a period is changed to a bracketed period.
2. In this data set, there aren't any periods, but it will help in other NLP problems.
3. It removes all words that show up five or fewer times in the dataset.This will greatly reduce issues due to noise in the data and improve the quality of the vector representations.
4. It returns a list of words in the text.

In [32]:
import os
import pickle
import torch


SPECIAL_WORDS = {'PADDING': '<PAD>'}


def load_data(path):
    """
    Load Dataset from File
    """
    input_file = os.path.join(path)
    with open(input_file, "r") as f:
        data = f.read()

    return data


def preprocess_and_save_data(dataset_path, token_lookup, create_lookup_tables):
    """
    Preprocess Text Data
    """
    text = load_data(dataset_path)
    
    # Ignore notice, since we don't use it for analysing the data
    text = text[81:]

    token_dict = token_lookup()
    for key, token in token_dict.items():
        text = text.replace(key, ' {} '.format(token))

    text = text.lower()
    text = text.split()

    vocab_to_int, int_to_vocab = create_lookup_tables(text + list(SPECIAL_WORDS.values()))
    int_text = [vocab_to_int[word] for word in text]
    pickle.dump((int_text, vocab_to_int, int_to_vocab, token_dict), open('preprocess.p', 'wb'))


def load_preprocess():
    """
    Load the Preprocessed Training data and return them in batches of <batch_size> or less
    """
    return pickle.load(open('preprocess.p', mode='rb'))


def save_model(filename, decoder):
    save_filename = os.path.splitext(os.path.basename(filename))[0] + '.pt'
    torch.save(decoder, save_filename)


def load_model(filename):
    save_filename = os.path.splitext(os.path.basename(filename))[0] + '.pt'
    return torch.load(save_filename)

In [33]:
data_dir = './data/parsed_data/data.txt'
text = load_data(data_dir)

# Vocab2int & Int2vocab

Here we are creating 2 dictionaries to convert words to integers (`vocab_to_int`) and integers to vocab (`int_to_vocab`). The integers are assigned in descending order of the frequency, so the most frequent word, "the",  is given the integer "0" and the next most frequent word is given "1" and so on.

In [35]:
view_line_range = (0, 10)

import numpy as np

print('Dataset Stats')
print('Roughly the number of unique words: {}'.format(len({word: None for word in text.split()})))

lines = text.split('\n')
print('Number of lines: {}'.format(len(lines)))
word_count_line = [len(line.split()) for line in lines]
print('Average number of words in each line: {}'.format(np.average(word_count_line)))

print()
print('The lines {} to {}:'.format(*view_line_range))
print('\n'.join(text.split('\n')[view_line_range[0]:view_line_range[1]]))

Dataset Stats
Roughly the number of unique words: 1167
Number of lines: 500
Average number of words in each line: 13.362

The lines 0 to 10:
¿Cuáles son los principios fundamentales de la programación orientada a objetos?
¿Cómo se implementa el polimorfismo en lenguajes de programación como Java o C++?
¿Cuál es la diferencia entre un algoritmo recursivo y uno iterativo?
¿Cómo se optimiza el rendimiento de bases de datos en aplicaciones web de alta concurrencia?
¿Qué es la arquitectura de microservicios y en qué se diferencia de la arquitectura monolítica?
¿Cómo se implementa la seguridad en el desarrollo de aplicaciones web?
¿Cuáles son las principales técnicas para el análisis de big data?
¿Qué es el patrón MVC (Modelo-Vista-Controlador) y cuál es su utilidad en el desarrollo web?
¿Cómo se abordan los problemas de concurrencia y sincronización en el desarrollo de software?
¿Qué son los sistemas distribuidos y cuáles son sus ventajas y desventajas?


In [36]:
from collections import Counter

def create_lookup_tables(text):
    """
    Create lookup tables for vocabulary
    :param text: The text of tv scripts split into words
    :return: A tuple of dicts (vocab_to_int, int_to_vocab)
    """
    # TODO: Implement Function
    word_count = Counter(text)
    sorted_vocab = sorted(word_count, key = word_count.get, reverse=True)
    int_to_vocab = {ii:word for ii, word in enumerate(sorted_vocab)}
    vocab_to_int = {word:ii for ii, word in int_to_vocab.items()}
    
    # return tuple
    return (vocab_to_int, int_to_vocab)


In [37]:
def token_lookup():
    """
    Generate a dict to turn punctuation into a token.
    :return: Tokenized dictionary where the key is the punctuation and the value is the token
    """
    # TODO: Implement Function
    token = dict()
    token['.'] = '<PERIOD>'
    token[','] = '<COMMA>'
    token['"'] = 'QUOTATION_MARK'
    token[';'] = 'SEMICOLON'
    token['!'] = 'EXCLAIMATION_MARK'
    token['?'] = 'QUESTION_MARK'
    token['('] = 'LEFT_PAREN'
    token[')'] = 'RIGHT_PAREN'
    token['-'] = 'QUESTION_MARK'
    token['\n'] = 'NEW_LINE'
    return token


In [38]:
preprocess_and_save_data(data_dir, token_lookup, create_lookup_tables)

In [39]:
int_text, vocab_to_int, int_to_vocab, token_dict = load_preprocess()

In [40]:
train_on_gpu = torch.cuda.is_available()

# Batching Data

 We'll use `TensorDataset` to provide a known format to our dataset; in combination with DataLoader, it will handle batching, shuffling, and other dataset iteration functions.<br>
We can create data with TensorDataset by passing in feature and target tensors. Then create a DataLoader as usual.

```python
data = TensorDataset(feature_tensors, target_tensors)
data_loader = torch.utils.data.DataLoader(data, batch_size=batch_size)
```

For example, say we have these as input:<br>
```
words = [1, 2, 3, 4, 5, 6, 7]
sequence_length = 4
```
Our first feature_tensor should contain the values:<br>
```
[1, 2, 3, 4]
```
And the corresponding target_tensor should just be the next "word"/tokenized word value:<br>
```
5
```
This should continue with the second feature_tensor, target_tensor being:<br>
```
[2, 3, 4, 5]  # features
6             # target
```

In [41]:
from torch.utils.data import TensorDataset, DataLoader
import torch
import numpy as np


def batch_data(words, sequence_length, batch_size):
    """
    Batch the neural network data using DataLoader
    :param words: The word ids of the TV scripts
    :param sequence_length: The sequence length of each batch
    :param batch_size: The size of each batch; the number of sequences in a batch
    :return: DataLoader with batched data
    """
    # TODO: Implement function
    n_batches = len(words)//batch_size
    x, y = [], []
    words = words[:n_batches*batch_size]
    
    for ii in range(0, len(words)-sequence_length):
        i_end = ii+sequence_length        
        batch_x = words[ii:ii+sequence_length]
        x.append(batch_x)
        batch_y = words[i_end]
        y.append(batch_y)
    
    data = TensorDataset(torch.from_numpy(np.asarray(x)), torch.from_numpy(np.asarray(y)))
    data_loader = DataLoader(data, shuffle=True, batch_size=batch_size)
        
    
    # return a dataloader
    return data_loader


In [45]:
# test dataloader

test_text = range(50)
t_loader = batch_data(test_text, sequence_length=5, batch_size=10)

data_iter = iter(t_loader)
sample = next(data_iter)  # Obtiene un lote de datos de muestra

sample_x, sample_y = sample[0], sample[1]  # Asumiendo que tu DataLoader devuelve 'input' y 'target'

print(sample_x.shape)
print(sample_x)
print()
print(sample_y.shape)
print(sample_y)

torch.Size([10, 5])
tensor([[44, 45, 46, 47, 48],
        [41, 42, 43, 44, 45],
        [10, 11, 12, 13, 14],
        [34, 35, 36, 37, 38],
        [29, 30, 31, 32, 33],
        [ 6,  7,  8,  9, 10],
        [ 2,  3,  4,  5,  6],
        [ 0,  1,  2,  3,  4],
        [40, 41, 42, 43, 44],
        [36, 37, 38, 39, 40]], dtype=torch.int32)

torch.Size([10])
tensor([49, 46, 15, 39, 34, 11,  7,  5, 45, 41], dtype=torch.int32)


# Talking Points Model

## Genral Architecture

### Embedding Layer

The model should take our word tokens and firstly pass it through our embedding layer. This layer will be responsible for converting out word tokens or integers into embeddings of specific size. These word embeddings are then fed to the next layer of LSTM cells. <br>

The main purpose of using embedding layer is dimensionality reduction.

### Contiguous LSTM Layer

Our LSTM layer is defined by *hidden state size and number of layers*. At each step, an LSTM cell will produce an output and a new hidden state. The hidden state will be passed to next cell as input (memory representation.)

### Final Fully Connected Linear Layer

The output generated by LSTM cell will be then fed into a *Sigmoid activated fully-connected linear layer.* This layer is responsible for mapping LSTM output to desired output size.

The output of the sigmoid function will be the probability distribution of most likely next word.<br><br>


<img src="https://github.com/purvasingh96/Talking-points-global-hackathon/blob/master/assets/lstm_rnn_2.png?raw=1" height="500"></img>


In [70]:
import torch.nn as nn

class RNN(nn.Module):
    
    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout=0.5):
        """
        Initialize the PyTorch RNN Module
        :param vocab_size: The number of input dimensions of the neural network (the size of the vocabulary)
        :param output_size: The number of output dimensions of the neural network
        :param embedding_dim: The size of embeddings, should you choose to use them        
        :param hidden_dim: The size of the hidden layer outputs
        :param dropout: dropout to add in between LSTM/GRU layers
        """
        super(RNN, self).__init__()
        # TODO: Implement function
        
        # define embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        # define lstm layer
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=dropout, batch_first=True)
        
        
        # set class variables
        self.vocab_size = vocab_size
        self.output_size = output_size
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
        self.n_layers = n_layers
        
        # define model layers
        self.fc = nn.Linear(hidden_dim, output_size)
    
    
    def forward(self, x, hidden):
        """
        Forward propagation of the neural network
        :param nn_input: The input to the neural network
        :param hidden: The hidden state        
        :return: Two Tensors, the output of the neural network and the latest hidden state
        """
        # TODO: Implement function   
        batch_size = x.size(0)
        x=x.long()
        
        # embedding and lstm_out 
        embeds = self.embedding(x)
        lstm_out, hidden = self.lstm(embeds, hidden)
        
        # stack up lstm layers
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
        
        # dropout, fc layer and final sigmoid layer
        out = self.fc(lstm_out)
        
        # reshaping out layer to batch_size * seq_length * output_size
        out = out.view(batch_size, -1, self.output_size)
        
        # return last batch
        out = out[:, -1]

        # return one batch of output word scores and the hidden state
        return out, hidden
    
    
    def init_hidden(self, batch_size):
        '''
        Initialize the hidden state of an LSTM/GRU
        :param batch_size: The batch_size of the hidden state
        :return: hidden state of dims (n_layers, batch_size, hidden_dim)
        '''
        # create 2 new zero tensors of size n_layers * batch_size * hidden_dim
        weights = next(self.parameters()).data

        hidden = (weights.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                    weights.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        # initialize hidden state with zero weights, and move to GPU if available
        
        return hidden

In [82]:
def forward_back_prop(rnn, optimizer, criterion, inp, target, hidden):
    """
    Forward and backward propagation on the neural network
    :param rnn: The PyTorch Module that holds the neural network
    :param optimizer: The PyTorch optimizer for the neural network
    :param criterion: The PyTorch loss function
    :param inp: A batch of input to the neural network
    :param target: The target output for the batch of input
    :return: The loss and the latest hidden state Tensor
    """
    
    # Creating variables for hidden state to prevent back-propagation
    # of historical states 
    h = tuple([each.data for each in hidden])
    
    rnn.zero_grad()
    
    # Forward pass
    output, h = rnn(inp, h)
    
    # Calculate the loss
    loss = criterion(output, target.long())
    
    # Backpropagation
    loss.backward()
    nn.utils.clip_grad_norm_(rnn.parameters(), 5)
    optimizer.step()

    # Return the loss over a batch and the hidden state produced by our model
    return loss.item(), h


In [115]:
def train_rnn(rnn, batch_size, optimizer, criterion, n_epochs, show_every_n_batches=100):
    batch_losses = []
    
    rnn.train()

    print("Training for %d epoch(s)..." % n_epochs)
    for epoch_i in range(1, n_epochs + 1):
        print(epoch_i)
        # initialize hidden state
        hidden = rnn.init_hidden(batch_size)
        
        for batch_i, (inputs, labels) in enumerate(train_loader, 1):
            
            # make sure you iterate over completely full batches, only
            n_batches = len(train_loader.dataset)//batch_size
            if(batch_i > n_batches):
                break
            
            # forward, back prop
            loss, hidden = forward_back_prop(rnn, optimizer, criterion, inputs, labels, hidden)          
            # record loss
            batch_losses.append(loss)

            # printing loss stats
            if batch_i % show_every_n_batches == 0:
                print('Epoch: {:>4}/{:<4}  Loss: {}\n'.format(
                    epoch_i, n_epochs, np.average(batch_losses)))
                batch_losses = []

    # returns a trained rnn
    return rnn

In [189]:
# Data params
# Sequence Length
sequence_length = 10  # of words in a sequence
# Batch Size
batch_size = 128

# data loader - do not change
train_loader = batch_data(int_text, sequence_length, batch_size)

In [190]:
# Training parameters
# Number of Epochs
num_epochs = 10
# Learning Rate
learning_rate = 0.001

# Model parameters
# Vocab size
vocab_size = len(vocab_to_int)
# Output size
output_size = vocab_size
# Embedding Dimension
embedding_dim = 200
# Hidden Dimension
hidden_dim = 250
# Number of RNN Layers
n_layers = 2

# Show stats for every n number of batches
show_every_n_batches = 500

In [191]:
# create model and move to gpu if available
rnn = RNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout=0.5)


# defining loss and optimization functions for training
optimizer = torch.optim.Adam(rnn.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()

# training the model
trained_rnn = train_rnn(rnn, batch_size, optimizer, criterion, num_epochs, show_every_n_batches)

# saving the trained model
save_model('../app/saved_models/trained_rnn', trained_rnn)
print('Model Trained and Saved')

Training for 10 epoch(s)...
1
2
3
4
5
6
7
8
9
10
Model Trained and Saved


In [192]:
"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
import torch

_, vocab_to_int, int_to_vocab, token_dict = load_preprocess()
trained_rnn = load_model('../app/saved_models/trained_rnn')

In [195]:
import torch.nn.functional as F

def generate(rnn, prime_id, int_to_vocab, token_dict, pad_value, predict_len=100):
    """
    Generate text using the neural network
    :param decoder: The PyTorch Module that holds the trained neural network
    :param prime_id: The word id to start the first prediction
    :param int_to_vocab: Dict of word id keys to word values
    :param token_dict: Dict of puncuation tokens keys to puncuation values
    :param pad_value: The value used to pad a sequence
    :param predict_len: The length of text to generate
    :return: The generated text
    """
    rnn.eval()
    
    # create a sequence (batch_size=1) with the prime_id
    current_seq = np.full((1, sequence_length), pad_value)
    current_seq[-1][-1] = prime_id
    predicted = [int_to_vocab[prime_id]]
    
    for _ in range(predict_len):
        if train_on_gpu:
            current_seq = torch.LongTensor(current_seq).cuda()
        else:
            current_seq = torch.LongTensor(current_seq)
        
        # initialize the hidden state
        hidden = rnn.init_hidden(current_seq.size(0))
        
        # get the output of the rnn
        output, _ = rnn(current_seq, hidden)
        
        # get the next word probabilities
        p = F.softmax(output, dim=1).data
        if(train_on_gpu):
            p = p.cpu() # move to cpu
         
        # use top_k sampling to get the index of the next word
        top_k = 5
        p, top_i = p.topk(top_k)
        top_i = top_i.numpy().squeeze()
        
        # select the likely next word index with some element of randomness
        p = p.numpy().squeeze()
        word_i = np.random.choice(top_i, p=p/p.sum())
        
        # retrieve that word from the dictionary
        word = int_to_vocab[word_i]
        predicted.append(word)     
        
        # the generated word becomes the next "current sequence" and the cycle can continue
        current_seq = np.roll(current_seq.cpu(), -1, 1)
        current_seq[-1][-1] = word_i
    
    gen_sentences = ' '.join(predicted)
    
    # Replace punctuation tokens
    for key, token in token_dict.items():
        ending = ' ' if key in ['\n', '(', '"'] else ''
        gen_sentences = gen_sentences.replace(' ' + token.lower(), key)
    gen_sentences = gen_sentences.replace('\n ', '\n')
    gen_sentences = gen_sentences.replace('( ', '(')
    
    # return all the sentences
    return gen_sentences


In [205]:
gen_length = 8 # modify the length to your preference
prime_words = ['diferencia'] # name for starting the script

"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
for prime_word in prime_words:
    pad_word = SPECIAL_WORDS['PADDING']
    generated_script = generate(trained_rnn, vocab_to_int[prime_word], int_to_vocab, token_dict, vocab_to_int[pad_word], gen_length)
    print(generated_script)

diferencia el desarrollo de sistemas?
¿qué desafíos


## Future Work

There are few things which I would like to work on to improvise the model's performance-

1. Use of bidirectional LSTM
2. Pre-trained word embeddings such as GloVe or FastText
3. Larger dataset that focuses on impact of corona pandemic on stocks.