## Esse projeto tem como objetivo criar uma Recurrent Neural Networks para conseguir inferir dados sobre um modelo especifo.
<br> <b>O dataset é um conjunto de comentários que podem ser positivos ou negativos.
<br> Nesse projeto será realizado as seguintes etapas:</b>

> **Etapas:** 
1. **Preparar o dataset:**
    * Carregar o dataset
    * Imprimir alguns dados do dataset
    * Separar os comentários em palavras
    * Criar um vacabulário de interios, com um número que indica cada palavra
    * Imprimir o tamanho do vocabulário
    * Converter positivo e negativo para valores numericos
    * Verificando o maior comentário e a quantidade de comentários em branco
    * Removendo comentários desnecessários (em branco)
    * Criar uma função para tornar todos os comentários do mesmo tamanho
* **Criar um conjunto de teste e validação:**
    * Criar um DataLoader a partir do conjunto
* **Criar uma RNN utilizando [LSTM](https://pytorch.org/docs/stable/nn.html#lstm)**
* **Treinar o modelo**
* **Validar o modelo**
* **Testar o modelo:**
    * Criar um método para transformar um comentário no vocabulário conhecido pela maquina treinada
    * Criar um método para predizer
    * Realizar a previsão

## 1) Preparar o dataset

### 1.1 Carregar o dataset

In [1]:
import numpy as np

# carregando os dados com as revisões e o resultado de cada uma.
with open('data/reviews.txt', 'r') as f:
    reviews = f.read()
with open('data/labels.txt', 'r') as f:
    labels = f.read()

### 1.2 Imprimir alguns dados do dataset

In [2]:
print(reviews[:1000])
print()
print(labels[:20])

bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   
story of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is turn

### 1.3 Separar os comentários em palavras

In [3]:
from string import punctuation

# separando cada revisão em varias palavras
reviews = reviews.lower() # deixando tudo no mesmo padrão lowercase
all_text = ''.join([c for c in reviews if c not in punctuation])

# separando cada palavra em uma linha.
reviews_split = all_text.split('\n')
all_text = ' '.join(reviews_split)

# create a list of words
words = all_text.split()

#Imprimir as 10 primeiras palavras
words[:10]

['bromwell', 'high', 'is', 'a', 'cartoon', 'comedy', 'it', 'ran', 'at', 'the']

### 1.4 Criar um vacabulário de interios, com um numero que indica cada palavra

In [4]:
from collections import Counter

## Cria um dicionario com inteiros Ex: and : 1234, yes: 9452
dicionario = Counter(words)
vocab = sorted(dicionario, key=dicionario.get, reverse=True)
vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)}

## passo cada revisão para o dicionario, para ser retornado um revisão apenas com os numeros inteiros correspondente a revisão 
reviews_ints = []
for review in reviews_split:
    reviews_ints.append([vocab_to_int[word] for word in review.split()])

### 1.5 Imprimir o tamanho do vocabulário

In [5]:
# Quantidade de elementos no dicionario.
print('Unique words: ', len((vocab_to_int)))
print()

# Exemplo da primeira review
print('Review: \n', reviews_split[:1])
print()
print('Tokenized review: \n', reviews_ints[:1])

Unique words:  74072

Review: 
 ['bromwell high is a cartoon comedy  it ran at the same time as some other programs about school life  such as  teachers   my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers   the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students  when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled          at           high  a classic line inspector i  m here to sack one of your teachers  student welcome to bromwell high  i expect that many adults of my age think that bromwell high is far fetched  what a pity that it isn  t   ']

Tokenized review: 
 [[21025, 308, 6, 3, 1050, 207, 8, 2138, 32, 1, 171, 57, 15, 49, 81, 5785, 44, 382, 110, 140, 15, 5194, 60, 154, 9, 1, 4975, 5852, 475, 71

### 1.6 Converter positivo e negativo para valores numericos
É necessário converter essas palavras para poder ser utilizado pela rede neural.

In [6]:
# 1=positive, 0=negative label conversion
labels_split = labels.split('\n')
encoded_labels = np.array([1 if label == 'positive' else 0 for label in labels_split])

### 1.7 Verificando o maior comentário e a quantidade de comentários em branco

In [7]:
review_lens = Counter([len(x) for x in reviews_ints])
print("Zero-length reviews: {}".format(review_lens[0]))
print("Maximum review length: {}".format(max(review_lens)))

Zero-length reviews: 1
Maximum review length: 2514


### 1.8 Removendo comentarios desnecessários (em branco)

In [8]:
print('Number of reviews before removing outliers: ', len(reviews_ints))

## remover os comentários e o resultado daqueles que estão em branco.

# obtem o indice dos comentários em branco.
non_zero_idx = [ii for ii, review in enumerate(reviews_ints) if len(review) != 0]

# removendo os comentários e o resultado.
reviews_ints = [reviews_ints[ii] for ii in non_zero_idx]
encoded_labels = np.array([encoded_labels[ii] for ii in non_zero_idx])

print('Number of reviews after removing outliers: ', len(reviews_ints))

Number of reviews before removing outliers:  25001
Number of reviews after removing outliers:  25000


### 1.9 Criar uma função para tornar todos os comentários do mesmo tamanho
É importante definir um tamanho "ideal" dos comentários. Isso servirá para normalizar os dados, tornar tanto comentários muito grandes como comentários muito pequenos em comentários de tamanho "ideal".

In [9]:
##Esse método tem como objetivo incluir 0 até complentar exatamente o tamanho do comentário padrão.
def pad_features(reviews_ints, seq_length):
    
    # getting the correct rows x cols shape
    features = np.zeros((len(reviews_ints), seq_length), dtype=int)

    # for each review, I grab that review and 
    for i, row in enumerate(reviews_ints):
        features[i, -len(row):] = np.array(row)[:seq_length]
    
    return features

In [10]:
seq_length = 200

features = pad_features(reviews_ints, seq_length=seq_length)

## test statements - do not change - ##
assert len(features)==len(reviews_ints), "A quantidade de comentários no dataset depois de ter o tamanho alterado deverá se a mesma"
assert len(features[0])==seq_length, "Cada comentário deverá conter o tamanho igual o seq_length."

# Imprimi os 10 primeiro valores dos 20 primeiros batches(lotes de comentarios) 
print(features[:20,:10])

[[    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [22382    42 46418    15   706 17139  3389    47    77    35]
 [ 4505   505    15     3  3342   162  8312  1652     6  4819]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [   54    10    14   116    60   798   552    71   364     5]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    1   330   578    34     3   162   748  2731     9   325]
 [    9    11 10171  5305  1946   689   444    22   280   673]
 [    0     0     0     0     0     0     0     0     0

## 2) Criar um conjunto de teste e validação
Com o dataset já formatado, vamos separa-lo em 3 conjuntos <b>(Treino, Validação e Teste)</b>

In [11]:
split_frac = 0.8

## Separando o dataset em treino, validação e teste

split_idx = int(len(features)*0.8)
train_x, remaining_x = features[:split_idx], features[split_idx:]
train_y, remaining_y = encoded_labels[:split_idx], encoded_labels[split_idx:]

test_idx = int(len(remaining_x)*0.5)
val_x, test_x = remaining_x[:test_idx], remaining_x[test_idx:]
val_y, test_y = remaining_y[:test_idx], remaining_y[test_idx:]

## Imprimir o resultado: quantidade e tamanho dos elementos
print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape), 
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape))

			Feature Shapes:
Train set: 		(20000, 200) 
Validation set: 	(2500, 200) 
Test set: 		(2500, 200)


### 2.1 Criar um DataLoader a partir do conjunto

Será utilizado o [TensorDataset] (https://pytorch.org/docs/stable/data.html#) que recebe como parametros um conjunto de dados e seus respecitvos resultados, a dimensão de ambos precisam ser iguais.
<br>O TensorDataset será utilziado para criar o DataLoader.

In [12]:
import torch
from torch.utils.data import TensorDataset, DataLoader

# criar os Tensor datasets
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
valid_data = TensorDataset(torch.from_numpy(val_x), torch.from_numpy(val_y))
test_data = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))

# a quantidade de lote que será carregado a cada iteração. No caso abaixo 50 cometários por iteração.
batch_size = 50

# É importante incluir o SHUFFLE para mudar a oredem dos dados.
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)


# Obtem um batch para exibir as informações
dataiter = iter(train_loader)
sample_x, sample_y = dataiter.next()

print('Sample input size: ', sample_x.size()) # batch_size, seq_length
print('Sample input: \n', sample_x)
print()
print('Sample label size: ', sample_y.size()) # batch_size
print('Sample label: \n', sample_y)

Sample input size:  torch.Size([50, 200])
Sample input: 
 tensor([[   10,    65,    39,  ...,  2571,  6078,    15],
        [  147,    36,    26,  ...,     7,     7,     1],
        [   11,     6,     3,  ...,    87,     5, 24007],
        ...,
        [    0,     0,     0,  ...,    24,   288,   316],
        [  251,    36,    13,  ...,    21,   236,    28],
        [    4,     1,   107,  ...,     1,   764,   965]], dtype=torch.int32)

Sample label size:  torch.Size([50])
Sample label: 
 tensor([1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1,
        0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0,
        0, 1], dtype=torch.int32)


## 3) Criar uma RNN utilizando [LSTM](https://pytorch.org/docs/stable/nn.html#lstm) 

In [13]:
# Verificando se há GPU disponível
train_on_gpu=torch.cuda.is_available()

if(train_on_gpu):
    print('Training on GPU.')
else:
    print('No GPU available, training on CPU.')

Training on GPU.


In [14]:
import torch.nn as nn

class SentimentRNN(nn.Module):
    """
    A classe referente ao modelo RNN que será utilizado para treinar a analise sentimental
    """

    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):
        """
        Inicializando o modelo configurando as 
        """
        super(SentimentRNN, self).__init__()

        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        # embedding and LSTM layers
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, 
                            dropout=drop_prob, batch_first=True)
        
        # dropout layer
        self.dropout = nn.Dropout(0.3)
        
        # linear and sigmoid layers
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sig = nn.Sigmoid()
        

    def forward(self, x, hidden):
        """
        Execura um forward pass pelo modelo.
        """
        batch_size = x.size(0)

        # embeddings and lstm_out
        embeds = self.embedding(x)
        lstm_out, hidden = self.lstm(embeds, hidden)
    
        # stack up lstm outputs
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
        
        # dropout and fully-connected layer
        out = self.dropout(lstm_out)
        out = self.fc(out)
        # sigmoid function
        sig_out = self.sig(out)
        
        # reshape to be batch_size first
        sig_out = sig_out.view(batch_size, -1)
        sig_out = sig_out[:, -1] # get last batch of labels
        
        # return last sigmoid output and hidden state
        return sig_out, hidden
    
    
    def init_hidden(self, batch_size):
        ''' Inicializando o hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        return hidden
        

### Instanciar o modelo RNN

In [15]:
# Instancia o modelo com os parametros necessários
vocab_size = len(vocab_to_int)+1 # +1 for the 0 padding + our word tokens
output_size = 1 #Resultado : (Negativo ou positivo)
embedding_dim = 400 
hidden_dim = 256
n_layers = 2

net = SentimentRNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)
print(net)

SentimentRNN(
  (embedding): Embedding(74073, 400)
  (lstm): LSTM(400, 256, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.3)
  (fc): Linear(in_features=256, out_features=1, bias=True)
  (sig): Sigmoid()
)


## 4) Treinar o modelo
Using [BCELoss](https://pytorch.org/docs/stable/nn.html#bceloss) (**Binary Cross Entropy Loss**)

In [16]:
# Create the criterion and optimization
lr=0.001

criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
globalLoss = 10;

In [17]:
# Parametros para o treinamento:

epochs = 4
counter = 0
print_every = 100
clip=5 # gradient clipping

# utilizar GPU se estiver disponível
if(train_on_gpu):
    net.cuda()

net.train()
# Treina de acordo com o numero de epochs(ciclos)
for e in range(epochs):
    # initialize hidden state
    h = net.init_hidden(batch_size)

    # batch loop
    for inputs, labels in train_loader:
        counter += 1

        if(train_on_gpu):
            inputs, labels = inputs.cuda(), labels.cuda()

        # Creating new variables for the hidden state, otherwise we'd backprop through the entire training history
        h = tuple([each.data for each in h])

        # zero accumulated gradients
        net.zero_grad()

        # get the output from the model
        output, h = net(inputs.long(), h)

        # calculate the loss and perform backprop
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()
        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        nn.utils.clip_grad_norm_(net.parameters(), clip)
        optimizer.step()

        # loss stats
        if counter % print_every == 0:
            # Get validation loss
            val_h = net.init_hidden(batch_size)
            val_losses = []
            net.eval()
            for inputs, labels in valid_loader:

                # Creating new variables for the hidden state, otherwise
                # we'd backprop through the entire training history
                val_h = tuple([each.data for each in val_h])

                if(train_on_gpu):
                    inputs, labels = inputs.cuda(), labels.cuda()

                output, val_h = net(inputs.long(), val_h)
                val_loss = criterion(output.squeeze(), labels.float())

                val_losses.append(val_loss.item())

            net.train()
            print("Epoch: {}/{}...".format(e+1, epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))
            
           ##Salvando o modelo quando a media do validation loss for menor que a do modelo já salvo.
            train_loss_mean = np.mean(val_losses)
            if  train_loss_mean <= globalLoss:
                print('Validation loss decreased ({:.6f} --> {:.6f}).  Saving model ...'.format(
                globalLoss,
                train_loss_mean))
                torch.save(net.state_dict(), 'model-saved.pth')
                globalLoss = train_loss_mean


Epoch: 1/4... Step: 100... Loss: 0.656093... Val Loss: 0.663846
Validation loss decreased (10.000000 --> 0.663846).  Saving model ...
Epoch: 1/4... Step: 200... Loss: 0.696967... Val Loss: 0.624745
Validation loss decreased (0.663846 --> 0.624745).  Saving model ...
Epoch: 1/4... Step: 300... Loss: 0.682565... Val Loss: 0.638374
Epoch: 1/4... Step: 400... Loss: 0.431185... Val Loss: 0.535197
Validation loss decreased (0.624745 --> 0.535197).  Saving model ...
Epoch: 2/4... Step: 500... Loss: 0.588817... Val Loss: 0.610350
Epoch: 2/4... Step: 600... Loss: 0.389675... Val Loss: 0.515616
Validation loss decreased (0.535197 --> 0.515616).  Saving model ...
Epoch: 2/4... Step: 700... Loss: 0.444179... Val Loss: 0.486162
Validation loss decreased (0.515616 --> 0.486162).  Saving model ...
Epoch: 2/4... Step: 800... Loss: 0.402562... Val Loss: 0.451269
Validation loss decreased (0.486162 --> 0.451269).  Saving model ...
Epoch: 3/4... Step: 900... Loss: 0.312389... Val Loss: 0.472883
Epoch: 3/

## 5) Validar o modelo

Nessa etapa será carregado o melhor modelo salvo, e será calculado a media da perda (average loss) e a acuracia (accuracy). **Sendo que o ideal é a media da perda ser um numero baixo < 50 e a acuracia um numero grande, por exemplo 80**

In [18]:
## Método para reccaregar o modelo já treinado
def _reload_module():
    net.load_state_dict (torch.load('model-saved.pth'))
    if(train_on_gpu):
        net.cuda()


# Reccarega o modelo treinado.
_reload_module()

In [19]:
# Get test data loss and accuracy

test_losses = [] # track loss
num_correct = 0

# init hidden state
h = net.init_hidden(batch_size)

net.eval()
# iterate over test data
for inputs, labels in test_loader:

    # Creating new variables for the hidden state, otherwise
    # we'd backprop through the entire training history
    h = tuple([each.data for each in h])

    if(train_on_gpu):
        inputs, labels = inputs.cuda(), labels.cuda()
    
    # get predicted outputs
    output, h = net(inputs.long(), h)
    
    # calculate loss
    test_loss = criterion(output.squeeze(), labels.float())
    test_losses.append(test_loss.item())
    
    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze())  # rounds to the nearest integer
    
    # compare predictions to true label
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)


# -- stats! -- ##
# avg test loss
print("Test loss: {:.3f}".format(np.mean(test_losses)*100))

# accuracy over all test data
test_acc = num_correct/len(test_loader.dataset)*100
print("Test accuracy: {:.3f}".format(test_acc))


Test loss: 47.569
Test accuracy: 78.480


## 6) Testar o modelo 
Para testar o modelo será utilizado a forma conhecida como **Inference on a test review** que é a inserção de um novo elemento sem um resultado e verificar se o modelo consegue inferir qual seria o resultado daquele novo elemento.

### 6.1 Criar um método para transformar um comentário no vocabulário conhecido pela maquina treinada

In [20]:
from string import punctuation

def tokenize_review(test_review):
    test_review = test_review.lower() # lowercase
    # remove pontuação
    test_text = ''.join([c for c in test_review if c not in punctuation])

     # separando por espaços
    test_words = test_text.split()

     # transformar cada palavra em um inteiro correspondente ao vocabulario
    test_ints = []
    test_ints.append([vocab_to_int[word] for word in test_words])

    return test_ints


### 6.2 Criar um método para predizer

In [21]:
def predict(net, test_review, sequence_length=200):
    
    net.eval()
    
     # tokenize review
    test_ints = tokenize_review(test_review)
    
    # pad tokenized sequence
    seq_length=sequence_length
    features = pad_features(test_ints, seq_length)
    
    
    # converte pra Tensor para conseguir passar para o modelo.
    feature_tensor = torch.from_numpy(features)
    
    batch_size = feature_tensor.size(0)
    
    # initialize hidden state
    h = net.init_hidden(batch_size)
    
    if(train_on_gpu):
        feature_tensor = feature_tensor.cuda()
    
    # get the output from the model
    output, h = net(feature_tensor.long(), h)
    
    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze()) 
    # printing output value, before rounding
    print('Prediction value, pre-rounding: {:.6f}'.format(output.item()))
    
    # print custom response
    if(pred.item()==1):
        print("Positive review detected!")
    else:
        print("Negative review detected.")
        

### 6.3 Realizar a previsão

In [22]:
# call function
seq_length=200 # good to use the length that was trained on

# negative test review
test_review_neg = 'The worst movie I have seen; acting was terrible and I want my money back. This movie had bad acting and the dialogue was slow.'

# positive test review
test_review_pos = 'This movie had the best acting and the dialogue was so good. I loved it.'


predict(net, test_review_pos, seq_length)
predict(net, test_review_neg, seq_length)

Prediction value, pre-rounding: 0.975107
Positive review detected!
Prediction value, pre-rounding: 0.023059
Negative review detected.
