# Notebook de referência 

Usar as secções como guia.

Nome: Matheus Gustavo Alves Sasso Ra: 158257


Neste colab iremos treinar um modelo para fazer análise de sentimento usando o dataset IMDB.

https://towardsdatascience.com/from-pytorch-to-pytorch-lightning-a-gentle-introduction-b371b7caaf09

Habilitamos o linting (avisa sobre erros de formatação no código)

In [0]:
# !pip install --quiet flake8-nb pycodestyle_magic
# %load_ext pycodestyle_magic
# %flake8_on

Installação Pytorch Lightning

In [0]:
!pip3 install pytorch-lightning --upgrade --quiet

Descobrimos se há uma GPU disponível

In [0]:
import numpy as np 
import torch
from multiprocessing import cpu_count
import math
from tqdm import tqdm


if torch.cuda.is_available(): 
   dev = "cuda:0"
else: 
   dev = "cpu" 
print(dev)
device = torch.device(dev)

cuda:0


## Preparando Dados

Primeiro, fazemos download do dataset:

In [0]:
!wget -nc http://files.fast.ai/data/aclImdb.tgz 
!tar -xzf aclImdb.tgz

File ‘aclImdb.tgz’ already there; not retrieving.



## Carregando o dataset

Criaremos uma divisão de treino (80%) e dev (20%) artificialmente.

Nota: Evitar de olhar ao máximo o dataset de teste para não ficar enviseado no que será testado. Em aplicações reais, o dataset de teste só estará disponível no futuro, ou seja, é quando o usuário começa a testar o seu produto.

In [0]:
import os
import random


def load_texts(folder):
    texts = []
    for path in os.listdir(folder):
        with open(os.path.join(folder, path)) as f:
            texts.append(f.read())
    return texts

x_train_pos = load_texts('aclImdb/train/pos')
x_train_neg = load_texts('aclImdb/train/neg')
x_test_pos = load_texts('aclImdb/test/pos')
x_test_neg = load_texts('aclImdb/test/neg')

x_train = x_train_pos + x_train_neg
x_test = x_test_pos + x_test_neg
y_train = [True] * len(x_train_pos) + [False] * len(x_train_neg)
y_test = [True] * len(x_test_pos) + [False] * len(x_test_neg)

# Embaralhamos o treino para depois fazermos a divisão treino/dev.
c = list(zip(x_train, y_train))
random.shuffle(c)
x_train, y_train = zip(*c)

n_train = int(0.8 * len(x_train))

x_dev = x_train[n_train:]
y_dev = y_train[n_train:]
x_train = x_train[:n_train]
y_train = y_train[:n_train]

print(len(x_train), 'amostras de treino.')
print(len(x_dev), 'amostras de desenvolvimento.')
print(len(x_test), 'amostras de teste.')

print('3 primeiras amostras treino:')
for x, y in zip(x_train[:3], y_train[:3]):
    print(y, x[:100])

print('3 últimas amostras treino:')
for x, y in zip(x_train[-3:], y_train[-3:]):
    print(y, x[:100])

print('3 primeiras amostras dev:')
for x, y in zip(x_dev[:3], y_test[:3]):
    print(y, x[:100])

print('3 últimas amostras dev:')
for x, y in zip(x_dev[-3:], y_dev[-3:]):
    print(y, x[:100])

20000 amostras de treino.
5000 amostras de desenvolvimento.
25000 amostras de teste.
3 primeiras amostras treino:
False I was fooled to rent this movie by its impressive cover. Alas. It is easily one of the worst movies 
True <br /><br />"Burning Paradise" is a combination of neo-Shaw Brothers action and Ringo Lam's urban cy
True TO all of yall who think 1.This was a boring telecast 2.Halle berry and denzel Washington did not de
3 últimas amostras treino:
False I was looking on Imdbs bottom 100 because i thought id never seen anything as bad as plan 9 from out
True Did Sandra (yes, she must have) know we would still be here for her some nine years later?<br /><br 
False *** THIS CONTAINS MANY, MANY SPOILERS, NOT THAT IT MATTERS, SINCE EVERYTHING IS SO PATENTLY OBVIOUS 
3 primeiras amostras dev:
True This movie is unworthy of the Omen title. It is so bad that it has actually damaged the classic natu
True This isn't another searing look at the Holocaust but rather an intimate story about

## Download do word embedding

Lista dos modelos disponíveis: https://github.com/RaRe-Technologies/gensim-data#models

In [0]:
import gensim.downloader as api

word2vec_model = api.load("glove-wiki-gigaword-300")
print('word2vec shape:', word2vec_model.vectors.shape)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


word2vec shape: (400000, 300)


Opção alternativa

In [0]:
# !wget -nc http://nlp.stanford.edu/data/glove.6B.zip
# !unzip -o glove.6B.zip -d glove_dir

## Criando Vocabulário a partir do word embedding

In [0]:
import itertools

vocab = {word: index for index, word in enumerate(word2vec_model.index2word)}

# Adicionando PAD token
vocab['[PAD]'] = len(vocab)
pad_vector = np.zeros((1, word2vec_model.vectors.shape[1]))
embeddings = np.concatenate((word2vec_model.vectors, pad_vector), axis=0)

print('Número de palavras no vocabulário:', len(vocab))
print(f'20 tokens mais frequentes: {list(itertools.islice(vocab.keys(), 20))}')

Número de palavras no vocabulário: 400001
20 tokens mais frequentes: ['the', ',', '.', 'of', 'to', 'and', 'in', 'a', '"', "'s", 'for', '-', 'that', 'on', 'is', 'was', 'said', 'with', 'he', 'as']


Opção alternativa

In [0]:
# from torchtext.vocab import GloVe
# glove_dim = 300
# glove = GloVe(name='6B', dim=glove_dim, cache='./glove_dir')
# vectors = glove.vectors
# vocab = glove.stoi
# pad_vector = np.zeros((1, vectors.shape[1]))
# vocab['[PAD]'] = len(vocab)
# embeddings = np.concatenate((vectors, pad_vector), axis=0)

# print(len(vocab))
# print('Primeiras 20 palavras e seus índices:', list(vocab.items())[:20])

## Tokenizando o dataset e convertendo para índices (preferencialmente, usar o DataLoader)

Eu gostei dessa maneira de fazer porque o L fica variável 

In [0]:
import collections
import itertools
from torch.utils.data import Dataset
from typing import Dict
from typing import List

#texts -> X, labels -> y
class MyDataset(Dataset):
    def __init__(self, texts: List[str], labels: List[int],
                 vocab: Dict[str, int], pad_token_id: int,
                 max_seq_length: int = 64):

        self.max_seq_length = max_seq_length
        self.pad_token_id = pad_token_id
        self.vocab = vocab
        self.texts = texts
        self.labels =  torch.tensor(labels).type(torch.long)
        
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        text = self.texts[idx]
        tokens = self.tokenize(text)
        token_ids = self.tokens_to_ids(tokens)
        token_ids = self.truncate_and_pad(token_ids)
        # mask = (token_ids != self.pad_token_id)# mask =  x != self.pad_id
        # return torch.tensor(token_ids).type(torch.long), torch.tensor(mask).type(torch.long), self.labels[idx]
        return torch.tensor(token_ids).type(torch.long), self.labels[idx]

    def tokenize(self, text: str) -> List[str]:
        return text.lower().split()

    def tokens_to_ids(self, tokens: List[str]) -> List[int]:
        return [self.vocab[token] for token in tokens if token in vocab]

    def truncate_and_pad(self, tokens: List[str]):
        tokens = tokens[:self.max_seq_length]
        tokens += [self.pad_token_id] * max(0, self.max_seq_length - len(tokens))
        return tokens

## Inicializando e testando o DataLoader

In [0]:
from torch.utils.data import DataLoader

texts = ['we like pizza', 'he does not like apples']
labels = [0, 1]
mydataset_debug = MyDataset(
    texts=texts,
    labels=labels,
    vocab=vocab,
    pad_token_id=vocab['[PAD]'],
    max_seq_length=10)

dataloader_debug = DataLoader(mydataset_debug, batch_size=10, shuffle=True,num_workers=0)

batch_token_ids ,batch_labels = next(iter(dataloader_debug))
print('batch_token_ids', batch_token_ids)
print('batch_labels', batch_labels)
# print('batch_mask', batch_mask)
print('batch_token_ids.shape:', batch_token_ids.shape)
# print('batch_mask.shape', batch_mask.shape)
print('batch_labels.shape:', batch_labels.shape)

batch_token_ids tensor([[    53,    117,   9388, 400000, 400000, 400000, 400000, 400000, 400000,
         400000],
        [    18,    260,     36,    117,  13134, 400000, 400000, 400000, 400000,
         400000]])
batch_labels tensor([0, 1])
batch_token_ids.shape: torch.Size([2, 10])
batch_labels.shape: torch.Size([2])


## Definindo a Rede Neural
* Para mim a melhor estrutura foi a da **Gabriela Surita**, porque cada parte da rede faz uma coisa e depois engloba a rede para a funcionalidade especídica. Eu  fiz parecido, mas o dela ficou melhor

* Usei a estratégia do **Diedre** para o overfit one batch. Não sei se é a melhor,mas não consegui fazer com o overfit_pct=.01 do PL

* Para criar a máscara gostei da maneira simples suferida pelo **Paulo Finardi**

* Irei utilizar o Pytorch Lighting recomendado pelo **Diedre** e as funções recomendadas pelo **Mascos Piau**

* Ficarei com minha a **minha** Positional Encoding e Multihead Attention

In [0]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import pytorch_lightning as pl

Matriz de atenção

In [0]:
class AttentionMatrix(pl.LightningModule):
  def forward(self,K,Q,V):
    D = K.shape[1]
    scores = torch.matmul(Q,K.transpose(-1,-2)) # N,L,L
    mask = torch.ones_like(scores)*(-1e10)
    score_com_mascara = torch.where(scores!=0,scores,mask)
    socre_normalizado = score_com_mascara/math.sqrt(D) #
    probs = F.softmax(socre_normalizado,dim=2)
    E = torch.matmul(probs, V)
    return E

In [0]:
emb = nn.Embedding.from_pretrained(torch.Tensor(embeddings))
emb.requires_grad = False # V, D
att  = AttentionMatrix()
tensor_embedings = att(emb(batch_token_ids),emb(batch_token_ids),emb(batch_token_ids))
tensor_embedings.shape #B, L D

torch.Size([2, 10, 300])

Positional Encoding

In [0]:
class PositionalEncodingSimples(pl.LightningModule):
    "Implement the PE function."
    def __init__(self, L, D):
        super(PositionalEncodingSimples, self).__init__()
        self.L = L
        self.pos_embedding  = nn.Embedding(L, D)
        
    def forward(self, C):# C=> D,L
        P = torch.arange(end=self.L, device=C.device) #D
        P = self.pos_embedding(P) #L, D
        #C.shape => B,L,D
        X = P + C #broadcasting de P em C na dimaensão N ===> P(L,D) e C(N,L,D)
        # X=> D,L
        return X

In [0]:
Bx,Lx,Dx = 2,10,300
pos  = PositionalEncodingSimples(L=Lx,D=Dx)
X = pos(torch.rand(Bx,Lx,Dx))
X.shape

torch.Size([2, 10, 300])

Multi Headed Attention

In [0]:
class MultiHeadedAttention(pl.LightningModule):
    def __init__(self,L,D,H):
      "Take in model size and number of heads."
      super(MultiHeadedAttention, self).__init__()
      
      self.d_k = D//H
      self.H = H
      self.L = L
      self.D = D
      
      #classe de atenção
      self.attention =  AttentionMatrix()

      #projeções lineares
      self.Wq = nn.Linear(self.D, self.D, bias=False)
      self.Wk = nn.Linear(self.D, self.D, bias=False)
      self.Wv = nn.Linear(self.D, self.D, bias=False)
      self.Wo = nn.Linear(self.D, self.D, bias=False)
        
    def forward(self,x):
      # self.Wq(x).shape = self.Wk(x).shape = self.Wv(x).shape => N,L,D 
      q = self.Wq(x).view(-1, self.L, self.H,  self.d_k) # N, L, H , D/H
      k = self.Wk(x).view(-1, self.L, self.H,  self.d_k) # N, L, H , D/H
      v = self.Wv(x).view(-1, self.L, self.H,  self.d_k) # N, L, H , D/H

      # Transpor para: N, H, L, D/H
      q, k, v = q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2)
      new_x = self.attention(q, k, v)    # new_x.shape = N, H, L, D/H
      new_x = new_x.transpose(-3, -2).contiguous()  # new_x.shape = N, L, H, D/H
      new_x = new_x.view(-1, self.L, self.D) #N, L, D => Surepessão de H nas multiplicações matriciais

      return self.Wo(new_x)

In [0]:
multi_head_attention  = MultiHeadedAttention(L=10,D=300,H=6)
emb = nn.Embedding.from_pretrained(torch.Tensor(embeddings))
emb.requires_grad = False #V,D
E = emb(batch_token_ids)
Wo = multi_head_attention(E)
Wo.shape

torch.Size([2, 10, 300])

Encoder

In [0]:
# https://pytorch.org/docs/stable/_modules/torch/nn/modules/activation.html#MultiheadAttention
class Encoder(pl.LightningModule):
  def __init__(self,H,L,D,dropout):#H = nheads
      super(Encoder, self).__init__()

      #layers para add & norm
      self.layer_norm_1 = nn.LayerNorm(D)
      self.layer_norm_2 = nn.LayerNorm(D)

      #rede feedforward
      self.feedforward = nn.Sequential(
        nn.Linear(D,D),
        nn.ReLU(),
        nn.Dropout(dropout),
        nn.Linear(D,D)
       )
      

  def forward(self, x,emb,pos_emb,multi_head_attention,pad_id):
    
    # # Mask Calc
    mask =  x != pad_id

    #Not Postitional Embedding
    C_emb = emb(x) # B,L,D
    
    #Postitional Embedding
    X_emb = r1 =  pos_emb(C_emb) # B,L,D

    #Multihead_Attention
    E =  multi_head_attention(X_emb) # B,L,D
    
    # Add & Norm
    E = r2 = self.layer_norm_1(E + r1)# B,L,D

    #feed forward
    ff = self.feedforward(E)# B,L,D

    # Add & Norm
    ff = self.layer_norm_2(ff + r2)# B,L,D

    return ff,mask

Definicao dos Hyperparâmetros


In [0]:
from argparse import ArgumentParser
#Hyperparâmetros Fornecidos pelo professor

# Para usar depois para tunar
# parser = HyperOptArgumentParser(strategy='random_search')
# parser.add_argument('--learning_rate', default=0.002, type=float, help='the learning rate')
# # let's enable optimizing over the number of layers in the network
#parser.opt_list('--nb_layers', default=2, type=int, tunable=True, options=[2, 4, 8])

parser = ArgumentParser()

# add all the available options to the trainer
parser = pl.Trainer.add_argparse_args(parser)

# parametrize the network
parser.add_argument('--n_heads', type=int, default=6)
parser.add_argument('--embedding_dim', type=int, default=300)
parser.add_argument('--max_lenght', type=int, default=200)
parser.add_argument('--batch_size', type=int, default=128)
parser.add_argument('--hidden_dim', type=int, default=300)
parser.add_argument('--feed_forward', type=int, default=300)
parser.add_argument('--learning_rate', type=float, default=300)
parser.add_argument('--dropout', type=float, default=0.1)


hparams = parser.parse_args(["--n_heads","6","--embedding_dim","300",
                             "--max_lenght","200","--batch_size","128",
                             "--hidden_dim","300","--feed_forward","300",
                             '--dropout',"0.1"])

Englobamento da rede


In [0]:
from torch.nn import functional as F
from torch.utils.data import DataLoader

class SentimentLightning(pl.LightningModule):
  def __init__(self,
               hparams,
               all_data,
               vocab,
               embeddings=torch.Tensor(embeddings),
               criterion = torch.nn.CrossEntropyLoss()):
      super(SentimentLightning, self).__init__()

      #---------- Hyperparâmetros
      self.hparams = hparams

      #----------Critério de Loss
      self.loss_criterion = criterion

      #---------- Carregamento datasets
      self.pad_idx = vocab['[PAD]'] 
      self.train_dataset = MyDataset(all_data[0], all_data[1], max_seq_length=hparams.max_lenght,vocab=vocab,pad_token_id=self.pad_idx)
      self.valid_dataset = MyDataset(all_data[2], all_data[3], max_seq_length=hparams.max_lenght,vocab=vocab,pad_token_id=self.pad_idx)
      if len(all_data) == 6:
        self.test_dataset = MyDataset(all_data[4], all_data[5], max_seq_length=hparams.max_lenght,vocab=vocab,pad_token_id=self.pad_idx)


      #---------- embeddings
      weight = embeddings #pesos fixos pré treinados
      self.emb = nn.Embedding.from_pretrained(weight)
      self.emb.requires_grad = False 


      #---------- classe positional embedding
      self.positional_embedding =PositionalEncodingSimples(L=hparams.max_lenght,
                                                           D=hparams.embedding_dim)
      #---------- multihead
      self.multi_head_attention = MultiHeadedAttention(H=hparams.n_heads,
                                                       L=hparams.max_lenght,
                                                       D=hparams.embedding_dim)
      #---------- encoder
      self.encoder = Encoder(H=hparams.n_heads,
                             L=hparams.max_lenght,
                             D=hparams.embedding_dim,
                             dropout=hparams.dropout)


      #---------- Englobamento da rede
      self.net = nn.Sequential(
       nn.Linear(hparams.embedding_dim,hparams.hidden_dim),
       nn.ReLU(),
       nn.Dropout(hparams.dropout),
       nn.Linear(hparams.hidden_dim,2)
      )

  #---------- Função forward
  def forward(self, x):
      ff,mask = self.encoder(x,self.emb,
                        self.positional_embedding,
                        self.multi_head_attention,
                        self.pad_idx)
      
      # Embeddings Mean
      ff_sum = torch.sum(ff, dim=1)#D,B
      mask = torch.sum(mask, dim=1)# Antes: B,L ; Depois:B
      ff_sum_transposto = torch.t(ff_sum)#D,B
      ff_mean = torch.div(ff_sum_transposto, mask)#D,B
      ff_mean_transposto = torch.t(ff_mean)#B,D

      #logits
      logits = self.net(ff_mean_transposto)
      return logits

  #---------- Funções de auxílio
  # def cross_entropy_loss(self, logits, labels):
  #     return F.nll_loss(logits, labels)

  #---------- Funções do Pytorch Lightning

  def training_step(self, train_batch, batch_idx):
      x, y = train_batch
      logits = self.forward(x)
      loss = self.loss_criterion(logits, y)

      logs = {'train_loss': loss}
      return {'loss': loss, 'log': logs}

  def validation_step(self, val_batch, batch_idx):
      x, y = val_batch
      logits = self.forward(x)
      loss = self.loss_criterion(logits, y)
      # import pdb;pdb.set_trace()
      predict = logits.argmax(dim=-1)
      correct = (predict == y).sum().float()

      return {'val_loss': loss,'val_acc':correct}


  def validation_epoch_end(self, outputs):
      # called at the end of the validation epoch
      # outputs is an array with what you returned in validation_step for each batch
      # outputs = [{'loss': batch_0_loss}, {'loss': batch_1_loss}, ..., {'loss': batch_n_loss}] 
      avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
      # avg_acc = torch.stack([x['val_acc'] for x in outputs]).mean()
      acc = torch.stack([x['val_acc'] for x in outputs]).sum().type(torch.float)
      avg_acc = acc/len(self.valid_dataset)
      tensorboard_logs = {'val_loss': avg_loss,'val_acc': avg_acc}
      return {'avg_val_loss': avg_loss,'avg_val_acc': avg_acc, 'log': tensorboard_logs}


  def test_step(self, test_batch, batch_idx):
      x, y = test_batch
      logits = self.forward(x)
      loss = self.loss_criterion(logits, y)
      # import pdb;pdb.set_trace()
      predict = logits.argmax(dim=-1)
      correct = (predict == y).sum().float()

      return {'test_loss': loss,'test_acc':correct}


  def test_epoch_end(self, outputs):
      avg_loss = torch.stack([x['test_loss'] for x in outputs]).mean()
      # avg_acc = torch.stack([x['test_acc'] for x in outputs]).mean()
      acc = torch.stack([x['test_acc'] for x in outputs]).sum().type(torch.float)
      avg_acc = acc/len(self.test_dataset)
      tensorboard_logs = {'test_loss': avg_loss,'test_acc': avg_acc}
      return {'avg_test_loss': avg_loss,'avg_test_acc': avg_acc, 'log': tensorboard_logs}



  def configure_optimizers(self):
      optimizer = torch.optim.Adam(self.parameters())#lr=0.0001
      return optimizer

  def train_dataloader(self):
      return DataLoader(self.train_dataset, batch_size=self.hparams.batch_size, shuffle=True)
  
  def val_dataloader(self):
      return DataLoader(self.valid_dataset, batch_size=self.hparams.batch_size, shuffle=False)

   
  def test_dataloader(self):
      return DataLoader(self.test_dataset, batch_size=self.hparams.batch_size, shuffle=False)


## Número de parâmetros do modelo

In [0]:
# all_data = [x_train,y_train,x_dev,y_dev]
all_data = [x_train,y_train,x_dev,y_dev,x_test, y_test]

In [0]:
model = SentimentLightning(hparams,all_data,vocab)

In [0]:
sum([torch.tensor(x.size()).prod() for x in model.parameters() if x.requires_grad]) # trainable parameters

tensor(692702)

## Logger

In [0]:
#Estou usando a classe de Logger da Gabriela Surita
#No tutorial as func estavam em branco, ao ver o dela clareou pra mim
from pytorch_lightning.loggers import LightningLoggerBase, rank_zero_only, TensorBoardLogger


class NotebookLogger(LightningLoggerBase):
    """
    Defines a custom logger that just prints to the notebook so we can keep some
    PDF friendly logs.

    Ref: https://pytorch-lightning.readthedocs.io/en/latest/loggers.html#custom-logger
    """
    experiment = "default"
    name = "notebook"
    version = 1

    def __init__(self, metric="val_acc", log_freq=1):
        self.count = 0
        self.log_freq = log_freq
        self.metric = metric

    @rank_zero_only
    def log_hyperparams(self, params):
        pass

    @rank_zero_only
    def log_metrics(self, metrics, step):
        if self.metric in metrics:
            if self.count % self.log_freq == 0:
                print("Event:", self.count, "Stats:", metrics)

            self.count += 1

## Testando o modelo com um batch

In [0]:
trainer= pl.Trainer(gpus=1,fast_dev_run=True)
trainer.fit(model)

INFO:lightning:Running in fast_dev_run mode: will run a full train, val and test loop using a single batch
INFO:lightning:GPU available: True, used: True
INFO:lightning:VISIBLE GPUS: 0
INFO:lightning:
   | Name                               | Type                      | Params
-----------------------------------------------------------------------------
0  | loss_criterion                     | CrossEntropyLoss          | 0     
1  | emb                                | Embedding                 | 120 M 
2  | positional_embedding               | PositionalEncodingSimples | 60 K  
3  | positional_embedding.pos_embedding | Embedding                 | 60 K  
4  | multi_head_attention               | MultiHeadedAttention      | 360 K 
5  | multi_head_attention.attention     | AttentionMatrix           | 0     
6  | multi_head_attention.Wq            | Linear                    | 90 K  
7  | multi_head_attention.Wk            | Linear                    | 90 K  
8  | multi_head_attention.Wv

HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(flex='2'), max=1.0), HTML(value='')), …



HBox(children=(FloatProgress(value=0.0, description='Validating', layout=Layout(flex='2'), max=1.0, style=Prog…




1

## Overfit em um batch

Antes de treinar o modelo no dataset todo, faremos overfit do modelo em um único minibatch de treino para verificar se loss vai para próximo de 0. Isso serve para depurar se a implementação do modelo está correta.

Podemos também medir se a acurácia neste minibatch chega perto de 100%. Isso serve para depurar se nossa função que mede a acurácia está correta.

Nota: se treinarmos por muitas épocas (ex: 500) é possivel que a loss vá para zero mesmo com bugs na implementação. O ideal é que a loss chege próxima a zero antes de 100 épocas.

Specify one fixed Batch

In [0]:
mydataset_one_batch = MyDataset(
    texts=all_data[0],
    labels=all_data[1],
    vocab=vocab,
    pad_token_id=vocab['[PAD]'],
    max_seq_length=200)

dataloader_one_batch = DataLoader(mydataset_one_batch, batch_size=128, shuffle=False)



In [0]:
class SentimentLightningOneBatchOverfit(pl.LightningModule):
  def __init__(self,
               hparams,
               dataloader_one_batch,
               vocab,
               embeddings=torch.Tensor(embeddings),
               criterion = torch.nn.CrossEntropyLoss()):
      super(SentimentLightningOneBatchOverfit, self).__init__()

      #---------- Hyperparâmetros
      self.hparams = hparams

      #----------Critério de Loss
      self.loss_criterion = criterion

      #---------- embeddings
      weight = embeddings #pesos fixos pré treinados
      self.emb = nn.Embedding.from_pretrained(weight)
      self.emb.requires_grad = False 

      #---------- dataloader
      self.dataloader_one_batch = dataloader_one_batch
      self.pad_idx = vocab['[PAD]'] 


      #---------- classe positional embedding
      self.positional_embedding =PositionalEncodingSimples(L=hparams.max_lenght,
                                                           D=hparams.embedding_dim)
      #---------- multihead
      self.multi_head_attention = MultiHeadedAttention(H=hparams.n_heads,
                                                       L=hparams.max_lenght,
                                                       D=hparams.embedding_dim)
      #---------- encoder
      self.encoder = Encoder(H=hparams.n_heads,
                             L=hparams.max_lenght,
                             D=hparams.embedding_dim,
                             dropout=hparams.dropout)


      #---------- Englobamento da rede
      self.net = nn.Sequential(
       nn.Linear(hparams.embedding_dim,hparams.hidden_dim),
       nn.ReLU(),
       nn.Dropout(hparams.dropout),
       nn.Linear(hparams.hidden_dim,2)
      )

  #---------- Função forward
  def forward(self, x):
      ff,mask = self.encoder(x,self.emb,
                        self.positional_embedding,
                        self.multi_head_attention,
                        self.pad_idx)
      
      # Embeddings Mean
      ff_sum = torch.sum(ff, dim=1)#D,B
      mask = torch.sum(mask, dim=1)# Antes: B,L ; Depois:B
      ff_sum_transposto = torch.t(ff_sum)#D,B
      ff_mean = torch.div(ff_sum_transposto, mask)#D,B
      ff_mean_transposto = torch.t(ff_mean)#B,D

      #logits
      logits = self.net(ff_mean_transposto)
      return logits

  #---------- Funções de auxílio
  # def cross_entropy_loss(self, logits, labels):
  #     return F.nll_loss(logits, labels)

  #---------- Funções do Pytorch Lightning

  def training_step(self, train_batch, batch_idx):
      x, y = train_batch
      logits = self.forward(x)
      loss = self.loss_criterion(logits, y)

      logs = {'train_loss': loss}
      return {'loss': loss, 'log': logs}

  def configure_optimizers(self):
      optimizer = torch.optim.Adam(self.parameters())#lr=0.0001
      return optimizer

  def train_dataloader(self):
      return self.dataloader_one_batch


In [0]:
overfit_model = SentimentLightningOneBatchOverfit(hparams,dataloader_one_batch,vocab)
overfit_trainer = pl.Trainer(gpus=1, max_epochs=5, fast_dev_run=False)
overfit_trainer.fit(overfit_model)

INFO:lightning:GPU available: True, used: True
INFO:lightning:VISIBLE GPUS: 0
INFO:lightning:
   | Name                               | Type                      | Params
-----------------------------------------------------------------------------
0  | loss_criterion                     | CrossEntropyLoss          | 0     
1  | emb                                | Embedding                 | 120 M 
2  | positional_embedding               | PositionalEncodingSimples | 60 K  
3  | positional_embedding.pos_embedding | Embedding                 | 60 K  
4  | multi_head_attention               | MultiHeadedAttention      | 360 K 
5  | multi_head_attention.attention     | AttentionMatrix           | 0     
6  | multi_head_attention.Wq            | Linear                    | 90 K  
7  | multi_head_attention.Wk            | Linear                    | 90 K  
8  | multi_head_attention.Wv            | Linear                    | 90 K  
9  | multi_head_attention.Wo            | Linear          

HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(flex='2'), max=1.0), HTML(value='')), …






1

## Treinamento e Validação no dataset todo

In [0]:
import time

In [0]:
maxe = 20
start = time.time()
trainer = pl.Trainer( gpus = 1,
                     early_stop_callback=True,
                     max_epochs = maxe,
                     logger=[NotebookLogger(log_freq=20, metric="train_loss"),NotebookLogger(log_freq=20, metric="val_acc"), TensorBoardLogger("./lightning_logs")])
trainer.fit(model)
end = time.time() - start
print("Tempo treinamento por época :",(start-end)/maxe)

INFO:lightning:GPU available: True, used: True
INFO:lightning:VISIBLE GPUS: 0
INFO:lightning:
   | Name                               | Type                      | Params
-----------------------------------------------------------------------------
0  | loss_criterion                     | CrossEntropyLoss          | 0     
1  | emb                                | Embedding                 | 120 M 
2  | positional_embedding               | PositionalEncodingSimples | 60 K  
3  | positional_embedding.pos_embedding | Embedding                 | 60 K  
4  | multi_head_attention               | MultiHeadedAttention      | 360 K 
5  | multi_head_attention.attention     | AttentionMatrix           | 0     
6  | multi_head_attention.Wq            | Linear                    | 90 K  
7  | multi_head_attention.Wk            | Linear                    | 90 K  
8  | multi_head_attention.Wv            | Linear                    | 90 K  
9  | multi_head_attention.Wo            | Linear          

HBox(children=(FloatProgress(value=0.0, description='Validation sanity check', layout=Layout(flex='2'), max=5.…

HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(flex='2'), max=1.0), HTML(value='')), …



Event: 0 Stats: {'train_loss': 0.3382672965526581}


HBox(children=(FloatProgress(value=0.0, description='Validating', layout=Layout(flex='2'), max=40.0, style=Pro…

Event: 0 Stats: {'val_loss': 0.39491698145866394, 'val_acc': 0.8185999989509583}
Event: 20 Stats: {'train_loss': 0.48587766289711}


HBox(children=(FloatProgress(value=0.0, description='Validating', layout=Layout(flex='2'), max=40.0, style=Pro…

Event: 40 Stats: {'train_loss': 0.3586779534816742}


HBox(children=(FloatProgress(value=0.0, description='Validating', layout=Layout(flex='2'), max=40.0, style=Pro…

Event: 60 Stats: {'train_loss': 0.4346606731414795}


HBox(children=(FloatProgress(value=0.0, description='Validating', layout=Layout(flex='2'), max=40.0, style=Pro…

HBox(children=(FloatProgress(value=0.0, description='Validating', layout=Layout(flex='2'), max=40.0, style=Pro…

Event: 80 Stats: {'train_loss': 0.33533257246017456}


HBox(children=(FloatProgress(value=0.0, description='Validating', layout=Layout(flex='2'), max=40.0, style=Pro…

INFO:lightning:Epoch 00006: early stopping



Tempo treinamento por época : 79379628.25244467


## Após treinado, avaliamos o modelo no dataset de test.

É importante que essa avaliação seja feita poucas vezes para evitar o overfit no dataset de teste.

In [0]:
start = time.time()
trainer.test(model)
end = time.time() - start
print("Tempo teste:",(start-end))



HBox(children=(FloatProgress(value=0.0, description='Testing', layout=Layout(flex='2'), max=196.0, style=Progr…

--------------------------------------------------------------------------------
TEST RESULTS
{'avg_test_acc': 0.7949199676513672,
 'avg_test_loss': 0.44269970059394836,
 'test_acc': 0.7949199676513672,
 'test_loss': 0.44269970059394836}
--------------------------------------------------------------------------------

Tempo teste: 1587592708.271782


## Tensorboard

In [0]:
%load_ext tensorboard
%tensorboard --logdir lightning_logs/

# Conclusão

* Gostei do Pytorch Lightning como framework. É muito vantajoso para não ter que se preocupar com os .device, mas dificulta para fazer o debug.

* Aparentemente os resultadados **não estão corretos** com a utilização do Pytorch Lightning. Não deu tempo de concertar até o deadline de entrega, vou concertar futuramente.


# Fim do Notebook