<a href="https://colab.research.google.com/github/rubensmau/nlp/blob/master/Fasthugs_fastai_LM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# FastHugs  --- USING roberta-base model

Modified from https://github.com/morganmcg1/fasthugs .  
My goal is to use a trained roberta language model for  masked language prediction / next word prediction in a fastai environment.
I want to import a trained model in a large corpus, use as it is in fastai2, also as language mode. And make predictions using learn.predict.
In the last step,I would like to modify the model head  to make classification of other task. Since I am learning fastai, I assume it would be easier to have all the customization done in fastai environment

This notebook was run from Google Colab

This notebook gives a full run through to fine-tune a text classification model with **HuggingFace 🤗 transformers** and the new **fastai-v2** library.

## Things You Might Like (❤️ ?)
**FastHugsTokenizer:** A tokenizer wrapper than can be used with fastai-v2's tokenizer.

**FastHugsModel:** A model wrapper over the HF models, more or less the same to the wrapper's from HF fastai-v1 articles mentioned below

**Vocab:** A function to extract the vocab depending on the pre-trained transformer (HF hasn't standardised this processes 😢).

**Padding:** Padding settings for the padding token index and on whether the transformer prefers left or right padding

**Vocab for Albert-base-v2**: .json for Albert-base-v2's vocab, otherwise this has to be extracted from a SentencePiece model file, which isn't fun

**Model Splitters:** Functions to split the classification head from the model backbone in line with fastai-v2's new definition of `Learner`

## Housekeeping
### Pretrained Transformers only for now 😐
Initially, this notebook will only deal with finetuning HuggingFace's pretrained models. It covers BERT, DistilBERT, RoBERTa and ALBERT pretrained classification models only. These are the core transformer model architectures where HuggingFace have added a classification head. HuggingFace also has other versions of these model architectures such as the core model architecture and language model model architectures.

If you'd like to try train a model from scratch HuggingFace just recently published an article on [How to train a new language model from scratch using Transformers and Tokenizers](https://huggingface.co/blog/how-to-train). Its well worth reading to see how their `tokenizers` library can be used, independent of their pretrained transformer models.

### Read these first 👇
This notebooks heavily borrows from [this notebook](https://www.kaggle.com/melissarajaram/roberta-fastai-huggingface-transformers) , which in turn is based off of this [tutorial](https://www.kaggle.com/maroberti/fastai-with-transformers-bert-roberta) and accompanying [article](https://towardsdatascience.com/fastai-with-transformers-bert-roberta-xlnet-xlm-distilbert-4f41ee18ecb2). Huge thanks to  Melissa Rajaram and Maximilien Roberti for these great resources, if you're not familiar with the HuggingFace library please given them a read first as they are quite comprehensive.

### fastai-v2  ✌️2️⃣
[This paper](https://www.fast.ai/2020/02/13/fastai-A-Layered-API-for-Deep-Learning/) introduces the v2 version of the fastai library and you can follow and contribute to v2's progress [on the forums](https://forums.fast.ai/). This notebook uses the small IMDB dataset and is based off the [fastai-v2 ULMFiT tutorial](http://dev.fast.ai/tutorial.ulmfit). Huge thanks to Jeremy, Sylvain, Rachel and the fastai community for making this library what it is. I'm super excited about the additinal flexibility v2 brings. 🎉

### Dependencies 📥
If you haven't already, install HuggingFace's `transformers` library with: `pip install transformers`

In [0]:
#hide
# CUDA ERROR DEBUGGING
# https://lernapparat.de/debug-device-assert/
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

In [0]:
#!pip freeze | grep fastprogress
#!pip uninstall fastprogress

In [0]:
!pip install -q fastai2==0.0.13     #### it seems that using this version , I get fewer errors
!pip install -q transformers

In [0]:
#hide
%reload_ext autoreload
%autoreload 2

from fastai2.basics import *
from fastai2.text.all import *
from fastai2.callback.all import *

from transformers import BertForSequenceClassification, BertTokenizer, BertConfig
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer, DistilBertConfig
from transformers import RobertaForSequenceClassification, RobertaForMaskedLM,  RobertaTokenizer, RobertaConfig
from transformers import AlbertForSequenceClassification, AlbertTokenizer, AlbertConfig,AutoModelWithLMHead,AutoTokenizer


!pip install -q transformers
from __future__ import print_function
import ipywidgets as widgets
from transformers import pipeline

import json

In [0]:
#hide
path = untar_data(URLs.IMDB_SAMPLE)
model_path = Path('models')
df = pd.read_csv(path/'texts.csv')

## FastHugs Tokenizer
This tokenizer wrapper is initialised with the pretrained HF tokenizer, you can also specify the max_seq_len if you want longer/shorter sequences. Given text it returns tokens and adds separator tokens depending on the model type being used.

In [0]:
class FastHugsTokenizer():
    """ 
        transformer_tokenizer : takes the tokenizer that has been loaded from the tokenizer class
        model_type : model type set by the user
        max_len : override default sequence length, typically 512 for bert-like models
    """
    def __init__(self, transformer_tokenizer, model_type = 'roberta', max_seq_len=None, **kwargs): 
        self.tok = transformer_tokenizer
        self.max_seq_len = ifnone(max_seq_len, self.tok.max_len)
        self.model_type = model_type
        self.pad_token_id = self.tok.pad_token_id
        
    def do_tokenize(self, t:str):
        """Limits the maximum sequence length and add the special tokens"""
        CLS = self.tok.cls_token
        SEP = self.tok.sep_token
#         import pdb
#         pdb.set_trace()
        #print(t)
        if 'roberta' in model_type:
            tokens = self.tok.tokenize(t, add_prefix_space=True)[:self.max_seq_len - 2]
        else:
            tokens = self.tok.tokenize(t)[:self.max_seq_len - 2]
        #print(tokens)
        return [CLS] + tokens + [SEP]

    def __call__(self, items): 
        for t in items: yield self.do_tokenize(t)

In [0]:
# #### Tokenizer test
# t = fasthugstok()
# t.do_tokenize('i am tall')

## FastHugs Model
This `nn.module` wraps the pretrained transformer model, initialises it with is config file. If you'd like to make configuration changes to the model, you can do so in this class. The `forward` of this module is taken straight from Melissa's notebook above and its purpose is to create the attention mask and grab only the logits from the output of the model (as the HF transformer models also output the loss).

In [0]:
# More or less copy-paste from https://www.kaggle.com/melissarajaram/roberta-fastai-huggingface-transformers/data
class FastHugsModelLM(nn.Module):
    def __init__(self, pretrained_model_name, model_class, config_class, max_seq_len=None):
        super(FastHugsModelLM, self).__init__()
        self.config = config_class.from_pretrained(pretrained_model_name)
        #self.config.num_labels = n_class  #NO CLASSIFICATION TASK
        if max_seq_len is not None: self.config.max_position_embeddings = max_seq_len  ### era max_len antes, acho que tava errado
        
        self.transformer = model_class.from_pretrained(pretrained_model_name, config = self.config, 
                                    cache_dir=model_path/f'{pretrained_model_name}')

        
    def forward(self, input_ids, attention_mask=None):
        attention_mask = (input_ids!=1).type(input_ids.type()) 
        logits = self.transformer(input_ids, attention_mask = attention_mask)[0] 
        return logits

    def reset(self): pass  # self.h.zero_()    ##### LEARNER COMPLAINED AT ONE POINT THAT RESET ATTRIBUTE WAS MISSING

In [0]:
#dir(FastHugsModelLM)

## Padding
Pass the initialised transformer tokenizer to set the index for the padding token and the side padding should be applied; e.g. BERT, Roberta prefers padding to the right, so we set `pad_first=False`

In [0]:
def transformer_padding(transformer_tokenizer): 
    if transformer_tokenizer.padding_side == 'right': 
        pad_first=False
    return partial(pad_input_chunk, pad_first=pad_first, pad_idx=transformer_tokenizer.pad_token_id)

## Lets get training
### Select our HuggingFace model

from transformers import RobertaTokenizer, RobertaForSequenceClassification
import torch

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaForSequenceClassification.from_pretrained('roberta-base')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
outputs = model(input_ids, labels=labels)
loss, logits = outputs[:2]

In [0]:
models_dict = {'bert_classification': (BertForSequenceClassification, BertTokenizer, BertConfig, 
                                       'bert-base-uncased', 'bert_class_splitter'),
                'roberta_classification': (RobertaForSequenceClassification, RobertaTokenizer, RobertaConfig, 
                                           'roberta-base', 'roberta_clas_splitter'),
                'distilbert_classification': (DistilBertForSequenceClassification, DistilBertTokenizer, DistilBertConfig, 
                                              'distilbert-base-uncased', 'distilbert_clas_splitter'),
                'albert_classification': (AlbertForSequenceClassification, AlbertTokenizer, AlbertConfig, 
                                          'albert-base-v2', 'albert_clas_splitter'),
                'roberta-maskedlm' :     (RobertaForMaskedLM,RobertaTokenizer,RobertaConfig,         # ALTERNTATIVES AutoModelWithLMHead,AutoTokenizer
                                           "roberta-base", "roberta_base_splitter")
                

              }

Grab the model, tokenizer and config that we'd like to use

In [0]:
model_type = 'roberta-maskedlm'
model_class, tokenizer_class, config_class, pretrained_model_name, tfmr_splitter = models_dict[model_type]

We can also change the max sequence length for the tokenizer and transformer here. If its not set it will default to the pretrained model's default, typically `512`. 1024 or even 2048 can also be used depending on your GPU memory

In [0]:
max_seq_len = 288  

## Geting HuggingFace Tokenizer into fastai-v2
Intialise the tokenizer needed for the pretrained model, this will download the `vocab.json` and `merges.txt` files needed. Specifying `cache_dir` will allow us easily access them, otherwise they will be saved to a Torch cache folder here `~/.cache/torch/transformers`. 

In [0]:
transformer_tokenizer = tokenizer_class.from_pretrained(pretrained_model_name, 
                                                        cache_dir=model_path/f'{pretrained_model_name}')

**Create fasthugstok function:** Lets incorporate the `transformer_tokenizer` into fastai-v2's framework by specifying a fucntion to pass to `Tokenizer.from_df`. (Note `from_df` is the only method I have tested)

In [0]:
fasthugstok = partial(FastHugsTokenizer, transformer_tokenizer = transformer_tokenizer, 
                      model_type=model_type, max_seq_len=max_seq_len)  ## None

**Set up fastai-v2's Tokenizer.from_df:** We pass `rules=[]` to override fastai's default text processing rules

In [0]:
tok_fn = Tokenizer.from_df(text_cols='text', res_col_name='text', tok_func=fasthugstok, rules=[])

 ## Vocab
 Model and vocab files will be saved with files names as a long string of digits and letters (e.g. `d9fc1956a0....f4cfdb5feda.json` generated from the etag from the AWS S3 bucket as described [here in the HuggingFace repo](https://github.com/huggingface/transformers/issues/2157). For readability I prefer to save the files in a specified directory and model name so that it can be easily found and accessed in future.
 
(Note: To avoid saving these files twice you could look at the `from_pretrained` and `cached_path` functions in HuggingFace's `PreTrainedTokenizer` class definition to find the code that downloads the files and maybe modify them to download directly to your specified directory withe desired name. I haven't had time to go that deep.)

Load vocab file into a `list` as expected by fastai-v2. The HF pretrained tokenizer vocabs come in different file formats depending on the tokenizer you're using; BERT's vocab is saved as a .txt file, RoBERTa's is saved as a .json and Albert's has to be extracted from a SentencePiece model

In [0]:
def get_vocab(transformer_tokenizer, pretrained_model_name):
    if pretrained_model_name in ['bert-base-uncased', 'distilbert-base-uncased',]:
        transformer_vocab = list(transformer_tokenizer.vocab.keys())
    else:
        transformer_tokenizer.save_vocabulary(model_path/f'{pretrained_model_name}')
        suff = 'json'
        if pretrained_model_name in ['albert-base-v2']:
            with open(model_path/f'{pretrained_model_name}/alberta_v2_vocab.{suff}', 'r') as f: 
                transformer_vocab = json.load(f) 
        else:
            with open(model_path/f'{pretrained_model_name}/vocab.{suff}', 'r') as f: 
                transformer_vocab = list(json.load(f).keys()) 
    return transformer_vocab

In [0]:
transformer_vocab = get_vocab(transformer_tokenizer, pretrained_model_name)

## Setup Data
### Create Dataset
Lets add our custom tokenizer function (`tok_fn`) and `transformer_vocab` here

In [58]:
splits = ColSplitter()(df)
x_tfms = [attrgetter("text"), tok_fn, Numericalize(vocab=transformer_vocab)]
dsets = Datasets(df, splits=splits, tfms=[x_tfms], dl_type=SortedDL)   ### tfms=[x_tfms] ,[attrgetter("label"), Categorize()]

### Dataloaders
Here we use our `transformer_padding()` wrapper when loading the dataloader

### (Alternatively) Factory dataloader
Here we set:
- `tok_tfm=tok_fn` to use our HF tokenizer
- `text_vocab=transformer_vocab` to load our pretrained vocab
- `before_batch=transformer_padding(transformer_tokenizer)` to use our custom padding function 

In [43]:
# Factory
bs =4
dls = TextDataLoaders.from_df(df, text_col="text", tok_tfm=tok_fn, text_vocab=transformer_vocab,
                              before_batch=transformer_padding(transformer_tokenizer),
                              is_lm=True, valid_col='is_valid', bs=bs,device='cuda')

In [59]:
dls.show_batch(max_n=2, trunc_at=60)

Unnamed: 0,text,text_
0,"<s> ĠThis Ġguy Ġhas Ġno Ġidea Ġof Ġcinema . ĠOkay , Ġit Ġseems Ġhe Ġmade Ġa Ġfew Ġinterest ig Ġtheater Ġshows Ġin Ġhis Ġyouth , Ġand Ġabout Ġtwo Ġacceptable Ġmovies Ġthat Ġhad Ġsuccess Ġmore Ġof Ġpolitical Ġreasons Ġcause Ġthey Ġtricked Ġthe Ġcommunist Ġcensorship . ĠThis Ġall Ġis Ġvery Ġgood , Ġbut Ġlook Ġcarefully : ĠHE ĠDOES ĠNOT ĠKNOW ĠHIS ĠJ","ĠThis Ġguy Ġhas Ġno Ġidea Ġof Ġcinema . ĠOkay , Ġit Ġseems Ġhe Ġmade Ġa Ġfew Ġinterest ig Ġtheater Ġshows Ġin Ġhis Ġyouth , Ġand Ġabout Ġtwo Ġacceptable Ġmovies Ġthat Ġhad Ġsuccess Ġmore Ġof Ġpolitical Ġreasons Ġcause Ġthey Ġtricked Ġthe Ġcommunist Ġcensorship . ĠThis Ġall Ġis Ġvery Ġgood , Ġbut Ġlook Ġcarefully : ĠHE ĠDOES ĠNOT ĠKNOW ĠHIS ĠJ OB"
1,ĠKnox ville Ġboth Ġgive Ġbelow Ġaverage Ġperformances . ĠThe Ġlatter Ġwas Ġpretty Ġgood Ġas ĠSt if ler Ġbut Ġhe Ġtries Ġway Ġtoo Ġhard Ġhere . ĠThe Ġlatter Ġjust Ġseems Ġto Ġbe Ġlooking Ġfor Ġa Ġpaycheck Ġand Ġnothing Ġelse . ĠJessica ĠSimpson Ġisn 't Ġknown Ġfor Ġher Ġacting Ġnor Ġis Ġshe Ġreally Ġknown Ġfor Ġher Ġsinging . ĠShe </s> <s> ĠIn,ville Ġboth Ġgive Ġbelow Ġaverage Ġperformances . ĠThe Ġlatter Ġwas Ġpretty Ġgood Ġas ĠSt if ler Ġbut Ġhe Ġtries Ġway Ġtoo Ġhard Ġhere . ĠThe Ġlatter Ġjust Ġseems Ġto Ġbe Ġlooking Ġfor Ġa Ġpaycheck Ġand Ġnothing Ġelse . ĠJessica ĠSimpson Ġisn 't Ġknown Ġfor Ġher Ġacting Ġnor Ġis Ġshe Ġreally Ġknown Ġfor Ġher Ġsinging . ĠShe </s> <s> ĠIn Ġthe


## Model Splitters
HuggingFace's models with names such as: `RobertaForSequenceClassification` are core transformer models with a classification head. Lets split the classification head from the core transformer backbone to enable us use progressive unfreezing and differential learning rates.

You can split the model into 3 groups by modifying the splitter function like so:

`
def roberta_clas_splitter(m):
    "Split the classifier head from the backbone"
    groups = [nn.Sequential(m.transformer.roberta.embeddings,
                  m.transformer.roberta.encoder.layer[0],
                  m.transformer.roberta.encoder.layer[1],
                  m.transformer.roberta.encoder.layer[2],
                  m.transformer.roberta.encoder.layer[3],
                  m.transformer.roberta.encoder.layer[4],
                  m.transformer.roberta.encoder.layer[5],
                  m.transformer.roberta.encoder.layer[6],
                  m.transformer.roberta.encoder.layer[7],
                  m.transformer.roberta.encoder.layer[8])]
    groups+= [nn.Sequential(m.transformer.roberta.encoder.layer[9],
                  m.transformer.roberta.encoder.layer[10],
                  m.transformer.roberta.encoder.layer[11],
                  m.transformer.roberta.pooler)]
    groups = L(groups + [m.transformer.classifier])
    return groups.map(params)
`

**Classification Head Differences**

Interestingly, BERT's classification head is different to RoBERTa's

BERT + ALBERT:

`
(dropout): Dropout(p=0.1, inplace=False)
(classifier): Linear(in_features=768, out_features=2, bias=True)
`

DistilBERT's has a "pre-classifier" layer:

`
(pre_classifier): Linear(in_features=768, out_features=768, bias=True)
(classifier): Linear(in_features=768, out_features=2, bias=True)
(dropout): Dropout(p=0.2, inplace=False)`

RoBERTa's:

`(classifier): RobertaClassificationHead(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (out_proj): Linear(in_features=768, out_features=2, bias=True))`

In [0]:
def bert_clas_splitter(m):
    "Split the classifier head from the backbone"
    groups = [nn.Sequential(m.transformer.bert.embeddings,
                m.transformer.bert.encoder.layer[0],
                m.transformer.bert.encoder.layer[1],
                m.transformer.bert.encoder.layer[2],
                m.transformer.bert.encoder.layer[3],
                m.transformer.bert.encoder.layer[4],
                m.transformer.bert.encoder.layer[5],
                m.transformer.bert.encoder.layer[6],
                m.transformer.bert.encoder.layer[7],
                m.transformer.bert.encoder.layer[8],
                m.transformer.bert.encoder.layer[9],
                m.transformer.bert.encoder.layer[10],
                m.transformer.bert.encoder.layer[11],
                m.transformer.bert.pooler)]
    groups = L(groups + [m.transformer.classifier]) 
    return groups.map(params)

In [0]:
def albert_clas_splitter(m):
    groups = [nn.Sequential(m.transformer.albert.embeddings,
                m.transformer.albert.encoder.embedding_hidden_mapping_in, 
                m.transformer.albert.encoder.albert_layer_groups,
                m.transformer.albert.pooler)]
    groups = L(groups + [m.transformer.classifier]) 
    return groups.map(params)

In [0]:
def distilbert_clas_splitter(m):
    groups = [nn.Sequential(m.transformer.distilbert.embeddings,
                m.transformer.distilbert.transformer.layer[0], 
                m.transformer.distilbert.transformer.layer[1],
                m.transformer.distilbert.transformer.layer[2],
                m.transformer.distilbert.transformer.layer[3],
                m.transformer.distilbert.transformer.layer[4],
                m.transformer.distilbert.transformer.layer[5],
                m.transformer.pre_classifier)]
    groups = L(groups + [m.transformer.classifier]) 
    return groups.map(params)

In [0]:
def roberta_clas_splitter(m):
    "Split the classifier head from the backbone"
    groups = [nn.Sequential(m.transformer.roberta.embeddings,
                  m.transformer.roberta.encoder.layer[0],
                  m.transformer.roberta.encoder.layer[1],
                  m.transformer.roberta.encoder.layer[2],
                  m.transformer.roberta.encoder.layer[3],
                  m.transformer.roberta.encoder.layer[4],
                  m.transformer.roberta.encoder.layer[5],
                  m.transformer.roberta.encoder.layer[6],
                  m.transformer.roberta.encoder.layer[7],
                  m.transformer.roberta.encoder.layer[8],
                  m.transformer.roberta.encoder.layer[9],
                  m.transformer.roberta.encoder.layer[10],
                  m.transformer.roberta.encoder.layer[11],
                  m.transformer.roberta.pooler)]
    groups = L(groups + [m.transformer.classifier])
    return groups.map(params)

In [0]:
def roberta_base_splitter(m):                       ## ADDED SPLITTER
    "Split the classifier head from the backbone"
    groups = [nn.Sequential(m.transformer.roberta.embeddings,
         m.transformer.roberta.encoder.layer[0],
                  m.transformer.roberta.encoder.layer[1],
                  m.transformer.roberta.encoder.layer[2],
                  m.transformer.roberta.encoder.layer[3],
                  m.transformer.roberta.encoder.layer[4],
                  m.transformer.roberta.encoder.layer[5],
                  m.transformer.roberta.encoder.layer[6],
                  m.transformer.roberta.encoder.layer[7],
                  m.transformer.roberta.encoder.layer[8],
                  m.transformer.roberta.encoder.layer[9],
                  m.transformer.roberta.encoder.layer[10],
                  m.transformer.roberta.encoder.layer[11],
                  m.transformer.roberta.pooler)]
    groups = L(groups + [m.transformer.lm_head])      ### SUBSTITUTED CLASSIFIER BY LM_HEAD
    return groups.map(params)

In [0]:
splitters = {#'bert_clas_splitter':bert_clas_splitter,
            #'albert_clas_splitter':albert_clas_splitter,
            #'distilbert_clas_splitter':distilbert_clas_splitter,
            #'roberta_clas_splitter':roberta_clas_splitter,
            'roberta_base_splitter': roberta_base_splitter}

### Load Model with configs

Here we can tweak the HuggingFace model's config file before loading the model

In [47]:
model_class, config_class,pretrained_model_name

(transformers.modeling_auto.AutoModelWithLMHead,
 transformers.configuration_roberta.RobertaConfig,
 'roberta-base')

In [0]:
fasthugs_model_lm = FastHugsModelLM(model_class=model_class, config_class=config_class,
                               pretrained_model_name = pretrained_model_name, 
                               max_seq_len=None)         #n_class=dsets.c

Initialise everything our Learner

In [0]:
opt_func = partial(Adam, decouple_wd=True)

cbs = [MixedPrecision(clip=0.1), SaveModelCallback()]

loss = CrossEntropyLossFlat() #LabelSmoothingCrossEntropy  

splitter = splitters[tfmr_splitter]

## Time to train
### Create our learner  - OF A LANGUAGE MODEL - TO PREDICT THE NEXT WORD IN A SENTENCE , OR A MASKED WORD, LIKE IN THE ORIGINAL ROBERTA MODEL.

In [0]:
#learn = Learner(dls, fasthugs_model_lm, opt_func=opt_func, splitter=splitter, loss_func=loss, cbs=cbs, metrics=[accuracy])
#learn = language_model_learner(dls, fasthugs_model_lm, drop_mult=0.3,pretrained=True, metrics=[accuracy, Perplexity()])
#learn = TextLearner(dls, fasthugs_model_lm,splitter=splitter,loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn = LMLearner(dls, fasthugs_model_lm, loss_func=CrossEntropyLossFlat(), splitter=splitter,cbs=cbs)

In [0]:
def decodifica (copia):
    return  [ transformer_vocab[copia[i]] for i in range(len(copia))] #copia[i].item(),
    

#  CHECK SHAPES OF XB, YB AND MODEL(XB) 

In [77]:
xb,yb= learn.dls.one_batch()
xb.shape, yb.shape

(torch.Size([4, 72]), torch.Size([4, 72]))

In [78]:
ll = [learn.model(xb)[0][i].argmax().item() for i in range(72) ]
decodifica(ll)[:10]

['<s>',
 'You',
 'Ġknow',
 'Ġa',
 'Ġmovie',
 'Ġwill',
 'Ġnot',
 'Ġgo',
 'Ġwell',
 'Ġwhen']

In [86]:
decodifica(yb[0])[:10]

['ĠYou',
 'Ġknow',
 'Ġa',
 'Ġmovie',
 'Ġwill',
 'Ġnot',
 'Ġgo',
 'Ġwell',
 'Ġwhen',
 'ĠJohn']

### Stage 1 training
Lets freeze the model backbone and only train the classifier head. `freeze_to(1)` means that only the classifier head is trainable

In [0]:
learn.freeze_to(1)  

In [0]:
#learn.summary()

Lets find a learning rate to train our classifier head

##  ERROR WHEN RUNNING .FIT OR OTHER HIGH LEVEL ROUTINES

ValueError: Expected input batch_size (72) to match target batch_size (288).

In [98]:
learn.fit(1)

epoch,train_loss,valid_loss,time
0,0.0,00:00,


ValueError: ignored

In [0]:
learn.recorder.plot_lr_find()
plt.vlines(6.918e-05, 0.6, 1.1)
plt.vlines(0.04786, 0.6, 1.1)

In [0]:
learn.fit_one_cycle(3, lr_max=1e-3)

In [0]:
learn.summary()

In [0]:
learn.save('fasthugs-lm')

In [0]:
learn.recorder.plot_loss()

In [0]:
1 = erro

### Stage 2 training
And now lets train the full model with differential learning rates

In [0]:
learn.unfreeze()

In [0]:
#learn.summary()

In [0]:
learn.lr_find(suggestions=True)

In [0]:
learn.recorder.plot_lr_find()
plt.vlines(2.51e-08, 0.2, 0.5)
plt.vlines(6.30e-05, 0.2, 0.5)

In [0]:
learn.fit_one_cycle(3, lr_max=slice(1e-6, 1e-5))

In [0]:
learn.save('roberta-fasthugs-stg2-3e-5')

In [0]:
learn.recorder.plot_loss()

## Lets Look at the model's predictions #### ORIGINAL FASTHUG MODELS OR OUR MODEL WITHOUT LEARN.

In [0]:
### predicao de fasthugs_model_lm
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
sequence = f"Distilled models are smaller than the models they mimic. Using them instead of the large versions would help {tokenizer.mask_token} our carbon footprint."
input = tokenizer.encode(sequence, return_tensors="pt")
input_ids = torch.tensor(tokenizer.encode(sequence)).unsqueeze(0)  # Batch size 1
copia = input_ids[0].clone()
outputs = fasthugs_model_lm(input_ids.cuda())
#outputs = model(input)[0]

mask_token_index = torch.where(input == tokenizer.mask_token_id)[1]
print(mask_token_index)
mask_token_logits = outputs[0, mask_token_index, :]
top_5_tokens = torch.topk(mask_token_logits[0], 5, dim=0).indices.tolist()

for token in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))


In [0]:
outputs.shape

In [49]:

tokenizer = RobertaTokenizer.from_pretrained("roberta-base")   # AutoTokenizer      
model = RobertaForMaskedLM.from_pretrained("roberta-base")   # AutoModelWithLMHead

sequence = f"Distilled models are smaller than the models they mimic. Using them instead of the large versions would help {tokenizer.mask_token} our carbon footprint."

input = tokenizer.encode(sequence, return_tensors="pt")
mask_token_index = torch.where(input == tokenizer.mask_token_id)[1]
print(tokenizer.mask_token_id)
print(mask_token_index,tokenizer.mask_token)
token_logits = model(input)[0]
mask_token_logits = token_logits[0, mask_token_index, :]

top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))


50264
tensor([21]) <mask>
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help  reduce our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help  lower our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help  minimize our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help  decrease our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help  cut our carbon footprint.


In [0]:
sequence

In [0]:
TEXT = "I liked this movie because"
N_WORDS = 40
N_SENTENCES = 2
learn.get_preds(TEXT)


In [0]:
res = learn.predict("'Distilled models are smaller than the models they mimic. Using them instead of the large" )
#print(res)
top5 = torch.topk(res[2][-1], 5, dim=0).indices.tolist()
decodifica(top5)

In [0]:
res[2].shape

In [0]:
from fastai2.interpret import *
interp = Interpretation.from_learner(learn)

In [0]:
interp.plot_top_losses(3)

In [0]:
### outro exemplo
from transformers import AutoModelWithLMHead, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = AutoModelWithLMHead.from_pretrained("distilbert-base-cased")

sequence = f"Distilled models are smaller than the models they mimic. Using them instead of the large versions would help {tokenizer.mask_token} our carbon footprint."

input = tokenizer.encode(sequence, return_tensors="pt")
mask_token_index = torch.where(input == tokenizer.mask_token_id)[1]
print(tokenizer.mask_token_id)
print(mask_token_index,tokenizer.mask_token)
token_logits = model(input)[0]
mask_token_logits = token_logits[0, mask_token_index, :]

top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))

# CLASSIFIER

In [0]:
### PARA USAR O DATASET COMPLETO
path = untar_data(URLs.IMDB)
path.ls()

In [0]:
# ##### read data from folder
# imdb_clas = DataBlock(blocks=(TextBlock.from_folder(path, vocab=dbunch_lm.vocab),CategoryBlock),
#                       get_x=read_tokenized_file,
#                       get_y = parent_label,
#                       get_items=partial(get_text_files, folders=['train', 'test']),
#                       splitter=GrandparentSplitter(valid_name='test'))

# dls= imdb_clas.dataloaders(path, path=path, bs=bs, seq_len=80)

# # ou
# #dls_clas = TextDataLoaders.from_folder(untar_data(URLs.IMDB), valid='test', text_vocab=dls_lm.vocab)

# learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy,loss_func = CrossEntropyLossFlat(), ).to_fp16()

In [0]:
df.columns

In [0]:
##### read data from df  - FUNCIONA !!
dls = TextDataLoaders.from_df(df, text_col='text', label_col='label', path=path, is_lm=False, valid_col='is_valid', bs=64, seq_len=80)
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.3, metrics=[accuracy] ,loss_func= CrossEntropyLossFlat()).to_fp16()

In [0]:
####   read data from df  - FUNCIONA USANDO DATABLOCK
imdb_clas = DataBlock(blocks=(TextBlock.from_df('text', is_lm=False),CategoryBlock),
                    get_x=attrgetter('text'),
                    get_y=ColReader('label'),
                    splitter=RandomSplitter())
dls = imdb_clas.dataloaders(df)
xb,yb = learn.dls.one_batch()
xb.shape , yb.shape

In [0]:
# ##### NAO FUNCIONA , ESTE BLOCO-- ESPERAR POR https://github.com/muellerzr/Practical-Deep-Learning-for-Coders-2.0
# imdb_clas = DataBlock(blocks=(TextBlock('text', is_lm=False,vocab=dbunch_lm.vocab), CategoryBlock),
#                       get_x=attrgetter('text'),
#                       get_y=ColReader('label'),
#                       splitter=RandomSplitter(),
#                       # dl_type=SortedDL, 
# dls = imdb_clas.dataloaders(df)

In [0]:
xb,yb = learn.dls.one_batch()

In [0]:
xb.shape , yb.shape

In [0]:
learn.lr_find()