# FastHugs
This notebook gives a full run through to fine-tune a text classification model with **HuggingFace transformers** and the new **fastai-v2** library.

## Things You Might Like
**FastHugsTokenizer:** A tokenizer wrapper than can be used with fastai-v2's tokenizer.

**FastHugsModel:** A model wrapper over the HF models, more or less the same to the wrapper's from HF fastai-v1 articles mentioned below

**Vocab:** A function to extract the vocab depending on the pre-trained transformer (HF hasn't standardised this processes 😢 ).

**Padding:** Padding settings for the padding token index and on whether the transformer prefers left or right padding

**Vocab for Albert-base-v2**: .json for Albert-base-v2's vocab, otherwise this has to be extracted from a SentencePiece model file, which isn't fun


### Pretrained Transformers only for now
Initially, this notebook will only deal with finetuning HuggingFace's pretrained models. It covers BERT, DistilBERT, RoBERTa and ALBERT pretrained classification models only. These are the core transformer model architectures where HuggingFace have added a classification head. HuggingFace also has other versions of these model architectures such as the core model architecture and language model model architectures.

If you'd like to try train a model from scratch HuggingFace just recently published an article on [How to train a new language model from scratch using Transformers and Tokenizers](https://huggingface.co/blog/how-to-train). Its well worth reading to see how their `tokenizers` library can be used, independent of their pretrained transformer models.

### Read these first 👇
This notebooks heavily borrows from [this notebook](https://www.kaggle.com/melissarajaram/roberta-fastai-huggingface-transformers) , which in turn is based off of this [tutorial](https://www.kaggle.com/maroberti/fastai-with-transformers-bert-roberta) and accompanying [article](https://towardsdatascience.com/fastai-with-transformers-bert-roberta-xlnet-xlm-distilbert-4f41ee18ecb2). Huge thanks to  Melissa Rajaram and Maximilien Roberti for these great resources, if you're not familiar with the HuggingFace library please 

### fastai-v2
[This paper](https://www.fast.ai/2020/02/13/fastai-A-Layered-API-for-Deep-Learning/) introduces the v2 version of the fastai library and you can follow and contribute to v2's progress [on the forums](https://forums.fast.ai/). This notebook is based off the [fastai-v2 ULMFiT tutorial](http://dev.fast.ai/tutorial.ulmfit). Huge thanks to Jeremy, Sylvain, Rachel and the fastai community for making this library what it is. I'm super excited about the additinal flexibility v2 brings.

### Dependencies
If you haven't already, install HuggingFace's `transformers` library with: `pip install transformers`

 ## Vocab
 Model and vocab files will be saved with files names as a long string of digits and letters (e.g. `d9fc1956a0....f4cfdb5feda.json` generated from the etag from the AWS S3 bucket as described [here in the HuggingFace repo](https://github.com/huggingface/transformers/issues/2157). For readability I prefer to save the files in a specified directory and model name so that it can be easily found and accessed in future.
 
(Note: To avoid saving these files twice you could look at the `from_pretrained` and `cached_path` functions in HuggingFace's `PreTrainedTokenizer` class definition to find the code that downloads the files and maybe modify them to download directly to your specified directory withe desired name. I haven't had time to go that deep.)

Load vocab file into a `list` as expected by fastai-v2. The HF pretrained tokenizer vocabs come in different file formats depending on the tokenizer you're using; BERT's vocab is saved as a .txt file, RoBERTa's is saved as a .json and Albert's has to be extracted from a SentencePiece model

In [153]:
def get_vocab(transformer_tokenizer, pretrained_model_name):
    if pretrained_model_name in ['bert-base-uncased', 'distilbert-base-uncased']:
        transformer_vocab = list(transformer_tokenizer.vocab.keys())
    else:
        transformer_tokenizer.save_vocabulary(model_path/f'{pretrained_model_name}')
        suff = 'json'
        if pretrained_model_name in ['albert-base-v2']:
            with open(model_path/f'{pretrained_model_name}/alberta_v2_vocab.{suff}', 'r') as f: 
                transformer_vocab = json.load(f) 
        else:
            with open(model_path/f'{pretrained_model_name}/vocab.{suff}', 'r') as f: 
                transformer_vocab = list(json.load(f).keys()) 

In [154]:
transformer_vocab = get_vocab(transformer_tokenizer, pretrained_model_name)

## Albert Vocab
### From SentencePiece issues: https://github.com/google/sentencepiece/issues/121

**1. Install protobuf**

`sudo apt install protobuf-compiler`



**2. Clone the SentencePiece repo**

`git clone https://github.com/google/sentencepiece.git`



**3. cd to `sentencepiece/src`**

`cd sentencepiece/src`


**4. Run the below to generate `sentencepiece_model_pb2.py`**

`protoc --python_out=. sentencepiece_model.proto`


**5. Copy the `sentencepiece_model_pb2.py` file to the same directory as your notebook**

`cp sentencepiece_model_pb2.py YOUR_NOTEBOOK_DIR/sentencepiece_model_pb2.py`


**6. Use `sentencepiece_model_pb2` to open the .model file from the tokenizer**

`
import sentencepiece_model_pb2 as spmodel
m = spmodel.ModelProto()
m.ParseFromString(open('models/spiece.model', 'rb').read())
`

**7. Iterate through .pieces to extract each token from the vocab and append to list**
`
vocab_ls=[]
for i,p in enumerate(m.pieces):
    vocab_ls.append(p.piece)
`

(There is also a p.score attribute if you are interested in that too)


**8. Save the vocab so you don't have to do this icky work again**

`
import json
with open('YOUR_MODEL_DIR/alberta_v2_vocab.json', 'w', encoding='utf-8') as f:
    json.dump(vocab_ls, f, ensure_ascii=False, indent=4)
 `
 
**Opening the vocab.json file in future**

Now simply use the below code to open your saved vocab file

`
with open('models/albert-base-v2/alberta_v2_vocab.txt', 'w') as f:
    for item in v_ls:
        f.write("%s\n" % item)
`