<a href="https://colab.research.google.com/github/ShinAsakawa/ShinAsakawa.github.io/blob/master/2022notebooks/2022_0130Adding_Custom_Layers_on_Top_of_a_HuggingFace_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

- date: 2022_0130
- author: asakawa
- source: https://jovian.ai/rajbsangani/emotion-tuned-sarcasm/v/1?utm_source=embed#C11
- blog: [Adding Custom Layers on Top of a Hugging Face Model](https://towardsdatascience.com/adding-custom-layers-on-top-of-a-hugging-face-model-f1ccdfc257bd)

# Adding Custom Layers on Top of a Hugging Face Model

Hugging Face モデルの本体から隠れ状態を抽出し，その上に課題固有の層を修正・追加し，PyTorch を使ってカスタムセットアップ全体をエンドツーエンドで訓練する方法を学びます。
<!-- Learn how to extract the hidden states from a Hugging Face model body, modify/add task-specific layers on top of it and train the whole custom setup end-to-end using PyTorch -->

Before starting, this post assumes basic familiarity with Hugging Face (using a model out-of-the-box ). 
Also, a huge shoutout to the folks at Hugging Face for setting up a beginner-friendly learning environment!

## What will you learn from this blog?

1. Use task-specific models from the Hugging Face Hub and make them adapt to your task at hand.
2. De-coupling a Model’s head from its body and using the body to leverage domain-specific knowledge.
3. Building a custom head and attaching it to the body of the HF model in PyTorch and training the system end-to-end.

## The anatomy of a Hugging Face Model
Here is what a typical HF model looks like

<center>
<img src="https://miro.medium.com/max/1260/1*7JDSKluZfSSI0O1yRWUIOQ.png"><br/>
Image By Author
</center>

## Why will I need to use the head and body separately?

Some models on Hugging Face are trained on downstream tasks like question-answering or text classification and contain knowledge about the data they were trained on in their weights.

Sometimes, especially when our task at hand contains very little data or is domain-specific (such as medical or sports specific tasks), we can make use of other models on the hub trained on tasks (not necessarily the same task as our task at hand but falling within the same domain, such as sports or medicine) and make use of some of the pretrained knowledge these models to improve performance on our own task.

1. A very simple example would be if say we have a small dataset about classifying whether some financial statements are positive or negative in terms of sentiment. However, we go onto the Hub and find that a lot of models have been trained for QA related to finance. We can use certain layers from these models for improving our own tasks.
2. Another simple example is when a certain domain-specific model has learned to classify text into 5 categories from a huge dataset it was trained on. Say we have a similar classification task, a completely different dataset in the same domain and only want to classify the data into 2 categories instead of 5. We can again use a model’s body and add our own head in an attempt to augment domain-specific knowledge on our own task.

Diagrammatically, this is what we are trying to do

<center>
<img src="https://miro.medium.com/max/1286/1*5h3h7WtxAZpmmjfem3eoUQ.png"><br/>

<img src="https://miro.medium.com/max/1358/1*Zz_QpVlAPF0Jkgd02948Yg.png"><br/>
</center>

## Jumping into the code!

Our task is simple, sarcasm detection on this dataset from Kaggle.

You can check out the full code [here](https://jovian.ai/rajbsangani/emotion-tuned-sarcasm). 
I have not included the preprocessing and some training details below in the interest of time so make sure to check out the notebook for the entire code.

I will use a model with 5 classification outputs trained on a huge corpus of tweets to classify 5 different emotions, extract the body and add custom layers in PyTorch for our task (2 labels, sarcastic and not sarcastic) and train the new model end-to-end.

Note: You can use any model in this example (not necessarily a model trained for classification) since we will only use that model’s body and leave the head.

This is what our workflow looks like

<center>
<img src="https://miro.medium.com/max/1126/1*vBXL8SiUl9lPLvZkUIoGUQ.png"><br/>
</center>

I will be skipping the data-preprocessing steps and jumping straight to the main class, but you can check out the entire code in the link at the beginning of this section.

## Tokenization and Dynamic Padding


In [None]:
!pip install datasets transformers[sentencepiece]

In [None]:
from datasets import load_dataset,Dataset,DatasetDict
from transformers import DataCollatorWithPadding,AutoModelForSequenceClassification, Trainer, TrainingArguments,AutoTokenizer,AutoModel,AutoConfig
from transformers.modeling_outputs import TokenClassifierOutput
import torch
import torch.nn as nn
import pandas as pd

In [None]:
#!echo '{"username":"turingcomplete","key":"a49cdd9a6452346d9fdacca035bde21a"}' > kaggle.json
#!ls -l kaggle.json

In [None]:
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/ 
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download -d rmisra/news-headlines-dataset-for-sarcasm-detection

data=load_dataset("json",data_files="/content/news-headlines-dataset-for-sarcasm-detection.zip")
data=data.rename_column("is_sarcastic","label")

data=data.remove_columns(['article_link'])

data.set_format('pandas')
data=data['train'][:]

data.drop_duplicates(subset=['headline'],inplace=True)
data=data.reset_index()[['headline','label']]
data=Dataset.from_pandas(data)

# 80% train, 20% test + validation
train_testvalid = data.train_test_split(test_size=0.2,seed=15)

# Split the 10% test + valid in half test, half valid
test_valid = train_testvalid['test'].train_test_split(test_size=0.5,seed=15)

# gather everyone if you want to have a single DatasetDict
data = DatasetDict({
    'train': train_testvalid['train'],
    'test': test_valid['test'],
    'valid': test_valid['train']})

data

In [None]:
checkpoint = "cardiffnlp/twitter-roberta-base-emotion"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer.model_max_len=512

In [None]:
def tokenize(batch):
  return tokenizer(batch["headline"], truncation=True,max_length=512)

tokenized_dataset = data.map(tokenize, batched=True)
tokenized_dataset

In [None]:
tokenized_dataset.set_format("torch",columns=["input_ids", "attention_mask", "label"])
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

### Extracting the Body and adding our own layers


In [None]:
class CustomModel(nn.Module):
  def __init__(self,checkpoint,num_labels): 
    super(CustomModel,self).__init__() 
    self.num_labels = num_labels 

    #Load Model with given checkpoint and extract its body
    self.model = model = AutoModel.from_pretrained(checkpoint,config=AutoConfig.from_pretrained(checkpoint, output_attentions=True,output_hidden_states=True))
    self.dropout = nn.Dropout(0.1) 
    self.classifier = nn.Linear(768,num_labels) # load and initialize weights

  def forward(self, input_ids=None, attention_mask=None,labels=None):
    #Extract outputs from the body
    outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)

    #Add custom layers
    sequence_output = self.dropout(outputs[0]) #outputs[0]=last hidden state

    logits = self.classifier(sequence_output[:,0,:].view(-1,768)) # calculate losses
    
    loss = None
    if labels is not None:
      loss_fct = nn.CrossEntropyLoss()
      loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
    
    return TokenClassifierOutput(loss=loss, logits=logits, hidden_states=outputs.hidden_states,attentions=outputs.attentions)

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model=CustomModel(checkpoint=checkpoint,num_labels=2).to(device)

In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_dataset["train"], shuffle=True, batch_size=32, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_dataset["valid"], batch_size=32, collate_fn=data_collator
)

In [None]:
from transformers import AdamW,get_scheduler

optimizer = AdamW(model.parameters(), lr=5e-5)

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)
print(num_training_steps)

In [None]:
from datasets import load_metric
metric = load_metric("f1")

As you can see we first subclass the nn Module from PyTorch, extract the model body using AutoModel (from transformers) and provide the checkpoint to the model whose body we want to use.

Note that a TokenClassifierOutput (from the transformers library) is returned which makes sure that our output is in a similar format to that from a Hugging Face model on the hub.

### Training the new model end-to-end

In [None]:
from tqdm.auto import tqdm

progress_bar_train = tqdm(range(num_training_steps))
progress_bar_eval = tqdm(range(num_epochs * len(eval_dataloader)))


for epoch in range(num_epochs):
  model.train()
  for batch in train_dataloader:
      batch = {k: v.to(device) for k, v in batch.items()}
      outputs = model(**batch)
      loss = outputs.loss
      loss.backward()

      optimizer.step()
      lr_scheduler.step()
      optimizer.zero_grad()
      progress_bar_train.update(1)

  model.eval()
  for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])
    progress_bar_eval.update(1)
    
  print(metric.compute())

      

In [None]:
model.eval()

test_dataloader = DataLoader(
    tokenized_dataset["test"], batch_size=32, collate_fn=data_collator
)

for batch in test_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

As you can see, we achieve a decent performance using this method. 
**Keep in mind that the aim of this blog isn’t to analyze performance for this particular dataset but to learn how to use a pre-trained Body and add a Custom Head**.

### Conclusion

We saw how one can add custom layers to a pre-trained model’s body using the Hugging Face Hub.

Some takeaways:

1. This technique is particularly helpful in cases where we have small domain-specific datasets and want to leverage models trained on larger datasets in the same domain (task-agnostic) to augment performance on our small dataset.
2. We can choose models that have been trained on downstream tasks different from our own task and still use the knowledge from that model’s body.
3. This may not be necessary at all if your dataset is large enough and generic, in which case you can use AutoModelForSequenceClassification or whatever other task you have to solve using a BERT like checkpoint. In fact, if that is so, I would strongly recommend not building your own head.

Check out my [GitHub](https://github.com/rajlm10) for some other projects. You can contact me [here](https://rajsangani.me/).
Thank you for your time!

If you liked this here are som more!

- [Interpreting an LSTM through LIME](https://towardsdatascience.com/interpreting-an-lstm-through-lime-e294e6ed3a03)
- [Powerful Text Augmentation Using NLPAUG](https://towardsdatascience.com/powerful-text-augmentation-using-nlpaug-5851099b4e97)