<a href="https://colab.research.google.com/github/0xVolt/whats-up-doc/blob/main/src/experimental-notebooks/fine-tune-distilbert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning the DistilBERT Model from HuggingFace

---

## Import data to fine-tune model

In [36]:
import pandas as pd

In [37]:
dataset = pd.read_csv('assets/spam-data.csv')

In [38]:
dataset.head(), dataset.shape

(   label                                               text
 0      0  Go until jurong point, crazy.. Available only ...
 1      0                      Ok lar... Joking wif u oni...
 2      1  Free entry in 2 a wkly comp to win FA Cup fina...
 3      0  U dun say so early hor... U c already then say...
 4      0  Nah I don't think he goes to usf, he lives aro...,
 (5572, 2))

## Extract dependent and independent features

In [39]:
X = list(dataset['text'])
y = list(dataset['label'])

## Train-test Split

In [40]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=14)

## Use a HuggingFace Model

Generally, the steps involved in using a model from HuggingFace involves,
1. Calling the pre-trained model
2. Calling the model's tokenizer - since each model has it's own tokenizer
3. Use the tokenizer to encode the train and test datasets
   1. `truncation` - remove whitespace from each data point
   2. `padding` - conform all data points to the same length

In [41]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

In [42]:
trainEncoded = tokenizer(X_train, truncation=True, padding=True)
testEncoded = tokenizer(X_test, truncation=True, padding=True)

In [43]:
# print(testEncoded)

## Create Dataset Objects with Tensorflow

In tensorflow, the dataset objects are tensors. We do this so data flows through our pipeline in the expected format.

In [44]:
import tensorflow as tf

trainDataset = tf.data.Dataset.from_tensor_slices((
    dict(trainEncoded),
    y_train
))

testDataset = tf.data.Dataset.from_tensor_slices((
    dict(testEncoded),
    y_test
))

In [45]:
from transformers import TFDistilBertForSequenceClassification, TFTrainer, TFTrainingArguments

trainingArguments = TFTrainingArguments(
    output_dir = '../results',
    num_train_epochs = 2,
    evaluation_strategy = 'steps',
    eval_steps = 500,
    per_device_train_batch_size = 4,
    per_device_eval_batch_size = 8,
    warmup_steps = 100,
    weight_decay = 0.01,
    logging_dir = '../logs',
    logging_steps = 10
)

In [46]:
with trainingArguments.strategy.scope():
    model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

In [47]:
trainer = TFTrainer(
    model = model,
    args = trainingArguments,
    train_dataset = trainDataset,
    eval_dataset = testDataset
)



In [None]:
trainer.train()