In [1]:
# Transformers installation
! pip install transformers datasets
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git




[notice] A new release of pip available: 22.2.2 -> 22.3
[notice] To update, run: python.exe -m pip install --upgrade pip


# Fine-tune a pretrained model

There are significant benefits to using a pretrained model. It reduces computation costs, your carbon footprint, and allows you to use state-of-the-art models without having to train one from scratch. 🤗 Transformers provides access to thousands of pretrained models for a wide range of tasks. When you use a pretrained model, you train it on a dataset specific to your task. This is known as fine-tuning, an incredibly powerful training technique. In this tutorial, you will fine-tune a pretrained model with a deep learning framework of your choice:

* Fine-tune a pretrained model with 🤗 Transformers [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer).
* Fine-tune a pretrained model in TensorFlow with Keras.
* Fine-tune a pretrained model in native PyTorch.

<a id='data-processing'></a>

## Prepare a dataset

Before you can fine-tune a pretrained model, download a dataset and prepare it for training. The previous tutorial showed you how to process data for training, and now you get an opportunity to put those skills to the test!

Begin by loading the [Yelp Reviews](https://huggingface.co/datasets/yelp_review_full) dataset:

In [125]:
import string
import random
def diffuse(text):
    lt = list(text)
    for i in range(random.randint(0,1)):
        lt[random.randrange(len(lt))] = random.choice(string.ascii_letters + '')
    for i in range(random.randint(0,2)):
        j = random.randrange(len(lt)-1)
        lt[j],lt[j+1] = lt[j+1],lt[j]
    return ''.join(lt)

In [126]:
print(diffuse("this is a slepo typo filter"))

this i sa slepo typo filwer


In [119]:
import pandas as pd
from sklearn.model_selection import train_test_split

DS = pd.read_csv("decmet.csv")
print(DS.columns)

Index(['Timestamp', 'Enter a sentence',
       '0 - dont dad joke\n1 - dad joke is ok'],
      dtype='object')


In [127]:
dataset = pd.DataFrame( columns=['text', 'label'])
for i,r in DS.iterrows():
    #print(i)
    temp = pd.DataFrame([[r[1],r[2]]],columns = ['text','label'])
    dataset = pd.concat([dataset,temp],ignore_index=True)
print(dataset.head())

                                                text label
0  man im absolutely done with life i want to end...     0
1                 honestly im not even mad im amused     1
2                          im soo pissed off at life     0
3                          lol im happy you liked it     1
4                                    im eating lunch     1


In [121]:
# Diffuse dataset
diffused = pd.DataFrame( columns=['text', 'label'])
for j in range(random.randint(20,40)):
    i = random.randint(1,int((dataset.size-1)/2))
    #print(i)
    #print(dataset.iloc[i][0])
    temp = pd.DataFrame([[diffuse(dataset.iloc[i][0]),dataset.iloc[i][1]]],columns = ['text','label'])
    diffused = pd.concat([diffused,temp],ignore_index=True)
print(diffused['text'])

0                  imk inda confused but igYet the idea
1                                    im feleing amazing
2                  im kinda confused bu ti gett he idea
3                                    I'm wnat psata Noo
4                                im standign right ehre
5                              i thikn im benig stalked
6                                    ifm eeling amazing
7                                  im fniall yback homU
8           im starting to thin ki should be na aethist
9                            im never Aonnag iev you up
10                        taht is because im sutpid lol
11    im thaknful saki gf is not Cere to see this cr...
12     must ehre playing rocket leauge on my tab lm ...
13                   ia m immuDe to your pointless acts
14                                   'Im want psata too
15                                 im pulling fr olayla
16                                          im ocnLsued
17                                      im aetin

from datasets import load_dataset

dataset = load_dataset("csv",data_files = "decmet.csv")
dataset = dataset.remove_columns('Timestamp')
dataset = dataset.rename_column('Enter a sentence','text')
dataset = dataset.rename_column('0 - dont dad joke\n1 - dad joke is ok','label')

In [136]:
from datasets import Dataset
#df = pd.concat([dataset,diffused])
dataset = Dataset.from_pandas(df)
dataset = dataset.train_test_split(test_size=0.1)
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label', '__index_level_0__'],
        num_rows: 108
    })
    test: Dataset({
        features: ['text', 'label', '__index_level_0__'],
        num_rows: 13
    })
})

As you now know, you need a tokenizer to process the text and include a padding and truncation strategy to handle any variable sequence lengths. To process your dataset in one step, use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/process.html#map) method to apply a preprocessing function over the entire dataset:

In [137]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

If you like, you can create a smaller subset of the full dataset to fine-tune on to reduce the time it takes:

In [145]:
train_dataset = tokenized_datasets["train"].shuffle(seed=42)
eval_dataset = tokenized_datasets["test"].shuffle(seed=42)

<a id='trainer'></a>

## Train

🤗 Transformers provides a [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) class optimized for training 🤗 Transformers models, making it easier to start training without manually writing your own training loop. The [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) API supports a wide range of training options and features such as logging, gradient accumulation, and mixed precision.

Start by loading your model and specify the number of expected labels. From the Yelp Review [dataset card](https://huggingface.co/datasets/yelp_review_full#data-fields), you know there are five labels:

In [139]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

Downloading:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

<Tip>

You will see a warning about some of the pretrained weights not being used and some weights being randomly
initialized. Don't worry, this is completely normal! The pretrained head of the BERT model is discarded, and replaced with a randomly initialized classification head. You will fine-tune this new model head on your sequence classification task, transferring the knowledge of the pretrained model to it.

</Tip>

### Training hyperparameters

Next, create a [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments) class which contains all the hyperparameters you can tune as well as flags for activating different training options. For this tutorial you can start with the default training [hyperparameters](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments), but feel free to experiment with these to find your optimal settings.

Specify where to save the checkpoints from your training:

In [140]:
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test_trainer")

Using TensorFlow backend.


### Metrics

[Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) does not automatically evaluate model performance during training. You will need to pass [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) a function to compute and report metrics. The 🤗 Datasets library provides a simple [`accuracy`](https://huggingface.co/metrics/accuracy) function you can load with the `load_metric` (see this [tutorial](https://huggingface.co/docs/datasets/metrics.html) for more information) function:

In [141]:
import numpy as np
from datasets import load_metric

metric = load_metric("accuracy")

Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

Call `compute` on `metric` to calculate the accuracy of your predictions. Before passing your predictions to `compute`, you need to convert the predictions to logits (remember all 🤗 Transformers models return logits):

In [142]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

If you'd like to monitor your evaluation metrics during fine-tuning, specify the `evaluation_strategy` parameter in your training arguments to report the evaluation metric at the end of each epoch:

In [143]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

### Trainer

Create a [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) object with your model, training arguments, training and test datasets, and evaluation function:

In [146]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

Then fine-tune your model by calling [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train):

In [148]:
trainer.train()
print("training done")

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: __index_level_0__, text. If __index_level_0__, text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 108
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 42


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.332739,0.923077
2,No log,0.388298,0.846154
3,No log,0.171042,0.923077


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: __index_level_0__, text. If __index_level_0__, text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 13
  Batch size = 8
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: __index_level_0__, text. If __index_level_0__, text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 13
  Batch size = 8
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: __index_level_0__, text. If __index_level_0__, text are not expected by `BertForSequenceClassification.forward`,  you can safe

training done


<a id='pytorch_native'></a>

Load your model with the number of expected labels:

In [149]:
model.save_pretrained("zf_model")

Configuration saved in zf_model\config.json
Model weights saved in zf_model\pytorch_model.bin


In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)