Modified from Huggingface tutorial:    
https://huggingface.co/docs/transformers/training

In this tutorial, you will fine-tune a pretrained model with a deep learning framework of your choice:

# Preparing data

In [1]:
from datasets import load_dataset

dataset = load_dataset("yelp_review_full")
dataset["train"][100]

  from .autonotebook import tqdm as notebook_tqdm


{'label': 0,
 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. 

As you now know, you need a tokenizer to process the text and include a padding and truncation strategy to handle any variable sequence lengths. To process your dataset in one step, use 🤗 Datasets map method to apply a preprocessing function over the entire dataset:

In [2]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map: 100%|██████████| 50000/50000 [00:17<00:00, 2880.52 examples/s]


In [3]:
print(tokenized_datasets.shape)
print(tokenized_datasets["train"][0].keys())
print(tokenized_datasets["train"][0]['label'])
print(tokenized_datasets["train"][0]['text'])
print(tokenized_datasets["train"][0]['input_ids'])
print(len(tokenized_datasets["train"][0]['input_ids']))
print('Number of paddings:', tokenized_datasets["train"][0]['input_ids'].count(0)) # List.count(value) 统计某个元素出现的次数
print(tokenized_datasets["train"][0]['token_type_ids'])
print(tokenized_datasets["train"][0]['attention_mask'])
print(len(tokenized_datasets["train"][0]['attention_mask']))
print('Number of non-attention tokes:', tokenized_datasets["train"][0]['attention_mask'].count(0))
# 可以看到的是，在训练的时候，所有有效的token都应该计算attention
# padding时填充的值为0，其对应的attention mask也是0.
# input_ids 就是token的id，也就是说tokenize 之后得到的不是token，而是直接得到token的id(也就是初级的embedding)

{'train': (650000, 5), 'test': (50000, 5)}
dict_keys(['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'])
4
dr. goldberg offers everything i look for in a general practitioner.  he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first.  really, what more do you need?  i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank.
[101, 173, 1197, 119, 2284, 2953, 3272, 1917, 178, 1440, 1111, 1107, 170, 1704, 22351, 119, 1119, 112, 188, 3505, 1105, 3123, 1106, 2037, 1106, 1443, 1217, 10063, 4404, 132, 1119, 112, 188, 1579, 1113, 1159, 1107, 3195, 1117, 4420, 132, 1119, 112, 188, 6559, 1114, 170, 1499, 118, 23555, 2704, 113, 183, 9379, 114, 1134, 1139, 2153, 1138, 371

In [4]:
print(tokenized_datasets.shape)
print(tokenized_datasets["test"][0].keys())
print(tokenized_datasets["test"][0]['label'])
print(tokenized_datasets["test"][0]['text'])
print(tokenized_datasets["test"][0]['input_ids'])
print(len(tokenized_datasets["test"][0]['input_ids']))
print('Number of paddings:', tokenized_datasets["test"][0]['input_ids'].count(0)) # List.count(value) 统计某个元素出现的次数
print(tokenized_datasets["test"][0]['token_type_ids'])
print(tokenized_datasets["test"][0]['attention_mask'])
print(len(tokenized_datasets["test"][0]['attention_mask']))
print('Number of non-attention tokes:', tokenized_datasets["test"][0]['attention_mask'].count(0))
# 可以看到的是，在训练的时候，所有有效的token都应该计算attention
# padding时填充的值为0，其对应的attention mask也是0.
# input_ids 就是token的id，也就是说tokenize 之后得到的不是token，而是直接得到token的id(也就是初级的embedding)

{'train': (650000, 5), 'test': (50000, 5)}
dict_keys(['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'])
0
I got 'new' tires from them and within two weeks got a flat. I took my car to a local mechanic to see if i could get the hole patched, but they said the reason I had a flat was because the previous patch had blown - WAIT, WHAT? I just got the tire and never needed to have it patched? This was supposed to be a new tire. \nI took the tire over to Flynn's and they told me that someone punctured my tire, then tried to patch it. So there are resentful tire slashers? I find that very unlikely. After arguing with the guy and telling him that his logic was far fetched he said he'd give me a new tire \"this time\". \nI will never go back to Flynn's b/c of the way this guy treated me and the simple fact that they gave me a used tire!
[101, 146, 1400, 112, 1207, 112, 14337, 1121, 1172, 1105, 1439, 1160, 2277, 1400, 170, 3596, 119, 146, 1261, 1139, 1610, 1106, 170, 1469, 19459

In [5]:
# Get a smaller dataset if needed
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

# Training

🤗 Transformers provides a Trainer class optimized for training 🤗 Transformers models, making it easier to start training without manually writing your own training loop. The Trainer API supports a wide range of training options and features such as logging, gradient accumulation, and mixed precision.

Start by loading your model and specify the number of expected labels. From the Yelp Review dataset card, you know there are five labels:



In [6]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


You will see a warning about some of the pretrained weights not being used and some weights being randomly initialized. Don’t worry, this is completely normal! The pretrained head of the BERT model is discarded, and replaced with a randomly initialized classification head. You will fine-tune this new model head on your sequence classification task, transferring the knowledge of the pretrained model to it.



Next, create a TrainingArguments class which contains all the hyperparameters you can tune as well as flags for activating different training options. For this tutorial you can start with the default training hyperparameters, but feel free to experiment with these to find your optimal settings.

Specify where to save the checkpoints from your training:  
(有个问题，fine-tuning的时候我怎么设置tuning哪些层呢？)

In [7]:
from transformers import TrainingArguments
training_args = TrainingArguments(output_dir="test_trainer")

Evaluation  
Trainer does not automatically evaluate model performance during training. You’ll need to pass Trainer a function to compute and report metrics. The 🤗 Evaluate library provides a simple accuracy function you can load with the evaluate.load (see this quicktour for more information) function:



In [8]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

If you’d like to monitor your evaluation metrics during fine-tuning, specify the evaluation_strategy parameter in your training arguments to report the evaluation metric at the end of each epoch:



In [9]:
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

Create a Trainer object with your model, training arguments, training and test datasets, and evaluation function:



In [10]:
trainer = Trainer(
    model=model,
    args=training_args, 
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Then fine-tune your model by calling train():



In [11]:
trainer.train()

  0%|          | 0/375 [00:00<?, ?it/s]

OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacty of 3.95 GiB of which 50.69 MiB is free. Including non-PyTorch memory, this process has 3.88 GiB memory in use. Of the allocated memory 3.33 GiB is allocated by PyTorch, and 47.73 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF