# Hugging Face Transformers 微调训练入门

本示例将介绍基于 Transformers 实现模型微调训练的主要流程，包括：
- 数据集下载
- 数据预处理
- 训练超参数配置
- 训练评估指标设置
- 训练器基本介绍
- 实战训练
- 模型保存

## YelpReviewFull 数据集

**Hugging Face 数据集：[ YelpReviewFull ](https://huggingface.co/datasets/yelp_review_full)**

### 数据集摘要

Yelp评论数据集包括来自Yelp的评论。它是从Yelp Dataset Challenge 2015数据中提取的。

### 支持的任务和排行榜
文本分类、情感分类：该数据集主要用于文本分类：给定文本，预测情感。

### 语言
这些评论主要以英语编写。

### 数据集结构

#### 数据实例
一个典型的数据点包括文本和相应的标签。

来自YelpReviewFull测试集的示例如下：

```json
{
    'label': 0,
    'text': 'I got \'new\' tires from them and within two weeks got a flat. I took my car to a local mechanic to see if i could get the hole patched, but they said the reason I had a flat was because the previous patch had blown - WAIT, WHAT? I just got the tire and never needed to have it patched? This was supposed to be a new tire. \\nI took the tire over to Flynn\'s and they told me that someone punctured my tire, then tried to patch it. So there are resentful tire slashers? I find that very unlikely. After arguing with the guy and telling him that his logic was far fetched he said he\'d give me a new tire \\"this time\\". \\nI will never go back to Flynn\'s b/c of the way this guy treated me and the simple fact that they gave me a used tire!'
}
```

#### 数据字段

- 'text': 评论文本使用双引号（"）转义，任何内部双引号都通过2个双引号（""）转义。换行符使用反斜杠后跟一个 "n" 字符转义，即 "\n"。
- 'label': 对应于评论的分数（介于1和5之间）。

#### 数据拆分

Yelp评论完整星级数据集是通过随机选取每个1到5星评论的130,000个训练样本和10,000个测试样本构建的。总共有650,000个训练样本和50,000个测试样本。

## 下载数据集

In [2]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

In [6]:
!pip show datasets
!pip show transformers

Name: datasets
Version: 3.2.0
Summary: HuggingFace community-driven open-source library of datasets
Home-page: https://github.com/huggingface/datasets
Author: HuggingFace Inc.
Author-email: thomas@huggingface.co
License: Apache 2.0
Location: /usr/local/lib/python3.10/dist-packages
Requires: aiohttp, dill, filelock, fsspec, huggingface-hub, multiprocess, numpy, packaging, pandas, pyarrow, pyyaml, requests, tqdm, xxhash
Required-by: 
Name: transformers
Version: 4.47.1
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /usr/local/lib/python3.10/dist-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: peft

In [None]:
from datasets import load_dataset

dataset = load_dataset("yelp_review_full")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/6.72k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/299M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/23.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

In [None]:
dataset["train"][111]

{'label': 2,
 'text': "As far as Starbucks go, this is a pretty nice one.  The baristas are friendly and while I was here, a lot of regulars must have come in, because they bantered away with almost everyone.  The bathroom was clean and well maintained and the trash wasn't overflowing in the canisters around the store.  The pastries looked fresh, but I didn't partake.  The noise level was also at a nice working level - not too loud, music just barely audible.\\n\\nI do wish there was more seating.  It is nice that this location has a counter at the end of the bar for sole workers, but it doesn't replace more tables.  I'm sure this isn't as much of a problem in the summer when there's the space outside.\\n\\nThere was a treat receipt promo going on, but the barista didn't tell me about it, which I found odd.  Usually when they have promos like that going on, they ask everyone if they want their receipt to come back later in the day to claim whatever the offer is.  Today it was one of th

In [None]:
import random
import pandas as pd
import datasets
from IPython.display import display, HTML

In [None]:
def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(dataset["train"])

Unnamed: 0,label,text
0,4 stars,The food was awsome the service was slow our servers was great until they sat him a party 12 and waited it on it alone shame on management that table all had bottled water and 3 different bottles wine and appetizers we were then ignored and forgot about the fod was awsome the flat iron was great the the chicken was 2 huge breast all food was seasoned and the sauses amazing our server forgot us and our little 30 to 40 dollar tip he forgot about he was going for the 400 tip hope he got it he did not get much from me dinner for 2 will run u 200 but it is worth it
1,4 stars,"This review is a little overdue (I went over two months ago) but that doesn't take away any of the fun my family and I had while we were there. We found Vertuccio Farms on Groupon and they were offering a great deal for the whole family we couldn't pass up. So one Saturday morning many weeks back we took a drive out to Gilbert to visit the farm. I think we went the last weekend that it was open for the season. \n\nThe place was not very crowded, which I liked. It was however pretty muddy and wet because of the rain that had fallen before we came. The staff was trying to keep the mud under control with hay but it wasn't working out so well. But it wasn't so bad that we couldn't have a good time. \n\nThe very first thing we did when we walked in was ride the barrel train around the farm. You get one ride included in your ticket price. This was pretty fun, though my motion sickness acted up (I am pretty sensitive about moving in things that don't use my legs). So we got to see all of the activities available without having to walk around. And we did pretty much all of the rest of the activities while we were there. There were bouncy horses, which I did not participate in but the rest of my family thought they were hysterical. Some pedal bikes, a large bouncy area for kids, a small playground, farm animal feedings. There is a thing they call a \""pizza farm\"". From the website: \""What is a Pizza Farm? The pizza farm is a one-half acre circular garden that is divided into eight pie shaped \""slices\"" (like a pizza) and grows or grazes all the ingredients needed to make a farm fresh pizza! You will learn how farmers grow wheat for the crust, tomatoes for the sauce..dairy cows or goats that give milk to make cheese, pigs for pepperoni, and much more!\"" This was pretty fun. I always like to look at growing vegetables and herbs. There is even a large 10 acre corn maze that has a different design every year. \n\nWe all had a pretty good time here and would come back again in the future. My three year old niece was pretty excited by everything here. So not only a good time for kids, it was fun for the four adults in our party as well."
2,3 stars,"My wife and I really enjoyed the food and the salsa. We also had very good service and did not experience any of the negative things that I had seen on Yelp. That being said, I have to mark them down because their prices are absurd! It's Mexican food- tortillas, chicken, beans, and cheese... You can't charge me $13 for that! That is more than my steak, salad, and side at Outback... Twice the price of a meal at Costa Vida (which is a Mexican grill in the same strip mall)!"
3,1 star,"Just awful... mediocre dishes for an over-rated dining experience.\n\nPerhaps this dates me, but if you slightly remember the show \""In Living Color,\"" their service gets a big fat... HATED IT!\n\nRead my reviews and you'd know that I've had my fair share of 5-diamond, 4-figure dining experiences for two. This certainly was no 5-diamond, perhaps a cubic zirconia would be more appropriate.\n\nWhy, you ask? ...Well, overall, there was a lack luster of service. Aside from Robert Smith, the master sommelier, which seemed to make us feel like he really cared, no one gave us the time of day. (Definitely not typical of 18-point Gault Millau, 2-star Michelin caliber.)\n\nWe had reservations for 6:30, for a party of three, and requested for a lakeview, upon calling. Instead, we got the closest thing to the door, with the third person facing the entry so all the world could see what he was eating. l noticed that there were several available tables for the same number seating. So, I asked the hostess for a different table. She immediately said, we have nothing available. (Whaa? Not even a \""let me go see what I can do\"" and pretend to ask the head host.) So, we asked them to move the third seat to against the entry, which would somewhat enclose our \""eating circle.\""\n\nWe went for the $110 tasting and $65 per person pairing. OMG! From one plate to the next, they were pushy. This gave us the impression that it wasn't a 5-diamond experience, but a cheap Chinese restaurant that's constantly trying to get you out of the door and pay your bill ASAP. \n\nOne person didn't know what the other was doing, forgetting requests like lime or a sherry with dessert. If it wasn't for the \""break\"" we requested before dessert and stepped out onto the patio for a full fountain show and Cuban cigars, we would have choked.\n\nLastly, or perhaps, I should have started with this. I couldn't even believe that they DIDN'T ASK \""Does anyone have any allergies?\"" ...which is pretty standard for a restaurant of that gauge. I had to walk over to our table waiter (yes, I had to walk over because they never seemed to stick around long enough to listen or even noted that you were giving them the eye) and inform him that my boyfriend had a nut allergy.\n\nTo cap off this dinner, I inquired with the Bellagio on whom was the current manager. They informed me that it was Gilles Kolakowski, to which I wrote a letter of my displeasure. And, much to my dismay, we didn't even get an apology. In fact, I never heard back from them.\n===============================================\n\nOh, and if I may judged it strictly on the plate, it's still a one star.\n\nFirst course: The lobster salad was OK, absolutely nothing spectacular.\n\nSecond course: The seared scallop was cooked well, but the potato sail was uncomplimentary in flavour... perhaps Chef Serrano placed it there for texture, but it was more distracting in flavour. ...And, that's not good!\n\nThird course: I admit they do have the best foie gras in Vegas. But, that's all.\n\nFourth course: The bass was over cooked, very tough. And, I can't believe that they served this course in a 5-in tall (plain white) bowl! The fish had a sauce, but it wasn't soup. Thoughts like, \""Did they run out of plates?!\"" ran through my head.\n\nFifth course: Ordered \""the custard\"" trusting the chef and his staff to redeem themselves. However, unsurprising, the dish gave all my senses the impression that it was decadent flourless chocolate cake. Obviously, they gave me the wrong order.\n\nDECADENCE DOES NOT SUBSTITUTE FOR INCOMPETENCE.\nSave yourself the few hundred dollars and hit the tables instead. The hospitality and service is better there, tenfold.\n\n===============================================\nSeveral people tell me that they've gone and had a better experience. All I have to say to that is... the mark of a good restaurant is consistency. Food and service should always be top notch, and I didn't find either on my visit.\n\nAlso, I read that Robert Smith was nominated by the James Beard Foundation for best sommelier this year. He didn't win, but perhaps this awful restaurant had something to do with it.... lmao."
4,4 stars,"I should begin by saying that I really want to like Spokes. Locally-owned, bicycle-friendly bar with a great selection of beer that is just blocks from my house and free of the Mill Avenue crowd. The all-day happy hour on Saturday is great, too.\n\nThat being said, I think it is just going to have to be a bar for me. I've been on a handful of occasions now, and have been pleased with my beer experience every time. However, the two times that I have gotten food, I've been disappointed with the quality and price. (First, a grilled cheese with caramelized onion and pear, second, a salmon (no B)LT). Namely, the two sandwiches I purchased had too little flavor and too much grease; I felt like I had to wipe my hands and face after each bite. My friends at dinner each of these nights echoed this sentiment, with sirloin tacos and a burger, respectively.\n\nI'll keep coming back for your beer, Spokes on Southern, but no more food for me until the menu improves."
5,3 stars,"Solid restaurant. A gringo wonderland. I have lost my license and needed a dinner spot close enough to my hotel that I wouldn't run the risk of getting pulled over (i'm a hysterical driver...watch out Phoenix) and Ajo Al's reviews were pretty decent so I headed down to get some fish tacos. \n\nI was scared about anything saying \""Del Mar\"" after my Frank & Lupe's experience but though the Hawaiian Ono in the Tacos del mar ($11.95) was previously frozen, it was an enjoyable experience. I've never had Ono and to me the texture was like that of swordfish, which is just fine for a tiny soft taco. It was a little chewy, but again that's because it's frozen. The beans and rice were okay, and there was some sauce for the tacos that was okay too...not exciting enough for me to remember what the sauce was though, or what it tasted like. \n\nWhen seated you will be presented with the world's largest basket of ultra thin tortilla chips you've ever seen, whether you are one person or 6. They were a little greasier than i prefer BUT they were warm and that makes them yummier. \n\nI don't think I would go back here but I don't regret trying it. It was a good meal and the servers are very friendly and welcoming."
6,2 star,"I thought it was not very impressive. We met a promoter on the strip who asked for a tip of $10 to get us into this club for free. Turned out that the same coupons for ladies were free everywhere else and the male coupon was refused by the club.\nThe billboards for the club advertised 3 rooms but when we went there on Thursday, 2 of them were closed!!!\nDrinks are in the $10-$15 range, which some think is normal for Las Vegas, but I personally think is way overpriced.\nMusic and DJs were OK, but the light was not very impressive.\nAnother thing to consider is that smoking is allowed everywhere in the club and you are going to inhale smoke all night!"
7,1 star,"Probably one of the worst dining experiences ever! We, a party of four, came for AZ Restaurant week. We waited about 30 minutes before our order was taken, much too long. We were brought a relish tray that was really meant for 2 but there were 4 of us. Odd. 2 of us opted for the PrixFix menu. The brie in puff pastry was really very undercooked, and appeared the kitchen was rushed just to get it out, a revelation of what was to come. The $5.95 upcharge for caesar salad and oddly placed anchovies on the edge of the plate was just weird. Given the so called classy-ness of this place, they should have been mixed in with the dressing. The pork chop was ordered medium and specifically so to the waitress that it not be dry. It came out very dry, and nearly un-edible. Prime rib was very thick and exceptionally dry. Both the pork chop and prime rib had to be taken off the bill. Our waitress was not engaged for most of the dinner, and was extremely busy for service...way too busy to attend to her diners. Management was apologetic enough to comp the meals. This place is very drab, dark and absolutely stuck in the era in which it was opened. The same red velvet wallpaper is on the walls, the clientele considerably older, as if the only patrons are the very ones who have kept the place going since the beginning. Once these patrons are dead this place will be too. Highly overrated, highly overpraised, way too noisy and highly overpriced. This is NOT a place for people who expect and enjoy a very high standard of food quality, service and ambience. I pity the diner who does not heed this review...eat at your own risk but don't say i didn't warn you."
8,5 stars,"Mon dieu ~ 100th review!\n\nDecadent. Amazing. Indulgent. Memorable. Delectable. These adjectives were among the many that swirled around in my head as I enjoyed each course of L'Atelier's Seasonal Discovery Menu.\n\nOne of the highlights of our recent R&R jaunt to Vegas was definitely this wonderful meal. We went all out and got the wine pairings as well, and I must say, it was worth the arm & leg we paid. The open kitchen, color theme, and lighting were great and helped elevate our anticipation. The sommelier, our main, and subsequent various servers (a different one for each course) were all quite professional and charming.\n\nL'AMUSE-BOUCHE - Avocado and cilantro flavored grapefruit gelee. Tiny in size - but big in flavor. Complex and delicious, loved the avo, cilantro, grapefruit combo.\n\nLE CRABE ROYAL - King crab on a turnip disc with a sweet and sour sauce. Interesting play on 'ravioli'. Another beautiful & delicious dish, plating was divine.\n\nLA SAINT-JACQUES - Sea scallop cooked in the shell with chive oil. Peerless quality - so delicious, scallops are one of my favorite foods, and this was one of the best I've had.\n\nLA CEBETTE - White onion tart with smoked bacon and asparagus. LOVED this - couldn't stop ooh'ing and ahh'ing as I ate this one. Can you say tears of joy?? Wish I had the metabolism and time to sit around and eat these tarts all day. Hub made a quite spectacular version at home the following weekend (including butter poached scallops on top). \n\nLE FOIE GRAS - Duck foie gras with confit kumquats. So good...and SO rich. Never been a fan of kumquats - but excuuse me because this dish would not have been the same without them.\n\nLA SOLE - Dover sole filet, baby leek with ginger. A very generous portion of perfectly cooked sole. Made me want hot steam rice!\n\nL'AGNEAU - Lamb shoulder confit and steam garbanzo beans - this was my main dish choice. The lamb was cooked down and the tender pieces were molded into a small cake-like portion, surrounded by the most vibrant green garbanzo beans, served in a tiny tagine, a beautiful dish.\n\nLA CAILLE - Foie gras stuffed free-range quail with truffled-mashed potatoes - this was the hub's main dish choice. The quail was perfectly cooked and was delicious together with the potatoes.\n\nLA PECHE - Summer peaches on basil sable, coconut milk emulsion. Super fancy presentation, lovely hard sugar candy disc perched atop the dish with a glittery patterned sheen - ALMOST too pretty to eat.\n\nLA FRAISE - White chocolate ice cream on an almond panna cotta, fresh strawberries and mint. Perfect combination of flavors and textures. Light and satisfying dessert.\n\nTHE WINE - We had champagne, whites, reds, and ice wine - all from France, and all to die for.\n\nLE CAFE - Coffee or Espresso with one dark chocolate candy. Perfect ending to dinner.\n\nIf you truly love food, this is what it's all about ~ delicious, fresh, beautiful flavors - prepared with passion, and enjoyed with a loved one! Cheers!"
9,4 stars,"(Falsetto) - \"" Sherry. Sherry Baby .Sherry. Sherry Baby . Sheeeeeery Baaaaby. Sheeeery Won't you come out tonight\""\n\nWho doesn't like Frankie Valli And the 4 Seasons? This type of music is timeless and any one who has good taste would appreciate it. \n\nBacked by several Tony awards this show kicked Broadway ass! Once you survive the first fifteen minutes it gets better, I promise. The show ran a little bit over 2 hours and I did not even notice it. It is a musical biography of the group and how they got started. The format was interesting and different. The cast was great and they all had to be versatile. The music was awesome but no one can be Frankie Valli, he is irreplaceable. However, the vocalists did a terrific job with their songs. This show is exceptionally good. Bravo!"


## 预处理数据

下载数据集到本地后，使用 Tokenizer 来处理文本，对于长度不等的输入数据，可以使用填充（padding）和截断（truncation）策略来处理。

Datasets 的 `map` 方法，支持一次性在整个数据集上应用预处理函数。

下面使用填充到最大长度的策略，处理整个数据集：

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Map:   0%|          | 0/650000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
show_random_elements(tokenized_datasets["train"], num_examples=1)

Unnamed: 0,label,text,input_ids,token_type_ids,attention_mask
0,3 stars,"Standard location for the chain. Orange chicken was cooked well, the rice was a touch undercooked a little crunchy in parts, the chicken egg roll was tasty. Staff was friendly and moved the line quickly and easily.\n\nThe place was very clean even though it was right around lunch time and busy. They kept up on the cleaning while we ate as well after patrons would leave and the tables weren't up to par. \n\nOverall: B+\nIt's not gourmet (I think you probably knew that before heading in), but it's a good lunch stop! =)","[101, 6433, 2450, 1111, 1103, 4129, 119, 6309, 9323, 1108, 13446, 1218, 117, 1103, 7738, 1108, 170, 2828, 1223, 2528, 27499, 170, 1376, 172, 22715, 1183, 1107, 2192, 117, 1103, 9323, 9069, 5155, 1108, 27629, 13913, 119, 5949, 1108, 4931, 1105, 1427, 1103, 1413, 1976, 1105, 3253, 119, 165, 183, 165, 183, 1942, 4638, 1282, 1108, 1304, 4044, 1256, 1463, 1122, 1108, 1268, 1213, 5953, 1159, 1105, 5116, 119, 1220, 2023, 1146, 1113, 1103, 9374, 1229, 1195, 8756, 1112, 1218, 1170, 14645, 1156, 1817, 1105, 1103, 7072, 3920, 112, 189, 1146, 1106, 14247, 119, 165, 183, 165, 183, 2346, 4121, ...]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]"


### 数据抽样

使用 1000 个数据样本，在 BERT 上演示小规模训练（基于 Pytorch Trainer）

`shuffle()`函数会随机重新排列列的值。如果您希望对用于洗牌数据集的算法有更多控制，可以在此函数中指定generator参数来使用不同的numpy.random.Generator。

In [None]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

## 微调训练配置

### 加载 BERT 模型

警告通知我们正在丢弃一些权重（`vocab_transform` 和 `vocab_layer_norm` 层），并随机初始化其他一些权重（`pre_classifier` 和 `classifier` 层）。在微调模型情况下是绝对正常的，因为我们正在删除用于预训练模型的掩码语言建模任务的头部，并用一个新的头部替换它，对于这个新头部，我们没有预训练的权重，所以库会警告我们在用它进行推理之前应该对这个模型进行微调，而这正是我们要做的事情。

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### 训练超参数（TrainingArguments）

完整配置参数与默认值：https://huggingface.co/docs/transformers/v4.36.1/en/main_classes/trainer#transformers.TrainingArguments

源代码定义：https://github.com/huggingface/transformers/blob/v4.36.1/src/transformers/training_args.py#L161

**最重要配置：模型权重保存路径(output_dir)**

In [None]:
from transformers import TrainingArguments

model_dir = "models/bert-base-cased-finetune-yelp"

# logging_steps 默认值为500，根据我们的训练数据和步长，将其设置为100
# 修复：You must call wandb.init() before wandb.log()； 设置report_to="none",  # 禁用 WandB
training_args = TrainingArguments(output_dir=model_dir,
                                  per_device_train_batch_size=16,
                                  num_train_epochs=5,
                                  report_to="none",
                                  logging_steps=100)

In [None]:
# 完整的超参数配置
print(training_args)

TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=no,
eval_use_gather_object=F

### 训练过程中的指标评估（Evaluate)

**[Hugging Face Evaluate 库](https://huggingface.co/docs/evaluate/index)** 支持使用一行代码，获得数十种不同领域（自然语言处理、计算机视觉、强化学习等）的评估方法。 当前支持 **完整评估指标：https://huggingface.co/evaluate-metric**

训练器（Trainer）在训练过程中不会自动评估模型性能。因此，我们需要向训练器传递一个函数来计算和报告指标。

Evaluate库提供了一个简单的准确率函数，您可以使用`evaluate.load`函数加载

In [None]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [None]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]


接着，调用 `compute` 函数来计算预测的准确率。

在将预测传递给 compute 函数之前，我们需要将 logits 转换为预测值（**所有Transformers 模型都返回 logits**）。

In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

#### 训练过程指标监控

通常，为了监控训练过程中的评估指标变化，我们可以在`TrainingArguments`指定`evaluation_strategy`参数，以便在 epoch 结束时报告评估指标。

In [None]:
from transformers import TrainingArguments, Trainer
# 修复：You must call wandb.init() before wandb.log()； 设置report_to="none",  # 禁用 WandB
training_args = TrainingArguments(output_dir=model_dir,
                                  evaluation_strategy="epoch",
                                  per_device_train_batch_size=16,
                                  num_train_epochs=3,
                                  report_to="none",
                                  logging_steps=30)



## 开始训练

### 实例化训练器（Trainer）

`kernel version` 版本问题：暂不影响本示例代码运行

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

## 使用 nvidia-smi 查看 GPU 使用

为了实时查看GPU使用情况，可以使用 `watch` 指令实现轮询：`watch -n 1 nvidia-smi`:

```shell
Every 1.0s: nvidia-smi                                                   Wed Dec 20 14:37:41 2023

Wed Dec 20 14:37:41 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:0D.0 Off |                    0 |
| N/A   64C    P0              69W /  70W |   6665MiB / 15360MiB |     98%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     18395      C   /root/miniconda3/bin/python                6660MiB |
+---------------------------------------------------------------------------------------+
```

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,1.5226,1.306708,0.43
2,1.071,0.976997,0.581
3,0.6962,0.980735,0.591


TrainOutput(global_step=189, training_loss=1.1280765155005077, metrics={'train_runtime': 390.2205, 'train_samples_per_second': 7.688, 'train_steps_per_second': 0.484, 'total_flos': 789354427392000.0, 'train_loss': 1.1280765155005077, 'epoch': 3.0})

In [None]:
small_test_dataset = tokenized_datasets["test"].shuffle(seed=64).select(range(100))

In [None]:
trainer.evaluate(small_test_dataset)

{'eval_loss': 1.0324125289916992,
 'eval_accuracy': 0.54,
 'eval_runtime': 3.1879,
 'eval_samples_per_second': 31.369,
 'eval_steps_per_second': 4.078,
 'epoch': 3.0}

### 保存模型和训练状态

- 使用 `trainer.save_model` 方法保存模型，后续可以通过 from_pretrained() 方法重新加载
- 使用 `trainer.save_state` 方法保存训练状态

In [None]:
trainer.save_model(model_dir)

In [None]:
trainer.save_state()

In [None]:
# trainer.model.save_pretrained("./")

## Homework: 使用完整的 YelpReviewFull 数据集训练，看 Acc 最高能到多少