# Fine Tune

**Reference:** https://neptune.ai/blog/hugging-face-pre-trained-models-find-the-best

In [2]:
# !pip install -r requirements.txt

In [1]:
import warnings
warnings.filterwarnings("ignore")

## Load Model

In [4]:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

cuda


In [52]:
tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-50-many-to-one-mmt")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/mbart-large-50-many-to-one-mmt").to(device)

In [5]:
model_name = f"mbart-finetuned-cn-to-en-auto"
model_path = f"../models/{model_name}"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path).to(device)

In [65]:
text = "开空调的情况下，续航掉的太快了，特别是冬天天气冷的时候，不开空调不行，天气一冻，续航就掉的更快"

tokenizer.src_lang = 'zh_CN'
tokenizer.tgt_lang = 'en_XX'

tokenized_text = tokenizer(text, return_tensors="pt").to(device)
print(tokenized_text)
print()

translation = model.generate(**tokenized_text)
print(translation)
print()

translated_text = tokenizer.decode(translation[0], skip_special_tokens=True)
print(translated_text)

{'input_ids': tensor([[250025,      6,   4185, 109612,  53340,      4, 100945,  15036,  14224,
             43,   4150,   4771,    274,      4,  41178, 106460,  70871,  10359,
           5382,      4,    562,   4185, 109612,  88467,      4,  70871,    684,
         160127,      4, 100945,  15036,    887,  14224,     43, 155132,      2]],
       device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

tensor([[     2, 250004,  14847,     70,  12944,     83,  69347,     98,      4,
             70, 172714,   6897,  36069,      7,   5792,   4271,      4,  41866,
           3229,     70,  92949,     83,  91097,     23,  41710,      5,   1650,
             25,      7,    959,   4127,   2174,     70,  12944,     83,    959,
          69347,     98,      5,   1301,  33662,    237,     70,  92949,     83,
           1238,  70463,      4,     70, 172714,   6897,  36069, 

In [6]:
sentence = "开空调的情况下，续航掉的太快了，特别是冬天天气冷的时候，不开空调不行，天气一冻，续航就掉的更快"

tokenizer.src_lang = 'zh_CN'
tokenizer.tgt_lang = 'en_XX'

encoded = tokenizer(sentence, return_tensors="pt").to(device)
model = model.to(device)
generated_tokens = model.generate(**encoded)
decoded = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]

print(decoded)

If the air conditioner is on, the flight resumes too quickly, especially in winter when the weather is cold. If the air conditioner is not on, the flight resumes faster as soon as the weather is cold


## Preprocess Data

### Load Data

In [7]:
import pandas as pd
from datasets import Dataset

In [6]:
file_proc = "../data/trans/PROC-NCVQS-2021-2023.csv"

In [29]:
df_0 = pd.read_excel('../data/trans/2021_NCVQS_text.xlsx', sheet_name = 0)
df_1 = pd.read_excel('../data/trans/2022_NCVQS_text.xlsx', sheet_name = 0)
df_2 = pd.read_excel('../data/trans/2023_NCVQS_text.xlsx', sheet_name = 0)

colname = ["Chinese", "English"]
df_0.columns = colname
df_1.columns = colname
df_2.columns = colname

df_3 = pd.read_excel('../data/trans/2023_NCVQS_text.xlsx', sheet_name=1)

zh_columns = df_3.columns[:4]
en_columns = df_3.columns[4:]

df_3['Chinese'] = df_3[zh_columns].apply(lambda row: ' '.join(row.dropna().astype(str)), axis=1)
df_3['English'] = df_3[en_columns].apply(lambda row: ' '.join(row.dropna().astype(str)), axis=1)

df_3.drop(columns=en_columns, inplace=True)
df_3.drop(columns=zh_columns, inplace=True)

df = pd.concat([df_0, df_1, df_2, df_3], axis=0)
df = df.dropna()
df.drop_duplicates(keep='first', subset='Chinese', inplace=True)

In [30]:
print("Length: ", len(df), "\n Num of NaN: ", df.isna().sum().sum())

Length:  14316 
 Num of NaN:  0


In [31]:
df.head()

Unnamed: 0,Chinese,English
0,第二排的舒适性不太理想，减震有点硬，平时有坎时感觉咣当一下，不是很舒服，选择舒适性,"The comfort of the second row is not ideal, th..."
2,减震硬，路况不好的地方不太舒服，选择舒适性（在路况不好，沟沟坎坎比较多时候，车内晃动大）,"The shock absorption is hard, and riding under..."
3,车内的网络连接不稳定（自带的车联网，通过流量卡连接的互联网，有时使用中会突然没有网，在使用任...,The network connection in the car is unstable ...
4,开空调时车内有潮气的味道，开热风冷风都会有，问了问，有人说是滤芯的气味，不是很重（新车，没有...,There is a smell of moisture in the car when t...
5,第二排两侧的车门关门时声音咚咚的，声音很沉，感觉车门有点重，听上去没有质感，不是什么大问题，...,When the doors on both sides of the second row...


In [12]:
df.to_csv(file_proc, index=False)

### Load Processed Data

In [8]:
df = pd.read_csv(file_proc)

In [9]:
df.tail()

Unnamed: 0,Chinese,English
14311,外观时尚好看；安全系数高； 车漆是反光的黑色，很容易脏，有水印子； 外形好看；配置高；空间大； 无,Fashionable and good-looking appearance; High ...
14312,安全性好；内部空间大； 车机系统不稳定；整体车辆安全性能好； 省油；动力是提速快；安全性好，...,Good security; Large interior space; . Unstabl...
14313,外观时尚；内饰豪华； 冬天打火不稳定；新车，止回阀坏过2次； 品牌安全性好；内饰豪华；定速巡...,Stylish appearance; The interior is luxurious;...
14314,内部配置丰富；空间大；动力足； 有几个小毛病； 安全性；品牌名气；配置丰富； 小毛病比较多；,Rich internal functions; Large space; Power en...
14315,各方面都很均衡，外观设计符合我的审美，看上去比较大气 座椅包裹性不太好，谈不上舒服 配置丰富...,"It is balanced in all aspects, the appearance ..."


## Batch Translate

In [10]:
file_sample = "../data/trans/PROC_SAMPLE."

(14316, 2)

In [None]:
df = df.head()


## Fine-tune

### Transform Data

In [20]:
df = df.assign(translation=df.apply(lambda row:{'zh': row['Chinese'], 'en': row['English']}, axis=1))
df.reset_index(inplace=True)
df.drop(['index', 'Chinese', 'English'], axis=1, inplace=True)

In [21]:
df.head()

Unnamed: 0,translation
0,{'zh': '第二排的舒适性不太理想，减震有点硬，平时有坎时感觉咣当一下，不是很舒服，选择...
1,{'zh': '减震硬，路况不好的地方不太舒服，选择舒适性（在路况不好，沟沟坎坎比较多时候，...
2,{'zh': '车内的网络连接不稳定（自带的车联网，通过流量卡连接的互联网，有时使用中会突然...
3,{'zh': '开空调时车内有潮气的味道，开热风冷风都会有，问了问，有人说是滤芯的气味，不是...
4,{'zh': '第二排两侧的车门关门时声音咚咚的，声音很沉，感觉车门有点重，听上去没有质感，...


In [22]:
df.iloc[0][0]

{'zh': '第二排的舒适性不太理想，减震有点硬，平时有坎时感觉咣当一下，不是很舒服，选择舒适性',
 'en': 'The comfort of the second row is not ideal, the shock absorption is a bit hard, and it feels awkward when there are bumps, which is not very comfortable, and I choose comfort'}

In [23]:
print("Dimension: ", df.shape)

Dimension:  (14316, 1)


In [24]:
dataset = Dataset.from_pandas(df, split='train')
dataset = dataset.train_test_split(test_size=0.05)

In [25]:
print(type(dataset['train']))
print(dataset['train'][:1])

<class 'datasets.arrow_dataset.Dataset'>
{'translation': [{'en': "The new car has a very special smell, an indescribable odor, I don't know if it uses environmentally friendly materials or where it comes from, it is suspected to be from the interior, and I can't accept the smell so far, it should be removed;", 'zh': '新车有一股很特殊的气味,没法形容的异味,是否使用了环保材料,不清楚来自于具体哪里,怀疑是内饰,购车至今不能接受气味,最好能够去掉杂味;'}]}


In [26]:
dataset

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 13600
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 716
    })
})

### Tokenize Datasets

In [27]:
prefix = "" #for mBART and MarianMT
max_input_length = 512
max_target_length = 512

source_lang = "zh"
target_lang = "en"

def preprocess_function(examples):
    inputs = [prefix + ex[source_lang] for ex in examples["translation"]]
    targets = [ex[target_lang] for ex in examples["translation"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)
   
    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/13600 [00:00<?, ? examples/s]

Map:   0%|          | 0/716 [00:00<?, ? examples/s]

## Train and Fine-tune the Model

### Model Setup

In [28]:
from transformers import Seq2SeqTrainingArguments
from transformers import DataCollatorForSeq2Seq
from transformers import Seq2SeqTrainer
import evaluate

In [29]:
from datetime import datetime

# Get the current date
today = datetime.now()

# Format the date as "DDMMYYYY"
formatted_date = today.strftime("%d%m%Y")

print(formatted_date)

20022024


In [31]:
model_name = f"mbart-finetuned-cn-to-en-auto"
model_path = f"../models/{model_name}"

batch_size = 4

args = Seq2SeqTrainingArguments(
   output_dir=model_path,
   evaluation_strategy="epoch",
   learning_rate=2e-5,
   per_device_train_batch_size=batch_size,
   per_device_eval_batch_size=batch_size,
   weight_decay=0.01,
   save_total_limit=3,
   num_train_epochs=1,
   predict_with_generate=True,
)

In [32]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model) # default setting

In [33]:
import numpy as np
import evaluate

metric = evaluate.load("sacrebleu")
meteor = evaluate.load('meteor')

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]
    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    meteor_result = meteor.compute(predictions=decoded_preds, references=decoded_labels)
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result = {'bleu' : result['score']}
    result["gen_len"] = np.mean(prediction_lens)
    result["meteor"] = meteor_result["meteor"]
    result = {k: round(v, 4) for k, v in result.items()}
    return result

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.93k [00:00<?, ?B/s]

[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package omw-1.4 to /home/jovyan/nltk_data...


In [34]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

### Train and Save the Model

In [35]:
trainer.train()

You're using a MBart50TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Bleu,Gen Len,Meteor
1,0.9087,0.89668,30.5299,50.6047,0.6196


TrainOutput(global_step=3400, training_loss=0.9925245846019072, metrics={'train_runtime': 2206.0199, 'train_samples_per_second': 6.165, 'train_steps_per_second': 1.541, 'total_flos': 1445687188979712.0, 'train_loss': 0.9925245846019072, 'epoch': 1.0})

In [36]:
trainer.save_model()

In [29]:
# trainer.push_to_hub()

### Test

In [42]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)

In [43]:
sentence = "开空调的情况下，续航掉的太快了，特别是冬天天气冷的时候，不开空调不行，天气一冻，续航就掉的更快"

tokenizer.src_lang = 'zh_CN'
tokenizer.tgt_lang = 'en_XX'

encoded = tokenizer(sentence, return_tensors="pt").to(device)
model = model.to(device)
generated_tokens = model.generate(**encoded)
decoded = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]

print(decoded)

When the AC is turned on, the battery life drops too fast, especially when the weather is cold in winter. It's not good if the AC is not turned on. As soon as the weather is frozen, the battery life drops faster.


> "开空调的情况下，续航掉的太快了，特别是冬天天气冷的时候，不开空调不行，天气一冻，续航就掉的更快"

> **no fine-tuning** 'If the air conditioner is on, the flight resumes too quickly, especially in winter when the weather is cold. If the air conditioner is not on, the flight resumes faster as soon as the weather is cold'

> **fine-tuned with data of 2023** If you turn on the air conditioner, the electric range is too fast to fall off, especially in winter when the weather is cold and cold, it is not possible to not turn on the air conditioner without turning on the air conditioner, as the weather freezes and freezes, the electric range will fall off faster.

In [39]:
def translate(sentence):
    encoded = tokenizer(sentence, return_tensors="pt").to(device)
    generated_tokens = model.generate(**encoded)
    return tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]

In [40]:
for i in dataset['test']['translation'][:5]:
    print("="*30)
    print(i['zh'])
    print(i['en'])
    print("="*10)
    result = translate(i['zh'])
    print(result)

导航准确率低，使用时路线不是很准确，没有更新过导航，不如手机导航好用，具体是什么地点不准确，时间比较久了，记不太清了，导航每周都会使用
The navigation accuracy rate is low, the route is not very accurate when used, and the navigation has not been updated. It is not as easy to use as the mobile phone navigation. The specific location is not accurate. The problem has existed for a long time and I can’t remember clearly. The navigation will be used every week
The accuracy of the navigation is low, the route is not very accurate when used, the navigation has not been updated, it is not as good as the mobile phone navigation, the specific location is inaccurate, the time is relatively long, the memory is not clear, the navigation will be used every week
续航不满意，在市区不明显，温度低，上高速时电池电量消耗明显，跑了一段里程比如10公里就看电池明显少了一大部分，室外零下，冬天不常开，没算过百公里里程，经济模式，空调通常24度，不常跑高速
The range is not very satisfactory, not obvious in the city. The temperature is low, the battery power consumption is obvious when on the highway, run a period of mileage such as 10 kilometers to see a significant p