# Fine Tune

**Reference:** https://neptune.ai/blog/hugging-face-pre-trained-models-find-the-best

In [1]:
# pip install -r requirements.txt

In [2]:
import warnings
warnings.filterwarnings("ignore")

In [3]:
import torch

if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = "cpu"

print(device)

cuda


## Load Model

In [4]:
model_mbart = "facebook/mbart-large-50-many-to-one-mmt"

model_name = f"mbart-finetuned-cn-to-en-auto"
model_path = f"../models/{model_name}"

model_path = model_mbart

In [5]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)

In [6]:
sentence = "开空调的情况下，续航掉的太快了，特别是冬天天气冷的时候，不开空调不行，天气一冻，续航就掉的更快"

tokenizer.src_lang = 'zh_CN'
tokenizer.tgt_lang = 'en_XX'

encoded = tokenizer(sentence, return_tensors="pt").to(device)
model = model.to(device)
generated_tokens = model.generate(**encoded)
decoded = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]

print(decoded)

If the air conditioner is on, the flight resumes too quickly, especially in winter when the weather is cold. If the air conditioner is not on, the flight resumes faster as soon as the weather is cold


## Preprocess Data

### Load Data

In [7]:
import pandas as pd
from datasets import Dataset

pd.set_option('display.max_colwidth', None)

In [8]:
df_1 = pd.read_excel('../data/trans/2023_NCVQS_text.xlsx', sheet_name=0)
df_1.rename(columns = {'Detail complains breakdown  (Chinese)': 'Chinese', 'Translation': 'English'}, inplace=True)

df_2 = pd.read_excel('../data/trans/2023_NCVQS_text.xlsx', sheet_name=1)

zh_columns = df_2.columns[:4]
en_columns = df_2.columns[4:]

df_2['Chinese'] = df_2[zh_columns].apply(lambda row: ' '.join(row.dropna().astype(str)), axis=1)
df_2['English'] = df_2[en_columns].apply(lambda row: ' '.join(row.dropna().astype(str)), axis=1)

df_2.drop(columns=en_columns, inplace=True)
df_2.drop(columns=zh_columns, inplace=True)

df = pd.concat([df_1, df_2], axis = 0)
df = df.dropna()
df.drop_duplicates(keep='first', subset='Chinese', inplace=True)

In [9]:
print("Length: ", len(df), "\n Num of NaN: ", df.isna().sum().sum())

Length:  447 
 Num of NaN:  0


In [10]:
df.head()

Unnamed: 0,Chinese,English
0,开空调的情况下，续航掉的太快了，特别是冬天天气冷的时候，不开空调不行，天气一冻，续航就掉的更快,"In the case of turning on the air conditioner, the electric range drops too fast, especially when the weather is cold in winter, if you don't turn on the air conditioner, the weather freezes, the battery life will fall faster."
1,车机流畅度差，容易卡死机，车机系统，启动载入很慢，换挡杆前的车机，使用任何功能都有概率死机，发生过3-4次,"The smoothness of the IHU is poor, easy to jam, the car machine system, the start loading is very slow, the car machine before the gear lever, using any function has a probability of crashing, which has occurred 3-4 times."
2,整车的悬架系统，在过减速带时，速度在20码以下，但是车身的抖动还是很厉害，舒适性为第一的，美系车相比，差距还是比较大的,"The suspension system of the whole car, when crossing the speed bump, the speed is below 20km/h, but the shaking of the body is still very strong, the comfort is the first, compared with the American car, the gap is still relatively large."
3,大众车的通病，车子的隔音效果不太理想，车速在90码以上，车内的胎噪声就很明显了，必须把音量调大，才能缓解一点（是原厂轮胎，车窗关闭）,"The common problem of Volkswagen, the sound insulation of the car is not ideal, the speed is above 90km/h, the tire noise in the car is obvious, the volume must be turned up, in order to alleviate a little (is the original tires, the windows are closed."
4,车辆外观很不错，但是车标在晚上不能发亮，要是可以发亮的话会更拉风一点,"The appearance of the vehicle is very good, but the logo cannot be shiny at night, if it can be bright, it will be more stunning."


In [11]:
df.to_csv("../data/trans/PROC-NCVQS-2023.csv", index=False)
df.head(5).to_csv("../data/trans/PROC-SAMPLE.csv", index=False)
df.head(5)[["Chinese"]].to_csv("../data/trans/PROC-SAMPLE-INFERENCE.csv", index=False)

### Transform Data

In [12]:
df = df.assign(translation=df.apply(lambda row:{'zh': row['Chinese'], 'en': row['English']}, axis=1))
df.reset_index(inplace=True)
df.drop(['index', 'Chinese', 'English'], axis=1, inplace=True)

In [13]:
df.head()

Unnamed: 0,translation
0,"{'zh': '开空调的情况下，续航掉的太快了，特别是冬天天气冷的时候，不开空调不行，天气一冻，续航就掉的更快', 'en': 'In the case of turning on the air conditioner, the electric range drops too fast, especially when the weather is cold in winter, if you don't turn on the air conditioner, the weather freezes, the battery life will fall faster.'}"
1,"{'zh': '车机流畅度差，容易卡死机，车机系统，启动载入很慢，换挡杆前的车机，使用任何功能都有概率死机，发生过3-4次', 'en': 'The smoothness of the IHU is poor, easy to jam, the car machine system, the start loading is very slow, the car machine before the gear lever, using any function has a probability of crashing, which has occurred 3-4 times.'}"
2,"{'zh': '整车的悬架系统，在过减速带时，速度在20码以下，但是车身的抖动还是很厉害，舒适性为第一的，美系车相比，差距还是比较大的', 'en': 'The suspension system of the whole car, when crossing the speed bump, the speed is below 20km/h, but the shaking of the body is still very strong, the comfort is the first, compared with the American car, the gap is still relatively large.'}"
3,"{'zh': '大众车的通病，车子的隔音效果不太理想，车速在90码以上，车内的胎噪声就很明显了，必须把音量调大，才能缓解一点（是原厂轮胎，车窗关闭）', 'en': 'The common problem of Volkswagen, the sound insulation of the car is not ideal, the speed is above 90km/h, the tire noise in the car is obvious, the volume must be turned up, in order to alleviate a little (is the original tires, the windows are closed.'}"
4,"{'zh': '车辆外观很不错，但是车标在晚上不能发亮，要是可以发亮的话会更拉风一点', 'en': 'The appearance of the vehicle is very good, but the logo cannot be shiny at night, if it can be bright, it will be more stunning.'}"


In [14]:
df.iloc[0][0]

{'zh': '开空调的情况下，续航掉的太快了，特别是冬天天气冷的时候，不开空调不行，天气一冻，续航就掉的更快',
 'en': "In the case of turning on the air conditioner, the electric range drops too fast, especially when the weather is cold in winter, if you don't turn on the air conditioner, the weather freezes, the battery life will fall faster."}

In [15]:
print("Dimension: ", df.shape)

Dimension:  (447, 1)


In [16]:
dataset = Dataset.from_pandas(df, split='train')
dataset = dataset.train_test_split(test_size=0.01)

In [17]:
print(type(dataset['train']))
print(dataset['train'][:1])

<class 'datasets.arrow_dataset.Dataset'>
{'translation': [{'en': 'The overall feeling is good. The front hood thundered. Model, brand, color, exterior, interior. Engine sound (abnormal noise, increase head-up display.', 'zh': '整体感觉挺好的 前面车盖轰轰的响 车型，品牌，颜色，外观，内饰 发动机声音（异响），增加抬头显示'}]}


In [18]:
dataset

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 442
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 5
    })
})

### Tokenize Datasets

In [19]:
prefix = "" #for mBART and MarianMT
max_input_length = 512
max_target_length = 512

source_lang = "zh"
target_lang = "en"

def preprocess_function(examples):
    inputs = [prefix + ex[source_lang] for ex in examples["translation"]]
    targets = [ex[target_lang] for ex in examples["translation"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)
   
    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/442 [00:00<?, ? examples/s]

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

## Train and Fine-tune the Model

### Model Setup

In [20]:
from transformers import Seq2SeqTrainingArguments
from transformers import DataCollatorForSeq2Seq
from transformers import Seq2SeqTrainer
import evaluate

In [21]:
from datetime import datetime

# Get the current date
today = datetime.now()

# Format the date as "DDMMYYYY"
formatted_date = today.strftime("%d%m%Y")

print(formatted_date)

06032024


In [22]:
model_name = f"mbart-finetuned-cn-to-en-auto-2023"
model_path = f"../models/{model_name}"

batch_size = 4

args = Seq2SeqTrainingArguments(
   output_dir=model_path,
   evaluation_strategy="epoch",
   learning_rate=2e-5,
   per_device_train_batch_size=batch_size,
   per_device_eval_batch_size=batch_size,
   weight_decay=0.01,
   save_total_limit=3,
   num_train_epochs=2,
   predict_with_generate=True,
)

In [23]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model) # default setting

In [24]:
import numpy as np
import evaluate

metric = evaluate.load("sacrebleu")
meteor = evaluate.load('meteor')

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]
    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    meteor_result = meteor.compute(predictions=decoded_preds, references=decoded_labels)
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result = {'bleu' : result['score']}
    result["gen_len"] = np.mean(prediction_lens)
    result["meteor"] = meteor_result["meteor"]
    result = {k: round(v, 4) for k, v in result.items()}
    return result

[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/jovyan/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [25]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

### Train and Save the Model

In [26]:
trainer.train()

You're using a MBart50TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Bleu,Gen Len,Meteor
1,No log,0.80461,47.1049,55.4,0.7417
2,No log,0.801608,47.1057,53.4,0.7395


TrainOutput(global_step=222, training_loss=0.5791223886850718, metrics={'train_runtime': 142.6421, 'train_samples_per_second': 6.197, 'train_steps_per_second': 1.556, 'total_flos': 113549981024256.0, 'train_loss': 0.5791223886850718, 'epoch': 2.0})

In [27]:
eval_result = trainer.evaluate(tokenized_datasets['test'])

In [28]:
eval_result

{'eval_loss': 0.8016082644462585,
 'eval_bleu': 47.1057,
 'eval_gen_len': 53.4,
 'eval_meteor': 0.7395,
 'eval_runtime': 3.8774,
 'eval_samples_per_second': 1.29,
 'eval_steps_per_second': 0.516,
 'epoch': 2.0}

In [29]:
trainer.save_model()

In [30]:
# trainer.push_to_hub()

### Test

In [31]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = f"mbart-finetuned-cn-to-en-auto-2023"
model_path = f"../models/{model_name}"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)

In [32]:
sentence = "开空调的情况下，续航掉的太快了，特别是冬天天气冷的时候，不开空调不行，天气一冻，续航就掉的更快"

tokenizer.src_lang = 'zh_CN'
tokenizer.tgt_lang = 'en_XX'

encoded = tokenizer(sentence, return_tensors="pt").to(device)
model = model.to(device)
generated_tokens = model.generate(**encoded)
decoded = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]

print(decoded)

In the case of turning on the air conditioner, the electric range drops too fast, especially when the weather is cold in winter, it is not good to turn off the air conditioner, as soon as the weather freezes, the electric range drops faster.


> "开空调的情况下，续航掉的太快了，特别是冬天天气冷的时候，不开空调不行，天气一冻，续航就掉的更快"

> 'If the air conditioner is on, the flight resumes too quickly, especially in winter when the weather is cold. If the air conditioner is not on, the flight resumes faster as soon as the weather is cold'

In [33]:
def translate(sentence):
    encoded = tokenizer(sentence, return_tensors="pt").to(device)
    generated_tokens = model.generate(**encoded)
    return tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]

In [34]:
for i in dataset['test']['translation'][:5]:
    print("="*30)
    print(i['zh'])
    print(i['en'])
    print("="*10)
    result = translate(i['zh'])
    print(result)

充电速度快，电机加速性能很好 车辆行驶的质感很好，终身质保 车顶的全景大天幕很上档次，动力强，加速非常快 没有仪表板很不方便；动力回馈力度没办法调节
The charging speed is fast and the motor acceleration performance is very good. The driving feeling is very good, and the lifetime warranty. The panoramic canopy on the roof is very high-grade, strong power, and acceleration is very fast. It's inconvenient not to have a dashboard; The strength of power feedback cannot be adjusted.
The charging speed is fast, and the motor acceleration performance is very good. The texture of the vehicle driving is very good, lifetime warranty. The panoramic sunroof on the roof is very high-grade, the power is strong, and the acceleration is very fast. It is inconvenient to have no dashboard; The power feedback cannot be adjusted.
整车线条流畅，现代感十足； 内饰科技感强，座椅功能齐全； 内饰科技感强，座椅舒适； 无
The whole car has smooth lines and a modern feeling. The interior has a strong sense of technology and the seats are fully functional; . The interior is highly technologically advanced and the seats are comfor