# Fine Tune

**Reference:** https://neptune.ai/blog/hugging-face-pre-trained-models-find-the-best

In [1]:
# !pip install -r requirements.txt

In [29]:
!jupyter nbextension enable --py widgetsnbextension 

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m


In [3]:
import warnings
warnings.filterwarnings("ignore")

## Load Model

In [4]:
model_name = f"mbart-finetuned-cn-to-en-auto"
model_path = f"../models/{model_name}"

In [5]:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

cuda


In [6]:
# tokenizer = AutoTokenizer.from_pretrained(model_path)
# model = AutoModelForSeq2SeqLM.from_pretrained(model_path).to(device)

In [7]:
# text = "开空调的情况下，续航掉的太快了，特别是冬天天气冷的时候，不开空调不行，天气一冻，续航就掉的更快"

# tokenizer.src_lang = 'zh'
# tokenizer.tgt_lang = 'en'

# tokenized_text = tokenizer(text, return_tensors="pt").to(device)
# print(tokenized_text)
# print()

# translation = model.generate(**tokenized_text)
# print(translation)
# print()

# translated_text = tokenizer.decode(translation[0], skip_special_tokens=True)
# print(translated_text)

## Preprocess Data

### Load Data

In [8]:
import pandas as pd

pd.set_option('display.max_colwidth', None)

In [9]:
file_proc = "../data/trans/PROC-NCVQS-2021-2023.csv"
file_sample = "../data/trans/PROC_SAMPLE.csv"

### Load Processed Data

In [10]:
# df = pd.read_csv(file_proc)

# df.head(20).to_csv(file_sample, index=False)

## Batch Translate

In [11]:
file_sample = "../data/trans/PROC_SAMPLE.csv"

In [12]:
df = pd.read_csv(file_sample)

In [13]:
df.head()

Unnamed: 0,Chinese,English
0,第二排的舒适性不太理想，减震有点硬，平时有坎时感觉咣当一下，不是很舒服，选择舒适性,"The comfort of the second row is not ideal, the shock absorption is a bit hard, and it feels awkward when there are bumps, which is not very comfortable, and I choose comfort"
1,减震硬，路况不好的地方不太舒服，选择舒适性（在路况不好，沟沟坎坎比较多时候，车内晃动大）,"The shock absorption is hard, and riding under poor road conditions are not very comfortable. I choose comfort (when the road conditions are not good and there are many ridges and bumps, the inside of the car shakes a lot)"
2,车内的网络连接不稳定（自带的车联网，通过流量卡连接的互联网，有时使用中会突然没有网，在使用任何APP时都有发生几率，不知是什么原因）,"The network connection in the car is unstable (the built-in car network, the Internet is connected through the data traffic card, sometimes there is no network during use, it might happen when using any APP, and I don’t know why)"
3,开空调时车内有潮气的味道，开热风冷风都会有，问了问，有人说是滤芯的气味，不是很重（新车，没有更换过空气滤波器）,"There is a smell of moisture in the car when the AC is turned on both when hot and cold air is supplied in the car. When I asked, some repairmen said that it was the smell of the filter element, which was not very heavy (new car, the air filter has not been replaced)"
4,第二排两侧的车门关门时声音咚咚的，声音很沉，感觉车门有点重，听上去没有质感，不是什么大问题，设计的问题,"When the doors on both sides of the second row are closed, the thumping sound is very heavy. It doesn't feel good quality texture, not a big problem, a design issue"


In [14]:
from Translate import Translate

trans = Translate(model_path)

In [15]:
text = "开空调的情况下，续航掉的太快了，特别是冬天天气冷的时候，不开空调不行，天气一冻，续航就掉的更快"
print(trans.translator(text))

In the case of turning on the air conditioner, the endurance mileage drops too fast, especially in winter when the weather is cold. If I don't turn on the air conditioner, the endurance mileage will drop faster, especially when the weather is cold in winter. If I don't turn on the air conditioner, the endurance mileage will drop faster when the weather is freezing.


In [16]:
# df = trans.translator_batch(df, col_tgt="Translation")
# df.head()

## Fine-tune

### Transform Data

In [17]:
df = pd.read_csv(file_sample)

In [18]:
from datasets import Dataset, DatasetDict

In [19]:
def generate_dataset(df, col_src='Chinese', col_tgt='English', train_size=0.9):
    """
    Generates a DatasetDict for machine translation from a pandas DataFrame.

    Args:
    df: A pandas DataFrame containing the source and target language columns.
    col_src: The name of the source language column (default: 'Chinese').
    col_tgt: The name of the target language column (default: 'English').
    train_size: The proportion of the data to use for training (default: 0.9).

    Returns:
    A DatasetDict containing the training, validation, and test datasets.
    """
    df = df.assign(translation=df.apply(lambda row:{'zh': row['Chinese'], 'en': row['English']}, axis=1))
    df.reset_index(inplace=True)
    df.drop(['index', 'Chinese', 'English'], axis=1, inplace=True)
    
    test_size = 1 - train_size
    dataset = Dataset.from_pandas(df, split='train')
    dataset = dataset.train_test_split(test_size=test_size)
    testset = dataset['test'].train_test_split(test_size=0.5)
    dataset = DatasetDict({
        'train': dataset['train'],
        'test': testset['test'],
        'valid': testset['train']})
    
    return dataset

In [20]:
trans.dataset = trans.generate_dataset(df, train_size=0.8)

In [22]:
trans.dataset

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 16
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 2
    })
    valid: Dataset({
        features: ['translation'],
        num_rows: 2
    })
})

### Tokenize Datasets

In [None]:
def tokenize_dataset(self, max_length_input=512, max_length_target=512, prefix=''):
    self.max_length_input = max_length_input
    self.max_length_target = max_length_target
    
    def tokenizer(case):
        inputs = [prefix + i[self.src_lang] for i in case["translation"]]
        targets = [i[self.tgt_lang] for i in case["translation"]]

        model_inputs = self.tokenizer(inputs, max_length=self.max_length_input, truncation=True)
        with self.tokenizer.as_target_tokenizer():
            labels = self.tokenizer(targets, max_length=self.max_length_target, truncation=True)

        model_inputs["labels"] = labels["input_ids"]
        return model_inputs
    
    tokenized_datasets = self.dataset.map(tokenizer, batched=True)
    return tokenized_datasets

In [30]:
tokenized_datasets = trans.tokenize_dataset()

Map:   0%|          | 0/16 [00:00<?, ? examples/s]

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

In [24]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['translation', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 16
    })
    test: Dataset({
        features: ['translation', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 2
    })
    valid: Dataset({
        features: ['translation', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 2
    })
})

## Train and Fine-tune the Model

### Model Setup

In [31]:
from transformers import Seq2SeqTrainingArguments
from transformers import DataCollatorForSeq2Seq
from transformers import Seq2SeqTrainer
import evaluate
import numpy as np

In [None]:
def finetune(self, 
             df, rain_size=0.9, col_src='Chinese', col_tgt='English', 
             max_length_input=512, max_length_target=512, prefix='',
             model_path="model", batch_size=4):
    self.dataset = self.generate_dataset(df, 
                                         train_size=train_size, 
                                         col_src=col_src, col_tgt=col_tgt)
    self.tokenized_dataset = self.tokenize_dataset(max_length_input=max_length_input, 
                                                    max_length_target=max_length_input, 
                                                    prefix=prefix)
    self.args = Seq2SeqTrainingArguments(
       output_dir=model_path,
       evaluation_strategy="epoch",
       learning_rate=2e-5,
       per_device_train_batch_size=batch_size,
       per_device_eval_batch_size=batch_size,
       weight_decay=0.01,
       save_total_limit=3,
       num_train_epochs=1,
       predict_with_generate=True,
    )
    
    self.data_collator = DataCollatorForSeq2Seq(tokenizer=self.tokenizer, model=self.model)
    
    metric = evaluate.load("sacrebleu")
    meteor = evaluate.load('meteor')
    
    self.trainer = Seq2SeqTrainer(
        model=self.model,
        args=self.args,
        train_dataset=self.tokenized_dataset['train'],
        eval_dataset=self.tokenized_dataset['valid'],
        data_collator=self.data_collator,
        tokenizer=self.tokenizer,
        compute_metrics=self.compute_metrics,
    )
    
    self.trainer.train()
    self.trainer.save_model()
    
    self.eval_result = self.trainer.evaluate(self.tokenized_dataset['test'])
    print(self.eval_result)
    return None

In [None]:
model_name = f"mbart-finetuned-cn-to-en-auto-sample"
model_path = f"../models/{model_name}"

batch_size = 4

args = Seq2SeqTrainingArguments(
   output_dir=model_path,
   evaluation_strategy="epoch",
   learning_rate=2e-5,
   per_device_train_batch_size=batch_size,
   per_device_eval_batch_size=batch_size,
   weight_decay=0.01,
   save_total_limit=3,
   num_train_epochs=1,
   predict_with_generate=True,
)

In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model) # default setting

In [None]:
metric = evaluate.load("sacrebleu")
meteor = evaluate.load('meteor')

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]
    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    meteor_result = meteor.compute(predictions=decoded_preds, references=decoded_labels)
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result = {'bleu' : result['score']}
    result["gen_len"] = np.mean(prediction_lens)
    result["meteor"] = meteor_result["meteor"]
    result = {k: round(v, 4) for k, v in result.items()}
    return result

In [None]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

### Train and Save the Model

In [None]:
trainer.train()

In [None]:
trainer.save_model()

In [None]:
# trainer.push_to_hub()

### Test

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)

In [None]:
sentence = "开空调的情况下，续航掉的太快了，特别是冬天天气冷的时候，不开空调不行，天气一冻，续航就掉的更快"

tokenizer.src_lang = 'zh_CN'
tokenizer.tgt_lang = 'en_XX'

encoded = tokenizer(sentence, return_tensors="pt").to(device)
model = model.to(device)
generated_tokens = model.generate(**encoded)
decoded = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]

print(decoded)

> "开空调的情况下，续航掉的太快了，特别是冬天天气冷的时候，不开空调不行，天气一冻，续航就掉的更快"

> **no fine-tuning** 'If the air conditioner is on, the flight resumes too quickly, especially in winter when the weather is cold. If the air conditioner is not on, the flight resumes faster as soon as the weather is cold'

> **fine-tuned with data of 2023** If you turn on the air conditioner, the electric range is too fast to fall off, especially in winter when the weather is cold and cold, it is not possible to not turn on the air conditioner without turning on the air conditioner, as the weather freezes and freezes, the electric range will fall off faster.

In [None]:
def translate(sentence):
    encoded = tokenizer(sentence, return_tensors="pt").to(device)
    generated_tokens = model.generate(**encoded)
    return tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]

In [None]:
for i in dataset['test']['translation'][:5]:
    print("="*30)
    print(i['zh'])
    print(i['en'])
    print("="*10)
    result = translate(i['zh'])
    print(result)