In [1]:
# !pip install -r requirements.txt

In [1]:
from Translate import Translate

import pandas as pd
import warnings
warnings.filterwarnings("ignore")

pd.set_option('display.max_colwidth', None)

model_mbart = "facebook/mbart-large-50-many-to-one-mmt"

model_name = f"mbart-finetuned-cn-to-en-auto"
model_path = f"../models/{model_name}"

file_proc = "../data/trans/PROC-NCVQS-2021-2023.csv"
file_sample = "../data/trans/PROC_SAMPLE.csv"

## Single Case

In [3]:
trans = Translate(model_mbart)
text = "开空调的情况下，续航掉的太快了，特别是冬天天气冷的时候，不开空调不行，天气一冻，续航就掉的更快"
print(trans.translator(text))

If the air conditioner is on, the flight resumes too quickly, especially in winter when the weather is cold. If the air conditioner is not on, the flight resumes faster as soon as the weather is cold


## Batch Processing

In [4]:
df = pd.read_csv(file_sample)
df = trans.translator_batch(df, col_tgt="Translation")
df.head()

100%|██████████| 20/20 [00:20<00:00,  1.03s/it]


Unnamed: 0,Chinese,English,Translation
0,第二排的舒适性不太理想，减震有点硬，平时有坎时感觉咣当一下，不是很舒服，选择舒适性,"The comfort of the second row is not ideal, the shock absorption is a bit hard, and it feels awkward when there are bumps, which is not very comfortable, and I choose comfort","The comfort of the second row is not ideal, the shock absorption is a little hard, usually have a can when feeling a bit uncomfortable, not very comfortable, choose comfort"
1,减震硬，路况不好的地方不太舒服，选择舒适性（在路况不好，沟沟坎坎比较多时候，车内晃动大）,"The shock absorption is hard, and riding under poor road conditions are not very comfortable. I choose comfort (when the road conditions are not good and there are many ridges and bumps, the inside of the car shakes a lot)","Shock absorption hard, uncomfortable where the road is not good, choose comfort (when the road is not good, groove groove groove more often, the inside of the car shakes a lot)"
2,车内的网络连接不稳定（自带的车联网，通过流量卡连接的互联网，有时使用中会突然没有网，在使用任何APP时都有发生几率，不知是什么原因）,"The network connection in the car is unstable (the built-in car network, the Internet is connected through the data traffic card, sometimes there is no network during use, it might happen when using any APP, and I don’t know why)","Internet connection in the car is unstable (self-contained car network, Internet connected through traffic card, sometimes in use suddenly no network, there is a probability of using any APP, I do not know what the reason is)"
3,开空调时车内有潮气的味道，开热风冷风都会有，问了问，有人说是滤芯的气味，不是很重（新车，没有更换过空气滤波器）,"There is a smell of moisture in the car when the AC is turned on both when hot and cold air is supplied in the car. When I asked, some repairmen said that it was the smell of the filter element, which was not very heavy (new car, the air filter has not been replaced)","When you turn on the air conditioner, there is a smell of damp air inside the car, hot air and cold air are all there, I asked, some people say it is the smell of the filter core, not very heavy (new car, no air filter has been replaced)"
4,第二排两侧的车门关门时声音咚咚的，声音很沉，感觉车门有点重，听上去没有质感，不是什么大问题，设计的问题,"When the doors on both sides of the second row are closed, the thumping sound is very heavy. It doesn't feel good quality texture, not a big problem, a design issue",The door on both sides of the second row is knocking when the door is closed. The sound is very loud. It feels like the door is a little heavy. It doesn't sound quality. It's not a big problem. It's a design problem


## Fine-tuning

In [6]:
model_name_new = f"mbart-finetuned-cn-to-en-auto"
finetuned_model_path = f"../models/{model_name_new}"

df = pd.read_csv(file_proc)

trans = Translate(model_mbart)
trans.finetune(df, finetuned_model_path=finetuned_model_path)

Map:   0%|          | 0/12884 [00:00<?, ? examples/s]

Map:   0%|          | 0/716 [00:00<?, ? examples/s]

Map:   0%|          | 0/716 [00:00<?, ? examples/s]

You're using a MBart50TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Bleu,Gen Len,Meteor
1,0.9325,0.875066,30.6189,51.588,0.6143


[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/jovyan/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/jovyan/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


{'eval_loss': 0.8992447853088379, 'eval_bleu': 30.9389, 'eval_gen_len': 52.0768, 'eval_meteor': 0.6171, 'eval_runtime': 291.0182, 'eval_samples_per_second': 2.46, 'eval_steps_per_second': 0.615, 'epoch': 1.0}


### Single Case

In [3]:
model_name_new = f"mbart-finetuned-cn-to-en-auto"
finetuned_model_path = f"../models/{model_name_new}"

trans = Translate(finetuned_model_path)

text = "开空调的情况下，续航掉的太快了，特别是冬天天气冷的时候，不开空调不行，天气一冻，续航就掉的更快"
print(trans.translator(text))

In the case of turning on the air conditioner, the electric range drops too fast, especially when the weather is cold in winter, not turning on the air conditioner is not possible, the weather freezes, and the electric range drops faster.


### Batch Processing

In [4]:
df = pd.read_csv(file_sample)

df = trans.translator_batch(df, col_tgt="Translation")
df.head()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 20/20 [01:25<00:00,  4.30s/it]


Unnamed: 0,Chinese,English,Translation
0,第二排的舒适性不太理想，减震有点硬，平时有坎时感觉咣当一下，不是很舒服，选择舒适性,"The comfort of the second row is not ideal, the shock absorption is a bit hard, and it feels awkward when there are bumps, which is not very comfortable, and I choose comfort","The comfort of the second row is not ideal, the shock absorption is a bit hard, usually feels like a bump when there is a bump, not very comfortable, choose comfort."
1,减震硬，路况不好的地方不太舒服，选择舒适性（在路况不好，沟沟坎坎比较多时候，车内晃动大）,"The shock absorption is hard, and riding under poor road conditions are not very comfortable. I choose comfort (when the road conditions are not good and there are many ridges and bumps, the inside of the car shakes a lot)","The shock absorption is hard, the bad road conditions are not very comfortable, choose comfort (in the bad road conditions, grooves and grooves are relatively frequent, and the vibration inside the car is large."
2,车内的网络连接不稳定（自带的车联网，通过流量卡连接的互联网，有时使用中会突然没有网，在使用任何APP时都有发生几率，不知是什么原因）,"The network connection in the car is unstable (the built-in car network, the Internet is connected through the data traffic card, sometimes there is no network during use, it might happen when using any APP, and I don’t know why)","The network connection in the car is unstable (the self-contained car network, the Internet connected through the traffic card, sometimes there is suddenly no Internet in use, there is a chance when using any APP, I do not know what the reason is."
3,开空调时车内有潮气的味道，开热风冷风都会有，问了问，有人说是滤芯的气味，不是很重（新车，没有更换过空气滤波器）,"There is a smell of moisture in the car when the AC is turned on both when hot and cold air is supplied in the car. When I asked, some repairmen said that it was the smell of the filter element, which was not very heavy (new car, the air filter has not been replaced)","When turning on the air conditioner, there is a smell of dampness in the car, hot and cold air can be there, I asked, some people say it is the odor of the filter core, not very heavy (new car, no air filter has been replaced."
4,第二排两侧的车门关门时声音咚咚的，声音很沉，感觉车门有点重，听上去没有质感，不是什么大问题，设计的问题,"When the doors on both sides of the second row are closed, the thumping sound is very heavy. It doesn't feel good quality texture, not a big problem, a design issue","The door on the second row on both sides of the car door is knocking, the sound is very loud, the door feels a bit heavy, it does not sound quality, is not a big problem, the design problem."
