- **Title: Tutorial of LocalCat**
- **Author**: Ewen Wang (ewang1@volvocars.com)
- **Last Update**: March 4, 2024

This tutorial will guide you through the process of using **LocalCat** to load pre-trained LLMs and fine-tune them with domain data.

# Load Pre-trained LLMs

**LocalCat** simplifies the process of loading pre-trained LLMs.

## Load Packages

In [1]:
# pip install -r requirements.txt

In [2]:
import pandas as pd

from LocalCat.Translate import Translate
from LocalCat.Translate import Local

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/jovyan/.config/sagemaker/config.yaml


In [3]:
model_mbart = "facebook/mbart-large-50-many-to-many-mmt"

trans = Translate(model_name_or_path=model_mbart,
                  src_lang='zh_CN',
                  tgt_lang='en_XX')

**Note:**

For AI engineers in China, you may use hugging face mirror sites, such as [hf-mirror.com](https://hf-mirror.com/). 
    
``` bash
HF_ENDPOINT=https://hf-mirror.com python3 download_model.py
```

## Translate by Case

In [4]:
text = "开空调的情况下，续航掉的太快了，特别是冬天天气冷的时候，不开空调不行，天气一冻，续航就掉的更快"
print(trans.translator(text=text))

It's too fast to resume air-conditioning, especially when it's cold in winter. No air-conditioning is not possible. If it's freezing, it's faster to resume


## Translate on Batch

**LocalCat** also supports batch translation. All you need to do is to load the data and call the `translator_batch` method.

The data should be in the form of a `pandas` DataFrame.

In [5]:
file_inference = "../data/trans/PROC-SAMPLE-INFERENCE.csv"

df = pd.read_csv(file_inference)
df.head()

Unnamed: 0,Chinese
0,第二排的舒适性不太理想，减震有点硬，平时有坎时感觉咣当一下，不是很舒服，选择舒适性
1,减震硬，路况不好的地方不太舒服，选择舒适性（在路况不好，沟沟坎坎比较多时候，车内晃动大）
2,车内的网络连接不稳定（自带的车联网，通过流量卡连接的互联网，有时使用中会突然没有网，在使用任何APP时都有发生几率，不知是什么原因）
3,开空调时车内有潮气的味道，开热风冷风都会有，问了问，有人说是滤芯的气味，不是很重（新车，没有更换过空气滤波器）
4,第二排两侧的车门关门时声音咚咚的，声音很沉，感觉车门有点重，听上去没有质感，不是什么大问题，设计的问题


In [6]:
df = trans.translator_batch(df=df, 
                            col_src='Chinese', 
                            col_tgt="English")

100%|██████████| 5/5 [00:05<00:00,  1.14s/it]


In [7]:
df.head()

Unnamed: 0,Chinese,English
0,第二排的舒适性不太理想，减震有点硬，平时有坎时感觉咣当一下，不是很舒服，选择舒适性,"The comfort of the second row is not ideal, the shock reduction is a bit hard, normally when there is a can feel a bit, not comfortable, choose comfort"
1,减震硬，路况不好的地方不太舒服，选择舒适性（在路况不好，沟沟坎坎比较多时候，车内晃动大）,"Reduce shaking, where bad road conditions are uncomfortable, choose comfort (in bad road conditions, ditch ditch ditch ditch more time, the car shakes big)"
2,车内的网络连接不稳定（自带的车联网，通过流量卡连接的互联网，有时使用中会突然没有网，在使用任何APP时都有发生几率，不知是什么原因）,"Unstable network connections in the car (self-contained networks, networks connected via traffic cards, sometimes without a network when in use, and most likely when using any APP, for whatever reason)"
3,开空调时车内有潮气的味道，开热风冷风都会有，问了问，有人说是滤芯的气味，不是很重（新车，没有更换过空气滤波器）,"When you turn on the air conditioner, there's a smell of moisture in the car, and when you turn on the hot air, there's a cold air. Some people say it's a filtered air, and it's not very heavy. (New car, no air filter has been replaced)"
4,第二排两侧的车门关门时声音咚咚的，声音很沉，感觉车门有点重，听上去没有质感，不是什么大问题，设计的问题,"When the doors on both sides of the second row were closed, the sound was loud, and it felt like the doors were a bit heavy, and it didn't sound good. It wasn't a big problem, it was a design problem"


# Fine-tune LLMs with Domain Data

**LocalCat** also supports fine-tuning pre-trained LLMs with domain data.

![](../images/finetune.png)

## Fine-tune LLM

To fine-tune a LLM, you need some labelled data. Here we simplify it as a `pandas` dataframe. 

For translation task, spicificly, you need the source text and target text to train the model.

In [8]:
file_training = "../data/trans/PROC-NCVQS-2023.csv"

df = pd.read_csv(file_training)

In [9]:
df.head()

Unnamed: 0,Chinese,English
0,开空调的情况下，续航掉的太快了，特别是冬天天气冷的时候，不开空调不行，天气一冻，续航就掉的更快,"In the case of turning on the air conditioner, the electric range drops too fast, especially when the weather is cold in winter, if you don't turn on the air conditioner, the weather freezes, the battery life will fall faster."
1,车机流畅度差，容易卡死机，车机系统，启动载入很慢，换挡杆前的车机，使用任何功能都有概率死机，发生过3-4次,"The smoothness of the IHU is poor, easy to jam, the car machine system, the start loading is very slow, the car machine before the gear lever, using any function has a probability of crashing, which has occurred 3-4 times."
2,整车的悬架系统，在过减速带时，速度在20码以下，但是车身的抖动还是很厉害，舒适性为第一的，美系车相比，差距还是比较大的,"The suspension system of the whole car, when crossing the speed bump, the speed is below 20km/h, but the shaking of the body is still very strong, the comfort is the first, compared with the American car, the gap is still relatively large."
3,大众车的通病，车子的隔音效果不太理想，车速在90码以上，车内的胎噪声就很明显了，必须把音量调大，才能缓解一点（是原厂轮胎，车窗关闭）,"The common problem of Volkswagen, the sound insulation of the car is not ideal, the speed is above 90km/h, the tire noise in the car is obvious, the volume must be turned up, in order to alleviate a little (is the original tires, the windows are closed."
4,车辆外观很不错，但是车标在晚上不能发亮，要是可以发亮的话会更拉风一点,"The appearance of the vehicle is very good, but the logo cannot be shiny at night, if it can be bright, it will be more stunning."


In [10]:
# model_mbart = "facebook/mbart-large-50-many-to-many-mmt"
# trans = Translate(model_mbart)

finetuned_model_path = "../models/mbart-finetuned-cn-to-en-auto-sample"

trans.finetune(df=df, 
               finetuned_model_path=finetuned_model_path,
               train_size=0.98, 
               batch_size=4,
               learning_rate=2e-5,
               num_train_epochs=4)

Map:   0%|          | 0/438 [00:00<?, ? examples/s]

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Map:   0%|          | 0/4 [00:00<?, ? examples/s]

You're using a MBart50TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Bleu,Gen Len,Meteor
1,No log,0.719982,65.3751,53.5,0.8389
2,No log,0.738425,59.8413,54.75,0.8592
3,No log,0.765242,57.7931,51.25,0.8014
4,No log,0.832858,55.9429,52.0,0.8063


[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/jovyan/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/jovyan/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-

{'eval_loss': 0.6715749502182007, 'eval_bleu': 44.7421, 'eval_gen_len': 53.2, 'eval_meteor': 0.7686, 'eval_runtime': 3.255, 'eval_samples_per_second': 1.536, 'eval_steps_per_second': 0.614, 'epoch': 4.0}


[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/jovyan/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [11]:
finetuned_model_path = "../models/mbart-finetuned-cn-to-en-auto-sample"

text = "开空调的情况下，续航掉的太快了，特别是冬天天气冷的时候，不开空调不行，天气一冻，续航就掉的更快"

trans = Translate(model_name_or_path=finetuned_model_path)
print(trans.translator(text=text))

In the case of turning on the air conditioner, the electric range drops too fast, especially when the weather is cold in winter, if you don't turn on the air conditioner, the weather freezes, the electric range will fall faster.


> **MBart:** It's too fast to resume air-conditioning, especially when it's cold in winter. No air-conditioning is not possible. If it's freezing, it's faster to resume

In [13]:
# stop

# Deploy the LLM

**LocalCat** also supports deploying LLMs on the cloud (AWS Sagemaker Endpoint). 

![](../images/aws_llm.png)

Deploying the model contains the following steps:

1. Push the model to S3
2. Deploy the model as an endpoint
3. Test the endpoint

In [None]:
model_path = "../models/"
model_finetuned = "mbart-finetuned-cn-to-en-auto-sample"

## Step 1: Push the model to S3

In [None]:
bucket = "ai"
prefix = "llm"

local = Local(model_name=model_finetuned, model_path=model_path)
local.push_to_s3(bucket=bucket, prefix=prefix)

## Step2: Deploy the model as an endpoint

In [None]:
local.deploy(instance_type='ml.g4dn.4xlarge',
             transformers_version='4.37.0', 
             pytorch_version='2.1.0', 
             py_version='py310')

## Step 3: Test the endpoint

Check the endpoint name in the AWS Sagemaker Console, say `MBART-20240226-024324`.

In [None]:
local = Local()
local.endpoint_name = "MBART-20240226-024324" 

text = "开空调的情况下，续航掉的太快了，特别是冬天天气冷的时候，不开空调不行，天气一冻，续航就掉的更快"
result = local.translator(text=text)
print(result)