# Fine-Tuning Gemma for Cultural Language Understanding

Must follow:

3. Fine-Tuning Gemma 2
- Training Configuration
    - Hyperparameters
        - Learning rate, batch size, number of epochs, optimizer choice.
    - Training Procedures
        - Outline the training loop and checkpointing.
        - Discuss any custom training scripts or frameworks used.
- Performance Enhancement Techniques
    - Few-Shot Prompting
        - Explain how few-shot examples were selected and formatted.
    - Retrieval-Augmented Generation
        - Describe integration with external knowledge bases, if any.
    - Regularization Techniques
        - Use of dropout, early stopping to prevent overfitting.
- Challenges and Solutions
    - Detail any obstacles encountered during fine-tuning.
    - Explain how these issues were addressed.
4. Inference and Evaluation
- Running Inference
        - Step-by-step guide on how to generate outputs using the fine-tuned model.
        - Provide code snippets for loading the model and running predictions.
- Evaluation Metrics
    - Quantitative Metrics
        - BLEU score, ROUGE, perplexity, or other relevant metrics.
    - Qualitative Analysis
        - Human evaluation methods, feedback from native speakers.
- Results
    - Present evaluation results with tables and charts.
    - Discuss the model's performance in different aspects (fluency, accuracy, cultural relevance).
- Testing with Additional Inputs
    - Demonstrate the model's robustness with varied test cases.
    - Analyze performance on edge cases or complex inputs.
5. Reproducibility and Deployment

Notebook Documentation
Ensure all code cells are well-commented.
Use markdown cells to explain each section and provide context.
Publishing the Model
Instructions for accessing the model on Kaggle Models.
Include model versioning and any necessary metadata.
Replication Steps
Detailed guide for other users to replicate the fine-tuning process.
Mention any default settings and how to modify them for other contexts.
Inference Script
Provide a standalone script or function for running inference.
Explain input requirements and output formats.
6. Exploring Cultural and Linguistic Nuances

Language Fluency
Techniques used to enhance fluency and naturalness of generated text.
Addressing dialects or regional variations.
Literary Traditions
Adapting the model to generate or analyze poetry, proverbs, folklore.
Include examples showcasing the model's capability in these areas.
Historical Texts
Handling archaic language or historical scripts.
Methods used to train the model on historical data.
7. Ethical Considerations and Cultural Sensitivity

Bias Mitigation
Steps taken to identify and reduce biases in the model.
Cultural Respect
Ensuring the model's outputs are culturally appropriate.
Engagement with community experts or native speakers.
Data Privacy
Compliance with data protection laws and guidelines.
Handling of any personal or sensitive information.
8. Conclusion

Summary of Achievements
Recap the main accomplishments of the project.
Limitations
Discuss any limitations or areas where the model underperforms.
Future Work
Suggestions for further improvements or extensions.
Potential for adapting the approach to other languages or contexts.
Call to Action
Encourage the community to build upon this work.
Provide contact information for collaboration or feedback.
9. References

Cite all data sources, libraries, and frameworks used.
Acknowledge any third-party contributions.
10. Appendices (if applicable)

Additional Code
Include any supplementary scripts or functions.
Extended Results
Provide full evaluation reports or additional output examples.
Glossary
Define technical terms or language-specific concepts.
Tips for Making the Notebook Community-Friendly:

Clarity and Accessibility
Use clear and concise language.
Avoid jargon or explain it when necessary.
Visual Aids
Incorporate charts, graphs, and images to illustrate points.
Interactive Elements
Utilize Kaggle's interactive features for code execution.
Engagement
Pose questions or thought prompts to engage readers.
Consistency
Maintain a consistent style and formatting throughout the notebook.
By following this outline, your notebook will not only meet all the competition requirements but also stand out in clarity, thoroughness, and community impact, increasing your chances of being among the top submissions.

<br>

### Table of Contents
1. [Introduction](#1.)<br>
2. [Library Import](#2.)<br>
3. [Data Collection & Preprocessing](#3.)<br>
4. [Model & Hyperparams](#4.)<br>
&emsp;4.1 [Memory Requirement & Model Selection](#4.1)<br>
&emsp;4.2 [Low Rank Adaption (LoRA)](#4.2)<br>
&emsp;4.3 [Quantization](#4.3)<br>
5. [Fine-tuning model](#5.)<br>
6. [Model Evaluation](#6.)<br>
&emsp;6.1 [Prompting Technique Evaluation](#6.2)<br>
&emsp;6.2 [RAG for Advanced Nom Analysis](#6.3)<br>
7. [Benchmark](#7.)<br>
8. [Conclusion](#8.)<br>
9. [References](#9.)<br>

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href=""><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/KasaiHarcore/Gemma-Fine-tuning"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

# <div style="padding:14px;color:white;margin:0;font-family:Georgia;font-size:30px;text-align:left;display:fill;border-radius:5px;background-color:#004AAD;overflow:hidden">1. Introduction </div> <a id = "1."></a>

Recently, AI technologies such as chatbots and artificial intelligence-integrated support systems have been progressively integrated into various aspects of human life, particularly within the realm of natural language processing (NLP).

In this domain, fine-tuning and optimizing language models for specific languages, especially underrepresented or low-resource languages, holds immense significance. This is not just a technical challenge but also a mission to foster inclusivity and linguistic diversity in the digital age.

<img src = "gemma2.jpg" alt = "Gemma Power">

So this project focuses on enhancing [**Gemma 2**](https://huggingface.co/blog/gemma2#:~:text=Gemma%202%20Instruct%20has%20been,model%20oriented%20more%20towards%20conversational), Google’s latest open large language model (LLM) and a continuation of the Gemini framework by Google DeepMind. Designed for versatility and high performance, the model offers extensive capabilities, including an 8K-token context length, a broad training foundation, and a permissive license that supports diverse applications. Its adaptability makes it an ideal candidate for bridging linguistic and cultural gaps in NLP.

<img src = "Nom-Tay.jpg" alt = "Internet Archive">

In this work, I aim to fine-tune Gemma 2 to better understand and generate text in [**Chữ Nôm**](https://en.wikipedia.org/wiki/Ch%E1%BB%AF_N%C3%B4m#:~:text=Ch%E1%BB%AF%20N%C3%B4m%20is%20the%20logographic,%E5%9C%8B%E8%AA%9E%2C%20'national%20language'), the traditional Vietnamese logographic script. Vietnamese Nôm, or Chữ Nôm, was an ancient writing system in Vietnam before the 20th century. It evolved from Chinese characters but adapted to Vietnamese sounds and vocabulary. Nôm was used by scholars for literature and communication. The script visually differed from Chinese characters and expressed Vietnamese concepts with semantic and phonetic components [**(read more)**](https://www.quora.com/How-was-the-Han-Nom-Chu-Nom-script-different-from-the-Chinese-script-Was-it-ineffective-for-Vietnamese). Today, Chữ Nôm is a specialized field, and efforts are made to preserve its knowledge. Though modern Vietnamese uses the Latin alphabet, Nôm remains an integral part of Vietnam's cultural heritage.

By adapting Gemma 2 to this context, the project aspires to breathe new life into this ancient script and make it accessible in AI-driven applications.

What I will focus on in this notebook:
- Enhancing the fluency and cultural relevance of AI.
- Preserving and facilitating access to traditional literary heritage.
- Expanding the capabilities of AI to work with logographic systems.

Some requirements libraries can be used for this notebook:

In [None]:
!pip install transformers nltk trl huggingface_hub watermark matplotlib seaborn peft --quiet
!pip install -U bitsandbytes --quiet

# <div style="padding:14px;color:white;margin:0;font-family:Georgia;font-size:30px;text-align:left;display:fill;border-radius:5px;background-color:#004AAD;overflow:hidden">2. Library Import</div> <a id = "2."></a>

In [1]:
import re
import os
import json
import requests
from bs4 import BeautifulSoup

# NLP processing
import nltk
import unicodedata

# Data manipulation
import numpy as np
import pandas as pd
from datasets import load_dataset
from sklearn.model_selection import train_test_split

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Model
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig, get_peft_model, PeftModel
import bitsandbytes as bnb
from trl import SFTTrainer

# Python version
from platform import python_version
print('Python version in this Jupyter Notebook:', python_version())

# Load library versions
import watermark

# Library versions
%reload_ext watermark
%watermark -a "Library versions" --iversions

import warnings
warnings.filterwarnings("ignore")


Python version in this Jupyter Notebook: 3.12.0
Author: Library versions

transformers: 4.44.2
watermark   : 2.5.0
matplotlib  : 3.9.0
numpy       : 1.26.4
json        : 2.0.9
pandas      : 2.2.2
torch       : 2.3.1+cu121
re          : 2.2.1
platform    : 1.0.8
seaborn     : 0.13.2
nltk        : 3.8.1
sklearn     : 1.5.0
peft        : 0.12.0
bs4         : 4.12.3
datasets    : 3.0.0
trl         : 0.11.0
requests    : 2.32.3
bitsandbytes: 0.44.1



In [2]:
torch.cuda.is_available()

True

In [3]:
# My specs for the GPU
!nvidia-smi

Mon Nov 25 03:08:01 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 566.14                 Driver Version: 566.14         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 3090      WDDM  |   00000000:05:00.0 Off |                  N/A |
| 55%   36C    P8             33W /  390W |       0MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

So Google already public their model on Hugging Face, so we can use it directly. But we will need to loggin to use it. You can create HuggingFace account then follow this instruction:
- [Access Token](https://huggingface.co/docs/hub/security-tokens)
- [Quick Start](https://huggingface.co/docs/huggingface_hub/quick-start)

In [4]:
from huggingface_hub import login

tk = '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' # Your token goes here

login(token = tk)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to C:\Users\nguye\.cache\huggingface\token
Login successful


# <div style="padding:14px;color:white;margin:0;font-family:Georgia;font-size:30px;text-align:left;display:fill;border-radius:5px;background-color:#004AAD;overflow:hidden">3. Data Collection & Preprocessing</div> <a id = "3."></a>

Because most of the Nôm language comes from old text script in poems, history docs, or literature, we can use some of the old text to fine-tune the model. The dataset can be found:
- [**Sentences set**](https://www.kaggle.com/datasets/quandang/nomnanmt.) 
- [**Poetry set**](https://chunom.org/shelf/corpus./)
- Custom Nôm dictionary with correct Vietnamese mean translation that I manually process based on [**Vietnamese Nôm Preservation Foundation**](https://nomfoundation.org/) for evaluating the quality of the data and ensuring cultural sensitivity.

Data also available on [**HuggingFace**](https://huggingface.co/datasets/KasaiDanto/vietnamese_nom_scripts).

In [5]:
def split_nom_vi(file_path: str, output_path: str = r"./data/output.csv") -> pd.DataFrame:
    """ Split the text file data """
    nom = []
    vi = []
    
    with open(file_path, "r", encoding = "utf-8") as f:
        lines = f.readlines()
    
    for line in lines:
        if not line.strip():
            continue
        
        # Split by tab
        parts = line.split("\t")
        if len(parts) == 2:
            nom.append(parts[0].strip())
            vi.append(parts[1].strip())
        else:
            print(f"Process error at line number {line}")
    
    # Create DataFrame and saving
    data = pd.DataFrame({"nom": nom, "vi": vi})
    data.to_csv(output_path, index = False, encoding = "utf-8")
    print(f"Data Finish at {output_path}")
    
    return data

In [6]:
data_path = os.walk("./data/sentences")
for path, _, files in data_path:
    for file in files:
        if file.endswith(".txt"):
            file_path = os.path.join(path, file)
            output_path = os.path.join(path, file.replace(".txt", ".csv"))
            split_nom_vi(file_path, output_path)

Data Finish at ./data/sentences\DVSKTT-1 Quyen thu.csv
Data Finish at ./data/sentences\DVSKTT-2 Ngoai ky toan thu.csv
Data Finish at ./data/sentences\DVSKTT-3 Ban ky toan thu.csv
Data Finish at ./data/sentences\DVSKTT-4 Ban ky thuc luc.csv
Data Finish at ./data/sentences\DVSKTT-5 Ban ky tuc bien.csv
Data Finish at ./data/sentences\Luc Van Tien.csv
Data Finish at ./data/sentences\Tale of Kieu 1866.csv
Data Finish at ./data/sentences\Tale of Kieu 1871.csv
Data Finish at ./data/sentences\Tale of Kieu 1872.csv


In [7]:
dataset_sens = pd.DataFrame()
data_path = os.walk("./data/sentences")
for path, _, files in data_path:
    for file in files:
        if file.endswith(".csv"):
            file_path = os.path.join(path, file)
            dataset_sens = pd.concat([dataset_sens, pd.read_csv(file_path)])
            
dataset_sens = dataset_sens.reset_index(drop = True)

In [8]:
with open("./data/poetry/nom.txt", "r", encoding = "utf-8") as f:
    nom = f.readlines()
    
with open("./data/poetry/vi.txt", "r", encoding = "utf-8") as f:
    vi = f.readlines()
    
dataset_poe = pd.DataFrame({"nom": nom, "vi": vi})

In [9]:
dataset_dict = pd.read_csv('./data/Nom_dictionary.csv')
dataset_dict.head(20)

Unnamed: 0,nom,vi
0,阿,a
1,阿片,a phiến
2,阿彌陀佛,a di đà phật
3,埃,ai
4,安,an
5,英,anh
6,英,anh
7,英姉㛪,anh chị em
8,英户,anh họ
9,𠀧,ba


Here, depend how the language is, we will have different preprocessing steps.

In [10]:
def preprocess_vi_text(text: str) -> list:
    # Normalize text
    text = unicodedata.normalize('NFC', text)
    # Lowercase
    text = text.lower()
    # Remove extra white spaces
    text = re.sub(r'\s+', ' ', text).strip()
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    return text

In [11]:
def preprocess_nom_text(text: str) -> list:
    # Remove latin characters
    text = re.sub(r'[a-z]', '', text)
    # Remove extra white spaces
    text = re.sub(r'\s+', ' ', text).strip()
    return text

In [12]:
dataset_csv = pd.concat([dataset_sens, dataset_poe, dataset_dict], ignore_index = True)
dataset_csv = dataset_csv.drop_duplicates()
dataset_csv = dataset_csv.dropna()

In [13]:
# Apply preprocess to nom and vi
dataset_csv['nom'] = dataset_csv['nom'].apply(preprocess_nom_text)
dataset_csv['vi'] = dataset_csv['vi'].apply(preprocess_vi_text)

In [14]:
train, val = train_test_split(dataset_csv, test_size = 0.1, random_state = 42)
val, test = train_test_split(val, test_size = 0.5, random_state = 42)

train.to_csv('./data/train.csv', index = False)
val.to_csv('./data/val.csv', index = False)
test.to_csv('./data/test.csv', index = False)

In [15]:
dataset = load_dataset('csv', data_files = {'train': './data/train.csv', 'test': './data/test.csv', 'validation': './data/val.csv'})

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

# <div style="padding:14px;color:white;margin:0;font-family:Georgia;font-size:30px;text-align:left;display:fill;border-radius:5px;background-color:#004AAD;overflow:hidden">4. Model Loading & Hyperparams Tuning</div> <a id = "4."></a>

# 4.1 Memory Checking & Model Selection <a id = "4.1"></a>

To optimize performance and the ability to produce the best results, I use the Gemma 2 model with 2B parameters. Along with that, most people (especially those who are new to LLM) will have difficulty finding and choosing the right model that fit their own specs, so I want to share a website that can help you with this, which is [**LLM Checker**](https://rahulschand.github.io/gpu_poor/). (**Note that below image is an example**)

<img src = "llmcheck.png" alt = "LLM Model Choosing">

Note that it may not give you 100% accurate results, but it will give you a good starting point. Now let us check what Gemma can do with our task. See this [**guide**](https://huggingface.co/docs/transformers/conversations) if you want to customize your own prompt or using any advanced method.

In [None]:
# Note that this cell code I only run one time and it separate from all the other to prevent GPU memory exceed

def chat_with_model(model_name: str, user_input: str, max_length: int = 1024, trained_lora: str = None, bnb_config: BitsAndBytesConfig = BitsAndBytesConfig(load_in_8bit = True)):
    """ Function to load a text generation model and simulate a chat interaction """
    try:
        # Load the tokenizer and model
        print(f"Loading model and tokenizer for '{model_name}'...")
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config = bnb_config)
        if trained_lora is not None:
            model = PeftModel.from_pretrained(model, trained_lora)
        
        # Tokenize user input
        print("Generating response... \n")
        inputs = tokenizer(user_input, return_tensors = "pt")
        
        # Generate a response
        outputs = model.generate(inputs["input_ids"], max_length = max_length, do_sample = True, top_k = 10, top_p = 0.95)
        
        # Decode and return the response
        response = tokenizer.decode(outputs[0], skip_special_tokens = True)
        print(f"Response: {response} \n")
            
    except Exception as e:
        return f"Error: {e}"

model_name = "google/gemma-2-9b-it"
input_text = f"Translate this '{dataset_csv['nom'].iloc[4]}' to Vietnamese and explain the context meaning"
chat_with_model(model_name, input_text)

Based on the model's response, it seems that Gemma did a pretty good job at understanding the context, but if based on the linguistic meaning in Vietnamese, the translated sentence could be adjusted to more accurately reflect the style and the spirit of ancient literature. The model may give the general idea of ​​history's continuity and incompleteness but does not convey the significance and depth of the original sentence, which is rich in historical culture and the semantic style of ancient literature.

# 4.2 Low Rank Adaption <a id = "4.2"></a>

The number of trainable parameters is greatly decreased by the well-liked and portable training method known as LoRA (Low-Rank Adaptation of Large Language Models). It functions by adding fewer new weights to the model, and only these are used for training.

Therefore, instead of loading the entire model into the GPU, backpropagating the entire model, and updating all of its weights, fine-tuning LoRA generates two more matrices, A and B, on top of the model's initial weights, W, which are frozen. When these two matrices, A and B, are multiplied together, a new matrix with the same dimensions as the original weight matrix, W, is produced. The loss is only backpropagated during the training process after it has been calculated.

<img src = "lora.png" alt = "LoRA">

For any Math lover who want to explore this, I recommend you read [**this**](https://medium.com/@lokeshtodwal/demystifying-lora-q-lora-ea267abff48) article to understand how it runs.

In [16]:
lora_config = LoraConfig(
    r = 16, # Rank
    lora_alpha = 32, # Adjusting Coefficient
    lora_dropout = 0.1, # Chance to skip LoRA using
    target_modules = ['q_proj', 'k_proj', 'v_proj', 'o_proj'],
    bias = "none",
    task_type = "CAUSAL_LM",
)

# 4.3 Quantization <a id = "4.3"></a>

In machine learning, quantization is a technique that lowers the accuracy of a model's parameters. With the popularity of Large Langue Models (LLMs), this method gained popularity. Using this method, 32-bit or 16-bit floating-point numbers are converted to lower precision formats like 8-bit or 4-bit values. This method's main objective is to reduce the model's size in order to lower the computational requirements.

Quantization is important for the following main reasons:
- **Reduced Model Size**: Lowering the precision of the weights and quantization can significantly reduce the storage requirements. This is particularly beneficial for deploying models on devices with limited memory on edge devices.
- **Faster Inference**: Quantized models require less computational power, and that leads to faster inference times. This is crucial for real-time applications and user experience.
- **Lower Power Consumption**: Reduced computational requirements translate to lower power consumption; this will be very important for battery-operated devices.
- **Scalability**: A smaller model makes the deployment easier and scalability easier.
- **Cost Efficiency**: Smaller models with a high accuracy mean less cost for the same number of requests. This makes large language models accessible to smaller organizations and startup

More complicated stuff [**here**](https://www.maartengrootendorst.com/blog/quantization/).

In [17]:
# bnb_config = BitsAndBytesConfig(
#     load_in_4bit = True,
#     bnb_4bit_use_double_quant = True,
#     bnb_4bit_quant_type = "nf4",
#     bnb_4bit_compute_dtype = torch.bfloat16
# )
bnb_config = BitsAndBytesConfig(
    load_in_8bit = True
)

And finally some pre-defined params

In [18]:
# Hyperparams
BATCH_SIZE = 4
EPOCHS = 5
OUTPUT_DIR = "./working/model"

MODEL_NAME = "google/gemma-2-2b-it"

# <div style="padding:14px;color:white;margin:0;font-family:Georgia;font-size:30px;text-align:left;display:fill;border-radius:5px;background-color:#004AAD;overflow:hidden">5. Fine-tuning model</div> <a id = "5."></a>

In [19]:
def formatting_function(examples: pd.DataFrame) -> pd.DataFrame:
    input = examples['nom']
    output = examples['vi']
    intruction_template = f"""
    **Instruction:**

    Please translate the following text to Vietnamese.

    **Text to Translate:**
    {input}
    
    **Response:**
    {output}
    """
    
    return {'prompt': intruction_template}
    

In [20]:
formatted_dataset = dataset.map(formatting_function)

Map:   0%|          | 0/31297 [00:00<?, ? examples/s]

Map:   0%|          | 0/1739 [00:00<?, ? examples/s]

Map:   0%|          | 0/1739 [00:00<?, ? examples/s]

In [None]:
def print_trainable_parameters(model):
    """ Prints the number of trainable parameters in the model. """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [22]:
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, quantization_config = bnb_config, device_map = "auto", attn_implementation = 'eager')

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [23]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

In [24]:
print_trainable_parameters(get_peft_model(model, lora_config))

trainable params: 6389760 || all params: 2620731648 || trainable%: 0.2438158826706396


In [None]:
# test

In [None]:
# test

In [25]:
trainer = SFTTrainer(
    model,
    train_dataset = formatted_dataset["train"],
    eval_dataset = formatted_dataset["validation"],
    args = TrainingArguments(
        num_train_epochs = EPOCHS,
        per_device_train_batch_size = BATCH_SIZE,
        per_device_eval_batch_size = BATCH_SIZE,
        gradient_accumulation_steps = 1,
        evaluation_strategy = "epoch",
        learning_rate = 5e-5,
        weight_decay = 0.001,
        adam_beta1 = 0.9,
        adam_beta2 = 0.995,
        adam_epsilon = 1e-8,
        max_grad_norm = 1.0,
        seed = 4856,
        output_dir = OUTPUT_DIR,
        optim = "adamw_bnb_8bit",
        lr_scheduler_type = "reduce_lr_on_plateau"
    ),
    peft_config = lora_config,
    dataset_text_field = "prompt",
)

Map:   0%|          | 0/31297 [00:00<?, ? examples/s]

Map:   0%|          | 0/1739 [00:00<?, ? examples/s]

In [26]:
trainer.train()

  0%|          | 0/39125 [00:00<?, ?it/s]

{'loss': 2.0863, 'grad_norm': 3.7867989540100098, 'learning_rate': 5e-05, 'epoch': 0.06}
{'loss': 1.8223, 'grad_norm': 8.309208869934082, 'learning_rate': 5e-05, 'epoch': 0.13}
{'loss': 1.6827, 'grad_norm': 4.461440563201904, 'learning_rate': 5e-05, 'epoch': 0.19}
{'loss': 1.5995, 'grad_norm': 3.8560783863067627, 'learning_rate': 5e-05, 'epoch': 0.26}
{'loss': 1.5734, 'grad_norm': 4.111842632293701, 'learning_rate': 5e-05, 'epoch': 0.32}
{'loss': 1.5077, 'grad_norm': 5.331409454345703, 'learning_rate': 5e-05, 'epoch': 0.38}
{'loss': 1.4931, 'grad_norm': 3.3308260440826416, 'learning_rate': 5e-05, 'epoch': 0.45}
{'loss': 1.4615, 'grad_norm': 3.298192024230957, 'learning_rate': 5e-05, 'epoch': 0.51}
{'loss': 1.4425, 'grad_norm': 3.2954165935516357, 'learning_rate': 5e-05, 'epoch': 0.58}
{'loss': 1.4192, 'grad_norm': 4.4286065101623535, 'learning_rate': 5e-05, 'epoch': 0.64}
{'loss': 1.3961, 'grad_norm': 6.291783332824707, 'learning_rate': 5e-05, 'epoch': 0.7}
{'loss': 1.4016, 'grad_norm'

  0%|          | 0/435 [00:00<?, ?it/s]

{'eval_loss': 1.3377771377563477, 'eval_runtime': 223.6673, 'eval_samples_per_second': 7.775, 'eval_steps_per_second': 1.945, 'epoch': 1.0}
{'loss': 1.34, 'grad_norm': 3.314457416534424, 'learning_rate': 5e-05, 'epoch': 1.02}
{'loss': 1.3003, 'grad_norm': 3.4284536838531494, 'learning_rate': 5e-05, 'epoch': 1.09}
{'loss': 1.2939, 'grad_norm': 2.8539879322052, 'learning_rate': 5e-05, 'epoch': 1.15}
{'loss': 1.2861, 'grad_norm': 3.207547664642334, 'learning_rate': 5e-05, 'epoch': 1.21}
{'loss': 1.3051, 'grad_norm': 9.829144477844238, 'learning_rate': 5e-05, 'epoch': 1.28}
{'loss': 1.2926, 'grad_norm': 2.627249240875244, 'learning_rate': 5e-05, 'epoch': 1.34}
{'loss': 1.2857, 'grad_norm': 3.291450262069702, 'learning_rate': 5e-05, 'epoch': 1.41}
{'loss': 1.2782, 'grad_norm': 4.691164493560791, 'learning_rate': 5e-05, 'epoch': 1.47}
{'loss': 1.2873, 'grad_norm': 2.9475576877593994, 'learning_rate': 5e-05, 'epoch': 1.53}
{'loss': 1.2476, 'grad_norm': 19.61113929748535, 'learning_rate': 5e-0

  0%|          | 0/435 [00:00<?, ?it/s]

{'eval_loss': 1.252960205078125, 'eval_runtime': 237.5987, 'eval_samples_per_second': 7.319, 'eval_steps_per_second': 1.831, 'epoch': 2.0}
{'loss': 1.2066, 'grad_norm': 3.174567937850952, 'learning_rate': 5e-05, 'epoch': 2.04}
{'loss': 1.1961, 'grad_norm': 3.890836238861084, 'learning_rate': 5e-05, 'epoch': 2.11}
{'loss': 1.1951, 'grad_norm': 2.9104647636413574, 'learning_rate': 5e-05, 'epoch': 2.17}
{'loss': 1.2048, 'grad_norm': 5.512338161468506, 'learning_rate': 5e-05, 'epoch': 2.24}
{'loss': 1.1974, 'grad_norm': 2.656956672668457, 'learning_rate': 5e-05, 'epoch': 2.3}
{'loss': 1.1843, 'grad_norm': 2.8766958713531494, 'learning_rate': 5e-05, 'epoch': 2.36}
{'loss': 1.1958, 'grad_norm': 2.3810036182403564, 'learning_rate': 5e-05, 'epoch': 2.43}
{'loss': 1.1947, 'grad_norm': 2.6805338859558105, 'learning_rate': 5e-05, 'epoch': 2.49}
{'loss': 1.1925, 'grad_norm': 2.990164041519165, 'learning_rate': 5e-05, 'epoch': 2.56}
{'loss': 1.1798, 'grad_norm': 2.560637950897217, 'learning_rate': 

  0%|          | 0/435 [00:00<?, ?it/s]

{'eval_loss': 1.2114449739456177, 'eval_runtime': 243.9981, 'eval_samples_per_second': 7.127, 'eval_steps_per_second': 1.783, 'epoch': 3.0}
{'loss': 1.1791, 'grad_norm': 3.1127374172210693, 'learning_rate': 5e-05, 'epoch': 3.0}
{'loss': 1.129, 'grad_norm': 4.226780414581299, 'learning_rate': 5e-05, 'epoch': 3.07}
{'loss': 1.1303, 'grad_norm': 3.3289742469787598, 'learning_rate': 5e-05, 'epoch': 3.13}
{'loss': 1.1033, 'grad_norm': 2.9881794452667236, 'learning_rate': 5e-05, 'epoch': 3.19}
{'loss': 1.1331, 'grad_norm': 3.9584014415740967, 'learning_rate': 5e-05, 'epoch': 3.26}
{'loss': 1.1377, 'grad_norm': 3.102477788925171, 'learning_rate': 5e-05, 'epoch': 3.32}
{'loss': 1.1514, 'grad_norm': 2.9609925746917725, 'learning_rate': 5e-05, 'epoch': 3.39}
{'loss': 1.1321, 'grad_norm': 3.2523648738861084, 'learning_rate': 5e-05, 'epoch': 3.45}
{'loss': 1.1259, 'grad_norm': 3.1557865142822266, 'learning_rate': 5e-05, 'epoch': 3.51}
{'loss': 1.1465, 'grad_norm': 10.836472511291504, 'learning_rat

  0%|          | 0/435 [00:00<?, ?it/s]

{'eval_loss': 1.1880388259887695, 'eval_runtime': 238.6713, 'eval_samples_per_second': 7.286, 'eval_steps_per_second': 1.823, 'epoch': 4.0}
{'loss': 1.1231, 'grad_norm': 3.273221492767334, 'learning_rate': 5e-05, 'epoch': 4.03}
{'loss': 1.0955, 'grad_norm': 3.9406962394714355, 'learning_rate': 5e-05, 'epoch': 4.09}
{'loss': 1.0677, 'grad_norm': 2.9625065326690674, 'learning_rate': 5e-05, 'epoch': 4.15}
{'loss': 1.0699, 'grad_norm': 2.9987339973449707, 'learning_rate': 5e-05, 'epoch': 4.22}
{'loss': 1.0744, 'grad_norm': 5.666394233703613, 'learning_rate': 5e-05, 'epoch': 4.28}
{'loss': 1.0752, 'grad_norm': 3.0551328659057617, 'learning_rate': 5e-05, 'epoch': 4.35}
{'loss': 1.106, 'grad_norm': 2.844954013824463, 'learning_rate': 5e-05, 'epoch': 4.41}
{'loss': 1.1075, 'grad_norm': 2.907071352005005, 'learning_rate': 5e-05, 'epoch': 4.47}
{'loss': 1.0733, 'grad_norm': 2.8475306034088135, 'learning_rate': 5e-05, 'epoch': 4.54}
{'loss': 1.0948, 'grad_norm': 2.5725722312927246, 'learning_rate

  0%|          | 0/435 [00:00<?, ?it/s]

{'eval_loss': 1.171726942062378, 'eval_runtime': 241.6382, 'eval_samples_per_second': 7.197, 'eval_steps_per_second': 1.8, 'epoch': 5.0}
{'train_runtime': 51442.2444, 'train_samples_per_second': 3.042, 'train_steps_per_second': 0.761, 'train_loss': 1.240223369561826, 'epoch': 5.0}


TrainOutput(global_step=39125, training_loss=1.240223369561826, metrics={'train_runtime': 51442.2444, 'train_samples_per_second': 3.042, 'train_steps_per_second': 0.761, 'total_flos': 1.8569537465717146e+17, 'train_loss': 1.240223369561826, 'epoch': 5.0})

<div style="padding:14px;color:white;margin:0;font-family:Georgia;font-size:30px;text-align:left;display:fill;border-radius:5px;background-color:#004AAD;overflow:hidden">6. Model Evaluation</div> <a id = "6."></a>

# 6.1 Prompting Technique Evaluation <a id = "6.1"></a>

In [None]:
template = """<start_of_turn>system\nUse the following information and your own knowledge to answer the question. If you don't know the answer, say you don't know, don't try to make up the answer.\n
    {context}<end_of_turn>\n<start_of_turn>user\n{question}<end_of_turn>\n<start_of_turn>assistant"""

# 6.2 RAG for Advanced Nom Analysis <a id = "6.2"></a>

In [None]:
# Database
from langchain_community.vectorstores import FAISS

# Embedding model
from langchain_community.embeddings import GPT4AllEmbeddings

#
from langchain_text_splitters import CharacterTextSplitter

#
from langchain.chains import RetrievalQA, LLMChain
from langchain.memory import ConversationSummaryMemory
from langchain.prompts import PromptTemplate

In [None]:
def create_db_from_text(strs: str):
    """ Create a vector database from the text """

    # Initialize the text splitter
    text_splitter = CharacterTextSplitter(
        separator = "\n",
        chunk_size = 500,
        chunk_overlap = 50,
        length_function = len
    )
    
    # Split the text into chunks
    chunks = text_splitter.split_text(strs)
    
    # Load the embeddings model
    embeddings_model = GPT4AllEmbeddings(model_file = "model/all-MiniLM-L12-v2.Q8_0.gguf")
    
    # Create the vector database with FAISS
    database = FAISS.from_text(chunks, embeddings_model)
    database.save_local(vector_db_path)
    
    return database

In this part, we will using RAG to improve how Gemma can reason and explain Nom script using [**this**](https://www.nomfoundation.org/nom-project/Ho-Xuan-Huong/Ho-Xuan-Huong-of-poems?uiLang=en) as a main source

In [None]:
def extract_text_from_box(url: List[str]) -> str:
    try:
        # Fetch the content from url
        response = requests.get(url)
        response.raise_for_status()  # Raise an exception for HTTP errors
    except requests.exceptions.RequestException as e:
        print(f"Error fetching the URL: {e}")
        return ""

    # Parse the content with BeautifulSoup
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find all div elements with class "box"
    box_elements = soup.find_all('div', class_ = 'box')

    # Extract text from the box elements
    extracted_text = url + "\n\n".join([box.get_text(separator = ' ', strip = True) for box in box_elements])

    return extracted_text

data = []
url = ["https://www.nomfoundation.org/nom-project/Ho-Xuan-Huong/Ho-Xuan-Huong-of-poems/1-Autumn%20Landscape?uiLang=en",
        "https://www.nomfoundation.org/nom-project/Ho-Xuan-Huong/Ho-Xuan-Huong-of-poems/3-Offering%20betel?uiLang=en",
        "https://www.nomfoundation.org/nom-project/Ho-Xuan-Huong/Ho-Xuan-Huong-of-poems/4-Confession%20(II)?uiLang=en",
        "https://www.nomfoundation.org/nom-project/Ho-Xuan-Huong/Ho-Xuan-Huong-of-poems/5-Lament%20for%20the%20Prefect%20of%20V%C4%A9nh-T%C6%B0%E1%BB%9Dng?uiLang=en",
        "https://www.nomfoundation.org/nom-project/Ho-Xuan-Huong/Ho-Xuan-Huong-of-poems/6-Lament%20for%20Commissioner%20C%C3%B3c?uiLang=en",
        "https://www.nomfoundation.org/nom-project/Ho-Xuan-Huong/Ho-Xuan-Huong-of-poems/7-Confession%20(III)?uiLang=en",
        "https://www.nomfoundation.org/nom-project/Ho-Xuan-Huong/Ho-Xuan-Huong-of-poems/8-The%20Floating%20Cake?uiLang=en",
        "https://www.nomfoundation.org/nom-project/Ho-Xuan-Huong/Ho-Xuan-Huong-of-poems/9-On%20Sharing%20a%20Husband?uiLang=en",
        "https://www.nomfoundation.org/nom-project/Ho-Xuan-Huong/Ho-Xuan-Huong-of-poems/10-Jackfruit?uiLang=en",
        "https://www.nomfoundation.org/nom-project/Ho-Xuan-Huong/Ho-Xuan-Huong-of-poems/11-River%20Snail?uiLang=en",
        "https://www.nomfoundation.org/nom-project/Ho-Xuan-Huong/Ho-Xuan-Huong-of-poems/13-Teasing%20Chi%C3%AAu-H%E1%BB%95?uiLang=en",
        "https://www.nomfoundation.org/nom-project/Ho-Xuan-Huong/Ho-Xuan-Huong-of-poems/14-Chi%C3%AAu-H%E1%BB%95%E2%80%99s%20Reply?uiLang=en",
        "https://www.nomfoundation.org/nom-project/Ho-Xuan-Huong/Ho-Xuan-Huong-of-poems/15-Three-Mountain%20Pass?uiLang=en",
        "https://www.nomfoundation.org/nom-project/Ho-Xuan-Huong/Ho-Xuan-Huong-of-poems/18-The%20Unwed%20Mother?uiLang=en",
        "https://www.nomfoundation.org/nom-project/Ho-Xuan-Huong/Ho-Xuan-Huong-of-poems/23-Picking%20Flowers?uiLang=en",
        "https://www.nomfoundation.org/nom-project/Ho-Xuan-Huong/Ho-Xuan-Huong-of-poems/27-The%20Pharmacist%E2%80%99s%20Widow%20Mourns%20His%20Death?uiLang=en",
        "https://www.nomfoundation.org/nom-project/Ho-Xuan-Huong/Ho-Xuan-Huong-of-poems/30-The%20Retired%20Doctor?uiLang=en",
        "https://www.nomfoundation.org/nom-project/Ho-Xuan-Huong/Ho-Xuan-Huong-of-poems/32-Qu%C3%A1n%20S%C3%BA%20Pagoda?uiLang=en",
        "https://www.nomfoundation.org/nom-project/Ho-Xuan-Huong/Ho-Xuan-Huong-of-poems/33-Buddhist%20Nun?uiLang=en",
        "https://www.nomfoundation.org/nom-project/Ho-Xuan-Huong/Ho-Xuan-Huong-of-poems/34-The%20Lustful%20Monk?uiLang=en",
        "https://www.nomfoundation.org/nom-project/Ho-Xuan-Huong/Ho-Xuan-Huong-of-poems/38-Tr%E1%BA%A5n%20Qu%E1%BB%91c%20Temple?uiLang=en"
        "https://www.nomfoundation.org/nom-project/Ho-Xuan-Huong/Ho-Xuan-Huong-of-poems/49-Spring%20%E2%80%93%20watching%20pavilion?uiLang=en"
        "https://www.nomfoundation.org/nom-project/Ho-Xuan-Huong/Ho-Xuan-Huong-of-poems/48-Country%20Scene?uiLang=en",
        "https://www.nomfoundation.org/nom-project/Ho-Xuan-Huong/Ho-Xuan-Huong-of-poems/39-At%20the%20Chinese%20General%E2%80%99s%20Tomb?uiLang=en"
        "https://www.nomfoundation.org/nom-project/Ho-Xuan-Huong/Ho-Xuan-Huong-of-poems/42-The%20Crab?uiLang=en"]
for i in url:
    box_content = extract_text_from_box(i)
    data.append(i)

In [None]:
def chain_lang(pt: str, llm_model):
    chain = RetrievalQA.from_chain_type(llm = llm_model,
                                        retriever = VectorStoreRetriever(vectorstore = create_db_from_text(data)),
                                        memory = ConversationSummaryMemory(llm = llm_model),
                                        chain_type_kwargs = {"prompt": pt, "verbose": True},
                                        return_source_documents = False
                                        )
    return chain

# <div style="padding:14px;color:white;margin:0;font-family:Georgia;font-size:30px;text-align:left;display:fill;border-radius:5px;background-color:#004AAD;overflow:hidden">7. Benchmark</div> <a id = "7."></a>

To compare performance, I will using using this [website](https://www.clc.hcmus.edu.vn/?page_id=3039), which already been certified by competent authorities in Vietnam

# <div style="padding:14px;color:white;margin:0;font-family:Georgia;font-size:30px;text-align:left;display:fill;border-radius:5px;background-color:#004AAD;overflow:hidden">8. Conclusion</div> <a id = "8."></a>

# <div style="padding:14px;color:white;margin:0;font-family:Georgia;font-size:30px;text-align:left;display:fill;border-radius:5px;background-color:#004AAD;overflow:hidden">9. References</div> <a id = "9."></a>

Hu, E.J. et al. (2021) Lora: Low-rank adaptation of large language models, arXiv.org. Available at: https://arxiv.org/abs/2106.09685 (Accessed: 16 November 2024). 

Dinh, D., Nguyen, P. and Nguyen, L.H.B. (no date) Transliterating Nôm scripts into Vietnamese National Scripts Using Statistical Machine Translation, International Journal of Advanced Computer Science and Applications (IJACSA). Available at: https://thesai.org/Publications/ViewPaper?Volume=12&Issue=2&Code=IJACSA&SerialNo=5 (Accessed: 16 November 2024). 

Gemma 2: Improving open language models at a practical ... Available at: http://arxiv.org/pdf/2408.00118 (Accessed: 16 November 2024). 