To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth your local device, follow [our guide](https://docs.unsloth.ai/get-started/install-and-update). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


Long-Context GRPO for reinforcement learning — train stably at massive sequence lengths. Fine-tune models with up to 7x more context length efficiently. [Read Blog](https://unsloth.ai/docs/new/grpo-long-context)

3× faster training with optimized sequence packing — higher throughput with no quality loss.[Read Blog](https://unsloth.ai/docs/new/3x-faster-training-packing)

500k context-length fine-tuning — push long-context models further with memory-efficient training. [Read Blog](https://unsloth.ai/docs/new/500k-context-length-fine-tuning)

Introducing FP8 precision training for faster RL inference. [Read Blog](https://docs.unsloth.ai/new/fp8-reinforcement-learning).

Unsloth's [Docker image](https://hub.docker.com/r/unsloth/unsloth) is here! Start training with no setup & environment issues. [Read our Guide](https://docs.unsloth.ai/new/how-to-train-llms-with-unsloth-and-docker).

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [31]:
# CHOISIR LE MODÈLE !!!
#model_name = "unsloth/Qwen2.5-VL-7B-Instruct" # Compétition fermée
model_name = "unsloth/Qwen3-VL-8B-Instruct" # Compétion ouverte, exemple de modèle (on peut monter à plus de 8B)

In [32]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9]{1,}\.[0-9]{1,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.33.post1" if v=="2.9" else "0.0.32.post2" if v=="2.8" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth

if "2.5" and "7B" in model_name:
    !pip install transformers==4.56.2
elif "Qwen3" in model_name:
    !pip install transformers==4.57.1
else:
    print("Problème modèle")
!pip install --no-deps trl==0.22.2

### Unsloth

In [25]:
from unsloth import FastVisionModel # FastLanguageModel for LLMs
import torch

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Llama-3.2-11B-Vision-Instruct-bnb-4bit", # Llama 3.2 vision support
    "unsloth/Llama-3.2-11B-Vision-bnb-4bit",
    "unsloth/Llama-3.2-90B-Vision-Instruct-bnb-4bit", # Can fit in a 80GB card!
    "unsloth/Llama-3.2-90B-Vision-bnb-4bit",

    "unsloth/Pixtral-12B-2409-bnb-4bit",              # Pixtral fits in 16GB!
    "unsloth/Pixtral-12B-Base-2409-bnb-4bit",         # Pixtral base model

    "unsloth/Qwen2-VL-2B-Instruct-bnb-4bit",          # Qwen2 VL support
    "unsloth/Qwen2-VL-7B-Instruct-bnb-4bit",
    "unsloth/Qwen2-VL-72B-Instruct-bnb-4bit",

    "unsloth/llava-v1.6-mistral-7b-hf-bnb-4bit",      # Any Llava variant works!
    "unsloth/llava-1.5-7b-hf-bnb-4bit",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastVisionModel.from_pretrained(
    model_name,
    load_in_4bit = True, # Use 4bit to reduce memory use. False for 16bit LoRA.
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
)

==((====))==  Unsloth 2026.1.4: Fast Qwen2_5_Vl patching. Transformers: 4.56.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


KeyboardInterrupt: 

We now add LoRA adapters for parameter efficient finetuning - this allows us to only efficiently train 1% of all parameters.

**[NEW]** We also support finetuning ONLY the vision part of the model, or ONLY the language part. Or you can select both! You can also select to finetune the attention or the MLP layers!

In [3]:
model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers     = True, # False if not finetuning vision layers
    finetune_language_layers   = True, # False if not finetuning language layers
    finetune_attention_modules = True, # False if not finetuning attention layers
    finetune_mlp_modules       = True, # False if not finetuning MLP layers

    r = 16,           # The larger, the higher the accuracy, but might overfit
    lora_alpha = 16,  # Recommended alpha == r at least
    lora_dropout = 0,
    bias = "none",
    random_state = 3407, # Changer ça même si ça change rien
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
    # target_modules = "all-linear", # Optional now! Can specify a list if needed
)

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
import json
import os

base_path = '/content/drive/MyDrive/EVAHAN/train_data/'
#dataset_files = ['Dataset_A.json', 'Dataset_B.json', 'Dataset_C.json']
dataset_files = ['Dataset_A.json', 'Dataset_C.json']
#dataset_files = ['Dataset_B.json']
combined_dataset = []

for filename in dataset_files:
    filepath = os.path.join(base_path, filename)
    with open(filepath, 'r') as f:
        data = json.load(f)
        combined_dataset.extend(data)

print(f"Successfully loaded and combined {len(dataset_files)} datasets.")
print(f"Total items in combined dataset: {len(combined_dataset)}")

Successfully loaded and combined 2 datasets.
Total items in combined dataset: 10000


In [6]:
from sklearn.model_selection import train_test_split

subset_size = 0.25 # 25% of the total data
subset_data, _ = train_test_split(combined_dataset, test_size=1 - subset_size, random_state=42)

print(f"Using a {subset_size*100}% subset of the data: {len(subset_data)} samples")

# Split into training (75%) and temporary (25%) datasets from the subset
train_data, temp_data = train_test_split(subset_data, test_size=0.25, random_state=42)

# Split temporary (25%) into testing (15% of total) and validation (10% of total)
# (0.15 / 0.25 = 0.6 for test_size, since temp_data is 25% of the subset data)
test_data, val_data = train_test_split(temp_data, test_size=0.4, random_state=42) # 0.4 of 25% is 10% of total

print(f"Training set size: {len(train_data)}")
print(f"Testing set size: {len(test_data)}")
print(f"Validation set size: {len(val_data)}")

Using a 25.0% subset of the data: 2500 samples
Training set size: 1875
Testing set size: 375
Validation set size: 250


In [7]:
import PIL.Image
print("PIL.Image imported.")

PIL.Image imported.


In [8]:
from tqdm import tqdm # héhé
import json # Import json for creating the assistant's JSON response
import os
import cv2 # NEW: Import OpenCV for image processing
import numpy as np # NEW: Import numpy for image processing

instruction = """Analyze the provided image of ancient Chinese texts.

**Task Guidelines:**
1. **Transcription:** Transcribe the characters into standard Traditional Chinese (Unicode). Do not modernize the grammar.
2. **Legibility:** - If a character is an archaic variant, use the standard Traditional Chinese equivalent.
    - If a character is completely illegible due to damage, use '□'.
3. **Output:** Return ONLY the JSON object below.

```json
{
  "transcription": "TEXT_HERE",
  "notes": "Brief notes on damage/layout"
}
"""

# 4. **Reading Order:** If there is text, the text is generally written vertically (top to bottom) and arranged in columns from right to left. Respect this strict reading order.

base_path = '/content/drive/MyDrive/EVAHAN/train_data/' # Ensure base_path is accessible

# Original convert_to_conversation function (commented out)
# def convert_to_conversation(sample):
#     conversation = [
#         { "role": "user",
#           "content" : [
#             {"type" : "text",  "text"  : instruction},
#             {"type" : "image", "image" : sample["image"]} ]
#         },
#         { "role" : "assistant",
#           "content" : [
#             {"type" : "text",  "text"  : sample["text"]} ]
#         },
#     ]
#     return { "messages" : conversation }
# pass

# Adapted convert_to_conversation function
def convert_to_conversation_new(sample):
    if "text" not in sample:
        # Skip samples that do not have a 'text' key
        return None

    image_path = os.path.join(base_path, sample["image_path"])
    # PEUT-ÊTRE MODIFIER ÇA ?
    try:
    #    # Original:
      image = PIL.Image.open(image_path).convert("RGB")
    #    # NEW: Load image using OpenCV
    #    image_cv2 = cv2.imread(image_path)
    #    if image_cv2 is None:
    #        print(f"Error: Could not load image {image_path}")
    #        return None

    #    # Convert to grayscale
    #    gray_image = cv2.cvtColor(image_cv2, cv2.COLOR_BGR2GRAY)

        # Apply adaptive thresholding (binarization) to remove shadows
        # ADAPTIVE_THRESH_GAUSSIAN_C: uses a gaussian weighted sum of neighborhood values
        # THRESH_BINARY: the type of thresholding applied
        # 255: max value to use with THRESH_BINARY
        # 11: block size (size of neighborhood to calculate threshold for)
        # 2: constant subtracted from the mean or weighted mean
    #    binarized_image = cv2.adaptiveThreshold(gray_image, 255,
    #                                             cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
    #                                             cv2.THRESH_BINARY, 11, 2)

        # Convert OpenCV image (numpy array) back to PIL Image
     #   image = PIL.Image.fromarray(binarized_image).convert("RGB") # Ensure RGB for model input

    except Exception as e:
        print(f"Error processing image {image_path}: {e}")
        return None # Skip samples with problematic images

    # Construct the assistant's response as a JSON string based on the instruction format
    assistant_response_dict = {
        "transcription": sample["text"],
        "notes": "" # Assuming no 'notes' provided in the raw dataset, default to empty string
    }
    assistant_response_json_string = json.dumps(assistant_response_dict, ensure_ascii=False) # ensure_ascii=False to preserve Chinese characters

    conversation = [
        { "role": "user",
          "content" : [
            {"type" : "text",  "text"  : instruction},
            {"type" : "image", "image" : image} ]
        },
        { "role" : "assistant",
          "content" : [
            {"type" : "text",  "text"  : assistant_response_json_string} ]
        },
    ]
    return { "messages" : conversation }

# Original application of convert_to_conversation (commented out)
# converted_dataset = [convert_to_conversation(sample) for sample in dataset]

# Apply the new function to the training and validation datasets
# Filter out None values in case of image loading errors or missing 'text' key
converted_train_dataset = [convert_to_conversation_new(sample) for sample in tqdm(train_data) if convert_to_conversation_new(sample) is not None]
converted_val_dataset = [convert_to_conversation_new(sample) for sample in tqdm(val_data) if convert_to_conversation_new(sample) is not None]

print(f"Converted training dataset size: {len(converted_train_dataset)}")
print(f"Converted validation dataset size: {len(converted_val_dataset)}")

100%|██████████| 1875/1875 [12:51<00:00,  2.43it/s]
100%|██████████| 250/250 [01:40<00:00,  2.48it/s]

Converted training dataset size: 1875
Converted validation dataset size: 250





In [9]:
train_data[0]

{'image_path': 'Dataset_A/a_1932.jpg', 'text': '切智智修習五眼六神通慶喜當知'}

In [10]:
converted_train_dataset[0]

{'messages': [{'role': 'user',
   'content': [{'type': 'text',
     'text': 'Analyze the provided image of ancient Chinese texts.\n\n**Task Guidelines:**\n1. **Transcription:** Transcribe the characters into standard Traditional Chinese (Unicode). Do not modernize the grammar.\n2. **Legibility:** - If a character is an archaic variant, use the standard Traditional Chinese equivalent.\n    - If a character is completely illegible due to damage, use \'□\'.\n3. **Output:** Return ONLY the JSON object below.\n\n```json\n{\n  "transcription": "TEXT_HERE",\n  "notes": "Brief notes on damage/layout"\n}\n'},
    {'type': 'image',
     'image': <PIL.Image.Image image mode=RGB size=1960x175>}]},
  {'role': 'assistant',
   'content': [{'type': 'text',
     'text': '{"transcription": "切智智修習五眼六神通慶喜當知", "notes": ""}'}]}]}

In [11]:
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig

FastVisionModel.for_training(model) # Enable for training!

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    data_collator = UnslothVisionDataCollator(model, tokenizer), # Must use!
    train_dataset = converted_train_dataset, # Updated to use the new training dataset
    eval_dataset = converted_val_dataset,    # Added for evaluation
    args = SFTConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # Set num_train_epochs to 1 and max_steps to -1 to train for one full epoch.
        # Adjust these values based on desired training duration and dataset size.
        max_steps = -1,                    # Set to -1 to train for num_train_epochs
        num_train_epochs = 1,              # Train for 1 full epoch
        learning_rate = 2e-4,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.001,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",     # For Weights and Biases

        # You MUST put the below items for vision finetuning:
        remove_unused_columns = False,
        dataset_text_field = "",
        dataset_kwargs = {"skip_prepare_dataset": True},
        max_length = 2048,
    ),
)

Unsloth: Model does not have a default image size - using 512


In [12]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
6.822 GB of memory reserved.


In [13]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,875 | Num Epochs = 1 | Total steps = 235
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 51,521,536 of 8,343,688,192 (0.62% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,4.0949
2,4.071
3,4.0608
4,3.6512
5,3.3029
6,2.7208
7,2.4365
8,2.0769
9,1.8913
10,1.5758


In [14]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

1426.1729 seconds used for training.
23.77 minutes used for training.
Peak reserved memory = 7.506 GB.
Peak reserved memory for training = 0.684 GB.
Peak reserved memory % of max memory = 50.919 %.
Peak reserved memory for training % of max memory = 4.64 %.


In [15]:
FastVisionModel.for_inference(model) # Enable for inference!

# Comment out original image and instruction
# image = dataset[2]["image"]
# instruction = "Write the LaTeX representation for this image."

# Select an example from the test_data split
import os
import PIL.Image

test_example = test_data[0] # Get the first example from the test_data split

# Load the image from the test example using its path and the base_path
image_path = os.path.join(base_path, test_example["image_path"])
image = PIL.Image.open(image_path).convert("RGB")

# Use the globally defined instruction for the model
instruction = instruction

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": instruction}
    ]}
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt = True)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens = False,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

{"transcription": "景常嘯歘往一逺來生側半座塵風也", "notes": ""}<|im_end|>


In [17]:
import os

# Ensure the current directory is where we want the files
!wget -O task_a_c_eva.py https://raw.githubusercontent.com/GoThereGit/EvaHan/refs/heads/main/task_a_c_eva.py
!wget -O task_b_eva.py https://raw.githubusercontent.com/GoThereGit/EvaHan/refs/heads/main/task_b_eva.py

print("Downloaded task_a_c_eva.py and task_b_eva.py")

--2026-02-03 20:57:11--  https://raw.githubusercontent.com/GoThereGit/EvaHan/refs/heads/main/task_a_c_eva.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14094 (14K) [text/plain]
Saving to: ‘task_a_c_eva.py’


2026-02-03 20:57:11 (22.3 MB/s) - ‘task_a_c_eva.py’ saved [14094/14094]

--2026-02-03 20:57:11--  https://raw.githubusercontent.com/GoThereGit/EvaHan/refs/heads/main/task_b_eva.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20275 (20K) [text/plain]
Saving to: ‘task_b_eva.py’


2026-02-03 20:57:11 (11.9 MB/s) - ‘task_b

In [21]:
FastVisionModel.for_inference(model) # Ensure model is in inference mode

import json
from tqdm import tqdm
import os
import PIL.Image
import sys

# Ensure the current directory is in sys.path for module discovery
if "/content/" not in sys.path:
    sys.path.append("/content/")

# Force reload of the modules if they were previously loaded
# This helps ensure the latest version of the script is used after download.
if 'task_a_c_eva' in sys.modules:
    del sys.modules['task_a_c_eva']
if 'task_b_eva' in sys.modules:
    del sys.modules['task_b_eva']

# Import the evaluation functions
from task_a_c_eva import calculate_char_metrics # Corrected import
from task_b_eva import LayoutEvaluator

print("Comparing model inference with ground truth and evaluating for the first 50 test examples:")

sum = 0
idx = 0

# Iterate through the first 20 examples of the test_data
for i, test_example in tqdm(enumerate(test_data[:50]), total=50, desc="Performing inference and evaluation"):
    # a. Load the image using its image_path
    image_path = os.path.join(base_path, test_example["image_path"])
    try:
        image = PIL.Image.open(image_path).convert("RGB")
    except Exception as e:
        print(f"Error loading image {image_path}: {e}")
        continue

    # b. Construct the messages list for the tokenizer
    messages = [
        {"role": "user", "content": [
            {"type": "image"},
            {"type": "text", "text": instruction}
        ]}
    ]

    # c. Apply the chat template to messages
    input_text = tokenizer.apply_chat_template(messages, add_generation_prompt = True)

    # d. Prepare inputs for the model
    inputs = tokenizer(
        image,
        input_text,
        add_special_tokens = False,
        return_tensors = "pt",
    ).to("cuda")

    # e. Generate the model's output
    outputs = model.generate(**inputs, max_new_tokens = 128,
                       use_cache = True, temperature = 1.5, min_p = 0.1)

    # f. Decode the generated tokens to text
    generated_text_tokens = outputs[0][len(inputs["input_ids"][0]):]
    model_output_raw = tokenizer.decode(generated_text_tokens, skip_special_tokens=True)

    # g. Parse the model's text output as a JSON string
    model_output_cleaned = model_output_raw.replace("<|im_end|>", "").strip()
    predicted_transcription = "N/A (parsing error)"
    try:
        # Replace literal newlines with escaped newlines for valid JSON parsing
        model_output_for_json = model_output_cleaned.replace('\n', '\\n')
        parsed_output = json.loads(model_output_for_json)
        predicted_transcription = parsed_output.get("transcription", "N/A (transcription key missing)")
    except json.JSONDecodeError as e:
        print(f"JSON decoding error for output: {model_output_cleaned} - {e}")
        predicted_transcription = f"JSON Error: {model_output_cleaned}"

    # h. Extract the ground truth transcription
    # Check if 'text' key exists, as Dataset B does not have it for transcription evaluation
    ground_truth_transcription = test_example.get("text", "N/A (ground truth missing)")

    # i. Print the comparison
    print(f"\n--- Example {i+1} ---")
    print(f"Model Output:     {predicted_transcription}")
    print(f"Ground Truth:     {ground_truth_transcription}")

    # Conditional evaluation based on dataset
    dataset_name = image_path.split('/')[-2]
    if "Dataset_B" in dataset_name:

        print("Evaluation (Dataset B): Not applicable. Model outputs transcription, but ground truth is for layout detection.")
        # RÉADAPTER POUR DATASET B

    else:
        # For Dataset A and C, use calculate_char_metrics and extract comprehensive_score
        if ground_truth_transcription != "N/A (ground truth missing)":
            metrics_ac = calculate_char_metrics(ground_truth_transcription, predicted_transcription) # Call the correct function
            score_ac = metrics_ac.get("comprehensive_score", "N/A (score missing)") # Extract the score
            score_cer = metrics_ac.get("cer", "N/A (score missing)")
            score_precision = metrics_ac.get("precision", "N/A (score missing)")
            score_recall = metrics_ac.get("recall", "N/A (score missing)")
            score_f1 = metrics_ac.get("f1", "N/A (score missing)")
            score_ned = metrics_ac.get("ned", "N/A (score missing)")

            print(f"Evaluation (Dataset A/C): Comprehensive Score = {score_ac}")
            print(f"Precisely: CER={score_cer}, P={score_precision}, R={score_recall}, F1={score_f1}, ned={score_ned}")
            idx += 1
            sum += score_ac
        else:
            print("Evaluation (Dataset A/C): Skipping due to missing ground truth.")


print(f"Mean: {sum / idx}")

Comparing model inference with ground truth and evaluating for the first 50 test examples:


Performing inference and evaluation:   2%|▏         | 1/50 [00:08<06:57,  8.52s/it]


--- Example 1 ---
Model Output:     寒山修竹地，少住坐西阿。遥岑横岫出
Ground Truth:     駰鳴夢魂夜夜隨天代直入蓬萊殿裡行
Evaluation (Dataset A/C): Comprehensive Score = -0.0312
Precisely: CER=1.0625, P=0.0, R=0.0, F1=0.0, ned=1.0


Performing inference and evaluation:   4%|▍         | 2/50 [00:13<04:56,  6.18s/it]


--- Example 2 ---
Model Output:     上之四。　鷹門太守行
Ground Truth:     上之四鴈門太守行
Evaluation (Dataset A/C): Comprehensive Score = 0.6858
Precisely: CER=0.375, P=0.7, R=0.875, F1=0.7778, ned=0.3


Performing inference and evaluation:   6%|▌         | 3/50 [00:15<03:40,  4.68s/it]


--- Example 3 ---
Model Output:     山益州元
Ground Truth:     之所至元
Evaluation (Dataset A/C): Comprehensive Score = 0.25
Precisely: CER=0.75, P=0.25, R=0.25, F1=0.25, ned=0.75


Performing inference and evaluation:   8%|▊         | 4/50 [00:19<03:23,  4.42s/it]


--- Example 4 ---
Model Output:     船修習到諸惡界與黑髮界與獄
Ground Truth:     智修習空解脫門無相解脫門無願
Evaluation (Dataset A/C): Comprehensive Score = 0.1444
Precisely: CER=0.8571, P=0.1538, R=0.1429, F1=0.1481, ned=0.8571


Performing inference and evaluation:  10%|█         | 5/50 [00:26<04:00,  5.34s/it]


--- Example 5 ---
Model Output:     無共性如是與三世滅性與異趣住說亦說二無生與
Ground Truth:     其先府君於夷山付家產於姪德令經紀宗族識者□其克終人
Evaluation (Dataset A/C): Comprehensive Score = 0.0
Precisely: CER=1.0, P=0.0, R=0.0, F1=0.0, ned=1.0


Performing inference and evaluation:  12%|█▏        | 6/50 [00:31<03:46,  5.14s/it]


--- Example 6 ---
Model Output:     柳南之梅清香如雪光媚出包
Ground Truth:     獨有小梅清見骨只將真色
Evaluation (Dataset A/C): Comprehensive Score = 0.131
Precisely: CER=0.9091, P=0.1667, R=0.1818, F1=0.1739, ned=0.8333


Performing inference and evaluation:  14%|█▍        | 7/50 [00:36<03:42,  5.16s/it]


--- Example 7 ---
Model Output:     矣夫公詩載唐事迹甚明在昔謂之詩史而唐史載公歿
Ground Truth:     矣夫公詩載唐事迹甚明在昔謂之詩史而唐史載公歿
Evaluation (Dataset A/C): Comprehensive Score = 1.0
Precisely: CER=0.0, P=1.0, R=1.0, F1=1.0, ned=0.0


Performing inference and evaluation:  16%|█▌        | 8/50 [00:40<03:19,  4.75s/it]


--- Example 8 ---
Model Output:     萬年枝葉頌草色與紅潮
Ground Truth:     無性自性空慶喜當知以一切智無
Evaluation (Dataset A/C): Comprehensive Score = 0.0
Precisely: CER=1.0, P=0.0, R=0.0, F1=0.0, ned=1.0


Performing inference and evaluation:  18%|█▊        | 9/50 [00:43<02:43,  4.00s/it]


--- Example 9 ---
Model Output:     頗盡通
Ground Truth:     題畫馬
Evaluation (Dataset A/C): Comprehensive Score = 0.0
Precisely: CER=1.0, P=0.0, R=0.0, F1=0.0, ned=1.0


Performing inference and evaluation:  20%|██        | 10/50 [00:48<02:51,  4.30s/it]


--- Example 10 ---
Model Output:     一程燈火一程山容子行時那得還女兒擊挃歌欲訖兒
Ground Truth:     一程𤇆水一程山客子行時那得還女𫤗擊榜歌欲絶
Evaluation (Dataset A/C): Comprehensive Score = 0.679
Precisely: CER=0.3333, P=0.6818, R=0.7143, F1=0.6977, ned=0.3182


Performing inference and evaluation:  22%|██▏       | 11/50 [00:53<02:56,  4.52s/it]


--- Example 11 ---
Model Output:     旦讃说喜無量法蔵喜讃说无量無
Ground Truth:     量讃說喜無量法歡喜讃歎修喜無
Evaluation (Dataset A/C): Comprehensive Score = 0.5714
Precisely: CER=0.4286, P=0.5714, R=0.5714, F1=0.5714, ned=0.4286


Performing inference and evaluation:  24%|██▍       | 12/50 [01:02<03:48,  6.00s/it]


--- Example 12 ---
Model Output:     世傳曰娶白一港西建奴過江
Ground Truth:     法當自悟耳僕自停裴家
Evaluation (Dataset A/C): Comprehensive Score = -0.1
Precisely: CER=1.2, P=0.0, R=0.0, F1=0.0, ned=1.0


Performing inference and evaluation:  26%|██▌       | 13/50 [01:12<04:30,  7.32s/it]


--- Example 13 ---
Model Output:     經垣城下避山岐州鹽井坑巨甎山城城址西劉興廟巖寺
Ground Truth:     商而有天下傳于成王歸在宗周康王成周郊皆惟豐是都龍小
Evaluation (Dataset A/C): Comprehensive Score = 0.0405
Precisely: CER=0.96, P=0.0435, R=0.04, F1=0.0417, ned=0.96


Performing inference and evaluation:  28%|██▊       | 14/50 [01:18<04:04,  6.79s/it]


--- Example 14 ---
Model Output:     性法安法世蜜除𠰥耶不思議界
Ground Truth:     性法定法住實際虚空界不思議界
Evaluation (Dataset A/C): Comprehensive Score = 0.5056
Precisely: CER=0.5, P=0.5385, R=0.5, F1=0.5185, ned=0.5


Performing inference and evaluation:  30%|███       | 15/50 [01:22<03:27,  5.92s/it]


--- Example 15 ---
Model Output:     蜜藏中所述法故世閒便有人證脫
Ground Truth:     宻蔵中所說法故世閒便有八解脫
Evaluation (Dataset A/C): Comprehensive Score = 0.6429
Precisely: CER=0.3571, P=0.6429, R=0.6429, F1=0.6429, ned=0.3571


Performing inference and evaluation:  32%|███▏      | 16/50 [01:25<02:57,  5.21s/it]


--- Example 16 ---
Model Output:     三爲方便業生爲方便無所得爲方
Ground Truth:     二為方便無生為方便無所得為方
Evaluation (Dataset A/C): Comprehensive Score = 0.6429
Precisely: CER=0.3571, P=0.6429, R=0.6429, F1=0.6429, ned=0.3571


Performing inference and evaluation:  34%|███▍      | 17/50 [01:30<02:46,  5.04s/it]


--- Example 17 ---
Model Output:     五噐七無漏法二分故般𠰥出
Ground Truth:     五眼六神通無二無二分故慶喜由
Evaluation (Dataset A/C): Comprehensive Score = 0.3654
Precisely: CER=0.6429, P=0.4167, R=0.3571, F1=0.3846, ned=0.6429


Performing inference and evaluation:  36%|███▌      | 18/50 [01:36<02:50,  5.33s/it]


--- Example 18 ---
Model Output:     與萬姓之血齊進十族血故亦得無共孫十川人祖
Ground Truth:     判承□亦爲義官承裕𦦙進士累官吏科都給事中孫十三人紞
Evaluation (Dataset A/C): Comprehensive Score = 0.1653
Precisely: CER=0.84, P=0.2, R=0.16, F1=0.1778, ned=0.84


Performing inference and evaluation:  38%|███▊      | 19/50 [01:41<02:40,  5.18s/it]


--- Example 19 ---
Model Output:     韓才氣浩江漢文采肯鳳讝羣山得是窮景象愈可觀
Ground Truth:     韓才氣浩江漢文采竒鳳鸞兹山得是翁㬌物愈可觀
Evaluation (Dataset A/C): Comprehensive Score = 0.7143
Precisely: CER=0.2857, P=0.7143, R=0.7143, F1=0.7143, ned=0.2857


Performing inference and evaluation:  40%|████      | 20/50 [01:45<02:22,  4.76s/it]


--- Example 20 ---
Model Output:     祥老草書歌灌畦老歌
Ground Truth:     祥老草書歌灌畦老歌
Evaluation (Dataset A/C): Comprehensive Score = 1.0
Precisely: CER=0.0, P=1.0, R=1.0, F1=1.0, ned=0.0


Performing inference and evaluation:  42%|████▏     | 21/50 [01:51<02:30,  5.18s/it]


--- Example 21 ---
Model Output:     天寧結藴間居民衆請再領寺事乃造鉸石宝鋐取至
Ground Truth:     天寜結菴閒居衆請再領寺事乃造鍮石宝缾取至
Evaluation (Dataset A/C): Comprehensive Score = 0.7124
Precisely: CER=0.3, P=0.7143, R=0.75, F1=0.7317, ned=0.2857


Performing inference and evaluation:  42%|████▏     | 21/50 [01:54<02:37,  5.43s/it]


KeyboardInterrupt: 

In [19]:
model.save_pretrained("lora_model_QWEN2.5_EVAHAN")
tokenizer.save_pretrained("lora_model_QWEN2.5_EVAHAN")

[]