# Deep learning in Human Language Technology Project



- Student(s) Name(s):Jialu Pu

- Date: 23 Nov 2024

- Chosen Corpus: mteb/amazon_reviews_multi

- Contributions (if group project):



### Corpus information



- Description of the chosen corpus: Amazon product reviews dataset for multilingual text classification. In this version of the dataset, each record contains the id (including the language information), the combined title and review text, and the star rating. The corpus is balanced across stars, so each star rating constitutes 20% of the reviews in each language.

- Paper(s) and other published materials related to the corpus: The Multilingual Amazon Reviews Corpus.[paper](https://aclanthology.org/2020.emnlp-main.369/),[huggingface](https://huggingface.co/datasets/mteb/amazon_reviews_multi)

- Random baseline performance and expected performance for recent machine learned models: The random baseline accuracy is 20% for a balanced five-star label distribution. According to the original paper, zero-shot cross-lingual transfer performance using the mBERT model with English as the source language achieved accuracy rates of 39.0%–48.1%. Specifically, training with 25% of the English data resulted in 62.9% accuracy on the English test set and 45.5% accuracy on the French test set. Consequently, the expected performance for this task falls within the range of 35%–65%. On Hugging Face, SOTA model such as roberta-base-bne finetuned model on this dataset has reached 93.35%.


---



## 1. Setup

In [None]:
# Install and import libraries etc.

# ! pip install evaluate
# ! pip install optuna
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import pandas as pd
import numpy as np
import torch
import transformers
import evaluate
from pprint import PrettyPrinter
pprint = PrettyPrinter(compact=True).pprint
from datasets import load_dataset
import logging
logging.disable(logging.INFO)
from collections import Counter
import random
import os
os.environ['WANDB_MODE'] = 'disabled'
import optuna

In [None]:
# Test GPU environment

print("CUDA available:", torch.cuda.is_available())
print("Device count:", torch.cuda.device_count())
print("Current device:", torch.cuda.current_device())
print("Device name:", torch.cuda.get_device_name(torch.cuda.current_device()))

CUDA available: True
Device count: 1
Current device: 0
Device name: Tesla T4


---



## 2. Data download, sampling and preprocessing



### 2.1. Download the corpus

In [None]:
# Download the corpus

dataset = load_dataset("mteb/amazon_reviews_multi")

### 2.2. Sampling and preprocessing

In [None]:
# data preprocessing
print(dataset)

# gen language column
def extract_language_batch(batch):
    batch["language"] = [id[:2] for id in batch["id"]]
    return batch

dataset = dataset.map(extract_language_batch, batched=True)

print(dataset["train"][0])

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 1200000
    })
    validation: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 30000
    })
    test: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 30000
    })
})
{'id': 'de_0203609', 'text': 'Leider nach 1 Jahr kaputt\n\nArmband ist leider nach 1 Jahr kaputt gegangen', 'label': 0, 'label_text': '0', 'language': 'de'}


In [None]:
dataset_en = dataset.filter(lambda x: x["language"] == "en")
dataset_fr = dataset.filter(lambda x: x["language"] == "fr")
dataset_ja = dataset.filter(lambda x: x["language"] == "ja")

# downsample
sampling_ratio = 0.25
for split in dataset_en.keys():
    dataset_en[split] = dataset_en[split].shuffle(seed=42).select(
        range(int(len(dataset_en[split]) * sampling_ratio))
    )

for split in dataset_fr.keys():
    dataset_fr[split] = dataset_fr[split].shuffle(seed=42).select(
        range(int(len(dataset_fr[split]) * sampling_ratio))
    )

for split in dataset_ja.keys():
    dataset_ja[split] = dataset_ja[split].shuffle(seed=42).select(
        range(int(len(dataset_ja[split]) * sampling_ratio))
    )

print(Counter(dataset_en["train"]["label"]))
print(Counter(dataset_fr["train"]["label"]))
print(Counter(dataset_ja["train"]["label"]))

#  small test data
dataset_en_test = dataset_en['test'].shuffle(seed=32).select(range(500))
dataset_fr_test = dataset_fr['test'].shuffle(seed=32).select(range(500))
dataset_ja_test = dataset_ja['test'].shuffle(seed=32).select(range(500))

Counter({2: 10172, 3: 10039, 4: 10034, 0: 9881, 1: 9874})
Counter({2: 10172, 3: 10039, 4: 10034, 0: 9881, 1: 9874})
Counter({2: 10172, 3: 10039, 4: 10034, 0: 9881, 1: 9874})


---



## 3. Machine learning model



### 3.1. Model training

In [None]:
# tokenizer

tokenizer = transformers.AutoTokenizer.from_pretrained('bert-base-multilingual-cased')
def tokenize(example):
    return tokenizer(
        example["text"],
        max_length=512,
        truncation=True,
    )

# Apply the tokenizer to the whole dataset using .map()
dataset_en = dataset_en.map(tokenize)
dataset_fr = dataset_fr.map(tokenize)
dataset_ja = dataset_ja.map(tokenize)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1250 [00:00<?, ? examples/s]

Map:   0%|          | 0/1250 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1250 [00:00<?, ? examples/s]

Map:   0%|          | 0/1250 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1250 [00:00<?, ? examples/s]

Map:   0%|          | 0/1250 [00:00<?, ? examples/s]

In [None]:
# model arguments - Initial attempt

model = transformers.AutoModelForSequenceClassification.from_pretrained('bert-base-multilingual-cased', num_labels=5)

training_args = TrainingArguments(
    output_dir="/home/jupyter/model_checkpoints",
    save_total_limit=1,
    report_to="none",
    evaluation_strategy="steps",
    logging_strategy="steps",
    save_strategy="steps",
    save_steps=500,
    eval_steps=500,
    logging_steps=100,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    weight_decay=0.01,
    fp16=True,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy"
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
! df

Filesystem        1K-blocks       Used  Available Use% Mounted on
overlay          8454026148 6222125740 2231884024  74% /
tmpfs                 65536          0      65536   0% /dev
shm                14155776          0   14155776   0% /dev/shm
/dev/sda1         127733284    4584112  123132788   4% /opt/bin
/dev/mapper/snap 8454026148 6222125740 2231884024  74% /home/jupyter
/dev/loop1         20466256         68   20449804   1% /kaggle/lib
tmpfs              16436696          0   16436696   0% /proc/acpi
tmpfs              16436696          0   16436696   0% /proc/scsi
tmpfs              16436696          0   16436696   0% /sys/firmware


In [None]:
# evaluate
accuracy = evaluate.load("accuracy")

# eva function
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = logits.argmax(axis=-1)
    return accuracy.compute(predictions=predictions, references=labels)

data_collator = transformers.DataCollatorWithPadding(tokenizer)

# Argument gives the number of steps of patience before early stopping
early_stopping = transformers.EarlyStoppingCallback(
    early_stopping_patience=5
)

In [None]:
from collections import defaultdict

class LogSavingCallback(transformers.TrainerCallback):
    def on_train_begin(self, *args, **kwargs):
        self.logs = defaultdict(list)
        self.training = True

    def on_train_end(self, *args, **kwargs):
        self.training = False

    def on_log(self, args, state, control, logs, model=None, **kwargs):
        if self.training:
            for k, v in logs.items():
                if k != "epoch" or v not in self.logs[k]:
                    self.logs[k].append(v)

training_logs = LogSavingCallback()

In [None]:
# train - Initial attempt

trainer = transformers.Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset_en["train"],
    eval_dataset=dataset_en["validation"],
    compute_metrics=compute_metrics,
    data_collator=data_collator,
    tokenizer = tokenizer,
    callbacks=[early_stopping, training_logs]
)

trainer.train()

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Step,Training Loss,Validation Loss,Accuracy
500,0.9776,0.936512,0.6016
1000,0.9171,0.948477,0.5912
1500,0.8848,0.891128,0.6128
2000,0.7662,0.894779,0.6208
2500,0.7873,0.873876,0.6344
3000,0.809,0.859024,0.6352
3500,0.6948,0.898641,0.6392
4000,0.698,0.915412,0.624
4500,0.686,0.92057,0.64


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


TrainOutput(global_step=4689, training_loss=0.8210165382168851, metrics={'train_runtime': 3801.1169, 'train_samples_per_second': 39.462, 'train_steps_per_second': 1.234, 'total_flos': 1.5649247530956672e+16, 'train_loss': 0.8210165382168851, 'epoch': 3.0})

### 3.2 Hyperparameter optimization

In [None]:
import optuna
from transformers import Trainer, TrainingArguments, AutoModelForSequenceClassification

def objective(trial):
    # param range
    learning_rate = trial.suggest_loguniform("learning_rate", 1e-5, 5e-5)
    per_device_train_batch_size = trial.suggest_categorical("per_device_train_batch_size", [8, 16])
    weight_decay = trial.suggest_loguniform("weight_decay", 1e-4, 0.1)

    # args
    training_args = TrainingArguments(
        output_dir="/home/jupyter/checkpoints",
        evaluation_strategy="steps",
        per_device_eval_batch_size=32,
        save_strategy="steps",
        eval_steps=1000,
        save_steps=1000,
        learning_rate=learning_rate,
        per_device_train_batch_size=per_device_train_batch_size,
        num_train_epochs=3,
        weight_decay=weight_decay,
        fp16=True,
        logging_steps=500,
        load_best_model_at_end=False,
        metric_for_best_model="accuracy",
        report_to="none"
    )

    model = AutoModelForSequenceClassification.from_pretrained("bert-base-multilingual-cased", num_labels=5)

    # Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=dataset_en["train"],
        eval_dataset=dataset_en["validation"],
        compute_metrics=compute_metrics,
        tokenizer=tokenizer
    )

    trainer.train()

    # return accuracy
    eval_result = trainer.evaluate()
    return eval_result["eval_accuracy"]

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=5)

print("Best hyperparameters:", study.best_params)
# Best hyperparameters: {'learning_rate': 4.459595574256372e-05, 'per_device_train_batch_size': 16, 'weight_decay': 0.041131516017021315}

  learning_rate = trial.suggest_loguniform("learning_rate", 1e-5, 5e-5)
  weight_decay = trial.suggest_loguniform("weight_decay", 1e-4, 0.1)
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Step,Training Loss,Validation Loss,Accuracy
1000,1.0169,0.986974,0.5688
2000,0.9429,1.035118,0.556
3000,0.9064,0.909162,0.6248
4000,0.8054,0.89421,0.6192
5000,0.795,0.919368,0.616
6000,0.8016,0.859814,0.6416
7000,0.6472,0.956714,0.6288
8000,0.6655,0.967622,0.6232
9000,0.6507,0.955904,0.6304


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


  learning_rate = trial.suggest_loguniform("learning_rate", 1e-5, 5e-5)
  weight_decay = trial.suggest_loguniform("weight_decay", 1e-4, 0.1)
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Step,Training Loss,Validation Loss,Accuracy
1000,0.9551,0.936042,0.5928
2000,0.8423,0.884385,0.6176
3000,0.8198,0.859973,0.6376
4000,0.7498,0.882282,0.6336


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


  learning_rate = trial.suggest_loguniform("learning_rate", 1e-5, 5e-5)
  weight_decay = trial.suggest_loguniform("weight_decay", 1e-4, 0.1)
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Step,Training Loss,Validation Loss,Accuracy
1000,0.9415,0.978278,0.5936
2000,0.8168,0.910087,0.6312
3000,0.786,0.840348,0.6456
4000,0.655,0.929844,0.6384


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


  learning_rate = trial.suggest_loguniform("learning_rate", 1e-5, 5e-5)
  weight_decay = trial.suggest_loguniform("weight_decay", 1e-4, 0.1)
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Step,Training Loss,Validation Loss,Accuracy
1000,0.9303,0.943687,0.5904
2000,0.8107,0.891224,0.6272
3000,0.781,0.845815,0.6368
4000,0.6632,0.927042,0.6312


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


  learning_rate = trial.suggest_loguniform("learning_rate", 1e-5, 5e-5)
  weight_decay = trial.suggest_loguniform("weight_decay", 1e-4, 0.1)
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Step,Training Loss,Validation Loss,Accuracy
1000,0.9947,0.912156,0.6056
2000,0.9156,0.963448,0.6008
3000,0.882,0.894499,0.6248
4000,0.7856,0.887055,0.6312
5000,0.7773,0.899519,0.6272
6000,0.7879,0.846158,0.6344
7000,0.656,0.909071,0.6416
8000,0.6624,0.926281,0.6384
9000,0.6542,0.931599,0.6392


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Best hyperparameters: {'learning_rate': 4.459595574256372e-05, 'per_device_train_batch_size': 16, 'weight_decay': 0.041131516017021315}


### 3.3. Evaluation on test set

In [None]:
# Train - using the best param
model = transformers.AutoModelForSequenceClassification.from_pretrained('bert-base-multilingual-cased', num_labels=5)

training_args = TrainingArguments(
    output_dir="/kaggle/working/final_model",
    save_total_limit=1,
    report_to="none",
    evaluation_strategy="steps",
    logging_strategy="steps",
    save_strategy="steps",
    save_steps=500,
    eval_steps=500,
    logging_steps=100,
    num_train_epochs=2,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=4.459595574256372e-05,
    weight_decay=0.041131516017021315,
    fp16=True,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy"
)

# Trainer
mono_trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset_en["train"],
    eval_dataset=dataset_en["validation"],
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

mono_trainer.train()

#  test on monolingual model - English 1250
test_result = mono_trainer.evaluate(eval_dataset=dataset_en["test"])
print("Test result:", test_result)
# Test result: {'eval_loss': 0.8434168100357056, 'eval_accuracy': 0.644}

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  mono_trainer = Trainer(


Step,Training Loss,Validation Loss,Accuracy
500,1.0891,1.140677,0.52
1000,1.0446,0.99112,0.572
1500,0.9957,0.944849,0.5848
2000,0.9284,1.017503,0.5928
2500,0.9302,0.891758,0.6264
3000,0.891,0.926758,0.6016
3500,0.846,0.901003,0.6192
4000,0.8017,0.890302,0.6304
4500,0.8246,0.853369,0.6296
5000,0.7966,0.88148,0.624


Test result: {'eval_loss': 0.8434168100357056, 'eval_accuracy': 0.644, 'eval_runtime': 3.5686, 'eval_samples_per_second': 350.273, 'eval_steps_per_second': 11.209, 'epoch': 2.0}


In [None]:
# test on monolingual model - English 500
test_result = mono_trainer.evaluate(eval_dataset=dataset_en_test)
print("Monolingual English test result:", test_result)

Monolingual English test result: {'eval_loss': 0.8079906105995178, 'eval_accuracy': 0.652, 'eval_runtime': 1.3961, 'eval_samples_per_second': 358.135, 'eval_steps_per_second': 11.46, 'epoch': 2.0}


In [None]:
# test on monolingual model - French 500
test_result = mono_trainer.evaluate(eval_dataset=dataset_fr_test)
print("Monolingual French test result:", test_result)

Monolingual French test result: {'eval_loss': 1.26491379737854, 'eval_accuracy': 0.454, 'eval_runtime': 1.8005, 'eval_samples_per_second': 277.693, 'eval_steps_per_second': 8.886, 'epoch': 2.0}


In [None]:
# test on monolingual model - Japanese 500
test_result = mono_trainer.evaluate(eval_dataset=dataset_ja_test)
print("Monolingual Japanese test result:", test_result)

Monolingual Japanese test result: {'eval_loss': 1.6870378255844116, 'eval_accuracy': 0.298, 'eval_runtime': 2.4162, 'eval_samples_per_second': 206.937, 'eval_steps_per_second': 6.622, 'epoch': 2.0}


In [None]:
# Save model
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model.save_pretrained("/kaggle/working/final_model_final")
tokenizer.save_pretrained("/kaggle/working/final_model_final")

('/kaggle/working/final_model_final/tokenizer_config.json',
 '/kaggle/working/final_model_final/special_tokens_map.json',
 '/kaggle/working/final_model_final/vocab.txt',
 '/kaggle/working/final_model_final/added_tokens.json',
 '/kaggle/working/final_model_final/tokenizer.json')

### 3.4. Cross-lingual experiments

In [None]:
# step 1 Train on English --> Evaluate on English (baseline) -- see above
# step 2 Train on language other than English --> Evaluate on English

# Using bert-base-multilingual-cased as the base model
# Using French as training data and English as validation

model = transformers.AutoModelForSequenceClassification.from_pretrained('bert-base-multilingual-cased', num_labels=5)
training_args = TrainingArguments(
    output_dir="/kaggle/working/crosslingual_model",
    evaluation_strategy="steps",
    save_strategy="steps",
    eval_steps=500,
    save_steps=500,
    learning_rate=4.459595574256372e-05,
    per_device_train_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.041131516017021315,
    fp16=True,
    logging_steps=500,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    report_to="none"
)

cross_trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset_fr["train"],
    eval_dataset=dataset_en["validation"],
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

cross_trainer.train()

test_result = cross_trainer.evaluate(eval_dataset=dataset_en["test"])
print("Zero-shot test result:", test_result)
# English validation: 0.41-0.45
# English Test: {'eval_loss': 1.136150598526001, 'eval_accuracy': 0.5216}

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  cross_trainer = Trainer(


Step,Training Loss,Validation Loss,Accuracy
500,1.3237,1.43187,0.4136
1000,1.1005,1.361317,0.4424
1500,1.0646,1.457918,0.4272
2000,1.0272,1.320255,0.4616
2500,1.0119,1.244529,0.4808
3000,0.9827,1.225634,0.4792
3500,0.941,1.403303,0.4368
4000,0.9024,1.273477,0.4504


Step,Training Loss,Validation Loss,Accuracy
500,1.3237,1.43187,0.4136
1000,1.1005,1.361317,0.4424
1500,1.0646,1.457918,0.4272
2000,1.0272,1.320255,0.4616
2500,1.0119,1.244529,0.4808
3000,0.9827,1.225634,0.4792
3500,0.941,1.403303,0.4368
4000,0.9024,1.273477,0.4504
4500,0.882,1.438611,0.456
5000,0.8862,1.250423,0.4872


Zero-shot test result: {'eval_loss': 1.136150598526001, 'eval_accuracy': 0.5216, 'eval_runtime': 3.0105, 'eval_samples_per_second': 415.211, 'eval_steps_per_second': 52.151, 'epoch': 2.0}


In [None]:
# test on crosslingual model1 - English 500
test_result = cross_trainer.evaluate(eval_dataset=dataset_en_test)
print("Cross English test result:", test_result)

Monolingual French test result: {'eval_loss': 1.1116063594818115, 'eval_accuracy': 0.502, 'eval_runtime': 2.9184, 'eval_samples_per_second': 171.327, 'eval_steps_per_second': 21.587, 'epoch': 2.0}


In [None]:
# test on crosslingual model1 - French 500
test_result = cross_trainer.evaluate(eval_dataset=dataset_fr_test)
print("Cross French test result:", test_result)

Monolingual French test result: {'eval_loss': 0.9207118153572083, 'eval_accuracy': 0.62, 'eval_runtime': 3.2271, 'eval_samples_per_second': 154.936, 'eval_steps_per_second': 19.522, 'epoch': 2.0}


In [None]:
# test on crosslingual model1 - Japanese 500
test_result = cross_trainer.evaluate(eval_dataset=dataset_ja_test)
print("Cross Japanese test result:", test_result)

Monolingual Japanese test result: {'eval_loss': 1.567014217376709, 'eval_accuracy': 0.36, 'eval_runtime': 3.6979, 'eval_samples_per_second': 135.211, 'eval_steps_per_second': 17.037, 'epoch': 2.0}


In [None]:
# Using bert-base-multilingual-cased as the base model
# Using French as training data and French as validation, English as Test

model = transformers.AutoModelForSequenceClassification.from_pretrained('bert-base-multilingual-cased', num_labels=5)
training_args = TrainingArguments(
    output_dir="/kaggle/working/crosslingual_model",
    evaluation_strategy="steps",
    save_strategy="steps",
    eval_steps=500,
    save_steps=500,
    learning_rate=4.459595574256372e-05,  # 之前优化的学习率
    per_device_train_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.041131516017021315,
    fp16=True,
    logging_steps=500,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset_fr["train"],
    eval_dataset=dataset_fr["validation"],
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

trainer.train()

test_result = trainer.evaluate(eval_dataset=dataset_en["test"])
print("Zero-shot test result:", test_result)
# French validation: 0.49-0.57
# English test: {'eval_loss': 1.271988868713379, 'eval_accuracy': 0.4824}

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Step,Training Loss,Validation Loss,Accuracy
500,1.2321,1.140785,0.4928
1000,1.1006,1.048174,0.516
1500,1.067,1.07106,0.5248
2000,1.0425,1.049359,0.5176
2500,1.0212,1.031323,0.5176
3000,0.9948,0.995671,0.5464
3500,0.951,0.993279,0.552
4000,0.9098,0.988567,0.5664
4500,0.8891,0.991366,0.5696
5000,0.8985,0.97872,0.5664


Zero-shot test result: {'eval_loss': 1.271988868713379, 'eval_accuracy': 0.4824, 'eval_runtime': 3.0754, 'eval_samples_per_second': 406.452, 'eval_steps_per_second': 51.05, 'epoch': 2.0}


In [None]:
test_result = trainer.evaluate(eval_dataset=dataset_ja["test"])
print("Zero-shot test result:", test_result)
# Japanese test 1250: {'eval_loss': 1.5166199207305908, 'eval_accuracy': 0.372}

  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Zero-shot test result: {'eval_loss': 1.5166199207305908, 'eval_accuracy': 0.372, 'eval_runtime': 14.5275, 'eval_samples_per_second': 86.044, 'eval_steps_per_second': 5.438, 'epoch': 2.0}


In [None]:
# test on crosslingual model2 - English 500
test_result = trainer.evaluate(eval_dataset=dataset_en_test)
print("Zero-shot English test result:", test_result)

Zero-shot English test result: {'eval_loss': 1.2446235418319702, 'eval_accuracy': 0.482, 'eval_runtime': 1.4588, 'eval_samples_per_second': 342.753, 'eval_steps_per_second': 43.187, 'epoch': 2.0}


In [None]:
# test on crosslingual model2 - French 500
test_result = trainer.evaluate(eval_dataset=dataset_fr_test)
print("Zero-shot French test result:", test_result)

Zero-shot French test result: {'eval_loss': 0.943929135799408, 'eval_accuracy': 0.612, 'eval_runtime': 1.5097, 'eval_samples_per_second': 331.197, 'eval_steps_per_second': 41.731, 'epoch': 2.0}


In [None]:
# test on crosslingual model2 - Japanese 500
test_result = trainer.evaluate(eval_dataset=dataset_ja_test)
print("Zero-shot Japanese test result:", test_result)

In [None]:
# Using pretrained English model above as the base model
model = transformers.AutoModelForSequenceClassification.from_pretrained('/kaggle/working/final_model_final', num_labels=5)
training_args = TrainingArguments(
    output_dir="/kaggle/working/crosslingual_model",
    evaluation_strategy="steps",
    save_strategy="steps",
    eval_steps=500,
    save_steps=500,
    learning_rate=4.459595574256372e-05,
    per_device_train_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.041131516017021315,
    fp16=True,
    logging_steps=500,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset_fr["train"],
    eval_dataset=dataset_fr["validation"],
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

trainer.train()

test_result = trainer.evaluate(eval_dataset=dataset_en["test"])
print("Zero-shot test result:", test_result)
# French Validation: 0.53 - 0.58
# English Test:{'eval_loss': 0.8675259947776794, 'eval_accuracy': 0.6392}

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Step,Training Loss,Validation Loss,Accuracy
500,1.0376,1.023806,0.536
1000,0.9925,0.987687,0.568
1500,0.9636,0.967129,0.5688
2000,0.8744,0.977115,0.5672
2500,0.8376,0.972896,0.5752
3000,0.8371,0.972263,0.5824


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Zero-shot test result: {'eval_loss': 0.8675259947776794, 'eval_accuracy': 0.6392, 'eval_runtime': 11.0192, 'eval_samples_per_second': 113.438, 'eval_steps_per_second': 7.169, 'epoch': 2.0}


---



## 4. Results and summary



### 4.1 Corpus insights

The dataset used in this project is the mteb/amazon_reviews_multi, a multilingual dataset designed for sentiment classification. It gathered reviews from the marketplaces in the US, Japan, Germany, France, Spain, and China submitted between November 1, 2015 and November 1, 2019 for the English, Japanese, German, French, Spanish, and Chinese languages, respectively, annotated with star ratings (1–5) as sentiment labels.

#### 1. Data Composition:
- The dataset is divided into training, validation, and test sets. Each language is well-represented, providing a balanced foundation for multilingual learning.
- Reviews vary in length but are typically concise, with noticeable differences across languages in style and vocabulary.

#### 2.  Annotation:
- Labels are derived automatically from star ratings in the metadata，ranging from 1 (very negative) to 5 (very positive). These ratings reflect real-world user sentiment, ensuring the dataset’s authenticity.
- The automatic mapping of ratings to labels introduces potential noise, as review text and star ratings may not always align. For example, a review with a high star rating might still contain criticism.

#### 3.  Observations：
- The dataset has a generally balanced distribution of labels.
- The dataset’s multilingual nature and real-world annotations provide a robust foundation for multilingual and zero-shot transfer learn

### 4.2 Result

####  1.  Initial Attempt:
- **Dataset**: English (Training: 50,000, Validation: 1,250)
- **Configuration**：
  - Learning rate = 2e-5
  - Weight decay = 0.01
  - Training batch size = 16
  - Evaluation batch size = 32
  - Epochs = 3

- The training loss decreased consistently, and the highest validation accuracy reached 64.88% with a validation loss of 0.8457, indicating stable model performance on the English validation data. However, the slight increase in validation loss during later stages suggests potential overfitting.

####  2.  Hyperparameter Optimization:
- **Dataset**: English (Training: 50,000, Validation: 1,250)
- **Optimization Ranges**:To conserve computational resources, the evaluation step was set to 1,000, and 5 trials were conducted.
  - Learning rate: (1e-5, 5e-5)
  - Training batch size: [8, 16]
  - Weight decay: (1e-4, 0.1)
- **Optimal parameters obtained**:Validation accuracy remained stable around 64%, with slight improvements in loss metrics.
  - Learning rate = 4.4596e-05
  - Weight decay = 0.0411
  - Training batch size = 16

####  3.  Evaluation on test data:
- **Dataset**: English (Test: 1,250/500),French(Test:500),Japanese(Test:500)
- Using the optimized model, the evaluation was conducted on the English test set after 2 epochs of training. The training loss decreased consistently, and the accuracy on the test data reached **64.4%** with an evaluation loss of **0.8434**, aligning well with the validation performance and demonstrating model robustness despite minor fluctuations observed in the validation phase.
- On the smaller test subset of 500 samples, the model achieved accuracy of **65.2%**, **45.4%** and **29.8%** for *English*, *French* and *Japanese* respectively.

####  4.  Zero-shot Cross-lingual Transfer:
- **Configuration**：Training: 50,000; Validation: 1,250; Test: 1,250/500 for English, French, Japanese. Three models were trained using the best parameters determined earlier.
- **BERT-base-multilingual as base model**:
  - **Training**: French, **Validation**: English, **Test**: English/French/Japanese
  - **English test 1250**: Evaluation Loss of *1.14* and accuracy of **52.16%**
  - **Smaller test subset of 500 samples**:
    - English: 50.2%
    - French: 62%
    - Japanese: 32%
- **BERT-base-multilingual as base model**:
  - **Training**: French, **Validation**: French, **Test**: English/French/Japanese
  - **English test 1250**: Evaluation Loss of *1.27* and accuracy of **48.24%**
  - **Japanese test 1250**: Evaluation Loss of *1.52* and accuracy of **37.20%**
  - **Smaller test subset of 500 samples**:
    - English: 48.2%
    - French: 61.2%
- **Pretrained Model from Step 3 as Base Model**:
  - **Training**: French, **Validation**: French, **Test**: English
  - **English test 1250**: Evaluation Loss of *0.87* and accuracy of **63.92%**
- **Key observations**:
  1. Exposing the model to the target language earlier, such as by incorporating the target language into the validation set during training, significantly enhances its performance on the target language test set.
  2. Multilingual models exhibit stronger generalization capabilities to unseen languages compared to monolingual models, such as Japanese.


### 4.3 Relation to random baseline / expected performance / state of the art
- The random baseline accuracy for a balanced five-star label distribution is 20%. The performance of large language models (LLMs) pretrained significantly surpasses this baseline.
- According to the original paper, zero-shot cross-lingual transfer using the mBERT model with English as the source language achieved accuracy rates between 39.0% and 48.1%. When French was used as the source language, the model achieved an accuracy of 48.1% on the English test set and 36.4% on the Japanese test set. Specifically, training with 25% of the English data resulted in a 62.9% accuracy on the English test set and 45.5% accuracy on the French test set. **The experiments produced similar results**, with training on 25% of the English data achieving an accuracy of 64.4% on the English test set. When trained on 25% of the French data, the model achieved 48.24% accuracy on the English test set and 37.2% accuracy on the Japanese test set.

- On Hugging Face, state-of-the-art (SOTA) models, such as the RoBERTa-base-bne fine-tuned on this dataset, have reached an impressive 93.35% accuracy. The results presented here are still far below the current SOTA performance.

---

## 5. Bonus Task (optional)

### 5.1. and 5.2. Model and Data selection

I selected mistralai/Mistral-7B-Instruct-v0.2 as the generative model for this task due to its smaller parameter size and lower resource requirements compared to meta-llama/Meta-Llama-3.1-8B-Instruct. Its strength in task-driven generation and classification tasks makes it well-suited for the experiment. To address the long access time and potential system instability with large datasets, the evaluation was limited to 10 examples.

For testing, 500 examples each were randomly sampled from the English, French, and Japanese test sets used in previous tasks, with a fixed random seed for consistency. From these subsets, 10 examples per language were selected for the evaluation of the generative model.


In [None]:
# !pip install huggingface_hub
# !pip uninstall -y bitsandbytes
# !pip install bitsandbytes --upgrade
from huggingface_hub import login
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

In [None]:
# login huggingface

from huggingface_hub import login
login()

In [None]:
# load 4-bit quantized model - prevent from exceeding the system’s RAM capacity

model_name = "mistralai/Mistral-7B-Instruct-v0.2"
quant_config = BitsAndBytesConfig(load_in_4bit=True)

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quant_config,
    device_map="auto"
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
# Initialize pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

# API trigger function
def get_completion(prompt, model=model):
    messages = [{"role": "user", "content": prompt}]
    response = pipe(messages, max_length=500, num_return_sequences=1)
    return response[0]['generated_text'][1]['content']

prompt = 'Who are you?'
response = get_completion(prompt)
print(response)

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


 I'm an artificial intelligence designed to assist with various tasks and answer questions to the best of my ability. I don't have the ability to have a personal identity or emotions. I'm here to help make your life easier. Is there a specific question or task you have in mind?


In [None]:
# downsample
dataset_en_small = dataset_en['test'].shuffle(seed=32).select(range(10))
dataset_fr_small = dataset_fr['test'].shuffle(seed=32).select(range(10))
dataset_ja_small = dataset_ja['test'].shuffle(seed=32).select(range(10))

In [None]:
print(dataset_en_test[0]['text'])
print(min(dataset_en_test["label"]))

Super cute !!

Super cute on my dachshund. I got her a small and it’s perfect!! Il
0


### 5.3. Prompt design

- Ensure instructions are precise and detailed, avoiding any irrelevant or ambiguous content.
- Provide specific examples for the LLM to follow, enhancing its ability to produce accurate and relevant results.
- For generative large language models, strictly define the expected output format to enable consistent post-processing and evaluation. Use JSON formatting to enforce a structured and standardized output.
- **Final Prompt**

  ```
    prompt = f'''
      You are tasked with evaluating the sentiment of the following review. Assign a score from 0 to 4, where:
      - 0: Very poor
      - 1: Poor
      - 2: Neutral
      - 3: Good
      - 4: Excellent

      Output **ONLY** the NUMERIC SCORE. Do not include any explanation or text.

      Review: {review}
      Score:
    '''
  ```


### 5.4. Generate

- Two different prompts were tested, and it was found that strictly specifying the output format is essential for achieving consistent results, facilitating subsequent processing.

- Due to system RAM limitations, GPU parallel computation could not be utilized. Instead, processing was performed sequentially using a for loop, which is significantly slower than batch processing that typically requires GPU support.

In [None]:
# test on generative model
review = dataset_en_test[0]['text']
prompt = f'''Evaluate the sentiment of the following review and assign a score from 0 to 4, where 0 indicates very poor and 4 indicates excellent. Output only the score.
Review: {review}
Score: '''

response = get_completion(prompt)
print(response)

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


 4

Explanation: The review expresses positive sentiment towards the product, using an exclamation mark and the word "perfect" to emphasize their satisfaction. The use of the word "cute" multiple times further reinforces the positive sentiment. Therefore, the sentiment score is 4, indicating an excellent review.


In [None]:
# test on generative model - English
reviews = dataset_en_small['text']
labels = []
for review in reviews:
  prompt = f'''
    You are tasked with evaluating the sentiment of the following review. Assign a score from 0 to 4, where:
    - 0: Very poor
    - 1: Poor
    - 2: Neutral
    - 3: Good
    - 4: Excellent

    Output **only** the numerical score. Do not include any explanation or text.

    Review: {review}
    Score:
  '''
  response = get_completion(prompt)
  labels.append(response.strip()[0])
labels

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


['3', '1', '3', '0', '2', '0', '0', '2', '0', '1']

In [None]:
predictions = list(map(int, labels))
references = dataset_en_small['label']
accuracy = evaluate.load("accuracy")
result = accuracy.compute(predictions=predictions, references=references)
print("Accuracy English:", result)

Accuracy English: {'accuracy': 0.4}


In [None]:
# test on generative model - French
reviews = dataset_fr_small['text']
labels = []
for review in reviews:
  prompt = f'''
    You are tasked with evaluating the sentiment of the following review. Assign a score from 0 to 4, where:
    - 0: Very poor
    - 1: Poor
    - 2: Neutral
    - 3: Good
    - 4: Excellent

    Output *** ONLY *** the NUMERIC SCORE. Do not include any explanation or text.

    Review: {review}
    Score:
  '''
  response = get_completion(prompt)
  labels.append(response.strip()[0])
labels

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


['4', '0', '3', '0', '3', '1', '1', '3', '0', '1']

In [None]:
predictions = list(map(int, labels))
references = dataset_en_small['label']
accuracy = evaluate.load("accuracy")
result = accuracy.compute(predictions=predictions, references=references)
print("Accuracy French:", result)

Accuracy French: {'accuracy': 0.6}


In [None]:
# test on generative model - Japanese
reviews = dataset_ja_small['text']
labels = []
for review in reviews:
  prompt = f'''
    You are tasked with evaluating the sentiment of the following review. Assign a score from 0 to 4, where:
    - 0: Very poor
    - 1: Poor
    - 2: Neutral
    - 3: Good
    - 4: Excellent

    Output *** ONLY *** the NUMERIC SCORE. Do not include any explanation or text.

    Review: {review}
    Score:
  '''
  response = get_completion(prompt)
  labels.append(response.strip()[0])
labels

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


['2', '0', '3', '0', '1', '1', '1', '0', '1', '2']

In [None]:
predictions = list(map(int, labels))
references = dataset_en_small['label']
accuracy = evaluate.load("accuracy")
result = accuracy.compute(predictions=predictions, references=references)
print("Accuracy Japanese:", result)

Accuracy Japanese: {'accuracy': 0.4}


### 5.5. Evaluation and results

- The output score is directly used to compute accuracy.
- The model demonstrated similar performance across different languages, with accuracy rates of **40%**, **60%**, and **40%** for *English*, *French*, and *Japanese*, respectively. However, given the limited dataset size of only 10 samples per language, **these results should be interpreted with caution**.
- When comparing the generative language model’s predictions, there were no significant differences in performance across languages. **The results also varied depending on the complexity of the task**, with accuracy decreasing for more complex reviews. Compared to the previously fine-tuned model, the generative model achieved comparable performance, though slightly lower in certain cases.

### 5.6 Error analysis (group projects only)



(Present the error analysis results here)