# Japanese Politeness Classifier — Model Training Notebook 1st version
This notebook contains the full training process for fine-tuning a Japanese BERT model to classify sentences based on their level of politeness: casual, neutral, or keigo.

## 1. Setup & Imports
Import required libraries including Hugging Face Transformers, Datasets, PyTorch, and other utilities.

In [1]:
import pandas as pd
import numpy as np
import torch
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [2]:
from transformers import (
    BertTokenizer,
    BertForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
    AutoTokenizer,
    pipeline
)
from datasets import Dataset
import evaluate

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

In [4]:
import os
import random
import warnings
from dotenv import load_dotenv
warnings.filterwarnings("ignore")

## 2. Load and Inspect Preprocessed Data
Load the cleaned CSV file created in the preprocessing phase. Make sure the dataset contains the sentence and label columns.

In [None]:
df = pd.read_csv(r"G:\Python Projects\politeness-classifier-jp\data\processed\BunnyGirl800-Preprocessed-binary.csv")
df.head(3)

Unnamed: 0,text,label,length
0,おい ムロ ちょっと来てくれ！,0,15
1,何か出てきやがった,0,9
2,あ…,1,2


## 3. Prepare Dataset for Model Input
Tokenize the Japanese text using a tokenizer (e.g., BERT tokenizer pre-trained on Japanese). Convert the data into a Hugging Face Dataset object suitable for training.

In [6]:
load_dotenv()
token = os.getenv("HUGGINGFACE-TOKEN")
tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese", token=token)

In [7]:
# Split the DataFrame
df_train, df_test = train_test_split(df, test_size=0.2, random_state=123, stratify=df["label"])

# Convert to Hugging Face datasets
train_dataset = Dataset.from_pandas(df_train.reset_index(drop=True))
test_dataset = Dataset.from_pandas(df_test.reset_index(drop=True))

In [8]:
def preprocess_function(sentences):
    return tokenizer(sentences["text"], padding=True, truncation=True)

In [9]:
# Apply the tokenizer to the datasets
train_dataset = train_dataset.map(preprocess_function, batched=True)
test_dataset = test_dataset.map(preprocess_function, batched=True)

Map: 100%|██████████| 645/645 [00:00<00:00, 1880.47 examples/s]
Map: 100%|██████████| 162/162 [00:00<00:00, 4764.45 examples/s]


## 4. Define Model Architecture
Load a pre-trained Japanese BERT model with a classification head for 3 classes (casual, neutral, polite).

In [10]:
model = BertForSequenceClassification.from_pretrained("cl-tohoku/bert-base-japanese", num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at cl-tohoku/bert-base-japanese and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 5. Training Configuration
Define training arguments like batch size, learning rate, epochs, evaluation strategy, logging, and checkpointing.

In [11]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [12]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir=r"G:\Python Projects\politeness-classifier-jp\models",          # where to save model
    eval_strategy="epoch",     # evaluate every epoch
    learning_rate=2e-5,              # small LR for fine-tuning
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=4,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss"
)

In [13]:
from transformers import Trainer
from transformers import EarlyStoppingCallback

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator
)

## 6. Train the Model
Use the Hugging Face Trainer API to train the model on the prepared dataset.

In [14]:
trainer.train()

IndexError: Target 2 is out of bounds.

## 7. Evaluate the Model
Visualize metrics like accuracy, loss, precision, recall, or F1-score on the validation set.

In [None]:
metrics = trainer.evaluate()
print(metrics)

{'eval_loss': 0.2634919583797455, 'eval_runtime': 2.8055, 'eval_samples_per_second': 57.744, 'eval_steps_per_second': 7.485, 'epoch': 4.0}


In [None]:
from sklearn.metrics import classification_report
pred_output = trainer.predict(test_dataset)
preds = np.argmax(pred_output.predictions, axis=1)
labels = pred_output.label_ids
print(classification_report(labels, preds))

              precision    recall  f1-score   support

           0       0.86      0.90      0.88        48
           1       0.83      0.83      0.83        42
           2       0.99      0.96      0.97        72

    accuracy                           0.91       162
   macro avg       0.89      0.90      0.89       162
weighted avg       0.91      0.91      0.91       162



## 8. Save the Trained Model
Save the model and tokenizer locally (e.g., in models/politeness-bert/) so you can later load it for inference.

In [None]:
trainer.save_model(r"G:\Python Projects\politeness-classifier-jp\models\bert-finetunedv2")
tokenizer.save_pretrained(r"G:\Python Projects\politeness-classifier-jp\models\bert-finetunedv2")

('G:\\Python Projects\\politeness-classifier-jp\\models\\bert-finetuned\\tokenizer_config.json',
 'G:\\Python Projects\\politeness-classifier-jp\\models\\bert-finetuned\\special_tokens_map.json',
 'G:\\Python Projects\\politeness-classifier-jp\\models\\bert-finetuned\\vocab.txt',
 'G:\\Python Projects\\politeness-classifier-jp\\models\\bert-finetuned\\added_tokens.json')

In [None]:
import json

output_dir = r"G:\Python Projects\politeness-classifier-jp\models\bert-finetunedv2"
os.makedirs(output_dir, exist_ok=True)

with open(os.path.join(output_dir, "metrics.json"), "w") as f:
    json.dump(metrics, f, indent=4)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay

def save_results(output_dir, metrics, predictions, labels, class_names=None):
    os.makedirs(output_dir, exist_ok=True)

    # 1. Save metrics.json
    with open(os.path.join(output_dir, "metrics.json"), "w") as f:
        json.dump(metrics, f, indent=4)

    # 2. Save classification_report.txt
    report = classification_report(labels, predictions, target_names=class_names, digits=4)
    with open(os.path.join(output_dir, "classification_report.txt"), "w") as f:
        f.write(report)

    # 3. Save confusion_matrix.png
    cm = confusion_matrix(labels, predictions)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=class_names)
    fig, ax = plt.subplots(figsize=(6, 6))
    disp.plot(ax=ax, cmap="Blues", values_format="d")
    plt.title("Confusion Matrix")
    plt.savefig(os.path.join(output_dir, "confusion_matrix.png"))
    plt.close()

    print(f"✅ Results saved in {output_dir}")

In [None]:
# Step 1: Evaluate and predict
metrics = trainer.evaluate()
pred_output = trainer.predict(test_dataset)
preds = np.argmax(pred_output.predictions, axis=1)
labels = pred_output.label_ids

# Step 2: Save everything
save_results(
    output_dir=output_dir,
    metrics=metrics,
    predictions=preds,
    labels=labels,
    class_names=["Class 0", "Class 1", "Class 2"]  # or None
)

✅ Results saved in G:\Python Projects\politeness-classifier-jp\models\bert-finetunedv1


## 9. Test Inference on New Sentences
Try out the model on your own Japanese inputs using the pipeline or manual tokenization.