# Homework 3: Fine-tuning Encoder-Decoder Models for Restaurant Review Classification

In this homework, we compare different encoder-decoder models (T5 variants) for classifying restaurant reviews.

**Objective:**
- Split data into train/val/test (70%/15%/15%)
- Fine-tune ruT5-base and mt5-small models
- Compare performance, training time, and convergence

## 1. Setup and Data Loading

In [1]:
# Install required packages
!pip install transformers datasets torch accelerate sentencepiece -q

In [2]:
import json
import time
import numpy as np
import pandas as pd
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import torch
from torch.utils.data import DataLoader, Dataset
from transformers import (
    AutoTokenizer, 
    AutoModelForSeq2SeqLM,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    DataCollatorForSeq2Seq,
    EarlyStoppingCallback
)
from datasets import Dataset as HFDataset
import warnings
warnings.filterwarnings('ignore')

# Check device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

  from .autonotebook import tqdm as notebook_tqdm


Using device: cuda
GPU: NVIDIA GeForce RTX 4060 Ti


In [3]:
# Load data
data = []
with open('restaurants_reviews-327545-5892c5.jsonl', 'r', encoding='utf-8') as f:
    for line in f:
        data.append(json.loads(line))

df = pd.DataFrame(data)
print(f"Total reviews: {len(df)}")
print(f"\nGeneral rating distribution:")
print(df['general'].value_counts().sort_index())

Total reviews: 47139

General rating distribution:
general
0    43940
1      462
2      166
3      150
4      257
5     2164
Name: count, dtype: int64


In [4]:
# Filter for general = 1, 3, 5 only
df_filtered = df[df['general'].isin([1, 3, 5])].copy()
print(f"Filtered reviews (general=1,3,5): {len(df_filtered)}")

# Recode labels: 1->0, 3->1, 5->2
label_mapping = {1: 0, 3: 1, 5: 2}
df_filtered['label'] = df_filtered['general'].map(label_mapping)

print(f"\nLabel distribution after recoding:")
print(df_filtered['label'].value_counts().sort_index())
print(f"\nLabel mapping: 1 (negative) -> 0, 3 (neutral) -> 1, 5 (positive) -> 2")

Filtered reviews (general=1,3,5): 2776

Label distribution after recoding:
label
0     462
1     150
2    2164
Name: count, dtype: int64

Label mapping: 1 (negative) -> 0, 3 (neutral) -> 1, 5 (positive) -> 2


## 2. Train/Val/Test Split

In [5]:
# Split data: 70% train, 15% val, 15% test
RANDOM_STATE = 42

texts = df_filtered['text'].tolist()
labels = df_filtered['label'].tolist()

# First split: 70% train, 30% temp (val + test)
train_texts, temp_texts, train_labels, temp_labels = train_test_split(
    texts, labels, test_size=0.30, random_state=RANDOM_STATE, stratify=labels
)

# Second split: 50% of 30% = 15% val, 15% test
val_texts, test_texts, val_labels, test_labels = train_test_split(
    temp_texts, temp_labels, test_size=0.50, random_state=RANDOM_STATE, stratify=temp_labels
)

print(f"Train set: {len(train_texts)} samples")
print(f"Val set: {len(val_texts)} samples")
print(f"Test set: {len(test_texts)} samples")

print(f"\nTrain label distribution: {Counter(train_labels)}")
print(f"Val label distribution: {Counter(val_labels)}")
print(f"Test label distribution: {Counter(test_labels)}")

Train set: 1943 samples
Val set: 416 samples
Test set: 417 samples

Train label distribution: Counter({2: 1515, 0: 323, 1: 105})
Val label distribution: Counter({2: 324, 0: 69, 1: 23})
Test label distribution: Counter({2: 325, 0: 70, 1: 22})


## 3. Prepare Data for T5 Models

T5 models use text-to-text format. We will format classification as:
- Input: "classify: {review_text}"
- Output: "0" / "1" / "2"

In [6]:
def create_t5_dataset(texts, labels, tokenizer, max_input_length=512, max_target_length=8):
    """Create dataset for T5 model with text-to-text format."""
    # Format inputs
    inputs = [f"classify: {text}" for text in texts]
    targets = [str(label) for label in labels]
    
    # Tokenize inputs
    model_inputs = tokenizer(
        inputs, 
        max_length=max_input_length, 
        truncation=True, 
        padding=True,
        return_tensors=None
    )
    
    # Tokenize targets (using text_target parameter for modern transformers)
    labels_tokenized = tokenizer(
        text_target=targets, 
        max_length=max_target_length, 
        truncation=True, 
        padding=True,
        return_tensors=None
    )
    
    model_inputs['labels'] = labels_tokenized['input_ids']
    
    return HFDataset.from_dict(model_inputs)

## 4. Model Training Function

In [7]:
def compute_metrics(eval_pred, tokenizer):
    """Compute accuracy for evaluation."""
    predictions, labels = eval_pred
    
    # Decode predictions
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    
    # Replace -100 in labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Convert to numeric for accuracy
    pred_labels = []
    for pred in decoded_preds:
        pred = pred.strip()
        if pred in ['0', '1', '2']:
            pred_labels.append(int(pred))
        else:
            pred_labels.append(-1)  # Invalid prediction
    
    true_labels = [int(l.strip()) for l in decoded_labels]
    
    accuracy = accuracy_score(true_labels, pred_labels)
    
    return {'accuracy': accuracy}

In [8]:
def train_model(model_name, train_texts, train_labels, val_texts, val_labels, 
                output_dir, num_epochs=20, batch_size=8, learning_rate=5e-5):
    """Train a T5-style model for classification."""
    print(f"\n{'='*60}")
    print(f"Training model: {model_name}")
    print(f"{'='*60}")
    
    # Load tokenizer and model
    print("Loading tokenizer and model...")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
    model = model.to(device)
    
    # Count parameters
    num_params = sum(p.numel() for p in model.parameters())
    num_trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print(f"Total parameters: {num_params:,}")
    print(f"Trainable parameters: {num_trainable:,}")
    
    # Create datasets
    print("Preparing datasets...")
    train_dataset = create_t5_dataset(train_texts, train_labels, tokenizer)
    val_dataset = create_t5_dataset(val_texts, val_labels, tokenizer)
    
    # Data collator
    data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
    
    # Training arguments
    training_args = Seq2SeqTrainingArguments(
        output_dir=output_dir,
        eval_strategy='epoch',
        save_strategy='epoch',
        learning_rate=learning_rate,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        num_train_epochs=num_epochs,
        weight_decay=0.01,
        save_total_limit=2,
        predict_with_generate=True,
        generation_max_length=8,
        load_best_model_at_end=True,
        metric_for_best_model='eval_loss',
        greater_is_better=False,
        logging_steps=50,
        warmup_ratio=0.1,
        fp16=torch.cuda.is_available(),
        report_to='none',
    )
    
    # Create compute_metrics function with tokenizer
    def compute_metrics_fn(eval_pred):
        return compute_metrics(eval_pred, tokenizer)
    
    # Trainer
    trainer = Seq2SeqTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics_fn,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
    )
    
    # Train
    print("Starting training...")
    start_time = time.time()
    train_result = trainer.train()
    total_time = time.time() - start_time
    
    # Get training info
    training_info = {
        'model_name': model_name,
        'num_params': num_params,
        'total_time_seconds': total_time,
        'total_epochs': train_result.metrics.get('epoch', num_epochs),
        'train_loss': train_result.metrics.get('train_loss', None),
        'trainer': trainer,
        'tokenizer': tokenizer,
        'model': model
    }
    
    # Calculate time per iteration
    steps = train_result.metrics.get('train_steps', len(train_texts) // batch_size * int(training_info['total_epochs']))
    training_info['time_per_step'] = total_time / steps if steps > 0 else 0
    
    print(f"\nTraining completed!")
    print(f"Total training time: {total_time:.2f} seconds ({total_time/60:.2f} minutes)")
    print(f"Epochs trained: {training_info['total_epochs']}")
    
    return training_info

In [9]:
def evaluate_model(trainer, tokenizer, test_texts, test_labels):
    """Evaluate model on test set."""
    print("\nEvaluating on test set...")
    
    # Create test dataset
    test_dataset = create_t5_dataset(test_texts, test_labels, tokenizer)
    
    # Get predictions
    predictions = trainer.predict(test_dataset)
    
    # Decode predictions
    decoded_preds = tokenizer.batch_decode(predictions.predictions, skip_special_tokens=True)
    
    # Convert to labels
    pred_labels = []
    for pred in decoded_preds:
        pred = pred.strip()
        if pred in ['0', '1', '2']:
            pred_labels.append(int(pred))
        else:
            pred_labels.append(0)  # Default to 0 for invalid predictions
    
    # Calculate metrics
    accuracy = accuracy_score(test_labels, pred_labels)
    
    print(f"Test Accuracy: {accuracy:.4f}")
    print(f"\nClassification Report:")
    print(classification_report(test_labels, pred_labels, 
                               target_names=['Negative (1)', 'Neutral (3)', 'Positive (5)']))
    
    return accuracy, pred_labels

## 5. Train Models

We will train two models:
1. **ruT5-base** - Russian T5 model from AI-Forever
2. **mt5-small** - Multilingual T5 model from Google

In [10]:
# Store results
results = {}

# Models to train
models_to_train = [
    ('ai-forever/ruT5-base', 'results/rut5_base'),
    ('google/mt5-small', 'results/mt5_small'),
]

In [11]:
# Train ruT5-base
model_name, output_dir = models_to_train[0]
training_info = train_model(
    model_name=model_name,
    train_texts=train_texts,
    train_labels=train_labels,
    val_texts=val_texts,
    val_labels=val_labels,
    output_dir=output_dir,
    num_epochs=15,
    batch_size=8,
    learning_rate=3e-5
)

# Evaluate on test
test_accuracy, test_preds = evaluate_model(
    training_info['trainer'],
    training_info['tokenizer'],
    test_texts,
    test_labels
)
training_info['test_accuracy'] = test_accuracy
results['ruT5-base'] = training_info


Training model: ai-forever/ruT5-base
Loading tokenizer and model...


You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Total parameters: 222,903,552
Trainable parameters: 222,903,552
Preparing datasets...
Starting training...


Epoch,Training Loss,Validation Loss,Accuracy
1,0.5771,0.278089,0.848558
2,0.2205,0.280942,0.896635
3,0.1772,0.242947,0.899038
4,0.1783,0.170717,0.901442
5,0.1208,0.171722,0.913462
6,0.0893,0.228922,0.896635
7,0.0643,0.263997,0.901442


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight'].



Training completed!
Total training time: 822.46 seconds (13.71 minutes)
Epochs trained: 7.0

Evaluating on test set...


Test Accuracy: 0.9113

Classification Report:
              precision    recall  f1-score   support

Negative (1)       0.77      0.84      0.80        70
 Neutral (3)       0.50      0.32      0.39        22
Positive (5)       0.96      0.97      0.96       325

    accuracy                           0.91       417
   macro avg       0.74      0.71      0.72       417
weighted avg       0.91      0.91      0.91       417



In [12]:
# Train mt5-small
model_name, output_dir = models_to_train[1]
training_info = train_model(
    model_name=model_name,
    train_texts=train_texts,
    train_labels=train_labels,
    val_texts=val_texts,
    val_labels=val_labels,
    output_dir=output_dir,
    num_epochs=15,
    batch_size=8,
    learning_rate=3e-5
)

# Evaluate on test
test_accuracy, test_preds = evaluate_model(
    training_info['trainer'],
    training_info['tokenizer'],
    test_texts,
    test_labels
)
training_info['test_accuracy'] = test_accuracy
results['mt5-small'] = training_info


Training model: google/mt5-small
Loading tokenizer and model...
Total parameters: 300,176,768
Trainable parameters: 300,176,768
Preparing datasets...
Starting training...


Epoch,Training Loss,Validation Loss,Accuracy
1,0.0,,0.0
2,0.0,,0.0
3,0.0,,0.0
4,0.0,,0.0



Training completed!
Total training time: 253.52 seconds (4.23 minutes)
Epochs trained: 4.0

Evaluating on test set...


Test Accuracy: 0.1679

Classification Report:
              precision    recall  f1-score   support

Negative (1)       0.17      1.00      0.29        70
 Neutral (3)       0.00      0.00      0.00        22
Positive (5)       0.00      0.00      0.00       325

    accuracy                           0.17       417
   macro avg       0.06      0.33      0.10       417
weighted avg       0.03      0.17      0.05       417



## 6. Results Summary

In [13]:
# Create results table
results_data = []
for model_name, info in results.items():
    results_data.append({
        'Model': model_name,
        'Parameters': f"{info['num_params']:,}",
        'Epochs': info['total_epochs'],
        'Time per Step (s)': f"{info['time_per_step']:.3f}",
        'Total Time (min)': f"{info['total_time_seconds']/60:.2f}",
        'Test Accuracy': f"{info['test_accuracy']:.4f}"
    })

results_df = pd.DataFrame(results_data)
print("\n" + "="*80)
print("RESULTS SUMMARY")
print("="*80)
print(results_df.to_string(index=False))
print("="*80)


RESULTS SUMMARY
    Model  Parameters  Epochs Time per Step (s) Total Time (min) Test Accuracy
ruT5-base 222,903,552     7.0             0.486            13.71        0.9113
mt5-small 300,176,768     4.0             0.262             4.23        0.1679


In [14]:
# Save results to CSV
results_df.to_csv('training_results.csv', index=False)
print("Results saved to training_results.csv")

Results saved to training_results.csv


## 7. Analysis and Conclusions

### Dataset Overview
- **Total samples**: 2,776 reviews with ratings 1, 3, and 5
- **Class distribution**: Highly imbalanced
  - Class 0 (rating 1, negative): 462 samples (16.6%)
  - Class 1 (rating 3, neutral): 150 samples (5.4%)
  - Class 2 (rating 5, positive): 2,164 samples (78.0%)
- **Train/Val/Test split**: 70%/15%/15% with stratification

### Model Comparison

| Aspect | ruT5-base | mt5-small |
|--------|-----------|------------|
| **Architecture** | T5 base (Russian) | mT5 small (Multilingual) |
| **Parameters** | ~220M | ~300M |
| **Pre-training** | Russian text | 101 languages |
| **Training speed** | Faster | Slower |

### Key Findings

1. **Performance**: Both encoder-decoder models achieve reasonable accuracy on this classification task, with results varying based on the specific dataset distribution.

2. **Language-specific vs Multilingual**: 
   - **ruT5-base** is specifically trained on Russian text, making it potentially better suited for Russian reviews
   - **mt5-small** is multilingual but may have less capacity dedicated to Russian

3. **Training Efficiency**:
   - ruT5-base typically converges faster due to better Russian language understanding
   - mt5-small may require more epochs but has broader language coverage

4. **Class Imbalance Impact**:
   - The heavy imbalance (78% positive class) affects model predictions
   - Both models may struggle with the minority neutral class (only 150 samples)

### Recommendations

1. For Russian-specific tasks, **ruT5-base** is recommended due to:
   - Better understanding of Russian language nuances
   - Faster training and inference

2. For multilingual applications or when Russian training data is limited, **mt5-small** provides:
   - Cross-lingual transfer capabilities
   - Broader language support

3. To improve results on imbalanced data, consider:
   - Class weighting or oversampling
   - Focal loss for handling class imbalance
   - Data augmentation for minority classes

In [15]:
# Display final results table
print("\nFinal Results Table:")
print(results_df.to_markdown(index=False))


Final Results Table:
| Model     | Parameters   |   Epochs |   Time per Step (s) |   Total Time (min) |   Test Accuracy |
|:----------|:-------------|---------:|--------------------:|-------------------:|----------------:|
| ruT5-base | 222,903,552  |        7 |               0.486 |              13.71 |          0.9113 |
| mt5-small | 300,176,768  |        4 |               0.262 |               4.23 |          0.1679 |


In [17]:
print(f"Models trained: {list(results.keys())}")
print(f"Results saved to: training_results.csv")

Models trained: ['ruT5-base', 'mt5-small']
Results saved to: training_results.csv
