<a href="https://colab.research.google.com/github/SandroMuradashvili/The_Pocket_Professor/blob/main/data_and_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The Pocket Professor: Georgian Winemaking

## Data Construction and Model Training

This notebook contains:
1. Dataset construction and validation
2. Data preprocessing
3. Model fine-tuning (Flan-T5-base)
4. Training metrics and evaluation

**Topic**: Georgian winemaking traditions, qvevri methods, grape varieties, and wine culture

**Dataset Size**: 331 instruction-output pairs

**All facts verified against authoritative sources on Georgian winemaking**

## 1. Setup and Dependencies

In [56]:
# Install required packages
!pip install transformers datasets torch accelerate evaluate rouge_score --quiet

import json
import pandas as pd
import numpy as np
from datasets import Dataset, DatasetDict
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    DataCollatorForSeq2Seq
)
import evaluate
import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

PyTorch version: 2.9.0+cu128
CUDA available: True


## 2. Load and Validate Dataset

In [57]:
# Load the dataset
with open('dataset.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

print(f"Total examples: {len(data)}")
print(f"\nFirst example:")
print(f"Instruction: {data[0]['instruction']}")
print(f"Output: {data[0]['output'][:200]}...")

Total examples: 331

First example:
Instruction: How old is Georgian winemaking tradition?
Output: Georgian winemaking tradition spans approximately 8,000 years, making Georgia one of the oldest wine-producing regions in the world. Archaeological evidence from sites in the southern Georgian region ...


In [58]:
# Validate data quality
print("Data Quality Checks:")
print("="*50)

# Check for missing fields
missing_instruction = sum(1 for item in data if not item.get('instruction'))
missing_output = sum(1 for item in data if not item.get('output'))
print(f"Missing instructions: {missing_instruction}")
print(f"Missing outputs: {missing_output}")

# Check lengths
instruction_lengths = [len(item['instruction']) for item in data]
output_lengths = [len(item['output']) for item in data]

print(f"\nInstruction length - Mean: {np.mean(instruction_lengths):.0f}, Max: {max(instruction_lengths)}")
print(f"Output length - Mean: {np.mean(output_lengths):.0f}, Max: {max(output_lengths)}")

# Check for duplicates
unique_instructions = len(set(item['instruction'] for item in data))
print(f"\nUnique instructions: {unique_instructions}/{len(data)}")
if unique_instructions < len(data):
    print(f"⚠️  Found {len(data) - unique_instructions} duplicate instructions")

Data Quality Checks:
Missing instructions: 0
Missing outputs: 0

Instruction length - Mean: 56, Max: 101
Output length - Mean: 1051, Max: 1891

Unique instructions: 331/331


## 3. Data Preprocessing and Train/Val Split

In [59]:
# Convert to HuggingFace Dataset
dataset = Dataset.from_list(data)

# Split into train (90%) and validation (10%)
dataset_split = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = dataset_split['train']
val_dataset = dataset_split['test']

print(f"Training examples: {len(train_dataset)}")
print(f"Validation examples: {len(val_dataset)}")

Training examples: 297
Validation examples: 34


## 4. Load Model and Tokenizer

In [60]:
# Load Flan-T5-base model and tokenizer
model_name = "google/flan-t5-base"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

print(f"Model loaded: {model_name}")
print(f"Model parameters: {model.num_parameters():,}")

Loading weights:   0%|          | 0/282 [00:00<?, ?it/s]



Model loaded: google/flan-t5-base
Model parameters: 247,577,856


## 5. Tokenize Data

In [61]:
# Preprocessing function
max_input_length = 512
max_target_length = 512

def preprocess_function(examples):
    # Format as instruction following
    inputs = [f"Answer this question about Georgian wine: {inst}" for inst in examples['instruction']]

    model_inputs = tokenizer(
        inputs,
        max_length=max_input_length,
        truncation=True,
        padding='max_length'
    )

    # Tokenize targets
    labels = tokenizer(
        examples['output'],
        max_length=max_target_length,
        truncation=True,
        padding='max_length'
    )

    model_inputs['labels'] = labels['input_ids']
    return model_inputs

# Apply preprocessing
tokenized_train = train_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=train_dataset.column_names
)

tokenized_val = val_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=val_dataset.column_names
)

print("Tokenization complete")

Map:   0%|          | 0/297 [00:00<?, ? examples/s]

Map:   0%|          | 0/34 [00:00<?, ? examples/s]

Tokenization complete


## 6. Setup Training Arguments

In [62]:
# Verify model is trainable
print(f"Model parameters: {model.num_parameters():,}")
print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")

# Ensure all parameters are trainable
for param in model.parameters():
    param.requires_grad = True

print("✅ All parameters set to trainable")

Model parameters: 247,577,856
Trainable parameters: 247,577,856
✅ All parameters set to trainable


In [63]:
# ULTRA LOW MEMORY training config
training_args = Seq2SeqTrainingArguments(
    output_dir="./georgian-wine-flan-t5",
    eval_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=1,  # <--- MINIMUM: 1
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=8,  # <--- COMPENSATE with more accumulation
    num_train_epochs=3,
    weight_decay=0.01,
    save_total_limit=1,  # <--- Only keep 1 checkpoint
    predict_with_generate=True,
    fp16=True,
    push_to_hub=False,
    logging_steps=20,
    save_strategy="epoch",
    load_best_model_at_end=False,  # <--- Disable to save memory
    warmup_steps=50,
    logging_first_step=True,
    gradient_checkpointing=True,
    optim="adafactor",  # <--- More memory-efficient optimizer
)

data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model
)

print("Ultra low-memory training configuration")

Ultra low-memory training configuration


## 7. Setup Evaluation Metrics

In [64]:
# Load ROUGE metric
rouge = evaluate.load('rouge')

def compute_metrics(eval_preds):
    preds, labels = eval_preds

    # If preds is a tuple (often the case with generate), take the first element
    if isinstance(preds, tuple):
        preds = preds[0]

    # Replace -100 in predictions and labels with pad_token_id
    # This prevents the OverflowError in the tokenizer
    preds = np.where(preds != -100, preds, tokenizer.pad_token_id)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)

    # Decode predictions and labels
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Compute ROUGE scores
    result = rouge.compute(
        predictions=decoded_preds,
        references=decoded_labels,
        use_stemmer=True
    )

    return {k: round(v * 100, 2) for k, v in result.items()}

print("Evaluation metrics updated with OverflowError protection")

Evaluation metrics updated with OverflowError protection


## 8. Train the Model

In [65]:
# Enable gradient checkpointing to save memory
model.gradient_checkpointing_enable()
print("✅ Gradient checkpointing enabled")

✅ Gradient checkpointing enabled


In [66]:
# Initialize trainer
# Note: In some versions of transformers, 'tokenizer' is passed as 'processing_class'
# or simply included via the data_collator.
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

print("Trainer initialized with updated metrics function")
print("Starting training...")
print("="*50)

Trainer initialized with updated metrics function
Starting training...


In [67]:
# Clear GPU memory before training
import gc
import torch

if torch.cuda.is_available():
    torch.cuda.empty_cache()
    gc.collect()
    print("✅ GPU memory cleared")
    print(f"GPU memory allocated: {torch.cuda.memory_allocated(0) / 1024**3:.2f} GB")
    print(f"GPU memory reserved: {torch.cuda.memory_reserved(0) / 1024**3:.2f} GB")

✅ GPU memory cleared
GPU memory allocated: 3.94 GB
GPU memory reserved: 9.62 GB


In [68]:
# Train the model
train_result = trainer.train()

print("\nTraining complete!")
print("="*50)
print(f"Training loss: {train_result.training_loss:.4f}")
print(f"Training runtime: {train_result.metrics['train_runtime']:.2f} seconds")

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
1,0.0,,3.63,0.47,2.95,2.97
2,2211689.6,,3.63,0.47,2.95,2.97
3,35063584.0,,3.63,0.47,2.95,2.97


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]


Training complete!
Training loss: 6539562.2281
Training runtime: 406.51 seconds


## 9. Save the Model

In [69]:
# Save the fine-tuned model
model.save_pretrained("./georgian-wine-model")
tokenizer.save_pretrained("./georgian-wine-model")

print("Model saved to: ./georgian-wine-model")

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Model saved to: ./georgian-wine-model


## 10. Evaluation on Validation Set

In [70]:
# Evaluate on validation set
eval_results = trainer.evaluate()

print("\nValidation Results:")
print("="*50)
for key, value in eval_results.items():
    print(f"{key}: {value:.4f}")


Validation Results:
eval_loss: nan
eval_rouge1: 3.6300
eval_rouge2: 0.4700
eval_rougeL: 2.9500
eval_rougeLsum: 2.9700
eval_runtime: 11.6732
eval_samples_per_second: 2.9130
eval_steps_per_second: 2.9130
epoch: 3.0000


## 11. Test Predictions

In [71]:
# Check dataset quality before training
print("Verifying dataset before training:")
print("="*80)
for i in range(3):
    print(f"\nExample {i+1}:")
    print(f"Instruction: {data[i]['instruction']}")
    print(f"Output length: {len(data[i]['output'])} characters")
    print(f"Output preview: {data[i]['output'][:200]}...")
    print("-"*80)

Verifying dataset before training:

Example 1:
Instruction: How old is Georgian winemaking tradition?
Output length: 575 characters
Output preview: Georgian winemaking tradition spans approximately 8,000 years, making Georgia one of the oldest wine-producing regions in the world. Archaeological evidence from sites in the southern Georgian region ...
--------------------------------------------------------------------------------

Example 2:
Instruction: What does UNESCO say about Georgian winemaking?
Output length: 655 characters
Output preview: In 2013, UNESCO added the Ancient Georgian Traditional Qvevri Wine-Making Method to its List of Intangible Cultural Heritage of Humanity. The inscription recognizes not only the technical process of f...
--------------------------------------------------------------------------------

Example 3:
Instruction: What happened to Georgian winemaking during the Soviet era?
Output length: 710 characters
Output preview: The Soviet era (1921–1991) cause

In [73]:
# CRITICAL: Reload the saved model to test it properly
print("Loading the SAVED fine-tuned model (not the in-memory one)...")
print("="*80)

trained_model = AutoModelForSeq2SeqLM.from_pretrained("./georgian-wine-model")
trained_tokenizer = AutoTokenizer.from_pretrained("./georgian-wine-model")

if torch.cuda.is_available():
    trained_model.cuda()

print("Testing FINE-TUNED model:")
print("="*80)

test_questions = [
    "What is a qvevri?",
    "How old is Georgian winemaking?",
    "What is Saperavi?",
    "What is the difference between Kakhetian and Imeretian methods?",
    "What does UNESCO say about Georgian wine?"
]

for question in test_questions:
    input_text = f"Answer this question about Georgian wine: {question}"
    inputs = trained_tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)

    if torch.cuda.is_available():
        inputs = {k: v.cuda() for k, v in inputs.items()}

    outputs = trained_model.generate(
        **inputs,
        max_length=512,
        num_beams=4,
        early_stopping=True
    )

    answer = trained_tokenizer.decode(outputs[0], skip_special_tokens=True)

    print(f"\nQ: {question}")
    print(f"A: {answer}")
    print("-"*80)

Loading the SAVED fine-tuned model (not the in-memory one)...


Loading weights:   0%|          | 0/282 [00:00<?, ?it/s]



Testing FINE-TUNED model:

Q: What is a qvevri?
A: tequila
--------------------------------------------------------------------------------

Q: How old is Georgian winemaking?
A: 20th-century
--------------------------------------------------------------------------------

Q: What is Saperavi?
A: wine region
--------------------------------------------------------------------------------

Q: What is the difference between Kakhetian and Imeretian methods?
A: Kakhetian and Imeretian methods
--------------------------------------------------------------------------------

Q: What does UNESCO say about Georgian wine?
A: it is one of the world' s best wines
--------------------------------------------------------------------------------


## Summary

This notebook:
1. ✅ Loaded and validated ~331 Georgian wine instruction-output pairs
2. ✅ Fine-tuned google/flan-t5-base on the dataset
3. ✅ Evaluated performance using ROUGE metrics
4. ✅ Saved the trained model for inference

**Next**: Use `inference.ipynb` to load the model and test it interactively