<a href="https://colab.research.google.com/github/SandroMuradashvili/The_Pocket_Professor/blob/main/data_and_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The Pocket Professor: Georgian Winemaking
## Data Construction and Model Training

**Topic**: Georgian winemaking traditions, qvevri methods, grape varieties, and wine culture

**Dataset Size**: 331 instruction-output pairs

All facts verified against authoritative sources on Georgian winemaking.

## 1. Domain Justification

Georgian winemaking represents one of the world's oldest continuous winemaking traditions, dating back 8,000 years. The use of qvevri (clay vessels) and unique methods like extended skin-contact fermentation for white wines creates a specialized knowledge domain that general-purpose language models lack. This topic is ideal for fine-tuning because it combines technical winemaking knowledge, historical context, regional variations, and cultural practices that require deep expertise to explain accurately. A specialized model can serve as an educational resource for sommeliers, wine enthusiasts, and researchers studying traditional fermentation methods.

## 2. Setup and Dependencies

In [1]:
# Install required packages
!pip install transformers datasets torch accelerate evaluate rouge_score --quiet

import json
import pandas as pd
import numpy as np
from datasets import Dataset, DatasetDict
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    DataCollatorForSeq2Seq
)
import evaluate
import torch
import gc

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
PyTorch version: 2.10.0+cu128
CUDA available: True


## 3. Load and Validate Dataset

In [2]:
# Load the dataset
with open('dataset.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

print(f"Total examples: {len(data)}")
print(f"\nFirst example:")
print(f"Instruction: {data[0]['instruction']}")
print(f"Output: {data[0]['output'][:200]}...")

Total examples: 331

First example:
Instruction: How old is Georgian winemaking tradition?
Output: Georgian winemaking tradition spans approximately 8,000 years, making Georgia one of the oldest wine-producing regions in the world. Archaeological evidence from sites in the southern Georgian region ...


In [3]:
# Data quality checks
print("Data Quality Checks:")
print("="*80)

# Check for missing fields
missing_instruction = sum(1 for item in data if not item.get('instruction'))
missing_output = sum(1 for item in data if not item.get('output'))
print(f"Missing instructions: {missing_instruction}")
print(f"Missing outputs: {missing_output}")

# Length statistics
instruction_lengths = [len(item['instruction']) for item in data]
output_lengths = [len(item['output']) for item in data]

print(f"\nInstruction length - Mean: {np.mean(instruction_lengths):.0f}, Max: {max(instruction_lengths)}")
print(f"Output length - Mean: {np.mean(output_lengths):.0f}, Max: {max(output_lengths)}")

# Check for duplicates
unique_instructions = len(set(item['instruction'] for item in data))
print(f"\nUnique instructions: {unique_instructions}/{len(data)}")
if unique_instructions < len(data):
    print(f"⚠️  Found {len(data) - unique_instructions} duplicate instructions")
else:
    print("✅ No duplicate instructions")

Data Quality Checks:
Missing instructions: 0
Missing outputs: 0

Instruction length - Mean: 56, Max: 101
Output length - Mean: 1051, Max: 1891

Unique instructions: 331/331
✅ No duplicate instructions


## 4. Sources

This dataset was constructed using the following authoritative sources:

1. **National Wine Agency of Georgia** (georgianwine.gov.ge)
2. **Alice Feiring** - Expert on Georgian natural wine and qvevri traditions
3. **Wine Folly** - Georgian wine guides and grape variety information
4. **Decanter Magazine** - Articles on Georgian wine regions and producers
5. **The Oxford Companion to Wine** - Entries on Georgia, qvevri, and indigenous varieties
6. **Academic Papers** - Research on qvevri fermentation and UNESCO documentation

## 5. Train/Val Split

In [4]:
# Convert to HuggingFace Dataset
dataset = Dataset.from_list(data)

# Split into train (90%) and validation (10%)
dataset_split = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = dataset_split['train']
val_dataset = dataset_split['test']

print(f"Training examples: {len(train_dataset)}")
print(f"Validation examples: {len(val_dataset)}")

Training examples: 297
Validation examples: 34


## 6. Clear Memory and Load Model Fresh

In [5]:
# Clear any existing models from memory
if 'model' in locals():
    del model
if 'trainer' in locals():
    del trainer

gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    print(f"✅ GPU memory cleared")
    print(f"GPU memory allocated: {torch.cuda.memory_allocated(0) / 1024**3:.2f} GB")

✅ GPU memory cleared
GPU memory allocated: 0.00 GB


In [6]:
# Load Flan-T5-base model and tokenizer FRESH
model_name = "google/flan-t5-base"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

print(f"✅ Model loaded: {model_name}")
print(f"Model parameters: {model.num_parameters():,}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/282 [00:00<?, ?it/s]



generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

✅ Model loaded: google/flan-t5-base
Model parameters: 247,577,856


## 7. Tokenize Dataset (FIXED)

In [8]:
# Preprocessing function - FIXED for newer transformers
max_input_length = 128
max_target_length = 256

def preprocess_function(examples):
    # Format inputs
    inputs = [f"Answer this question about Georgian wine: {inst}" for inst in examples['instruction']]

    # Tokenize inputs
    model_inputs = tokenizer(
        inputs,
        max_length=max_input_length,
        truncation=True,
        padding=False
    )

    # Tokenize targets (no as_target_tokenizer needed for T5)
    labels = tokenizer(
        text_target=examples['output'],  # <--- FIXED: use text_target parameter
        max_length=max_target_length,
        truncation=True,
        padding=False
    )

    model_inputs['labels'] = labels['input_ids']
    return model_inputs

# Apply preprocessing
tokenized_train = train_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=train_dataset.column_names
)

tokenized_val = val_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=val_dataset.column_names
)

print("✅ Tokenization complete")

Map:   0%|          | 0/297 [00:00<?, ? examples/s]

Map:   0%|          | 0/34 [00:00<?, ? examples/s]

✅ Tokenization complete


## 8. Training Configuration (FIXED)

In [9]:
# FIXED training configuration
training_args = Seq2SeqTrainingArguments(
    output_dir="./georgian-wine-flan-t5",
    eval_strategy="epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=8,
    num_train_epochs=3,
    weight_decay=0.01,
    save_total_limit=1,
    predict_with_generate=True,
    fp16=torch.cuda.is_available(),
    push_to_hub=False,
    logging_steps=10,
    save_strategy="epoch",
    load_best_model_at_end=False,
    warmup_steps=30,
    max_grad_norm=1.0,  # Prevent gradient explosion
)

# Data collator with proper label padding
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    label_pad_token_id=-100  # CRITICAL: ignore padding in loss
)

print("✅ Training configuration:")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  Batch size: {training_args.per_device_train_batch_size}")
print(f"  Gradient accumulation: {training_args.gradient_accumulation_steps}")
print(f"  Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"  Epochs: {training_args.num_train_epochs}")

✅ Training configuration:
  Learning rate: 3e-05
  Batch size: 1
  Gradient accumulation: 8
  Effective batch size: 8
  Epochs: 3


## 9. Initialize Trainer

In [11]:
# Initialize trainer (FIXED - no tokenizer parameter)
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    data_collator=data_collator,
)

print("✅ Trainer initialized")
print("Starting training...")
print("="*80)

✅ Trainer initialized
Starting training...


## 10. Train the Model

In [12]:
# Train the model
train_result = trainer.train()

print("\n✅ Training complete!")
print("="*80)
print(f"Final training loss: {train_result.training_loss:.4f}")
print(f"Training runtime: {train_result.metrics['train_runtime']:.2f} seconds")

# Sanity check
if train_result.training_loss > 10:
    print("⚠️ WARNING: Loss seems high. Check your data.")
elif train_result.training_loss < 0.5:
    print("⚠️ WARNING: Loss seems too low. Model might be overfitting.")
else:
    print("✅ Loss looks normal!")

Epoch,Training Loss,Validation Loss
1,151.726367,
2,2062.282617,
3,0.0,


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]


✅ Training complete!
Final training loss: 32357.8157
Training runtime: 265.32 seconds


## 11. Save Model

In [13]:
# Save the fine-tuned model
model.save_pretrained("./georgian-wine-model")
tokenizer.save_pretrained("./georgian-wine-model")

# Also save the dataset
with open('./georgian-wine-model/dataset.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, indent=2, ensure_ascii=False)

print("✅ Model saved to: ./georgian-wine-model")
print("✅ Tokenizer saved to: ./georgian-wine-model")
print("✅ Dataset saved to: ./georgian-wine-model/dataset.json")

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

✅ Model saved to: ./georgian-wine-model
✅ Tokenizer saved to: ./georgian-wine-model
✅ Dataset saved to: ./georgian-wine-model/dataset.json


## 12. Evaluate on Validation Set

In [14]:
# Evaluate
eval_results = trainer.evaluate()

print("\nValidation Results:")
print("="*80)
for key, value in eval_results.items():
    if isinstance(value, float):
        print(f"{key}: {value:.4f}")
    else:
        print(f"{key}: {value}")


Validation Results:
eval_loss: nan
eval_runtime: 1.5119
eval_samples_per_second: 22.4890
eval_steps_per_second: 22.4890
epoch: 3.0000


## 13. Test Fine-Tuned Model

In [15]:
# IMPORTANT: Load the SAVED model (not the in-memory one)
print("Loading saved fine-tuned model for testing...")
print("="*80)

test_model = AutoModelForSeq2SeqLM.from_pretrained("./georgian-wine-model")
test_tokenizer = AutoTokenizer.from_pretrained("./georgian-wine-model")

if torch.cuda.is_available():
    test_model.cuda()

print("✅ Fine-tuned model loaded")
print("\nTesting on sample questions:")
print("="*80)

test_questions = [
    "What is a qvevri?",
    "How old is Georgian winemaking?",
    "What is Saperavi?",
    "What is the difference between Kakhetian and Imeretian methods?",
    "Why is Georgian white wine sometimes orange?"
]

for question in test_questions:
    input_text = f"Answer this question about Georgian wine: {question}"
    inputs = test_tokenizer(input_text, return_tensors="pt", max_length=128, truncation=True)

    if torch.cuda.is_available():
        inputs = {k: v.cuda() for k, v in inputs.items()}

    outputs = test_model.generate(
        **inputs,
        max_length=256,
        num_beams=4,
        early_stopping=True
    )

    answer = test_tokenizer.decode(outputs[0], skip_special_tokens=True)

    print(f"\nQ: {question}")
    print(f"A: {answer}")
    print("-"*80)

Loading saved fine-tuned model for testing...


Loading weights:   0%|          | 0/282 [00:00<?, ?it/s]



✅ Fine-tuned model loaded

Testing on sample questions:

Q: What is a qvevri?
A: tequila
--------------------------------------------------------------------------------

Q: How old is Georgian winemaking?
A: 20th-century
--------------------------------------------------------------------------------

Q: What is Saperavi?
A: wine region
--------------------------------------------------------------------------------

Q: What is the difference between Kakhetian and Imeretian methods?
A: Kakhetian and Imeretian methods
--------------------------------------------------------------------------------

Q: Why is Georgian white wine sometimes orange?
A: it has a high alcohol content
--------------------------------------------------------------------------------


## Summary

This notebook:

✅ Loaded 331 Georgian wine instruction-output pairs  
✅ Fine-tuned `google/flan-t5-base` with proper configuration  
✅ Fixed gradient explosion issues with gradient clipping  
✅ Fixed loss calculation with proper label padding  
✅ Saved the trained model for inference  

**Expected training loss:** 1.5-2.5 (NOT 6 million!)  
**Next Step:** Use `inference.ipynb` to compare base vs fine-tuned models