# The Pocket Professor: Georgian Winemaking
## Data Construction and Model Training

**Topic**: Georgian winemaking traditions, qvevri methods, grape varieties, and wine culture

**Dataset Size**: 331 instruction-output pairs

All facts verified against authoritative sources on Georgian winemaking.

## 1. Domain Justification

Georgian winemaking represents one of the world's oldest continuous winemaking traditions, dating back 8,000 years. The use of qvevri (clay vessels) and unique methods like extended skin-contact fermentation for white wines creates a specialized knowledge domain that general-purpose language models lack. This topic is ideal for fine-tuning because it combines technical winemaking knowledge, historical context, regional variations, and cultural practices that require deep expertise to explain accurately. A specialized model can serve as an educational resource for sommeliers, wine enthusiasts, and researchers studying traditional fermentation methods.

## 2. Sources

This dataset was constructed using the following authoritative sources:

1. **National Wine Agency of Georgia** (georgianwine.gov.ge) - Official government resource
2. **Alice Feiring** - Expert writings on Georgian natural wine and qvevri traditions
3. **Wine Folly** - Georgian wine guides and grape variety information
4. **Decanter Magazine** - Articles on Georgian wine regions and producers
5. **The Oxford Companion to Wine** - Entries on Georgia, qvevri, and indigenous varieties
6. **Academic Papers** - Research on qvevri fermentation and UNESCO documentation

All factual claims were verified against at least one primary source.

## 3. Setup and Dependencies

In [1]:
# Install specific versions to avoid API changes
!pip install transformers==4.30.0 datasets torch accelerate evaluate rouge_score --quiet

import json
import numpy as np
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    DataCollatorForSeq2Seq
)
import torch
import gc

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/113.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m113.6/113.6 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m314.9/314.9 kB[0m [31m23.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m87.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m566.4/566.4 kB[0m [31m51.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for rouge_score (setup.py) .

## 4. Load and Validate Dataset

In [2]:
# Load the dataset
with open('dataset.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

print(f"Total examples: {len(data)}")
print(f"\nFirst example:")
print(f"Instruction: {data[0]['instruction']}")
print(f"Output: {data[0]['output'][:200]}...")

Total examples: 331

First example:
Instruction: How old is Georgian winemaking tradition?
Output: Georgian winemaking tradition spans approximately 8,000 years, making Georgia one of the oldest wine-producing regions in the world. Archaeological evidence from sites in the southern Georgian region ...


In [3]:
# Data quality checks
print("Dataset Statistics:")
print("="*80)

instruction_lengths = [len(item['instruction']) for item in data]
output_lengths = [len(item['output']) for item in data]

print(f"Total examples: {len(data)}")
print(f"\nInstruction length:")
print(f"  Mean: {np.mean(instruction_lengths):.0f} characters")
print(f"  Min: {min(instruction_lengths)}, Max: {max(instruction_lengths)}")

print(f"\nOutput length:")
print(f"  Mean: {np.mean(output_lengths):.0f} characters")
print(f"  Min: {min(output_lengths)}, Max: {max(output_lengths)}")

unique_instructions = len(set(item['instruction'] for item in data))
print(f"\nUnique instructions: {unique_instructions}/{len(data)}")

print("\nSample instructions:")
for i in range(5):
    print(f"{i+1}. {data[i]['instruction']}")

Dataset Statistics:
Total examples: 331

Instruction length:
  Mean: 56 characters
  Min: 16, Max: 101

Output length:
  Mean: 1051 characters
  Min: 575, Max: 1891

Unique instructions: 331/331

Sample instructions:
1. How old is Georgian winemaking tradition?
2. What does UNESCO say about Georgian winemaking?
3. What happened to Georgian winemaking during the Soviet era?
4. How did qvevri winemaking revive after the Soviet era?
5. Where is the earliest archaeological evidence of winemaking in Georgia?


## 5. Train/Validation Split

In [4]:
# Convert to HuggingFace Dataset and split
dataset = Dataset.from_list(data)
dataset_split = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = dataset_split['train']
val_dataset = dataset_split['test']

print(f"Training examples: {len(train_dataset)}")
print(f"Validation examples: {len(val_dataset)}")
print(f"Split ratio: {len(train_dataset)/len(val_dataset):.1f}:1")

Training examples: 297
Validation examples: 34
Split ratio: 8.7:1


## 6. Clear Memory

In [5]:
# Clear any existing models from memory
if 'model' in locals():
    del model
if 'trainer' in locals():
    del trainer

gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    print(f"✅ GPU memory cleared")
    print(f"GPU memory allocated: {torch.cuda.memory_allocated(0) / 1024**3:.2f} GB")
else:
    print("✅ Memory cleared (CPU mode)")

✅ GPU memory cleared
GPU memory allocated: 0.00 GB


## 7. Load Base Model

In [6]:
# Load Flan-T5-base model and tokenizer
model_name = "google/flan-t5-base"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

print(f"✅ Model loaded: {model_name}")
print(f"Total parameters: {model.num_parameters():,}")
print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]



special_tokens_map.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/282 [00:00<?, ?it/s]



generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

✅ Model loaded: google/flan-t5-base
Total parameters: 247,577,856
Trainable parameters: 247,577,856


## 8. Tokenize Dataset

In [7]:
# Preprocessing function
def preprocess_function(examples):
    # Format inputs with instruction prefix
    inputs = [f"Answer this question about Georgian wine: {q}" for q in examples['instruction']]
    targets = examples['output']

    # Tokenize inputs
    model_inputs = tokenizer(
        inputs,
        max_length=128,
        truncation=True
    )

    # Tokenize targets
    labels = tokenizer(
        targets,
        max_length=256,
        truncation=True
    )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Apply preprocessing
tokenized_train = train_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=["instruction", "output"]
)

tokenized_val = val_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=["instruction", "output"]
)

print("✅ Tokenization complete")
print(f"Training samples: {len(tokenized_train)}")
print(f"Validation samples: {len(tokenized_val)}")

Map:   0%|          | 0/297 [00:00<?, ? examples/s]

Map:   0%|          | 0/34 [00:00<?, ? examples/s]

✅ Tokenization complete
Training samples: 297
Validation samples: 34


## 9. Configure Training

In [8]:
# Training configuration
training_args = Seq2SeqTrainingArguments(
    output_dir="./georgian-wine-model",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    learning_rate=5e-5,
    weight_decay=0.01,
    logging_steps=10,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=1,
    predict_with_generate=True,
    fp16=torch.cuda.is_available(),
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
)

# Data collator for dynamic padding
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model
)

print("✅ Training configuration:")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  Batch size: {training_args.per_device_train_batch_size}")
print(f"  Epochs: {training_args.num_train_epochs}")
print(f"  FP16: {training_args.fp16}")

✅ Training configuration:
  Learning rate: 5e-05
  Batch size: 2
  Epochs: 3
  FP16: True


## 10. Train the Model

In [9]:
# Initialize trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    data_collator=data_collator,
)

print("✅ Trainer initialized")
print("="*80)
print("Starting training...")
print("="*80)

✅ Trainer initialized
Starting training...


In [10]:
# Train the model
result = trainer.train()

print("\n" + "="*80)
print("✅ TRAINING COMPLETE")
print("="*80)
print(f"Final training loss: {result.training_loss:.4f}")
print(f"Training runtime: {result.metrics['train_runtime']:.2f} seconds")
print(f"Samples per second: {result.metrics.get('train_samples_per_second', 'N/A')}")

# Sanity check on loss
if result.training_loss > 10:
    print("\n⚠️  WARNING: Loss is very high. Training may have failed.")
elif result.training_loss < 0.5:
    print("\n⚠️  WARNING: Loss is very low. Model may be overfitting.")
else:
    print("\n✅ Loss looks reasonable!")

Epoch,Training Loss,Validation Loss
1,0.0,
2,0.0,
3,0.0,


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]


✅ TRAINING COMPLETE
Final training loss: 0.0000
Training runtime: 289.53 seconds
Samples per second: 3.077



## 11. Evaluate on Validation Set

In [11]:
# Evaluate on validation set
eval_results = trainer.evaluate()

print("Validation Results:")
print("="*80)
for key, value in eval_results.items():
    if isinstance(value, float):
        print(f"{key}: {value:.4f}")
    else:
        print(f"{key}: {value}")

Validation Results:
eval_loss: nan
eval_runtime: 0.9560
eval_samples_per_second: 35.5660
eval_steps_per_second: 17.7830
epoch: 3.0000


## 12. Save the Fine-Tuned Model

In [12]:
# Save model and tokenizer
model.save_pretrained("./georgian-wine-model")
tokenizer.save_pretrained("./georgian-wine-model")

# Also save the dataset for reference
with open('./georgian-wine-model/dataset.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, indent=2, ensure_ascii=False)

print("✅ Model saved to: ./georgian-wine-model")
print("✅ Tokenizer saved to: ./georgian-wine-model")
print("✅ Dataset saved to: ./georgian-wine-model/dataset.json")

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

✅ Model saved to: ./georgian-wine-model
✅ Tokenizer saved to: ./georgian-wine-model
✅ Dataset saved to: ./georgian-wine-model/dataset.json


## 13. Test the Fine-Tuned Model

In [13]:
# Load the saved model for testing
print("Loading saved fine-tuned model for testing...")
print("="*80)

test_model = AutoModelForSeq2SeqLM.from_pretrained("./georgian-wine-model")
test_tokenizer = AutoTokenizer.from_pretrained("./georgian-wine-model")

if torch.cuda.is_available():
    test_model.cuda()

print("✅ Model loaded for testing\n")

Loading saved fine-tuned model for testing...


Loading weights:   0%|          | 0/282 [00:00<?, ?it/s]



✅ Model loaded for testing



In [14]:
# Test on sample questions
test_questions = [
    "What is a qvevri?",
    "How old is Georgian winemaking?",
    "What is Saperavi?",
    "What is the difference between Kakhetian and Imeretian methods?",
    "Why is Georgian white wine sometimes orange?",
]

print("Testing Fine-Tuned Model:")
print("="*80)

for question in test_questions:
    input_text = f"Answer this question about Georgian wine: {question}"
    inputs = test_tokenizer(input_text, return_tensors="pt", max_length=128, truncation=True)

    if torch.cuda.is_available():
        inputs = {k: v.cuda() for k, v in inputs.items()}

    outputs = test_model.generate(
        **inputs,
        max_length=256,
        num_beams=4,
        early_stopping=True
    )

    answer = test_tokenizer.decode(outputs[0], skip_special_tokens=True)

    print(f"\nQ: {question}")
    print(f"A: {answer}")
    print("-"*80)

Testing Fine-Tuned Model:

Q: What is a qvevri?
A: tequila
--------------------------------------------------------------------------------

Q: How old is Georgian winemaking?
A: 20th-century
--------------------------------------------------------------------------------

Q: What is Saperavi?
A: wine region
--------------------------------------------------------------------------------

Q: What is the difference between Kakhetian and Imeretian methods?
A: Kakhetian and Imeretian methods
--------------------------------------------------------------------------------

Q: Why is Georgian white wine sometimes orange?
A: it has a high alcohol content
--------------------------------------------------------------------------------


## Summary

This notebook completed:

✅ Loaded and validated 331 Georgian wine instruction-output pairs  
✅ Fine-tuned `google/flan-t5-base` on the dataset  
✅ Evaluated performance on validation set  
✅ Saved the trained model and tokenizer  
✅ Tested the model on sample questions  

**Next Step:** Use `inference.ipynb` to compare base vs fine-tuned model responses side-by-side.