# 🕉️ Sanātana Dharma GPT-2 Training - Clean Dataset
## Professional-Grade Bhagavad Gita Model Training

**Dataset:** 12,002 high-quality examples from 22 renowned scholars  
**Model:** GPT-2 (124M parameters)  
**Platform:** Google Colab Pro  
**Quality Score:** 9.5/10  

---

### 📋 What This Does:
- Trains GPT-2 on clean Bhagavad Gita data (NO truncations, NO copyright text)
- Uses complete Sanskrit verses with accurate translations
- Includes commentaries from 22+ scholars
- Produces production-ready model for Q1 2026 launch

### ⏱️ Expected Time:
- Setup: 2-3 minutes
- Training: ~45-60 minutes (Colab Pro GPU)
- Download: 2-3 minutes

---

## Step 1: Setup Environment

In [None]:
# Install required packages
!pip install -q transformers datasets accelerate

print('Packages installed successfully!')

In [None]:
# Import libraries
import json
import os
from pathlib import Path
import torch
from transformers import (
    GPT2LMHeadModel,
    GPT2Tokenizer,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling
)
from datasets import Dataset
import numpy as np

# Check GPU
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Using device: {device}')
if device == 'cuda':
    print(f'   GPU: {torch.cuda.get_device_name(0)}')
    print(f'   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB')
else:
    print('WARNING: No GPU detected! Training will be slow.')

## Step 2: Upload Your Clean Dataset

**IMPORTANT:** Upload the file: `data/jsonl/clean_gita_training_dataset.jsonl`

👉 Click the folder icon on the left sidebar → Upload file

In [None]:
# Verify dataset file
import time

DATASET_FILE = 'clean_gita_training_dataset.jsonl'

if os.path.exists(DATASET_FILE):
    print(f'Dataset found: {DATASET_FILE}')
    print(f'File size: {os.path.getsize(DATASET_FILE) / 1e6:.2f} MB')
else:
    print('Please upload: clean_gita_training_dataset.jsonl')
    print('Use the file browser on the left to upload it')

## Step 3: Load and Prepare Dataset

In [None]:
# Load dataset
print('Loading dataset...')
data = []
with open(DATASET_FILE, 'r', encoding='utf-8') as f:
    for line in f:
        try:
            item = json.loads(line.strip())
            data.append(item)
        except:
            continue

print(f'Loaded {len(data)} examples')

# Show statistics
print('\nDataset Statistics:')
print(f'   Total examples: {len(data)}')

tasks = {}
for item in data:
    task = item.get('task', 'unknown')
    tasks[task] = tasks.get(task, 0) + 1

print('\n   Task breakdown:')
for task, count in sorted(tasks.items(), key=lambda x: -x[1]):
    print(f'   - {task}: {count}')

# Show sample
print('\nSample example:')
sample = data[0]
print(f'   Instruction: {sample["instruction"][:80]}...')
print(f'   Response: {sample["response"][:150]}...')
print(f'   Task: {sample["task"]}')
print(f'   Quality: {sample["quality_score"]}/10')

In [None]:
# Prepare training texts
print('Preparing training texts...')

def create_training_text(example):
    instruction = example['instruction']
    response = example['response']
    text = f'Question: {instruction}\nAnswer: {response}<|endoftext|>'
    return text

training_texts = [create_training_text(item) for item in data]

print(f'Prepared {len(training_texts)} training texts')
print(f'\nSample formatted text:')
print(training_texts[0][:300] + '...')

## Step 4: Load Model and Tokenizer

In [None]:
# Load tokenizer
print('Loading GPT-2 tokenizer...')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
print('Tokenizer loaded')

# Load model
print('\nLoading GPT-2 model...')
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.to(device)
print(f'Model loaded on {device}')
print(f'   Parameters: {model.num_parameters() / 1e6:.1f}M')

## Step 5: Tokenize Dataset

In [None]:
# Tokenize
print('Tokenizing dataset...')
print('This may take a few minutes...')

def tokenize_function(texts):
    return tokenizer(
        texts,
        truncation=True,
        max_length=512,
        padding='max_length',
        return_tensors='pt'
    )

# Tokenize all texts
encodings = tokenize_function(training_texts)

# Create dataset
dataset_dict = {
    'input_ids': encodings['input_ids'],
    'attention_mask': encodings['attention_mask']
}

train_dataset = Dataset.from_dict(dataset_dict)

print(f'Tokenization complete')
print(f'   Dataset size: {len(train_dataset)} examples')

## Step 6: Configure Training Parameters

In [None]:
# Training configuration optimized for Colab Pro
training_args = TrainingArguments(
    output_dir='./gpt2-gita-clean',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=4,
    learning_rate=5e-5,
    warmup_steps=500,
    fp16=True,
    logging_steps=50,
    save_steps=500,
    save_total_limit=2,
    evaluation_strategy='no',
    dataloader_num_workers=2,
    report_to='none',
)

print('Training configuration set')
print(f'\nTraining Parameters:')
print(f'   Epochs: {training_args.num_train_epochs}')
print(f'   Batch size: {training_args.per_device_train_batch_size}')
print(f'   Gradient accumulation: {training_args.gradient_accumulation_steps}')
print(f'   Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}')
print(f'   Learning rate: {training_args.learning_rate}')
print(f'   FP16: {training_args.fp16}')
print(f'\nEstimated training time: 45-60 minutes')

## Step 7: Create Trainer and Start Training

In [None]:
# Data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=data_collator,
)

print('Trainer created')
print('\nStarting training...\n')
print('=' * 60)

In [None]:
# TRAIN THE MODEL!
trainer.train()

print('\n' + '=' * 60)
print('Training complete!')

## Step 8: Save the Final Model

In [None]:
# Save model and tokenizer
output_dir = './gpt2-gita-final'

print(f'Saving model to {output_dir}...')
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

print(f'Model saved to {output_dir}')
print(f'\nContents:')
for file in os.listdir(output_dir):
    size = os.path.getsize(os.path.join(output_dir, file)) / 1e6
    print(f'   - {file} ({size:.2f} MB)')

## Step 9: Test the Model

In [None]:
# Test the trained model
print('Testing the trained model...\n')

def generate_response(question, max_length=400):
    prompt = f'Question: {question}\nAnswer:'
    inputs = tokenizer(prompt, return_tensors='pt').to(device)
    outputs = model.generate(
        inputs['input_ids'],
        max_new_tokens=max_length,
        temperature=0.8,
        top_p=0.92,
        top_k=50,
        repetition_penalty=1.2,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    if 'Answer:' in response:
        answer = response.split('Answer:', 1)[1].strip()
        return answer
    return response

# Test questions
test_questions = [
    'What is dharma in Bhagavad Gita?',
    'Explain verse 2.47 of Bhagavad Gita',
    'What does Krishna teach about karma yoga?',
    'What is Chapter 6 about?'
]

for question in test_questions:
    print('=' * 60)
    print(f'Q: {question}')
    print('\nA:', generate_response(question))
    print()

## Step 10: Download the Model

In [None]:
# Create ZIP file for download
import shutil

print('Creating ZIP file for download...')
shutil.make_archive('gpt2-gita-clean-model', 'zip', './gpt2-gita-final')

zip_size = os.path.getsize('gpt2-gita-clean-model.zip') / 1e6
print(f'ZIP created: gpt2-gita-clean-model.zip ({zip_size:.2f} MB)')
print('\nDownload it from the files panel on the left')
print('   Right-click -> Download')

In [None]:
# Optional: Save to Google Drive
from google.colab import drive

try:
    drive.mount('/content/drive')
    drive_path = '/content/drive/MyDrive/gpt2-gita-clean-model'
    print(f'Copying model to Google Drive...')
    shutil.copytree('./gpt2-gita-final', drive_path, dirs_exist_ok=True)
    print(f'Model saved to Google Drive: {drive_path}')
except Exception as e:
    print(f'Could not save to Drive: {e}')
    print('   Use ZIP download instead')

## 🎉 Training Complete!

### What You Got:
✅ **Production-quality GPT-2 model** trained on 12,002 clean examples  
✅ **Complete verses** with accurate translations  
✅ **22 scholarly perspectives** integrated  
✅ **No truncations**, no copyright issues  
✅ **Ready for Q1 2026 launch**  

### Next Steps:
1. **Download** the model (ZIP or from Google Drive)
2. **Extract** to `server/models/gpt2-gita-clean/`
3. **Update** `server/app_gpt2.py` model path
4. **Test** with your API
5. **Deploy** for investors!

---

🕉️ **May this AI serve the dharma well!** 🙏