# AI Rapper System - Google Colab Training

## Setup Instructions:
1. Upload this notebook to Google Colab
2. Go to Runtime > Change runtime type > Select GPU
3. Run each cell in order
4. Download your trained model at the end

**Estimated Time:** 2-4 hours

## Step 1: Upload Training Data

In [None]:
from google.colab import files
import json

print("üìÅ Upload your training_lyrics.json file:")
uploaded = files.upload()

# Verify upload
with open('training_lyrics.json', 'r') as f:
    data = json.load(f)
    print(f"\n‚úÖ Loaded {len(data['training_data'])} verses for training")

## Step 2: Install Dependencies

In [None]:
print("üì¶ Installing required packages...\n")
!pip install -q transformers datasets accelerate torch
print("\n‚úÖ Dependencies installed!")

## Step 3: Prepare Training Script

In [None]:
%%writefile train_model.py
import json
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer, Trainer, TrainingArguments
from datasets import Dataset

print("üé§ AI Rapper Training Script\n")

# Load training data
with open('training_lyrics.json', 'r') as f:
    data = json.load(f)

# Prepare training texts
texts = []
for item in data['training_data']:
    text = f"Prompt: {item['prompt']}\n\nLyrics:\n{item['lyrics']}\n\n"
    texts.append(text)

print(f"üìä Prepared {len(texts)} training examples\n")

# Load model and tokenizer
print("üîß Loading GPT-2-Medium model...")
model_name = "gpt2-medium"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(
    model_name,
    low_cpu_mem_usage=True  # Optimize memory usage
)

# Set pad token
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id

# Tokenize
print("üìù Tokenizing texts...")
def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True, padding='max_length', max_length=512)

dataset = Dataset.from_dict({'text': texts})
tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=['text'])
tokenized_dataset.set_format('torch')

# Training arguments (optimized for Colab free tier)
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,  # Simulate larger batch size
    save_steps=500,
    save_total_limit=2,
    logging_steps=50,
    learning_rate=5e-5,
    warmup_steps=100,
    weight_decay=0.01,
    logging_dir='./logs',
    fp16=True,  # Use mixed precision for faster training
    evaluation_strategy="no",
    save_safetensors=True,
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

# Train
print("\nüöÄ Starting training (this will take 2-4 hours)...\n")
print("üí° Tip: You can monitor progress in the logs below")
print("üìä Training loss should decrease from ~4.0 to ~2.0-2.5\n")
trainer.train()

# Save model
print("\nüíæ Saving trained model...")
model.save_pretrained('./trained_model')
tokenizer.save_pretrained('./trained_model')

# Save generation config with optimized parameters
from transformers import GenerationConfig
generation_config = GenerationConfig(
    max_new_tokens=512,  # Allow full verses
    temperature=0.9,
    top_p=0.95,
    do_sample=True,
    repetition_penalty=1.1,
    pad_token_id=tokenizer.eos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
)
generation_config.save_pretrained('./trained_model')

print("\n‚úÖ Training complete!")
print(f"üìÅ Model saved to: ./trained_model")
print("üéØ Next: Run the test cell to see your model in action!")

## Step 4: Train the Model (This takes 2-4 hours)

In [None]:
!python train_model.py

## Step 5: Test the Model

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load trained model
print("üì• Loading your trained model...")
model = GPT2LMHeadModel.from_pretrained('./trained_model')
tokenizer = GPT2Tokenizer.from_pretrained('./trained_model')
print("‚úÖ Model loaded!\n")

# Generate test lyrics with optimized parameters
test_prompts = [
    "Write aggressive battle rap bars",
    "Write motivational bars",
    "Write smooth storytelling bars"
]

print("üé§ Testing your model with different prompts:\n")
print("="*60)

for i, test_prompt in enumerate(test_prompts, 1):
    prompt = f"Prompt: {test_prompt}\n\nLyrics:\n"
    inputs = tokenizer(prompt, return_tensors='pt')
    
    outputs = model.generate(
        inputs['input_ids'],
        max_new_tokens=256,  # Full verse length
        temperature=0.9,
        top_p=0.95,
        repetition_penalty=1.1,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )
    
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Extract just the lyrics (after "Lyrics:\n")
    lyrics = generated_text.split("Lyrics:\n")[-1].strip()
    
    print(f"\n{i}. {test_prompt}")
    print("-"*60)
    print(lyrics)
    print("="*60)

print("\n‚úÖ Model test complete!")
print("üìä Check if the output has:")
print("  - Good rhyme schemes")
print("  - Consistent flow (syllables per line)")
print("  - Complete verses (not truncated)")
print("  - Your unique style")
print("\nüí° If quality is low, you may need more training data or epochs")

## Step 6: Download Your Trained Model

In [None]:
# Zip the model for download
!zip -r trained_model.zip ./trained_model

# Download
from google.colab import files
print("üì• Downloading your trained model...")
files.download('trained_model.zip')
print("\n‚úÖ Download complete! Extract this on your local machine.")

## Next Steps:

1. Extract `trained_model.zip` on your computer
2. Place the `trained_model` folder in your project's `models/` directory
3. Update `.env`: `LOCAL_MODEL_PATH=./models/trained_model`
4. Restart your API server
5. Generate lyrics with YOUR trained model!

**Optional:** Convert to GGUF for faster CPU inference:
```bash
# Install llama.cpp tools
# Convert model to GGUF format
```