# 💕 Flan-T5 Romantic Chat Training Pipeline

This notebook contains a complete step-by-step guide to fine-tune Google's Flan-T5 model on romantic chat conversations.

## 📋 Table of Contents
1. **Environment Setup & Dependencies**
2. **Data Loading & Preprocessing** 
3. **Dataset Preparation**
4. **Model Configuration**
5. **Training Setup**
6. **Training Execution**
7. **Model Evaluation**
8. **Model Saving & Loading**
9. **Inference Testing**

---

## Step 1: Install Required Dependencies 📦

First, let's install all the necessary libraries for training Flan-T5:

In [1]:
# Install required packages
!pip install transformers datasets torch accelerate sentencepiece evaluate rouge_score
!pip install --upgrade huggingface_hub

# For better training performance (optional)
!pip install deepspeed  # Optional: for faster training

Collecting transformers
  Downloading transformers-4.52.4-py3-none-any.whl.metadata (38 kB)
  Downloading transformers-4.52.4-py3-none-any.whl.metadata (38 kB)
Collecting datasets
Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting torch
Collecting torch
  Downloading torch-2.7.0-cp312-none-macosx_11_0_arm64.whl.metadata (29 kB)
  Downloading torch-2.7.0-cp312-none-macosx_11_0_arm64.whl.metadata (29 kB)
Collecting accelerate
Collecting accelerate
  Downloading accelerate-1.7.0-py3-none-any.whl.metadata (19 kB)
  Downloading accelerate-1.7.0-py3-none-any.whl.metadata (19 kB)
Collecting sentencepiece
Collecting sentencepiece
  Downloading sentencepiece-0.2.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (7.7 kB)
  Downloading sentencepiece-0.2.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (7.7 kB)
Collecting evaluate
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metada

## Step 2: Import Required Libraries 📚

Import all necessary libraries for data processing, model training, and evaluation:

In [None]:
import json
import os
import glob
import pandas as pd
import torch
from transformers import (
    T5ForConditionalGeneration,
    T5Tokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForSeq2Seq,
    EarlyStoppingCallback
)
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split
import numpy as np
from typing import Dict, List
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm

# Set up device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## Step 3: Load and Explore Dataset 🔍

Let's load all the romantic chat data from the LOVE folder and explore its structure:

In [None]:
def load_chat_data(folder_path: str) -> List[Dict]:
    """
    Load all JSON chat data from a folder
    
    Args:
        folder_path: Path to folder containing JSON files
    
    Returns:
        List of chat conversation dictionaries
    """
    all_conversations = []
    
    # Find all JSON files in the folder
    json_files = glob.glob(os.path.join(folder_path, "*.json"))
    
    print(f"Found {len(json_files)} JSON files:")
    for file_path in json_files:
        print(f"  - {os.path.basename(file_path)}")
        
        try:
            with open(file_path, 'r', encoding='utf-8') as f:
                data = json.load(f)
                all_conversations.extend(data)
                print(f"    Loaded {len(data)} conversations")
        except Exception as e:
            print(f"    Error loading {file_path}: {e}")
    
    return all_conversations

# Load the romantic chat data
love_folder = "/Users/shaswatraj/Desktop/AI/LOVE"
conversations = load_chat_data(love_folder)

print(f"\n📊 Total conversations loaded: {len(conversations)}")
print(f"\n📋 Sample conversation:")
print(json.dumps(conversations[0], indent=2, ensure_ascii=False))

In [None]:
# Explore the dataset
def explore_dataset(conversations: List[Dict]):
    """
    Explore and analyze the chat dataset
    """
    print("🔍 Dataset Analysis:")
    print(f"Total conversations: {len(conversations)}")
    
    # Analyze input and response lengths
    input_lengths = []
    response_lengths = []
    
    for conv in conversations:
        if 'input' in conv and 'response' in conv:
            input_lengths.append(len(conv['input']))
            response_lengths.append(len(conv['response']))
    
    print(f"\n📏 Text Length Statistics:")
    print(f"Input - Mean: {np.mean(input_lengths):.1f}, Max: {max(input_lengths)}, Min: {min(input_lengths)}")
    print(f"Response - Mean: {np.mean(response_lengths):.1f}, Max: {max(response_lengths)}, Min: {min(response_lengths)}")
    
    # Language analysis (basic)
    hindi_count = 0
    english_count = 0
    mixed_count = 0
    
    for conv in conversations:
        text = conv.get('input', '') + ' ' + conv.get('response', '')
        # Simple heuristic for language detection
        if any(ord(char) > 127 for char in text):  # Contains non-ASCII (likely Hindi)
            if any(char.isascii() and char.isalpha() for char in text):
                mixed_count += 1
            else:
                hindi_count += 1
        else:
            english_count += 1
    
    print(f"\n🌐 Language Distribution:")
    print(f"English: {english_count}")
    print(f"Hindi/Hinglish: {hindi_count}")
    print(f"Mixed: {mixed_count}")
    
    # Show sample conversations
    print(f"\n📝 Sample Conversations:")
    for i, conv in enumerate(conversations[:3]):
        print(f"\nExample {i+1}:")
        print(f"Input: {conv.get('input', 'N/A')}")
        print(f"Response: {conv.get('response', 'N/A')}")

explore_dataset(conversations)

## Step 4: Data Preprocessing 🛠️

Prepare the data for Flan-T5 training by formatting it properly:

In [None]:
def preprocess_conversations(conversations: List[Dict]) -> List[Dict]:
    """
    Preprocess conversations for T5 training format
    
    Args:
        conversations: List of conversation dictionaries
    
    Returns:
        Processed conversations ready for training
    """
    processed_data = []
    
    for conv in conversations:
        if 'input' in conv and 'response' in conv:
            # Extract human input and AI response
            human_input = conv['input']
            ai_response = conv['response']
            
            # Remove "Human: " and "AI: " prefixes if present
            if human_input.startswith("Human: "):
                human_input = human_input[7:]
            if ai_response.startswith("AI: "):
                ai_response = ai_response[4:]
            
            # Format for T5: input should be a task description + context
            # T5 is trained with task prefixes
            formatted_input = f"romantic chat: {human_input.strip()}"
            formatted_output = ai_response.strip()
            
            processed_data.append({
                'input_text': formatted_input,
                'target_text': formatted_output
            })
    
    return processed_data

# Preprocess the data
processed_conversations = preprocess_conversations(conversations)
print(f"✅ Processed {len(processed_conversations)} conversations")

# Show example of processed data
print(f"\n📋 Processed Format Example:")
for i, item in enumerate(processed_conversations[:2]):
    print(f"\nExample {i+1}:")
    print(f"Input: {item['input_text']}")
    print(f"Target: {item['target_text']}")

## Step 5: Model and Tokenizer Setup 🤖

Load the Flan-T5 model and tokenizer:

In [None]:
# Model configuration
MODEL_NAME = "google/flan-t5-small"  # Start with small for faster training
# Other options: "google/flan-t5-base", "google/flan-t5-large"

print(f"🤖 Loading model: {MODEL_NAME}")

# Load tokenizer
tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)
print(f"✅ Tokenizer loaded. Vocab size: {tokenizer.vocab_size}")

# Load model
model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)
model.to(device)
print(f"✅ Model loaded and moved to {device}")
print(f"📊 Model parameters: {model.num_parameters():,}")

# Test tokenization
test_input = "romantic chat: Hello beautiful, how are you? 💕"
test_tokens = tokenizer(test_input, return_tensors="pt")
print(f"\n🧪 Tokenization test:")
print(f"Input: {test_input}")
print(f"Tokens: {test_tokens['input_ids'].shape}")
print(f"Decoded: {tokenizer.decode(test_tokens['input_ids'][0], skip_special_tokens=True)}")

## Step 6: Dataset Preparation 📊

Split the data and create HuggingFace datasets:

In [None]:
def create_datasets(processed_conversations: List[Dict], test_size: float = 0.2):
    """
    Create train and validation datasets
    
    Args:
        processed_conversations: List of processed conversation dictionaries
        test_size: Fraction of data to use for validation
    
    Returns:
        DatasetDict with train and validation splits
    """
    # Split the data
    train_data, val_data = train_test_split(
        processed_conversations, 
        test_size=test_size, 
        random_state=42
    )
    
    print(f"📊 Dataset split:")
    print(f"  Training: {len(train_data)} conversations")
    print(f"  Validation: {len(val_data)} conversations")
    
    # Create HuggingFace datasets
    train_dataset = Dataset.from_list(train_data)
    val_dataset = Dataset.from_list(val_data)
    
    # Combine into DatasetDict
    dataset_dict = DatasetDict({
        'train': train_dataset,
        'validation': val_dataset
    })
    
    return dataset_dict

# Create datasets
datasets = create_datasets(processed_conversations)
print(f"\n✅ Datasets created successfully")
print(f"Train dataset: {len(datasets['train'])} examples")
print(f"Validation dataset: {len(datasets['validation'])} examples")

In [None]:
def tokenize_function(examples):
    """
    Tokenize the input and target texts
    """
    # Tokenize inputs
    model_inputs = tokenizer(
        examples['input_text'],
        max_length=512,
        truncation=True,
        padding=False  # We'll pad later in the data collator
    )
    
    # Tokenize targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples['target_text'],
            max_length=512,
            truncation=True,
            padding=False
        )
    
    model_inputs['labels'] = labels['input_ids']
    return model_inputs

# Apply tokenization
print("🔄 Tokenizing datasets...")
tokenized_datasets = datasets.map(
    tokenize_function,
    batched=True,
    remove_columns=datasets['train'].column_names
)

print("✅ Tokenization complete")
print(f"Tokenized train dataset: {len(tokenized_datasets['train'])} examples")
print(f"Tokenized validation dataset: {len(tokenized_datasets['validation'])} examples")

# Show tokenized example
print(f"\n📋 Tokenized Example:")
example = tokenized_datasets['train'][0]
print(f"Input IDs shape: {np.array(example['input_ids']).shape}")
print(f"Labels shape: {np.array(example['labels']).shape}")
print(f"Decoded input: {tokenizer.decode(example['input_ids'], skip_special_tokens=True)}")
print(f"Decoded label: {tokenizer.decode(example['labels'], skip_special_tokens=True)}")

## Step 7: Training Configuration ⚙️

Set up training arguments and data collator:

In [None]:
# Create output directory
output_dir = "./flan-t5-romantic-chat"
os.makedirs(output_dir, exist_ok=True)

# Training arguments
training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=3,  # Start with 3 epochs
    per_device_train_batch_size=4,  # Adjust based on GPU memory
    per_device_eval_batch_size=4,
    warmup_steps=100,
    weight_decay=0.01,
    logging_dir=f"{output_dir}/logs",
    logging_steps=50,
    evaluation_strategy="steps",
    eval_steps=200,
    save_strategy="steps",
    save_steps=500,
    save_total_limit=3,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    report_to=None,  # Disable wandb reporting
    seed=42,
    fp16=torch.cuda.is_available(),  # Use mixed precision if GPU available
    gradient_accumulation_steps=2,  # Effective batch size = 4 * 2 = 8
    lr_scheduler_type="linear",
    learning_rate=5e-5,
    remove_unused_columns=False,
)

print(f"⚙️ Training Configuration:")
print(f"  Output directory: {output_dir}")
print(f"  Epochs: {training_args.num_train_epochs}")
print(f"  Batch size: {training_args.per_device_train_batch_size}")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  Mixed precision (FP16): {training_args.fp16}")
print(f"  Gradient accumulation: {training_args.gradient_accumulation_steps}")

In [None]:
# Data collator for dynamic padding
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    padding=True,
    return_tensors="pt"
)

print("✅ Data collator configured for dynamic padding")

# Test the data collator
test_batch = [tokenized_datasets['train'][i] for i in range(2)]
collated_batch = data_collator(test_batch)
print(f"\n🧪 Data Collator Test:")
print(f"Batch input_ids shape: {collated_batch['input_ids'].shape}")
print(f"Batch labels shape: {collated_batch['labels'].shape}")
print(f"Batch attention_mask shape: {collated_batch['attention_mask'].shape}")

## Step 8: Trainer Setup 🏋️

Create the Trainer with evaluation metrics:

In [None]:
from evaluate import load

# Load evaluation metrics
rouge_metric = load("rouge")

def compute_metrics(eval_pred):
    """
    Compute ROUGE metrics for evaluation
    """
    predictions, labels = eval_pred
    
    # Decode predictions and labels
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    
    # Replace -100 in labels (used for padding) with tokenizer.pad_token_id
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Compute ROUGE scores
    result = rouge_metric.compute(
        predictions=decoded_preds,
        references=decoded_labels,
        use_stemmer=True
    )
    
    # Extract ROUGE scores
    result = {key: value * 100 for key, value in result.items()}
    
    return {
        "rouge1": result["rouge1"],
        "rouge2": result["rouge2"],
        "rougeL": result["rougeL"],
        "rougeLsum": result["rougeLsum"]
    }

# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)

print("✅ Trainer configured successfully")
print(f"📊 Training will start with {len(tokenized_datasets['train'])} training examples")
print(f"📊 Evaluation will use {len(tokenized_datasets['validation'])} validation examples")

## Step 9: Model Training 🚀

Start the training process:

In [None]:
# Check model before training
print("🔍 Pre-training model check:")
test_input = "romantic chat: I miss you so much 💔"
test_encoding = tokenizer(test_input, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model.generate(
        test_encoding.input_ids,
        max_length=50,
        num_beams=2,
        early_stopping=True
    )
    pre_training_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"Input: {test_input}")
    print(f"Pre-training output: {pre_training_output}")

print("\n🚀 Starting training...")
print("This may take a while depending on your hardware.")
print("📊 Training progress will be shown below:")

# Start training
train_result = trainer.train()

print("\n🎉 Training completed!")
print(f"📊 Training Results:")
print(f"  Final loss: {train_result.training_loss:.4f}")
print(f"  Training time: {train_result.metrics['train_runtime']:.2f} seconds")
print(f"  Samples per second: {train_result.metrics['train_samples_per_second']:.2f}")

## Step 10: Model Evaluation 📈

Evaluate the trained model:

In [None]:
# Evaluate the model
print("📊 Evaluating trained model...")
eval_results = trainer.evaluate()

print(f"\n📈 Evaluation Results:")
for key, value in eval_results.items():
    if key.startswith('eval_'):
        metric_name = key.replace('eval_', '')
        print(f"  {metric_name}: {value:.4f}")

# Test the model with sample inputs
print("\n🧪 Testing trained model with sample inputs:")

test_cases = [
    "romantic chat: Good morning beautiful! ☀️",
    "romantic chat: I'm feeling sad today 😢",
    "romantic chat: Tumhe pata hai main tumse kitna pyaar karta hun? 💕",
    "romantic chat: What would you do if I was there with you?",
    "romantic chat: Mujhe tumhari yaad aa rahi hai 💭"
]

for test_input in test_cases:
    # Tokenize input
    inputs = tokenizer(test_input, return_tensors="pt").to(device)
    
    # Generate response
    with torch.no_grad():
        outputs = model.generate(
            inputs.input_ids,
            max_length=100,
            num_beams=3,
            early_stopping=True,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id
        )
    
    # Decode response
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    print(f"\n💬 Input: {test_input.replace('romantic chat: ', '')}")
    print(f"🤖 Response: {generated_text}")

## Step 11: Save the Trained Model 💾

Save the model and tokenizer for future use:

In [None]:
# Save the trained model and tokenizer
save_directory = "./flan-t5-romantic-chat-final"
os.makedirs(save_directory, exist_ok=True)

print(f"💾 Saving model to {save_directory}...")

# Save model and tokenizer
trainer.save_model(save_directory)
tokenizer.save_pretrained(save_directory)

print("✅ Model and tokenizer saved successfully!")
print(f"📁 Model location: {save_directory}")

# Also save training configuration
config_info = {
    "model_name": MODEL_NAME,
    "training_examples": len(tokenized_datasets['train']),
    "validation_examples": len(tokenized_datasets['validation']),
    "epochs": training_args.num_train_epochs,
    "batch_size": training_args.per_device_train_batch_size,
    "learning_rate": training_args.learning_rate,
    "final_eval_loss": eval_results.get('eval_loss', 'N/A'),
    "final_rouge1": eval_results.get('eval_rouge1', 'N/A')
}

with open(f"{save_directory}/training_info.json", 'w') as f:
    json.dump(config_info, f, indent=2)

print("📋 Training configuration saved to training_info.json")

## Step 12: Load and Test Saved Model 🔄

Demonstrate how to load and use the saved model:

In [None]:
# Function to load and test the saved model
def load_and_test_model(model_path: str):
    """
    Load the saved model and test it
    """
    print(f"🔄 Loading model from {model_path}...")
    
    # Load tokenizer and model
    loaded_tokenizer = T5Tokenizer.from_pretrained(model_path)
    loaded_model = T5ForConditionalGeneration.from_pretrained(model_path)
    loaded_model.to(device)
    
    print("✅ Model loaded successfully!")
    
    return loaded_tokenizer, loaded_model

def generate_romantic_response(model, tokenizer, user_input: str) -> str:
    """
    Generate a romantic response using the trained model
    """
    # Format input for the model
    formatted_input = f"romantic chat: {user_input}"
    
    # Tokenize input
    inputs = tokenizer(formatted_input, return_tensors="pt").to(device)
    
    # Generate response
    with torch.no_grad():
        outputs = model.generate(
            inputs.input_ids,
            max_length=150,
            num_beams=4,
            early_stopping=True,
            temperature=0.8,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id
        )
    
    # Decode and return response
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

# Test the loading function
loaded_tokenizer, loaded_model = load_and_test_model(save_directory)

# Interactive testing
print("\n💕 Interactive Romantic Chat Test:")
print("Type messages and see how the model responds!\n")

test_messages = [
    "Hey beautiful, how are you today?",
    "I'm missing you so much right now", 
    "Tumhe pata hai tum kitne special ho?",
    "What would you do if I surprised you?",
    "I love talking to you"
]

for message in test_messages:
    response = generate_romantic_response(loaded_model, loaded_tokenizer, message)
    print(f"💬 You: {message}")
    print(f"🤖 AI: {response}")
    print("-" * 50)

## 🎉 Training Complete!

### 📊 Summary
You have successfully fine-tuned a Flan-T5 model on romantic chat conversations! The model can now:

- Generate romantic and flirty responses
- Handle both English and Hindi/Hinglish inputs
- Maintain conversational context and emotional tone
- Respond with appropriate emojis and emotional expressions

### 🚀 Next Steps

1. **Improve the Model:**
   - Add more diverse training data
   - Experiment with different model sizes (base, large)
   - Fine-tune hyperparameters
   - Add more evaluation metrics

2. **Deploy the Model:**
   - Create a chatbot interface
   - Build a web application
   - Deploy to cloud platforms
   - Create mobile app integration

3. **Advanced Features:**
   - Add personality consistency
   - Implement conversation memory
   - Add safety filters
   - Create different conversation modes

### 📁 Files Generated
- `./flan-t5-romantic-chat-final/` - Trained model and tokenizer
- `./flan-t5-romantic-chat/` - Training checkpoints and logs
- `training_info.json` - Training configuration and results

---
**Happy chatting! 💕**

## Bonus: Simple Chat Interface 💬

Create a simple interactive chat interface to test your model:

In [None]:
def interactive_chat():
    """
    Simple interactive chat interface
    """
    print("💕 Welcome to your Romantic AI Chatbot!")
    print("Type 'quit' to exit\n")
    
    while True:
        user_input = input("💬 You: ")
        
        if user_input.lower() in ['quit', 'exit', 'bye']:
            print("🤖 AI: Goodbye my love! 💕")
            break
        
        if user_input.strip():
            response = generate_romantic_response(loaded_model, loaded_tokenizer, user_input)
            print(f"🤖 AI: {response}\n")
        else:
            print("🤖 AI: Please say something, darling! 💖\n")

# Uncomment the line below to start the interactive chat
# interactive_chat()

## 💡 Training Tips & Best Practices

### Performance Optimization
- **GPU Memory**: If you run out of GPU memory, reduce `per_device_train_batch_size`
- **Training Speed**: Increase `gradient_accumulation_steps` to maintain effective batch size
- **Model Size**: Start with `flan-t5-small`, then try `flan-t5-base` or `flan-t5-large`

### Data Quality
- **Diverse Examples**: Include various conversation styles and emotions
- **Language Mix**: Balance English and Hindi/Hinglish examples
- **Quality Control**: Remove inappropriate or low-quality conversations

### Hyperparameter Tuning
- **Learning Rate**: Try values between 1e-5 and 1e-4
- **Epochs**: Monitor validation loss to avoid overfitting
- **Temperature**: Higher values (0.8-1.2) for more creative responses

### Evaluation
- **Human Evaluation**: Have humans rate response quality
- **Conversation Flow**: Test multi-turn conversations
- **Safety**: Check for inappropriate content generation

---

*This notebook provides a complete pipeline for training romantic chatbots using Flan-T5. Customize the parameters and data according to your specific needs!*

## 📝 Model Usage Notes

### Loading the Model in Production

```python
from transformers import T5ForConditionalGeneration, T5Tokenizer

# Load your trained model
model_path = "./flan-t5-romantic-chat-final"
tokenizer = T5Tokenizer.from_pretrained(model_path)
model = T5ForConditionalGeneration.from_pretrained(model_path)

# Generate response
def chat_response(user_input):
    input_text = f"romantic chat: {user_input}"
    inputs = tokenizer(input_text, return_tensors="pt")
    outputs = model.generate(inputs.input_ids, max_length=100, temperature=0.7, do_sample=True)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response
```

### ⚠️ Important Considerations

1. **Content Safety**: Always implement content filters for production use
2. **User Privacy**: Don't store personal conversations without consent
3. **Bias Monitoring**: Regularly check for and mitigate potential biases
4. **Performance**: Monitor response quality and retrain as needed

---

**🎯 Ready to create amazing romantic conversations! Your AI companion is now trained and ready to spread love! 💕**