# 🎯 CoLA Dataset Trainer - Gradio Interface

**Interactive notebook for launching the Google CoLA (Corpus of Linguistic Acceptability) training interface**

This notebook provides an easy way to:
- Install required dependencies
- Launch the Gradio web interface
- Monitor training progress
- Access the interface from any device

---

## 📦 Step 1: Install Dependencies

First, let's install all the required packages for the CoLA trainer:

In [None]:
# Install required packages
!pip install -q gradio torch transformers datasets scikit-learn pandas numpy tqdm evaluate accelerate

print("✅ All dependencies installed successfully!")

## 🔧 Step 2: Check System Information

Let's check what hardware we're working with:

In [None]:
import torch
import platform
import os

print("🖥️  System Information:")
print(f"   Platform: {platform.system()} {platform.release()}")
print(f"   Python: {platform.python_version()}")
print(f"   PyTorch: {torch.__version__}")
print(f"   CUDA Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"   CUDA Device: {torch.cuda.get_device_name()}")
    print(f"   CUDA Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("   Using CPU for training")

print(f"\n📁 Current Directory: {os.getcwd()}")

## 📊 Step 3: Create Sample CoLA Dataset (Optional)

If you don't have a CoLA dataset ready, let's create a sample one for testing:

In [None]:
import pandas as pd
import json
from pathlib import Path

# Create sample CoLA dataset
sample_data = [
    # Acceptable sentences (label = 1)
    {"sentence": "The cat sat on the mat.", "label": 1},
    {"sentence": "She quickly ran to the store.", "label": 1},
    {"sentence": "I think that he will come tomorrow.", "label": 1},
    {"sentence": "The book on the table is mine.", "label": 1},
    {"sentence": "We went to the movies last night.", "label": 1},
    {"sentence": "The dog barked loudly at the stranger.", "label": 1},
    {"sentence": "Can you help me with this problem?", "label": 1},
    {"sentence": "The weather is beautiful today.", "label": 1},
    {"sentence": "She speaks three languages fluently.", "label": 1},
    {"sentence": "The children are playing in the garden.", "label": 1},
    
    # Unacceptable sentences (label = 0)
    {"sentence": "Cat the sat mat on the.", "label": 0},
    {"sentence": "Quickly she to store the ran.", "label": 0},
    {"sentence": "Think I that tomorrow will he come.", "label": 0},
    {"sentence": "Book the table on mine is the.", "label": 0},
    {"sentence": "Movies we the to last went night.", "label": 0},
    {"sentence": "Barked dog the loudly stranger at the.", "label": 0},
    {"sentence": "Help can with me you problem this?", "label": 0},
    {"sentence": "Weather beautiful the today is.", "label": 0},
    {"sentence": "Languages three speaks fluently she.", "label": 0},
    {"sentence": "Playing children the garden in are the.", "label": 0},
]

# Create data directory if it doesn't exist
data_dir = Path("data")
data_dir.mkdir(exist_ok=True)

# Save as CSV
df = pd.DataFrame(sample_data)
csv_path = data_dir / "sample_cola_dataset.csv"
df.to_csv(csv_path, index=False)

# Save as JSON
json_path = data_dir / "sample_cola_dataset.json"
with open(json_path, 'w') as f:
    json.dump(sample_data, f, indent=2)

print("✅ Sample CoLA datasets created:")
print(f"   📁 CSV: {csv_path}")
print(f"   📁 JSON: {json_path}")
print(f"\n📊 Dataset Info:")
print(f"   Total samples: {len(sample_data)}")
print(f"   Acceptable: {sum(1 for x in sample_data if x['label'] == 1)}")
print(f"   Unacceptable: {sum(1 for x in sample_data if x['label'] == 0)}")

# Display preview
print("\n📋 Dataset Preview:")
display(df.head())

## 🚀 Step 4: Launch Gradio Interface

Now let's launch the main Gradio interface for CoLA training:

In [None]:
# Import the CoLA trainer
import sys
import os

# Add current directory to path to import our module
current_dir = os.getcwd()
if current_dir not in sys.path:
    sys.path.append(current_dir)

# Import the gradio interface
try:
    from gradio_cola_trainer import create_interface, cola_trainer
    print("✅ CoLA trainer module imported successfully!")
except ImportError as e:
    print(f"❌ Error importing CoLA trainer: {e}")
    print("Make sure gradio_cola_trainer.py is in the current directory.")

In [None]:
# Launch the Gradio interface
print("🚀 Launching Gradio Interface...")
print("\n📱 The interface will be available at:")
print("   • Local: http://localhost:7860")
print("   • Network: Available to other devices on your network")
print("   • Public: Shareable link will be generated\n")

# Create and launch interface
interface = create_interface()

# Launch with custom settings for notebook environment
interface.launch(
    server_name="0.0.0.0",  # Allow external access
    server_port=7860,       # Standard port
    share=True,             # Create public link
    debug=False,            # Disable debug in notebook
    show_error=True,        # Show errors in interface
    inline=False            # Open in new tab/window
)

## 🎯 Step 5: Quick Training Guide

Once the interface is launched, follow these steps:

### 1. **Upload Dataset** 📂
- Go to the "Dataset Upload" tab
- Upload your CSV, TSV, or JSON file with CoLA format
- Or use the sample dataset we created: `data/sample_cola_dataset.csv`
- Verify the dataset preview looks correct

### 2. **Configure Training** ⚙️
- Go to the "Model Training" tab
- Choose model size:
  - **Small**: Fast, good for testing
  - **Base**: Balanced speed/accuracy
  - **Large**: Best accuracy, slower
- Adjust parameters as needed
- Click "Start Training"

### 3. **Test Model** 🧪
- Go to the "Model Testing" tab
- Enter sentences to test (one per line)
- Click "Test Sentences" to see predictions

### 4. **Get Help** ❓
- Check the "Help" tab for detailed instructions
- Find tips for better training results

---

## 📊 Step 6: Monitor Training (Optional)

You can monitor training progress and system resources:

In [None]:
import time
import psutil
import torch
from IPython.display import clear_output

def monitor_system(duration_minutes=5):
    """Monitor system resources during training"""
    end_time = time.time() + (duration_minutes * 60)
    
    while time.time() < end_time:
        clear_output(wait=True)
        
        # CPU and Memory
        cpu_percent = psutil.cpu_percent(interval=1)
        memory = psutil.virtual_memory()
        
        print("📊 System Monitoring:")
        print(f"   CPU Usage: {cpu_percent:.1f}%")
        print(f"   Memory Usage: {memory.percent:.1f}% ({memory.used / 1e9:.1f}GB / {memory.total / 1e9:.1f}GB)")
        
        # GPU if available
        if torch.cuda.is_available():
            gpu_memory = torch.cuda.get_device_properties(0).total_memory
            gpu_allocated = torch.cuda.memory_allocated()
            gpu_cached = torch.cuda.memory_reserved()
            
            print(f"   GPU Memory: {gpu_allocated / 1e9:.1f}GB allocated, {gpu_cached / 1e9:.1f}GB cached")
            print(f"   GPU Usage: {(gpu_allocated / gpu_memory * 100):.1f}%")
        
        print(f"\n⏰ Monitoring for {duration_minutes} minutes...")
        print("   (Stop this cell to end monitoring)")
        
        time.sleep(5)

# Uncomment to start monitoring
# monitor_system(duration_minutes=10)

## 🔄 Step 7: Restart Interface (If Needed)

If you need to restart the interface with different settings:

In [None]:
# Stop current interface (if running)
try:
    interface.close()
    print("✅ Previous interface closed")
except:
    print("ℹ️  No interface was running")

# Create new interface
interface = create_interface()

# Launch with new settings
interface.launch(
    server_name="0.0.0.0",
    server_port=7861,  # Different port if 7860 is busy
    share=True,
    debug=False,
    inline=False
)

## 📁 Step 8: File Management

Useful commands for managing your training files:

In [None]:
import os
import glob
from pathlib import Path

# List all files in current directory
print("📁 Current Directory Contents:")
for item in sorted(os.listdir('.')):
    if os.path.isdir(item):
        print(f"   📂 {item}/")
    else:
        print(f"   📄 {item}")

# List data files
print("\n📊 Data Files:")
data_files = glob.glob("data/*")
for file in sorted(data_files):
    print(f"   📄 {file}")

# List any trained models
print("\n🤖 Trained Models:")
model_dirs = glob.glob("cola_model_*")
if model_dirs:
    for model_dir in sorted(model_dirs):
        print(f"   🎯 {model_dir}")
else:
    print("   No trained models found yet")

## 🧹 Step 9: Cleanup (Optional)

Clean up temporary files and free memory:

In [None]:
import gc
import torch
import shutil

def cleanup_session():
    """Clean up memory and temporary files"""
    
    # Clear GPU memory
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        print("✅ GPU memory cleared")
    
    # Garbage collection
    gc.collect()
    print("✅ Garbage collection completed")
    
    # Clear model from trainer
    try:
        cola_trainer.model = None
        cola_trainer.tokenizer = None
        cola_trainer.trainer = None
        print("✅ Model cleared from memory")
    except:
        print("ℹ️  No model to clear")

# Uncomment to run cleanup
# cleanup_session()

---

## 🎉 You're All Set!

The Gradio interface is now running and ready for CoLA dataset training. 

**Key Features Available:**
- ✅ Dataset upload and validation
- ✅ Interactive training configuration
- ✅ Real-time model testing
- ✅ Comprehensive help documentation
- ✅ Public sharing capabilities

**Next Steps:**
1. Upload your CoLA dataset or use the sample we created
2. Configure training parameters
3. Start training your model
4. Test the trained model on new sentences

Happy training! 🚀