# COT Synthetic Data Generator - Google Colab

This notebook runs the COT Synthetic Dataset Generator on Google Colab.

**Setup Steps:**
1. Change runtime to GPU (Runtime → Change runtime type → T4 GPU)
2. Run all cells in order
3. Download results from the last cell

**Estimated Time:** 30-60 minutes for 20 seeds

## 1. Clone Repository

In [None]:
# Replace with your repository URL
REPO_URL = "https://github.com/YOUR_USERNAME/YOUR_REPO.git"

!git clone {REPO_URL}

# Change to repository directory (update if different)
%cd YOUR_REPO

# Verify files
!ls -la

## 2. Install Ollama

In [None]:
%%bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama server in background
nohup ollama serve > ollama.log 2>&1 &

# Wait for server to start
sleep 5

# Verify Ollama is running
curl http://localhost:11434/api/tags

## 3. Install Python Dependencies

In [None]:
!pip install -q pyyaml jinja2 pandas pyarrow
print("✓ Dependencies installed")

## 4. Pull Models

**For Colab Free Tier (T4 GPU):** Use smaller models (1.7b, 2b, 4b)

**For Colab Pro (A100):** You can use larger models (8b, 14b)

In [None]:
# Recommended for FREE tier
!ollama pull qwen3:1.7b
!ollama pull gemma2:2b

# Uncomment for Colab Pro or if you want better quality
# !ollama pull qwen3:4b
# !ollama pull deepseek-r1:8b

# List available models
!ollama list

## 5. Test Run (Dry Run)

In [None]:
!python run_pipeline.py --max-seeds 2 --dry-run

## 6. Generate Data

Choose one of the configurations below based on your needs:

In [None]:
# Configuration 1: FAST (Free Tier) - Small model, short context
!python run_pipeline.py \
    --model-strategy fixed \
    --model qwen3-1.7b \
    --ctx-mode fixed \
    --fixed-tokens 1024 \
    --max-seeds 20 \
    --samples-per-seed 2 \
    --output-format both

In [None]:
# Configuration 2: BALANCED (Free Tier) - Medium quality
!python run_pipeline.py \
    --model-strategy random \
    --ctx-mode profile \
    --max-seeds 15 \
    --samples-per-seed 3 \
    --output-format both

In [None]:
# Configuration 3: QUALITY (Colab Pro) - Best results
# Uncomment if you have Colab Pro and pulled larger models
# !python run_pipeline.py \
#     --model-strategy fixed \
#     --model deepseek-r1-8b \
#     --ctx-mode long_cot \
#     --max-seeds 30 \
#     --samples-per-seed 3 \
#     --output-format both

## 7. Check Results

In [None]:
import pandas as pd
import os

# List output files
!ls -lh output/

# Load and preview data
output_dir = 'output'
for file in os.listdir(output_dir):
    if file.endswith('.parquet'):
        df = pd.read_parquet(os.path.join(output_dir, file))
        print(f"\n{'='*60}")
        print(f"File: {file}")
        print(f"Rows: {len(df)}")
        print(f"Columns: {list(df.columns)}")
        print(f"\nSample:")
        print(df.head(2))
        print(f"{'='*60}")

## 8. Download Results

In [None]:
from google.colab import files
import shutil

# Create zip file
shutil.make_archive('synthetic_data_output', 'zip', 'output')

# Download
files.download('synthetic_data_output.zip')

print("✓ Download started! Check your browser's download folder.")

## 9. (Optional) Resume Generation

If your session disconnects, re-run cells 1-4, then use this cell to resume:

In [None]:
!python run_pipeline.py \
    --resume \
    --max-seeds 20 \
    --output-format both

## 10. (Optional) Monitor Ollama Logs

In [None]:
# View last 50 lines of Ollama logs
!tail -n 50 ollama.log

## Troubleshooting

### Ollama Not Working
```python
# Check if Ollama is running
!ps aux | grep ollama

# Restart Ollama
!pkill ollama
!nohup ollama serve > ollama.log 2>&1 &
!sleep 5
```

### Out of Memory
- Use smaller models (qwen3:1.7b, gemma2:2b)
- Reduce `--fixed-tokens` to 512 or 1024
- Reduce `--max-seeds`

### Session Timeout
- Download results periodically
- Use `--resume` flag to continue
- Consider Colab Pro for longer sessions