# COT Synthetic Data Generator - Kaggle

This notebook runs the COT Synthetic Dataset Generator on Kaggle.

**Setup Steps:**
1. Settings â†’ Accelerator â†’ GPU T4 x2
2. Settings â†’ Internet â†’ ON
3. Settings â†’ Persistence â†’ ON
4. Add your dataset (if uploaded) or use GitHub
5. Run all cells in order

**Advantages:**
- 2x T4 GPUs (16GB total)
- 9-hour sessions
- Auto-saved outputs
- Better for large batches

**Estimated Time:** 1-2 hours for 50 seeds

## 1. Setup Project Files

In [None]:
import os
import shutil

# ============================================
# OPTION A: From Kaggle Dataset
# ============================================
# 1. Upload your project as a Kaggle dataset first
# 2. Add it to this notebook (Add Data â†’ Your Datasets)
# 3. Update the path below

dataset_path = '/kaggle/input/cot-synthetic-data-generator/'

# Create working directory
!mkdir -p /kaggle/working/project
%cd /kaggle/working/project

# Extract if tar.gz
if os.path.exists(f'{dataset_path}synthetic-data-gen.tar.gz'):
    !tar -xzf {dataset_path}synthetic-data-gen.tar.gz -C /kaggle/working/project
    print("âœ“ Extracted from tar.gz")
elif os.path.exists(dataset_path):
    # Copy all files from dataset
    !cp -r {dataset_path}* /kaggle/working/project/
    print("âœ“ Copied from dataset")

# ============================================
# OPTION B: From GitHub
# ============================================
# Uncomment and update with your repo URL
# REPO_URL = "https://github.com/YOUR_USERNAME/YOUR_REPO.git"
# !git clone {REPO_URL} /kaggle/working/project
# %cd /kaggle/working/project

# Verify files
!ls -la

## 2. Install Ollama

In [None]:
%%bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama server in background
nohup ollama serve > /kaggle/working/ollama.log 2>&1 &

# Wait for server to start (Kaggle needs more time)
sleep 10

# Verify Ollama is running
curl http://localhost:11434/api/tags || echo "Waiting for Ollama..."
sleep 5
curl http://localhost:11434/api/tags

## 3. Install Python Dependencies

In [None]:
!pip install -q pyyaml jinja2 pandas pyarrow
print("âœ“ Dependencies installed")

## 4. Pull Models

**Kaggle has 2x T4 GPUs (16GB total)** - You can use larger models than Colab!

In [None]:
# Recommended for Kaggle (good balance of speed and quality)
!ollama pull qwen3:4b          # 2.6 GB - fast and good quality
!ollama pull deepseek-r1:8b    # 4.9 GB - best for reasoning

# Optional: Add more for diversity
# !ollama pull qwen3:8b        # 5.2 GB
# !ollama pull phi4-mini       # 2.5 GB
# !ollama pull gemma2:9b       # 5.4 GB

# List available models
!ollama list

## 5. Test Run (Dry Run)

In [None]:
!python run_pipeline.py --max-seeds 2 --dry-run

## 6. Small Test Generation

In [None]:
# Quick test to verify everything works
!python run_pipeline.py \
    --model-strategy fixed \
    --model qwen3-4b \
    --ctx-mode fixed \
    --fixed-tokens 2048 \
    --max-seeds 5 \
    --samples-per-seed 2 \
    --output-dir /kaggle/working/output

## 7. Full Generation

Choose one configuration based on your goals:

In [None]:
# Configuration 1: MAXIMUM THROUGHPUT - Random models, mixed contexts
!python run_pipeline.py \
    --model-strategy random \
    --ctx-mode profile \
    --max-seeds 100 \
    --samples-per-seed 3 \
    --output-format both \
    --output-dir /kaggle/working/output

In [None]:
# Configuration 2: QUALITY FOCUS - 8B model, long COT
!python run_pipeline.py \
    --model-strategy fixed \
    --model deepseek-r1-8b \
    --ctx-mode long_cot \
    --max-seeds 50 \
    --samples-per-seed 3 \
    --output-format both \
    --output-dir /kaggle/working/output

In [None]:
# Configuration 3: SPEED - 4B model, short context
!python run_pipeline.py \
    --model-strategy fixed \
    --model qwen3-4b \
    --ctx-mode fixed \
    --fixed-tokens 1024 \
    --max-seeds 200 \
    --samples-per-seed 2 \
    --output-format both \
    --output-dir /kaggle/working/output

In [None]:
# Configuration 4: SPECIFIC SKILLS - Focus on reasoning tasks
!python run_pipeline.py \
    --skills RSN-ARITH RSN-LOGIC RSN-CAUSAL \
    --model-strategy fixed \
    --model deepseek-r1-8b \
    --ctx-mode profile \
    --max-seeds 30 \
    --samples-per-seed 3 \
    --output-format both \
    --output-dir /kaggle/working/output

## 8. Check Results

In [None]:
import pandas as pd
import os

# List output files
!ls -lh /kaggle/working/output/

# Load and preview data
output_dir = '/kaggle/working/output'
if os.path.exists(output_dir):
    for file in os.listdir(output_dir):
        if file.endswith('.parquet'):
            filepath = os.path.join(output_dir, file)
            df = pd.read_parquet(filepath)
            print(f"\n{'='*70}")
            print(f"File: {file}")
            print(f"Rows: {len(df):,}")
            print(f"Columns: {list(df.columns)}")
            print(f"\nFirst 2 samples:")
            print(df.head(2))
            print(f"\nData types:")
            print(df.dtypes)
            print(f"{'='*70}")
else:
    print("No output directory found yet.")

## 9. Download Results

**Kaggle automatically saves files in `/kaggle/working/`**

You can download them from the **Output** tab on the right â†’

In [None]:
# Create a summary of generated data
import pandas as pd
import json

summary = {
    "total_samples": 0,
    "files": []
}

output_dir = '/kaggle/working/output'
if os.path.exists(output_dir):
    for file in os.listdir(output_dir):
        if file.endswith('.parquet'):
            filepath = os.path.join(output_dir, file)
            df = pd.read_parquet(filepath)
            summary["total_samples"] += len(df)
            summary["files"].append({
                "name": file,
                "rows": len(df),
                "size_mb": os.path.getsize(filepath) / (1024*1024)
            })

# Save summary
with open('/kaggle/working/generation_summary.json', 'w') as f:
    json.dump(summary, f, indent=2)

print("\nðŸ“Š Generation Summary:")
print(json.dumps(summary, indent=2))
print("\nâœ“ Summary saved to /kaggle/working/generation_summary.json")
print("âœ“ Download all files from the Output tab â†’")

## 10. (Optional) Resume Generation

If you need to continue generating more data:

In [None]:
!python run_pipeline.py \
    --resume \
    --max-seeds 50 \
    --output-dir /kaggle/working/output

## 11. (Optional) Push to HuggingFace Hub

In [None]:
# First, install huggingface_hub
!pip install -q huggingface_hub datasets

# Set your HuggingFace token (get from https://huggingface.co/settings/tokens)
HF_TOKEN = "your_token_here"
REPO_NAME = "your-username/dataset-name"

# Push to hub
!python run_pipeline.py \
    --push-to-hub {REPO_NAME} \
    --hf-token {HF_TOKEN} \
    --max-seeds 100

## 12. Monitor Progress

In [None]:
# Check Ollama logs
!tail -n 50 /kaggle/working/ollama.log

In [None]:
# Check GPU usage
!nvidia-smi

In [None]:
# Check disk usage
!df -h /kaggle/working

## Troubleshooting

### Ollama Not Working

In [None]:
# Check if Ollama is running
!ps aux | grep ollama

# Check logs
!cat /kaggle/working/ollama.log

# Restart Ollama
!pkill ollama
!nohup ollama serve > /kaggle/working/ollama.log 2>&1 &
!sleep 10
!curl http://localhost:11434/api/tags

### Internet Not Working

1. Click Settings (gear icon on right)
2. Ensure **Internet** is **ON**
3. Click **Save**
4. Restart the notebook

### Out of Memory

In [None]:
# Use smaller models or reduce context
!python run_pipeline.py \
    --model qwen3-4b \
    --ctx-mode fixed \
    --fixed-tokens 1024 \
    --max-seeds 10 \
    --output-dir /kaggle/working/output

## Performance Tips

1. **Use 4B-8B models** for best speed/quality balance on Kaggle
2. **Batch in chunks** of 50-100 seeds to avoid timeouts
3. **Enable persistence** in settings to auto-save outputs
4. **Monitor GPU quota** - Kaggle gives 30 GPU hours/week
5. **Use `--resume`** to continue from checkpoints

## Recommended Workflow

1. **Test run** (5 seeds) - verify everything works
2. **Small batch** (20-50 seeds) - check quality
3. **Full generation** (100-200 seeds) - production run
4. **Download** from Output tab
5. **Repeat** if needed using `--resume`