# Kaggle GPU Runner - House Prices Project

This notebook runs the house-prices-starter project on Kaggle's GPU infrastructure.

## Setup Instructions

1. **Enable GPU**: In notebook settings (right panel), set Accelerator to "GPU" (P100 or T4)
2. **Add Competition Data**: Add the "house-prices-advanced-regression-techniques" dataset
3. **Run Cells**: Execute cells in order

## Workflow

- Cell 1: Clone repository and install dependencies
- Cell 2: Setup Kaggle environment (symlinks, GPU verification)
- Cell 3: Verify GPU availability
- Cell 4: Run preprocessing (if needed)
- Cell 5: Run model training (GPU-accelerated)
- Cell 6: Verify outputs


In [None]:
# Cell 1: Clone repository and install dependencies
# Update the GitHub URL to match your repository
REPO_URL = "https://github.com/yourusername/house-prices-starter.git"
PROJECT_DIR = "/kaggle/working/project"

# Clone repository
import subprocess
import os

# Remove existing directory if it exists (for re-runs)
if os.path.exists(PROJECT_DIR):
    import shutil
    shutil.rmtree(PROJECT_DIR)

# Clone the repository
print(f"Cloning repository from {REPO_URL}...")
result = subprocess.run(
    ["git", "clone", REPO_URL, PROJECT_DIR],
    capture_output=True,
    text=True
)

if result.returncode == 0:
    print(f"[OK] Repository cloned to {PROJECT_DIR}")
else:
    print(f"[ERROR] Failed to clone repository: {result.stderr}")
    raise RuntimeError("Repository clone failed")

# Change to project directory
os.chdir(PROJECT_DIR)
print(f"Current directory: {os.getcwd()}")

# Install dependencies
print("\nInstalling dependencies...")
result = subprocess.run(
    ["pip", "install", "-r", "requirements.txt"],
    capture_output=True,
    text=True
)

if result.returncode == 0:
    print("[OK] Dependencies installed")
else:
    print(f"[WARNING] Some dependencies may have failed: {result.stderr}")

# Add project to Python path
import sys
sys.path.insert(0, PROJECT_DIR)
print(f"\n[OK] Project root added to Python path: {PROJECT_DIR}")


In [None]:
# Cell 2: Setup Kaggle environment
# This sets up symlinks and verifies the environment

from kaggle.remote.setup_kaggle import setup_kaggle_environment

setup_kaggle_environment()


In [None]:
# Cell 3: Verify GPU availability

# Check GPU with nvidia-smi
print("Checking GPU availability...")
print("-" * 70)
!nvidia-smi

print("\n" + "-" * 70)
print("GPU Information via Python:")
print("-" * 70)

from kaggle.remote.gpu_runner import verify_gpu_setup, print_gpu_usage_info

gpu_available = verify_gpu_setup()
print()
print_gpu_usage_info()


## Optional: Run Preprocessing

If you need to regenerate preprocessed data on Kaggle, uncomment and run the cell below.
Otherwise, you can skip this if preprocessing is already done or not needed.


In [None]:
# Cell 4: Run preprocessing (OPTIONAL - uncomment if needed)
# Uncomment the line below to run preprocessing on Kaggle

# %run notebooks/preprocessing/run_preprocessing.py


## Run Model Training

Choose which model to train. GPU-accelerated models (XGBoost, CatBoost, LightGBM) will automatically use GPU if available.


In [None]:
# Cell 5: Run model training (GPU-accelerated)
# Choose one of the following models to train:

# CatBoost (GPU-accelerated)
%run notebooks/Models/9catBoostModel.py

# Uncomment to run other models:
# XGBoost (GPU-accelerated)
# %run notebooks/Models/7XGBoostModel.py

# LightGBM (GPU-accelerated, if supported)
# %run notebooks/Models/8lightGbmModel.py


In [None]:
# Cell 6: Verify outputs

import os
from pathlib import Path

# Check submission files
submissions_dir = Path("/kaggle/working/project/data/submissions")
print("Submission files:")
print("-" * 70)

if submissions_dir.exists():
    submission_files = list(submissions_dir.rglob("*.csv"))
    for file in sorted(submission_files):
        if file.name != "sample_submission.csv":
            size_kb = file.stat().st_size / 1024
            print(f"  {file.name}: {size_kb:.1f} KB")
else:
    print("  No submissions directory found")

# Check runs directory
runs_dir = Path("/kaggle/working/project/runs")
print("\nModel performance log:")
print("-" * 70)
perf_file = runs_dir / "model_performance.csv"
if perf_file.exists():
    print(f"  Found: {perf_file}")
    import pandas as pd
    df = pd.read_csv(perf_file)
    print(f"  Total model runs: {len(df)}")
    if len(df) > 0:
        print(f"  Latest model: {df.iloc[-1]['model']}")
        print(f"  Latest CV RMSE: {df.iloc[-1]['rmse']:.6f}")
else:
    print("  No performance log found")

print("\n" + "=" * 70)
print("[INFO] To download files: Use Kaggle notebook 'Data' output panel")
print("=" * 70)
