Skip to content

TimPPPP/Datathon---BroadVail

Repository files navigation

Rice Datathon 2026 - BroadVail Finance Track

RevPAR Growth Prediction & Neighborhood Drivers

Predict which multifamily properties outperform their peers on RevPAR growth and discover which combinations of amenities, neighborhood context, and travel-time definitions best explain that outperformance across different locations and time periods.

Results Summary

Metric Value
Best Model CatBoost Ensemble (3 models)
CV RMSE 0.1368
Original CatBoost 0.1725
Baseline RMSE 0.2020 (Ridge)
Improvement vs Baseline 32.3%

Key Findings:

  • the submission file is named as submission.csv in the root folder
  • Property type importance increased 226% post-COVID
  • Building age metrics decreased ~58% in importance post-COVID
  • Only ~20% of properties maintained same performance quartile between periods

Quick Start

Prerequisites

  • Docker and Docker Compose installed
  • Dataset files (provided separately)

Setup

  1. Place dataset files in Dataset/

    Dataset/
    ├── master_panel_drv10.csv
    ├── master_panel_drv15.csv
    ├── master_panel_drv30.csv
    ├── scoring.csv
    └── dictionary.csv
    
  2. Build and start Docker environment

    docker-compose build
    docker-compose up
  3. Access Jupyter Lab at http://localhost:8888

Run Notebooks

Execute notebooks in order (00-07) to reproduce the analysis.

Project Structure

├── submission.csv         # Final predictions on scoring.csv
├── data/                  # Processed data (gitignored)
├── Dataset/               # Raw CSV files (gitignored)
├── models/                # Trained models (gitignored)
├── notebooks/             # Analysis notebooks (00-07)
├── reports/
│   ├── figures/           # Visualizations
│   ├── tables/            # Analysis tables
│   ├── final_writeup.md   # Methodology summary
│   └── llm_outputs/       # LLM-generated summaries
├── src/                   # Python modules
├── Dockerfile
└── docker-compose.yml

Notebook Pipeline

# Notebook Description
00 00_data_audit.ipynb Data schema validation
01 01_research_insights.ipynb EDA and hypotheses
02 02_feature_engineering.ipynb Feature pipeline
03 03_baselines.ipynb Baseline models
04 04_advanced_ml.ipynb Gradient boosting
05 05_interpretation_story.ipynb SHAP analysis
06 06_make_submission.ipynb Generate submission
07 07_writeup.ipynb Final report

Methodology

Feature Engineering

  • Density features: amenity counts / trade area size
  • Share features: category proportions
  • PCA: dimensionality reduction on amenity densities
  • Clustering: neighborhood typology via KMeans

Model

  • Algorithm: CatBoost gradient boosting
  • Validation: 5-fold GroupKFold by market
  • Interpretation: SHAP values for feature importance

LLM Integration

Generate executive summary using OpenAI API:

export OPENAI_API_KEY="your-key"
python scripts/run_llm_summary.py

Uses temperature=0 for deterministic outputs per competition rules.

Key Dependencies

  • pandas, numpy, scikit-learn
  • catboost, lightgbm, xgboost
  • shap, matplotlib, seaborn
  • openai (for LLM summaries)

Competition Rules

  1. Models must be repeatable and deterministic
  2. Data is confidential - delete after competition
  3. Only use provided data - no external sources

License

Competition use only. See BroadVail data use notice.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors