Predict which multifamily properties outperform their peers on RevPAR growth and discover which combinations of amenities, neighborhood context, and travel-time definitions best explain that outperformance across different locations and time periods.
| Metric | Value |
|---|---|
| Best Model | CatBoost Ensemble (3 models) |
| CV RMSE | 0.1368 |
| Original CatBoost | 0.1725 |
| Baseline RMSE | 0.2020 (Ridge) |
| Improvement vs Baseline | 32.3% |
Key Findings:
- the submission file is named as submission.csv in the root folder
- Property type importance increased 226% post-COVID
- Building age metrics decreased ~58% in importance post-COVID
- Only ~20% of properties maintained same performance quartile between periods
- Docker and Docker Compose installed
- Dataset files (provided separately)
-
Place dataset files in
Dataset/Dataset/ ├── master_panel_drv10.csv ├── master_panel_drv15.csv ├── master_panel_drv30.csv ├── scoring.csv └── dictionary.csv -
Build and start Docker environment
docker-compose build docker-compose up
-
Access Jupyter Lab at http://localhost:8888
Execute notebooks in order (00-07) to reproduce the analysis.
├── submission.csv # Final predictions on scoring.csv
├── data/ # Processed data (gitignored)
├── Dataset/ # Raw CSV files (gitignored)
├── models/ # Trained models (gitignored)
├── notebooks/ # Analysis notebooks (00-07)
├── reports/
│ ├── figures/ # Visualizations
│ ├── tables/ # Analysis tables
│ ├── final_writeup.md # Methodology summary
│ └── llm_outputs/ # LLM-generated summaries
├── src/ # Python modules
├── Dockerfile
└── docker-compose.yml
| # | Notebook | Description |
|---|---|---|
| 00 | 00_data_audit.ipynb |
Data schema validation |
| 01 | 01_research_insights.ipynb |
EDA and hypotheses |
| 02 | 02_feature_engineering.ipynb |
Feature pipeline |
| 03 | 03_baselines.ipynb |
Baseline models |
| 04 | 04_advanced_ml.ipynb |
Gradient boosting |
| 05 | 05_interpretation_story.ipynb |
SHAP analysis |
| 06 | 06_make_submission.ipynb |
Generate submission |
| 07 | 07_writeup.ipynb |
Final report |
- Density features: amenity counts / trade area size
- Share features: category proportions
- PCA: dimensionality reduction on amenity densities
- Clustering: neighborhood typology via KMeans
- Algorithm: CatBoost gradient boosting
- Validation: 5-fold GroupKFold by market
- Interpretation: SHAP values for feature importance
Generate executive summary using OpenAI API:
export OPENAI_API_KEY="your-key"
python scripts/run_llm_summary.pyUses temperature=0 for deterministic outputs per competition rules.
- pandas, numpy, scikit-learn
- catboost, lightgbm, xgboost
- shap, matplotlib, seaborn
- openai (for LLM summaries)
- Models must be repeatable and deterministic
- Data is confidential - delete after competition
- Only use provided data - no external sources
Competition use only. See BroadVail data use notice.