# Water Main Break Prediction Project: Pipeline Blueprint

## 1. 📦 Required Libraries

```python
# Core
import pandas as pd
import numpy as np
import os

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd

# Modeling
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Utilities
import joblib


## 2. 📥 Data Acquisition
Datasets:
- 2004–2019 Water Main Breaks
- 2021 Water Main Breaks
- Optional: 2022 if quality aligns

Actions:
- Download each dataset from the Syracuse Open Data Portal.
- Store in ./data/raw/.
    - Manual or scripted wget/curl if API exists
    - Save locally in consistent format (CSV recommended)


## 3. 🧹 Preprocessing (Initial Cleanup)
Goals:
- Standardize column names across all datasets
- Concatenate all years into a single DataFrame
- Remove duplicates and entries with missing critical values (e.g., lat/lon, break date)

Deliverable:
- preprocessed_breaks.csv saved to ./data/processed/

##  4. 🧼 Data Cleaning
Goals:
- Handle missing values
- Format dates and extract temporal features (year, month, season)
- Encode categorical variables (e.g., pipe material)
- Generate engineered features (pipe age, winter flag, etc.)

Export:
- df.to_csv('./data/clean/cleaned_breaks.csv', index=False)


##  5. 🔬 Exploratory Data Analysis (EDA)
- Visualize break frequency by year, material, temperature bands (if available)
- Heatmaps of correlations
- GIS-style map using geopandas to plot break clusters

## 6. 🧠 Model Building
Goals:
- Train/test split
- Random Forest baseline model
- Hyperparameter tuning (GridSearchCV or RandomizedSearchCV)
- Feature importance evaluation

Export:
- joblib.dump(rf_model, './models/random_forest_break_model.pkl')


##  7. 📊 Visualization of Results
- Confusion matrix, classification report
- Map overlay of predicted break risk
- Feature importance bar plot

##  8. 🏁 Final Output
- Final clean dataset
- Trained model file
- Jupyter Notebook
- Flask or Streamlit Dashboard
- Whitepaper (Quarto or PDF)
- Blog post (Markdown, Medium format)

## 🔄 Export Logic Between Steps

| Step              | Export File                              | Notes                                             |
|-------------------|------------------------------------------|---------------------------------------------------|
| Preprocessing     | `./data/processed/preprocessed.csv`      | Raw merged and aligned data                       |
| Cleaning          | `./data/clean/cleaned_breaks.csv`        | Cleaned dataset ready for modeling and EDA        |
| Modeling          | `./models/random_forest_break_model.pkl` | Trained ML model saved with `joblib`              |
| Final Dashboard   | *(No export – dynamic)*                  | Uses model + cleaned data for live visualization  |


## 📌 Notes and Observations
- Missing years (e.g., 2020) do not invalidate model but should be acknowledged.
- Need to investigate data consistency across different time spans.
- Consider adding an API endpoint or dashboard for demo purposes.