# Predicting Pipe Breaks with Machine Learning: A Civic Infrastructure Case Study

## Project Roadmap

### Phase 1: Initial Exploration

* Load Syracuse water main break dataset.
* Identify available columns: location, break type, date/time.
* Note major missing features: pipe material, diameter, age, depth, soil type, pressure zone, etc.

### Phase 2: Problem Reframing

* Recognize that the lack of critical pipe metadata limits traditional supervised ML.
* Reframe project goal from pipe-level prediction to **spatial risk classification**:

  * "Where are the break-prone zones based on what we *can* observe?"
  * Focus on location-based analysis and external enrichment.

### Phase 3: Data Enrichment Plan

* Add contextually relevant features from public sources:

  * Approximate infrastructure age from parcel/building data.
  * Soil characteristics from USDA/NRCS Soil Survey.
  * Freeze/thaw weather patterns from NOAA datasets.
  * Zoning or land-use type.
* Engineer categorical features and proxies where direct data is unavailable.

### Phase 4: Exploratory Analysis

* Map break locations to visualize density/hotspots.
* Analyze time trends (by year, month, or season).
* Overlay enriched features to identify potential correlations.

### Phase 5: Modeling & Classification

* Create classification labels:

  * Binary (e.g., hotspot vs. non-hotspot)
  * Or frequency bands (low/med/high break zones)
* Apply basic models:

  * Logistic regression
  * Decision trees
  * Random forest (as capacity allows)
* Evaluate with confusion matrix, precision/recall, AUC.

### Phase 6: Dashboard (Optional)

* Build a lightweight Flask dashboard to:

  * Display break heatmaps
  * Show predictions across city zones
  * Provide insight for hypothetical maintenance prioritization

## Initial Notes: Data Exploration Summary

The Syracuse water main break dataset is an open-access civic dataset with major limitations for ML. While it includes key fields such as break type and geolocation, it omits most engineering attributes necessary for classic infrastructure risk modeling:

* **Missing columns:** Pipe material, diameter, installation year, depth, surrounding soil, pressure, maintenance history, etc.
* This forced a pivot: the project scope has shifted from predicting break likelihood at the *pipe level* to analyzing *spatial risk zones* using enriched public data and proxy features.

Despite the lack of pipe-specific data, the dataset still enables meaningful civic analysis â€” especially when combined with enrichment techniques. This reframing offers strong portfolio value and real-world applicability.

---

More detailed documentation will follow with exploratory visuals, data cleaning process, enrichment steps, and modeling logic.

---

*Prepared by Brice Nelson for the Pencils & Python whitepaper series.*
