# Data Ingestion
- Load and inspect raw water main break datasets

In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
breaks_2019 = pd.read_csv('../data/raw/wm_breaks_2004-2019.csv')
breaks_2021 = pd.read_csv('../data/raw/wm_breaks_2021.csv')
breaks_2022 = pd.read_csv('../data/raw/wm_breaks_2022.csv')

print('2019 Breaks \n',breaks_2019.head())
print('\n2021 Breaks \n',breaks_2021.head())
print('\n2022 Breaks \n',breaks_2022.head())

2019 Breaks 
               X             Y                fullDate                location  \
0 -8.479826e+06  5.319088e+06  2011/01/14 00:00:00+00       1205 W FAYETTE ST   
1 -8.474632e+06  5.314544e+06  2011/01/14 00:00:00+00     1003 JAMESVILLE AVE   
2 -8.478344e+06  5.316543e+06  2011/01/14 00:00:00+00  PALMER AVE & CHENEY ST   
3 -8.472936e+06  5.319305e+06  2011/01/16 00:00:00+00       2100 E FAYETTE ST   
4 -8.477195e+06  5.320761e+06  2011/01/17 00:00:00+00        206 BUTTERNUT ST   

   leakClass  month  date  weekday    year  week        lon        lat  \
0          0    1.0  14.0      NaN  2011.0   2.0 -76.175575  43.046733   
1          0    1.0  14.0      NaN  2011.0   2.0 -76.128918  43.016895   
2          0    1.0  14.0      NaN  2011.0   2.0 -76.162257  43.030022   
3          0    1.0  16.0      NaN  2011.0   3.0 -76.113683  43.048158   
4          0    1.0  17.0      NaN  2011.0   3.0 -76.151936  43.057713   

   ObjectId  
0         1  
1         2  
2         3 

## Data Exploration Assessment Upon Initial Data Ingestion
- Data provided is missing columns of necessary information
- Additionally, the data is not in a standardized format
- Additional data columns will need to be added

## Additional columns to be added
- Unique ID
- Description
- Date
- Full Date
- Time
- Location
- x coord
- y coord
- Latitude
- Longitude
- leak type
- leak class, work order, description
- year, month, week day, week of year (all derived from date)

## 🔍 External Data Sources to Enrich Pipe Break Modeling

### 🔍 External Data Sources to Enrich Pipe Break Modeling

| **Desired Info**        | **Purpose**                                       | **How to Get It**                                                |
|-------------------------|----------------------------------------------------|------------------------------------------------------------------|
| **Weather Data**        | Analyze correlation between freeze/thaw and breaks | [NOAA NCEI API](https://www.ncei.noaa.gov/support/access-data-service-api-user-documentation), [OpenWeatherMap API](https://openweathermap.org/api), [Weather Data from Visual Crossing](https://www.visualcrossing.com/weather-data) |
| **Pipe Material / Age** | Assess infrastructure degradation risk             | City asset registry or utility department GIS (may require FOIA request or internal access) |
| **Soil Type**           | Evaluate corrosiveness or movement-related stress  | [USGS SSURGO Database](https://www.nrcs.usda.gov/resources/data-and-reports/ssurgo), [USDA Web Soil Survey](https://websoilsurvey.sc.egov.usda.gov/App/HomePage.htm) |
| **Road Classification** | Understand traffic loads over pipe infrastructure  | [TIGER/Line Shapefiles (U.S. Census)](https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html), [OpenStreetMap](https://download.geofabrik.de/) |
| **Elevation**           | Analyze pressure-related strain in hilly areas     | [USGS National Map Viewer](https://apps.nationalmap.gov/viewer/), [NASA SRTM DEM Data](https://search.earthdata.nasa.gov/search) |
| **Distance to Hydrant** | Model pressure/flow zones and response logistics   | Local utility GIS, hydrant layers (may be internal or via [Open Data portals](https://data.gov/)) |
| **Land Use / Zoning**   | Evaluate industrial vs. residential stress levels  | [OpenStreetMap Land Use Layers](https://wiki.openstreetmap.org/wiki/Land_use), local planning department GIS portals |
| **Repair Response Time**| Analyze service lag vs. damage severity            | Utility maintenance logs (typically internal or accessed via request) |
