-
Notifications
You must be signed in to change notification settings - Fork 286
The Data
The target variable we are intending to forecast is called cfs which stands for cubic feet per second measured as waterflow passing by a gauge. By design, cfs should be >= 0 , where cfs = 0 implies the water is perfectly still. In certain cases, gauges may malfunction and give a -9999 reading (this is common during winter months). For this reason, the data must be pre-processed to cleanse these anomalies. If the river is partially frozen but still flowing a gauge may generate null values for cfs.
Meta data information is stored in a json file in the following format. Each river json file is named gage_id.json
├────────
│ └── "river"
│ ├── "river_gage_id": str,
| ├── "elevation": float,
| ├── "river_flow_raw_path": "path to gage_id_final.csv",
| ├── "months_retrieved": "mm/dd/yyyy-mm/dd/yyyy",
| ├── "missing_flows_rows_count": int,
| ├── "dam_controlled": "yes/no/unknown"
| ├── "frozen-months": "mm/dd-mm/dd"
| ├── "latitude": float
| ├── "longitude": float
| ├── "flow_min": float
| ├── "flow_max": float
| ├── "city_near": "the closest city that the river is near"
| ├── stations
└── [i]
| | ├── "weather_station_id": str
| | ├── "distance_mi_station": float
| | ├── "temp_max": float
| | ├── "temp_min": float
| | ├── "missing_precip": int
| | ├── "missing_temp": int
Calculating Closest Gage Initially The closest gage is calculated by computing by leveraging a cross join between the meta_gage and the meta_sites table. Then we can compute the haversine distance between the lat_gage/lon_gage and the lat_site/lon_site. This will give us the total distance for between every gage and weather station in the dataset. Next we can run a secondary query
WITH cte AS
( SELECT gage_id, row_id, distance
ROW_NUMBER() OVER (PARTITION BY gage_id
ORDER BY value DESC
)
AS rn
FROM t
)
SELECT gage_id, site_id, distance, rn
FROM cte
WHERE rn <= 3
ORDER BY distance, rn;
Calculating Updates
Datetime Standardization
All date time data has been standardized to the following format
dt = datetime.strptime(x, "%Y-%m-%d %H:%M")