Skip to content
Jane Thompson edited this page Feb 11, 2020 · 8 revisions

Target Variable

The target variable we are intending to forecast is called cfs which stands for cubic feet per second measured as waterflow passing by a gauge. By design, cfs should be >= 0 , where cfs = 0 implies the water is perfectly still. In certain cases, gauges may malfunction and give a -9999 reading (this is common during winter months). For this reason, the data must be pre-processed to cleanse these anomalies. If the river is partially frozen but still flowing a gauge may generate null values for cfs.

Meta Data

Meta data information is stored in a json file in the following format. Each river json file is named gage_id.json

├────────
       │   └── "river"
       │       ├── "river_gage_id": str,
       |       ├── "elevation": float,
       |       ├── "river_flow_raw_path": "path to gage_id_final.csv",
       |       ├── "months_retrieved": "mm/dd/yyyy-mm/dd/yyyy",
       |       ├── "missing_flows_rows_count": int,
       |       ├── "dam_controlled": "yes/no/unknown"
       |       ├── "frozen-months": "mm/dd-mm/dd"
       |       ├── "latitude": float
       |       ├── "longitude": float
       |       ├── "flow_min": float
       |       ├── "flow_max": float
       |       ├── "city_near": "the closest city that the river is near"
       |       ├── stations
                      └── [i]
       |        |       ├── "weather_station_id": str
       |        |       ├── "distance_mi_station": float
       |        |       ├── "temp_max": float
       |        |       ├── "temp_min": float
       |        |       ├── "missing_precip": int
       |        |       ├── "missing_temp": int

Calculating Closest Gage Initially The closest gage is calculated by computing by leveraging a cross join between the meta_gage and the meta_sites table. Then we can compute the haversine distance between the lat_gage/lon_gage and the lat_site/lon_site. This will give us the total distance for between every gage and weather station in the dataset. Next we can run a secondary query

 WITH cte AS
  ( SELECT gage_id, row_id, distance
           ROW_NUMBER() OVER (PARTITION BY gage_id
                              ORDER BY value DESC
                             )
             AS rn
    FROM t
  )
SELECT gage_id, site_id, distance, rn
FROM cte
WHERE rn <= 3
ORDER BY distance, rn;

Calculating Updates

Datetime Standardization

All date time data has been standardized to the following format dt = datetime.strptime(x, "%Y-%m-%d %H:%M")