## Predicting Flood Potential Based on Rain Fall for the San Lorenzo River Basin, California

ETL Step 1:  
- Extracting data from Stream Gage USGS 11160500 San Lorenzo at Big Trees for the time period 09/01/2014 to 09/01/2024.

  Stream Gage data was obtained through download via the [USGS for station 11160500]. River height in feet, measurement every 15 minutes.

- Extracting precipitation data relative to the San Lorenzo River Watershed for the same time period, 09/01/2014 to 09/01/2024.
  Data was obtained from the [California Data Exchange Center, Department of Water Sources site]
  
  4 stations were identified that have hourly data readings in inches of rainfall:

| Location            | Code | Elevation | Latitude  | Longitude    | County     | Agency                              |
|---------------------|------|-----------|-----------|--------------|------------|-------------------------------------|
| BEN LOMOND (CDF)     | BLO  | 2630      | 37.132000 | -122.169998  | SANTA CRUZ | CA Dept of Forestry and Fire Protection |
| SCHULTIES RD         | SCH  | 1400      | 37.132999 | -121.969002  | SANTA CRUZ | Santa Cruz County                   |
| BOULDER CREEK        | BDC  | 800       | 37.141998 | -122.163002  | SANTA CRUZ | Santa Cruz County                   |
| BEN LOMOND           | BLN  | 365       | 37.092999 | -122.074997  | SANTA CRUZ | Santa Cruz County                   |







[USGS for station 11160500]: https://waterdata.usgs.gov/monitoring-location/11160500/#parameterCode=00065&period=P7D&showMedian=false
[California Data Exchange Center, Department of Water Sources site]: https://cdec.water.ca.gov/dynamicapp/wsSensorData

In [3]:
# Import dependencies
import pandas as pd 

Extracting the Stream Gage data to a Pandas data frame

In [5]:
# Define the file path 
file_path = 'Resources/BigTrees11160500_9_2014_9_2024.txt'

# Skip the header rows and load the data into a DataFrame
stream = pd.read_csv(file_path, sep='\t', comment='#', skiprows=28, header=0)

# Rename the columns
stream.columns = ['agency', 'site_no', 'datetime', 'time_zone', 'gage_height', 'approval_code']

# Convert the 'datetime' column to datetime type for easier manipulation
stream['datetime'] = pd.to_datetime(stream['datetime'])

# Display the DataFrame
print(stream.head())

  agency   site_no            datetime time_zone  gage_height approval_code
0   USGS  11160500 2014-09-01 00:00:00       PDT         2.69             A
1   USGS  11160500 2014-09-01 00:15:00       PDT         2.69             A
2   USGS  11160500 2014-09-01 00:30:00       PDT         2.69             A
3   USGS  11160500 2014-09-01 00:45:00       PDT         2.69             A
4   USGS  11160500 2014-09-01 01:00:00       PDT         2.69             A


In [6]:
# Define the path for the new CSV output
output_csv = 'Resources/cleaned_stream_gage_data.csv'

# Save the DataFrame to a new CSV file
stream.to_csv(output_csv, index=False)

print(f"Data has been saved to {output_csv}")

Data has been saved to Resources/cleaned_stream_gage_data.csv


Clean date to have only one measurement per hour to match the rain data

In [8]:
# Convert the 'datetime' column to datetime format if it exists
if 'datetime' in stream.columns:
    stream['datetime'] = pd.to_datetime(stream['datetime'], errors='coerce')

# Extract the date and hour from the 'datetime' column
stream['date_hour'] = stream['datetime'].dt.floor('h')

# Group by the date and hour and get the max gage height for each hour
max_height_per_hour = stream.groupby('date_hour').agg({'gage_height': 'max'}).reset_index()

# Display the new DataFrame with the date, hour, and max height
print(max_height_per_hour.head())



            date_hour  gage_height
0 2014-09-01 00:00:00         2.69
1 2014-09-01 01:00:00         2.69
2 2014-09-01 02:00:00         2.69
3 2014-09-01 03:00:00         2.68
4 2014-09-01 04:00:00         2.68


  stream['date_hour'] = stream['datetime'].dt.floor('H')


Extracting rain data to Pandas data frame

In [12]:
# Load the Excel file
file_path = 'Resources/BLO_ SCH_ BDC_ BLN_2.xlsx'

# Reading the Excel file to inspect sheet names and general structure
xls = pd.ExcelFile(file_path)

# Display the sheet names to understand how the data is organized
xls.sheet_names

  warn("Workbook contains no default style, apply openpyxl's default")


['Sheet1']

In [14]:
# Since the file contains a single sheet 'Sheet1', let's load it and inspect the first few rows to understand the data structure.
rain = pd.read_excel(file_path, sheet_name='Sheet1')

# Drop the 'datetime' column
rain = rain.drop(columns=['DATE TIME'])

# Convert the 'obs_date' column to datetime
rain['OBS DATE'] = pd.to_datetime(rain['OBS DATE'], errors='coerce')

# Display the first few rows of the DataFrame to understand the structure
rain.head()

  warn("Workbook contains no default style, apply openpyxl's default")


Unnamed: 0,STATION_ID,DURATION,SENSOR_NUMBER,SENS_TYPE,OBS DATE,VALUE,DATA_FLAG,UNITS
0,BLO,H,2,RAIN,2014-09-01 00:00:00,0.04,,INCHES
1,BLO,H,2,RAIN,2014-09-01 01:00:00,0.04,,INCHES
2,BLO,H,2,RAIN,2014-09-01 02:00:00,0.04,,INCHES
3,BLO,H,2,RAIN,2014-09-01 03:00:00,0.04,,INCHES
4,BLO,H,2,RAIN,2014-09-01 04:00:00,0.04,,INCHES


In [15]:
print(rain.dtypes)

STATION_ID               object
DURATION                 object
SENSOR_NUMBER             int64
SENS_TYPE                object
OBS DATE         datetime64[ns]
VALUE                   float64
DATA_FLAG                object
UNITS                    object
dtype: object
