# Mangrove Monitoring Data Description

This document provides an overview of the datasets used for monitoring mangrove ecosystems. The data is sourced from USGS and NOAA, categorized into hydrological, water quality, and tidal parameters.

### Data Location  
- **Site Name:** PUMPKIN RIVER NEAR GOODLAND, FL  
- **Site ID:** 255534081324000  
- **TS_ID:** Internal number representing a time series.
- **Agency Id:** USGS

---

## 1. Hydrological Data
Hydrological data helps monitor **water flow and flooding patterns**, which are crucial for mangrove survival.

**Table Name: `hydrology_data`**

| Column Name                  | Description                                         | Source  | Unit |
|------------------------------|-----------------------------------------------------|---------|------|
| `datetime`                   | Timestamp of the measurement                        | USGS    | UTC  |
| `gage_height`                | Water level measurement                            | USGS    | Feet |
| `discharge_cfs`              | Volume of water flow per second                    | USGS    | Cubic feet per second (cfs) |
| `discharge_tidally_filtered` | Discharge adjusted for tidal effects               | USGS    | Cubic feet per second (cfs) |

### Hydrological Data Parameters
This table provides details of the key hydrological parameters recorded in the dataset.

| **Parameter**                                        | **TS_ID** | **Code** | **Description**                                         | **Qualification Codes**                                  |
|------------------------------------------------------|----------|---------|---------------------------------------------------------|----------------------------------------------------------|
| **Gage height, feet**                                | 33711    | 00065   | Water level measurement in feet                         | P - Provisional data subject to revision                 |
| **Discharge, cubic feet per second**                | 169594   | 00060   | Volume of water flow per second                        | P - Provisional data subject to revision, e - Estimated  |
| **Discharge, tidally filtered, cubic feet per second** | 171552   | 72137   | Discharge adjusted for tidal effects                   | P - Provisional data subject to revision, e - Estimated  |

---

## 2. Water Quality Data
These parameters assess **water conditions**, including salinity, which impacts mangrove health.

**Table Name: `water_quality_data`**

| Column Name              | Description                                      | Source  | Unit |
|--------------------------|--------------------------------------------------|---------|------|
| `datetime`               | Timestamp of the measurement                     | USGS    | UTC  |
| `water_temperature`      | Temperature of water at monitoring depth        | USGS    | Degrees Celsius |
| `salinity`              | Salt concentration in water                      | USGS    | Parts per thousand (ppt) |
| `specific_conductance`   | Conductivity of water, used to estimate salinity | USGS    | Microsiemens per cm (µS/cm) |

### Water Quality Data Parameters
This table provides details of the key water quality parameters recorded in the dataset.

| **Parameter**                                        | **TS_ID** | **Code** | **Description**                                           | **Qualification Codes**                                  |
|------------------------------------------------------|----------|---------|-----------------------------------------------------------|----------------------------------------------------------|
| **Salinity, water, unfiltered, parts per thousand, Near Bottom** | 33713    | 00480   | Salt concentration measured near the bottom               | P - Provisional data subject to revision                 |
| **Specific conductance, water, unfiltered, microsiemens per centimeter at 25°C, Near Bottom** | 223462   | 00095   | Electrical conductivity of water, a proxy for salinity    | P - Provisional data subject to revision                 |
| **Temperature, water, degrees Celsius, Near Bottom** | 33712    | 00010   | Temperature of water at near-bottom depths                | P - Provisional data subject to revision                 |

---

## 3. Tidal & Hydrodynamic Data
These parameters track **tidal influence and inundation** affecting mangrove root systems.

**Table Name: `tidal_data`**

| Column Name                    | Description                                     | Source  | Unit |
|---------------------------------|------------------------------------------------|---------|------|
| `datetime`                      | Timestamp of the measurement                   | USGS    | UTC  |
| `stream_elevation_navd88`        | Elevation of water level above NAVD88 reference | USGS    | Feet |
| `tidal_inundation_patterns`      | Water level fluctuations due to tides         | **Missing** | Feet |

### Tidal & Hydrodynamic Data Parameters
This table provides details of the key tidal and hydrodynamic parameters recorded in the dataset.

| **Parameter**                                         | **TS_ID** | **Code** | **Description**                                       | **Qualification Codes**                                  |
|-------------------------------------------------------|----------|---------|-------------------------------------------------------|----------------------------------------------------------|
| **Stream water level elevation above NAVD 1988, in feet** | 302824   | 63160   | Elevation of water level relative to NAVD 1988        | P - Provisional data subject to revision                 |

---

This document provides the foundational structure for analyzing **mangrove hydrology, water quality, and tidal impacts** based on USGS data.


# Hydrological Data

## EDA

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pytz

In [2]:
gage_height = pd.read_csv("hydrological_data/gage_height_stream_water_levels.tsv",sep="\t")
discharge = pd.read_csv("hydrological_data/discharge.tsv",sep="\t")
discharge_tf = pd.read_csv("hydrological_data/discharge_filtered.tsv",sep="\t")

In [3]:
# Standardizing column names for merging
gage_height = gage_height.rename(columns={"33711_00065": "gage_height", "33711_00065_cd": "gage_height_cd"})
discharge = discharge.rename(columns={"169594_00060": "discharge", "169594_00060_cd": "discharge_cd"})
discharge_tf = discharge_tf.rename(columns={"171552_72137": "discharge_tf", "171552_72137_cd": "discharge_tf_cd"})

In [4]:
# 3. STANDARDIZE TIMESTAMPS FOR HYDROLOGICAL DATA
def standardize_timestamps(df):
    """
    Create timezone-aware timestamps, handling EST/EDT correctly
    """
    df = df.copy()
    
    def localize_datetime(row):
        # Create a naive datetime object
        dt = pd.to_datetime(row['datetime'])
        tz_code = row['tz_cd']
        
        # Create naive datetime
        naive_dt = pd.Timestamp(dt.year, dt.month, dt.day, dt.hour, dt.minute, dt.second)
        
        # Apply the specific UTC offset based on the timezone code
        if tz_code == 'EDT':
            # EDT is UTC-4
            return naive_dt.tz_localize(pytz.FixedOffset(-4*60))
        elif tz_code == 'EST':
            # EST is UTC-5
            return naive_dt.tz_localize(pytz.FixedOffset(-5*60))
        else:
            # Default case
            return naive_dt
    
    # Apply the function to create timezone-aware datetimes
    df['datetime_with_tz'] = df.apply(localize_datetime, axis=1)
    
    return df

In [5]:
# Apply timezone standardization to hydrological datasets
gage_height = standardize_timestamps(gage_height)
discharge = standardize_timestamps(discharge)
discharge_tf = standardize_timestamps(discharge_tf)

# Check for duplicates in datetime_with_tz for each dataset
print(f"Duplicates in gage_height: {gage_height['datetime_with_tz'].duplicated().sum()}")
print(f"Duplicates in discharge: {discharge['datetime_with_tz'].duplicated().sum()}")
print(f"Duplicates in discharge_tf: {discharge_tf['datetime_with_tz'].duplicated().sum()}")

Duplicates in gage_height: 0
Duplicates in discharge: 0
Duplicates in discharge_tf: 0


In [6]:
gage_height.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33768 entries, 0 to 33767
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   agency_cd         33768 non-null  object 
 1   site_no           33768 non-null  int64  
 2   datetime          33768 non-null  object 
 3   tz_cd             33768 non-null  object 
 4   gage_height       33768 non-null  float64
 5   gage_height_cd    33768 non-null  object 
 6   datetime_with_tz  33768 non-null  object 
dtypes: float64(1), int64(1), object(5)
memory usage: 1.8+ MB


In [7]:
# Drop unnecessary columns
gage_height = gage_height.drop(columns=["datetime", "site_no", "gage_height_cd", "agency_cd", "tz_cd"])
discharge = discharge.drop(columns=["datetime", "site_no", "discharge_cd", "agency_cd", "tz_cd"])
discharge_tf = discharge_tf.drop(columns=["datetime", "site_no", "discharge_tf_cd", "agency_cd", "tz_cd"])

In [8]:
gage_height['datetime_with_tz'] = pd.to_datetime(gage_height['datetime_with_tz'], utc=True)
discharge['datetime_with_tz'] = pd.to_datetime(discharge['datetime_with_tz'], utc=True)
discharge_tf['datetime_with_tz'] = pd.to_datetime(discharge_tf['datetime_with_tz'], utc=True)

In [9]:
display(gage_height.head())
display(discharge.head())
display(discharge_tf.head())

Unnamed: 0,gage_height,datetime_with_tz
0,7.81,2024-02-27 22:00:00+00:00
1,7.7,2024-02-27 22:15:00+00:00
2,7.57,2024-02-27 22:30:00+00:00
3,7.43,2024-02-27 22:45:00+00:00
4,7.3,2024-02-27 23:00:00+00:00


Unnamed: 0,discharge,datetime_with_tz
0,-59.1,2024-02-28 06:45:00+00:00
1,-64.9,2024-02-28 07:00:00+00:00
2,-67.1,2024-02-28 07:15:00+00:00
3,-67.3,2024-02-28 07:30:00+00:00
4,-85.3,2024-02-28 07:45:00+00:00


Unnamed: 0,discharge_tf,datetime_with_tz
0,0.64,2024-02-28 07:00:00+00:00
1,0.49,2024-02-28 08:00:00+00:00
2,0.37,2024-02-28 09:00:00+00:00
3,0.26,2024-02-28 10:00:00+00:00
4,0.19,2024-02-28 11:00:00+00:00


In [10]:
display(gage_height.info())
display(discharge.info())
display(discharge_tf.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33768 entries, 0 to 33767
Data columns (total 2 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   gage_height       33768 non-null  float64            
 1   datetime_with_tz  33768 non-null  datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](1), float64(1)
memory usage: 527.8 KB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33743 entries, 0 to 33742
Data columns (total 2 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   discharge         33743 non-null  float64            
 1   datetime_with_tz  33743 non-null  datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](1), float64(1)
memory usage: 527.4 KB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8249 entries, 0 to 8248
Data columns (total 2 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   discharge_tf      8249 non-null   float64            
 1   datetime_with_tz  8249 non-null   datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](1), float64(1)
memory usage: 129.0 KB


None

In [11]:
# # Start with discharge_tf having datetime_with_tz in UTC
# # Create a copy of datetime_with_tz converted to Eastern time for index
# discharge_tf['datetime_eastern'] = discharge_tf['datetime_with_tz'].dt.tz_convert('US/Eastern')
# discharge_tf.set_index('datetime_eastern', inplace=True)

# # Handle any duplicates
# if discharge_tf.index.duplicated().any():
#     discharge_tf = discharge_tf[~discharge_tf.index.duplicated(keep='first')]

# # Resample to 15-minute intervals
# discharge_tf_resampled = discharge_tf.resample('15min').asfreq()

# # Apply cubic interpolation for numeric data
# discharge_tf_resampled['discharge_tf'] = discharge_tf_resampled['discharge_tf'].interpolate(method='cubic')

# # Handle categorical columns with forward fill
# if 'discharge_tf_cd' in discharge_tf_resampled.columns:
#     discharge_tf_resampled['discharge_tf_cd'] = discharge_tf_resampled['discharge_tf_cd'].ffill()

# # Reset index to get the Eastern time as a column
# discharge_tf_resampled.reset_index(inplace=True)

# # Now create new UTC timestamps for the 15-minute intervals
# # We create this from the Eastern time to ensure perfect alignment with the resampled data
# discharge_tf_resampled['datetime_with_tz'] = discharge_tf_resampled['datetime_eastern'].dt.tz_convert('UTC')

# # Select and order the columns you want
# discharge_tf_resampled = discharge_tf_resampled[['datetime_with_tz', 'discharge_tf', 'discharge_tf_cd']]

In [12]:
# discharge_tf_resampled.head()

In [13]:
discharge_tf.head()

Unnamed: 0,discharge_tf,datetime_with_tz
0,0.64,2024-02-28 07:00:00+00:00
1,0.49,2024-02-28 08:00:00+00:00
2,0.37,2024-02-28 09:00:00+00:00
3,0.26,2024-02-28 10:00:00+00:00
4,0.19,2024-02-28 11:00:00+00:00


In [14]:
discharge_tf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8249 entries, 0 to 8248
Data columns (total 2 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   discharge_tf      8249 non-null   float64            
 1   datetime_with_tz  8249 non-null   datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](1), float64(1)
memory usage: 129.0 KB


In [15]:
hydro_data = pd.merge(
    gage_height,
    discharge,
    on="datetime_with_tz",
    how="inner"
)

In [16]:
hydrological_data = pd.merge(
    hydro_data,
    discharge_tf,
    on="datetime_with_tz",
    how="inner"
)

In [17]:
hydrological_data.tail(20)

Unnamed: 0,gage_height,datetime_with_tz,discharge,discharge_tf
8224,6.64,2025-02-25 00:00:00+00:00,-18.7,-8.99
8225,7.0,2025-02-25 01:00:00+00:00,-61.2,-8.7
8226,7.62,2025-02-25 02:00:00+00:00,-90.1,-8.36
8227,8.17,2025-02-25 03:00:00+00:00,-141.0,-8.0
8228,8.41,2025-02-25 04:00:00+00:00,-86.1,-7.6
8229,8.54,2025-02-25 05:00:00+00:00,-158.0,-7.17
8230,8.17,2025-02-25 06:00:00+00:00,88.8,-6.71
8231,7.61,2025-02-25 07:00:00+00:00,94.5,-6.22
8232,6.92,2025-02-25 08:00:00+00:00,90.7,-5.73
8233,6.23,2025-02-25 09:00:00+00:00,64.3,-5.24


In [18]:
hydrological_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8244 entries, 0 to 8243
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   gage_height       8244 non-null   float64            
 1   datetime_with_tz  8244 non-null   datetime64[ns, UTC]
 2   discharge         8244 non-null   float64            
 3   discharge_tf      8244 non-null   float64            
dtypes: datetime64[ns, UTC](1), float64(3)
memory usage: 257.8 KB


# Water Quality Data

In [19]:
salinity = pd.read_csv("water_quality_data/salinity.tsv",sep="\t")
specific_conductance = pd.read_csv("water_quality_data/specific_conductance.tsv",sep="\t")
temperature = pd.read_csv("water_quality_data/temperature.tsv",sep="\t")

In [20]:
# Standardizing column names for merging
salinity = salinity.rename(columns={"33713_00480": "salinity", "33713_00480_cd": "salinity_cd"})
specific_conductance = specific_conductance.rename(columns={"223462_00095": "specific_conductance", "223462_00095_cd": "specific_conductance_cd"})
temperature = temperature.rename(columns={"33712_00010": "temperature", "33712_00010_cd": "temperature_cd"})

In [21]:
display(salinity.head())
display(specific_conductance.head())
display(temperature.head())

display(salinity.tail())
display(specific_conductance.tail())
display(temperature.tail())

Unnamed: 0,agency_cd,site_no,datetime,tz_cd,salinity,salinity_cd
0,USGS,255534081324000,2024-02-28 01:45,EST,31.0,P
1,USGS,255534081324000,2024-02-28 02:00,EST,31.0,P
2,USGS,255534081324000,2024-02-28 02:15,EST,31.0,P
3,USGS,255534081324000,2024-02-28 02:30,EST,31.0,P
4,USGS,255534081324000,2024-02-28 02:45,EST,31.0,P


Unnamed: 0,agency_cd,site_no,datetime,tz_cd,specific_conductance,specific_conductance_cd
0,USGS,255534081324000,2024-02-28 01:45,EST,47900,P
1,USGS,255534081324000,2024-02-28 02:00,EST,47900,P
2,USGS,255534081324000,2024-02-28 02:15,EST,47800,P
3,USGS,255534081324000,2024-02-28 02:30,EST,47600,P
4,USGS,255534081324000,2024-02-28 02:45,EST,47600,P


Unnamed: 0,agency_cd,site_no,datetime,tz_cd,temperature,temperature_cd
0,USGS,255534081324000,2024-02-28 01:45,EST,22.6,P
1,USGS,255534081324000,2024-02-28 02:00,EST,22.6,P
2,USGS,255534081324000,2024-02-28 02:15,EST,22.6,P
3,USGS,255534081324000,2024-02-28 02:30,EST,22.7,P
4,USGS,255534081324000,2024-02-28 02:45,EST,22.6,P


Unnamed: 0,agency_cd,site_no,datetime,tz_cd,salinity,salinity_cd
32678,USGS,255534081324000,2025-02-27 00:15,EST,36.0,P
32679,USGS,255534081324000,2025-02-27 00:30,EST,36.0,P
32680,USGS,255534081324000,2025-02-27 00:45,EST,36.0,P
32681,USGS,255534081324000,2025-02-27 01:00,EST,36.0,P
32682,USGS,255534081324000,2025-02-27 01:15,EST,36.0,P


Unnamed: 0,agency_cd,site_no,datetime,tz_cd,specific_conductance,specific_conductance_cd
32679,USGS,255534081324000,2025-02-27 00:15,EST,54100,P
32680,USGS,255534081324000,2025-02-27 00:30,EST,54100,P
32681,USGS,255534081324000,2025-02-27 00:45,EST,54000,P
32682,USGS,255534081324000,2025-02-27 01:00,EST,54000,P
32683,USGS,255534081324000,2025-02-27 01:15,EST,54000,P


Unnamed: 0,agency_cd,site_no,datetime,tz_cd,temperature,temperature_cd
33742,USGS,255534081324000,2025-02-27 00:15,EST,23.1,P
33743,USGS,255534081324000,2025-02-27 00:30,EST,23.0,P
33744,USGS,255534081324000,2025-02-27 00:45,EST,23.0,P
33745,USGS,255534081324000,2025-02-27 01:00,EST,22.9,P
33746,USGS,255534081324000,2025-02-27 01:15,EST,22.8,P


In [22]:
# Apply timezone standardization to hydrological datasets
salinity = standardize_timestamps(salinity)
specific_conductance = standardize_timestamps(specific_conductance)
temperature = standardize_timestamps(temperature)

# Check for duplicates in datetime_with_tz for each dataset
print(f"Duplicates in salinity: {salinity['datetime_with_tz'].duplicated().sum()}")
print(f"Duplicates in specific_conductance: {specific_conductance['datetime_with_tz'].duplicated().sum()}")
print(f"Duplicates in temperature: {temperature['datetime_with_tz'].duplicated().sum()}")

Duplicates in salinity: 0
Duplicates in specific_conductance: 0
Duplicates in temperature: 0


In [23]:
salinity = salinity.drop(columns={'agency_cd', 'site_no', 'salinity_cd', 'datetime', 'tz_cd'})
specific_conductance = specific_conductance.drop(columns={'agency_cd', 'site_no', 'specific_conductance_cd', 'datetime', 'tz_cd'})
temperature = temperature.drop(columns={'agency_cd', 'site_no', 'temperature_cd', 'datetime', 'tz_cd'})

In [24]:
salinity['datetime_with_tz'] = pd.to_datetime(salinity['datetime_with_tz'], utc=True)
specific_conductance['datetime_with_tz'] = pd.to_datetime(specific_conductance['datetime_with_tz'], utc=True)
temperature['datetime_with_tz'] = pd.to_datetime(temperature['datetime_with_tz'], utc=True)

In [25]:
display(salinity.head())
display(specific_conductance.head())
display(temperature.head())

display(salinity.tail())
display(specific_conductance.tail())
display(temperature.tail())

Unnamed: 0,salinity,datetime_with_tz
0,31.0,2024-02-28 06:45:00+00:00
1,31.0,2024-02-28 07:00:00+00:00
2,31.0,2024-02-28 07:15:00+00:00
3,31.0,2024-02-28 07:30:00+00:00
4,31.0,2024-02-28 07:45:00+00:00


Unnamed: 0,specific_conductance,datetime_with_tz
0,47900,2024-02-28 06:45:00+00:00
1,47900,2024-02-28 07:00:00+00:00
2,47800,2024-02-28 07:15:00+00:00
3,47600,2024-02-28 07:30:00+00:00
4,47600,2024-02-28 07:45:00+00:00


Unnamed: 0,temperature,datetime_with_tz
0,22.6,2024-02-28 06:45:00+00:00
1,22.6,2024-02-28 07:00:00+00:00
2,22.6,2024-02-28 07:15:00+00:00
3,22.7,2024-02-28 07:30:00+00:00
4,22.6,2024-02-28 07:45:00+00:00


Unnamed: 0,salinity,datetime_with_tz
32678,36.0,2025-02-27 05:15:00+00:00
32679,36.0,2025-02-27 05:30:00+00:00
32680,36.0,2025-02-27 05:45:00+00:00
32681,36.0,2025-02-27 06:00:00+00:00
32682,36.0,2025-02-27 06:15:00+00:00


Unnamed: 0,specific_conductance,datetime_with_tz
32679,54100,2025-02-27 05:15:00+00:00
32680,54100,2025-02-27 05:30:00+00:00
32681,54000,2025-02-27 05:45:00+00:00
32682,54000,2025-02-27 06:00:00+00:00
32683,54000,2025-02-27 06:15:00+00:00


Unnamed: 0,temperature,datetime_with_tz
33742,23.1,2025-02-27 05:15:00+00:00
33743,23.0,2025-02-27 05:30:00+00:00
33744,23.0,2025-02-27 05:45:00+00:00
33745,22.9,2025-02-27 06:00:00+00:00
33746,22.8,2025-02-27 06:15:00+00:00


In [26]:
saltemp = pd.merge(
    salinity,
    specific_conductance,
    on="datetime_with_tz",
    how="inner"
)

In [27]:
water_quality = pd.merge(
    saltemp,
    temperature,
    on="datetime_with_tz",
    how="inner"
)

In [28]:
len(salinity), len(temperature), len(specific_conductance)

(32683, 33747, 32684)

In [29]:
water_quality.head()

Unnamed: 0,salinity,datetime_with_tz,specific_conductance,temperature
0,31.0,2024-02-28 06:45:00+00:00,47900,22.6
1,31.0,2024-02-28 07:00:00+00:00,47900,22.6
2,31.0,2024-02-28 07:15:00+00:00,47800,22.6
3,31.0,2024-02-28 07:30:00+00:00,47600,22.7
4,31.0,2024-02-28 07:45:00+00:00,47600,22.6


In [30]:
water_quality.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32681 entries, 0 to 32680
Data columns (total 4 columns):
 #   Column                Non-Null Count  Dtype              
---  ------                --------------  -----              
 0   salinity              32681 non-null  float64            
 1   datetime_with_tz      32681 non-null  datetime64[ns, UTC]
 2   specific_conductance  32681 non-null  int64              
 3   temperature           32681 non-null  float64            
dtypes: datetime64[ns, UTC](1), float64(2), int64(1)
memory usage: 1021.4 KB


# Tidal and Hydrodynamic Data

In [31]:
stream_water_level = pd.read_csv('tidal_hydrodynamic_data/stream_water_level.tsv', sep='\t')

In [32]:
stream_water_level.head()

Unnamed: 0,agency_cd,site_no,datetime,tz_cd,302824_63160,302824_63160_cd
0,USGS,255534081324000,2024-02-28 02:00,EST,-0.9,P
1,USGS,255534081324000,2024-02-28 02:15,EST,-0.68,P
2,USGS,255534081324000,2024-02-28 02:30,EST,-0.44,P
3,USGS,255534081324000,2024-02-28 02:45,EST,-0.22,P
4,USGS,255534081324000,2024-02-28 03:00,EST,0.0,P


In [33]:
# Standardizing column names for merging
stream_water_level = stream_water_level.rename(columns={"302824_63160": "stream_water_level", "302824_63160_cd": "stream_water_level_cd"})

In [34]:
# Apply timezone standardization to Tidal and Hydrodynamic datasets
stream_water_level = standardize_timestamps(stream_water_level)

# Check for duplicates in datetime_with_tz for each dataset
print(f"Duplicates in stream_water_level: {stream_water_level['datetime_with_tz'].duplicated().sum()}")

Duplicates in stream_water_level: 0


In [35]:
stream_water_level = stream_water_level.drop(columns={'agency_cd', 'site_no', 'stream_water_level_cd', 'datetime', 'tz_cd'})

In [36]:
stream_water_level['datetime_with_tz'] = pd.to_datetime(stream_water_level['datetime_with_tz'], utc=True)

In [37]:
display(stream_water_level.head())
display(stream_water_level.tail())

Unnamed: 0,stream_water_level,datetime_with_tz
0,-0.9,2024-02-28 07:00:00+00:00
1,-0.68,2024-02-28 07:15:00+00:00
2,-0.44,2024-02-28 07:30:00+00:00
3,-0.22,2024-02-28 07:45:00+00:00
4,0.0,2024-02-28 08:00:00+00:00


Unnamed: 0,stream_water_level,datetime_with_tz
33763,1.16,2025-02-27 05:15:00+00:00
33764,1.3,2025-02-27 05:30:00+00:00
33765,1.42,2025-02-27 05:45:00+00:00
33766,1.5,2025-02-27 06:00:00+00:00
33767,1.53,2025-02-27 06:15:00+00:00


In [38]:
tide = pd.read_csv("tidal_hydrodynamic_data/tide.csv")

# Rename columns for clarity
tide = tide.rename(columns={
    'Predicted (ft)': 'tide_predicted',
    'Preliminary (ft)': 'tide_preliminary',
    'Verified (ft)': 'tide_verified'
})

In [39]:
tide.head()

Unnamed: 0,Date,Time (GMT),tide_predicted,tide_preliminary,tide_verified
0,2024/02/28,00:00,0.56,-,1.06
1,2024/02/28,01:00,0.248,-,0.63
2,2024/02/28,02:00,0.14,-,0.32
3,2024/02/28,03:00,0.325,-,0.72
4,2024/02/28,04:00,0.826,-,1.26


In [40]:
# Combine date and time into a single datetime column
tide['datetime'] = pd.to_datetime(
    tide['Date'] + ' ' + tide['Time (GMT)'],
    format='%Y/%m/%d %H:%M',
    errors='coerce'  # Handle any parsing errors
)

# Filter out any rows with parsing errors
tide = tide.dropna(subset=['datetime'])

# Since 'Time (GMT)' is already in GMT/UTC, we just need to localize it to UTC timezone
# without converting to Eastern Time
utc = pytz.timezone('UTC')  # Using UTC instead of GMT for consistency

# Localize to UTC timezone
tide['datetime_with_tz'] = tide['datetime'].dt.tz_localize(utc)

tide['tide_verified'] = pd.to_numeric(tide['tide_verified'], errors='coerce')

# Drop unnecessary columns
tide = tide.drop(['Date', 'Time (GMT)', 'datetime'], axis=1)

In [41]:
tide = tide[tide['tide_verified']!= "-"]
tide['tide_verified'] = tide['tide_verified'].astype(float)
tide = tide.drop(columns={'tide_preliminary'})

In [42]:
tide = tide[~tide['tide_verified'].isnull()]

In [43]:
tide = tide.drop(columns={'tide_predicted'})

In [44]:
# tide_15min.info()

In [45]:
tide.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8136 entries, 0 to 8135
Data columns (total 2 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   tide_verified     8136 non-null   float64            
 1   datetime_with_tz  8136 non-null   datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](1), float64(1)
memory usage: 190.7 KB


In [46]:
tidal_hydrodynamic = pd.merge(
    stream_water_level,
    tide,
    on="datetime_with_tz",
    how="inner"
)

In [47]:
tidal_hydrodynamic.head()

Unnamed: 0,stream_water_level,datetime_with_tz,tide_verified
0,-0.9,2024-02-28 07:00:00+00:00,2.53
1,0.0,2024-02-28 08:00:00+00:00,2.69
2,0.69,2024-02-28 09:00:00+00:00,2.46
3,0.65,2024-02-28 10:00:00+00:00,1.96
4,0.2,2024-02-28 11:00:00+00:00,1.64


In [48]:
tidal_hydrodynamic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7806 entries, 0 to 7805
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   stream_water_level  7806 non-null   float64            
 1   datetime_with_tz    7806 non-null   datetime64[ns, UTC]
 2   tide_verified       7806 non-null   float64            
dtypes: datetime64[ns, UTC](1), float64(2)
memory usage: 183.1 KB


# Meteorological Data

In [49]:
meteor = pd.read_csv('meteorological_data/meteor.csv')

# Rename columns for clarity
meteor = meteor.rename(columns={
    'Wind Speed (kn)': 'wind_speed',
    'Wind Dir (deg)': 'wind_direction',
    'Wind Gust (kn)': 'wind_gust',
    'Air Temp (°F)': 'air_temp',
})

In [50]:
meteor.head()

Unnamed: 0,Date,Time (GMT),wind_speed,wind_direction,wind_gust,air_temp
0,2/28/24,0:00,4.67,97,6.8,72.5
1,2/28/24,1:00,4.28,96,8.36,72.3
2,2/28/24,2:00,3.89,96,7.78,72.3
3,2/28/24,3:00,4.67,110,10.3,72.7
4,2/28/24,4:00,3.69,106,8.94,72.9


In [51]:
# Combine date and time into a single datetime column
meteor['datetime'] = pd.to_datetime(
    meteor['Date'] + ' ' + meteor['Time (GMT)']
)

utc = pytz.timezone('UTC')  # Using UTC instead of GMT for consistency
# Localize to UTC timezone
meteor['datetime_with_tz'] = meteor['datetime'].dt.tz_localize(utc)

# Filter out any rows with parsing errors
meteor = meteor.dropna(subset=['datetime'])


# Convert tide data columns to numeric
for col in ['wind_speed', 'wind_direction', 'wind_gust', 'air_temp']:
    if col in meteor.columns:
        meteor[col] = pd.to_numeric(meteor[col], errors='coerce')

# Drop unnecessary columns
meteor = meteor.drop(['Date', 'Time (GMT)', 'datetime'], axis=1)

  meteor['datetime'] = pd.to_datetime(


In [52]:
meteor.head()

Unnamed: 0,wind_speed,wind_direction,wind_gust,air_temp,datetime_with_tz
0,4.67,97.0,6.8,72.5,2024-02-28 00:00:00+00:00
1,4.28,96.0,8.36,72.3,2024-02-28 01:00:00+00:00
2,3.89,96.0,7.78,72.3,2024-02-28 02:00:00+00:00
3,4.67,110.0,10.3,72.7,2024-02-28 03:00:00+00:00
4,3.69,106.0,8.94,72.9,2024-02-28 04:00:00+00:00


In [53]:
meteor.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8809 entries, 0 to 8808
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   wind_speed        8563 non-null   float64            
 1   wind_direction    8563 non-null   float64            
 2   wind_gust         8563 non-null   float64            
 3   air_temp          8806 non-null   float64            
 4   datetime_with_tz  8809 non-null   datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](1), float64(4)
memory usage: 344.2 KB


# Soil & Sediment Data

In [54]:
oxygen = pd.read_csv('soil_sediment_data/dissolved_oxygen.tsv', sep='\t')
turbidity = pd.read_csv('soil_sediment_data/turbidity.tsv', sep='\t')
ph = pd.read_csv('soil_sediment_data/ph.tsv', sep='\t')

In [55]:
# Standardizing column names for merging
oxygen = oxygen.rename(columns={"249866_00300": "oxygen", "249866_00300_cd": "oxygen_cd"})
turbidity = turbidity.rename(columns={"249887_63680": "turbidity", "249887_63680_cd": "turbidity_cd"})
ph = ph.rename(columns={"249898_00400": "ph", "249898_00400_cd": "ph_cd"})

In [56]:
# Apply timezone standardization to hydrological datasets
oxygen = standardize_timestamps(oxygen)
turbidity = standardize_timestamps(turbidity)
ph = standardize_timestamps(ph)

# Check for duplicates in datetime_with_tz for each dataset
print(f"Duplicates in oxygen: {oxygen['datetime_with_tz'].duplicated().sum()}")
print(f"Duplicates in turbidity: {turbidity['datetime_with_tz'].duplicated().sum()}")
print(f"Duplicates in ph: {ph['datetime_with_tz'].duplicated().sum()}")

Duplicates in oxygen: 0
Duplicates in turbidity: 0
Duplicates in ph: 0


In [57]:
oxygen = oxygen.drop(columns={'agency_cd', 'site_no', 'oxygen_cd', 'datetime', 'tz_cd'})
turbidity = turbidity.drop(columns={'agency_cd', 'site_no', 'turbidity_cd', 'datetime', 'tz_cd'})
ph = ph.drop(columns={'agency_cd', 'site_no', 'ph_cd', 'datetime', 'tz_cd'})

In [58]:
oxygen['datetime_with_tz'] = pd.to_datetime(oxygen['datetime_with_tz'], utc=True)
turbidity['datetime_with_tz'] = pd.to_datetime(turbidity['datetime_with_tz'], utc=True)
ph['datetime_with_tz'] = pd.to_datetime(ph['datetime_with_tz'], utc=True)

In [59]:
display(oxygen.head())
display(turbidity.head())
display(ph.head())

display(oxygen.tail())
display(turbidity.tail())
display(ph.tail())

Unnamed: 0,oxygen,datetime_with_tz
0,9.7,2024-03-01 01:15:00+00:00
1,9.6,2024-03-01 01:30:00+00:00
2,9.7,2024-03-01 01:45:00+00:00
3,9.7,2024-03-01 02:00:00+00:00
4,9.7,2024-03-01 02:15:00+00:00


Unnamed: 0,turbidity,datetime_with_tz
0,6.2,2024-03-01 01:15:00+00:00
1,4.7,2024-03-01 01:30:00+00:00
2,4.8,2024-03-01 01:45:00+00:00
3,4.5,2024-03-01 02:00:00+00:00
4,5.3,2024-03-01 02:15:00+00:00


Unnamed: 0,ph,datetime_with_tz
0,7.8,2024-03-01 01:15:00+00:00
1,7.8,2024-03-01 01:30:00+00:00
2,7.8,2024-03-01 01:45:00+00:00
3,7.9,2024-03-01 02:00:00+00:00
4,7.8,2024-03-01 02:15:00+00:00


Unnamed: 0,oxygen,datetime_with_tz
35018,10.2,2025-02-28 23:15:00+00:00
35019,10.2,2025-02-28 23:30:00+00:00
35020,10.1,2025-02-28 23:45:00+00:00
35021,9.9,2025-03-01 00:00:00+00:00
35022,9.6,2025-03-01 00:15:00+00:00


Unnamed: 0,turbidity,datetime_with_tz
34279,7.0,2025-02-28 23:15:00+00:00
34280,7.0,2025-02-28 23:30:00+00:00
34281,7.2,2025-02-28 23:45:00+00:00
34282,7.3,2025-03-01 00:00:00+00:00
34283,7.4,2025-03-01 00:15:00+00:00


Unnamed: 0,ph,datetime_with_tz
33170,8.4,2025-02-28 23:15:00+00:00
33171,8.4,2025-02-28 23:30:00+00:00
33172,8.4,2025-02-28 23:45:00+00:00
33173,8.3,2025-03-01 00:00:00+00:00
33174,8.3,2025-03-01 00:15:00+00:00


In [60]:
oxtur = pd.merge(
    oxygen,
    turbidity,
    on="datetime_with_tz",
    how="inner"
)

In [61]:
soil_sediment = pd.merge(
    oxtur,
    ph,
    on="datetime_with_tz",
    how="inner"
)

In [62]:
soil_sediment.head()

Unnamed: 0,oxygen,datetime_with_tz,turbidity,ph
0,9.7,2024-03-01 01:15:00+00:00,6.2,7.8
1,9.6,2024-03-01 01:30:00+00:00,4.7,7.8
2,9.7,2024-03-01 01:45:00+00:00,4.8,7.8
3,9.7,2024-03-01 02:00:00+00:00,4.5,7.9
4,9.7,2024-03-01 02:15:00+00:00,5.3,7.8


# Predicting Variables

In [63]:
display(hydrological_data.head())
display(water_quality.head())
display(tidal_hydrodynamic.head())
display(meteor.head())
display(soil_sediment.head())

Unnamed: 0,gage_height,datetime_with_tz,discharge,discharge_tf
0,6.22,2024-02-28 07:00:00+00:00,-64.9,0.64
1,7.12,2024-02-28 08:00:00+00:00,-80.2,0.49
2,7.81,2024-02-28 09:00:00+00:00,-67.9,0.37
3,7.77,2024-02-28 10:00:00+00:00,18.2,0.26
4,7.32,2024-02-28 11:00:00+00:00,62.6,0.19


Unnamed: 0,salinity,datetime_with_tz,specific_conductance,temperature
0,31.0,2024-02-28 06:45:00+00:00,47900,22.6
1,31.0,2024-02-28 07:00:00+00:00,47900,22.6
2,31.0,2024-02-28 07:15:00+00:00,47800,22.6
3,31.0,2024-02-28 07:30:00+00:00,47600,22.7
4,31.0,2024-02-28 07:45:00+00:00,47600,22.6


Unnamed: 0,stream_water_level,datetime_with_tz,tide_verified
0,-0.9,2024-02-28 07:00:00+00:00,2.53
1,0.0,2024-02-28 08:00:00+00:00,2.69
2,0.69,2024-02-28 09:00:00+00:00,2.46
3,0.65,2024-02-28 10:00:00+00:00,1.96
4,0.2,2024-02-28 11:00:00+00:00,1.64


Unnamed: 0,wind_speed,wind_direction,wind_gust,air_temp,datetime_with_tz
0,4.67,97.0,6.8,72.5,2024-02-28 00:00:00+00:00
1,4.28,96.0,8.36,72.3,2024-02-28 01:00:00+00:00
2,3.89,96.0,7.78,72.3,2024-02-28 02:00:00+00:00
3,4.67,110.0,10.3,72.7,2024-02-28 03:00:00+00:00
4,3.69,106.0,8.94,72.9,2024-02-28 04:00:00+00:00


Unnamed: 0,oxygen,datetime_with_tz,turbidity,ph
0,9.7,2024-03-01 01:15:00+00:00,6.2,7.8
1,9.6,2024-03-01 01:30:00+00:00,4.7,7.8
2,9.7,2024-03-01 01:45:00+00:00,4.8,7.8
3,9.7,2024-03-01 02:00:00+00:00,4.5,7.9
4,9.7,2024-03-01 02:15:00+00:00,5.3,7.8


In [64]:
display(hydrological_data.info())
display(water_quality.info())
display(tidal_hydrodynamic.info())
display(meteor.info())
display(soil_sediment.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8244 entries, 0 to 8243
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   gage_height       8244 non-null   float64            
 1   datetime_with_tz  8244 non-null   datetime64[ns, UTC]
 2   discharge         8244 non-null   float64            
 3   discharge_tf      8244 non-null   float64            
dtypes: datetime64[ns, UTC](1), float64(3)
memory usage: 257.8 KB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32681 entries, 0 to 32680
Data columns (total 4 columns):
 #   Column                Non-Null Count  Dtype              
---  ------                --------------  -----              
 0   salinity              32681 non-null  float64            
 1   datetime_with_tz      32681 non-null  datetime64[ns, UTC]
 2   specific_conductance  32681 non-null  int64              
 3   temperature           32681 non-null  float64            
dtypes: datetime64[ns, UTC](1), float64(2), int64(1)
memory usage: 1021.4 KB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7806 entries, 0 to 7805
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   stream_water_level  7806 non-null   float64            
 1   datetime_with_tz    7806 non-null   datetime64[ns, UTC]
 2   tide_verified       7806 non-null   float64            
dtypes: datetime64[ns, UTC](1), float64(2)
memory usage: 183.1 KB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8809 entries, 0 to 8808
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   wind_speed        8563 non-null   float64            
 1   wind_direction    8563 non-null   float64            
 2   wind_gust         8563 non-null   float64            
 3   air_temp          8806 non-null   float64            
 4   datetime_with_tz  8809 non-null   datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](1), float64(4)
memory usage: 344.2 KB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32460 entries, 0 to 32459
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   oxygen            32460 non-null  float64            
 1   datetime_with_tz  32460 non-null  datetime64[ns, UTC]
 2   turbidity         32460 non-null  float64            
 3   ph                32460 non-null  float64            
dtypes: datetime64[ns, UTC](1), float64(3)
memory usage: 1014.5 KB


None

# Target Variables

In [65]:
ndvi = pd.read_csv('ndvi.csv')

ndvi.head()

Unnamed: 0,system:index,date,ndvi,.geo
0,2024_01_01,2024-01-01,0.588521,"{""type"":""MultiPoint"",""coordinates"":[]}"
1,2024_01_03,2024-01-03,0.228747,"{""type"":""MultiPoint"",""coordinates"":[]}"
2,2024_01_04,2024-01-04,0.103793,"{""type"":""MultiPoint"",""coordinates"":[]}"
3,2024_01_05,2024-01-05,0.586527,"{""type"":""MultiPoint"",""coordinates"":[]}"
4,2024_01_06,2024-01-06,-0.008358,"{""type"":""MultiPoint"",""coordinates"":[]}"


In [66]:
ndvi['datetime_with_tz'] = pd.to_datetime(ndvi['date'], utc=True)

In [67]:
ndvi = ndvi.drop(columns={'system:index', '.geo', 'date'})

In [68]:
ndvi['date'] = ndvi['datetime_with_tz'].dt.date

In [69]:
ndvi = ndvi.drop(columns={'datetime_with_tz'})

In [70]:
ndvi.head()

Unnamed: 0,ndvi,date
0,0.588521,2024-01-01
1,0.228747,2024-01-03
2,0.103793,2024-01-04
3,0.586527,2024-01-05
4,-0.008358,2024-01-06


In [71]:
ndvi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 343 entries, 0 to 342
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ndvi    343 non-null    float64
 1   date    343 non-null    object 
dtypes: float64(1), object(1)
memory usage: 5.5+ KB


In [72]:
pred_data = pd.merge(
    hydrological_data,
    water_quality,
    on="datetime_with_tz",
    how="inner"
)

In [73]:
pred_data = pd.merge(
    pred_data,
    tidal_hydrodynamic,
    on="datetime_with_tz",
    how="inner"
)

In [74]:
pred_data = pd.merge(
    pred_data,
    meteor,
    on="datetime_with_tz",
    how="inner"
)

In [75]:
pred_data = pd.merge(
    pred_data,
    soil_sediment,
    on="datetime_with_tz",
    how="inner"
)

In [76]:
pred_data['date'] = pred_data['datetime_with_tz'].dt.date

In [77]:
display(pred_data.head())
display(pred_data.tail())

Unnamed: 0,gage_height,datetime_with_tz,discharge,discharge_tf,salinity,specific_conductance,temperature,stream_water_level,tide_verified,wind_speed,wind_direction,wind_gust,air_temp,oxygen,turbidity,ph,date
0,5.86,2024-03-01 02:00:00+00:00,55.3,3.03,32.0,48500,24.8,-1.26,0.4,7.58,78.0,9.72,74.3,9.7,4.5,7.9,2024-03-01
1,5.37,2024-03-01 03:00:00+00:00,30.5,2.91,32.0,49100,24.2,-1.75,0.1,4.47,88.0,9.14,73.9,9.5,5.1,7.8,2024-03-01
2,5.0,2024-03-01 04:00:00+00:00,26.5,2.79,32.0,49500,23.9,-2.12,0.34,5.44,86.0,7.97,74.1,9.3,4.9,7.7,2024-03-01
3,4.77,2024-03-01 05:00:00+00:00,8.69,2.68,33.0,49800,23.6,-2.35,0.86,4.86,79.0,8.36,73.8,9.1,5.6,7.7,2024-03-01
4,4.66,2024-03-01 06:00:00+00:00,3.76,2.57,33.0,50200,23.3,-2.46,1.22,5.83,82.0,8.55,73.9,9.1,3.8,7.7,2024-03-01


Unnamed: 0,gage_height,datetime_with_tz,discharge,discharge_tf,salinity,specific_conductance,temperature,stream_water_level,tide_verified,wind_speed,wind_direction,wind_gust,air_temp,oxygen,turbidity,ph,date
6982,6.27,2025-01-31 19:00:00+00:00,-81.6,-8.78,39.0,59000,24.1,-0.85,2.65,6.8,150.0,10.89,77.0,9.4,7.1,7.9,2025-01-31
6983,7.41,2025-01-31 20:00:00+00:00,-131.0,-8.72,39.0,57900,26.1,0.29,2.62,6.61,144.0,15.36,76.8,9.8,7.1,8.0,2025-01-31
6984,8.12,2025-01-31 21:00:00+00:00,-99.9,-8.64,38.0,57500,25.8,1.0,2.39,8.36,141.0,12.83,76.8,10.2,7.1,8.1,2025-01-31
6985,7.88,2025-01-31 22:00:00+00:00,25.2,-8.54,38.0,57500,25.5,0.76,2.0,6.61,129.0,10.5,76.1,10.6,7.3,8.2,2025-01-31
6986,7.37,2025-01-31 23:00:00+00:00,48.7,-8.41,38.0,57600,25.6,0.25,1.52,3.3,132.0,4.86,75.6,9.7,7.0,8.0,2025-01-31


In [78]:
pred_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6987 entries, 0 to 6986
Data columns (total 17 columns):
 #   Column                Non-Null Count  Dtype              
---  ------                --------------  -----              
 0   gage_height           6987 non-null   float64            
 1   datetime_with_tz      6987 non-null   datetime64[ns, UTC]
 2   discharge             6987 non-null   float64            
 3   discharge_tf          6987 non-null   float64            
 4   salinity              6987 non-null   float64            
 5   specific_conductance  6987 non-null   int64              
 6   temperature           6987 non-null   float64            
 7   stream_water_level    6987 non-null   float64            
 8   tide_verified         6987 non-null   float64            
 9   wind_speed            6746 non-null   float64            
 10  wind_direction        6746 non-null   float64            
 11  wind_gust             6746 non-null   float64            
 12  air_te

In [79]:
pred_data.head()

Unnamed: 0,gage_height,datetime_with_tz,discharge,discharge_tf,salinity,specific_conductance,temperature,stream_water_level,tide_verified,wind_speed,wind_direction,wind_gust,air_temp,oxygen,turbidity,ph,date
0,5.86,2024-03-01 02:00:00+00:00,55.3,3.03,32.0,48500,24.8,-1.26,0.4,7.58,78.0,9.72,74.3,9.7,4.5,7.9,2024-03-01
1,5.37,2024-03-01 03:00:00+00:00,30.5,2.91,32.0,49100,24.2,-1.75,0.1,4.47,88.0,9.14,73.9,9.5,5.1,7.8,2024-03-01
2,5.0,2024-03-01 04:00:00+00:00,26.5,2.79,32.0,49500,23.9,-2.12,0.34,5.44,86.0,7.97,74.1,9.3,4.9,7.7,2024-03-01
3,4.77,2024-03-01 05:00:00+00:00,8.69,2.68,33.0,49800,23.6,-2.35,0.86,4.86,79.0,8.36,73.8,9.1,5.6,7.7,2024-03-01
4,4.66,2024-03-01 06:00:00+00:00,3.76,2.57,33.0,50200,23.3,-2.46,1.22,5.83,82.0,8.55,73.9,9.1,3.8,7.7,2024-03-01


In [80]:
full_data = pd.merge(
    pred_data,
    ndvi,
    on="date",
    how="inner"
)

In [81]:
full_data.head()

Unnamed: 0,gage_height,datetime_with_tz,discharge,discharge_tf,salinity,specific_conductance,temperature,stream_water_level,tide_verified,wind_speed,wind_direction,wind_gust,air_temp,oxygen,turbidity,ph,date,ndvi
0,5.86,2024-03-01 02:00:00+00:00,55.3,3.03,32.0,48500,24.8,-1.26,0.4,7.58,78.0,9.72,74.3,9.7,4.5,7.9,2024-03-01,0.335371
1,5.37,2024-03-01 03:00:00+00:00,30.5,2.91,32.0,49100,24.2,-1.75,0.1,4.47,88.0,9.14,73.9,9.5,5.1,7.8,2024-03-01,0.335371
2,5.0,2024-03-01 04:00:00+00:00,26.5,2.79,32.0,49500,23.9,-2.12,0.34,5.44,86.0,7.97,74.1,9.3,4.9,7.7,2024-03-01,0.335371
3,4.77,2024-03-01 05:00:00+00:00,8.69,2.68,33.0,49800,23.6,-2.35,0.86,4.86,79.0,8.36,73.8,9.1,5.6,7.7,2024-03-01,0.335371
4,4.66,2024-03-01 06:00:00+00:00,3.76,2.57,33.0,50200,23.3,-2.46,1.22,5.83,82.0,8.55,73.9,9.1,3.8,7.7,2024-03-01,0.335371


In [82]:
full_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5864 entries, 0 to 5863
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype              
---  ------                --------------  -----              
 0   gage_height           5864 non-null   float64            
 1   datetime_with_tz      5864 non-null   datetime64[ns, UTC]
 2   discharge             5864 non-null   float64            
 3   discharge_tf          5864 non-null   float64            
 4   salinity              5864 non-null   float64            
 5   specific_conductance  5864 non-null   int64              
 6   temperature           5864 non-null   float64            
 7   stream_water_level    5864 non-null   float64            
 8   tide_verified         5864 non-null   float64            
 9   wind_speed            5663 non-null   float64            
 10  wind_direction        5663 non-null   float64            
 11  wind_gust             5663 non-null   float64            
 12  air_te

In [83]:
full_data.to_csv('mangrove_data.csv', index=False)