# Tutorial 2.4 Merge All Preprocessed Data

## 2.4.1 Introduction

This notebook is **Step4** for the *Predict Future DO Tutorial Series*.

This notebook merges all cleaned, preprocessed inputs for the Predict Future DO Project.

- Water quality by station/date (`CBP_water_quality_surface.csv`)
- 14-day accumulated wind features (`wind.csv`)
- Accumulated discharge from major rivers (`USGS_discharge.csv`)
- All from the `CleanedData` folder.

The final output is a single merged CSV ready for modeling.

In [7]:
import pandas as pd
import os

# Load water quality data
df_wq = pd.read_csv("CleanedData/CBP_water_quality_surface.csv", parse_dates=["SampleDate"])
df_wq = df_wq.rename(columns={"SampleDate": "Date"})

# Load wind data
df_wind = pd.read_csv("CleanedData/wind.csv", index_col=0)
df_wind.index.name = "Date"
df_wind = df_wind.reset_index()
df_wind["Date"] = pd.to_datetime(df_wind["Date"])

# Load discharge data
df_discharge = pd.read_csv("CleanedData/USGS_discharge.csv", parse_dates=["date"])
df_discharge = df_discharge.rename(columns={"date": "Date"})


## 2.4.2 Align and Merge Datasets

- Find common date range
- Trim each dataset
- Merge by date
- Fill missing wind sector data with 0 (means no wind from that direction)

In [12]:
# Find common date range
start_date = max(df_wq["Date"].min(), df_wind["Date"].min(), df_discharge["Date"].min())
end_date   = min(df_wq["Date"].max(), df_wind["Date"].max(), df_discharge["Date"].max())
print(f"📅 Common date range: {start_date.date()} to {end_date.date()}")

# Trim all datasets
df_wq = df_wq[(df_wq["Date"] >= start_date) & (df_wq["Date"] <= end_date)]
df_wind = df_wind[(df_wind["Date"] >= start_date) & (df_wind["Date"] <= end_date)]
df_discharge = df_discharge[(df_discharge["Date"] >= start_date) & (df_discharge["Date"] <= end_date)]

# Merge water quality with wind
df_merged = pd.merge(df_wq, df_wind, on="Date", how="left")

# Merge in discharge data
df_merged = pd.merge(df_merged, df_discharge, on="Date", how="left")

# Fill NaNs in wind sectors with 0
wind_cols = [col for col in df_wind.columns if col != "Date"]
df_merged[wind_cols] = df_merged[wind_cols].fillna(0)

# Drop any rows with missing values in the rest (e.g., missing CHLA or discharge)
df_merged = df_merged.dropna()

# Save final merged dataset
os.makedirs("CleanedData", exist_ok=True)
final_path = os.path.join("CleanedData", "final_model_data.csv")
df_merged.to_csv(final_path, index=False)
print(f"✅ Final merged dataset saved to {final_path}")
print("📊 Final shape:", df_merged.shape)
df_merged.head()


📅 Common date range: 1985-11-12 to 2024-12-04
✅ Final merged dataset saved to CleanedData/final_model_data.csv
📊 Final shape: (10426, 28)


Unnamed: 0,Station,Date,CHLA,DO,SALINITY,TN,TP,WTEMP,NW14,W14,...,James_acc_flow_60d,James_acc_flow_90d,Potomac_discharge,Potomac_acc_flow_30d,Potomac_acc_flow_60d,Potomac_acc_flow_90d,Susquehanna_discharge,Susquehanna_acc_flow_30d,Susquehanna_acc_flow_60d,Susquehanna_acc_flow_90d
0,CB2.1,1985-11-14,5.1,9.6,2.11,1.571,0.029,13.8,0.0,5.831037,...,887040.0,1149480.0,19900,970380.0,1019563.0,1098913.0,29200,417756.0,885760.0,1105480.0
1,CB2.1,1985-12-11,1.0,13.1,0.0,1.6883,0.0499,4.3,0.0,0.0,...,1165520.0,1222290.0,13500,725600.0,1631020.0,1680613.0,51300,2395900.0,2761906.0,3218620.0
2,CB2.1,1986-03-12,3.4,11.7,0.0,1.977,0.062,7.1,4.447323,6.264138,...,418490.0,566250.0,11400,722000.0,994900.0,1272600.0,55800,1866800.0,3126190.0,3988620.0
3,CB2.1,1986-03-26,1.5,12.1,0.56,1.631,0.0532,7.5,0.0,0.0,...,569140.0,689390.0,13700,844700.0,1481400.0,1647100.0,78500,3289400.0,5024200.0,5830220.0
4,CB2.1,1986-04-09,13.3,10.9,0.0,1.888,0.0473,12.8,5.143233,5.382969,...,558640.0,705870.0,8320,708260.0,1454360.0,1689560.0,42500,3086400.0,4995400.0,6142820.0
