# Tutorial 2.1. Preprocessing CBP Water Quality Data

## 2.1.1 Introduction

This notebook is **Step1** for the *Predict Future DO Tutorial Series*.

It preprocesses Chesapeake Bay Program (CBP) water quality data downloaded from the [CBP DataHub](https://datahub.chesapeakebay.net/WaterQuality).

- **Data split:** To avoid timeouts, data is downloaded in three regional groups—Upper, Mid, and Lower Bay.
- **Goal:** Extract surface DO, Temperature, Salinity, Chla, TN, and TP; align by date and station; and save the cleaned dataset for model training.
- **Files:** Raw data (`Upper.csv`, `Mid.csv`, `Lower.csv`) are in the folder `CBP_RawData`.

> If you need help downloading data from the CBP Datahub, refer to *Tutorial 1*.

---

![Model Diagram](CBP_Mainstem_Station_Map.png)


## 2.1.2 Data Loading & Inspection

In [3]:
import pandas as pd
import numpy as np
import os
from IPython.display import display

# ===  Load all three water quality files ===
folder = "CBP_RawData"
dfs = {}
for fname in ["Upper.csv", "Mid.csv", "Lower.csv"]:
    fpath = os.path.join(folder, fname)
    df = pd.read_csv(fpath)
    dfs[fname] = df
    print(f"\n📂 --- Dataset: {fname} ---")
    
    # --- Show column names ---
    print("🧾 Columns:")
    print(df.columns.tolist())

  df = pd.read_csv(fpath)



📂 --- Dataset: Upper.csv ---
🧾 Columns:
['MonitoringStation', 'EventId', 'Cruise', 'Program', 'Project', 'Agency', 'Source', 'Station', 'SampleDate', 'SampleTime', 'TotalDepth', 'UpperPycnocline', 'LowerPycnocline', 'Depth', 'Layer', 'SampleType', 'SampleReplicateType', 'Parameter', 'Qualifier', 'MeasureValue', 'Unit', 'Method', 'Lab', 'Problem', 'PrecisionPC', 'BiasPC', 'Details', 'Latitude', 'Longitude', 'TierLevel']

📂 --- Dataset: Mid.csv ---
🧾 Columns:
['MonitoringStation', 'EventId', 'Cruise', 'Program', 'Project', 'Agency', 'Source', 'Station', 'SampleDate', 'SampleTime', 'TotalDepth', 'UpperPycnocline', 'LowerPycnocline', 'Depth', 'Layer', 'SampleType', 'SampleReplicateType', 'Parameter', 'Qualifier', 'MeasureValue', 'Unit', 'Method', 'Lab', 'Problem', 'PrecisionPC', 'BiasPC', 'Details', 'Latitude', 'Longitude', 'TierLevel']

📂 --- Dataset: Lower.csv ---
🧾 Columns:
['MonitoringStation', 'EventId', 'Cruise', 'Program', 'Project', 'Agency', 'Source', 'Station', 'SampleDate', 'Sa

## 2.1.3 Explore Each Dataset

Check date range, unique parameters, stations, and preview a random day's data at a random station.

In [6]:
for fname, df in dfs.items():
    print(f"\n📂 --- Dataset: {fname} ---")
    try:
        df['SampleDate'] = pd.to_datetime(df['SampleDate'])
    except Exception as e:
        print("⚠️ Couldn't parse dates:", e)

    # Unique sampling dates
    unique_dates = df['SampleDate'].unique()
    print(f"📅 Number of unique sampling dates: {len(unique_dates)}")

    # Date range
    if pd.api.types.is_datetime64_any_dtype(df['SampleDate']):
        print(f"📆 Date range: {df['SampleDate'].min().date()} to {df['SampleDate'].max().date()}")

    # Unique parameters
    unique_params = df['Parameter'].unique()
    print(f"🧪 Parameters measured ({len(unique_params)}): {unique_params}")

    # Unique stations
    if 'Station' in df.columns:
        unique_stations = df['Station'].unique()
        print(f"📍 Number of unique stations: {len(unique_stations)}")
        print(f"📍 Station list:\n{unique_stations}")

    # Show a random station/day
    valid_rows = df[['SampleDate', 'Station']].dropna()
    if not valid_rows.empty:
        random_idx = np.random.choice(valid_rows.index)
        random_date = valid_rows.loc[random_idx, 'SampleDate']
        random_station = valid_rows.loc[random_idx, 'Station']

        subset = df[(df['SampleDate'] == random_date) & (df['Station'] == random_station)]
        print(f"\n🔍 Showing data for Station '{random_station}' on {random_date.date()}:")
        display(subset)


📂 --- Dataset: Upper.csv ---
📅 Number of unique sampling dates: 1342
📆 Date range: 1984-07-11 to 2024-12-04
🧪 Parameters measured (6): ['CHLA' 'DO' 'SALINITY' 'TN' 'TP' 'WTEMP']
📍 Number of unique stations: 10
📍 Station list:
['CB2.1' 'CB2.2' 'CB3.1' 'CB3.2' 'CB3.3C' 'CB3.3E' 'CB3.3W' 'CB4.1C'
 'CB4.1E' 'CB4.1W']

🔍 Showing data for Station 'CB2.2' on 2000-04-27:


  df['SampleDate'] = pd.to_datetime(df['SampleDate'])


Unnamed: 0,MonitoringStation,EventId,Cruise,Program,Project,Agency,Source,Station,SampleDate,SampleTime,...,Unit,Method,Lab,Problem,PrecisionPC,BiasPC,Details,Latitude,Longitude,TierLevel
16252,CB2.2,111004,BAY317,TWQM,MAIN,MDDNR,MDDNR,CB2.2,2000-04-27,8:33:00,...,UG/L,L01,MDHMH,,,,,39.34873,-76.17579,T3
16253,CB2.2,111004,BAY317,TWQM,MAIN,MDDNR,MDDNR,CB2.2,2000-04-27,8:33:00,...,UG/L,L01,MDHMH,,,,,39.34873,-76.17579,T3
16254,CB2.2,111004,BAY317,TWQM,MAIN,MDDNR,MDDNR,CB2.2,2000-04-27,8:33:00,...,UG/L,L01,MDHMH,,,,,39.34873,-76.17579,T3
16255,CB2.2,111004,BAY317,TWQM,MAIN,MDDNR,MDDNR,CB2.2,2000-04-27,8:33:00,...,UG/L,L01,MDHMH,,,,,39.34873,-76.17579,T3
16256,CB2.2,111004,BAY317,TWQM,MAIN,MDDNR,MDDNR,CB2.2,2000-04-27,8:33:00,...,UG/L,L01,MDHMH,,,,,39.34873,-76.17579,T3
19289,CB2.2,111004,BAY317,TWQM,MAIN,MDDNR,MDDNR,CB2.2,2000-04-27,8:33:00,...,MG/L,F01,,,,,,39.34873,-76.17579,T3
19290,CB2.2,111004,BAY317,TWQM,MAIN,MDDNR,MDDNR,CB2.2,2000-04-27,8:33:00,...,MG/L,F01,,,,,,39.34873,-76.17579,T3
19291,CB2.2,111004,BAY317,TWQM,MAIN,MDDNR,MDDNR,CB2.2,2000-04-27,8:33:00,...,MG/L,F01,,,,,,39.34873,-76.17579,T3
19292,CB2.2,111004,BAY317,TWQM,MAIN,MDDNR,MDDNR,CB2.2,2000-04-27,8:33:00,...,MG/L,F01,,,,,,39.34873,-76.17579,T3
19293,CB2.2,111004,BAY317,TWQM,MAIN,MDDNR,MDDNR,CB2.2,2000-04-27,8:33:00,...,MG/L,F01,,,,,,39.34873,-76.17579,T3



📂 --- Dataset: Mid.csv ---
📅 Number of unique sampling dates: 1679
📆 Date range: 1984-07-10 to 2024-12-09
🧪 Parameters measured (6): ['CHLA' 'DO' 'SALINITY' 'TN' 'TP' 'WTEMP']
📍 Number of unique stations: 11
📍 Station list:
['CB4.2C' 'CB4.2E' 'CB4.2W' 'CB4.3C' 'CB4.3E' 'CB4.3W' 'CB5.1' 'CB5.2'
 'CB5.3' 'CB5.4' 'CB5.4W']

🔍 Showing data for Station 'CB5.2' on 1994-10-11:


  df['SampleDate'] = pd.to_datetime(df['SampleDate'])


Unnamed: 0,MonitoringStation,EventId,Cruise,Program,Project,Agency,Source,Station,SampleDate,SampleTime,...,Unit,Method,Lab,Problem,PrecisionPC,BiasPC,Details,Latitude,Longitude,TierLevel
241744,CB5.2,755,BAY208,TWQM,MAIN,MDDNR,MDDNR,CB5.2,1994-10-11,11:25:00,...,UG/L,L01,MDHMH,,,,,38.13705,-76.22787,T3
241745,CB5.2,755,BAY208,TWQM,MAIN,MDDNR,MDDNR,CB5.2,1994-10-11,11:25:00,...,UG/L,L01,MDHMH,,,,,38.13705,-76.22787,T3
241746,CB5.2,755,BAY208,TWQM,MAIN,MDDNR,MDDNR,CB5.2,1994-10-11,11:25:00,...,UG/L,L01,MDHMH,,,,,38.13705,-76.22787,T3
241747,CB5.2,755,BAY208,TWQM,MAIN,MDDNR,MDDNR,CB5.2,1994-10-11,11:25:00,...,UG/L,L01,MDHMH,,,,,38.13705,-76.22787,T3
241748,CB5.2,755,BAY208,TWQM,MAIN,MDDNR,MDDNR,CB5.2,1994-10-11,11:25:00,...,UG/L,L01,MDHMH,,,,,38.13705,-76.22787,T3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
289145,CB5.2,755,BAY208,TWQM,MAIN,MDDNR,MDDNR,CB5.2,1994-10-11,11:25:00,...,DEG C,F01,,,,,M Layer,38.13705,-76.22787,T3
289146,CB5.2,755,BAY208,TWQM,MAIN,MDDNR,MDDNR,CB5.2,1994-10-11,11:25:00,...,DEG C,F01,,,,,M Layer,38.13705,-76.22787,T3
289147,CB5.2,755,BAY208,TWQM,MAIN,MDDNR,MDDNR,CB5.2,1994-10-11,11:25:00,...,DEG C,F01,,,,,,38.13705,-76.22787,T3
289148,CB5.2,755,BAY208,TWQM,MAIN,MDDNR,MDDNR,CB5.2,1994-10-11,11:25:00,...,DEG C,F01,,,,,,38.13705,-76.22787,T3



📂 --- Dataset: Lower.csv ---


  df['SampleDate'] = pd.to_datetime(df['SampleDate'])


📅 Number of unique sampling dates: 1703
📆 Date range: 1984-07-11 to 2024-12-16
🧪 Parameters measured (6): ['CHLA' 'DO' 'SALINITY' 'TN' 'TP' 'WTEMP']
📍 Number of unique stations: 15
📍 Station list:
['CB5.5' 'CB6.1' 'CB6.2' 'CB6.3' 'CB6.4' 'CB7.1' 'CB7.1N' 'CB7.2' 'CB7.2E'
 'CB7.3' 'CB7.3E' 'CB7.4' 'CB7.4N' 'CB8.1' 'CB8.1E']

🔍 Showing data for Station 'CB8.1E' on 2018-08-06:


Unnamed: 0,MonitoringStation,EventId,Cruise,Program,Project,Agency,Source,Station,SampleDate,SampleTime,...,Unit,Method,Lab,Problem,PrecisionPC,BiasPC,Details,Latitude,Longitude,TierLevel
410768,CB8.1E,449544,BAY726,TWQM,MAIN,VADEQ,ODU,CB8.1E,2018-08-06,8:45:00,...,UG/L,L01,ODU,,,,,36.94737,-76.03494,T3
410769,CB8.1E,449544,BAY726,TWQM,MAIN,VADEQ,ODU,CB8.1E,2018-08-06,8:45:00,...,UG/L,L01,ODU,,,,,36.94737,-76.03494,T3
417316,CB8.1E,449544,BAY726,TWQM,MAIN,VADEQ,ODU,CB8.1E,2018-08-06,8:45:00,...,MG/L,F04,,,,,,36.94737,-76.03494,T3
417317,CB8.1E,449544,BAY726,TWQM,MAIN,VADEQ,ODU,CB8.1E,2018-08-06,8:45:00,...,MG/L,F04,,,,,,36.94737,-76.03494,T3
417318,CB8.1E,449544,BAY726,TWQM,MAIN,VADEQ,ODU,CB8.1E,2018-08-06,8:45:00,...,MG/L,F04,,,,,,36.94737,-76.03494,T3
417319,CB8.1E,449544,BAY726,TWQM,MAIN,VADEQ,ODU,CB8.1E,2018-08-06,8:45:00,...,MG/L,F04,,,,,,36.94737,-76.03494,T3
417345,CB8.1E,449544,BAY726,TWQM,MAIN,VADEQ,ODU,CB8.1E,2018-08-06,8:45:00,...,MG/L,F04,,,,,,36.94737,-76.03494,T3
417350,CB8.1E,449544,BAY726,TWQM,MAIN,VADEQ,ODU,CB8.1E,2018-08-06,8:45:00,...,MG/L,F04,,,,,,36.94737,-76.03494,T3
417351,CB8.1E,449544,BAY726,TWQM,MAIN,VADEQ,ODU,CB8.1E,2018-08-06,8:45:00,...,MG/L,F04,,,,,,36.94737,-76.03494,T3
417352,CB8.1E,449544,BAY726,TWQM,MAIN,VADEQ,ODU,CB8.1E,2018-08-06,8:45:00,...,MG/L,F04,,,,,,36.94737,-76.03494,T3


## 2.1.4 Clean and Filter Data

We filter the dataset to retain only valid records, focus on surface-layer samples, and keep variables relevant to DO prediction.

In [9]:
# Combine all regional datasets
df_wq = pd.concat(dfs.values(), ignore_index=True)

# Ensure dates are parsed
df_wq["SampleDate"] = pd.to_datetime(df_wq["SampleDate"])

# Filter for surface data (Depth = 0.5m)
df_surface = df_wq[df_wq["Depth"] == 0.5].copy()

# Pivot table: One row per station-date, columns are parameters
df_pivot = df_surface.pivot_table(
    index=["Station", "SampleDate"],
    columns="Parameter",
    values="MeasureValue",
    aggfunc="mean"  # Average if duplicates
).reset_index()

# Optional: Remove column hierarchy if created by pivot
df_pivot.columns.name = None

# Preview the final dataset
print("📋 Preview of cleaned water quality data:")
print(df_pivot.head())



📋 Preview of cleaned water quality data:
  Station SampleDate  CHLA   DO  SALINITY      TN     TP  WTEMP
0   CB2.1 1984-07-12  10.7  7.8      0.00  1.8515  0.120   25.9
1   CB2.1 1984-07-25  26.7  7.6      0.00  1.3990  0.086   27.6
2   CB2.1 1984-09-12   6.7  7.1      0.73  1.1870  0.048   22.6
3   CB2.1 1984-09-26  12.0  7.8      1.97  1.0730  0.049   22.0
4   CB2.1 1984-10-10  14.7  8.7      2.89  1.6200  0.046   16.1


## 2.1.5 Export Cleaned Data

Save the cleaned and pivoted dataset for later use in model training.


In [12]:
import os

# Ensure the folder exists
os.makedirs("CleanedData", exist_ok=True)

# Save to the CleanedDataset folder
output_path = os.path.join("CleanedData", "CBP_water_quality_surface.csv")
df_pivot.to_csv(output_path, index=False)
print(f"✅ Saved: {output_path}")


✅ Saved: CleanedData/CBP_water_quality_surface.csv


In [14]:
# Unique station names
stations = df_wq["Station"].unique()
print(f"🧭 Number of unique stations: {len(stations)}")
print("📍 Station list:")
print(stations)


🧭 Number of unique stations: 36
📍 Station list:
['CB2.1' 'CB2.2' 'CB3.1' 'CB3.2' 'CB3.3C' 'CB3.3E' 'CB3.3W' 'CB4.1C'
 'CB4.1E' 'CB4.1W' 'CB4.2C' 'CB4.2E' 'CB4.2W' 'CB4.3C' 'CB4.3E' 'CB4.3W'
 'CB5.1' 'CB5.2' 'CB5.3' 'CB5.4' 'CB5.4W' 'CB5.5' 'CB6.1' 'CB6.2' 'CB6.3'
 'CB6.4' 'CB7.1' 'CB7.1N' 'CB7.2' 'CB7.2E' 'CB7.3' 'CB7.3E' 'CB7.4'
 'CB7.4N' 'CB8.1' 'CB8.1E']
