In [1]:
!pip install -r requirements_tutorial3.txt



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


# Tutorial 3.1 Preprocessing the CBP Chla Data

### 3.1.1 Introduction

This notebook is **Step 1** of the *Predicting Chla from Sentinel-3 OLCI at Chesapeake Bay Tutorial Series*.

In this step, we preprocess Chesapeake Bay Program (CBP) in-situ chlorophyll-a (Chla) monitoring data for use with Sentinel-3 reflectance data.

The raw data (`MainstemChla.csv`) is placed in the folder `CBP_RawData/`. This dataset was downloaded from the [Chesapeake Bay Program Water Quality Portal](https://datahub.chesapeakebay.net/)
> If you need help downloading data from the CBP Datahub, refer to *Tutorial 1*.


We focus on surface layer (`Layer = 'S'`) Chla measurements from **May 2020 to the present**, which matches the availability period of Sentinel-3 OLCI data.

This step includes:
- Filtering surface layer samples
- Parsing and cleaning the sampling dates
- Averaging repeated measurements at station at the same date
- Saving the cleaned dataset for satellite matching

 *A map of the Mainstem stations used in this project is shown below.*


![Model Diagram](CBP_Mainstem_Station_Map.png)

### 3.1.2 Load & Preview Raw Data

In [2]:
import pandas as pd

# === Load raw data ===
file_path = "CBP_RawData/MainstemChla.csv"
df = pd.read_csv(file_path)

# Preview full raw data (no column truncation)
pd.set_option('display.max_columns', None)
print("🔍 Raw data preview:")
display(df.head())


ModuleNotFoundError: No module named 'pandas'

### 3.1.3 Filter and Process Data

In [None]:
# === Filter to surface layer ===
df_s_layer = df[df['Layer'].str.strip() == 'S'].copy()

# === Convert SampleDate to datetime ===
df_s_layer['SampleDate'] = pd.to_datetime(df_s_layer['SampleDate'], errors='coerce')

# === Drop rows with missing Station or SampleDate ===
df_s_layer = df_s_layer.dropna(subset=['SampleDate', 'Station'])

# === Group by Station and SampleDate, take mean of numeric fields ===
grouped_avg = df_s_layer.groupby(['Station', 'SampleDate'], as_index=False).mean(numeric_only=True)

# === Select only relevant columns ===
columns_to_keep = ['Station', 'SampleDate', 'MeasureValue', 'Latitude', 'Longitude']
cleaned_df = grouped_avg[columns_to_keep]

### 3.1.4 Exported Cleaned Surface Data

In [None]:
import os

# Create output folder if it doesn't exist
os.makedirs("CleanedData", exist_ok=True)

# Preview final cleaned dataset
print("✅ Final cleaned Chla data:")
display(cleaned_df.head())

# Save to CleanedData
output_path = "CleanedData/averaged_layer_S.csv"
cleaned_df.to_csv(output_path, index=False)
print(f"✅ Saved to '{output_path}'")
