# Carbon–Jellyfish Correlation Analysis

This notebook combines the surface dissolved inorganic carbon information that was previously
retrieved in [`carbon_data_exploration.ipynb`](DataExploration/carbon_data_exploration.ipynb) with the jellyfish
observations processed in [`jellyfish_plankton_data_exploration.ipynb`](DataExploration/jellyfish_plankton_data_exploration.ipynb).

The goal is to align the two datasets spatially and temporally and to compute the linear
correlation between jellyfish density measurements and the surrounding surface carbon
concentration.


In [1]:
from pathlib import Path

import numpy as np
import pandas as pd
from erddapy import ERDDAP
import xarray as xr
import matplotlib.pyplot as plt


## Load and prepare the jellyfish dataset

We reuse the cleaned version of the JeDI jellyfish dataset that was explored earlier. The
processing mirrors the steps in the original notebook: replace placeholders, coerce numeric
columns, create a timestamp for each observation, and focus on rows that contain density
information.


In [2]:
data_dir = Path("data")
jellyfish_path = data_dir / "JeDI.csv"

jellyfish = (
    pd.read_csv(jellyfish_path)
    .replace("nd", pd.NA)
)

numeric_cols = [
    "year", "month", "day", "lat", "lon",
    "count_actual", "density", "density_integrated",
    "biovolume", "biovolume_integrated",
    "weight_wet", "weight_dry"
]
for col in numeric_cols:
    jellyfish[col] = pd.to_numeric(jellyfish[col], errors="coerce")

# Drop observations without core spatial or temporal information
jellyfish = jellyfish.dropna(subset=["lat", "lon", "year"]).copy()

# Use mid-month/day defaults where they are missing before building timestamps
jellyfish["month"] = jellyfish["month"].fillna(7).astype(int)
jellyfish["day"] = jellyfish["day"].fillna(15).astype(int)

jellyfish["observation_time"] = pd.to_datetime(
    {
        "year": jellyfish["year"].astype(int),
        "month": jellyfish["month"],
        "day": jellyfish["day"]
    },
    errors="coerce"
)

jellyfish = jellyfish.dropna(subset=["observation_time", "density"]).copy()
print(f"Prepared {len(jellyfish):,} jellyfish observations with density measurements.")


  pd.read_csv(jellyfish_path)


Prepared 257,600 jellyfish observations with density measurements.


## Retrieve the carbon dataset within the jellyfish envelope

We use the same EMODnet ERDDAP endpoint and dataset as in the carbon exploration notebook.
The spatial and temporal query window is restricted to the extent covered by the jellyfish
observations to keep the download manageable. The dataset is further trimmed to surface
measurements (depth = 0 m).


In [6]:
# 1) Ensure obs times are tz-aware UTC
jellyfish["observation_time"] = pd.to_datetime(
    jellyfish["observation_time"], utc=True, errors="coerce"
)
obs_start = jellyfish["observation_time"].min()
obs_end   = jellyfish["observation_time"].max()

# 2) Read ERDDAP dataset coverage and force UTC as well
info = pd.read_csv(info_url)

dataset_start = pd.to_datetime(
    info.loc[info["Attribute Name"] == "time_coverage_start", "Value"].item(),
    utc=True
)
dataset_end = pd.to_datetime(
    info.loc[info["Attribute Name"] == "time_coverage_end", "Value"].item(),
    utc=True
)

# 3) Now comparisons are valid (all tz-aware UTC)
query_start = max(dataset_start, obs_start)
query_end   = min(dataset_end, obs_end)
print("obs_start:", obs_start, "obs_end:", obs_end)
print("dataset_start:", dataset_start, "dataset_end:", dataset_end)
print("NaT counts:", jellyfish["observation_time"].isna().sum(), "NaNs in obs_time")

if pd.isna(query_start) or pd.isna(query_end) or query_start > query_end:
    raise ValueError("No valid time overlap between observations and dataset coverage.")

# 4) Spatial envelope (optional: normalize longitudes to ERDDAP's convention)
lat_min, lat_max = float(jellyfish["lat"].min()), float(jellyfish["lat"].max())
lon_min, lon_max = float(jellyfish["lon"].min()), float(jellyfish["lon"].max())

# If your longitudes are 0..360 but ERDDAP expects -180..180, normalize:
if lon_max > 180:
    lon_min = ((lon_min + 180) % 360) - 180
    lon_max = ((lon_max + 180) % 360) - 180

# 5) Set constraints (RFC3339/ISO8601 w/ Z)
erd.constraints["time>="]   = query_start.strftime("%Y-%m-%dT%H:%M:%SZ")
erd.constraints["time<="]   = query_end.strftime("%Y-%m-%dT%H:%M:%SZ")
erd.constraints["latitude>="]  = lat_min
erd.constraints["latitude<="]  = lat_max
erd.constraints["longitude>="] = lon_min
erd.constraints["longitude<="] = lon_max
erd.constraints["depth="]      = 0  # surface



erd.variables = ["TCO2"]
dataset = erd.to_xarray()
print(dataset)


obs_start: 1934-07-26 00:00:00+00:00 obs_end: 2010-08-20 00:00:00+00:00
dataset_start: 2020-01-01 00:00:00+00:00 dataset_end: 2020-01-01 00:00:00+00:00
NaT counts: 0 NaNs in obs_time


ValueError: No valid time overlap between observations and dataset coverage.

## Sample carbon values at jellyfish observation points

To make the datasets comparable we sample the surface dissolved inorganic carbon field at each
jellyfish observation. The interpolation uses the nearest available grid point in space and
time.


In [None]:
surface_tco2 = dataset["TCO2"].sel(depth=0)

# Build helper arrays for vectorised interpolation
obs_coords = xr.Dataset(
    {
        "time": ("observation", jellyfish["observation_time"].values.astype("datetime64[ns]")),
        "latitude": ("observation", jellyfish["lat"].values),
        "longitude": ("observation", jellyfish["lon"].values),
    }
)

sampled = surface_tco2.interp(
    time=obs_coords["time"],
    latitude=obs_coords["latitude"],
    longitude=obs_coords["longitude"],
    method="nearest"
)

jellyfish["tco2"] = sampled.to_pandas().values

matched = jellyfish.dropna(subset=["tco2", "density"])
print(f"Matched {len(matched):,} observations with carbon samples.")
matched.head()


## Carbon–jellyfish correlation

With paired measurements available we can now compute the Pearson correlation coefficient
between the sampled carbon values and the reported jellyfish densities.


In [None]:
correlation = matched["tco2"].corr(matched["density"])
print(f"Pearson correlation (TCO₂ vs. jellyfish density): {correlation:.3f}")


### Visualise the relationship

A scatter plot gives an intuition for how the two variables co-vary. The colour scale encodes
the observation year to show potential temporal structure.


In [None]:
plt.figure(figsize=(8, 6))
scatter = plt.scatter(
    matched["tco2"],
    matched["density"],
    c=matched["year"],
    cmap="viridis",
    alpha=0.6,
    edgecolor="k",
    linewidth=0.2
)
plt.colorbar(scatter, label="Observation year")
plt.xlabel("Surface TCO₂ (µmol/kg)")
plt.ylabel("Jellyfish density")
plt.title("Surface carbon vs. jellyfish density")
plt.grid(True, linestyle=":", alpha=0.4)
plt.show()
