# Fase 1: Data Exploration

Notebook ini untuk mengeksplorasi struktur data NetCDF yang akan digunakan untuk analisis HSI Selat Sunda.

## File yang akan dianalisis:
1. `CHL 21-24.nc` - Chlorophyll-a data
2. `SST 21-24.nc` - Sea Surface Temperature (Kelvin)
3. `SO 21-24.nc` - Salinity data (3D dengan depth)
4. `BatimetriSelatSunda.nc` - Bathymetry data

## 1. Import Libraries

In [2]:
import netCDF4
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

Libraries imported successfully!


## 2. Explore CHL Data (Chlorophyll-a)

In [3]:
# Read CHL file
chl_file = '../CHL 21-24.nc'
nc_chl = netCDF4.Dataset(chl_file, 'r')

print("=== CHL File Structure ===")
print(f"\nDimensions: {list(nc_chl.dimensions.keys())}")
print(f"Variables: {list(nc_chl.variables.keys())}")

# Dimension sizes
print("\nDimension sizes:")
for dim_name, dim in nc_chl.dimensions.items():
    print(f"  {dim_name}: {dim.size}")

# Coordinate ranges
lat_chl = nc_chl.variables['latitude'][:]
lon_chl = nc_chl.variables['longitude'][:]
time_chl = nc_chl.variables['time'][:]

print(f"\nLatitude range: {lat_chl.min():.4f} to {lat_chl.max():.4f}")
print(f"Longitude range: {lon_chl.min():.4f} to {lon_chl.max():.4f}")
print(f"Time steps: {len(time_chl)}")

# CHL data statistics
chl_data = nc_chl.variables['CHL']
print(f"\nCHL variable shape: {chl_data.shape}")
print(f"CHL units: {chl_data.units}")
print(f"CHL long_name: {chl_data.long_name}")

# Sample data (first time step)
sample_chl = chl_data[0, :, :]
valid_chl = sample_chl[~np.isnan(sample_chl)]

if len(valid_chl) > 0:
    print(f"\nSample CHL values (first time step):")
    print(f"  Valid points: {len(valid_chl)} / {sample_chl.size}")
    print(f"  Min: {valid_chl.min():.4f} mg/m³")
    print(f"  Max: {valid_chl.max():.4f} mg/m³")
    print(f"  Mean: {valid_chl.mean():.4f} mg/m³")
    print(f"  Median: {np.median(valid_chl):.4f} mg/m³")
    print(f"  Std: {valid_chl.std():.4f} mg/m³")

# Time information
if hasattr(nc_chl.variables['time'], 'units'):
    print(f"\nTime units: {nc_chl.variables['time'].units}")
    time_units = nc_chl.variables['time'].units
    first_time = netCDF4.num2date(time_chl[0], time_units)
    last_time = netCDF4.num2date(time_chl[-1], time_units)
    print(f"Date range: {first_time} to {last_time}")

nc_chl.close()

=== CHL File Structure ===

Dimensions: ['time', 'latitude', 'longitude']
Variables: ['time', 'latitude', 'longitude', 'CHL']

Dimension sizes:
  time: 1461
  latitude: 32
  longitude: 34

Latitude range: -6.7708 to -5.4792
Longitude range: 104.5625 to 105.9375
Time steps: 1461

CHL variable shape: (1461, 32, 34)
CHL units: milligram m-3
CHL long_name: Chlorophyll-a concentration - Mean of the binned pixels

Sample CHL values (first time step):
  Valid points: 807 / 1088
  Min: 0.1089 mg/m³
  Max: 0.6592 mg/m³
  Mean: 0.2632 mg/m³
  Median: 0.2349 mg/m³
  Std: 0.0921 mg/m³

Time units: seconds since 1970-01-01 00:00:00
Date range: 2021-01-01 00:00:00 to 2024-12-31 00:00:00


## 3. Explore SST Data (Sea Surface Temperature)

In [4]:
# Read SST file
sst_file = '../SST 21-24.nc'
nc_sst = netCDF4.Dataset(sst_file, 'r')

print("=== SST File Structure ===")
print(f"\nDimensions: {list(nc_sst.dimensions.keys())}")
print(f"Variables: {list(nc_sst.variables.keys())}")

# Dimension sizes
print("\nDimension sizes:")
for dim_name, dim in nc_sst.dimensions.items():
    print(f"  {dim_name}: {dim.size}")

# Coordinate ranges
lat_sst = nc_sst.variables['latitude'][:]
lon_sst = nc_sst.variables['longitude'][:]
time_sst = nc_sst.variables['time'][:]

print(f"\nLatitude range: {lat_sst.min():.4f} to {lat_sst.max():.4f}")
print(f"Longitude range: {lon_sst.min():.4f} to {lon_sst.max():.4f}")
print(f"Time steps: {len(time_sst)}")

# SST data statistics (in Kelvin)
sst_data = nc_sst.variables['analysed_sst']
print(f"\nSST variable shape: {sst_data.shape}")
print(f"SST units: {sst_data.units}")
print(f"SST long_name: {sst_data.long_name}")

# Sample data (first time step)
sample_sst_k = sst_data[0, :, :]
valid_sst_k = sample_sst_k[~np.isnan(sample_sst_k)]

if len(valid_sst_k) > 0:
    print(f"\nSample SST values in Kelvin (first time step):")
    print(f"  Valid points: {len(valid_sst_k)} / {sample_sst_k.size}")
    print(f"  Min: {valid_sst_k.min():.4f} K")
    print(f"  Max: {valid_sst_k.max():.4f} K")
    print(f"  Mean: {valid_sst_k.mean():.4f} K")
    
    # Convert to Celcius
    valid_sst_c = valid_sst_k - 273.15
    print(f"\nSample SST values in Celcius (first time step):")
    print(f"  Min: {valid_sst_c.min():.2f} °C")
    print(f"  Max: {valid_sst_c.max():.2f} °C")
    print(f"  Mean: {valid_sst_c.mean():.2f} °C")

# Time information
if hasattr(nc_sst.variables['time'], 'units'):
    print(f"\nTime units: {nc_sst.variables['time'].units}")
    time_units = nc_sst.variables['time'].units
    first_time = netCDF4.num2date(time_sst[0], time_units)
    last_time = netCDF4.num2date(time_sst[-1], time_units)
    print(f"Date range: {first_time} to {last_time}")

nc_sst.close()

=== SST File Structure ===

Dimensions: ['time', 'latitude', 'longitude']
Variables: ['time', 'latitude', 'longitude', 'analysed_sst']

Dimension sizes:
  time: 1461
  latitude: 27
  longitude: 29

Latitude range: -6.7750 to -5.4750
Longitude range: 104.5750 to 105.9750
Time steps: 1461

SST variable shape: (1461, 27, 29)
SST units: kelvin
SST long_name: Analysed sea surface temperature

Sample SST values in Kelvin (first time step):
  Valid points: 616 / 783
  Min: 301.7200 K
  Max: 302.2700 K
  Mean: 301.9357 K

Sample SST values in Celcius (first time step):
  Min: 28.57 °C
  Max: 29.12 °C
  Mean: 28.79 °C

Time units: seconds since 1970-01-01 00:00:00
Date range: 2021-01-01 00:00:00 to 2024-12-31 00:00:00


## 4. Explore Salinity Data (SO)

In [5]:
# Read Salinity file
so_file = '../SO 21-24.nc'
nc_so = netCDF4.Dataset(so_file, 'r')

print("=== Salinity File Structure ===")
print(f"\nDimensions: {list(nc_so.dimensions.keys())}")
print(f"Variables: {list(nc_so.variables.keys())}")

# Dimension sizes
print("\nDimension sizes:")
for dim_name, dim in nc_so.dimensions.items():
    print(f"  {dim_name}: {dim.size}")

# Coordinate ranges
lat_so = nc_so.variables['latitude'][:]
lon_so = nc_so.variables['longitude'][:]
time_so = nc_so.variables['time'][:]
depth_so = nc_so.variables['depth'][:]

print(f"\nLatitude range: {lat_so.min():.4f} to {lat_so.max():.4f}")
print(f"Longitude range: {lon_so.min():.4f} to {lon_so.max():.4f}")
print(f"Time steps: {len(time_so)}")
print(f"Depth levels: {len(depth_so)}")
print(f"\nDepth range: {depth_so.min():.2f} to {depth_so.max():.2f} m")
print(f"Surface depth (first level): {depth_so[0]:.2f} m")

# Salinity data statistics
so_data = nc_so.variables['so']
print(f"\nSalinity variable shape: {so_data.shape}")
print(f"Salinity units: {so_data.units}")
print(f"Salinity long_name: {so_data.long_name}")

# Extract surface salinity (first depth level)
surface_so = so_data[:, 0, :, :]  # [time, depth=0, lat, lon]
print(f"\nSurface salinity shape: {surface_so.shape}")

# Sample data (first time step, surface)
sample_so = surface_so[0, :, :]
valid_so = sample_so[~np.isnan(sample_so)]

if len(valid_so) > 0:
    print(f"\nSample Salinity values (first time step, surface):")
    print(f"  Valid points: {len(valid_so)} / {sample_so.size}")
    print(f"  Min: {valid_so.min():.4f} PSU")
    print(f"  Max: {valid_so.max():.4f} PSU")
    print(f"  Mean: {valid_so.mean():.4f} PSU")
    print(f"  Median: {np.median(valid_so):.4f} PSU")
    print(f"  Std: {valid_so.std():.4f} PSU")

# Time information
if hasattr(nc_so.variables['time'], 'units'):
    print(f"\nTime units: {nc_so.variables['time'].units}")
    time_units = nc_so.variables['time'].units
    first_time = netCDF4.num2date(time_so[0], time_units)
    last_time = netCDF4.num2date(time_so[-1], time_units)
    print(f"Date range: {first_time} to {last_time}")

nc_so.close()

=== Salinity File Structure ===

Dimensions: ['time', 'depth', 'latitude', 'longitude']
Variables: ['time', 'depth', 'latitude', 'longitude', 'so']

Dimension sizes:
  time: 1461
  depth: 50
  latitude: 16
  longitude: 17

Latitude range: -6.7500 to -5.5000
Longitude range: 104.5834 to 105.9167
Time steps: 1461
Depth levels: 50

Depth range: 0.49 to 5727.92 m
Surface depth (first level): 0.49 m

Salinity variable shape: (1461, 50, 16, 17)
Salinity units: 1e-3
Salinity long_name: Salinity

Surface salinity shape: (1461, 16, 17)

Sample Salinity values (first time step, surface):
  Valid points: 215 / 272
  Min: 32.5892 PSU
  Max: 33.6970 PSU
  Mean: 33.2822 PSU
  Median: 33.2240 PSU
  Std: 0.1921 PSU

Time units: seconds since 1970-01-01 00:00:00
Date range: 2021-01-01 00:00:00 to 2024-12-31 00:00:00


## 5. Compare Spatial Coverage

In [6]:
# Reopen files to get coordinates
nc_chl = netCDF4.Dataset(chl_file, 'r')
nc_sst = netCDF4.Dataset(sst_file, 'r')
nc_so = netCDF4.Dataset(so_file, 'r')

lat_chl = nc_chl.variables['latitude'][:]
lon_chl = nc_chl.variables['longitude'][:]
lat_sst = nc_sst.variables['latitude'][:]
lon_sst = nc_sst.variables['longitude'][:]
lat_so = nc_so.variables['latitude'][:]
lon_so = nc_so.variables['longitude'][:]

print("=== Spatial Coverage Comparison ===")
print("\nLatitude ranges:")
print(f"  CHL:     {lat_chl.min():.4f} to {lat_chl.max():.4f}")
print(f"  SST:     {lat_sst.min():.4f} to {lat_sst.max():.4f}")
print(f"  Salinity: {lat_so.min():.4f} to {lat_so.max():.4f}")

print("\nLongitude ranges:")
print(f"  CHL:     {lon_chl.min():.4f} to {lon_chl.max():.4f}")
print(f"  SST:     {lon_sst.min():.4f} to {lon_sst.max():.4f}")
print(f"  Salinity: {lon_so.min():.4f} to {lon_so.max():.4f}")

# Calculate intersection (common area)
lat_min = max(lat_chl.min(), lat_sst.min(), lat_so.min())
lat_max = min(lat_chl.max(), lat_sst.max(), lat_so.max())
lon_min = max(lon_chl.min(), lon_sst.min(), lon_so.min())
lon_max = min(lon_chl.max(), lon_sst.max(), lon_so.max())

print("\n=== Recommended Bounding Box (Intersection) ===")
print(f"Latitude:  {lat_min:.4f} to {lat_max:.4f}")
print(f"Longitude: {lon_min:.4f} to {lon_max:.4f}")

# Calculate resolutions
print("\n=== Spatial Resolutions ===")
if len(lat_chl) > 1:
    res_chl = abs(lat_chl[1] - lat_chl[0])
    print(f"CHL resolution: ~{res_chl:.4f}° (~{res_chl*111:.1f} km)")
if len(lat_sst) > 1:
    res_sst = abs(lat_sst[1] - lat_sst[0])
    print(f"SST resolution: ~{res_sst:.4f}° (~{res_sst*111:.1f} km)")
if len(lat_so) > 1:
    res_so = abs(lat_so[1] - lat_so[0])
    print(f"Salinity resolution: ~{res_so:.4f}° (~{res_so*111:.1f} km)")

nc_chl.close()
nc_sst.close()
nc_so.close()

=== Spatial Coverage Comparison ===

Latitude ranges:
  CHL:     -6.7708 to -5.4792
  SST:     -6.7750 to -5.4750
  Salinity: -6.7500 to -5.5000

Longitude ranges:
  CHL:     104.5625 to 105.9375
  SST:     104.5750 to 105.9750
  Salinity: 104.5834 to 105.9167

=== Recommended Bounding Box (Intersection) ===
Latitude:  -6.7500 to -5.5000
Longitude: 104.5834 to 105.9167

=== Spatial Resolutions ===
CHL resolution: ~0.0417° (~4.6 km)
SST resolution: ~0.0500° (~5.6 km)
Salinity resolution: ~0.0833° (~9.3 km)


## 6. Summary & Next Steps

In [7]:
print("=== EXPLORATION SUMMARY ===")
print("\n✅ Data files successfully loaded and analyzed")
print("\nKey Findings:")
print("1. All files have 1461 time steps (daily data)")
print("2. Spatial resolutions differ (need resampling)")
print("3. SST is in Kelvin (needs conversion to Celcius)")
print("4. Salinity has 50 depth levels (need surface extraction)")
print("5. Bounding box intersection calculated")
print("\nNext Steps:")
print("- Fase 2: Data Preprocessing")
print("  - Convert SST: Kelvin → Celcius")
print("  - Extract surface salinity")
print("  - Resample to common grid")
print("  - Crop to bounding box")

=== EXPLORATION SUMMARY ===

✅ Data files successfully loaded and analyzed

Key Findings:
1. All files have 1461 time steps (daily data)
2. Spatial resolutions differ (need resampling)
3. SST is in Kelvin (needs conversion to Celcius)
4. Salinity has 50 depth levels (need surface extraction)
5. Bounding box intersection calculated

Next Steps:
- Fase 2: Data Preprocessing
  - Convert SST: Kelvin → Celcius
  - Extract surface salinity
  - Resample to common grid
  - Crop to bounding box
