# Fase 2: Data Preprocessing

Notebook ini untuk preprocessing data NetCDF sebelum perhitungan HSI.

## Langkah-langkah:
1. Load data NetCDF
2. Konversi SST: Kelvin â†’ Celcius
3. Ekstrak surface salinity (depth=0)
4. Resample ke grid seragam
5. Crop ke bounding box Selat Sunda
6. Handle missing values
7. Save processed data

## 1. Import Libraries & Setup

In [2]:
import netCDF4
import numpy as np
import pandas as pd
from scipy.interpolate import griddata
from datetime import datetime
import os
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

Libraries imported successfully!


## 2. Define Configuration

In [3]:
# Bounding Box Selat Sunda (dari eksplorasi)
# Akan diupdate setelah menjalankan notebook eksplorasi
BBOX = {
    'lat_min': -6.7750,   # dari SST (paling luas)
    'lat_max': -5.4750,   # dari SST
    'lon_min': 104.5625,  # dari CHL
    'lon_max': 105.9375   # dari CHL
}

# Target grid resolution (derajat)
# Pilih resolusi yang sesuai (0.05Â° ~ 5.5 km atau 0.1Â° ~ 11 km)
TARGET_RESOLUTION = 0.05  # derajat

# File paths
CHL_FILE = '../CHL 21-24.nc'
SST_FILE = '../SST 21-24.nc'
SO_FILE = '../SO 21-24.nc'

# Output directory
OUTPUT_DIR = '../data/processed'
os.makedirs(OUTPUT_DIR, exist_ok=True)

print(f"Bounding Box: {BBOX}")
print(f"Target Resolution: {TARGET_RESOLUTION}Â° (~{TARGET_RESOLUTION*111:.1f} km)")
print(f"Output directory: {OUTPUT_DIR}")

Bounding Box: {'lat_min': -6.775, 'lat_max': -5.475, 'lon_min': 104.5625, 'lon_max': 105.9375}
Target Resolution: 0.05Â° (~5.6 km)
Output directory: ../data/processed


## 3. Create Target Grid

In [4]:
# Create uniform grid untuk target resolution
lat_grid = np.arange(BBOX['lat_min'], BBOX['lat_max'] + TARGET_RESOLUTION, TARGET_RESOLUTION)
lon_grid = np.arange(BBOX['lon_min'], BBOX['lon_max'] + TARGET_RESOLUTION, TARGET_RESOLUTION)

# Create meshgrid
lon_mesh, lat_mesh = np.meshgrid(lon_grid, lat_grid)

print(f"Target grid size: {len(lat_grid)} x {len(lon_grid)} = {len(lat_grid) * len(lon_grid)} points")
print(f"Latitude range: {lat_grid.min():.4f} to {lat_grid.max():.4f}")
print(f"Longitude range: {lon_grid.min():.4f} to {lon_grid.max():.4f}")

Target grid size: 28 x 29 = 812 points
Latitude range: -6.7750 to -5.4250
Longitude range: 104.5625 to 105.9625


## 4. Load & Preprocess CHL Data

In [5]:
print("=== Processing CHL Data ===")
nc_chl = netCDF4.Dataset(CHL_FILE, 'r')

# Get coordinates
lat_chl = nc_chl.variables['latitude'][:]
lon_chl = nc_chl.variables['longitude'][:]
time_chl = nc_chl.variables['time'][:]
chl_data = nc_chl.variables['CHL']

print(f"Original CHL shape: {chl_data.shape}")
print(f"Time steps: {len(time_chl)}")

# Get time units for date conversion
time_units = nc_chl.variables['time'].units
print(f"Time units: {time_units}")

# Create meshgrid for original data
lon_chl_2d, lat_chl_2d = np.meshgrid(lon_chl, lat_chl)

# Flatten for interpolation
points_chl = np.column_stack((lon_chl_2d.ravel(), lat_chl_2d.ravel()))

print(f"CHL data loaded. Ready for resampling.")
nc_chl.close()

=== Processing CHL Data ===
Original CHL shape: (1461, 32, 34)
Time steps: 1461
Time units: seconds since 1970-01-01 00:00:00
CHL data loaded. Ready for resampling.


## 5. Load & Preprocess SST Data (Convert Kelvin to Celcius)

In [6]:
print("=== Processing SST Data ===")
nc_sst = netCDF4.Dataset(SST_FILE, 'r')

# Get coordinates
lat_sst = nc_sst.variables['latitude'][:]
lon_sst = nc_sst.variables['longitude'][:]
time_sst = nc_sst.variables['time'][:]
sst_data_k = nc_sst.variables['analysed_sst']  # in Kelvin

print(f"Original SST shape: {sst_data_k.shape}")
print(f"Time steps: {len(time_sst)}")

# Create meshgrid for original data
lon_sst_2d, lat_sst_2d = np.meshgrid(lon_sst, lat_sst)

# Flatten for interpolation
points_sst = np.column_stack((lon_sst_2d.ravel(), lat_sst_2d.ravel()))

print(f"SST data loaded. Will convert Kelvin â†’ Celcius during resampling.")
nc_sst.close()

=== Processing SST Data ===
Original SST shape: (1461, 27, 29)
Time steps: 1461
SST data loaded. Will convert Kelvin â†’ Celcius during resampling.


## 6. Load & Preprocess Salinity Data (Extract Surface Layer)

In [7]:
print("=== Processing Salinity Data ===")
nc_so = netCDF4.Dataset(SO_FILE, 'r')

# Get coordinates
lat_so = nc_so.variables['latitude'][:]
lon_so = nc_so.variables['longitude'][:]
time_so = nc_so.variables['time'][:]
depth_so = nc_so.variables['depth'][:]
so_data = nc_so.variables['so']  # [time, depth, lat, lon]

print(f"Original Salinity shape: {so_data.shape}")
print(f"Depth levels: {len(depth_so)}")
print(f"Surface depth (first level): {depth_so[0]:.2f} m")

# Extract surface salinity (depth index 0)
surface_so = so_data[:, 0, :, :]  # [time, lat, lon]
print(f"Surface salinity shape: {surface_so.shape}")

# Create meshgrid for original data
lon_so_2d, lat_so_2d = np.meshgrid(lon_so, lat_so)

# Flatten for interpolation
points_so = np.column_stack((lon_so_2d.ravel(), lat_so_2d.ravel()))

print(f"Surface salinity extracted. Ready for resampling.")
nc_so.close()

=== Processing Salinity Data ===
Original Salinity shape: (1461, 50, 16, 17)
Depth levels: 50
Surface depth (first level): 0.49 m
Surface salinity shape: (1461, 16, 17)
Surface salinity extracted. Ready for resampling.


## 7. Resample & Interpolate Data to Target Grid

In [8]:
def resample_to_grid(data_2d, points_orig, lon_target, lat_target, method='nearest'):
    """
    Resample 2D data to target grid using interpolation
    
    Parameters:
    - data_2d: 2D array [lat, lon] dari data original
    - points_orig: array [N, 2] dengan (lon, lat) dari data original
    - lon_target, lat_target: target grid coordinates
    - method: interpolation method ('linear', 'nearest', 'cubic')
    
    Returns:
    - resampled_data: 2D array dengan shape sesuai target grid
    """
    # Flatten target grid
    points_target = np.column_stack((lon_target.ravel(), lat_target.ravel()))
    
    # Flatten original data
    values_orig = data_2d.ravel()
    
    # Remove NaN values
    valid_mask = ~np.isnan(values_orig)
    if np.sum(valid_mask) == 0:
        return np.full(lon_target.shape, np.nan)
    
    points_valid = points_orig[valid_mask]
    values_valid = values_orig[valid_mask]
    
    # Interpolate
    values_interp = griddata(
        points_valid,
        values_valid,
        points_target,
        method=method,
        fill_value=np.nan
    )
    
    # Reshape to target grid
    resampled = values_interp.reshape(lon_target.shape)
    
    return resampled

print("Resampling function defined.")

Resampling function defined.


# ============================================
# OPTIMIZED VERSION - Process All Time Steps
# ============================================

import time
from multiprocessing import Pool, cpu_count
from functools import partial

# ===== Optimization Settings =====
INTERP_METHOD = 'nearest'  # 'nearest' (fastest, 3-5x), 'linear' (accurate), 'cubic' (slowest)
USE_PARALLEL = True  # Set False jika ada masalah dengan multiprocessing
N_WORKERS = min(4, cpu_count())  # Jumlah CPU cores
CHUNK_SIZE = 100  # Process 100 time steps per batch

print(f"ðŸš€ Optimization settings:")
print(f"  Interpolation method: {INTERP_METHOD} (3-5x faster)")
print(f"  Parallel processing: {USE_PARALLEL} ({N_WORKERS} workers)")
print(f"  Chunk size: {CHUNK_SIZE}")

# ===== Optimized Resampling Function =====
def resample_to_grid_fast(data_2d, points_orig, lon_target, lat_target, method=INTERP_METHOD):
    """Optimized resampling dengan method yang lebih cepat"""
    points_target = np.column_stack((lon_target.ravel(), lat_target.ravel()))
    values_orig = data_2d.ravel()
    
    valid_mask = ~np.isnan(values_orig)
    if np.sum(valid_mask) == 0:
        return np.full(lon_target.shape, np.nan)
    
    points_valid = points_orig[valid_mask]
    values_valid = values_orig[valid_mask]
    
    values_interp = griddata(
        points_valid,
        values_valid,
        points_target,
        method=method,
        fill_value=np.nan
    )
    
    return values_interp.reshape(lon_target.shape)

# ===== Process Single Time Step (for parallel) =====
def process_time_step(t, chl_data, sst_data_k, so_data, 
                     points_chl, points_sst, points_so,
                     lon_mesh, lat_mesh):
    """Process satu time step - untuk parallel processing"""
    try:
        # CHL
        chl_2d = chl_data[t, :, :]
        chl_resampled = resample_to_grid_fast(chl_2d, points_chl, lon_mesh, lat_mesh)
        
        # SST (convert Kelvin to Celcius)
        sst_2d_k = sst_data_k[t, :, :]
        sst_2d_c = sst_2d_k - 273.15
        sst_resampled = resample_to_grid_fast(sst_2d_c, points_sst, lon_mesh, lat_mesh)
        
        # Salinity (surface)
        so_2d = so_data[t, 0, :, :]
        so_resampled = resample_to_grid_fast(so_2d, points_so, lon_mesh, lat_mesh)
        
        return t, chl_resampled, sst_resampled, so_resampled
    except Exception as e:
        print(f"Error processing time step {t}: {e}")
        return t, None, None, None

# ===== Main Processing =====
SAMPLE_SIZE = None  # Set None untuk process semua, atau angka untuk sample

# Reopen files
nc_chl = netCDF4.Dataset(CHL_FILE, 'r')
nc_sst = netCDF4.Dataset(SST_FILE, 'r')
nc_so = netCDF4.Dataset(SO_FILE, 'r')

chl_data = nc_chl.variables['CHL']
sst_data_k = nc_sst.variables['analysed_sst']
so_data = nc_so.variables['so']

# Get number of time steps
n_times = len(time_chl)
if SAMPLE_SIZE:
    n_times = min(SAMPLE_SIZE, n_times)

print(f"\nProcessing {n_times} time steps with optimizations...")
start_time = time.time()

# Initialize arrays
processed_chl = np.full((n_times, len(lat_grid), len(lon_grid)), np.nan)
processed_sst = np.full((n_times, len(lat_grid), len(lon_grid)), np.nan)
processed_so = np.full((n_times, len(lat_grid), len(lon_grid)), np.nan)

time_indices = list(range(n_times))

if USE_PARALLEL and N_WORKERS > 1:
    # ===== PARALLEL PROCESSING =====
    print(f"Using parallel processing with {N_WORKERS} workers...")
    
    process_func = partial(
        process_time_step,
        chl_data=chl_data,
        sst_data_k=sst_data_k,
        so_data=so_data,
        points_chl=points_chl,
        points_sst=points_sst,
        points_so=points_so,
        lon_mesh=lon_mesh,
        lat_mesh=lat_mesh
    )
    
    # Process in chunks
    for chunk_start in range(0, n_times, CHUNK_SIZE):
        chunk_end = min(chunk_start + CHUNK_SIZE, n_times)
        chunk_indices = time_indices[chunk_start:chunk_end]
        
        chunk_num = chunk_start//CHUNK_SIZE + 1
        total_chunks = (n_times-1)//CHUNK_SIZE + 1
        print(f"Processing chunk {chunk_num}/{total_chunks} (time steps {chunk_start+1}-{chunk_end})...")
        
        with Pool(processes=N_WORKERS) as pool:
            results = pool.map(process_func, chunk_indices)
        
        # Store results
        for t, chl_res, sst_res, so_res in results:
            if chl_res is not None:
                processed_chl[t, :, :] = chl_res
                processed_sst[t, :, :] = sst_res
                processed_so[t, :, :] = so_res
        
        elapsed = time.time() - start_time
        print(f"  âœ“ Chunk {chunk_num} completed in {elapsed:.1f}s")
    
else:
    # ===== SEQUENTIAL PROCESSING (Optimized) =====
    print("Using sequential processing (optimized with 'nearest' method)...")
    
    for t in range(n_times):
        if (t + 1) % 50 == 0 or t == 0:
            elapsed = time.time() - start_time
            rate = (t + 1) / elapsed if elapsed > 0 else 0
            remaining = (n_times - t - 1) / rate if rate > 0 else 0
            print(f"Progress: {t+1}/{n_times} ({100*(t+1)/n_times:.1f}%) | "
                  f"Elapsed: {elapsed:.1f}s | ETA: {remaining:.1f}s")
        
        # CHL
        chl_2d = chl_data[t, :, :]
        processed_chl[t, :, :] = resample_to_grid_fast(chl_2d, points_chl, lon_mesh, lat_mesh)
        
        # SST (convert Kelvin to Celcius)
        sst_2d_k = sst_data_k[t, :, :]
        sst_2d_c = sst_2d_k - 273.15
        processed_sst[t, :, :] = resample_to_grid_fast(sst_2d_c, points_sst, lon_mesh, lat_mesh)
        
        # Salinity (surface)
        so_2d = so_data[t, 0, :, :]
        processed_so[t, :, :] = resample_to_grid_fast(so_2d, points_so, lon_mesh, lat_mesh)

total_time = time.time() - start_time
print(f"\n{'='*60}")
print(f"âœ… Processing complete!")
print(f"Total time: {total_time:.1f}s ({total_time/60:.1f} minutes)")
print(f"Average: {total_time/n_times:.2f}s per time step")
print(f"Processed data shape: {processed_chl.shape}")
print(f"{'='*60}")

# Close files
nc_chl.close()
nc_sst.close()
nc_so.close()

In [9]:
# Process sample (first 10 days) untuk testing
# Setelah berhasil, bisa diubah ke semua time steps
SAMPLE_SIZE = None  # Ubah ke None untuk process semua

# Reopen files
nc_chl = netCDF4.Dataset(CHL_FILE, 'r')
nc_sst = netCDF4.Dataset(SST_FILE, 'r')
nc_so = netCDF4.Dataset(SO_FILE, 'r')

chl_data = nc_chl.variables['CHL']
sst_data_k = nc_sst.variables['analysed_sst']
so_data = nc_so.variables['so']

# Get number of time steps
n_times = len(time_chl)
if SAMPLE_SIZE:
    n_times = min(SAMPLE_SIZE, n_times)

print(f"Processing {n_times} time steps...")

# Initialize arrays untuk processed data
processed_chl = np.full((n_times, len(lat_grid), len(lon_grid)), np.nan)
processed_sst = np.full((n_times, len(lat_grid), len(lon_grid)), np.nan)
processed_so = np.full((n_times, len(lat_grid), len(lon_grid)), np.nan)

# Process each time step
for t in range(n_times):
    if (t + 1) % 10 == 0 or t == 0:
        print(f"Processing time step {t+1}/{n_times}...")
    
    # CHL
    chl_2d = chl_data[t, :, :]
    processed_chl[t, :, :] = resample_to_grid(chl_2d, points_chl, lon_mesh, lat_mesh)
    
    # SST (convert Kelvin to Celcius)
    sst_2d_k = sst_data_k[t, :, :]
    sst_2d_c = sst_2d_k - 273.15  # Convert to Celcius
    processed_sst[t, :, :] = resample_to_grid(sst_2d_c, points_sst, lon_mesh, lat_mesh)
    
    # Salinity (surface)
    so_2d = so_data[t, 0, :, :]  # surface layer
    processed_so[t, :, :] = resample_to_grid(so_2d, points_so, lon_mesh, lat_mesh)

print(f"\nProcessing complete!")
print(f"Processed data shape: {processed_chl.shape}")

# Close files
nc_chl.close()
nc_sst.close()
nc_so.close()

Processing 1461 time steps...
Processing time step 1/1461...
Processing time step 10/1461...
Processing time step 20/1461...
Processing time step 30/1461...
Processing time step 40/1461...
Processing time step 50/1461...
Processing time step 60/1461...
Processing time step 70/1461...
Processing time step 80/1461...
Processing time step 90/1461...
Processing time step 100/1461...
Processing time step 110/1461...
Processing time step 120/1461...
Processing time step 130/1461...
Processing time step 140/1461...
Processing time step 150/1461...
Processing time step 160/1461...
Processing time step 170/1461...
Processing time step 180/1461...
Processing time step 190/1461...
Processing time step 200/1461...
Processing time step 210/1461...
Processing time step 220/1461...
Processing time step 230/1461...
Processing time step 240/1461...
Processing time step 250/1461...
Processing time step 260/1461...
Processing time step 270/1461...
Processing time step 280/1461...
Processing time step 290

## 9. Data Quality Check

In [10]:
# Check data quality
print("=== Data Quality Check ===")

# CHL
valid_chl = ~np.isnan(processed_chl)
chl_valid_pct = np.sum(valid_chl) / processed_chl.size * 100
print(f"\nCHL:")
print(f"  Valid data: {chl_valid_pct:.1f}%")
if np.any(valid_chl):
    print(f"  Range: {np.nanmin(processed_chl):.4f} to {np.nanmax(processed_chl):.4f} mg/mÂ³")
    print(f"  Mean: {np.nanmean(processed_chl):.4f} mg/mÂ³")

# SST
valid_sst = ~np.isnan(processed_sst)
sst_valid_pct = np.sum(valid_sst) / processed_sst.size * 100
print(f"\nSST:")
print(f"  Valid data: {sst_valid_pct:.1f}%")
if np.any(valid_sst):
    print(f"  Range: {np.nanmin(processed_sst):.2f} to {np.nanmax(processed_sst):.2f} Â°C")
    print(f"  Mean: {np.nanmean(processed_sst):.2f} Â°C")

# Salinity
valid_so = ~np.isnan(processed_so)
so_valid_pct = np.sum(valid_so) / processed_so.size * 100
print(f"\nSalinity:")
print(f"  Valid data: {so_valid_pct:.1f}%")
if np.any(valid_so):
    print(f"  Range: {np.nanmin(processed_so):.2f} to {np.nanmax(processed_so):.2f} PSU")
    print(f"  Mean: {np.nanmean(processed_so):.2f} PSU")

=== Data Quality Check ===

CHL:
  Valid data: 100.0%
  Range: 0.0482 to 36.7248 mg/mÂ³
  Mean: 0.6461 mg/mÂ³

SST:
  Valid data: 100.0%
  Range: 24.78 to 31.10 Â°C
  Mean: 29.15 Â°C

Salinity:
  Valid data: 100.0%
  Range: 28.33 to 34.34 PSU
  Mean: 32.32 PSU


## 10. Save Processed Data

In [11]:
# Save processed data as numpy arrays
# Format: [time, lat, lon]

np.savez_compressed(
    f"{OUTPUT_DIR}/processed_data.npz",
    chl=processed_chl,
    sst=processed_sst,
    salinity=processed_so,
    lat_grid=lat_grid,
    lon_grid=lon_grid,
    time_indices=np.arange(n_times)
)

print(f"Processed data saved to {OUTPUT_DIR}/processed_data.npz")
print(f"\nData shape: {processed_chl.shape}")
print(f"Grid size: {len(lat_grid)} x {len(lon_grid)}")
print(f"Time steps: {n_times}")

Processed data saved to ../data/processed/processed_data.npz

Data shape: (1461, 28, 29)
Grid size: 28 x 29
Time steps: 1461


## 11. Summary & Next Steps

In [12]:
print("=== PREPROCESSING SUMMARY ===")
print("\nâœ… Data preprocessing completed!")
print("\nWhat was done:")
print("1. âœ… SST converted: Kelvin â†’ Celcius")
print("2. âœ… Surface salinity extracted (depth=0)")
print("3. âœ… All data resampled to uniform grid ({TARGET_RESOLUTION}Â°)")
print("4. âœ… Data cropped to bounding box")
print("5. âœ… Missing values handled (NaN)")
print("\nNext Steps:")
print("- Fase 3: HSI Calculation")
print("  - Calculate HSI_CHL, HSI_SST, HSI_SO")
print("  - Calculate HSI_total = (HSI_CHL Ã— HSI_SST Ã— HSI_SO)^(1/3)")
print("\nNote: This was a sample run ({n_times} time steps).")
print("To process all data, set SAMPLE_SIZE = None in cell 8.")

=== PREPROCESSING SUMMARY ===

âœ… Data preprocessing completed!

What was done:
1. âœ… SST converted: Kelvin â†’ Celcius
2. âœ… Surface salinity extracted (depth=0)
3. âœ… All data resampled to uniform grid ({TARGET_RESOLUTION}Â°)
4. âœ… Data cropped to bounding box
5. âœ… Missing values handled (NaN)

Next Steps:
- Fase 3: HSI Calculation
  - Calculate HSI_CHL, HSI_SST, HSI_SO
  - Calculate HSI_total = (HSI_CHL Ã— HSI_SST Ã— HSI_SO)^(1/3)

Note: This was a sample run ({n_times} time steps).
To process all data, set SAMPLE_SIZE = None in cell 8.
