# Fase 4: Monthly Aggregation

Notebook ini untuk agregasi data HSI harian menjadi bulanan.

## Langkah-langkah:
1. Load HSI data
2. Convert time indices ke dates
3. Group data by year-month
4. Calculate monthly mean untuk HSI dan parameter
5. Generate 36 dataset bulanan (2021-2023)
6. Save monthly aggregated data

## 1. Import Libraries & Load HSI Data

In [9]:
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
import os
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

Libraries imported successfully!


In [10]:
# Load HSI data
HSI_DATA_FILE = '../data/processed/hsi_data.npz'

if not os.path.exists(HSI_DATA_FILE):
    raise FileNotFoundError(f"HSI data file not found! Please run HSI calculation notebook first.")

data = np.load(HSI_DATA_FILE)

hsi_total = data['hsi_total']
hsi_chl = data['hsi_chl']
hsi_sst = data['hsi_sst']
hsi_so = data['hsi_so']
chl = data['chl']
sst = data['sst']
salinity = data['salinity']
lat_grid = data['lat_grid']
lon_grid = data['lon_grid']

print(f"✅ HSI data loaded successfully!")
print(f"\nData shapes:")
print(f"  HSI_total: {hsi_total.shape}")
print(f"  Grid: {len(lat_grid)} x {len(lon_grid)}")
print(f"  Time steps: {hsi_total.shape[0]}")

✅ HSI data loaded successfully!

Data shapes:
  HSI_total: (1461, 28, 29)
  Grid: 28 x 29
  Time steps: 1461


## 2. Create Date Range

In [11]:
# Create date range (2021-01-01 to 2023-12-31)
# Assuming daily data starting from 2021-01-01
start_date = datetime(2021, 1, 1)
n_days = hsi_total.shape[0]

# Generate dates
dates = [start_date + timedelta(days=i) for i in range(n_days)]

# Create DataFrame untuk easier grouping
df_dates = pd.DataFrame({
    'date': dates,
    'year': [d.year for d in dates],
    'month': [d.month for d in dates],
    'year_month': [f"{d.year}-{d.month:02d}" for d in dates]
})

print(f"Date range: {dates[0].strftime('%Y-%m-%d')} to {dates[-1].strftime('%Y-%m-%d')}")
print(f"Total days: {len(dates)}")
print(f"\nUnique year-months: {df_dates['year_month'].nunique()}")
print(f"Year range: {df_dates['year'].min()} - {df_dates['year'].max()}")

# Show unique year-months
unique_months = sorted(df_dates['year_month'].unique())
print(f"\nMonths to process: {len(unique_months)}")
print(f"First 5: {unique_months[:5]}")
print(f"Last 5: {unique_months[-5:]}")

Date range: 2021-01-01 to 2024-12-31
Total days: 1461

Unique year-months: 48
Year range: 2021 - 2024

Months to process: 48
First 5: ['2021-01', '2021-02', '2021-03', '2021-04', '2021-05']
Last 5: ['2024-08', '2024-09', '2024-10', '2024-11', '2024-12']


## 3. Monthly Aggregation Function

In [12]:
def aggregate_monthly(data_array, year_month_list, method='mean'):
    """
    Aggregate daily data to monthly
    
    Parameters:
    - data_array: 3D array [time, lat, lon]
    - year_month_list: list of year-month strings (e.g., '2021-01')
    - method: 'mean' or 'median'
    
    Returns:
    - monthly_data: dict dengan key year_month, value: 2D array [lat, lon]
    """
    monthly_data = {}
    
    unique_months = sorted(set(year_month_list))
    
    for ym in unique_months:
        # Get indices for this month
        month_mask = np.array([ym_str == ym for ym_str in year_month_list])
        month_indices = np.where(month_mask)[0]
        
        if len(month_indices) > 0:
            # Extract data for this month
            month_data = data_array[month_indices, :, :]
            
            # Aggregate (mean or median)
            if method == 'mean':
                aggregated = np.nanmean(month_data, axis=0)
            elif method == 'median':
                aggregated = np.nanmedian(month_data, axis=0)
            else:
                raise ValueError(f"Unknown method: {method}")
            
            monthly_data[ym] = aggregated
    
    return monthly_data

print("✅ Aggregation function defined!")

✅ Aggregation function defined!


## 4. Aggregate All Data to Monthly

In [13]:
import time

print("=== Aggregating data to monthly ===")
print(f"Processing {len(unique_months)} months...")

start_time = time.time()

# Aggregate HSI
print("\nAggregating HSI_total...")
monthly_hsi = aggregate_monthly(hsi_total, df_dates['year_month'].values, method='mean')
print(f"  ✓ HSI_total: {len(monthly_hsi)} months")

print("Aggregating HSI_CHL...")
monthly_hsi_chl = aggregate_monthly(hsi_chl, df_dates['year_month'].values, method='mean')
print(f"  ✓ HSI_CHL: {len(monthly_hsi_chl)} months")

print("Aggregating HSI_SST...")
monthly_hsi_sst = aggregate_monthly(hsi_sst, df_dates['year_month'].values, method='mean')
print(f"  ✓ HSI_SST: {len(monthly_hsi_sst)} months")

print("Aggregating HSI_SO...")
monthly_hsi_so = aggregate_monthly(hsi_so, df_dates['year_month'].values, method='mean')
print(f"  ✓ HSI_SO: {len(monthly_hsi_so)} months")

# Aggregate original parameters
print("\nAggregating CHL...")
monthly_chl = aggregate_monthly(chl, df_dates['year_month'].values, method='mean')
print(f"  ✓ CHL: {len(monthly_chl)} months")

print("Aggregating SST...")
monthly_sst = aggregate_monthly(sst, df_dates['year_month'].values, method='mean')
print(f"  ✓ SST: {len(monthly_sst)} months")

print("Aggregating Salinity...")
monthly_salinity = aggregate_monthly(salinity, df_dates['year_month'].values, method='mean')
print(f"  ✓ Salinity: {len(monthly_salinity)} months")

elapsed = time.time() - start_time
print(f"\n✅ Aggregation complete in {elapsed:.1f}s!")
print(f"\nTotal months processed: {len(monthly_hsi)}")

=== Aggregating data to monthly ===
Processing 48 months...

Aggregating HSI_total...
  ✓ HSI_total: 48 months
Aggregating HSI_CHL...
  ✓ HSI_CHL: 48 months
Aggregating HSI_SST...
  ✓ HSI_SST: 48 months
Aggregating HSI_SO...
  ✓ HSI_SO: 48 months

Aggregating CHL...
  ✓ CHL: 48 months
Aggregating SST...
  ✓ SST: 48 months
Aggregating Salinity...
  ✓ Salinity: 48 months

✅ Aggregation complete in 1.0s!

Total months processed: 48


## 5. Verify Monthly Data

In [14]:
# Check data quality
print("=== Monthly Data Verification ===")

# Check all months are present
expected_months = 36  # 3 years × 12 months
actual_months = len(monthly_hsi)

print(f"\nExpected months: {expected_months}")
print(f"Actual months: {actual_months}")

if actual_months == expected_months:
    print("✅ All months present!")
else:
    print(f"⚠️  Missing {expected_months - actual_months} months")

# Show sample month
sample_month = list(monthly_hsi.keys())[0]
print(f"\nSample month: {sample_month}")
print(f"  HSI shape: {monthly_hsi[sample_month].shape}")
print(f"  HSI range: {np.nanmin(monthly_hsi[sample_month]):.4f} to {np.nanmax(monthly_hsi[sample_month]):.4f}")
print(f"  Valid points: {np.sum(~np.isnan(monthly_hsi[sample_month]))} / {monthly_hsi[sample_month].size}")

# List all months
print(f"\nAll months:")
for i, ym in enumerate(sorted(monthly_hsi.keys()), 1):
    print(f"  {i:2d}. {ym}", end="  ")
    if i % 6 == 0:
        print()  # New line every 6 months

=== Monthly Data Verification ===

Expected months: 36
Actual months: 48
⚠️  Missing -12 months

Sample month: 2021-01
  HSI shape: (28, 29)
  HSI range: 0.7636 to 0.9380
  Valid points: 812 / 812

All months:
   1. 2021-01     2. 2021-02     3. 2021-03     4. 2021-04     5. 2021-05     6. 2021-06  
   7. 2021-07     8. 2021-08     9. 2021-09    10. 2021-10    11. 2021-11    12. 2021-12  
  13. 2022-01    14. 2022-02    15. 2022-03    16. 2022-04    17. 2022-05    18. 2022-06  
  19. 2022-07    20. 2022-08    21. 2022-09    22. 2022-10    23. 2022-11    24. 2022-12  
  25. 2023-01    26. 2023-02    27. 2023-03    28. 2023-04    29. 2023-05    30. 2023-06  
  31. 2023-07    32. 2023-08    33. 2023-09    34. 2023-10    35. 2023-11    36. 2023-12  
  37. 2024-01    38. 2024-02    39. 2024-03    40. 2024-04    41. 2024-05    42. 2024-06  
  43. 2024-07    44. 2024-08    45. 2024-09    46. 2024-10    47. 2024-11    48. 2024-12  


## 6. Save Monthly Aggregated Data

In [15]:
# Save monthly data
OUTPUT_DIR = '../data/processed'
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Convert dict to arrays for easier saving
# Create arrays: [n_months, lat, lon]
sorted_months = sorted(monthly_hsi.keys())
n_months = len(sorted_months)
n_lat, n_lon = lat_grid.shape[0], lon_grid.shape[0]

# Initialize arrays
monthly_hsi_array = np.full((n_months, n_lat, n_lon), np.nan)
monthly_hsi_chl_array = np.full((n_months, n_lat, n_lon), np.nan)
monthly_hsi_sst_array = np.full((n_months, n_lat, n_lon), np.nan)
monthly_hsi_so_array = np.full((n_months, n_lat, n_lon), np.nan)
monthly_chl_array = np.full((n_months, n_lat, n_lon), np.nan)
monthly_sst_array = np.full((n_months, n_lat, n_lon), np.nan)
monthly_salinity_array = np.full((n_months, n_lat, n_lon), np.nan)

# Fill arrays
for i, ym in enumerate(sorted_months):
    monthly_hsi_array[i, :, :] = monthly_hsi[ym]
    monthly_hsi_chl_array[i, :, :] = monthly_hsi_chl[ym]
    monthly_hsi_sst_array[i, :, :] = monthly_hsi_sst[ym]
    monthly_hsi_so_array[i, :, :] = monthly_hsi_so[ym]
    monthly_chl_array[i, :, :] = monthly_chl[ym]
    monthly_sst_array[i, :, :] = monthly_sst[ym]
    monthly_salinity_array[i, :, :] = monthly_salinity[ym]

# Save
np.savez_compressed(
    f"{OUTPUT_DIR}/monthly_hsi_data.npz",
    hsi_total=monthly_hsi_array,
    hsi_chl=monthly_hsi_chl_array,
    hsi_sst=monthly_hsi_sst_array,
    hsi_so=monthly_hsi_so_array,
    chl=monthly_chl_array,
    sst=monthly_sst_array,
    salinity=monthly_salinity_array,
    lat_grid=lat_grid,
    lon_grid=lon_grid,
    months=sorted_months
)

print(f"✅ Monthly data saved to {OUTPUT_DIR}/monthly_hsi_data.npz")
print(f"\nFile contains:")
print(f"  - hsi_total: {monthly_hsi_array.shape} ({n_months} months)")
print(f"  - hsi_chl, hsi_sst, hsi_so: {monthly_hsi_array.shape}")
print(f"  - chl, sst, salinity: {monthly_hsi_array.shape}")
print(f"  - Grid coordinates (lat_grid, lon_grid)")
print(f"  - months: list of year-month strings")
print(f"\nMonths: {sorted_months[0]} to {sorted_months[-1]}")

✅ Monthly data saved to ../data/processed/monthly_hsi_data.npz

File contains:
  - hsi_total: (48, 28, 29) (48 months)
  - hsi_chl, hsi_sst, hsi_so: (48, 28, 29)
  - chl, sst, salinity: (48, 28, 29)
  - Grid coordinates (lat_grid, lon_grid)
  - months: list of year-month strings

Months: 2021-01 to 2024-12


## 7. Summary & Next Steps

In [16]:
print("=== MONTHLY AGGREGATION SUMMARY ===")
print("\n✅ Monthly aggregation completed successfully!")
print("\nWhat was done:")
print("1. ✅ Loaded HSI data")
print("2. ✅ Created date range (2021-2023)")
print("3. ✅ Aggregated daily data to monthly (mean)")
print("4. ✅ Generated 36 monthly datasets")
print("5. ✅ Saved monthly aggregated data")
print("\nNext Steps:")
print("- Fase 5: GeoJSON Export")
print("  - Convert monthly data to GeoJSON format")
print("  - Generate 36 GeoJSON files (one per month)")
print("  - Save to data/geojson/ folder")
print("\nOutput file: data/processed/monthly_hsi_data.npz")

=== MONTHLY AGGREGATION SUMMARY ===

✅ Monthly aggregation completed successfully!

What was done:
1. ✅ Loaded HSI data
2. ✅ Created date range (2021-2023)
3. ✅ Aggregated daily data to monthly (mean)
4. ✅ Generated 36 monthly datasets
5. ✅ Saved monthly aggregated data

Next Steps:
- Fase 5: GeoJSON Export
  - Convert monthly data to GeoJSON format
  - Generate 36 GeoJSON files (one per month)
  - Save to data/geojson/ folder

Output file: data/processed/monthly_hsi_data.npz
