# Unsupervised Clustering Methods for Meteorological European Configurations/ Patterns


<span style="color: yellow;"> - prendo solo un sub set dei dati per la velocità (fatto in -> 1.1), poi sarà da prendre tutto il dataset (30.07.25)  </span>  
<span style="color: yellow;">- vedere se togliere i percentili in 1.1</span>

In [52]:
import xarray as xr
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

from scipy import stats
from scipy.stats import shapiro, jarque_bera, anderson
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.gofplots import qqplot


## 1 Caricamento Dati e Analisi Iniziale

In [41]:
try:
    ds = xr.open_dataset('era5_2000_2004.grib', engine= 'cfgrib') # XArray DataSet
    print("Dataset loaded successfully.")
except Exception as e:
    print(f"Error loading dataset: {e}")

Dataset loaded successfully.


In [50]:
print("Overview of the dataset:")
print(f"   • Variabili: {list(ds.data_vars.keys())}")
print(f"   • Coordinate: {list(ds.coords.keys())}")

Overview of the dataset:
   • Variabili: ['z', 't', 'u', 'v']
   • Coordinate: ['number', 'time', 'step', 'isobaricInhPa', 'latitude', 'longitude', 'valid_time']


In [43]:
# Dimenision details
print("Dimension details:")
if 'latitude' in ds.dims:
    print(f"   • Latitude: {ds.dims['latitude']} points ({ds.latitude.min().values:.1f}° - {ds.latitude.max().values:.1f}°)")
if 'longitude' in ds.dims:
    print(f"   • Longitude: {ds.dims['longitude']} points ({ds.longitude.min().values:.1f}° - {ds.longitude.max().values:.1f}°)")
if 'time' in ds.dims:
    print(f"   • Time: {ds.dims['time']} steps ({pd.to_datetime(ds.time.values[0]).strftime('%Y-%m-%d')} - {pd.to_datetime(ds.time.values[-1]).strftime('%Y-%m-%d')})")
if 'isobaricInhPa' in ds.dims:
    print(f"   • Pressure levels: {ds.dims['isobaricInhPa']} levels ({list(ds.isobaricInhPa.values)} hPa)")

#Variables 
print("Variables in the dataset:")
for var in ds.data_vars:
    var_data = ds[var]
    print(f"   • {var}: {var_data.dims} - {var_data.attrs.get('long_name', 'N/A')}")
    print(f"     └─ Units: {var_data.attrs.get('units', 'N/A')}")


Dimension details:
   • Latitude: 201 points (20.0° - 70.0°)
   • Longitude: 321 points (-40.0° - 40.0°)
   • Time: 1827 steps (2000-01-01 - 2004-12-31)
   • Pressure levels: 3 levels ([np.float64(850.0), np.float64(500.0), np.float64(250.0)] hPa)
Variables in the dataset:
   • z: ('time', 'isobaricInhPa', 'latitude', 'longitude') - Geopotential
     └─ Units: m**2 s**-2
   • t: ('time', 'isobaricInhPa', 'latitude', 'longitude') - Temperature
     └─ Units: K
   • u: ('time', 'isobaricInhPa', 'latitude', 'longitude') - U component of wind
     └─ Units: m s**-1
   • v: ('time', 'isobaricInhPa', 'latitude', 'longitude') - V component of wind
     └─ Units: m s**-1


In [44]:
# Total dimensionality
total_spatial_points = 1
for dim in ['latitude', 'longitude']:
    if dim in ds.dims:
        total_spatial_points *= ds.dims[dim]

total_features = len(ds.data_vars) * ds.dims.get('isobaricInhPa', 1) * total_spatial_points
print("DIMENSIONALITY:")
print(f"   • Spatial points: {total_spatial_points}")
print(f"   • Total features per timestep: {total_features:,}")
print(f"   • Temporal samples: {ds.dims.get('time', 1)}")

DIMENSIONALITY:
   • Spatial points: 64521
   • Total features per timestep: 774,252
   • Temporal samples: 1827


Spatial points: punti griglia nello spazio. 
La regione osservata è suddivisa in una griglia regolare (0.25° x 0.25°), per ogni punto nella grigli avengono misurate le variabili

Total features per timestep: numero di variabili (features) in totale in ogni istante di tempo

Temporal samples: punti temporali nel dataset (365 giorni per 5 anni)

Posso trasformarlo in una matrice per il clustering:  
shape = (temporal_samples, total_features_per_timestep)
       = (1827, 774252)  
Ogni riga = una mappa meteorologica


### 1.1 Check quality in data

In [45]:
# Select a subset of the dataset for a specific time range (POI DA TOGLIERE)
#ds = ds.sel(time=slice("2000-01-01", "2001-12-31"))

In [46]:
def analyze_missing_values(dataset):
    """Analizzes missing values for each variable"""
    missing_info = {}
    
    for var in dataset.data_vars:
        data = dataset[var]
        total_values = data.size
        missing_count = np.isnan(data.values).sum()
        missing_percent = (missing_count / total_values) * 100
        
        missing_info[var] = {
            'count': missing_count,
            'percentage': missing_percent,
            'total': total_values
        }
    
    return missing_info

print("MISSING VALUES:")
missing_analysis = analyze_missing_values(ds)

for var, info in missing_analysis.items():
    print(f"    {var}: {info['count']:,} missing ({info['percentage']:.2f}%)")

MISSING VALUES:
    z: 0 missing (0.00%)
    t: 0 missing (0.00%)
    u: 0 missing (0.00%)
    v: 0 missing (0.00%)


There are no missing values in the dataset

In [47]:
print("STATISTICS:")
for var in ds.data_vars:
    data = ds[var].values
    valid_data = data[~np.isnan(data)]
    
    if len(valid_data) > 0:
        print(f" {var.upper()}:")
        print(f"      • Min: {valid_data.min():.3f}")
        print(f"      • Max: {valid_data.max():.3f}")
        print(f"      • Mean: {valid_data.mean():.3f}")
        print(f"      • Std: {valid_data.std():.3f}")
        print(f"      • Percentiles [25%, 50%, 75%]: {np.percentile(valid_data, [25, 50, 75])}")

STATISTICS:


 Z:
      • Min: 7495.887
      • Max: 108808.312
      • Mean: 57405.691
      • Std: 36090.105
      • Percentiles [25%, 50%, 75%]: [ 15097.1484375  55659.46875   100139.4375   ]
 T:
      • Min: 198.893
      • Max: 308.547
      • Mean: 252.359
      • Std: 24.976
      • Percentiles [25%, 50%, 75%]: [226.2230072  255.91355896 272.93847656]
 U:
      • Min: -63.877
      • Max: 112.808
      • Mean: 8.984
      • Std: 13.904
      • Percentiles [25%, 50%, 75%]: [-0.31930542  6.58242798 16.01531982]
 V:
      • Min: -91.157
      • Max: 89.342
      • Mean: -0.238
      • Std: 11.834
      • Percentiles [25%, 50%, 75%]: [-5.98008728 -0.32400513  5.86112976]


____________

## 2 Preprocessing