# Phân Loại TDE MALLORN - Trích Xuất Đặc Trưng (Feature Engineering)

**Mục tiêu:** Trích xuất features toàn diện từ lightcurves để phân loại TDE.

**Key insights từ paper và data exploration:**
- **TDEs có màu xanh (blue)** - mạnh ở u-band → Cần features về color
- **TDEs kéo dài lâu** (~400+ days vs SNe ~100-150 days) → Cần features về temporal/duration
- **TDEs có evolution smooth**, ít variability hơn AGN → Cần features về variability
- **Class imbalance nghiêm trọng** (~6% TDEs) → Cần xử lý trong training

**Các nhóm features được trích xuất:**
1. **Per-band statistics** (6 bands × ~35 features): flux stats, SNR, detection counts, rise/decay
2. **Color features** (~20 features): u-g, u-r, blue_fraction, peak_band
3. **Temporal features** (~10 features): duration, peak timing, cross-band sync
4. **Variability features** (~15 features): smoothness, residuals, coefficient of variation

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
from tqdm import tqdm
from scipy import stats
from scipy.interpolate import interp1d
import warnings
warnings.filterwarnings('ignore')

DATA_DIR = Path('../mallorn-astronomical-classification-challenge')
np.random.seed(42)

BANDS = ['u', 'g', 'r', 'i', 'z', 'y']
N_SPLITS = 20

## 1. Load Metadata và Cấu Hình

Thiết lập các constants và đọc metadata từ train_log và test_log.

In [2]:
train_log = pd.read_csv(DATA_DIR / 'train_log.csv')
test_log = pd.read_csv(DATA_DIR / 'test_log.csv')

print(f"Train objects: {len(train_log)}")
print(f"Test objects: {len(test_log)}")
print(f"TDE ratio in train: {train_log['target'].mean():.4f}")
print(f"\nExpected submission rows: {len(test_log)}")

Train objects: 3043
Test objects: 7135
TDE ratio in train: 0.0486

Expected submission rows: 7135


## 2. Định Nghĩa Hàm Trích Xuất Features

Các hàm này sẽ xử lý lightcurve và trích xuất features theo từng nhóm.

In [3]:
def compute_magnitude(flux, flux_err):
    """Convert flux to magnitude with error handling."""
    with np.errstate(divide='ignore', invalid='ignore'):
        mag = -2.5 * np.log10(np.maximum(flux, 1e-10))
        mag_err = 2.5 / np.log(10) * flux_err / np.maximum(np.abs(flux), 1e-10)
    return mag, mag_err


def extract_band_statistics(flux, flux_err, time):
    """Extract comprehensive statistics for a single band."""
    n = len(flux)
    if n == 0:
        return {}
    
    snr = flux / (flux_err + 1e-10)
    detections = snr > 3
    n_det = np.sum(detections)
    
    feats = {
        'n_obs': n,
        'n_det': n_det,
        'det_frac': n_det / n if n > 0 else 0,
        
        'flux_mean': np.mean(flux),
        'flux_std': np.std(flux),
        'flux_median': np.median(flux),
        'flux_max': np.max(flux),
        'flux_min': np.min(flux),
        'flux_range': np.max(flux) - np.min(flux),
        'flux_iqr': np.percentile(flux, 75) - np.percentile(flux, 25),
        'flux_skew': stats.skew(flux) if n > 2 else 0,
        'flux_kurtosis': stats.kurtosis(flux) if n > 3 else 0,
        
        'flux_p10': np.percentile(flux, 10),
        'flux_p25': np.percentile(flux, 25),
        'flux_p75': np.percentile(flux, 75),
        'flux_p90': np.percentile(flux, 90),
        
        'snr_mean': np.mean(snr),
        'snr_max': np.max(snr),
        'snr_median': np.median(snr),
        'snr_std': np.std(snr),
        
        'err_mean': np.mean(flux_err),
        'err_std': np.std(flux_err),
    }
    
    if n > 1:
        feats['time_span'] = time[-1] - time[0]
        feats['cadence_mean'] = np.mean(np.diff(time))
        feats['cadence_std'] = np.std(np.diff(time))
    else:
        feats['time_span'] = 0
        feats['cadence_mean'] = 0
        feats['cadence_std'] = 0
    
    if n_det > 0:
        det_flux = flux[detections]
        det_time = time[detections]
        
        feats['det_flux_mean'] = np.mean(det_flux)
        feats['det_flux_max'] = np.max(det_flux)
        feats['det_duration'] = det_time[-1] - det_time[0] if len(det_time) > 1 else 0
        
        peak_idx = np.argmax(det_flux)
        feats['peak_flux'] = det_flux[peak_idx]
        feats['peak_time_rel'] = (det_time[peak_idx] - det_time[0]) / (feats['det_duration'] + 1) if feats['det_duration'] > 0 else 0.5
        
        if peak_idx > 0:
            rise_dt = det_time[peak_idx] - det_time[0]
            rise_df = det_flux[peak_idx] - det_flux[0]
            feats['rise_time'] = rise_dt
            feats['rise_rate'] = rise_df / (rise_dt + 1e-10)
        else:
            feats['rise_time'] = 0
            feats['rise_rate'] = 0
        
        if peak_idx < len(det_flux) - 1:
            decay_dt = det_time[-1] - det_time[peak_idx]
            decay_df = det_flux[peak_idx] - det_flux[-1]
            feats['decay_time'] = decay_dt
            feats['decay_rate'] = decay_df / (decay_dt + 1e-10)
        else:
            feats['decay_time'] = 0
            feats['decay_rate'] = 0
        
        if len(det_flux) > 1:
            feats['variability'] = np.std(det_flux) / (np.mean(det_flux) + 1e-10)
            feats['rms'] = np.sqrt(np.mean(det_flux**2))
        else:
            feats['variability'] = 0
            feats['rms'] = det_flux[0] if len(det_flux) > 0 else 0
    else:
        for key in ['det_flux_mean', 'det_flux_max', 'det_duration', 'peak_flux', 
                    'peak_time_rel', 'rise_time', 'rise_rate', 'decay_time', 
                    'decay_rate', 'variability', 'rms']:
            feats[key] = 0
    
    above_mean = flux > np.mean(flux)
    feats['frac_above_mean'] = np.sum(above_mean) / n
    
    if n >= 3:
        try:
            slope, intercept, r_value, p_value, std_err = stats.linregress(time, flux)
            feats['trend_slope'] = slope
            feats['trend_r2'] = r_value**2
        except:
            feats['trend_slope'] = 0
            feats['trend_r2'] = 0
    else:
        feats['trend_slope'] = 0
        feats['trend_r2'] = 0
    
    return feats

In [4]:
def extract_color_features(band_data):
    """Extract color features - CRITICAL for TDE identification.
    
    Paper insight: TDEs are blue (strong u-band emission)
    """
    colors = {}
    
    band_fluxes = {}
    band_peak_fluxes = {}
    
    for band in BANDS:
        if band in band_data and len(band_data[band]['flux']) > 0:
            flux = band_data[band]['flux']
            snr = flux / (band_data[band]['flux_err'] + 1e-10)
            det_mask = snr > 3
            
            if np.sum(det_mask) > 0:
                det_flux = flux[det_mask]
                band_fluxes[band] = np.mean(det_flux)
                band_peak_fluxes[band] = np.max(det_flux)
            else:
                band_fluxes[band] = np.mean(flux) if len(flux) > 0 else 0
                band_peak_fluxes[band] = np.max(flux) if len(flux) > 0 else 0
        else:
            band_fluxes[band] = 0
            band_peak_fluxes[band] = 0
    
    color_pairs = [('u', 'g'), ('u', 'r'), ('u', 'i'), ('g', 'r'), ('g', 'i'), ('r', 'i'), ('i', 'z'), ('z', 'y')]
    for b1, b2 in color_pairs:
        if band_fluxes[b1] > 0 and band_fluxes[b2] > 0:
            colors[f'color_{b1}_{b2}'] = -2.5 * np.log10(band_fluxes[b1] / band_fluxes[b2])
            colors[f'color_peak_{b1}_{b2}'] = -2.5 * np.log10(
                (band_peak_fluxes[b1] + 1e-10) / (band_peak_fluxes[b2] + 1e-10)
            )
        else:
            colors[f'color_{b1}_{b2}'] = 0
            colors[f'color_peak_{b1}_{b2}'] = 0
    
    blue_bands = ['u', 'g']
    red_bands = ['r', 'i', 'z', 'y']
    
    blue_flux = sum([band_fluxes[b] for b in blue_bands])
    red_flux = sum([band_fluxes[b] for b in red_bands])
    total_flux = blue_flux + red_flux
    
    colors['blue_fraction'] = blue_flux / (total_flux + 1e-10)
    colors['u_fraction'] = band_fluxes['u'] / (total_flux + 1e-10)
    colors['g_fraction'] = band_fluxes['g'] / (total_flux + 1e-10)
    colors['blue_red_ratio'] = blue_flux / (red_flux + 1e-10)
    
    colors['u_dominance'] = band_fluxes['u'] / (np.max(list(band_fluxes.values())) + 1e-10)
    colors['peak_band_is_u'] = 1 if band_peak_fluxes['u'] == max(band_peak_fluxes.values()) else 0
    colors['peak_band_is_g'] = 1 if band_peak_fluxes['g'] == max(band_peak_fluxes.values()) else 0
    colors['peak_band_is_blue'] = 1 if max(band_peak_fluxes['u'], band_peak_fluxes['g']) >= max(
        band_peak_fluxes['r'], band_peak_fluxes['i'], band_peak_fluxes['z'], band_peak_fluxes['y']
    ) else 0
    
    return colors

In [5]:
def extract_temporal_features(band_data):
    """Extract cross-band temporal features.
    
    Paper insight: TDEs have longer duration (~400 days) vs SNe (~100-150 days)
    """
    feats = {}
    
    all_times = []
    all_fluxes = []
    all_det_times = []
    peak_times = {}
    
    for band in BANDS:
        if band in band_data and len(band_data[band]['flux']) > 0:
            flux = band_data[band]['flux']
            time = band_data[band]['time']
            flux_err = band_data[band]['flux_err']
            
            all_times.extend(time)
            all_fluxes.extend(flux)
            
            snr = flux / (flux_err + 1e-10)
            det_mask = snr > 3
            if np.sum(det_mask) > 0:
                det_time = time[det_mask]
                det_flux = flux[det_mask]
                all_det_times.extend(det_time)
                peak_times[band] = det_time[np.argmax(det_flux)]
    
    if len(all_times) > 0:
        feats['total_time_span'] = max(all_times) - min(all_times)
        feats['total_observations'] = len(all_times)
    else:
        feats['total_time_span'] = 0
        feats['total_observations'] = 0
    
    if len(all_det_times) > 0:
        feats['detection_time_span'] = max(all_det_times) - min(all_det_times)
        feats['total_detections'] = len(all_det_times)
    else:
        feats['detection_time_span'] = 0
        feats['total_detections'] = 0
    
    if len(peak_times) >= 2:
        peak_time_values = list(peak_times.values())
        feats['peak_time_spread'] = max(peak_time_values) - min(peak_time_values)
        
        if 'u' in peak_times and 'r' in peak_times:
            feats['peak_delay_u_r'] = peak_times['u'] - peak_times['r']
        else:
            feats['peak_delay_u_r'] = 0
        
        if 'g' in peak_times and 'r' in peak_times:
            feats['peak_delay_g_r'] = peak_times['g'] - peak_times['r']
        else:
            feats['peak_delay_g_r'] = 0
    else:
        feats['peak_time_spread'] = 0
        feats['peak_delay_u_r'] = 0
        feats['peak_delay_g_r'] = 0
    
    n_bands_detected = sum([1 for b in BANDS if b in band_data and 
                           len(band_data[b]['flux']) > 0 and 
                           np.sum(band_data[b]['flux'] / (band_data[b]['flux_err'] + 1e-10) > 3) > 0])
    feats['n_bands_detected'] = n_bands_detected
    
    return feats

In [6]:
def extract_variability_features(band_data):
    """Extract variability features to distinguish TDEs from AGN.
    
    Paper insight: AGN have stochastic variability, TDEs are smoother
    """
    feats = {}
    
    all_variabilities = []
    
    for band in BANDS:
        if band in band_data and len(band_data[band]['flux']) > 2:
            flux = band_data[band]['flux']
            time = band_data[band]['time']
            
            diffs = np.diff(flux)
            feats[f'{band}_flux_diff_std'] = np.std(diffs)
            feats[f'{band}_flux_diff_mean'] = np.mean(np.abs(diffs))
            
            if len(flux) > 3:
                sorted_idx = np.argsort(time)
                sorted_flux = flux[sorted_idx]
                
                try:
                    coeffs = np.polyfit(range(len(sorted_flux)), sorted_flux, 2)
                    poly_fit = np.polyval(coeffs, range(len(sorted_flux)))
                    residuals = sorted_flux - poly_fit
                    feats[f'{band}_residual_std'] = np.std(residuals)
                except:
                    feats[f'{band}_residual_std'] = 0
            else:
                feats[f'{band}_residual_std'] = 0
            
            cv = np.std(flux) / (np.mean(np.abs(flux)) + 1e-10)
            all_variabilities.append(cv)
            feats[f'{band}_cv'] = cv
        else:
            feats[f'{band}_flux_diff_std'] = 0
            feats[f'{band}_flux_diff_mean'] = 0
            feats[f'{band}_residual_std'] = 0
            feats[f'{band}_cv'] = 0
    
    if len(all_variabilities) > 0:
        feats['mean_variability'] = np.mean(all_variabilities)
        feats['max_variability'] = np.max(all_variabilities)
    else:
        feats['mean_variability'] = 0
        feats['max_variability'] = 0
    
    return feats

In [7]:
def extract_features_for_object(obj_id, lc_df):
    """Extract all features for a single object."""
    features = {'object_id': obj_id}
    
    obj_data = lc_df[lc_df['object_id'] == obj_id]
    if len(obj_data) == 0:
        return None
    
    band_data = {}
    for band in BANDS:
        band_df = obj_data[obj_data['Filter'] == band].sort_values('Time (MJD)')
        if len(band_df) > 0:
            band_data[band] = {
                'flux': band_df['Flux'].values,
                'flux_err': band_df['Flux_err'].values,
                'time': band_df['Time (MJD)'].values
            }
    
    for band in BANDS:
        if band in band_data:
            band_feats = extract_band_statistics(
                band_data[band]['flux'],
                band_data[band]['flux_err'],
                band_data[band]['time']
            )
            for key, value in band_feats.items():
                features[f'{band}_{key}'] = value
        else:
            for key in ['n_obs', 'n_det', 'det_frac', 'flux_mean', 'flux_std', 'flux_median',
                       'flux_max', 'flux_min', 'flux_range', 'flux_iqr', 'flux_skew', 'flux_kurtosis',
                       'flux_p10', 'flux_p25', 'flux_p75', 'flux_p90', 'snr_mean', 'snr_max',
                       'snr_median', 'snr_std', 'err_mean', 'err_std', 'time_span', 'cadence_mean',
                       'cadence_std', 'det_flux_mean', 'det_flux_max', 'det_duration', 'peak_flux',
                       'peak_time_rel', 'rise_time', 'rise_rate', 'decay_time', 'decay_rate',
                       'variability', 'rms', 'frac_above_mean', 'trend_slope', 'trend_r2']:
                features[f'{band}_{key}'] = 0
    
    color_feats = extract_color_features(band_data)
    features.update(color_feats)
    
    temporal_feats = extract_temporal_features(band_data)
    features.update(temporal_feats)
    
    var_feats = extract_variability_features(band_data)
    features.update(var_feats)
    
    return features

## 3. Xử Lý TOÀN BỘ Dữ Liệu Training (20 splits)

Quét qua tất cả 20 splits và trích xuất features cho mỗi object trong tập train.
**Lưu ý:** Quá trình này có thể mất vài phút do cần xử lý 3043 objects.

In [8]:
print("="*60)
print("Processing TRAINING data from ALL 20 splits...")
print("="*60)

train_features_list = []
train_object_ids = set(train_log['object_id'].values)

for split_num in range(1, N_SPLITS + 1):
    split_name = f'split_{split_num:02d}'
    lc_file = DATA_DIR / split_name / 'train_full_lightcurves.csv'
    
    if not lc_file.exists():
        print(f"  {split_name}: train file not found, skipping")
        continue
    
    print(f"  Processing {split_name}...")
    lc_df = pd.read_csv(lc_file)
    
    object_ids = lc_df['object_id'].unique()
    object_ids = [oid for oid in object_ids if oid in train_object_ids]
    
    for obj_id in tqdm(object_ids, desc=f"  {split_name}", leave=False):
        feats = extract_features_for_object(obj_id, lc_df)
        if feats is not None:
            train_features_list.append(feats)
    
    print(f"    Processed {len(object_ids)} objects, total so far: {len(train_features_list)}")

train_features = pd.DataFrame(train_features_list)
print(f"\nTotal training features extracted: {len(train_features)}")
print(f"Expected: {len(train_log)}")

Processing TRAINING data from ALL 20 splits...
  Processing split_01...


                                                             

    Processed 155 objects, total so far: 155
  Processing split_02...


                                                             

    Processed 170 objects, total so far: 325
  Processing split_03...


                                                             

    Processed 138 objects, total so far: 463
  Processing split_04...


                                                             

    Processed 145 objects, total so far: 608
  Processing split_05...


                                                             

    Processed 165 objects, total so far: 773
  Processing split_06...


                                                             

    Processed 155 objects, total so far: 928
  Processing split_07...


                                                             

    Processed 165 objects, total so far: 1093
  Processing split_08...


                                                             

    Processed 162 objects, total so far: 1255
  Processing split_09...


                                                             

    Processed 128 objects, total so far: 1383
  Processing split_10...


                                                             

    Processed 144 objects, total so far: 1527
  Processing split_11...


                                                             

    Processed 146 objects, total so far: 1673
  Processing split_12...


                                                             

    Processed 155 objects, total so far: 1828
  Processing split_13...


                                                             

    Processed 143 objects, total so far: 1971
  Processing split_14...


                                                             

    Processed 154 objects, total so far: 2125
  Processing split_15...


                                                             

    Processed 158 objects, total so far: 2283
  Processing split_16...


                                                             

    Processed 155 objects, total so far: 2438
  Processing split_17...


                                                             

    Processed 153 objects, total so far: 2591
  Processing split_18...


                                                             

    Processed 152 objects, total so far: 2743
  Processing split_19...


                                                             

    Processed 147 objects, total so far: 2890
  Processing split_20...


                                                             

    Processed 153 objects, total so far: 3043

Total training features extracted: 3043
Expected: 3043


## 4. Xử Lý TOÀN BỘ Dữ Liệu Test (20 splits)

Tương tự, xử lý tất cả test objects từ 20 splits.
**Lưu ý:** 7135 test objects → quá trình này sẽ lâu hơn train.

In [9]:
print("\n" + "="*60)
print("Processing TEST data from ALL 20 splits...")
print("="*60)

test_features_list = []
test_object_ids = set(test_log['object_id'].values)

for split_num in range(1, N_SPLITS + 1):
    split_name = f'split_{split_num:02d}'
    lc_file = DATA_DIR / split_name / 'test_full_lightcurves.csv'
    
    if not lc_file.exists():
        print(f"  {split_name}: test file not found, skipping")
        continue
    
    print(f"  Processing {split_name}...")
    lc_df = pd.read_csv(lc_file)
    
    object_ids = lc_df['object_id'].unique()
    object_ids = [oid for oid in object_ids if oid in test_object_ids]
    
    for obj_id in tqdm(object_ids, desc=f"  {split_name}", leave=False):
        feats = extract_features_for_object(obj_id, lc_df)
        if feats is not None:
            test_features_list.append(feats)
    
    print(f"    Processed {len(object_ids)} objects, total so far: {len(test_features_list)}")

test_features = pd.DataFrame(test_features_list)
print(f"\nTotal test features extracted: {len(test_features)}")
print(f"Expected: {len(test_log)}")


Processing TEST data from ALL 20 splits...
  Processing split_01...


                                                             

    Processed 364 objects, total so far: 364
  Processing split_02...


                                                             

    Processed 414 objects, total so far: 778
  Processing split_03...


                                                             

    Processed 338 objects, total so far: 1116
  Processing split_04...


                                                             

    Processed 332 objects, total so far: 1448
  Processing split_05...


                                                             

    Processed 375 objects, total so far: 1823
  Processing split_06...


                                                             

    Processed 374 objects, total so far: 2197
  Processing split_07...


                                                             

    Processed 398 objects, total so far: 2595
  Processing split_08...


                                                             

    Processed 387 objects, total so far: 2982
  Processing split_09...


                                                             

    Processed 289 objects, total so far: 3271
  Processing split_10...


                                                             

    Processed 331 objects, total so far: 3602
  Processing split_11...


                                                             

    Processed 325 objects, total so far: 3927
  Processing split_12...


                                                             

    Processed 353 objects, total so far: 4280
  Processing split_13...


                                                             

    Processed 379 objects, total so far: 4659
  Processing split_14...


                                                             

    Processed 351 objects, total so far: 5010
  Processing split_15...


                                                             

    Processed 342 objects, total so far: 5352
  Processing split_16...


                                                             

    Processed 354 objects, total so far: 5706
  Processing split_17...


                                                             

    Processed 351 objects, total so far: 6057
  Processing split_18...


                                                             

    Processed 345 objects, total so far: 6402
  Processing split_19...


                                                             

    Processed 375 objects, total so far: 6777
  Processing split_20...


                                                             

    Processed 358 objects, total so far: 7135

Total test features extracted: 7135
Expected: 7135


## 5. Gộp với Metadata (Redshift, EBV)

Merge features với Z (redshift) và EBV (extinction) từ train_log và test_log.

In [10]:
train_features = train_features.merge(
    train_log[['object_id', 'Z', 'EBV', 'target']], 
    on='object_id', 
    how='left'
)

test_features = test_features.merge(
    test_log[['object_id', 'Z', 'Z_err', 'EBV']], 
    on='object_id', 
    how='left'
)

if 'Z_err' in test_features.columns:
    test_features['z_relative_err'] = test_features['Z_err'] / (test_features['Z'] + 1e-10)

print(f"\nTrain shape after merge: {train_features.shape}")
print(f"Test shape after merge: {test_features.shape}")


Train shape after merge: (3043, 296)
Test shape after merge: (7135, 297)


## 6. Xử Lý Missing Values và Infinity

Fill NaN và inf values để đảm bảo dữ liệu sạch cho model training.

In [11]:
print(f"\nMissing values in train: {train_features.isnull().sum().sum()}")
print(f"Missing values in test: {test_features.isnull().sum().sum()}")

train_features = train_features.fillna(0)
test_features = test_features.fillna(0)

train_features = train_features.replace([np.inf, -np.inf], 0)
test_features = test_features.replace([np.inf, -np.inf], 0)

print(f"After cleaning - Train missing: {train_features.isnull().sum().sum()}")
print(f"After cleaning - Test missing: {test_features.isnull().sum().sum()}")


Missing values in train: 5671
Missing values in test: 12419
After cleaning - Train missing: 0
After cleaning - Test missing: 0


## 7. Kiểm Tra Tính Đầy Đủ Của Dữ Liệu

Verify rằng tất cả objects đều đã được xử lý và không có object nào bị thiếu.

In [12]:
print("\n" + "="*60)
print("DATA VERIFICATION")
print("="*60)

print(f"\nTrain:")
print(f"  Expected objects: {len(train_log)}")
print(f"  Extracted objects: {len(train_features)}")
print(f"  Match: {'✓' if len(train_features) == len(train_log) else '✗ MISMATCH!'}")

print(f"\nTest:")
print(f"  Expected objects: {len(test_log)}")
print(f"  Extracted objects: {len(test_features)}")
print(f"  Match: {'✓' if len(test_features) == len(test_log) else '✗ MISMATCH!'}")

if len(test_features) != len(test_log):
    print(f"\n  WARNING: Missing {len(test_log) - len(test_features)} test objects!")
    missing = set(test_log['object_id']) - set(test_features['object_id'])
    print(f"  First few missing: {list(missing)[:5]}")


DATA VERIFICATION

Train:
  Expected objects: 3043
  Extracted objects: 3043
  Match: ✓

Test:
  Expected objects: 7135
  Extracted objects: 7135
  Match: ✓


## 8. Lưu Features

Lưu train_features.csv và test_features.csv để sử dụng trong bước training.

In [13]:
train_features.to_csv('train_features.csv', index=False)
test_features.to_csv('test_features.csv', index=False)

print("\n" + "="*60)
print("FEATURE ENGINEERING COMPLETE")
print("="*60)
print(f"\nSaved files:")
print(f"  - train_features.csv: {train_features.shape}")
print(f"  - test_features.csv: {test_features.shape}")
print(f"\nFeature categories:")
print(f"  - Per-band statistics: 6 bands × ~35 features")
print(f"  - Color features: ~20 features")
print(f"  - Temporal features: ~10 features")
print(f"  - Variability features: ~15 features")
print(f"  - Metadata: Z, EBV")
print(f"\nTotal features: {len(train_features.columns) - 2}")
print("\nNext: Run 02_model_training.ipynb")
print("="*60)


FEATURE ENGINEERING COMPLETE

Saved files:
  - train_features.csv: (3043, 296)
  - test_features.csv: (7135, 297)

Feature categories:
  - Per-band statistics: 6 bands × ~35 features
  - Color features: ~20 features
  - Temporal features: ~10 features
  - Variability features: ~15 features
  - Metadata: Z, EBV

Total features: 294

Next: Run 02_model_training.ipynb
