# High-Risk Flight Prediction Using Machine Learning
## Professional ML Pipeline with Feature Selection & Data Leakage Prevention

**Business Objective**: Predict flights at high risk of significant delays (>30 min) or cancellation

---

### Key Questions
- Can we predict high-risk flights before departure?
- Which factors are most predictive of delays?
- How accurate can we be without using actual delay data?

### Approach
- **Data**: 30M flight records with 99.93% retention after cleaning
- **Features**: Temporal, operational, airport congestion (NO delay data)
- **Feature Selection**: Mutual information, correlation analysis, RFE
- **Models**: Random Forest, Gradient Boosting with temporal validation
- **Validation**: Rigorous data leakage prevention

**System**: 48GB RAM Configuration  
**Date**: November 6, 2025

In [1]:
# ============================================================================
# IMPORTS & CONFIGURATION
# ============================================================================

import sys
import os
import warnings
import gc
warnings.filterwarnings('ignore')

# Path configuration
if os.path.basename(os.getcwd()) == 'notebooks':
    sys.path.append('../src')
    data_path = '../../data/'
else:
    sys.path.append('./airline_efficiency_analysis/src')
    data_path = './data/'

# Core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

# ML libraries
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.feature_selection import mutual_info_classif, RFE, SelectKBest
from sklearn.metrics import (
    classification_report, confusion_matrix, roc_auc_score, 
    roc_curve, precision_recall_curve, f1_score, accuracy_score
)
from sklearn.preprocessing import StandardScaler
import joblib

# Memory profiling
import psutil

# Display settings
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format', '{:.2f}'.format)

# Plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 7)

def print_memory_usage(label=""):
    """Print current memory usage"""
    process = psutil.Process(os.getpid())
    mem_gb = process.memory_info().rss / (1024 ** 3)
    print(f"{'[' + label + ']' if label else ''} Memory: {mem_gb:.2f} GB / 48 GB")
    return mem_gb

print("‚úì Imports successful")
print_memory_usage("Initial")

‚úì Imports successful
[Initial] Memory: 0.19 GB / 48 GB


0.18714523315429688

## Phase 1: Data Loading

Loading 30M records with optimized memory usage

In [3]:
# ============================================================================
# DATA LOADING
# ============================================================================

import importlib
if 'data_loader' in sys.modules:
    importlib.reload(sys.modules['data_loader'])
from data_loader import AirlineDataLoader

loader = AirlineDataLoader()

print("="*80)
print("LOADING DATA - 10M RECORDS (optimized for ML)")
print("="*80)

df, carriers_df = loader.load_data(sample_size=10000000)

print(f"\n‚úì Loaded {len(df):,} records")
print(f"‚úì Loaded {len(carriers_df):,} carriers")
print_memory_usage("After loading")

LOADING DATA - 10M RECORDS (optimized for ML)
LOADING AIRLINE DATASETS
   Data not found locally. Attempting to download from Kaggle...
   ‚úì Downloaded data to: C:\Users\User\.cache\kagglehub\datasets\bulter22\airline-data\versions\2

üìÅ Loading carriers data from: C:\Users\User\.cache\kagglehub\datasets\bulter22\airline-data\versions\2\carriers.csv
   ‚úì Loaded 1,491 carriers

üìÅ Loading airline data from: C:\Users\User\.cache\kagglehub\datasets\bulter22\airline-data\versions\2\airline.csv.shuffle
   File size: 11.20 GB
   Loading 10,000,000 rows sequentially...
   ‚úì Loaded 10,000,000 flight records
   ‚úì Columns: 29

‚úì Loaded 10,000,000 records
‚úì Loaded 1,491 carriers
[After loading] Memory: 2.32 GB / 48 GB


2.322307586669922

## Phase 2: Data Cleaning

Applying fixes for 99.93% retention rate (previously had 98.82% data loss bug)

In [4]:
# ============================================================================
# DATA CLEANING
# ============================================================================

from data_cleaner import AirlineDataCleaner

cleaner = AirlineDataCleaner()

print("="*80)
print("DATA CLEANING - VERIFIED 99.93% RETENTION")
print("="*80)

initial_rows = len(df)
df, cleaning_report = cleaner.clean_data(df)
final_rows = len(df)
retention = (final_rows / initial_rows) * 100

print(f"\nüìä Cleaning Results:")
print(f"   Initial: {initial_rows:,} records")
print(f"   Final: {final_rows:,} records")
print(f"   Retention: {retention:.2f}%")
print(f"   Removed: {initial_rows - final_rows:,} records ({100-retention:.2f}%)")

assert retention > 99.0, f"‚ùå Retention too low: {retention:.2f}%"
print("\n‚úÖ Data retention verified!")
print_memory_usage("After cleaning")

DATA CLEANING - VERIFIED 99.93% RETENTION
DATA CLEANING PIPELINE

[1/8] Converting data types...
   ‚úì Converted data types for 27 columns
[2/8] Handling missing values...
   ‚úì Reduced missing values: 49,443,692 ‚Üí 7,226,446
[3/8] Removing duplicates...
   ‚úì Removed 3 duplicate records
[4/8] Handling outliers...
   ‚úì Removed 2,328 outlier records
[5/8] Validating categorical values...
   ‚úì Validated categorical values
[6/8] Validating numeric ranges...
   ‚úì Validated numeric ranges
[7/8] Creating derived fields...
   ‚úì Created 15 derived fields
[8/8] Skipping carrier merge (no carrier data)

CLEANING COMPLETE
Initial rows: 10,000,000
Final rows: 9,993,108
Removed: 6,892 (0.07%)

üìä Cleaning Results:
   Initial: 10,000,000 records
   Final: 9,993,108 records
   Retention: 99.93%
   Removed: 6,892 records (0.07%)

‚úÖ Data retention verified!
[After cleaning] Memory: 9.20 GB / 48 GB


9.19607162475586

## Phase 3: Feature Engineering with NO Data Leakage

**Critical**: Using only information available BEFORE flight departure
- ‚úÖ Scheduled times (CRSDepTime, CRSArrTime)
- ‚úÖ Airport traffic, route frequency
- ‚úÖ Carrier cancellation rates
- ‚ùå NO actual delay data (DepDelay, ArrDelay, IsDelayed, etc.)

In [5]:
# ============================================================================
# FEATURE ENGINEERING - MEMORY-EFFICIENT, NO DATA LEAKAGE
# ============================================================================

print("="*80)
print("FEATURE ENGINEERING - NO DATA LEAKAGE")
print("="*80)

# Sample 10M for ML (memory efficient)
print("\nSampling 10M records for ML...")
ml_df = df.sample(n=min(10000000, len(df)), random_state=42).copy()
print(f"ML dataset: {len(ml_df):,} records")

# Convert Cancelled to numeric
ml_df['Cancelled'] = (ml_df['Cancelled'] == 'YES').astype(int)

# ============================================================================
# 1. TEMPORAL FEATURES - Using TimeOfDay categories instead of continuous Hour
# ============================================================================
print("\n1Ô∏è‚É£ Creating temporal features...")

# Extract hour from SCHEDULED departure time
ml_df['Hour'] = (ml_df['CRSDepTime'] // 100).fillna(0).astype(int)

# Categorize into meaningful time periods (BETTER than continuous hour)
def categorize_hour(hour):
    if hour < 6:
        return 'EarlyMorning'  # 12am-6am: Red-eye, crew rest issues
    elif hour < 12:
        return 'Morning'       # 6am-12pm: Peak traffic
    elif hour < 17:
        return 'Afternoon'     # 12pm-5pm: Generally stable
    elif hour < 21:
        return 'Evening'       # 5pm-9pm: Cascading delays
    else:
        return 'LateNight'     # 9pm-12am: Accumulated delays

ml_df['TimeOfDay'] = ml_df['Hour'].apply(categorize_hour)

# Basic temporal features
ml_df['IsWeekend'] = (ml_df['DayOfWeek'].isin([6, 7])).astype(int)
ml_df['IsHolidaySeason'] = (ml_df['Month'].isin([11, 12])).astype(int)
ml_df['IsRushHour'] = (ml_df['Hour'].isin([7, 8, 17, 18])).astype(int)
ml_df['IsSummerTravel'] = (ml_df['Month'].isin([6, 7, 8])).astype(int)

print("   ‚úì Created 6 temporal features")

# ============================================================================
# 2. DISTANCE & ROUTE FEATURES
# ============================================================================
print("\n2Ô∏è‚É£ Creating distance & route features...")

ml_df['IsShortHaul'] = (ml_df['Distance'] < 500).astype(int)
ml_df['IsLongHaul'] = (ml_df['Distance'] > 2000).astype(int)
ml_df['IsMediumHaul'] = ((ml_df['Distance'] >= 500) & (ml_df['Distance'] <= 2000)).astype(int)

print("   ‚úì Created 3 distance features")

# ============================================================================
# 3. AIRPORT & CARRIER FEATURES (NO DELAY DATA!)
# ============================================================================
print("\n3Ô∏è‚É£ Creating airport & carrier features (memory-efficient)...")

# Prepare aggregation dataframe
df_temp = df.copy()
df_temp['Cancelled_num'] = (df_temp['Cancelled'] == 'YES').astype(int)

# Carrier cancellation rate (reliability indicator)
carrier_cancel_rate = df_temp.groupby('UniqueCarrier')['Cancelled_num'].mean()
ml_df['CarrierCancelRate'] = ml_df['UniqueCarrier'].map(carrier_cancel_rate).fillna(0.01)

# Airport traffic (congestion indicator)
origin_traffic = df_temp.groupby('Origin').size()
ml_df['OriginTraffic'] = ml_df['Origin'].map(origin_traffic).fillna(1000)

dest_traffic = df_temp.groupby('Dest').size()
ml_df['DestTraffic'] = ml_df['Dest'].map(dest_traffic).fillna(1000)

# Normalize traffic (percentile-based)
ml_df['OriginTrafficPct'] = ml_df['OriginTraffic'].rank(pct=True)
ml_df['DestTrafficPct'] = ml_df['DestTraffic'].rank(pct=True)

# Route frequency
route_key = df_temp['Origin'] + '_' + df_temp['Dest']
route_frequency = df_temp.groupby(route_key).size()
ml_df['RouteFrequency'] = (ml_df['Origin'] + '_' + ml_df['Dest']).map(route_frequency).fillna(100)

print("   ‚úì Created 6 airport/carrier features")

# ============================================================================
# 4. INTERACTION FEATURES
# ============================================================================
print("\n4Ô∏è‚É£ Creating interaction features...")

# High traffic + rush hour = extra risk
ml_df['HighTrafficRushHour'] = ((ml_df['OriginTrafficPct'] > 0.75) & 
                                  (ml_df['IsRushHour'] == 1)).astype(int)

# Weekend + holiday season = different patterns
ml_df['WeekendHoliday'] = ((ml_df['IsWeekend'] == 1) & 
                            (ml_df['IsHolidaySeason'] == 1)).astype(int)

# Busy airport + short haul = tight turnarounds
ml_df['BusyAirportShortHaul'] = ((ml_df['OriginTrafficPct'] > 0.75) & 
                                  (ml_df['IsShortHaul'] == 1)).astype(int)

# Early morning + long haul = crew rest issues
ml_df['EarlyMorningLongHaul'] = ((ml_df['TimeOfDay'] == 'EarlyMorning') & 
                                  (ml_df['IsLongHaul'] == 1)).astype(int)

print("   ‚úì Created 4 interaction features")

# ============================================================================
# 5. TARGET VARIABLE
# ============================================================================
print("\n5Ô∏è‚É£ Creating target variable...")
ml_df['IsHighRisk'] = ((ml_df['ArrDelay'] > 30) | (ml_df['Cancelled'] == 1)).astype(int)

high_risk_pct = ml_df['IsHighRisk'].sum() / len(ml_df) * 100
print(f"   ‚úì High-risk rate: {high_risk_pct:.1f}%")

print(f"\n‚úÖ Total features created: {len(ml_df.columns)} columns")
print_memory_usage("After feature engineering")

FEATURE ENGINEERING - NO DATA LEAKAGE

Sampling 10M records for ML...
ML dataset: 9,993,108 records

1Ô∏è‚É£ Creating temporal features...
   ‚úì Created 6 temporal features

2Ô∏è‚É£ Creating distance & route features...
   ‚úì Created 3 distance features

3Ô∏è‚É£ Creating airport & carrier features (memory-efficient)...
   ‚úì Created 6 airport/carrier features

4Ô∏è‚É£ Creating interaction features...
   ‚úì Created 4 interaction features

5Ô∏è‚É£ Creating target variable...
   ‚úì High-risk rate: 10.2%

‚úÖ Total features created: 54 columns
[After feature engineering] Memory: 14.01 GB / 48 GB


14.011764526367188

## Phase 4: Feature Selection

Using multiple methods to select the best features:
1. Correlation analysis (remove redundant features)
2. Mutual Information (measure feature-target relationship)
3. Feature importance from initial Random Forest

In [6]:
# ============================================================================
# FEATURE SELECTION
# ============================================================================

print("="*80)
print("FEATURE SELECTION - PROFESSIONAL APPROACH")
print("="*80)

# Prepare feature set
exclude_cols = ['IsHighRisk', 'ArrDelay', 'DepDelay', 'Cancelled', 'TailNum', 'FlightNum',
                'Route', 'Year', 'CRSDepTime', 'DepTime', 'CRSArrTime', 'ArrTime',
                'CancellationCode', 'Diverted', 'CarrierDelay', 'WeatherDelay',
                'NASDelay', 'SecurityDelay', 'LateAircraftDelay', 'ActualElapsedTime',
                'AirTime', 'TaxiOut', 'TaxiIn', 'CRSElapsedTime', 'Carrier_Name',
                'TaxiOutEfficiency', 'TaxiInEfficiency', 'AirTimeDeviation',
                'ExpectedAirTime', 'TimeOfDay', 'Hour',  # Categorical, will use dummies
                # CRITICAL: Exclude ALL delay indicators
                'IsDelayed', 'Is_DepDelayed', 'Is_ArrDelayed', 
                'Is_DepDelayed_15min', 'Is_ArrDelayed_15min',
                'PrevFlightDelay', 'Prev2FlightDelay', 'HasPrevFlightData',
                'Origin', 'Dest', 'UniqueCarrier']  # Categorical, not useful as-is

# One-hot encode TimeOfDay
time_dummies = pd.get_dummies(ml_df['TimeOfDay'], prefix='TimeOfDay', drop_first=True)
ml_df = pd.concat([ml_df, time_dummies], axis=1)

# Get numeric features
numeric_features = ml_df.select_dtypes(include=[np.number]).columns.tolist()
feature_candidates = [col for col in numeric_features if col not in exclude_cols]

print(f"\nüìä Initial features: {len(feature_candidates)}")

# ============================================================================
# METHOD 1: Remove highly correlated features
# ============================================================================
print("\n1Ô∏è‚É£ Correlation Analysis...")

feature_df = ml_df[feature_candidates].fillna(0)
corr_matrix = feature_df.corr().abs()

# Find features with correlation > 0.9
upper_triangle = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = set()

for column in upper_triangle.columns:
    correlated = upper_triangle[column][upper_triangle[column] > 0.9].index.tolist()
    if correlated:
        print(f"   Found high correlation: {column} <-> {correlated}")
        # Keep the first, drop the rest
        to_drop.update(correlated)

if to_drop:
    feature_candidates = [f for f in feature_candidates if f not in to_drop]
    print(f"   ‚úì Removed {len(to_drop)} highly correlated features")
else:
    print("   ‚úì No highly correlated features (>0.9)")

# ============================================================================
# METHOD 2: Mutual Information (measure feature-target relationship)
# ============================================================================
print("\n2Ô∏è‚É£ Mutual Information Analysis...")

feature_df = ml_df[feature_candidates].fillna(0)
mi_scores = mutual_info_classif(feature_df, ml_df['IsHighRisk'], random_state=42, n_jobs=-1)

mi_df = pd.DataFrame({
    'Feature': feature_candidates,
    'MI_Score': mi_scores
}).sort_values('MI_Score', ascending=False)

print("\nüìä Top 15 Features by Mutual Information:")
print(mi_df.head(15).to_string(index=False))

# Remove features with very low MI score
mi_threshold = 0.0001  # Very low threshold to be inclusive
low_mi_features = mi_df[mi_df['MI_Score'] < mi_threshold]['Feature'].tolist()

if low_mi_features:
    print(f"\n   ‚ö†Ô∏è  Removing {len(low_mi_features)} features with MI < {mi_threshold}:")
    for feat in low_mi_features:
        print(f"      - {feat}")
    feature_candidates = [f for f in feature_candidates if f not in low_mi_features]

# ============================================================================
# METHOD 3: Quick Random Forest for initial importance
# ============================================================================
print("\n3Ô∏è‚É£ Random Forest Feature Importance...")

# Use small sample for quick feature importance
sample_size = min(1000000, len(ml_df))
X_sample = ml_df[feature_candidates].sample(n=sample_size, random_state=42).fillna(0)
y_sample = ml_df.loc[X_sample.index, 'IsHighRisk']

# Quick RF
rf_selector = RandomForestClassifier(n_estimators=50, max_depth=10, random_state=42, n_jobs=-1)
rf_selector.fit(X_sample, y_sample)

# Get importance
importance_df = pd.DataFrame({
    'Feature': feature_candidates,
    'Importance': rf_selector.feature_importances_
}).sort_values('Importance', ascending=False)

print("\nüìä Top 15 Features by RF Importance:")
print(importance_df.head(15).to_string(index=False))

# Keep top features (importance > 0.001)
importance_threshold = 0.001
selected_features = importance_df[importance_df['Importance'] > importance_threshold]['Feature'].tolist()

print(f"\n‚úÖ FINAL SELECTION:")
print(f"   Started with: {len(numeric_features)} total features")
print(f"   After exclusions: {len(feature_candidates)} candidates")
print(f"   After correlation filter: {len(feature_candidates) - len(to_drop)} features")
print(f"   After MI filter: {len(feature_candidates) - len(low_mi_features)} features")
print(f"   After importance filter: {len(selected_features)} features")
print(f"\n   Final feature set: {len(selected_features)} features")

# Clean up memory
del df_temp, feature_df, X_sample, y_sample, rf_selector
gc.collect()

print_memory_usage("After feature selection")

FEATURE SELECTION - PROFESSIONAL APPROACH

üìä Initial features: 21

1Ô∏è‚É£ Correlation Analysis...
   Found high correlation: IsMediumHaul <-> ['IsShortHaul']
   Found high correlation: OriginTrafficPct <-> ['OriginTraffic']
   Found high correlation: DestTrafficPct <-> ['DestTraffic']
   ‚úì Removed 3 highly correlated features

2Ô∏è‚É£ Mutual Information Analysis...

üìä Top 15 Features by Mutual Information:
             Feature  MI_Score
           DayOfWeek      0.22
               Month      0.15
          DayofMonth      0.06
      DestTrafficPct      0.05
    OriginTrafficPct      0.05
BusyAirportShortHaul      0.03
     IsHolidaySeason      0.02
 HighTrafficRushHour      0.01
      WeekendHoliday      0.01
          IsLongHaul      0.01
      RouteFrequency      0.01
            Distance      0.01
EarlyMorningLongHaul      0.00
   CarrierCancelRate      0.00
      IsSummerTravel      0.00

   ‚ö†Ô∏è  Removing 6 features with MI < 0.0001:
      - EarlyMorningLongHaul
      

6.317539215087891

## Phase 5: Model Training

Training with selected features and temporal validation (train on Jan-Sep, test on Oct-Dec)

In [None]:
# ============================================================================
# MODEL TRAINING WITH SELECTED FEATURES
# ============================================================================

print("="*80)
print("MODEL TRAINING - IMPROVED FEATURE SET")
print("="*80)

# Prepare data with selected features only
print(f"\nüìä Using {len(selected_features)} selected features")

# Ensure Month is available for splitting (add if not in selected_features)
columns_to_use = list(selected_features) + ['IsHighRisk']
if 'Month' not in selected_features:
    columns_to_use.append('Month')

model_df = ml_df[columns_to_use].fillna(0)

# Temporal split (more realistic than random split)
train_df = model_df[model_df['Month'] <= 9].copy()
test_df = model_df[model_df['Month'] >= 10].copy()

X_train = train_df[selected_features]
y_train = train_df['IsHighRisk']
X_test = test_df[selected_features]
y_test = test_df['IsHighRisk']

print(f"\nüìä Dataset Split:")
print(f"   Training: {len(X_train):,} samples (Jan-Sep)")
print(f"   Testing: {len(X_test):,} samples (Oct-Dec)")
print(f"   High-risk rate in train: {y_train.mean()*100:.1f}%")
print(f"   High-risk rate in test: {y_test.mean()*100:.1f}%")

# ============================================================================
# Train Random Forest
# ============================================================================
print("\n" + "="*80)
print("üå≤ RANDOM FOREST CLASSIFIER")
print("="*80)

rf_model = RandomForestClassifier(
    n_estimators=200,
    max_depth=20,
    min_samples_split=1000,
    min_samples_leaf=500,
    max_features='sqrt',
    random_state=42,
    n_jobs=-1,
    class_weight='balanced'
)

print("Training Random Forest...")
rf_model.fit(X_train, y_train)

# Predictions
rf_train_pred = rf_model.predict(X_train)
rf_test_pred = rf_model.predict(X_test)
rf_test_proba = rf_model.predict_proba(X_test)[:, 1]

# Metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

rf_metrics = {
    'Train Accuracy': accuracy_score(y_train, rf_train_pred),
    'Test Accuracy': accuracy_score(y_test, rf_test_pred),
    'Precision': precision_score(y_test, rf_test_pred),
    'Recall': recall_score(y_test, rf_test_pred),
    'F1 Score': f1_score(y_test, rf_test_pred),
    'AUC-ROC': roc_auc_score(y_test, rf_test_proba)
}

print("\nüìä Random Forest Metrics:")
for metric, value in rf_metrics.items():
    print(f"   {metric}: {value:.4f}")

# ============================================================================
# Train Gradient Boosting
# ============================================================================
print("\n" + "="*80)
print("üöÄ GRADIENT BOOSTING CLASSIFIER")
print("="*80)

gb_model = GradientBoostingClassifier(
    n_estimators=200,
    max_depth=10,
    learning_rate=0.1,
    min_samples_split=1000,
    min_samples_leaf=500,
    max_features='sqrt',
    random_state=42,
    subsample=0.8
)

print("Training Gradient Boosting...")
gb_model.fit(X_train, y_train)

# Predictions
gb_train_pred = gb_model.predict(X_train)
gb_test_pred = gb_model.predict(X_test)
gb_test_proba = gb_model.predict_proba(X_test)[:, 1]

# Metrics
gb_metrics = {
    'Train Accuracy': accuracy_score(y_train, gb_train_pred),
    'Test Accuracy': accuracy_score(y_test, gb_test_pred),
    'Precision': precision_score(y_test, gb_test_pred),
    'Recall': recall_score(y_test, gb_test_pred),
    'F1 Score': f1_score(y_test, gb_test_pred),
    'AUC-ROC': roc_auc_score(y_test, gb_test_proba)
}

print("\nüìä Gradient Boosting Metrics:")
for metric, value in gb_metrics.items():
    print(f"   {metric}: {value:.4f}")

# ============================================================================
# Model Comparison
# ============================================================================
print("\n" + "="*80)
print("üìä MODEL COMPARISON")
print("="*80)

comparison_df = pd.DataFrame({
    'Random Forest': rf_metrics,
    'Gradient Boosting': gb_metrics
}).T

print(comparison_df.to_string())

print_memory_usage("After model training")

MODEL TRAINING - IMPROVED FEATURE SET

üìä Using 12 selected features

üìä Dataset Split:
   Training: 7,441,876 samples (Jan-Sep)
   Testing: 2,551,232 samples (Oct-Dec)
   High-risk rate in train: 10.2%
   High-risk rate in test: 10.0%

üå≤ RANDOM FOREST CLASSIFIER
Training Random Forest...


## Phase 6: Feature Importance Analysis

Analyzing which features are most important for predictions (should be more balanced now)

In [None]:
# ============================================================================
# FEATURE IMPORTANCE ANALYSIS - IMPROVED
# ============================================================================

print("="*80)
print("FEATURE IMPORTANCE - IMPROVED MODEL")
print("="*80)

# Get feature importance from best model
feature_importance = pd.DataFrame({
    'Feature': selected_features,
    'Importance': best_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("\nüìä TOP 20 FEATURES:")
print(feature_importance.head(20).to_string(index=False))

# Visualize
plt.figure(figsize=(12, 8))
top_n = 20
top_features = feature_importance.head(top_n)

plt.barh(range(top_n), top_features['Importance'].values)
plt.yticks(range(top_n), top_features['Feature'].values)
plt.xlabel('Importance Score')
plt.title(f'Top {top_n} Features for High-Risk Flight Prediction\n({best_model_name})')
plt.gca().invert_yaxis()

for i, (idx, row) in enumerate(top_features.iterrows()):
    plt.text(row['Importance'], i, f" {row['Importance']:.2%}", va='center')

plt.tight_layout()
plt.show()

# Check if Hour importance is now distributed
top_feature_importance = feature_importance.iloc[0]['Importance']
print(f"\nüìä Top Feature Importance: {top_feature_importance:.2%}")

if top_feature_importance > 0.30:
    print("   ‚ö†Ô∏è  Still high - may need more feature engineering")
elif top_feature_importance > 0.20:
    print("   ‚úÖ Good - within acceptable range")
else:
    print("   ‚úÖ Excellent - well-distributed importance")

print_memory_usage("After analysis")

## Phase 7: Model Evaluation

Detailed evaluation with classification report and confusion matrix

In [None]:
# ============================================================================
# DETAILED MODEL EVALUATION
# ============================================================================

print("="*80)
print(f"DETAILED EVALUATION - {best_model_name}")
print("="*80)

# Classification Report
print("\nüìä Classification Report:")
print(classification_report(y_test, best_model.predict(X_test)))

# Confusion Matrix
cm = confusion_matrix(y_test, best_model.predict(X_test))

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title(f'Confusion Matrix - {best_model_name}')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

print(f"\n‚úÖ Model evaluation complete!")
print(f"   Model: {best_model_name}")
print(f"   Features: {len(selected_features)}")
print(f"   AUC-ROC: {best_model.predict_proba(X_test)[:, 1].max():.4f}")

## Summary

### Key Results:
- **Data Retention**: 99.93% (fixed from 1.18%)
- **No Data Leakage**: All delay indicators excluded
- **Feature Selection**: Reduced to most informative features
- **Improved Importance**: TimeOfDay categories instead of continuous Hour
- **Temporal Validation**: Train on Jan-Sep, test on Oct-Dec

### Next Steps:
1. Deploy model for real-time prediction
2. Monitor feature importance stability over time
3. Add weather data for better predictions
4. Implement online learning for model updates