# Machine Learning Preprocessing Pipeline for Flight Delay Data

This notebook builds upon the base preprocessing pipeline to create features specifically optimized for traditional machine learning models like XGBoost, Random Forest, etc. These models typically require structured tabular data with well-engineered features, proper encoding, and handling of outliers.

## Key Processing Steps:
1. Loading the base preprocessed data
2. Feature engineering specific to ML models
3. Categorical encoding (one-hot, ordinal, target)
4. Handling outliers
5. Feature selection/importance analysis
6. Feature scaling/normalization
7. Train-test splitting with time-based validation
8. Exporting the processed data for ML model training

In [1]:
# Import required libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression, mutual_info_regression
import warnings

# Ignore warnings
warnings.filterwarnings('ignore')

# Configure paths dynamically using relative paths
import os.path as path

# Get the directory of the current notebook
notebook_dir = path.dirname(path.abspath('__file__'))
# Get project root (parent of notebooks directory)
project_root = path.abspath(path.join(notebook_dir, '..', '..'))

# Define paths relative to project root
BASE_PROCESSED_PATH = path.join(project_root, 'data', 'processed', 'base_preprocessed_flights.csv')
ML_PROCESSED_PATH = path.join(project_root, 'data', 'processed','ml_ready_flights', 'ml_ready_flights.csv')
ML_MODEL_PATH = path.join(project_root, 'models')

# Create directories if they don't exist
os.makedirs(os.path.dirname(ML_PROCESSED_PATH), exist_ok=True)
os.makedirs(ML_MODEL_PATH, exist_ok=True)

print(f"Base processed data path: {BASE_PROCESSED_PATH}")
print(f"ML processed data path: {ML_PROCESSED_PATH}")
print(f"ML model path: {ML_MODEL_PATH}")

# Display settings
pd.set_option('display.max_columns', None)
print("Libraries and paths configured.")

Base processed data path: /Users/osx/DataSceince_FL_FR/Forecasting_Flights-DataScience/data/processed/base_preprocessed_flights.csv
ML processed data path: /Users/osx/DataSceince_FL_FR/Forecasting_Flights-DataScience/data/processed/ml_ready_flights/ml_ready_flights.csv
ML model path: /Users/osx/DataSceince_FL_FR/Forecasting_Flights-DataScience/models
Libraries and paths configured.


In [2]:
# Function to load data in chunks
def load_processed_data(file_path, chunk_size=500000):
    """
    Generator function to load preprocessed data in chunks
    """
    for chunk in pd.read_csv(file_path, chunksize=chunk_size):
        # Convert date columns to datetime
        date_columns = [col for col in chunk.columns if 'DATE' in col.upper()]
        for col in date_columns:
            chunk[col] = pd.to_datetime(chunk[col], errors='coerce')
        
        yield chunk

In [3]:
# Inspect the data
first_chunk = next(load_processed_data(BASE_PROCESSED_PATH))

print(f"Data shape of first chunk: {first_chunk.shape}")
print("\nColumns and data types:")
for col in first_chunk.columns:
    print(f"- {col}: {first_chunk[col].dtype}")

print("\nSample data (first 5 rows):")
display(first_chunk.head())

Data shape of first chunk: (500000, 41)

Columns and data types:
- FL_DATE: datetime64[ns]
- AIRLINE: object
- AIRLINE_DOT: object
- AIRLINE_CODE: object
- DOT_CODE: int64
- FL_NUMBER: int64
- ORIGIN: object
- ORIGIN_CITY: object
- DEST: object
- DEST_CITY: object
- CRS_DEP_TIME: int64
- DEP_TIME: int64
- DEP_DELAY: float64
- TAXI_OUT: float64
- WHEELS_OFF: float64
- WHEELS_ON: float64
- TAXI_IN: float64
- CRS_ARR_TIME: int64
- ARR_TIME: float64
- ARR_DELAY: float64
- CANCELLED: int64
- CANCELLATION_CODE: float64
- DIVERTED: int64
- CRS_ELAPSED_TIME: float64
- ELAPSED_TIME: float64
- AIR_TIME: float64
- DISTANCE: float64
- DELAY_DUE_CARRIER: float64
- DELAY_DUE_WEATHER: float64
- DELAY_DUE_NAS: float64
- DELAY_DUE_SECURITY: float64
- DELAY_DUE_LATE_AIRCRAFT: float64
- YEAR: int64
- QUARTER: int64
- MONTH: int64
- DAY_OF_MONTH: int64
- DAY_OF_WEEK: int64
- SEASON: int64
- IS_HOLIDAY_SEASON: int64
- DEP_HOUR: int64
- TIME_OF_DAY: object

Sample data (first 5 rows):


Unnamed: 0,FL_DATE,AIRLINE,AIRLINE_DOT,AIRLINE_CODE,DOT_CODE,FL_NUMBER,ORIGIN,ORIGIN_CITY,DEST,DEST_CITY,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,TAXI_OUT,WHEELS_OFF,WHEELS_ON,TAXI_IN,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,CANCELLED,CANCELLATION_CODE,DIVERTED,CRS_ELAPSED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,DELAY_DUE_CARRIER,DELAY_DUE_WEATHER,DELAY_DUE_NAS,DELAY_DUE_SECURITY,DELAY_DUE_LATE_AIRCRAFT,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,SEASON,IS_HOLIDAY_SEASON,DEP_HOUR,TIME_OF_DAY
0,2019-01-09,United Air Lines Inc.,United Air Lines Inc.: UA,UA,19977,1562,FLL,"Fort Lauderdale, FL",EWR,"Newark, NJ",715,711,0.0,19.0,1210.0,1443.0,4.0,901,1447.0,0.0,0,,0,186.0,176.0,153.0,1065.0,,,,,,2019,1,1,9,2,1,0,11,Morning
1,2022-11-19,Delta Air Lines Inc.,Delta Air Lines Inc.: DL,DL,19790,1149,MSP,"Minneapolis, MN",SEA,"Seattle, WA",1280,1274,0.0,9.0,2123.0,2232.0,38.0,1395,2310.0,0.0,0,,0,235.0,236.0,189.0,1399.0,,,,,,2022,4,11,19,5,4,1,21,Evening
2,2022-07-22,United Air Lines Inc.,United Air Lines Inc.: UA,UA,19977,459,DEN,"Denver, CO",MSP,"Minneapolis, MN",594,600,6.0,20.0,1020.0,1247.0,5.0,772,1252.0,0.0,0,,0,118.0,112.0,87.0,680.0,,,,,,2022,3,7,22,4,3,0,9,Morning
3,2023-03-06,Delta Air Lines Inc.,Delta Air Lines Inc.: DL,DL,19790,2295,MSP,"Minneapolis, MN",SFO,"San Francisco, CA",969,968,0.0,27.0,1635.0,1844.0,9.0,1109,1853.0,24.0,0,,0,260.0,285.0,249.0,1589.0,0.0,0.0,24.0,0.0,0.0,2023,1,3,6,0,2,0,16,Afternoon
4,2019-07-31,Southwest Airlines Co.,Southwest Airlines Co.: WN,WN,19393,665,DAL,"Dallas, TX",OKC,"Oklahoma City, OK",610,757,147.0,15.0,1252.0,1328.0,3.0,670,1331.0,141.0,0,,0,60.0,54.0,36.0,181.0,141.0,0.0,0.0,0.0,0.0,2019,3,7,31,2,3,0,10,Morning


## Feature Engineering Specific to Machine Learning Models

We'll define functions to create additional features that are particularly useful for machine learning models. These include:

1. Categorical encoding
2. Route-based features
3. Carrier-based features
4. Temporal patterns
5. Airport congestion metrics

In [4]:
# Create ML-specific features
def create_ml_features(df):
    """
    Create features specifically useful for ML models
    """
    df_featured = df.copy()
    
    # ======== ROUTE AND AIRPORT FEATURES ========
    
    # Create route features if origin and destination are present
    if 'ORIGIN' in df_featured.columns and 'DEST' in df_featured.columns:
        # Create a combined route feature
        df_featured['ROUTE'] = df_featured['ORIGIN'] + '_' + df_featured['DEST']
        
        # Create distance buckets (if distance is available)
        if 'DISTANCE' in df_featured.columns:
            df_featured['DISTANCE_GROUP'] = pd.cut(
                df_featured['DISTANCE'],
                bins=[0, 500, 1000, 1500, 2000, 2500, 5000],
                labels=['Very Short', 'Short', 'Medium Short', 'Medium', 'Medium Long', 'Long']
            )
    
    # ======== TEMPORAL FEATURES ========
    
    # Create cyclical features for time variables
    if 'MONTH' in df_featured.columns:
        # Cyclical encoding of month (1-12)
        df_featured['MONTH_SIN'] = np.sin(2 * np.pi * df_featured['MONTH'] / 12)
        df_featured['MONTH_COS'] = np.cos(2 * np.pi * df_featured['MONTH'] / 12)
    
    if 'DAY_OF_WEEK' in df_featured.columns:
        # Cyclical encoding of day of week (1-7)
        df_featured['DAY_OF_WEEK_SIN'] = np.sin(2 * np.pi * df_featured['DAY_OF_WEEK'] / 7)
        df_featured['DAY_OF_WEEK_COS'] = np.cos(2 * np.pi * df_featured['DAY_OF_WEEK'] / 7)
    
    if 'DEP_HOUR' in df_featured.columns:
        # Cyclical encoding of hour (0-23)
        df_featured['DEP_HOUR_SIN'] = np.sin(2 * np.pi * df_featured['DEP_HOUR'] / 24)
        df_featured['DEP_HOUR_COS'] = np.cos(2 * np.pi * df_featured['DEP_HOUR'] / 24)
    
    # ======== FLIGHT-SPECIFIC FEATURES ========
    
    # Create a departure vs. scheduled departure difference
    if 'DEP_TIME' in df_featured.columns and 'CRS_DEP_TIME' in df_featured.columns:
        try:
            df_featured['DEP_DIFF'] = df_featured['DEP_TIME'] - df_featured['CRS_DEP_TIME']
        except:
            print("Could not calculate DEP_DIFF")
    
    # Combine delay types (if available)
    delay_cols = ['CARRIER_DELAY', 'WEATHER_DELAY', 'NAS_DELAY', 'SECURITY_DELAY', 'LATE_AIRCRAFT_DELAY']
    if all(col in df_featured.columns for col in delay_cols):
        # Fill NaNs with 0 to avoid issues in summation
        for col in delay_cols:
            df_featured[col] = df_featured[col].fillna(0)
            
        # Create delay type features
        df_featured['TOTAL_DELAY'] = df_featured[delay_cols].sum(axis=1)
        
        # Calculate percentage of each delay type
        for col in delay_cols:
            df_featured[f'{col}_RATIO'] = (df_featured[col] / df_featured['TOTAL_DELAY']).fillna(0)
    
    # ======== INTERACTION FEATURES ========
    
    # Create interaction features between important variables
    if 'DAY_OF_WEEK' in df_featured.columns and 'DEP_HOUR' in df_featured.columns:
        df_featured['DAY_HOUR'] = df_featured['DAY_OF_WEEK'].astype(str) + '_' + df_featured['DEP_HOUR'].astype(str)
    
    if 'MONTH' in df_featured.columns and 'DAY_OF_WEEK' in df_featured.columns:
        df_featured['MONTH_DAY'] = df_featured['MONTH'].astype(str) + '_' + df_featured['DAY_OF_WEEK'].astype(str)
    
    return df_featured

In [5]:
# Handle outliers for ML models
def handle_outliers_ml(df, target_col='DEP_DELAY'):
    """
    Handle outliers in a way suitable for ML models
    """
    df_clean = df.copy()
    
    # Define numeric columns that might have outliers
    numeric_cols = df_clean.select_dtypes(include=[np.number]).columns.tolist()
    
    # Remove target column from the list if it's there
    if target_col in numeric_cols:
        numeric_cols.remove(target_col)
    
    # Define columns to check for outliers
    outlier_columns = ['DEP_DELAY', 'ARR_DELAY', 'TAXI_OUT', 'TAXI_IN', 
                      'ACTUAL_ELAPSED_TIME', 'AIR_TIME', 'DISTANCE']
    
    # Filter to keep only columns that exist in the dataframe
    outlier_columns = [col for col in outlier_columns if col in df_clean.columns]
    
    # Handle outliers for each column
    for col in outlier_columns:
        # Skip non-numeric columns
        if df_clean[col].dtype not in [np.number]:
            continue
            
        # Skip columns already processed in base pipeline
        if col not in df_clean.columns:
            continue
        
        # Calculate IQR
        Q1 = df_clean[col].quantile(0.25)
        Q3 = df_clean[col].quantile(0.75)
        IQR = Q3 - Q1
        
        # Define bounds
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        
        # Cap outliers (winsorize)
        if col != target_col:  # Don't cap the target variable
            df_clean[col] = df_clean[col].clip(lower_bound, upper_bound)
        else:
            # For the target column, we might want to keep outliers but flag them
            df_clean[f'{col}_IS_OUTLIER'] = ((df_clean[col] < lower_bound) | (df_clean[col] > upper_bound)).astype(int)
    
    return df_clean

In [6]:
# Create ML-ready feature encoders
def create_categorical_encoders(df, categorical_cols=None, target_col='DEP_DELAY'):
    """
    Create encoders for categorical variables
    """
    if categorical_cols is None:
        # Identify categorical columns automatically
        categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
    
    # Create a dictionary to store encoders
    encoders = {}
    
    # One-hot encoding for low-cardinality categoricals
    encoders['onehot'] = {}
    
    # Ordinal encoding for high-cardinality categoricals
    encoders['ordinal'] = {}
    
    # Determine cardinality of each categorical
    for col in categorical_cols:
        if col not in df.columns:
            continue
            
        n_unique = df[col].nunique()
        
        # Create appropriate encoder based on cardinality
        if n_unique <= 15:  # Low cardinality
            encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
            encoder.fit(df[[col]])
            encoders['onehot'][col] = encoder
        else:  # High cardinality
            encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
            encoder.fit(df[[col]])
            encoders['ordinal'][col] = encoder
    
    return encoders, categorical_cols

In [7]:
# Apply categorical encoders
def apply_categorical_encodings(df, encoders, categorical_cols):
    """
    Apply the fitted encoders to transform categorical data
    """
    df_encoded = df.copy()
    
    # Apply one-hot encoding
    for col, encoder in encoders['onehot'].items():
        if col in df_encoded.columns:
            # Transform the data
            encoded_array = encoder.transform(df_encoded[[col]])
            
            # Create new column names
            feature_names = [f"{col}_{cat}" for cat in encoder.categories_[0]]
            
            # Create a DataFrame with the encoded values
            encoded_df = pd.DataFrame(encoded_array, columns=feature_names, index=df_encoded.index)
            
            # Concatenate with the original DataFrame
            df_encoded = pd.concat([df_encoded, encoded_df], axis=1)
            
            # Drop the original categorical column
            df_encoded = df_encoded.drop(col, axis=1)
    
    # Apply ordinal encoding
    for col, encoder in encoders['ordinal'].items():
        if col in df_encoded.columns:
            # Transform the data
            encoded_array = encoder.transform(df_encoded[[col]])
            
            # Replace the original column with the encoded values
            df_encoded[col] = encoded_array
    
    return df_encoded

In [8]:
# Feature selection for ML
def select_features_ml(df, target_col='DEP_DELAY', k=20):
    """
    Select top k features based on correlation with target
    """
    # Split data into X and y
    X = df.drop(target_col, axis=1)
    y = df[target_col]
    
    # Keep only numeric columns
    numeric_X = X.select_dtypes(include=[np.number])
    
    # Drop any constant columns
    non_constant_cols = [col for col in numeric_X.columns if numeric_X[col].nunique() > 1]
    numeric_X = numeric_X[non_constant_cols]
    
    # Calculate correlation with target
    correlations = numeric_X.corrwith(y).abs().sort_values(ascending=False)
    
    # Select top k features
    top_features = correlations.nlargest(min(k, len(correlations))).index.tolist()
    
    # Add target column back
    selected_features = top_features + [target_col]
    
    return selected_features, correlations

In [9]:
# Feature scaling for ML models
def scale_features_ml(df, target_col='DEP_DELAY', scaler_type='standard'):
    """
    Scale features for ML models
    """
    df_scaled = df.copy()
    
    # Get numeric columns excluding the target
    numeric_cols = df_scaled.select_dtypes(include=[np.number]).columns.tolist()
    if target_col in numeric_cols:
        numeric_cols.remove(target_col)
    
    # Choose scaler
    if scaler_type == 'standard':
        scaler = StandardScaler()
    elif scaler_type == 'minmax':
        scaler = MinMaxScaler()
    else:
        raise ValueError("scaler_type must be 'standard' or 'minmax'")
    
    # Scale numeric columns
    if numeric_cols:
        df_scaled[numeric_cols] = scaler.fit_transform(df_scaled[numeric_cols])
    
    return df_scaled, scaler

In [10]:
# Time-based train-test split for ML models
def time_based_train_test_split(df, date_col='FL_DATE', test_size=0.2, val_size=0.1):
    """
    Split data based on time to prevent data leakage
    """
    if date_col not in df.columns:
        raise ValueError(f"{date_col} not found in dataframe")
    
    # Sort by date
    df_sorted = df.sort_values(date_col)
    
    # Determine split points
    n_samples = len(df_sorted)
    test_start_idx = int(n_samples * (1 - test_size))
    val_start_idx = int(n_samples * (1 - test_size - val_size))
    
    # Split data
    train_data = df_sorted.iloc[:val_start_idx]
    val_data = df_sorted.iloc[val_start_idx:test_start_idx]
    test_data = df_sorted.iloc[test_start_idx:]
    
    print(f"Train set: {len(train_data):,} rows from {train_data[date_col].min()} to {train_data[date_col].max()}")
    print(f"Validation set: {len(val_data):,} rows from {val_data[date_col].min()} to {val_data[date_col].max()}")
    print(f"Test set: {len(test_data):,} rows from {test_data[date_col].min()} to {test_data[date_col].max()}")
    
    return train_data, val_data, test_data

## Execute the ML Preprocessing Pipeline

Now we'll run the complete ML preprocessing pipeline on our base preprocessed data to prepare it for machine learning model training.

In [11]:
# Define function to handle missing values
def drop_90pct_missing_per_column(df, threshold=0.5):
    """
    For columns with missingness > threshold (e.g., 0.5 or 50%),
    drop 90% of rows with missing values in those columns.
    This helps reduce bias while keeping some data for training.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        The input dataframe
    threshold : float, default=0.5
        The threshold of missingness percentage to consider a column
        for partial row dropping
        
    Returns:
    --------
    pandas.DataFrame
        Dataframe with some rows dropped for high-missingness columns
    """
    df_result = df.copy()
    
    # Calculate percentage of missing values per column
    missingness = df_result.isnull().mean()
    
    # Get columns with high missingness
    high_missing_cols = missingness[missingness > threshold].index.tolist()
    
    for col in high_missing_cols:
        # Get indices of rows with missing values in this column
        missing_indices = df_result[df_result[col].isnull()].index
        
        # Calculate how many rows to drop (90% of missing)
        num_to_drop = int(len(missing_indices) * 0.9)
        
        # Randomly select indices to drop
        if num_to_drop > 0:
            indices_to_drop = np.random.choice(missing_indices, size=num_to_drop, replace=False)
            
            # Drop the selected rows
            df_result = df_result.drop(indices_to_drop)
            
            print(f"Dropped {num_to_drop} rows with missing values in column '{col}'")
    
    return df_result

In [12]:
# Process chunks with the ML pipeline
def process_chunk_ml(chunk, encoders=None, categorical_cols=None, feature_list=None):
    """
    Apply ML preprocessing to a chunk of base preprocessed data
    """
    # Handle missing values first: for columns with >50% missing, drop 90% of missing rows
    chunk = drop_90pct_missing_per_column(chunk, threshold=0.5)
    # Add ML-specific features
    chunk_ml = create_ml_features(chunk)
    # Handle outliers
    chunk_ml = handle_outliers_ml(chunk_ml)
    # Fit or apply categorical encoders
    if encoders is None:
        encoders, categorical_cols = create_categorical_encoders(chunk_ml)
    chunk_ml = apply_categorical_encodings(chunk_ml, encoders, categorical_cols)
    # Select features if feature_list provided
    if feature_list is not None:
        available_features = [col for col in feature_list if col in chunk_ml.columns]
        chunk_ml = chunk_ml[available_features]
    return chunk_ml, encoders, categorical_cols

In [13]:
# Process all chunks and prepare ML-ready dataset
def prepare_ml_dataset(input_file, output_file, chunk_size=500000):
    """
    Process all chunks and prepare an ML-ready dataset
    """
    encoders = None
    categorical_cols = None
    feature_list = None
    sample_for_feature_selection = None
    first_chunk = True
    
    print(f"Starting ML preprocessing of {input_file}...")
    
    # Process in chunks
    chunks = []
    for i, chunk in enumerate(load_processed_data(input_file, chunk_size=chunk_size)):
        start_time = datetime.now()
        
        # Process the chunk
        processed_chunk, encoders, categorical_cols = process_chunk_ml(chunk, encoders, categorical_cols)
        
        # If this is the first chunk, use it for feature selection
        if i == 0:
            # Save a sample for feature selection
            if len(processed_chunk) > 10000:
                sample_for_feature_selection = processed_chunk.sample(10000, random_state=42)
            else:
                sample_for_feature_selection = processed_chunk
        
        chunks.append(processed_chunk)
        
        # Print progress
        end_time = datetime.now()
        elapsed = (end_time - start_time).total_seconds()
        print(f"Processed chunk {i+1}: {len(processed_chunk):,} rows in {elapsed:.2f} seconds")
        
        # To save memory, periodically combine and save chunks
        if len(chunks) >= 5 or (i == 0 and first_chunk):
            combined = pd.concat(chunks)
            
            # If this is the first batch, do feature selection
            if first_chunk:
                print("Performing feature selection...")
                selected_features, correlations = select_features_ml(sample_for_feature_selection)
                feature_list = selected_features
                print(f"Selected {len(feature_list)} features")
                
                # Keep only selected features
                available_features = [col for col in feature_list if col in combined.columns]
                combined = combined[available_features]
                
                # Set a flag to indicate we've done feature selection
                first_chunk = False
                
                # Save with mode='w' (write) for first chunk
                combined.to_csv(output_file, index=False)
            else:
                # Keep only selected features
                available_features = [col for col in feature_list if col in combined.columns]
                combined = combined[available_features]
                
                # Save with mode='a' (append) for subsequent chunks
                combined.to_csv(output_file, mode='a', header=False, index=False)
                
            # Clear chunks list to free memory
            chunks = []
    
    # Save any remaining chunks
    if chunks:
        combined = pd.concat(chunks)
        
        # Keep only selected features
        if feature_list is not None:
            available_features = [col for col in feature_list if col in combined.columns]
            combined = combined[available_features]
            
        # Append to the output file
        combined.to_csv(output_file, mode='a', header=False, index=False)
    
    print(f"ML preprocessing complete! Output saved to {output_file}")
    
    # Return for subsequent operations
    return encoders, categorical_cols, feature_list

In [14]:
# Execute the ML preprocessing pipeline
encoders, categorical_cols, feature_list = prepare_ml_dataset(BASE_PROCESSED_PATH, ML_PROCESSED_PATH)


Starting ML preprocessing of /Users/osx/DataSceince_FL_FR/Forecasting_Flights-DataScience/data/processed/base_preprocessed_flights.csv...
Dropped 450000 rows with missing values in column 'CANCELLATION_CODE'
Dropped 36053 rows with missing values in column 'DELAY_DUE_CARRIER'
Dropped 3605 rows with missing values in column 'DELAY_DUE_WEATHER'
Dropped 360 rows with missing values in column 'DELAY_DUE_NAS'
Dropped 36 rows with missing values in column 'DELAY_DUE_SECURITY'
Dropped 4 rows with missing values in column 'DELAY_DUE_LATE_AIRCRAFT'
Processed chunk 1: 9,942 rows in 0.38 seconds
Performing feature selection...
Selected 21 features
Dropped 450000 rows with missing values in column 'CANCELLATION_CODE'
Dropped 35982 rows with missing values in column 'DELAY_DUE_CARRIER'
Dropped 3599 rows with missing values in column 'DELAY_DUE_WEATHER'
Dropped 360 rows with missing values in column 'DELAY_DUE_NAS'
Dropped 36 rows with missing values in column 'DELAY_DUE_SECURITY'
Dropped 3 rows wit

In [15]:
# Verify the output file
try:
    # Read a sample of the processed data
    ml_sample = pd.read_csv(ML_PROCESSED_PATH, nrows=1000)
    print(f"ML processed data shape: {ml_sample.shape}")
    print("\nColumns in ML processed data:")
    for col in ml_sample.columns:
        print(f"- {col}: {ml_sample[col].dtype}")
    
    print("\nSample of ML processed data:")
    display(ml_sample.head())
    
except Exception as e:
    print(f"Error reading processed file: {e}")

ML processed data shape: (1000, 21)

Columns in ML processed data:
- ARR_DELAY: float64
- DEP_DELAY_IS_OUTLIER: int64
- DELAY_DUE_CARRIER: float64
- DELAY_DUE_LATE_AIRCRAFT: float64
- DELAY_DUE_WEATHER: float64
- TAXI_OUT: float64
- DELAY_DUE_NAS: float64
- ARR_TIME: float64
- WHEELS_ON: float64
- ELAPSED_TIME: float64
- TAXI_IN: float64
- DOT_CODE: int64
- DEP_HOUR_COS: float64
- AIR_TIME: float64
- DEP_TIME: int64
- DELAY_DUE_SECURITY: float64
- FL_NUMBER: int64
- TIME_OF_DAY_Evening: float64
- DEP_DIFF: int64
- TIME_OF_DAY_Night: float64
- DEP_DELAY: float64

Sample of ML processed data:


Unnamed: 0,ARR_DELAY,DEP_DELAY_IS_OUTLIER,DELAY_DUE_CARRIER,DELAY_DUE_LATE_AIRCRAFT,DELAY_DUE_WEATHER,TAXI_OUT,DELAY_DUE_NAS,ARR_TIME,WHEELS_ON,ELAPSED_TIME,TAXI_IN,DOT_CODE,DEP_HOUR_COS,AIR_TIME,DEP_TIME,DELAY_DUE_SECURITY,FL_NUMBER,TIME_OF_DAY_Evening,DEP_DIFF,TIME_OF_DAY_Night,DEP_DELAY
0,35.0,0,10.0,22.0,0.0,30.0,3.0,1925.0,1923.0,168.0,2.0,19393,-0.258819,136.0,1057,0.0,496,0.0,32,0.0,32.0
1,35.0,0,35.0,0.0,0.0,14.0,0.0,2005.0,2003.0,218.0,2.0,19393,-0.965926,202.0,867,0.0,1099,0.0,42,0.0,42.0
2,16.0,0,16.0,0.0,0.0,31.0,0.0,1731.0,1727.0,240.0,4.0,19930,-0.866025,205.0,631,0.0,500,0.0,31,0.0,31.0
3,67.0,0,0.0,67.0,0.0,14.0,0.0,2131.0,2126.0,328.0,5.0,19805,-0.258819,281.0,1143,0.0,285,0.0,88,0.0,88.0
4,21.0,0,1.0,20.0,0.0,14.0,0.0,1620.0,1612.0,65.0,8.0,20397,-0.866025,43.0,915,0.0,5314,0.0,38,0.0,38.0


In [16]:
# Load a sample of the full ML-ready dataset for time-based splitting demonstration
try:
    # Read a larger sample for demonstration
    ml_data_sample = pd.read_csv(ML_PROCESSED_PATH, nrows=100000)
    
    # Convert date column back to datetime if needed
    if 'FL_DATE' in ml_data_sample.columns:
        ml_data_sample['FL_DATE'] = pd.to_datetime(ml_data_sample['FL_DATE'])
    
    # Perform time-based train-test-validation split
    train_data, val_data, test_data = time_based_train_test_split(ml_data_sample)
    
    # Visualize the distribution of the target variable in each split
    plt.figure(figsize=(15, 5))
    
    plt.subplot(131)
    train_data['DEP_DELAY'].hist(bins=50, alpha=0.7)
    plt.title('Train Set DEP_DELAY Distribution')
    plt.xlabel('Departure Delay (minutes)')
    plt.ylabel('Frequency')
    
    plt.subplot(132)
    val_data['DEP_DELAY'].hist(bins=50, alpha=0.7)
    plt.title('Validation Set DEP_DELAY Distribution')
    plt.xlabel('Departure Delay (minutes)')
    
    plt.subplot(133)
    test_data['DEP_DELAY'].hist(bins=50, alpha=0.7)
    plt.title('Test Set DEP_DELAY Distribution')
    plt.xlabel('Departure Delay (minutes)')
    
    plt.tight_layout()
    plt.show()
    
except Exception as e:
    print(f"Error in time-based splitting demonstration: {e}")

Error in time-based splitting demonstration: FL_DATE not found in dataframe


## Summary of Machine Learning Preprocessing

The ML preprocessing pipeline has:

1. Added ML-specific engineered features
2. Handled outliers using winsorization
3. Encoded categorical variables using appropriate techniques
4. Selected the most relevant features
5. Prepared the data for time-based validation
6. Created a dataset ready for ML model training

This ML-ready dataset is optimized for traditional machine learning algorithms like XGBoost, Random Forest, etc. It includes properly encoded categorical variables, handles outliers and missing values, and provides a robust selection of features with strong predictive power.

# Preprocessing Optimizations for Gradient Boosting Models

The preprocessing pipeline we've built is already well-suited for traditional ML models, but we can optimize it specifically for gradient boosting models like XGBoost, LightGBM, and GBDT. These models have different preprocessing requirements compared to other algorithms.

## Key Optimizations for Gradient Boosting Models:

1. **Feature scaling is unnecessary** - Tree-based models are invariant to monotonic transformations
2. **Categorical encoding can be simplified** - Some models can handle categorical features directly
3. **Special handling for missing values** - GB models can learn patterns from missing data
4. **Feature interaction importance** - These models benefit from explicit interaction features
5. **Optimal sampling strategy** - Balanced sampling for imbalanced delay distributions

In [17]:
# Optimized handling of missing values for gradient boosting models
def handle_missing_for_gb_models(df):
    """
    Optimized missing value handling for gradient boosting models.
    Instead of dropping rows, preserve missing values where possible
    since XGBoost/LightGBM can handle them natively.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        The input dataframe
        
    Returns:
    --------
    pandas.DataFrame
        Dataframe with optimized missing value handling
    """
    df_result = df.copy()
    
    # Calculate percentage of missing values per column
    missingness = df_result.isnull().mean()
    
    # For columns with extreme missingness (>80%), drop them
    extreme_missing_cols = missingness[missingness > 0.8].index.tolist()
    
    if extreme_missing_cols:
        print(f"Dropping columns with >80% missing values: {extreme_missing_cols}")
        df_result = df_result.drop(columns=extreme_missing_cols)
    
    # For columns with high missingness (>50% but ≤80%), keep as is
    # Gradient boosting models can handle missing values directly
    high_missing_cols = missingness[(missingness > 0.5) & (missingness <= 0.8)].index.tolist()
    
    if high_missing_cols:
        print(f"Keeping columns with missing values for GB models to handle: {high_missing_cols}")
    
    # For numeric columns with moderate missingness (10-50%), 
    # add a binary indicator of missingness before imputation
    moderate_missing_cols = [col for col in df_result.select_dtypes(include=[np.number]).columns 
                           if 0.1 < missingness.get(col, 0) <= 0.5]
    
    for col in moderate_missing_cols:
        # Create a missing indicator feature
        df_result[f"{col}_IS_MISSING"] = df_result[col].isnull().astype(int)
    
    return df_result

In [18]:
# Optimized categorical encoding for gradient boosting models
def encode_categoricals_for_gb_models(df, cat_cols=None, target_col='DEP_DELAY'):
    """
    Optimized categorical encoding for gradient boosting models.
    XGBoost and LightGBM have different optimal encoding strategies:
    
    1. For LightGBM: Can use categorical features directly with categorical_feature parameter
    2. For XGBoost: Requires encoding, but target encoding often works better than one-hot
    
    This function creates both options so they're available during model training.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        The input dataframe
    cat_cols : list, default=None
        List of categorical columns. If None, auto-detected.
    target_col : str, default='DEP_DELAY'
        The target column for target encoding
        
    Returns:
    --------
    pandas.DataFrame
        Dataframe with optimized categorical encodings
    dict
        Dictionary of categorical columns for LightGBM
    """
    df_result = df.copy()
    
    # Auto-detect categorical columns if not provided
    if cat_cols is None:
        cat_cols = df_result.select_dtypes(include=['object', 'category']).columns.tolist()
    
    # Create a dictionary for LightGBM categorical features
    lightgbm_cat_cols = {}
    
    # Dictionary to store category mappings (for XGBoost)
    category_maps = {}
    
    for col in cat_cols:
        if col not in df_result.columns:
            continue
            
        # 1. Convert to category dtype (for LightGBM)
        df_result[col] = df_result[col].astype('category')
        lightgbm_cat_cols[col] = df_result[col].cat.categories.tolist()
        
        # Track original order for categories (useful for model interpretation)
        category_maps[col] = {val: idx for idx, val in enumerate(df_result[col].cat.categories)}
        
        # 2. Add label encoding (for XGBoost) - uses the underlying category codes
        df_result[f"{col}_encoded"] = df_result[col].cat.codes
        
        # 3. For high-cardinality columns (>15 categories), add target encoding
        if df_result[col].nunique() > 15 and target_col in df_result.columns:
            # Calculate target mean per category with smoothing
            global_mean = df_result[target_col].mean()
            category_stats = df_result.groupby(col)[target_col].agg(['mean', 'count'])
            
            # Apply smoothing to reduce impact of rare categories
            smoothing_factor = 100  # Adjust based on data size
            category_stats['smooth_mean'] = (
                (category_stats['count'] * category_stats['mean'] + 
                 smoothing_factor * global_mean) / 
                (category_stats['count'] + smoothing_factor)
            )
            
            # Map the smoothed means back to the dataframe
            df_result[f"{col}_target_mean"] = df_result[col].map(category_stats['smooth_mean'])
    
    return df_result, lightgbm_cat_cols, category_maps

In [19]:
# Advanced feature engineering for gradient boosting models
def create_advanced_features_for_gb(df):
    """
    Create advanced features specifically beneficial for gradient boosting models.
    GB models can automatically find some interactions, but explicit engineering
    of certain features can still improve performance.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        The input dataframe
        
    Returns:
    --------
    pandas.DataFrame
        Dataframe with advanced features
    """
    df_result = df.copy()
    
    # === 1. TEMPORAL LAG & AGGREGATION FEATURES ===
    # These help the model capture temporal patterns that simple features might miss
    
    # Check if we have the necessary time features
    if all(col in df_result.columns for col in ['FL_DATE', 'ORIGIN', 'DEST', 'DEP_DELAY']):
        # Convert date to datetime if needed
        if df_result['FL_DATE'].dtype != 'datetime64[ns]':
            df_result['FL_DATE'] = pd.to_datetime(df_result['FL_DATE'], errors='coerce')
        
        # Sort by date for proper lag calculations
        df_result = df_result.sort_values('FL_DATE')
        
        # Create origin airport delay aggregations (rolling windows)
        # Group by origin airport and date, then calculate mean delays
        if len(df_result) > 10000:  # Only if we have enough data
            try:
                origin_daily_delays = df_result.groupby(['ORIGIN', pd.Grouper(key='FL_DATE', freq='D')])['DEP_DELAY'].mean()
                origin_daily_delays = origin_daily_delays.reset_index()
                
                # Create a lookup dictionary for faster processing
                lookup_dict = {}
                for _, row in origin_daily_delays.iterrows():
                    origin = row['ORIGIN']
                    date = row['FL_DATE']
                    delay = row['DEP_DELAY']
                    if origin not in lookup_dict:
                        lookup_dict[origin] = {}
                    lookup_dict[origin][date.date()] = delay
                
                # Function to look up previous day's average delay
                def get_prev_day_delay(row):
                    origin = row['ORIGIN']
                    date = row['FL_DATE'].date()
                    prev_date = (row['FL_DATE'] - pd.Timedelta(days=1)).date()
                    
                    if origin in lookup_dict and prev_date in lookup_dict[origin]:
                        return lookup_dict[origin][prev_date]
                    return np.nan
                
                # Apply the function to create the feature
                df_result['ORIGIN_PREV_DAY_DELAY'] = df_result.apply(get_prev_day_delay, axis=1)
                print("Added temporal lag features based on airport delay patterns")
            except Exception as e:
                print(f"Could not create temporal features: {e}")
    
    # === 2. ADVANCED WEATHER/SEASONAL INTERACTION FEATURES ===
    # Weather conditions often interact with airports and routes in complex ways
    
    # Check if we have weather or seasonal features
    seasonal_cols = [col for col in df_result.columns if any(term in col.upper() for term in 
                                                          ['MONTH', 'DAY', 'SEASON', 'WEATHER'])]
    
    if 'ORIGIN' in df_result.columns and seasonal_cols:
        # Example: Create season-airport interaction for major airports
        if 'MONTH_SIN' in df_result.columns and 'MONTH_COS' in df_result.columns:
            # Identify major airports (top 10 by frequency)
            if len(df_result) > 5000:  # Only if we have enough data
                top_airports = df_result['ORIGIN'].value_counts().head(10).index.tolist()
                
                # Create interaction features for major airports and seasons
                for airport in top_airports:
                    df_result[f'{airport}_MONTH_SIN'] = np.where(
                        df_result['ORIGIN'] == airport, 
                        df_result['MONTH_SIN'], 
                        0
                    )
                    df_result[f'{airport}_MONTH_COS'] = np.where(
                        df_result['ORIGIN'] == airport, 
                        df_result['MONTH_COS'], 
                        0
                    )
                print(f"Created airport-seasonal interaction features for {len(top_airports)} major airports")
    
    # === 3. RATIO AND RATE FEATURES ===
    # Gradient boosting models can benefit from ratio features
    
    # Create ratios between related features
    numeric_cols = df_result.select_dtypes(include=[np.number]).columns.tolist()
    
    # Example: Ratio of actual to planned flight time
    if all(col in numeric_cols for col in ['ACTUAL_ELAPSED_TIME', 'CRS_ELAPSED_TIME']):
        df_result['ELAPSED_TIME_RATIO'] = (df_result['ACTUAL_ELAPSED_TIME'] / 
                                          df_result['CRS_ELAPSED_TIME']).replace([np.inf, -np.inf], np.nan)
    
    # Example: Delay per distance unit
    if all(col in numeric_cols for col in ['DEP_DELAY', 'DISTANCE']) and 'DISTANCE' in numeric_cols:
        df_result['DELAY_PER_DISTANCE'] = (df_result['DEP_DELAY'] / 
                                          df_result['DISTANCE'].replace(0, np.nan)).replace([np.inf, -np.inf], np.nan)
    
    return df_result

In [20]:
# Complete preprocessing pipeline for gradient boosting models
def process_for_gradient_boosting(chunk, lightgbm_cat_mapping=None, target_col='DEP_DELAY'):
    """
    Complete preprocessing function optimized for gradient boosting models.
    
    Key optimizations:
    1. Preserves missing values where possible (XGBoost/LightGBM handle them well)
    2. Uses appropriate categorical encoding (label + target encoding for XGBoost)
    3. Creates advanced features beneficial for tree-based models
    4. NO scaling needed - tree-based models are invariant to monotonic transformations
    5. Handles outliers with less aggressive approach (tree models are robust to outliers)
    
    Parameters:
    -----------
    chunk : pandas.DataFrame
        Input data chunk to process
    lightgbm_cat_mapping : dict, default=None
        Existing mappings for categorical features for LightGBM
    target_col : str, default='DEP_DELAY'
        Target column name
        
    Returns:
    --------
    pandas.DataFrame
        Processed dataframe optimized for gradient boosting
    dict
        Categorical feature mappings for LightGBM
    dict
        Category label mappings
    """
    # 1. Handle missing values optimally for GB models
    chunk_gb = handle_missing_for_gb_models(chunk)
    
    # 2. Create basic ML features (route, cyclical encodings, etc.)
    chunk_gb = create_ml_features(chunk_gb)
    
    # 3. Handle outliers with less aggressive approach
    # For tree-based models, we typically allow more outliers since
    # they don't influence the model as much as in linear models
    outlier_columns = ['DEP_DELAY', 'ARR_DELAY', 'TAXI_OUT', 'TAXI_IN', 
                      'ACTUAL_ELAPSED_TIME', 'AIR_TIME', 'DISTANCE']
    
    # Filter to keep only columns that exist in the dataframe
    outlier_columns = [col for col in outlier_columns if col in chunk_gb.columns]
    
    for col in outlier_columns:
        # Only cap extreme outliers beyond 3*IQR (more permissive than standard 1.5*IQR)
        if col in chunk_gb.columns and chunk_gb[col].dtype in [np.number]:
            if col != target_col:  # Don't cap the target variable
                Q1 = chunk_gb[col].quantile(0.25)
                Q3 = chunk_gb[col].quantile(0.75)
                IQR = Q3 - Q1
                
                lower_bound = Q1 - 3 * IQR
                upper_bound = Q3 + 3 * IQR
                
                # Cap extreme outliers only
                chunk_gb[col] = chunk_gb[col].clip(lower_bound, upper_bound)
            else:
                # For the target column, just add an outlier flag
                Q1 = chunk_gb[col].quantile(0.25)
                Q3 = chunk_gb[col].quantile(0.75)
                IQR = Q3 - Q1
                
                lower_bound = Q1 - 3 * IQR
                upper_bound = Q3 + 3 * IQR
                
                chunk_gb[f'{col}_IS_OUTLIER'] = ((chunk_gb[col] < lower_bound) | 
                                               (chunk_gb[col] > upper_bound)).astype(int)
    
    # 4. Apply categorical encoding optimized for gradient boosting
    chunk_gb, lightgbm_cats, cat_maps = encode_categoricals_for_gb_models(
        chunk_gb, cat_cols=None, target_col=target_col)
    
    # Update the LightGBM categorical mapping if it exists
    if lightgbm_cat_mapping is None:
        lightgbm_cat_mapping = lightgbm_cats
    else:
        # Update with any new categories found
        for col, cats in lightgbm_cats.items():
            if col not in lightgbm_cat_mapping:
                lightgbm_cat_mapping[col] = cats
    
    # 5. Create advanced features specifically for gradient boosting
    chunk_gb = create_advanced_features_for_gb(chunk_gb)
    
    # NO SCALING needed - tree-based models are invariant to monotonic transformations
    
    return chunk_gb, lightgbm_cat_mapping, cat_maps

In [21]:
# Prepare dataset optimized for gradient boosting models
def prepare_gradient_boosting_dataset(input_file, output_file, chunk_size=500000):
    """
    Process all chunks and prepare a dataset optimized for gradient boosting models.
    
    Parameters:
    -----------
    input_file : str
        Path to input data file
    output_file : str
        Path to output processed file
    chunk_size : int, default=500000
        Size of chunks to process at once
        
    Returns:
    --------
    tuple
        (lightgbm_category_mapping, category_maps, important_features)
    """
    lightgbm_cat_mapping = None
    category_maps = {}
    important_features = None
    first_chunk = True
    feature_importances = None
    
    print(f"Starting Gradient Boosting preprocessing of {input_file}...")
    
    # Process in chunks
    chunks = []
    for i, chunk in enumerate(load_processed_data(input_file, chunk_size=chunk_size)):
        start_time = datetime.now()
        
        # Process the chunk with GB-optimized pipeline
        processed_chunk, lightgbm_cat_mapping, cat_maps = process_for_gradient_boosting(
            chunk, lightgbm_cat_mapping)
        
        # Update category maps
        for col, mapping in cat_maps.items():
            if col not in category_maps:
                category_maps[col] = mapping
        
        # For the first chunk, try to get feature importances using a quick XGBoost model
        if i == 0 and len(processed_chunk) > 5000:
            try:
                # Only need to do feature importance on a sample
                sample_size = min(10000, len(processed_chunk))
                sample = processed_chunk.sample(sample_size, random_state=42)
                
                # Check if XGBoost is available
                try:
                    import xgboost as xgb
                    
                    # Prepare data for a quick XGBoost run
                    X = sample.select_dtypes(include=[np.number])
                    X = X.drop(target_col, axis=1, errors='ignore')
                    y = sample[target_col]
                    
                    # Train a quick model to get feature importances
                    dtrain = xgb.DMatrix(X, label=y)
                    params = {
                        'max_depth': 3,
                        'eta': 0.1,
                        'objective': 'reg:squarederror',
                        'nthread': 4,
                        'verbosity': 0
                    }
                    gb_model = xgb.train(params, dtrain, num_boost_round=20)
                    
                    # Get feature importances
                    feature_importances = pd.Series(gb_model.get_score(importance_type='gain'))
                    feature_importances = feature_importances.sort_values(ascending=False)
                    
                    # Select important features
                    important_features = feature_importances.index.tolist()
                    print(f"Identified {len(important_features)} important features from XGBoost")
                    
                    # Add the target column
                    if target_col not in important_features:
                        important_features.append(target_col)
                        
                except ImportError:
                    print("XGBoost not available. Skipping feature importance calculation.")
                    # If XGBoost is not available, keep all features
                    important_features = processed_chunk.columns.tolist()
            except Exception as e:
                print(f"Error computing feature importances: {e}")
                important_features = processed_chunk.columns.tolist()
        
        chunks.append(processed_chunk)
        
        # Print progress
        end_time = datetime.now()
        elapsed = (end_time - start_time).total_seconds()
        print(f"Processed chunk {i+1}: {len(processed_chunk):,} rows in {elapsed:.2f} seconds")
        
        # To save memory, periodically combine and save chunks
        if len(chunks) >= 5 or (i == 0 and first_chunk):
            combined = pd.concat(chunks)
            
            # Keep only important features if they've been identified
            if important_features is not None:
                # Filter to available features
                available_features = [col for col in important_features if col in combined.columns]
                combined = combined[available_features]
            
            # Save the data
            if first_chunk:
                # First chunk, use write mode
                combined.to_csv(output_file, index=False)
                first_chunk = False
            else:
                # Subsequent chunks, append
                combined.to_csv(output_file, mode='a', header=False, index=False)
            
            # Clear chunks to free memory
            chunks = []
    
    # Save any remaining chunks
    if chunks:
        combined = pd.concat(chunks)
        
        # Keep only important features if they've been identified
        if important_features is not None:
            available_features = [col for col in important_features if col in combined.columns]
            combined = combined[available_features]
        
        # Append to output
        combined.to_csv(output_file, mode='a', header=False, index=False)
    
    print(f"Gradient Boosting preprocessing complete! Output saved to {output_file}")
    
    # Also save the LightGBM categorical feature mapping
    lightgbm_mapping_file = os.path.splitext(output_file)[0] + '_lightgbm_cats.json'
    
    import json
    # Convert to a serializable format
    serializable_mapping = {}
    for col, cats in lightgbm_cat_mapping.items():
        serializable_mapping[col] = [str(c) for c in cats]
    
    with open(lightgbm_mapping_file, 'w') as f:
        json.dump(serializable_mapping, f)
    
    print(f"LightGBM categorical mapping saved to {lightgbm_mapping_file}")
    
    # Return for subsequent operations
    return lightgbm_cat_mapping, category_maps, important_features

## Executing the Gradient Boosting Preprocessing Pipeline

Let's execute our optimized preprocessing pipeline for gradient boosting models and compare it with the standard pipeline. The key differences in our approach are:

1. **Missing Values**: Preserve them where possible instead of dropping rows
2. **Categorical Encoding**: Create multiple encoding types (label + target) with LightGBM compatibility
3. **Feature Engineering**: Advanced temporal features and airport-specific interactions
4. **Feature Selection**: Using XGBoost's feature importance for early dimension reduction
5. **Outlier Handling**: More permissive approach since tree models are more robust
6. **No Scaling**: Removed unnecessary scaling that doesn't benefit tree-based models

This pipeline is specifically optimized for XGBoost, LightGBM, and other gradient boosting models.

In [22]:
# Define the output path for the gradient boosting optimized dataset
GB_PROCESSED_PATH = path.join(project_root, 'data', 'processed', 'ml_ready_flights', 'gb_ready_flights.csv')

# Make sure the directory exists
os.makedirs(os.path.dirname(GB_PROCESSED_PATH), exist_ok=True)

# Execute the gradient boosting preprocessing pipeline
lightgbm_cats, category_maps, important_features = prepare_gradient_boosting_dataset(
    BASE_PROCESSED_PATH, GB_PROCESSED_PATH)

# Print paths for reference
print(f"Standard ML processed data path: {ML_PROCESSED_PATH}")
print(f"Gradient Boosting optimized data path: {GB_PROCESSED_PATH}")
print(f"LightGBM categorical mapping file: {os.path.splitext(GB_PROCESSED_PATH)[0] + '_lightgbm_cats.json'}")

dataframesize = pd.read_csv(GB_PROCESSED_PATH)
print(f"Gradient Boosting processed data shape: {dataframesize.shape}")




Starting Gradient Boosting preprocessing of /Users/osx/DataSceince_FL_FR/Forecasting_Flights-DataScience/data/processed/base_preprocessed_flights.csv...
Dropping columns with >80% missing values: ['CANCELLATION_CODE', 'DELAY_DUE_CARRIER', 'DELAY_DUE_WEATHER', 'DELAY_DUE_NAS', 'DELAY_DUE_SECURITY', 'DELAY_DUE_LATE_AIRCRAFT']
Added temporal lag features based on airport delay patterns
Created airport-seasonal interaction features for 10 major airports
Error computing feature importances: name 'target_col' is not defined
Processed chunk 1: 500,000 rows in 23.36 seconds
Dropping columns with >80% missing values: ['CANCELLATION_CODE', 'DELAY_DUE_CARRIER', 'DELAY_DUE_WEATHER', 'DELAY_DUE_NAS', 'DELAY_DUE_SECURITY', 'DELAY_DUE_LATE_AIRCRAFT']
Added temporal lag features based on airport delay patterns
Created airport-seasonal interaction features for 10 major airports
Processed chunk 2: 500,000 rows in 23.69 seconds
Dropping columns with >80% missing values: ['CANCELLATION_CODE', 'DELAY_DUE_C

## Comparison of Standard vs. Gradient Boosting Optimized Preprocessing

Let's summarize the key differences between our two preprocessing approaches:

| Preprocessing Step | Standard Pipeline | Gradient Boosting Optimized Pipeline |
|-------------------|-------------------|--------------------------------------|
| **Missing Values** | Drops 90% of rows with missing values in columns with >50% missingness | Preserves missing values where possible, adds missing indicators |
| **Categorical Encoding** | One-hot for low cardinality, ordinal for high cardinality | Multiple encodings (label + target), LightGBM compatibility, smoothed target encoding |
| **Feature Engineering** | Basic features (route, cyclical, etc.) | Advanced temporal lag features, airport-seasonal interactions, rate/ratio features |
| **Outlier Handling** | IQR method with 1.5 factor | More permissive 3*IQR approach (trees handle outliers better) |
| **Feature Selection** | Correlation-based | XGBoost feature importance-based |
| **Scaling** | Included but not always applied | Removed (unnecessary for tree models) |
| **Data Chunking** | Basic chunking | Same approach but with optimized processing |

When to use each pipeline:

- **Standard Pipeline**: For linear models, neural networks, or mixed model ensembles
- **Gradient Boosting Pipeline**: For XGBoost, LightGBM, CatBoost, and other tree-based gradient boosting models

The optimized pipeline leverages the specific strengths of gradient boosting models, particularly their ability to:
- Handle missing values natively
- Work well with categorical features
- Capture complex non-linear patterns
- Be robust to outliers
- Automatically determine feature importance