# Deep Learning Preprocessing Pipeline for Flight Delay Data

This notebook builds upon the base preprocessing pipeline to create features specifically optimized for deep learning models such as neural networks. Deep learning models typically require specialized preprocessing including proper normalization, embedding-friendly encodings for categorical variables, and structured data formats compatible with deep learning frameworks.

## Key Processing Steps:
1. Loading the base preprocessed data
2. Feature engineering specific to deep learning models
3. Advanced encoding for categorical variables (embeddings)
4. Data normalization (standardization)
5. Sequence generation for RNNs and LSTMs
6. Data batching and formatting for deep learning frameworks
7. Time-based validation strategy
8. Exporting the processed data for DL model training

In [None]:
# Import required libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
from datetime import datetime
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
import warnings

# Ignore warnings
warnings.filterwarnings('ignore')

# Configure paths dynamically using relative paths
import os.path as path

# Get the directory of the current notebook
notebook_dir = path.dirname(path.abspath('__file__'))
# Get project root (parent of notebooks directory)
project_root = path.abspath(path.join(notebook_dir, '..', '..'))

# Define paths relative to project root
BASE_PROCESSED_PATH = path.join(project_root, 'data', 'processed', 'base_preprocessed_flights.csv')
DL_PROCESSED_PATH = path.join(project_root, 'data', 'processed', 'dl_ready_flights')
DL_MODEL_PATH = path.join(project_root, 'models', 'dl')

# Create directories if they don't exist
os.makedirs(os.path.dirname(DL_PROCESSED_PATH), exist_ok=True)
os.makedirs(DL_MODEL_PATH, exist_ok=True)

print(f"Base processed data path: {BASE_PROCESSED_PATH}")
print(f"DL processed data path: {DL_PROCESSED_PATH}")
print(f"DL model path: {DL_MODEL_PATH}")

# Display settings
pd.set_option('display.max_columns', None)
print("Libraries and paths configured.")

Libraries and paths configured.


In [14]:
# Function to load data in chunks
def load_processed_data(file_path, chunk_size=500000):
    """
    Generator function to load preprocessed data in chunks
    """
    for chunk in pd.read_csv(file_path, chunksize=chunk_size):
        # Convert date columns to datetime
        date_columns = [col for col in chunk.columns if 'DATE' in col.upper()]
        for col in date_columns:
            chunk[col] = pd.to_datetime(chunk[col], errors='coerce')
        
        yield chunk

In [15]:
# Inspect the data
first_chunk = next(load_processed_data(BASE_PROCESSED_PATH))

print(f"Data shape of first chunk: {first_chunk.shape}")
print("\nColumns and data types:")
for col in first_chunk.columns:
    print(f"- {col}: {first_chunk[col].dtype}")

print("\nSample data (first 5 rows):")
display(first_chunk.head())

Data shape of first chunk: (500000, 28)

Columns and data types:
- FL_DATE: datetime64[ns]
- ORIGIN: object
- DEST: object
- CRS_DEP_TIME: int64
- DEP_TIME: float64
- DEP_DELAY: float64
- TAXI_OUT: float64
- WHEELS_OFF: float64
- WHEELS_ON: float64
- TAXI_IN: float64
- CRS_ARR_TIME: int64
- ARR_TIME: float64
- ARR_DELAY: float64
- CANCELLED: int64
- CANCELLATION_CODE: object
- DIVERTED: int64
- CRS_ELAPSED_TIME: float64
- AIR_TIME: float64
- DISTANCE: float64
- YEAR: int64
- MONTH: int64
- DAY_OF_MONTH: int64
- DAY_OF_WEEK: int64
- QUARTER: int64
- SEASON: int64
- IS_HOLIDAY_SEASON: int64
- DEP_HOUR: int64
- TIME_OF_DAY: object

Sample data (first 5 rows):


Unnamed: 0,FL_DATE,ORIGIN,DEST,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,TAXI_OUT,WHEELS_OFF,WHEELS_ON,TAXI_IN,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,CANCELLED,CANCELLATION_CODE,DIVERTED,CRS_ELAPSED_TIME,AIR_TIME,DISTANCE,YEAR,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,QUARTER,SEASON,IS_HOLIDAY_SEASON,DEP_HOUR,TIME_OF_DAY
0,2019-01-09,FLL,EWR,715,1151.0,-4.0,19.0,1210.0,1443.0,4.0,901,1447.0,-14.0,0,,0,186.0,153.0,1065.0,2019,1,9,3,1,1,0,11,Morning
1,2022-11-19,MSP,SEA,1280,2114.0,-6.0,9.0,2123.0,2232.0,38.0,1395,2310.0,-5.0,0,,0,235.0,189.0,1399.0,2022,11,19,6,4,4,1,21,Evening
2,2022-07-22,DEN,MSP,594,1000.0,6.0,20.0,1020.0,1247.0,5.0,772,1252.0,0.0,0,,0,118.0,87.0,680.0,2022,7,22,5,3,3,0,9,Morning
3,2023-03-06,MSP,SFO,969,1608.0,-1.0,27.0,1635.0,1844.0,9.0,1109,1853.0,24.0,0,,0,260.0,249.0,1589.0,2023,3,6,1,1,2,0,16,Afternoon
4,2019-07-31,DAL,OKC,610,1237.0,147.0,15.0,1252.0,1328.0,3.0,670,1331.0,141.0,0,,0,60.0,36.0,181.0,2019,7,31,3,3,3,0,10,Morning


## Feature Engineering Specific to Deep Learning Models

Deep learning models require specific preprocessing approaches. Let's define functions to prepare data for neural networks:

1. Proper categorical encodings for embeddings
2. Feature normalization
3. Sequence data preparation for recurrent networks
4. Structured representations for tabular data

In [16]:
# Create DL-specific features
def create_dl_features(df):
    """
    Create features specifically useful for deep learning models
    """
    df_featured = df.copy()
    
    # ======== TEMPORAL FEATURES FOR SEQUENTIAL MODELS ========
    
    # Sort by date and time if available
    if 'FL_DATE' in df_featured.columns and 'DEP_TIME' in df_featured.columns:
        df_featured = df_featured.sort_values(['FL_DATE', 'DEP_TIME'])
    
    # Create normalized time features (better for neural networks)
    if 'MONTH' in df_featured.columns:
        # Normalize month to [0,1]
        df_featured['MONTH_NORM'] = (df_featured['MONTH'] - 1) / 11
    
    if 'DAY_OF_MONTH' in df_featured.columns:
        # Normalize day to [0,1]
        df_featured['DAY_OF_MONTH_NORM'] = (df_featured['DAY_OF_MONTH'] - 1) / 30
    
    if 'DAY_OF_WEEK' in df_featured.columns:
        # Normalize day of week to [0,1]
        df_featured['DAY_OF_WEEK_NORM'] = (df_featured['DAY_OF_WEEK'] - 1) / 6
    
    if 'DEP_HOUR' in df_featured.columns:
        # Normalize hour to [0,1]
        df_featured['DEP_HOUR_NORM'] = df_featured['DEP_HOUR'] / 23
    
    # Create sine and cosine features for cyclical time variables
    # These are particularly useful for neural networks to understand cyclical patterns
    if 'MONTH' in df_featured.columns:
        df_featured['MONTH_SIN'] = np.sin(2 * np.pi * df_featured['MONTH'] / 12)
        df_featured['MONTH_COS'] = np.cos(2 * np.pi * df_featured['MONTH'] / 12)
    
    if 'DAY_OF_WEEK' in df_featured.columns:
        df_featured['DAY_OF_WEEK_SIN'] = np.sin(2 * np.pi * df_featured['DAY_OF_WEEK'] / 7)
        df_featured['DAY_OF_WEEK_COS'] = np.cos(2 * np.pi * df_featured['DAY_OF_WEEK'] / 7)
    
    if 'DEP_HOUR' in df_featured.columns:
        df_featured['DEP_HOUR_SIN'] = np.sin(2 * np.pi * df_featured['DEP_HOUR'] / 24)
        df_featured['DEP_HOUR_COS'] = np.cos(2 * np.pi * df_featured['DEP_HOUR'] / 24)
    
    # ======== AGGREGATED FEATURES FOR CONTEXT ========
    # Note: In a full pipeline, these would be pre-computed from historical data
    # Here we're calculating them within the chunk as an approximation
    
    if 'ORIGIN' in df_featured.columns:
        # Airport busy level - compute average daily flights per origin
        airport_flights = df_featured.groupby('ORIGIN').size().reset_index(name='ORIGIN_FLIGHTS')
        df_featured = df_featured.merge(airport_flights, on='ORIGIN', how='left')
        
        # Normalize the airport flight count
        max_flights = df_featured['ORIGIN_FLIGHTS'].max()
        if max_flights > 0:  # Avoid division by zero
            df_featured['ORIGIN_FLIGHTS_NORM'] = df_featured['ORIGIN_FLIGHTS'] / max_flights
    
    if 'OP_CARRIER' in df_featured.columns:
        # Carrier size - compute total flights per carrier
        carrier_flights = df_featured.groupby('OP_CARRIER').size().reset_index(name='CARRIER_FLIGHTS')
        df_featured = df_featured.merge(carrier_flights, on='OP_CARRIER', how='left')
        
        # Normalize the carrier flight count
        max_flights = df_featured['CARRIER_FLIGHTS'].max()
        if max_flights > 0:  # Avoid division by zero
            df_featured['CARRIER_FLIGHTS_NORM'] = df_featured['CARRIER_FLIGHTS'] / max_flights
    
    # ======== SEQUENTIAL FEATURES ========
    
    # Code for creating sequential features will be in a separate function
    
    return df_featured

In [17]:
# Create label encoders for categorical variables (for embeddings)
def create_label_encoders(df, categorical_cols=None):
    """
    Create label encoders for categorical variables to be used in embeddings
    """
    if categorical_cols is None:
        # Identify categorical columns automatically
        categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
    
    # Create a dictionary to store encoders
    encoders = {}
    vocab_sizes = {}
    
    # Create label encoder for each categorical column
    for col in categorical_cols:
        if col not in df.columns:
            continue
        
        # Create and fit the label encoder
        encoder = LabelEncoder()
        encoder.fit(df[col].astype(str).fillna('UNKNOWN'))
        encoders[col] = encoder
        
        # Store vocabulary size (number of unique categories + 1 for unknown)
        vocab_sizes[col] = len(encoder.classes_) + 1
    
    return encoders, vocab_sizes, categorical_cols

In [18]:
# Apply label encoders and create embedding-ready data
def apply_label_encodings(df, encoders, categorical_cols):
    """
    Apply label encoders and prepare data for embeddings
    """
    df_encoded = df.copy()
    
    # Apply encoding to each categorical column
    for col, encoder in encoders.items():
        if col in df_encoded.columns:
            # Convert to string and fill NAs
            col_data = df_encoded[col].astype(str).fillna('UNKNOWN')
            
            # Handle values not seen during training
            unique_vals = set(col_data.unique())
            known_vals = set(encoder.classes_)
            unknown_vals = unique_vals - known_vals
            
            if unknown_vals:
                # Replace unknown values with a known value (e.g., 'UNKNOWN')
                for val in unknown_vals:
                    col_data = col_data.replace(val, 'UNKNOWN')
            
            # Transform using the encoder
            try:
                df_encoded[f'{col}_ENCODED'] = encoder.transform(col_data)
            except:
                # If transformation fails, use a default value
                print(f"Error encoding {col}, using default values")
                df_encoded[f'{col}_ENCODED'] = 0
            
            # Drop the original column (we keep the encoded version for embeddings)
            df_encoded = df_encoded.drop(col, axis=1)
    
    return df_encoded

In [19]:
# Normalize numeric features for deep learning
def normalize_features_dl(df, exclude_cols=None, scaler=None):
    """
    Normalize numeric features for neural networks
    """
    if exclude_cols is None:
        exclude_cols = []
    
    df_norm = df.copy()
    
    # Get numeric columns excluding those in exclude_cols
    numeric_cols = df_norm.select_dtypes(include=[np.number]).columns.tolist()
    numeric_cols = [col for col in numeric_cols if col not in exclude_cols]
    
    # Fit or transform with scaler
    if scaler is None:
        # First call - fit a new scaler
        scaler = StandardScaler()
        df_norm[numeric_cols] = scaler.fit_transform(df_norm[numeric_cols])
    else:
        # Subsequent calls - use the fitted scaler
        df_norm[numeric_cols] = scaler.transform(df_norm[numeric_cols])
    
    return df_norm, scaler

In [20]:
# Create sequence data for RNN/LSTM models
def create_sequence_data(df, seq_length=10, time_col='FL_DATE', group_cols=None, target_col='DEP_DELAY'):
    """
    Create sequences for RNN/LSTM models
    
    This function creates sequences of data by grouping by specified columns
    and ordering by time. Each sequence will have seq_length steps.
    """
    if group_cols is None:
        # Default grouping by origin airport and carrier
        group_cols = ['ORIGIN', 'OP_CARRIER']
    
    # Filter group_cols to include only columns that exist
    group_cols = [col for col in group_cols if col in df.columns]
    
    # Ensure we have at least one grouping column
    if not group_cols:
        raise ValueError("No valid grouping columns found")
    
    # Sort by group and time
    if time_col in df.columns:
        df_sorted = df.sort_values(group_cols + [time_col])
    else:
        # If no time column, just use the existing order
        df_sorted = df
    
    # Lists to store sequences and targets
    X_sequences = []
    y_values = []
    
    # Group by the specified columns
    for _, group in df_sorted.groupby(group_cols):
        # Skip if group is too small
        if len(group) < seq_length + 1:
            continue
        
        # Extract features (exclude target and time column)
        features = group.drop([target_col], axis=1)
        if time_col in features.columns:
            features = features.drop([time_col], axis=1)
        
        # Extract target
        targets = group[target_col].values
        
        # Create sequences
        for i in range(len(group) - seq_length):
            X_sequences.append(features.iloc[i:i+seq_length].values)
            y_values.append(targets[i+seq_length])
    
    # Convert to numpy arrays
    X_sequences = np.array(X_sequences)
    y_values = np.array(y_values)
    
    return X_sequences, y_values

In [21]:
# Time-based train-test-validation split for deep learning
def time_based_split_dl(df, date_col='FL_DATE', test_size=0.2, val_size=0.1):
    """
    Create a time-based split for deep learning models
    """
    if date_col not in df.columns:
        raise ValueError(f"{date_col} not found in dataframe")
    
    # Sort by date
    df_sorted = df.sort_values(date_col)
    
    # Determine split points
    n_samples = len(df_sorted)
    test_start_idx = int(n_samples * (1 - test_size))
    val_start_idx = int(n_samples * (1 - test_size - val_size))
    
    # Split data
    train_data = df_sorted.iloc[:val_start_idx]
    val_data = df_sorted.iloc[val_start_idx:test_start_idx]
    test_data = df_sorted.iloc[test_start_idx:]
    
    print(f"Train set: {len(train_data):,} rows from {train_data[date_col].min()} to {train_data[date_col].max()}")
    print(f"Validation set: {len(val_data):,} rows from {val_data[date_col].min()} to {val_data[date_col].max()}")
    print(f"Test set: {len(test_data):,} rows from {test_data[date_col].min()} to {test_data[date_col].max()}")
    
    return train_data, val_data, test_data

## Execute the DL Preprocessing Pipeline

Now we'll run the DL preprocessing pipeline on our base preprocessed data. For deep learning, we'll create both:
1. A standard tabular dataset for feedforward neural networks
2. Sequence data for RNN/LSTM models

In [None]:
# Process chunks with the DL pipeline
def process_chunk_dl(chunk, encoders=None, categorical_cols=None, scaler=None, exclude_from_scaling=None):
    """
    Apply DL preprocessing to a chunk of base preprocessed data
    """
    # Handle missing values first: for columns with >50% missing, drop 90% of missing rows
    chunk = drop_90pct_missing_per_column(chunk, threshold=0.5)
    # Add DL-specific features
    chunk_dl = create_dl_features(chunk)
    # Fit or apply label encoders for categorical variables
    if encoders is None:
        # First chunk - create encoders
        encoders, vocab_sizes, categorical_cols = create_label_encoders(chunk_dl)
    # Apply encodings
    chunk_dl = apply_label_encodings(chunk_dl, encoders, categorical_cols)
    # Normalize numeric features
    chunk_dl, scaler = normalize_features_dl(chunk_dl, exclude_from_scaling, scaler)
    return chunk_dl, encoders, categorical_cols, scaler

In [23]:
# Process all chunks and save the DL-ready dataset
def prepare_dl_dataset(input_file, output_dir, chunk_size=500000):
    """
    Process all chunks and prepare DL-ready datasets
    """
    encoders = None
    categorical_cols = None
    scaler = None
    
    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    
    # Define output paths
    tabular_output = os.path.join(output_dir, 'dl_tabular_data.csv')
    encoders_output = os.path.join(output_dir, 'dl_encoders.pkl')
    scaler_output = os.path.join(output_dir, 'dl_scaler.pkl')
    
    print(f"Starting DL preprocessing of {input_file}...")
    
    # Process in chunks
    chunks = []
    for i, chunk in enumerate(load_processed_data(input_file, chunk_size=chunk_size)):
        start_time = datetime.now()
        
        # Process the chunk
        processed_chunk, encoders, categorical_cols, scaler = process_chunk_dl(
            chunk, encoders, categorical_cols, scaler
        )
        
        chunks.append(processed_chunk)
        
        # Print progress
        end_time = datetime.now()
        elapsed = (end_time - start_time).total_seconds()
        print(f"Processed chunk {i+1}: {len(processed_chunk):,} rows in {elapsed:.2f} seconds")
        
        # To save memory, periodically combine and save chunks
        if len(chunks) >= 5:
            combined = pd.concat(chunks)
            
            # Save with mode='a' (append) after first chunk
            if i <= 5:
                combined.to_csv(tabular_output, index=False)
            else:
                combined.to_csv(tabular_output, mode='a', header=False, index=False)
                
            # Clear chunks list to free memory
            chunks = []
    
    # Save any remaining chunks
    if chunks:
        combined = pd.concat(chunks)
        # Check if file exists
        if os.path.exists(tabular_output):
            combined.to_csv(tabular_output, mode='a', header=False, index=False)
        else:
            combined.to_csv(tabular_output, index=False)
    
    # Save encoders and scaler
    with open(encoders_output, 'wb') as f:
        pickle.dump((encoders, categorical_cols), f)
    
    with open(scaler_output, 'wb') as f:
        pickle.dump(scaler, f)
    
    print(f"DL preprocessing complete!")
    print(f"Tabular data saved to: {tabular_output}")
    print(f"Encoders saved to: {encoders_output}")
    print(f"Scaler saved to: {scaler_output}")
    
    return encoders, categorical_cols, scaler

In [24]:
# Execute the DL preprocessing pipeline
encoders, categorical_cols, scaler = prepare_dl_dataset(BASE_PROCESSED_PATH, DL_PROCESSED_PATH)

Starting DL preprocessing of /Users/osx/flightDelayPIPELINE.2/data/processed/base_preprocessed_flights.csv...
Processed chunk 1: 500,000 rows in 1.27 seconds
Processed chunk 1: 500,000 rows in 1.27 seconds
Processed chunk 2: 500,000 rows in 0.84 seconds
Processed chunk 2: 500,000 rows in 0.84 seconds
Error encoding DEST, using default values
Error encoding DEST, using default values
Processed chunk 3: 500,000 rows in 1.02 seconds
Processed chunk 3: 500,000 rows in 1.02 seconds
Error encoding DEST, using default values
Error encoding DEST, using default values
Processed chunk 4: 500,000 rows in 1.37 seconds
Processed chunk 4: 500,000 rows in 1.37 seconds
Error encoding DEST, using default values
Error encoding DEST, using default values
Processed chunk 5: 500,000 rows in 1.07 seconds
Processed chunk 5: 500,000 rows in 1.07 seconds


OSError: [Errno 28] No space left on device

In [None]:
# Verify the output file and create sample sequences
try:
    # Define paths
    tabular_output = os.path.join(DL_PROCESSED_PATH, 'dl_tabular_data.csv')
    
    # Read a sample of the processed data
    dl_sample = pd.read_csv(tabular_output, nrows=10000)
    print(f"DL processed data shape: {dl_sample.shape}")
    
    print("\nColumns in DL processed data:")
    for col in dl_sample.columns:
        print(f"- {col}: {dl_sample[col].dtype}")
    
    print("\nSample of DL processed data:")
    display(dl_sample.head())
    
    # Convert date column back to datetime if needed
    if 'FL_DATE' in dl_sample.columns:
        dl_sample['FL_DATE'] = pd.to_datetime(dl_sample['FL_DATE'])
    
    # Create a small sequence dataset as an example
    print("\nCreating sample sequences for RNN/LSTM models...")
    X_seq, y_seq = create_sequence_data(dl_sample, seq_length=5)
    
    print(f"Sequence data shape: {X_seq.shape}")
    print(f"Target data shape: {y_seq.shape}")
    print("Sample sequence:")
    display(X_seq[0])
    
except Exception as e:
    print(f"Error verifying output: {e}")

In [None]:
# Create train-test-val split for deep learning (small sample for demonstration)
try:
    # Read a sample of the data
    dl_sample = pd.read_csv(os.path.join(DL_PROCESSED_PATH, 'dl_tabular_data.csv'), nrows=50000)
    
    # Convert date column back to datetime if needed
    if 'FL_DATE' in dl_sample.columns:
        dl_sample['FL_DATE'] = pd.to_datetime(dl_sample['FL_DATE'])
        
        # Create time-based splits
        train_data, val_data, test_data = time_based_split_dl(dl_sample)
        
        # Visualize the target distribution in each split
        plt.figure(figsize=(15, 5))
        
        plt.subplot(131)
        plt.hist(train_data['DEP_DELAY'], bins=50, alpha=0.7)
        plt.title('Train Set - Departure Delay Distribution')
        plt.xlabel('Delay (minutes)')
        
        plt.subplot(132)
        plt.hist(val_data['DEP_DELAY'], bins=50, alpha=0.7)
        plt.title('Validation Set - Departure Delay Distribution')
        plt.xlabel('Delay (minutes)')
        
        plt.subplot(133)
        plt.hist(test_data['DEP_DELAY'], bins=50, alpha=0.7)
        plt.title('Test Set - Departure Delay Distribution')
        plt.xlabel('Delay (minutes)')
        
        plt.tight_layout()
        plt.show()
        
except Exception as e:
    print(f"Error creating time-based splits: {e}")

## Creating Model-Ready Batches for Deep Learning Frameworks

In this section, we provide example code for formatting the preprocessed data into batches suitable for deep learning frameworks. This is a demonstration using a small sample, but the same approach can be applied to the full dataset.

In [None]:
# Example: Create batches for a feed-forward neural network
def create_tabular_batches(df, target_col='DEP_DELAY', batch_size=32):
    """
    Create batches for training a tabular deep learning model
    """
    # Separate features and target
    X = df.drop(columns=[target_col])
    y = df[target_col].values
    
    # Create batches
    n_samples = len(X)
    n_batches = n_samples // batch_size
    
    # Lists to store batches
    X_batches = []
    y_batches = []
    
    # Create batches
    for i in range(n_batches):
        start_idx = i * batch_size
        end_idx = start_idx + batch_size
        
        X_batches.append(X.iloc[start_idx:end_idx].values)
        y_batches.append(y[start_idx:end_idx])
    
    return X_batches, y_batches, X.columns.tolist()

In [None]:
# Demonstration: Create tabular batches from a small sample
try:
    # Create batches from the train set
    if 'train_data' in locals():
        # Remove date column if present
        if 'FL_DATE' in train_data.columns:
            train_features = train_data.drop(columns=['FL_DATE'])
        else:
            train_features = train_data
            
        # Create batches
        X_batches, y_batches, feature_names = create_tabular_batches(train_features, batch_size=32)
        
        print(f"Created {len(X_batches)} batches for training")
        print(f"Batch shape: {X_batches[0].shape}")
        print(f"Target shape: {y_batches[0].shape}")
        
        # Show the first few feature names
        print("\nFirst 10 features:")
        for i, name in enumerate(feature_names[:10]):
            print(f"{i+1}. {name}")
        
except Exception as e:
    print(f"Error creating batches: {e}")

## Summary of Deep Learning Preprocessing

The DL preprocessing pipeline has:

1. Added DL-specific engineered features optimized for neural networks
2. Created normalized time features (cyclic encoding using sin/cos)
3. Applied label encoding to categorical variables for embedding layers
4. Normalized numeric features for better convergence
5. Created sequence data for RNN/LSTM models
6. Implemented time-based train-test-validation splitting
7. Demonstrated batch creation for deep learning frameworks

This DL-ready dataset is optimized for both feedforward neural networks (tabular data) and recurrent neural networks (sequence data). The preprocessing enhances model convergence and provides appropriate formats for embedding layers commonly used in deep learning models for categorical variables.