# 03 - Feature Engineering (Simplified)

This notebook creates new features from existing data:
- Temporal features (hour, day, month, season, etc.)
- Cyclical encoding
- Interaction features  
- Categorical encoding

**NO LAG or ROLLING FEATURES** - This allows single-row predictions!

In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

Libraries imported successfully!


## Load Cleaned Data

In [3]:
df = pd.read_csv('../processed_data/steel_data_cleaned.csv', parse_dates=['date'])

print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
df.head()

Dataset shape: (35040, 11)

Columns: ['date', 'Usage_kWh', 'Lagging_Current_Reactive.Power_kVarh', 'Leading_Current_Reactive_Power_kVarh', 'CO2(tCO2)', 'Lagging_Current_Power_Factor', 'Leading_Current_Power_Factor', 'NSM', 'WeekStatus', 'Day_of_week', 'Load_Type']


Unnamed: 0,date,Usage_kWh,Lagging_Current_Reactive.Power_kVarh,Leading_Current_Reactive_Power_kVarh,CO2(tCO2),Lagging_Current_Power_Factor,Leading_Current_Power_Factor,NSM,WeekStatus,Day_of_week,Load_Type
0,2018-01-01 00:00:00,3.42,3.46,0.0,0.0,70.3,100.0,0,Weekday,Monday,Light_Load
1,2018-01-01 00:15:00,3.17,2.95,0.0,0.0,73.21,100.0,900,Weekday,Monday,Light_Load
2,2018-01-01 00:30:00,4.0,4.46,0.0,0.0,66.77,100.0,1800,Weekday,Monday,Light_Load
3,2018-01-01 00:45:00,3.24,3.28,0.0,0.0,70.28,100.0,2700,Weekday,Monday,Light_Load
4,2018-01-01 01:00:00,3.31,3.56,0.0,0.0,68.09,100.0,3600,Weekday,Monday,Light_Load


## Extract Temporal Features

In [4]:
# Extract time-based features
df['hour'] = df['date'].dt.hour
df['day'] = df['date'].dt.day
df['month'] = df['date'].dt.month
df['dayofweek'] = df['date'].dt.dayofweek
df['quarter'] = df['date'].dt.quarter
df['dayofyear'] = df['date'].dt.dayofyear
df['weekofyear'] = df['date'].dt.isocalendar().week

# Create season feature
def get_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:
        return 'Fall'

df['season'] = df['month'].apply(get_season)

# Is weekend feature
df['is_weekend'] = (df['dayofweek'] >= 5).astype(int)

# Time of day categories
def get_time_of_day(hour):
    if 6 <= hour < 12:
        return 'Morning'
    elif 12 <= hour < 18:
        return 'Afternoon'
    elif 18 <= hour < 22:
        return 'Evening'
    else:
        return 'Night'

df['time_of_day'] = df['hour'].apply(get_time_of_day)

print("Temporal features created!")
print(f"\nNew features: {['hour', 'day', 'month', 'dayofweek', 'quarter', 'dayofyear', 'weekofyear', 'season', 'is_weekend', 'time_of_day']}")

Temporal features created!

New features: ['hour', 'day', 'month', 'dayofweek', 'quarter', 'dayofyear', 'weekofyear', 'season', 'is_weekend', 'time_of_day']


## Cyclical Encoding for Time Features

In [5]:
# Encode cyclical features using sine and cosine
# This helps models understand that hour 23 and hour 0 are close

# Hour encoding (24 hours)
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)

# Day of week encoding (7 days)
df['dayofweek_sin'] = np.sin(2 * np.pi * df['dayofweek'] / 7)
df['dayofweek_cos'] = np.cos(2 * np.pi * df['dayofweek'] / 7)

# Month encoding (12 months)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)

# Day of year encoding (365 days)
df['dayofyear_sin'] = np.sin(2 * np.pi * df['dayofyear'] / 365)
df['dayofyear_cos'] = np.cos(2 * np.pi * df['dayofyear'] / 365)

print("Cyclical features encoded!")
print(f"Shape after cyclical encoding: {df.shape}")

Cyclical features encoded!
Shape after cyclical encoding: (35040, 29)


In [6]:
# Lag features REMOVED - not needed for batch predictions
print("Skipping lag features to allow single-row predictions")

Skipping lag features to allow single-row predictions


In [7]:
# Rolling features REMOVED - not needed for batch predictions
print("Skipping rolling window features to allow single-row predictions")

Skipping rolling window features to allow single-row predictions


In [8]:
# Create interaction features between important variables

# Total reactive power
df['total_reactive_power'] = (df['Lagging_Current_Reactive.Power_kVarh'] + 
                               df['Leading_Current_Reactive_Power_kVarh'])

# Power factor difference
df['power_factor_diff'] = (df['Lagging_Current_Power_Factor'] - 
                           df['Leading_Current_Power_Factor'])

# Average power factor
df['avg_power_factor'] = (df['Lagging_Current_Power_Factor'] + 
                          df['Leading_Current_Power_Factor']) / 2

# Ratio features
df['reactive_power_ratio'] = df['Lagging_Current_Reactive.Power_kVarh'] / (df['Leading_Current_Reactive_Power_kVarh'] + 1e-5)

print("Interaction features created!")
print(f"Shape after interaction features: {df.shape}")

Interaction features created!
Shape after interaction features: (35040, 33)


In [9]:
# Label encoding for ordinal categories
le_load = LabelEncoder()
df['Load_Type_encoded'] = le_load.fit_transform(df['Load_Type'])

le_week = LabelEncoder()
df['WeekStatus_encoded'] = le_week.fit_transform(df['WeekStatus'])

le_day = LabelEncoder()
df['Day_of_week_encoded'] = le_day.fit_transform(df['Day_of_week'])

le_season = LabelEncoder()
df['season_encoded'] = le_season.fit_transform(df['season'])

le_time = LabelEncoder()
df['time_of_day_encoded'] = le_time.fit_transform(df['time_of_day'])

print("Categorical encoding completed!")
print(f"\nLoad_Type mapping: {dict(zip(le_load.classes_, le_load.transform(le_load.classes_)))}")
print(f"WeekStatus mapping: {dict(zip(le_week.classes_, le_week.transform(le_week.classes_)))}")
print(f"Season mapping: {dict(zip(le_season.classes_, le_season.transform(le_season.classes_)))}")
print(f"Time of day mapping: {dict(zip(le_time.classes_, le_time.transform(le_time.classes_)))}")

Categorical encoding completed!

Load_Type mapping: {'Light_Load': 0, 'Maximum_Load': 1, 'Medium_Load': 2}
WeekStatus mapping: {'Weekday': 0, 'Weekend': 1}
Season mapping: {'Fall': 0, 'Spring': 1, 'Summer': 2, 'Winter': 3}
Time of day mapping: {'Afternoon': 0, 'Evening': 1, 'Morning': 2, 'Night': 3}


## One-Hot Encoding for Categorical Features

In [10]:
# Create one-hot encoded versions as well (for tree-based models)
df_encoded = pd.get_dummies(df, columns=['Load_Type', 'WeekStatus', 'season', 'time_of_day'], 
                            prefix=['LoadType', 'Week', 'Season', 'TimeOfDay'])

print(f"Shape after one-hot encoding: {df_encoded.shape}")
print(f"\nNew one-hot encoded columns created: {[col for col in df_encoded.columns if col not in df.columns]}")

Shape after one-hot encoding: (35040, 47)

New one-hot encoded columns created: ['LoadType_Light_Load', 'LoadType_Maximum_Load', 'LoadType_Medium_Load', 'Week_Weekday', 'Week_Weekend', 'Season_Fall', 'Season_Spring', 'Season_Summer', 'Season_Winter', 'TimeOfDay_Afternoon', 'TimeOfDay_Evening', 'TimeOfDay_Morning', 'TimeOfDay_Night']


In [11]:
# Check missing values
missing_count = df_encoded.isnull().sum()
print(f"Columns with missing values:")
print(missing_count[missing_count > 0])

# Fill NaN values from lag features with 0 or backward fill
lag_columns = [col for col in df_encoded.columns if 'lag' in col or 'rolling' in col]
df_encoded[lag_columns] = df_encoded[lag_columns].bfill().fillna(0)

print(f"\nMissing values after handling: {df_encoded.isnull().sum().sum()}")

Columns with missing values:
Series([], dtype: int64)

Missing values after handling: 0


In [12]:
print("Feature Engineering Summary:")
print("="*80)
print(f"Original features: {len(df.columns)}")
print(f"Engineered features: {len(df_encoded.columns)}")
print(f"Total rows: {len(df_encoded)}")
print(f"\nFeature categories:")
print(f"- Temporal features: hour, day, month, season, etc.")
print(f"- Cyclical encodings: sin/cos for hour, day, month")
print(f"- Lag features: {len([c for c in df_encoded.columns if 'lag' in c])} features")
print(f"- Rolling window features: {len([c for c in df_encoded.columns if 'rolling' in c])} features")
print(f"- Interaction features: total_reactive_power, power_factor_diff, etc.")
print(f"- Encoded categoricals: Load_Type, WeekStatus, Season, Time_of_day")

Feature Engineering Summary:
Original features: 38
Engineered features: 47
Total rows: 35040

Feature categories:
- Temporal features: hour, day, month, season, etc.
- Cyclical encodings: sin/cos for hour, day, month
- Lag features: 0 features
- Rolling window features: 0 features
- Interaction features: total_reactive_power, power_factor_diff, etc.
- Encoded categoricals: Load_Type, WeekStatus, Season, Time_of_day


In [13]:
# Save the dataset with engineered features
output_path = '../processed_data/steel_data_featured.csv'
df_encoded.to_csv(output_path, index=False)

print(f"\nEngineered dataset saved to: {output_path}")
print(f"Final shape: {df_encoded.shape}")

# Also save column names for reference
feature_names = df_encoded.columns.tolist()
with open('../processed_data/feature_names.txt', 'w') as f:
    for name in feature_names:
        f.write(f"{name}\n")

print("Feature names saved to: ../processed_data/feature_names.txt")


Engineered dataset saved to: ../processed_data/steel_data_featured.csv
Final shape: (35040, 47)
Feature names saved to: ../processed_data/feature_names.txt
