# **Feature Engineering** #

In this step, we create meaningful features from the cleaned time-series data to capture temporal patterns, normal behavior, and sudden deviations in energy consumption. Rolling statistics, lag features, and time-based attributes are engineered to provide contextual and historical information to the anomaly detection models. This step enhances the model’s ability to distinguish normal operational behavior from abnormal energy usage.

### **Import Libraries** ###

In [2]:
import os
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import joblib


### **Load Data** ###

In [None]:

electricity_df = pd.read_csv("electricity.csv")


weather_df = pd.read_csv("weather.csv")


### **Confirm Timestamp Type** ###

In [5]:
electricity_df['timestamp'] = pd.to_datetime(electricity_df['timestamp'])
weather_df['timestamp'] = pd.to_datetime(weather_df['timestamp'])
df = pd.merge(electricity_df, weather_df, on='timestamp', how='left')

In [6]:
df['timestamp'] = pd.to_datetime(df['timestamp'])


In [8]:
df.to_csv(
    r"C:\Users\Priyangaa\OneDrive\Desktop\Project\building-energy-anomaly-detection\results\cleaned_data.csv",
    index=False
)


In [12]:
os.chdir(r"C:\Users\Priyangaa\OneDrive\Desktop\Project\building-energy-anomaly-detection\notebooks")


In [14]:
import pandas as pd

cols = pd.read_csv(
    "../results/cleaned_data.csv",
    nrows=1
).columns

print(cols[:10])     # first few
print("Total columns:", len(cols))


Index(['timestamp', 'Panther_parking_Lorriane', 'Panther_lodging_Cora',
       'Panther_office_Hannah', 'Panther_lodging_Hattie',
       'Panther_education_Teofila', 'Panther_education_Jerome',
       'Panther_retail_Felix', 'Panther_parking_Asia',
       'Panther_education_Misty'],
      dtype='object')
Total columns: 1588


In [16]:
import pandas as pd

cols = pd.read_csv(
    "../results/cleaned_data.csv",
    nrows=1
).columns

print("Total columns:", len(cols))
print("First few columns:", cols[:5])


Total columns: 1588
First few columns: Index(['timestamp', 'Panther_parking_Lorriane', 'Panther_lodging_Cora',
       'Panther_office_Hannah', 'Panther_lodging_Hattie'],
      dtype='object')


In [17]:
use_columns = ['timestamp', cols[1]]   

In [18]:
df = pd.read_csv(
    "../results/cleaned_data.csv",
    usecols=use_columns,
    dtype={cols[1]: 'float32'}
)

df.rename(columns={cols[1]: 'electricity'}, inplace=True)

df.head()


Unnamed: 0,timestamp,electricity
0,2016-01-01 00:00:00,0.0
1,2016-01-01 00:00:00,0.0
2,2016-01-01 00:00:00,0.0
3,2016-01-01 00:00:00,0.0
4,2016-01-01 00:00:00,0.0


In [19]:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df = df.sort_values('timestamp').reset_index(drop=True)


### **Time-Based Features** ###

In [20]:
df['hour'] = df['timestamp'].dt.hour
df['day_of_week'] = df['timestamp'].dt.dayofweek
df['month'] = df['timestamp'].dt.month
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)


### **Rolling Statistics** ###

In [21]:
WINDOW = 168  

df['electricity_rolling_mean'] = (
    df['electricity'].rolling(window=WINDOW).mean()
)

df['electricity_rolling_std'] = (
    df['electricity'].rolling(window=WINDOW).std()
)


### **Deviation Feature** ###

In [22]:
df['electricity_deviation'] = (
    df['electricity'] - df['electricity_rolling_mean']
) / (df['electricity_rolling_std'] + 1e-5)


### **Lag Features** ###

In [23]:
df['electricity_lag1'] = df['electricity'].shift(1)
df['electricity_lag24'] = df['electricity'].shift(24)


### **Handle NaNs** ###

In [24]:
df = df.dropna().reset_index(drop=True)


### **Final Sanity Check** ###

In [25]:
df.shape


(330305, 11)

In [26]:
df.head()


Unnamed: 0,timestamp,electricity,hour,day_of_week,month,is_weekend,electricity_rolling_mean,electricity_rolling_std,electricity_deviation,electricity_lag1,electricity_lag24
0,2016-01-01 09:00:00,0.0,9,4,1,0,0.0,0.0,0.0,0.0,0.0
1,2016-01-01 09:00:00,0.0,9,4,1,0,0.0,0.0,0.0,0.0,0.0
2,2016-01-01 09:00:00,0.0,9,4,1,0,0.0,0.0,0.0,0.0,0.0
3,2016-01-01 09:00:00,0.0,9,4,1,0,0.0,0.0,0.0,0.0,0.0
4,2016-01-01 09:00:00,0.0,9,4,1,0,0.0,0.0,0.0,0.0,0.0


### **Save Feature-Engineered Data** ###

In [27]:
df.to_csv(
    "../results/feature_engineered_data.csv",
    index=False
)


## **Observations** ##

---> Feature engineering was performed on a single building’s electricity consumption data to optimize memory usage and enable scalable processing.

---> The timestamp column was converted to datetime format and the dataset was sorted chronologically to maintain strict time-series order.

---> Time-based features such as hour of day, day of week, month, and weekend indicator were created to capture operational and occupancy-driven patterns.

---> Rolling mean and rolling standard deviation using a 7-day window were computed to establish a local baseline of normal energy behavior.

---> A deviation score was calculated to quantify how much each observation deviates from recent historical consumption, providing a strong signal for anomaly detection.

---> Lag features (1-hour and 24-hour) were introduced to capture short-term momentum and sudden changes in energy usage.

---> Initial missing values generated by rolling and lag operations were safely removed without affecting the overall time-series integrity.

---> The final feature-engineered dataset is compact, clean, and optimized for unsupervised anomaly detection models.

## **Key Findings** ##


---> Time-based features revealed strong daily and weekly consumption patterns in building energy usage.

---> Rolling mean and rolling standard deviation effectively defined a local baseline for normal behavior.

---> Deviation-based features highlighted sudden spikes and drops that are critical for anomaly detection.

---> Lag features introduced short-term memory, improving the detection of abrupt consumption changes.

---> Feature engineering significantly improved the model’s ability to detect meaningful anomalies rather than random noise.