# Feature Engineering - air_12300

This notebook applies feature engineering to the air_12300 dataset, following the same pipeline as air_12318.

## 1. Setup

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import IsolationForest
from sklearn.mixture import GaussianMixture

In [2]:
df = pd.read_csv("../../data/raw/air_12300.csv", on_bad_lines='skip')
df['time'] = pd.to_datetime(df['time'], errors='coerce')
df = df.sort_values('time').reset_index(drop=True)

print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")

Dataset shape: (509084, 33)

Columns: ['id', 'time', 'epoch', 'air', 'device', 'freq_hertz', 'vlineavg_volt', 'va_volt', 'vb_volt', 'vc_volt', 'vlineper_volt', 'ia_ampere', 'ib_ampere', 'ic_ampere', 'iavg_ampere', 'pftot_none', 'ptotper_percent', 'ptot_watt', 'stotper_percent', 'stot_volt-ampere', 'qtotper_percent', 'qtot_var', 'expvar_var-hour', 'expwh_watt-hour', 'state_none', 'hours_hour', 'servhr_hour', 'servday_day', 'startcount_none', 'oilpress_pascal', 'cooltemp_degree-celsius', 'rpm_revolutions-per-minute', 'vbat_volt']


## 2. ON/OFF Detection

In [3]:
df = pd.read_csv("../../data/raw/air_12300.csv", on_bad_lines='skip')
df['time'] = pd.to_datetime(df['time'], errors='coerce')
df = df.sort_values('time').reset_index(drop=True)

print(f"Dataset shape: {df.shape}")
print(f"\nOriginal columns: {df.columns.tolist()[:10]}...")

# Rename columns to match air_12318 naming convention
column_mapping = {
    'ptot_watt': 'ptot_W',
    'qtot_var': 'qtot_Var',
    'stot_volt-ampere': 'stot_VA',
    'ia_ampere': 'ia_A',
    'ib_ampere': 'ib_A',
    'ic_ampere': 'ic_A',
    'iavg_ampere': 'iavg_A',
    'va_volt': 'va_V',
    'vb_volt': 'vb_V',
    'vc_volt': 'vc_V',
    'vlineavg_volt': 'vlineavg_V',
    'vlineper_volt': 'vlineper_V',
    'vbat_volt': 'vbat_V',
    'pftot_none': 'pftot_None',
    'cooltemp_degree-celsius': 'temp_Degrees Celsius',
    'oilpress_pascal': 'pressure_Bar',
    'freq_hertz': 'freq_Hz*10',
    'hours_hour': 'hours_sec',
    'expwh_watt-hour': 'expwh_Kwh*10',
    'expvar_var-hour': 'expvar_Kvarh*10',
}

df = df.rename(columns=column_mapping)

print(f"\n✓ Columns renamed to standard format")
print(f"Renamed columns: {df.columns.tolist()[:10]}...")

Dataset shape: (509084, 33)

Original columns: ['id', 'time', 'epoch', 'air', 'device', 'freq_hertz', 'vlineavg_volt', 'va_volt', 'vb_volt', 'vc_volt']...

✓ Columns renamed to standard format
Renamed columns: ['id', 'time', 'epoch', 'air', 'device', 'freq_Hz*10', 'vlineavg_V', 'va_V', 'vb_V', 'vc_V']...


In [4]:
# ON/OFF Detection using GMM on log(power)
p = df['ptot_W'].clip(lower=0).fillna(0)
X = np.log1p(p).values.reshape(-1, 1)

gmm = GaussianMixture(n_components=2, random_state=42).fit(X)
labels = gmm.predict(X)

means = gmm.means_.flatten()
on_cluster = np.argmax(means)

df['is_running_gmm'] = (labels == on_cluster).astype(int)

print("GMM cluster means:", means)
print("ON cluster:", on_cluster)
print(f"✓ GMM-based ON/OFF detection complete")

GMM cluster means: [21.41641413  5.44787643]
ON cluster: 0
✓ GMM-based ON/OFF detection complete


In [5]:
# Cross-check with currents
df['any_current'] = df[['ia_A', 'ib_A', 'ic_A']].sum(axis=1) > 0.1
print(f"✓ Current check complete")

✓ Current check complete


In [6]:
# Final ON/OFF label
df['is_running'] = ((df['is_running_gmm'] == 1) | df['any_current']).astype(int)

print("ON/OFF split:")
print(df['is_running'].value_counts(normalize=True))

ON/OFF split:
is_running
1    0.986315
0    0.013685
Name: proportion, dtype: float64


## 3. Feature Engineering

In [7]:
# Current imbalance
if all(col in df.columns for col in ['ia_A', 'ib_A', 'ic_A']):
    df['current_imbalance'] = df[['ia_A', 'ib_A', 'ic_A']].std(axis=1)
    print("✓ Current imbalance feature created")

✓ Current imbalance feature created


In [8]:
# Voltage imbalance
voltage_cols = ['va_V', 'vb_V', 'vc_V']
if all(col in df.columns for col in voltage_cols):
    df['voltage_imbalance'] = df[voltage_cols].std(axis=1)
    print("✓ Voltage imbalance feature created")

✓ Voltage imbalance feature created


In [9]:
# Power factor anomaly (deviation from ideal = 1)
if 'pftot_None' in df.columns:
    df['pf_anomaly'] = np.abs(1 - df['pftot_None'])
    print("✓ Power factor anomaly feature created")

✓ Power factor anomaly feature created


In [10]:
# Temperature rate of change
temp_cols = [c for c in df.columns if 'temp' in c.lower()]
for col in temp_cols:
    df[f'{col}_roc'] = df[col].diff()

if temp_cols:
    print(f"✓ Temperature rate of change features created ({len(temp_cols)} columns)")

✓ Temperature rate of change features created (1 columns)


In [11]:
# Fuel rate of change
fuel_cols = [c for c in df.columns if 'fuel' in c.lower()]
for col in fuel_cols:
    df[f'{col}_roc'] = df[col].diff()

if fuel_cols:
    print(f"✓ Fuel rate of change features created ({len(fuel_cols)} columns)")

In [12]:
# Rolling means & stds
rolling_window = 60  # 60 rows ~ 1h if data is per minute

for col in ['ptot_W', 'ia_A', 'pf_anomaly']:
    if col in df.columns:
        df[f'{col}_rollmean'] = df[col].rolling(window=rolling_window, min_periods=1).mean()
        df[f'{col}_rollstd'] = df[col].rolling(window=rolling_window, min_periods=1).std()

print(f"✓ Rolling statistics created (window={rolling_window})")

✓ Rolling statistics created (window=60)


## 4. Verify Features

In [13]:
# Ensure 'time' is in the dataframe
if 'time' not in df.columns:
    print("⚠️ Warning: 'time' column missing, reloading from raw data...")
    df_raw = pd.read_csv("../../data/raw/air_12300.csv", on_bad_lines='skip')
    df_raw['time'] = pd.to_datetime(df_raw['time'], errors='coerce')
    df['time'] = df_raw.loc[df.index, 'time'].values

# Move 'time' to the front
cols = ['time'] + [c for c in df.columns if c != 'time']
df = df[cols]

print("✅ 'time' column preserved in final features:", 'time' in df.columns)
print(f"\nFinal shape: {df.shape}")
print(f"Total columns: {len(df.columns)}")

✅ 'time' column preserved in final features: True

Final shape: (509084, 46)
Total columns: 46


In [14]:
# Display columns
print("Columns before saving:")
print(df.columns.tolist())

Columns before saving:
['time', 'id', 'epoch', 'air', 'device', 'freq_Hz*10', 'vlineavg_V', 'va_V', 'vb_V', 'vc_V', 'vlineper_V', 'ia_A', 'ib_A', 'ic_A', 'iavg_A', 'pftot_None', 'ptotper_percent', 'ptot_W', 'stotper_percent', 'stot_VA', 'qtotper_percent', 'qtot_Var', 'expvar_Kvarh*10', 'expwh_Kwh*10', 'state_none', 'hours_sec', 'servhr_hour', 'servday_day', 'startcount_none', 'pressure_Bar', 'temp_Degrees Celsius', 'rpm_revolutions-per-minute', 'vbat_V', 'is_running_gmm', 'any_current', 'is_running', 'current_imbalance', 'voltage_imbalance', 'pf_anomaly', 'temp_Degrees Celsius_roc', 'ptot_W_rollmean', 'ptot_W_rollstd', 'ia_A_rollmean', 'ia_A_rollstd', 'pf_anomaly_rollmean', 'pf_anomaly_rollstd']


In [15]:
# Check for missing values
missing = df.isnull().sum()
if missing.sum() > 0:
    print("\nMissing values:")
    print(missing[missing > 0])
else:
    print("\n✅ No missing values")


Missing values:
va_V                          3000
vb_V                          3000
vc_V                          3000
vlineper_V                    3000
ia_A                          3000
ib_A                          3000
ic_A                          3000
iavg_A                        3000
pftot_None                    3000
ptotper_percent               3000
ptot_W                        3000
stotper_percent               3000
stot_VA                       3000
qtotper_percent               3000
qtot_Var                      3000
expvar_Kvarh*10               3000
expwh_Kwh*10                  3000
state_none                    3000
hours_sec                     3000
servhr_hour                   3000
servday_day                   3000
startcount_none               3000
pressure_Bar                  3000
temp_Degrees Celsius          3000
rpm_revolutions-per-minute    3000
vbat_V                        3000
current_imbalance             3000
voltage_imbalance             3000
pf_

## 5. Feature Importance (Optional Analysis)

In [16]:
# Define which columns are actual features
exclude_cols = ['time', 'id', 'epoch', 'air', 'device', 'is_running_gmm', 'any_current']
feature_cols = [c for c in df.columns if c not in exclude_cols]

print(f"Number of features: {len(feature_cols)}")
print(f"\nSample features: {feature_cols[:10]}")

Number of features: 39

Sample features: ['freq_Hz*10', 'vlineavg_V', 'va_V', 'vb_V', 'vc_V', 'vlineper_V', 'ia_A', 'ib_A', 'ic_A', 'iavg_A']


### Feature Explanations

- **voltage_imbalance** = `std(va_V, vb_V, vc_V)`  
  High values indicate phase imbalance, which stresses the generator.

- **current_imbalance** = `std(ia_A, ib_A, ic_A)`  
  Highlights uneven load across phases, possible wiring or load issues.

- **pf_anomaly** = `|1 - pftot_None|`  
  Shows inefficiency in power usage. Large deviations = possible electrical faults.

- **temp_roc** = change in temperature over time  
  Sudden rises may indicate cooling issues or overload.

- **fuel_roc** = change in fuel level  
  Smooth drops = normal usage, sharp drops = refueling events or sensor errors.

- **rolling means / stds** (for power, current, temp, etc.)  
  Capture long-term trends instead of just single data points.

These features together capture **operational health signals** and form the input for anomaly detection / predictive maintenance models.

## 6. Save Outputs

In [17]:
import os
processed_dir = "../../data/processed"

# Create directory if it doesn't exist
os.makedirs(processed_dir, exist_ok=True)

# Save full dataset
full_path_parquet = os.path.join(processed_dir, "air_12300_features_full.parquet")
full_path_csv = os.path.join(processed_dir, "air_12300_features_full.csv")

df.to_parquet(full_path_parquet, index=False)
df.to_csv(full_path_csv, index=False)

# Save sample
sample_path_parquet = os.path.join(processed_dir, "air_12300_features_sample.parquet")
sample_path_csv = os.path.join(processed_dir, "air_12300_features_sample.csv")

df.head(10000).to_parquet(sample_path_parquet, index=False)
df.head(10000).to_csv(sample_path_csv, index=False)

print("✅ Saved processed features with all engineered columns.")
print(f"\nFiles saved:")
print(f"  - {full_path_parquet}")
print(f"  - {full_path_csv}")
print(f"  - {sample_path_parquet}")
print(f"  - {sample_path_csv}")

✅ Saved processed features with all engineered columns.

Files saved:
  - ../../data/processed\air_12300_features_full.parquet
  - ../../data/processed\air_12300_features_full.csv
  - ../../data/processed\air_12300_features_sample.parquet
  - ../../data/processed\air_12300_features_sample.csv


## Summary

Feature engineering completed for **air_12300**:
- ✅ ON/OFF state detection (GMM)
- ✅ Current imbalance
- ✅ Voltage imbalance
- ✅ Power factor anomaly
- ✅ Temperature & fuel rate of change
- ✅ Rolling statistics (60-window)

**Next step**: Run `AnomalyDetection_12300.ipynb` to detect anomalies using the engineered features.