# Fleet Feature Engineering

## Overview
Notebook ini melakukan feature engineering untuk data operasional fleet (alat berat) dalam mendukung:
- **Equipment Failure Prediction**: Prediksi breakdown alat berat
- **Port Operability Forecast**: Prediksi operabilitas pelabuhan berdasarkan kondisi equipment
- **Predictive Maintenance**: Penjadwalan maintenance berbasis data

## Data Sources
- `fct_operasional_alat_relatif_2`: Data operasional harian alat berat (6,985 records)
- `dim_alat_berat_relatif_2`: Master data alat berat (100 units)

## Feature Categories Created
1. **Age Features** (3): Equipment age, risk flags
2. **Usage Features** (7): Utilization rate, cumulative hours, usage variance
3. **Maintenance Features** (10): Breakdown flags, maintenance activity, operating rate
4. **Health Features** (7): Component health scores, composite health score
5. **Interaction Features** (8): Risk interactions, type profiles

Total: **38 engineered features** from 21 original columns

## Key Metrics
- Equipment tracked: 96 units
- Date range: 2025-07-01 to 2025-10-31
- Mean health score: 73.6/100
- Critical equipment: 4 units
- Breakdown rate: 5.2%

## Output Files
- CSV: `data/processed/fleet_features.csv` (3.25 MB)
- Parquet: `data/feature_store/fleet_features.parquet` (0.49 MB, 6.7x compression)
- Metadata: `fleet_features_metadata.json`

---

## 1. Setup & Data Loading


In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from pathlib import Path
from datetime import datetime, timedelta
import sys
import os

# Set project root directory
# This notebook is in: notebooks/03_feature_engineering/
# Project root is 2 levels up
project_root = Path(r'c:\Users\I5\Documents\asah-2025\capstone-project\minewise_ml')
os.chdir(project_root)
print(f"Working directory: {os.getcwd()}")

# Verify data file exists
data_file = project_root / 'data' / 'raw' / 'dataset_rancangan.xlsx'
print(f"Data file exists: {data_file.exists()}")

# Suppress warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

print("\nLibraries imported successfully")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")


Working directory: c:\Users\I5\Documents\asah-2025\capstone-project\minewise_ml
Data file exists: True

Libraries imported successfully
Pandas version: 2.3.3
NumPy version: 2.2.6


In [2]:
# Load fleet operational data
print("=" * 80)
print("LOADING FLEET DATASETS")
print("=" * 80)

# Load fleet operations
fleet_ops = pd.read_excel('data/raw/dataset_rancangan.xlsx',
                          sheet_name='fct_operasional_alat_relatif_2')
print(f"\nFleet Operations loaded: {fleet_ops.shape}")
print(f"Columns: {list(fleet_ops.columns)}")

# Load equipment master
equipment_master = pd.read_excel('data/raw/dataset_rancangan.xlsx',
                                 sheet_name='dim_alat_berat_relatif_2')
print(f"\nEquipment Master loaded: {equipment_master.shape}")
print(f"Columns: {list(equipment_master.columns)}")

print("\n" + "=" * 80)
print("DATA OVERVIEW")
print("=" * 80)

print(f"\nFleet Operations Preview:")
print(fleet_ops.head(2))

print(f"\nEquipment Master Preview:")
print(equipment_master.head(2))

print(f"\nData Types:")
print(f"Fleet Ops: {fleet_ops.dtypes.value_counts().to_dict()}")
print(f"Equipment Master: {equipment_master.dtypes.value_counts().to_dict()}")


LOADING FLEET DATASETS

Fleet Operations loaded: (6985, 13)
Columns: ['id_record', 'id_alat', 'tanggal_operasi', 'shift', 'jam_mulai', 'jam_selesai', 'status_operasi', 'durasi_jam', 'material_dipindah', 'total_muatan_ton', 'jumlah_ritase', 'id_operator', 'lokasi_kode']

Fleet Operations loaded: (6985, 13)
Columns: ['id_record', 'id_alat', 'tanggal_operasi', 'shift', 'jam_mulai', 'jam_selesai', 'status_operasi', 'durasi_jam', 'material_dipindah', 'total_muatan_ton', 'jumlah_ritase', 'id_operator', 'lokasi_kode']

Equipment Master loaded: (100, 9)
Columns: ['id_alat', 'tipe_alat', 'model_alat', 'kapasitas_default_ton', 'departemen', 'lokasi_kode', 'tgl_pembelian', 'umur_tahun', 'kondisi']

DATA OVERVIEW

Fleet Operations Preview:
        id_record   id_alat tanggal_operasi    shift jam_mulai jam_selesai  \
0  OPRREC_0000001  ALAT_047      2025-07-01  Shift 2  14:00:00    17:43:00   
1  OPRREC_0000002  ALAT_047      2025-07-01  Shift 1  06:00:00    11:19:00   

  status_operasi  durasi_ja

## 2. Equipment Age & History Features

In [3]:
print("=" * 80)
print("CREATING EQUIPMENT AGE FEATURES")
print("=" * 80)

# Merge equipment master data with correct column names
fleet_data = fleet_ops.merge(
    equipment_master[['id_alat', 'tgl_pembelian', 'umur_tahun', 'tipe_alat', 'kondisi']],
    on='id_alat',
    how='left'
)
print(f"Data merged successfully: {fleet_data.shape}")

# Calculate equipment age from purchase date
fleet_data['tgl_pembelian'] = pd.to_datetime(fleet_data['tgl_pembelian'])
current_date = pd.Timestamp.now()
fleet_data['equipment_age_days'] = (current_date - fleet_data['tgl_pembelian']).dt.days
fleet_data['equipment_age_months'] = fleet_data['equipment_age_days'] / 30.44
fleet_data['equipment_age_years'] = fleet_data['equipment_age_days'] / 365.25

# Use umur_tahun as fallback if available
fleet_data['equipment_age_years'] = fleet_data['equipment_age_years'].fillna(fleet_data['umur_tahun'])

# Age-based risk flags
fleet_data['high_age_risk'] = (fleet_data['equipment_age_years'] > 10).astype(int)
fleet_data['mid_age_risk'] = ((fleet_data['equipment_age_years'] >= 5) &
                               (fleet_data['equipment_age_years'] <= 10)).astype(int)
fleet_data['new_equipment'] = (fleet_data['equipment_age_years'] < 2).astype(int)

print("\nEquipment age features created:")
print(f"  - equipment_age_days")
print(f"  - equipment_age_months")
print(f"  - equipment_age_years")
print(f"  - high_age_risk (>10 years)")
print(f"  - mid_age_risk (5-10 years)")
print(f"  - new_equipment (<2 years)")

print(f"\nAge Distribution:")
print(f"  Mean age: {fleet_data['equipment_age_years'].mean():.1f} years")
print(f"  Median age: {fleet_data['equipment_age_years'].median():.1f} years")
print(f"  Min age: {fleet_data['equipment_age_years'].min():.1f} years")
print(f"  Max age: {fleet_data['equipment_age_years'].max():.1f} years")
print(f"  High risk units: {fleet_data['high_age_risk'].sum()} ({fleet_data['high_age_risk'].mean()*100:.1f}%)")
print(f"  New equipment: {fleet_data['new_equipment'].sum()} ({fleet_data['new_equipment'].mean()*100:.1f}%)")


CREATING EQUIPMENT AGE FEATURES
Data merged successfully: (6985, 17)

Equipment age features created:
  - equipment_age_days
  - equipment_age_months
  - equipment_age_years
  - high_age_risk (>10 years)
  - mid_age_risk (5-10 years)
  - new_equipment (<2 years)

Age Distribution:
  Mean age: 4.7 years
  Median age: 4.2 years
  Min age: 0.1 years
  Max age: 9.9 years
  High risk units: 0 (0.0%)
  New equipment: 1576 (22.6%)


## 3. Usage Pattern Features

In [4]:
print("=" * 80)
print("CREATING USAGE PATTERN FEATURES")
print("=" * 80)

# Convert timestamp
fleet_data['tanggal_operasi'] = pd.to_datetime(fleet_data['tanggal_operasi'])

# Sort by equipment and date
fleet_data = fleet_data.sort_values(['id_alat', 'tanggal_operasi'])

# Calculate daily operating hours (durasi_jam is already available)
if 'durasi_jam' in fleet_data.columns:
    # Daily usage hours from durasi_jam
    fleet_data['daily_usage_hours'] = fleet_data['durasi_jam']
    
    # Utilization rate (assuming 24h max per day)
    fleet_data['utilization_rate'] = (fleet_data['daily_usage_hours'] / 24).clip(0, 1)
    
    # Usage intensity flags
    fleet_data['high_usage_flag'] = (fleet_data['utilization_rate'] > 0.8).astype(int)
    fleet_data['low_usage_flag'] = (fleet_data['utilization_rate'] < 0.2).astype(int)
    fleet_data['overwork_flag'] = (fleet_data['utilization_rate'] > 0.9).astype(int)
    fleet_data['idle_flag'] = (fleet_data['status_operasi'] == 'Standby').astype(int)
    
    print("\nUsage pattern features created:")
    print("  - daily_usage_hours")
    print("  - utilization_rate (0-1 scale)")
    print("  - high_usage_flag (>80%)")
    print("  - low_usage_flag (<20%)")
    print("  - overwork_flag (>90%)")
    print("  - idle_flag (Standby status)")
    
    # Cumulative hours
    fleet_data['cumulative_hours_7d'] = fleet_data.groupby('id_alat')['daily_usage_hours'].transform(
        lambda x: x.rolling(window=7, min_periods=1).sum()
    )
    fleet_data['cumulative_hours_30d'] = fleet_data.groupby('id_alat')['daily_usage_hours'].transform(
        lambda x: x.rolling(window=30, min_periods=1).sum()
    )
    fleet_data['avg_daily_hours_7d'] = fleet_data['cumulative_hours_7d'] / 7
    
    print("  - cumulative_hours_7d")
    print("  - cumulative_hours_30d")
    print("  - avg_daily_hours_7d")
    
    # Usage variance (indicates inconsistent usage patterns)
    fleet_data['usage_variance_7d'] = fleet_data.groupby('id_alat')['daily_usage_hours'].transform(
        lambda x: x.rolling(window=7, min_periods=1).std()
    )
    print("  - usage_variance_7d")

print(f"\nUsage Statistics:")
print(f"  Mean utilization: {fleet_data['utilization_rate'].mean():.2%}")
print(f"  High usage operations: {fleet_data['high_usage_flag'].sum()} ({fleet_data['high_usage_flag'].mean()*100:.1f}%)")
print(f"  Overwork situations: {fleet_data['overwork_flag'].sum()} ({fleet_data['overwork_flag'].mean()*100:.1f}%)")


CREATING USAGE PATTERN FEATURES

Usage pattern features created:
  - daily_usage_hours
  - utilization_rate (0-1 scale)
  - high_usage_flag (>80%)
  - low_usage_flag (<20%)
  - overwork_flag (>90%)
  - idle_flag (Standby status)
  - cumulative_hours_7d
  - cumulative_hours_30d
  - avg_daily_hours_7d
  - usage_variance_7d

Usage Statistics:
  Mean utilization: 19.54%
  High usage operations: 0 (0.0%)
  Overwork situations: 0 (0.0%)
  - usage_variance_7d

Usage Statistics:
  Mean utilization: 19.54%
  High usage operations: 0 (0.0%)
  Overwork situations: 0 (0.0%)


## 4. Maintenance Pattern Features

In [5]:
print("=" * 80)
print("CREATING MAINTENANCE FEATURES")
print("=" * 80)

# Calculate days since last record for each equipment (proxy for maintenance cycle)
fleet_data['days_since_last_record'] = fleet_data.groupby('id_alat')['tanggal_operasi'].diff().dt.days

# Overdue maintenance flags (based on industry standards: weekly/biweekly checks)
fleet_data['overdue_maintenance_flag'] = (fleet_data['days_since_last_record'] > 14).astype(int)
fleet_data['critical_overdue_flag'] = (fleet_data['days_since_last_record'] > 30).astype(int)

# Maintenance activity indicators
fleet_data['maintenance_activity'] = (fleet_data['status_operasi'] == 'Maintenance').astype(int)
fleet_data['maintenance_count_30d'] = fleet_data.groupby('id_alat')['maintenance_activity'].transform(
    lambda x: x.rolling(window=30, min_periods=1).sum()
)

print("\nMaintenance features created:")
print("  - days_since_last_record")
print("  - overdue_maintenance_flag (>14 days)")
print("  - critical_overdue_flag (>30 days)")
print("  - maintenance_activity")
print("  - maintenance_count_30d")

# Operational status features
if 'status_operasi' in fleet_data.columns:
    fleet_data['breakdown_flag'] = (fleet_data['status_operasi'] == 'Breakdown').astype(int)
    fleet_data['standby_flag'] = (fleet_data['status_operasi'] == 'Standby').astype(int)
    fleet_data['operating_flag'] = (fleet_data['status_operasi'] == 'Beroperasi').astype(int)
    
    # Breakdown history with rolling windows
    fleet_data['breakdown_count_7d'] = fleet_data.groupby('id_alat')['breakdown_flag'].transform(
        lambda x: x.rolling(window=7, min_periods=1).sum()
    )
    fleet_data['breakdown_count_30d'] = fleet_data.groupby('id_alat')['breakdown_flag'].transform(
        lambda x: x.rolling(window=30, min_periods=1).sum()
    )
    fleet_data['breakdown_rate_30d'] = fleet_data.groupby('id_alat')['breakdown_flag'].transform(
        lambda x: x.rolling(window=30, min_periods=1).mean()
    )
    
    # Operating efficiency (operating days / total days)
    fleet_data['operating_rate_7d'] = fleet_data.groupby('id_alat')['operating_flag'].transform(
        lambda x: x.rolling(window=7, min_periods=1).mean()
    )
    
    print("  - breakdown_flag")
    print("  - standby_flag")
    print("  - operating_flag")
    print("  - breakdown_count_7d")
    print("  - breakdown_count_30d")
    print("  - breakdown_rate_30d")
    print("  - operating_rate_7d")

print(f"\nMaintenance Statistics:")
print(f"  Breakdown incidents: {fleet_data['breakdown_flag'].sum()}")
print(f"  Standby records: {fleet_data['standby_flag'].sum()}")
print(f"  Mean operating rate (7d): {fleet_data['operating_rate_7d'].mean():.2%}")


CREATING MAINTENANCE FEATURES

Maintenance features created:
  - days_since_last_record
  - overdue_maintenance_flag (>14 days)
  - critical_overdue_flag (>30 days)
  - maintenance_activity
  - maintenance_count_30d
  - breakdown_flag
  - standby_flag
  - operating_flag
  - breakdown_count_7d
  - breakdown_count_30d
  - breakdown_rate_30d
  - operating_rate_7d

Maintenance Statistics:
  Breakdown incidents: 366
  Standby records: 686
  Mean operating rate (7d): 74.29%
  - breakdown_flag
  - standby_flag
  - operating_flag
  - breakdown_count_7d
  - breakdown_count_30d
  - breakdown_rate_30d
  - operating_rate_7d

Maintenance Statistics:
  Breakdown incidents: 366
  Standby records: 686
  Mean operating rate (7d): 74.29%


## 5. Equipment Health Score

In [6]:
print("=" * 80)
print("CREATING EQUIPMENT HEALTH SCORE")
print("=" * 80)

# Component scores (normalized 0-100)

# 1. Age score (newer equipment = higher score)
max_age = fleet_data['equipment_age_years'].max()
if max_age > 0:
    fleet_data['health_score_age'] = 100 * (1 - fleet_data['equipment_age_years'] / max_age)
else:
    fleet_data['health_score_age'] = 100
fleet_data['health_score_age'] = fleet_data['health_score_age'].clip(0, 100)

# 2. Maintenance score (based on maintenance activity)
fleet_data['health_score_maintenance'] = 100 - (fleet_data['overdue_maintenance_flag'] * 50 +
                                                  fleet_data['critical_overdue_flag'] * 50)
fleet_data['health_score_maintenance'] = fleet_data['health_score_maintenance'].clip(0, 100)

# 3. Usage score (optimal usage around 70% utilization)
if 'utilization_rate' in fleet_data.columns:
    # Penalty for both overuse and underuse (optimal = 0.7)
    usage_deviation = abs(fleet_data['utilization_rate'] - 0.7) / 0.7
    fleet_data['health_score_usage'] = 100 * (1 - usage_deviation)
    fleet_data['health_score_usage'] = fleet_data['health_score_usage'].clip(0, 100)
else:
    fleet_data['health_score_usage'] = 100

# 4. Breakdown score (no recent breakdowns = higher score)
if 'breakdown_rate_30d' in fleet_data.columns:
    fleet_data['health_score_breakdown'] = 100 * (1 - fleet_data['breakdown_rate_30d'])
else:
    fleet_data['health_score_breakdown'] = 100

# 5. Operational efficiency score
if 'operating_rate_7d' in fleet_data.columns:
    fleet_data['health_score_efficiency'] = 100 * fleet_data['operating_rate_7d']
else:
    fleet_data['health_score_efficiency'] = 100

# Composite health score (weighted average)
fleet_data['equipment_health_score'] = (
    0.25 * fleet_data['health_score_age'] +
    0.25 * fleet_data['health_score_maintenance'] +
    0.20 * fleet_data['health_score_usage'] +
    0.20 * fleet_data['health_score_breakdown'] +
    0.10 * fleet_data['health_score_efficiency']
)

# Health categories
fleet_data['health_category'] = pd.cut(
    fleet_data['equipment_health_score'],
    bins=[0, 40, 70, 100],
    labels=['Critical', 'Warning', 'Good']
)

# Equipment condition mapping
condition_scores = {'Baik': 100, 'Cukup': 70, 'Kurang': 40, 'Rusak': 10}
if 'kondisi' in fleet_data.columns:
    fleet_data['condition_score'] = fleet_data['kondisi'].map(condition_scores).fillna(70)
    # Adjust health score with actual condition
    fleet_data['equipment_health_score'] = (
        0.7 * fleet_data['equipment_health_score'] +
        0.3 * fleet_data['condition_score']
    )

print("\nHealth score components created:")
print("  - health_score_age (25% weight)")
print("  - health_score_maintenance (25% weight)")
print("  - health_score_usage (20% weight)")
print("  - health_score_breakdown (20% weight)")
print("  - health_score_efficiency (10% weight)")
print("  - equipment_health_score (composite 0-100)")
print("  - health_category (Critical/Warning/Good)")
print("  - condition_score (from kondisi field)")

print(f"\nHealth Distribution:")
print(fleet_data['health_category'].value_counts())
print(f"\nMean Health Score: {fleet_data['equipment_health_score'].mean():.1f}")
print(f"Critical equipment: {(fleet_data['health_category'] == 'Critical').sum()}")


CREATING EQUIPMENT HEALTH SCORE

Health score components created:
  - health_score_age (25% weight)
  - health_score_maintenance (25% weight)
  - health_score_usage (20% weight)
  - health_score_breakdown (20% weight)
  - health_score_efficiency (10% weight)
  - equipment_health_score (composite 0-100)
  - condition_score (from kondisi field)

Health Distribution:
health_category
Good        3707
Critical       4
Name: count, dtype: int64

Mean Health Score: 73.6
Critical equipment: 4


## 6. Interaction Features

In [7]:
print("=" * 80)
print("CREATING INTERACTION FEATURES")
print("=" * 80)

# Age × Usage interaction (older equipment working harder = higher risk)
if 'utilization_rate' in fleet_data.columns:
    fleet_data['age_usage_interaction'] = (
        fleet_data['equipment_age_years'] * fleet_data['utilization_rate']
    )
    print("\nAge × Usage interaction created")

# Age × Breakdown interaction (older equipment with breakdowns = critical)
if 'breakdown_flag' in fleet_data.columns:
    fleet_data['age_breakdown_risk'] = (
        fleet_data['equipment_age_years'] * fleet_data['breakdown_rate_30d']
    )
    print("Age × Breakdown risk created")

# Overdue maintenance × High usage (critical risk combination)
if 'overdue_maintenance_flag' in fleet_data.columns and 'high_usage_flag' in fleet_data.columns:
    fleet_data['critical_risk_flag'] = (
        (fleet_data['overdue_maintenance_flag'] == 1) & 
        (fleet_data['high_usage_flag'] == 1)
    ).astype(int)
    print("Critical risk flag (overdue + high usage)")

# Low health × High usage (equipment degradation risk)
if 'equipment_health_score' in fleet_data.columns and 'high_usage_flag' in fleet_data.columns:
    fleet_data['degradation_risk_flag'] = (
        (fleet_data['equipment_health_score'] < 50) & 
        (fleet_data['high_usage_flag'] == 1)
    ).astype(int)
    print("Degradation risk flag (low health + high usage)")

# Combined risk score (multi-factor risk assessment)
risk_components = []
weights = []

if 'high_age_risk' in fleet_data.columns:
    risk_components.append(fleet_data['high_age_risk'])
    weights.append(0.25)

if 'overdue_maintenance_flag' in fleet_data.columns:
    risk_components.append(fleet_data['overdue_maintenance_flag'])
    weights.append(0.25)

if 'high_usage_flag' in fleet_data.columns:
    risk_components.append(fleet_data['high_usage_flag'])
    weights.append(0.20)

if 'breakdown_flag' in fleet_data.columns:
    risk_components.append(fleet_data['breakdown_flag'])
    weights.append(0.30)

if len(risk_components) > 0:
    # Normalize weights
    total_weight = sum(weights)
    weights = [w/total_weight for w in weights]
    
    # Calculate weighted risk score
    fleet_data['combined_risk_score'] = sum(w * comp for w, comp in zip(weights, risk_components))
    print("Combined risk score (0-1 scale, weighted)")

# Equipment type risk profiles
if 'tipe_alat' in fleet_data.columns:
    type_breakdown_rate = fleet_data.groupby('tipe_alat')['breakdown_flag'].transform('mean')
    fleet_data['type_risk_profile'] = type_breakdown_rate
    print("Equipment type risk profile")

print("\nAll interaction features created!")
print(f"\nHigh-risk equipment count: {(fleet_data['combined_risk_score'] > 0.5).sum()}")
print(f"Critical risk situations: {fleet_data['critical_risk_flag'].sum()}")


CREATING INTERACTION FEATURES

Age × Usage interaction created
Age × Breakdown risk created
Critical risk flag (overdue + high usage)
Degradation risk flag (low health + high usage)
Combined risk score (0-1 scale, weighted)
Equipment type risk profile

All interaction features created!

High-risk equipment count: 0
Critical risk situations: 0


## 7. Feature Summary & Validation

In [8]:
print("=" * 80)
print("FLEET FEATURE ENGINEERING SUMMARY")
print("=" * 80)

# Count feature categories
age_features = [col for col in fleet_data.columns if 'age' in col and col.startswith('equipment')]
usage_features = [col for col in fleet_data.columns if 'usage' in col or 'utilization' in col]
maintenance_features = [col for col in fleet_data.columns if 'maintenance' in col or 'breakdown' in col]
health_features = [col for col in fleet_data.columns if 'health' in col]
interaction_features = [col for col in fleet_data.columns if 'interaction' in col or 'risk' in col.lower()]

print(f"\nFeature Categories:")
print(f"  Age Features: {len(age_features)}")
print(f"    {age_features[:5]}")
print(f"  Usage Features: {len(usage_features)}")
print(f"    {usage_features[:5]}")
print(f"  Maintenance Features: {len(maintenance_features)}")
print(f"    {maintenance_features[:5]}")
print(f"  Health Features: {len(health_features)}")
print(f"    {health_features[:5]}")
print(f"  Interaction Features: {len(interaction_features)}")
print(f"    {interaction_features[:5]}")

# Total features
original_cols = len(fleet_ops.columns) + len(equipment_master.columns) - 1  # -1 for merge key
total_engineered = len(fleet_data.columns) - original_cols

print(f"\nTotal Features:")
print(f"  Original: {original_cols}")
print(f"  Engineered: {total_engineered}")
print(f"  Total: {len(fleet_data.columns)}")

# Missing values check
all_engineered = age_features + usage_features + maintenance_features + health_features + interaction_features
all_engineered = list(set(all_engineered))  # Remove duplicates

if all_engineered:
    missing_counts = fleet_data[all_engineered].isnull().sum()
    missing_pct = (missing_counts / len(fleet_data) * 100).round(2)
    
    if missing_counts.sum() > 0:
        print(f"\nMissing Values (top 10):")
        missing_df = pd.DataFrame({
            'Count': missing_counts[missing_counts > 0],
            'Percentage': missing_pct[missing_counts > 0]
        }).sort_values('Count', ascending=False).head(10)
        print(missing_df)
    else:
        print(f"\nNo missing values in engineered features!")

# Data quality check
print(f"\nData Quality:")
print(f"  Duplicate records: {fleet_data.duplicated().sum()}")
print(f"  Records with all features: {fleet_data[all_engineered].notna().all(axis=1).sum()}")

# Data shape
print(f"\nFinal Dataset Shape: {fleet_data.shape}")
print(f"  Rows: {fleet_data.shape[0]:,}")
print(f"  Columns: {fleet_data.shape[1]}")

# Feature value ranges
print(f"\nKey Feature Ranges:")
if 'equipment_health_score' in fleet_data.columns:
    print(f"  Health Score: {fleet_data['equipment_health_score'].min():.1f} - {fleet_data['equipment_health_score'].max():.1f}")
if 'combined_risk_score' in fleet_data.columns:
    print(f"  Risk Score: {fleet_data['combined_risk_score'].min():.3f} - {fleet_data['combined_risk_score'].max():.3f}")
if 'utilization_rate' in fleet_data.columns:
    print(f"  Utilization Rate: {fleet_data['utilization_rate'].min():.2%} - {fleet_data['utilization_rate'].max():.2%}")


FLEET FEATURE ENGINEERING SUMMARY

Feature Categories:
  Age Features: 3
    ['equipment_age_days', 'equipment_age_months', 'equipment_age_years']
  Usage Features: 7
    ['daily_usage_hours', 'utilization_rate', 'high_usage_flag', 'low_usage_flag', 'usage_variance_7d']
  Maintenance Features: 10
    ['overdue_maintenance_flag', 'maintenance_activity', 'maintenance_count_30d', 'breakdown_flag', 'breakdown_count_7d']
  Health Features: 7
    ['health_score_age', 'health_score_maintenance', 'health_score_usage', 'health_score_breakdown', 'health_score_efficiency']
  Interaction Features: 8
    ['high_age_risk', 'mid_age_risk', 'age_usage_interaction', 'age_breakdown_risk', 'critical_risk_flag']

Total Features:
  Original: 21
  Engineered: 38
  Total: 59

Missing Values (top 10):
                   Count  Percentage
usage_variance_7d     96        1.37

Data Quality:
  Duplicate records: 0
  Records with all features: 6889

Final Dataset Shape: (6985, 59)
  Rows: 6,985
  Columns: 59

Key

## 8. Save Processed Features

In [9]:
print("=" * 80)
print("SAVING FLEET FEATURES")
print("=" * 80)

# Define output paths
output_dir = Path('data/processed')
output_dir.mkdir(parents=True, exist_ok=True)
output_file = output_dir / 'fleet_features.csv'

# Save to CSV
fleet_data.to_csv(output_file, index=False)
print(f"\nFeatures saved to: {output_file.absolute()}")
print(f"  File size: {output_file.stat().st_size / 1024 / 1024:.2f} MB")
print(f"  Records: {len(fleet_data):,}")

# Also save to feature store (parquet for efficiency)
feature_store_dir = Path('data/feature_store')
feature_store_dir.mkdir(parents=True, exist_ok=True)
feature_store_file = feature_store_dir / 'fleet_features.parquet'

fleet_data.to_parquet(feature_store_file, index=False, compression='snappy')
print(f"\nFeatures saved to feature store: {feature_store_file.absolute()}")
print(f"  File size: {feature_store_file.stat().st_size / 1024 / 1024:.2f} MB")
print(f"  Compression ratio: {(output_file.stat().st_size / feature_store_file.stat().st_size):.1f}x")

# Save feature metadata
metadata = {
    'created_at': pd.Timestamp.now().isoformat(),
    'total_records': len(fleet_data),
    'total_features': len(fleet_data.columns),
    'date_range': {
        'start': fleet_data['tanggal_operasi'].min().isoformat(),
        'end': fleet_data['tanggal_operasi'].max().isoformat()
    },
    'equipment_count': fleet_data['id_alat'].nunique(),
    'equipment_types': fleet_data['tipe_alat'].value_counts().to_dict(),
    'health_distribution': fleet_data['health_category'].value_counts().to_dict()
}

import json
metadata_file = feature_store_dir / 'fleet_features_metadata.json'
with open(metadata_file, 'w') as f:
    json.dump(metadata, f, indent=2, default=str)
print(f"\nMetadata saved to: {metadata_file.absolute()}")

print("\n" + "=" * 80)
print("FLEET FEATURE ENGINEERING COMPLETE!")
print("=" * 80)

print("\nDataset Summary:")
print(f"  Equipment tracked: {fleet_data['id_alat'].nunique()}")
print(f"  Date range: {metadata['date_range']['start'][:10]} to {metadata['date_range']['end'][:10]}")
print(f"  Total operations: {len(fleet_data):,}")

print("\nNext Steps:")
print("  1. Train Equipment Failure Prediction model (05_modeling_fleet/)")
print("  2. Train Port Operability Forecast model")
print("  3. Develop predictive maintenance scheduling")

print("\nOutput files:")
print(f"  - CSV: {output_file.absolute()}")
print(f"  - Parquet: {feature_store_file.absolute()}")
print(f"  - Metadata: {metadata_file.absolute()}")


SAVING FLEET FEATURES



Features saved to: c:\Users\I5\Documents\asah-2025\capstone-project\minewise_ml\data\processed\fleet_features.csv
  File size: 3.25 MB
  Records: 6,985

Features saved to feature store: c:\Users\I5\Documents\asah-2025\capstone-project\minewise_ml\data\feature_store\fleet_features.parquet
  File size: 0.49 MB
  Compression ratio: 6.7x

Metadata saved to: c:\Users\I5\Documents\asah-2025\capstone-project\minewise_ml\data\feature_store\fleet_features_metadata.json

FLEET FEATURE ENGINEERING COMPLETE!

Dataset Summary:
  Equipment tracked: 96
  Date range: 2025-07-01 to 2025-10-31
  Total operations: 6,985

Next Steps:
  1. Train Equipment Failure Prediction model (05_modeling_fleet/)
  2. Train Port Operability Forecast model
  3. Develop predictive maintenance scheduling

Output files:
  - CSV: c:\Users\I5\Documents\asah-2025\capstone-project\minewise_ml\data\processed\fleet_features.csv
  - Parquet: c:\Users\I5\Documents\asah-2025\capstone-project\minewise_ml\data\feature_store\fleet_fe