# 🔋 EV Battery Digital Twin - Model Training Notebook

This notebook trains two ML models:
1. **RUL Model**: Predicts Remaining Useful Life (battery cycles left)
2. **Failure Model**: Predicts probability of battery failure

## 📋 Instructions
1. Upload your dataset (CSV format)
2. Run all cells in order
3. Download the trained models (.joblib files) at the end

## 📊 Expected Dataset Format
Your CSV should have these columns:
- `soc` - State of Charge (%)
- `soh` - State of Health (%)
- `voltage` - Battery Voltage (V)
- `current` - Current (A)
- `temperature` - Temperature (°C)
- `speed` - Vehicle Speed (km/h)
- `rul` - Target: Remaining Useful Life (cycles)
- `failure_prob` - Target: Failure Probability (0-1)

**Don't have a dataset?** No problem! This notebook can generate synthetic data for you.

## 📦 Step 1: Install Required Packages

In [None]:
!pip install -q numpy pandas scikit-learn xgboost matplotlib seaborn joblib

## 📚 Step 2: Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import xgboost as xgb
import joblib
import warnings
warnings.filterwarnings('ignore')

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("✅ All libraries imported successfully!")
print(f"XGBoost version: {xgb.__version__}")
print(f"Pandas version: {pd.__version__}")

## 📁 Step 3: Load Your Dataset

**Option A**: Upload your own CSV file (recommended)

**Option B**: Generate synthetic data (for testing)

In [None]:
# Option A: Load your CSV file
# Uncomment and modify the path to your dataset
# df = pd.read_csv('your_dataset.csv')

# Option B: Generate synthetic data
USE_SYNTHETIC_DATA = True  # Set to False if using your own dataset

if USE_SYNTHETIC_DATA:
    print("🔄 Generating synthetic data...")
    np.random.seed(42)
    n_samples = 10000
    
    # Generate realistic EV battery data
    soc = np.random.uniform(10, 100, n_samples)
    soh = np.random.uniform(70, 100, n_samples)
    voltage = 300 + (soc / 100) * 100 + np.random.normal(0, 5, n_samples)
    current = np.random.uniform(-200, 200, n_samples)
    temperature = np.random.uniform(15, 45, n_samples)
    speed = np.random.uniform(0, 120, n_samples)
    
    # Generate target variables
    # RUL depends on SOH, temperature, and usage patterns
    rul = (soh / 100) * 1000 - (temperature - 25) * 10 + np.random.normal(0, 50, n_samples)
    rul = np.clip(rul, 0, 1000)
    
    # Failure probability inversely related to SOH and RUL
    failure_prob = 1 - (soh / 100) * (rul / 1000) + np.random.normal(0, 0.1, n_samples)
    failure_prob = np.clip(failure_prob, 0, 1)
    
    df = pd.DataFrame({
        'soc': soc,
        'soh': soh,
        'voltage': voltage,
        'current': current,
        'temperature': temperature,
        'speed': speed,
        'rul': rul,
        'failure_prob': failure_prob
    })
    print(f"✅ Generated {len(df):,} synthetic samples")
else:
    # If you're using Colab, uncomment this to upload your file
    # from google.colab import files
    # uploaded = files.upload()
    # df = pd.read_csv(list(uploaded.keys())[0])
    pass

print(f"\n📊 Dataset shape: {df.shape}")
print(f"\n📋 Dataset info:")
df.info()

## 🔍 Step 4: Exploratory Data Analysis

In [None]:
# Display first few rows
print("📊 First 5 rows of the dataset:")
display(df.head())

# Statistical summary
print("\n📈 Statistical Summary:")
display(df.describe())

# Check for missing values
print("\n🔍 Missing values:")
missing = df.isnull().sum()
if missing.sum() == 0:
    print("✅ No missing values found!")
else:
    print(missing[missing > 0])

In [None]:
# Visualize feature distributions
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
fig.suptitle('Feature Distributions', fontsize=16, fontweight='bold')

for idx, col in enumerate(df.columns):
    ax = axes[idx // 4, idx % 4]
    ax.hist(df[col], bins=50, color='skyblue', edgecolor='black', alpha=0.7)
    ax.set_title(col.upper(), fontweight='bold')
    ax.set_xlabel('Value')
    ax.set_ylabel('Frequency')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Correlation heatmap
plt.figure(figsize=(10, 8))
correlation = df.corr()
sns.heatmap(correlation, annot=True, fmt='.2f', cmap='coolwarm', center=0,
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Feature Correlation Heatmap', fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

## 🔧 Step 5: Data Preprocessing

In [None]:
# Define features and targets
feature_columns = ['soc', 'soh', 'voltage', 'current', 'temperature', 'speed']
X = df[feature_columns].values

# Two target variables
y_rul = df['rul'].values
y_failure = df['failure_prob'].values

print(f"✅ Features shape: {X.shape}")
print(f"✅ RUL target shape: {y_rul.shape}")
print(f"✅ Failure target shape: {y_failure.shape}")

In [None]:
# Split data (80% train, 20% test)
X_train, X_test, y_rul_train, y_rul_test = train_test_split(
    X, y_rul, test_size=0.2, random_state=42
)

_, _, y_failure_train, y_failure_test = train_test_split(
    X, y_failure, test_size=0.2, random_state=42
)

print(f"✅ Training set size: {X_train.shape[0]:,} samples")
print(f"✅ Test set size: {X_test.shape[0]:,} samples")
print(f"✅ Split ratio: {X_train.shape[0]/X.shape[0]*100:.1f}% train / {X_test.shape[0]/X.shape[0]*100:.1f}% test")

In [None]:
# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("✅ Feature scaling completed")
print(f"\nScaled feature means (should be ~0): {X_train_scaled.mean(axis=0).round(4)}")
print(f"Scaled feature stds (should be ~1): {X_train_scaled.std(axis=0).round(4)}")

## 🤖 Step 6: Train RUL Prediction Model

In [None]:
print("🔄 Training RUL Model (XGBoost)...")
print("=" * 50)

# Configure XGBoost for RUL prediction
rul_model = xgb.XGBRegressor(
    n_estimators=200,
    max_depth=8,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    objective='reg:squarederror',
    random_state=42,
    n_jobs=-1
)

# Train the model
rul_model.fit(
    X_train_scaled, y_rul_train,
    eval_set=[(X_test_scaled, y_rul_test)],
    verbose=False
)

print("✅ RUL Model training completed!")

In [None]:
# Evaluate RUL model
y_rul_pred = rul_model.predict(X_test_scaled)

rul_mae = mean_absolute_error(y_rul_test, y_rul_pred)
rul_rmse = np.sqrt(mean_squared_error(y_rul_test, y_rul_pred))
rul_r2 = r2_score(y_rul_test, y_rul_pred)

print("\n📊 RUL Model Performance:")
print("=" * 50)
print(f"Mean Absolute Error (MAE):  {rul_mae:.2f} cycles")
print(f"Root Mean Squared Error:     {rul_rmse:.2f} cycles")
print(f"R² Score:                    {rul_r2:.4f}")
print("=" * 50)

if rul_r2 > 0.9:
    print("✅ Excellent model performance! (R² > 0.9)")
elif rul_r2 > 0.8:
    print("✅ Good model performance! (R² > 0.8)")
else:
    print("⚠️ Model performance could be improved (R² < 0.8)")

In [None]:
# Visualize RUL predictions
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Actual vs Predicted
axes[0].scatter(y_rul_test, y_rul_pred, alpha=0.5, s=20)
axes[0].plot([y_rul_test.min(), y_rul_test.max()], 
             [y_rul_test.min(), y_rul_test.max()], 
             'r--', lw=2, label='Perfect Prediction')
axes[0].set_xlabel('Actual RUL (cycles)', fontsize=12)
axes[0].set_ylabel('Predicted RUL (cycles)', fontsize=12)
axes[0].set_title(f'RUL Predictions (R²={rul_r2:.4f})', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Residuals
residuals = y_rul_test - y_rul_pred
axes[1].scatter(y_rul_pred, residuals, alpha=0.5, s=20)
axes[1].axhline(y=0, color='r', linestyle='--', lw=2)
axes[1].set_xlabel('Predicted RUL (cycles)', fontsize=12)
axes[1].set_ylabel('Residuals (cycles)', fontsize=12)
axes[1].set_title('Residual Plot', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Feature importance for RUL model
plt.figure(figsize=(10, 6))
importance = rul_model.feature_importances_
indices = np.argsort(importance)[::-1]

plt.bar(range(len(importance)), importance[indices], color='skyblue', edgecolor='black')
plt.xticks(range(len(importance)), [feature_columns[i] for i in indices], rotation=45)
plt.xlabel('Features', fontsize=12)
plt.ylabel('Importance', fontsize=12)
plt.title('RUL Model - Feature Importance', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("\n🔍 Top 3 Most Important Features for RUL:")
for i in range(3):
    print(f"  {i+1}. {feature_columns[indices[i]]}: {importance[indices[i]]:.4f}")

## 🤖 Step 7: Train Failure Prediction Model

In [None]:
print("🔄 Training Failure Prediction Model (XGBoost)...")
print("=" * 50)

# Configure XGBoost for failure probability prediction
failure_model = xgb.XGBRegressor(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    objective='reg:squarederror',
    random_state=42,
    n_jobs=-1
)

# Train the model
failure_model.fit(
    X_train_scaled, y_failure_train,
    eval_set=[(X_test_scaled, y_failure_test)],
    verbose=False
)

print("✅ Failure Model training completed!")

In [None]:
# Evaluate Failure model
y_failure_pred = failure_model.predict(X_test_scaled)
y_failure_pred = np.clip(y_failure_pred, 0, 1)  # Ensure predictions are in [0, 1]

failure_mae = mean_absolute_error(y_failure_test, y_failure_pred)
failure_rmse = np.sqrt(mean_squared_error(y_failure_test, y_failure_pred))
failure_r2 = r2_score(y_failure_test, y_failure_pred)

print("\n📊 Failure Model Performance:")
print("=" * 50)
print(f"Mean Absolute Error (MAE):  {failure_mae:.4f}")
print(f"Root Mean Squared Error:     {failure_rmse:.4f}")
print(f"R² Score:                    {failure_r2:.4f}")
print("=" * 50)

if failure_r2 > 0.7:
    print("✅ Good model performance! (R² > 0.7)")
elif failure_r2 > 0.5:
    print("✅ Acceptable model performance (R² > 0.5)")
else:
    print("⚠️ Model performance could be improved (R² < 0.5)")

In [None]:
# Visualize Failure predictions
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Actual vs Predicted
axes[0].scatter(y_failure_test, y_failure_pred, alpha=0.5, s=20)
axes[0].plot([0, 1], [0, 1], 'r--', lw=2, label='Perfect Prediction')
axes[0].set_xlabel('Actual Failure Probability', fontsize=12)
axes[0].set_ylabel('Predicted Failure Probability', fontsize=12)
axes[0].set_title(f'Failure Predictions (R²={failure_r2:.4f})', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
axes[0].set_xlim([0, 1])
axes[0].set_ylim([0, 1])

# Residuals
residuals = y_failure_test - y_failure_pred
axes[1].scatter(y_failure_pred, residuals, alpha=0.5, s=20)
axes[1].axhline(y=0, color='r', linestyle='--', lw=2)
axes[1].set_xlabel('Predicted Failure Probability', fontsize=12)
axes[1].set_ylabel('Residuals', fontsize=12)
axes[1].set_title('Residual Plot', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Feature importance for Failure model
plt.figure(figsize=(10, 6))
importance = failure_model.feature_importances_
indices = np.argsort(importance)[::-1]

plt.bar(range(len(importance)), importance[indices], color='salmon', edgecolor='black')
plt.xticks(range(len(importance)), [feature_columns[i] for i in indices], rotation=45)
plt.xlabel('Features', fontsize=12)
plt.ylabel('Importance', fontsize=12)
plt.title('Failure Model - Feature Importance', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("\n🔍 Top 3 Most Important Features for Failure:")
for i in range(3):
    print(f"  {i+1}. {feature_columns[indices[i]]}: {importance[indices[i]]:.4f}")

## 💾 Step 8: Save Models and Scaler

In [None]:
# Create models directory
import os
os.makedirs('models', exist_ok=True)

# Save RUL model
rul_model_path = 'models/rul_xgb_model.joblib'
joblib.dump(rul_model, rul_model_path)
print(f"✅ RUL model saved to: {rul_model_path}")

# Save Failure model
failure_model_path = 'models/failure_xgb_model.joblib'
joblib.dump(failure_model, failure_model_path)
print(f"✅ Failure model saved to: {failure_model_path}")

# Save scaler
scaler_path = 'models/scaler.joblib'
joblib.dump(scaler, scaler_path)
print(f"✅ Scaler saved to: {scaler_path}")

# Save feature names
feature_names_path = 'models/feature_names.joblib'
joblib.dump(feature_columns, feature_names_path)
print(f"✅ Feature names saved to: {feature_names_path}")

print("\n" + "=" * 50)
print("🎉 All models saved successfully!")
print("=" * 50)

## 📊 Step 9: Model Summary Report

In [None]:
print("\n" + "="*60)
print("📊 FINAL MODEL SUMMARY REPORT")
print("="*60)

print("\n🔋 RUL Prediction Model (XGBoost)")
print("-" * 60)
print(f"  Model Type:           XGBoost Regressor")
print(f"  Number of Features:   {len(feature_columns)}")
print(f"  Training Samples:     {X_train.shape[0]:,}")
print(f"  Test Samples:         {X_test.shape[0]:,}")
print(f"  MAE:                  {rul_mae:.2f} cycles")
print(f"  RMSE:                 {rul_rmse:.2f} cycles")
print(f"  R² Score:             {rul_r2:.4f}")

print("\n⚠️ Failure Prediction Model (XGBoost)")
print("-" * 60)
print(f"  Model Type:           XGBoost Regressor")
print(f"  Number of Features:   {len(feature_columns)}")
print(f"  Training Samples:     {X_train.shape[0]:,}")
print(f"  Test Samples:         {X_test.shape[0]:,}")
print(f"  MAE:                  {failure_mae:.4f}")
print(f"  RMSE:                 {failure_rmse:.4f}")
print(f"  R² Score:             {failure_r2:.4f}")

print("\n📁 Saved Files:")
print("-" * 60)
print(f"  1. {rul_model_path}")
print(f"  2. {failure_model_path}")
print(f"  3. {scaler_path}")
print(f"  4. {feature_names_path}")

print("\n" + "="*60)
print("✅ Training Complete! Download the models folder.")
print("="*60)

## 🧪 Step 10: Test Predictions on Sample Data

In [None]:
# Test with a sample input
print("🧪 Testing models with sample data...\n")

# Create sample input
sample_data = np.array([[
    85.0,   # soc
    95.0,   # soh
    380.0,  # voltage
    -50.0,  # current
    25.0,   # temperature
    60.0    # speed
]])

# Scale the input
sample_scaled = scaler.transform(sample_data)

# Make predictions
rul_pred = rul_model.predict(sample_scaled)[0]
failure_pred = np.clip(failure_model.predict(sample_scaled)[0], 0, 1)

print("📊 Input Data:")
print("-" * 60)
for i, col in enumerate(feature_columns):
    print(f"  {col:15s}: {sample_data[0][i]:8.2f}")

print("\n🔮 Predictions:")
print("-" * 60)
print(f"  RUL:              {rul_pred:.2f} cycles")
print(f"  Failure Prob:     {failure_pred:.4f} ({failure_pred*100:.2f}%)")

# Health status
if failure_pred < 0.3:
    status = "✅ HEALTHY"
elif failure_pred < 0.6:
    status = "⚠️ WARNING"
else:
    status = "🚨 CRITICAL"
    
print(f"\n  Battery Status:   {status}")
print("=" * 60)

## 📥 Step 11: Download Models (For Google Colab)

Uncomment the code below if you're using Google Colab to download the models.

In [None]:
# Uncomment this cell if running on Google Colab
# from google.colab import files
# import shutil

# # Create a zip file of the models folder
# shutil.make_archive('ev_models', 'zip', 'models')

# # Download the zip file
# files.download('ev_models.zip')

# print("✅ Models downloaded as 'ev_models.zip'")
# print("Extract the zip file and place the .joblib files in your project's 'models' folder")

## 📝 Next Steps

1. **Download** all `.joblib` files from the `models` folder
2. **Copy** them to your project: `C:\Users\pavan\OneDrive\Desktop\EV_BATTER_FINAL\models\`
3. **Run** your predictor:
   ```powershell
   python src\inference\live_predictor.py --interval 5 --write-back
   ```

## 🎯 Files You Need
- `rul_xgb_model.joblib` - RUL prediction model
- `failure_xgb_model.joblib` - Failure prediction model
- `scaler.joblib` - Feature scaler
- `feature_names.joblib` - Feature column names

## 📚 Model Details
- **Algorithm**: XGBoost (Extreme Gradient Boosting)
- **Input Features**: SOC, SOH, Voltage, Current, Temperature, Speed
- **Output 1**: Remaining Useful Life (cycles)
- **Output 2**: Failure Probability (0-1)
- **Framework**: scikit-learn compatible (joblib serialization)

---

**🎉 Training Complete! Your models are ready for deployment.**