<div style="text-align: center;">
    <a href="https://www.dataia.eu/">
        <img border="0" src="https://github.com/ramp-kits/lwfa/raw/main/img/DATAIA-h.png" width="90%"></a>
</div>

# RAMP on eVTOL Battery Degradation Prediction

<i>Data Science Challenge</i>

## Introduction

### Electric Vertical Takeoff and Landing (eVTOL) Batteries

Electric Vertical Takeoff and Landing (eVTOL) vehicles represent the forefront of urban air mobility. These aircraft rely heavily on high-performance batteries that must maintain their capacity and reliability over hundreds of charging cycles in demanding conditions. Battery degradation is a critical concern, as it directly impacts vehicle range, payload capacity, and safety.

Battery degradation is influenced by numerous factors including temperature fluctuations, charge/discharge rates, depth of discharge, and overall usage patterns. Predicting how a battery's discharge capacity will change over time is crucial for maintenance scheduling, fleet management, and ensuring safe operations.

### The Challenge

In this RAMP challenge, you will develop models to predict battery degradation based on cycling data. You'll be provided with detailed measurements from battery cycles, including voltage curves, current profiles, temperature readings, and other key parameters. The goal is to predict the future discharge capacity of batteries as they age through multiple cycles.

This challenge has important real-world implications: accurate prediction of battery degradation can improve safety margins, optimize maintenance schedules, and extend the useful life of expensive battery packs in the growing eVTOL industry.

# Exploratory Data Analysis

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
pd.set_option('display.max_columns', None)

## Getting Access to the Data

In [None]:
import problem

X_df, y = problem.get_train_data()

## Understanding the Dataset

Our dataset consists of measurements from multiple batteries at different points in their lifecycle. Each battery undergoes many charge and discharge cycles, with measurements taken throughout each cycle.

Let's examine the structure of our data:

In [None]:
X_df.head()

In [None]:
print(f"Dataset shape: {X_df.shape}")
print(f"Number of unique batteries: {X_df['battery_id'].nunique()}")
print(f"Cycles per battery: {X_df.groupby('battery_id')['cycle_number'].max().mean():.1f} (average)")

## Key Features

Let's understand the key features in our dataset:

- **battery_id**: Unique identifier for each battery
- **cycle_number**: The sequence number of charge/discharge cycle
- **charge_capacity**: The capacity measured during charging (mAh)
- **discharge_capacity**: The capacity measured during discharging (mAh) - this is our target variable
- **charge_time**: Duration of the charging phase (seconds)
- **discharge_time**: Duration of the discharging phase (seconds)
- **energy_efficiency**: Ratio of discharge energy to charge energy
- **max_voltage**: Maximum cell voltage during the cycle (V)
- **min_voltage**: Minimum cell voltage during the cycle (V)
- **avg_temperature**: Average temperature during the cycle (°C)
- **max_temperature**: Maximum temperature during the cycle (°C)
- **voltage_drop_rate**: Rate of voltage decrease during discharge (V/s)

## Summary Statistics

In [None]:
X_df.describe()

## Capacity Degradation Over Cycles

Let's examine how the discharge capacity of batteries degrades over multiple cycles:

In [None]:
plt.figure(figsize=(12, 6))
for battery_id in X_df['battery_id'].unique()[:5]:  # Plot first 5 batteries
    battery_data = X_df[X_df['battery_id'] == battery_id]
    plt.plot(battery_data['cycle_number'], battery_data['discharge_capacity'], 
             marker='o', markersize=4, linestyle='-', alpha=0.7, label=f'Battery {battery_id}')
    
plt.xlabel('Cycle Number', fontsize=12)
plt.ylabel('Discharge Capacity (mAh)', fontsize=12)
plt.title('Discharge Capacity vs. Cycle Number', fontsize=14)
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

We can see the characteristic capacity fade curve, where capacity decreases more rapidly in early cycles and then stabilizes into a more gradual decline.

## Correlations Between Features

In [None]:
plt.figure(figsize=(14, 10))
correlation_matrix = X_df.corr()
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', linewidths=0.5)
plt.title('Feature Correlation Heatmap', fontsize=14)
plt.tight_layout()
plt.show()

## Analyzing Voltage Curves

Let's look at voltage profiles during discharge for a single battery at different points in its lifecycle:

In [None]:
def plot_discharge_curves(battery_id, cycles_to_plot=None):
    """Plot discharge voltage curves for specified cycles of a battery"""
    
    # If no specific cycles requested, choose a representative sample
    if cycles_to_plot is None:
        battery_data = X_df[X_df['battery_id'] == battery_id]
        total_cycles = battery_data['cycle_number'].max()
        cycles_to_plot = [1, int(total_cycles*0.25), int(total_cycles*0.5), 
                           int(total_cycles*0.75), total_cycles]
    
    # Load the raw cycle data for this battery
    # Note: In a real implementation, you'd load the actual voltage curves from files
    # For this example, we'll simulate voltage curves
    plt.figure(figsize=(12, 6))
    
    # Generate simulated discharge voltage curves for illustration
    # (In practice, you'd load the actual data)
    for cycle in cycles_to_plot:
        # Get capacity for this cycle
        cycle_data = X_df[(X_df['battery_id'] == battery_id) & 
                           (X_df['cycle_number'] == cycle)]
        
        if len(cycle_data) == 0:
            continue
        
        # Generate simulated discharge curve
        # In real data, you'd have actual time series 
        capacity = cycle_data['discharge_capacity'].values[0]
        max_v = cycle_data['max_voltage'].values[0]
        min_v = cycle_data['min_voltage'].values[0]
        
        # Simulate a discharge curve
        discharge_time = np.linspace(0, 1, 100)
        # As batteries age, the voltage drops more quickly
        curve_factor = 1 + (cycle/cycles_to_plot[-1])
        voltage = max_v - (max_v-min_v) * np.power(discharge_time, 1/curve_factor)
        
        plt.plot(discharge_time, voltage, label=f'Cycle {cycle} - Cap: {capacity:.1f} mAh')
    
    plt.xlabel('Normalized Discharge Time', fontsize=12)
    plt.ylabel('Cell Voltage (V)', fontsize=12)
    plt.title(f'Battery {battery_id} Discharge Voltage Curves', fontsize=14)
    plt.legend()
    plt.grid(True)
    plt.tight_layout()
    plt.show()

# Plot discharge curves for a sample battery
sample_battery_id = X_df['battery_id'].iloc[0]
plot_discharge_curves(sample_battery_id)

As batteries degrade, their voltage drops more quickly during discharge. This change in the voltage curve shape is an important indicator of battery health.

## Temperature Effects on Degradation

In [None]:
plt.figure(figsize=(12, 6))

# Create scatter plot with color based on temperature
scatter = plt.scatter(X_df['cycle_number'], X_df['discharge_capacity'],
                     c=X_df['avg_temperature'], alpha=0.6, cmap='viridis',
                     s=30, edgecolors='none')

plt.colorbar(scatter, label='Average Temperature (°C)')
plt.xlabel('Cycle Number', fontsize=12)
plt.ylabel('Discharge Capacity (mAh)', fontsize=12)
plt.title('Discharge Capacity vs. Cycle Number (colored by temperature)', fontsize=14)
plt.tight_layout()
plt.show()

## Capacity vs. Energy Efficiency

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(X_df['energy_efficiency'], X_df['discharge_capacity'], 
            alpha=0.5, c=X_df['cycle_number'], cmap='coolwarm')

plt.colorbar(label='Cycle Number')
plt.xlabel('Energy Efficiency', fontsize=12)
plt.ylabel('Discharge Capacity (mAh)', fontsize=12)
plt.title('Discharge Capacity vs. Energy Efficiency', fontsize=14)
plt.tight_layout()
plt.show()

# Feature Engineering

Let's develop some features that might be useful for predicting battery degradation:

In [None]:
def engineer_features(df):
    """Create derived features that might help with degradation prediction"""
    
    # Create a copy to avoid modifying the original dataframe
    df = df.copy()
    
    # Calculate capacity retention as percentage of initial capacity
    for battery_id in df['battery_id'].unique():
        battery_data = df[df['battery_id'] == battery_id]
        initial_capacity = battery_data[battery_data['cycle_number'] == 
                                      battery_data['cycle_number'].min()]['discharge_capacity'].values[0]
        df.loc[df['battery_id'] == battery_id, 'capacity_retention'] = \
            df.loc[df['battery_id'] == battery_id, 'discharge_capacity'] / initial_capacity
    
    # Calculate differential features
    for battery_id in df['battery_id'].unique():
        battery_data = df[df['battery_id'] == battery_id].sort_values('cycle_number')
        
        # Calculate capacity change rate (per cycle)
        capacity_change = battery_data['discharge_capacity'].diff() / battery_data['cycle_number'].diff()
        df.loc[battery_data.index, 'capacity_change_rate'] = capacity_change
        
        # Calculate voltage efficiency (min/max ratio)
        df.loc[battery_data.index, 'voltage_efficiency'] = \
            battery_data['min_voltage'] / battery_data['max_voltage']
        
        # Moving averages
        df.loc[battery_data.index, 'discharge_capacity_ma3'] = \
            battery_data['discharge_capacity'].rolling(3, min_periods=1).mean()
    
    # Calculate temperature variation
    df['temp_variation'] = df['max_temperature'] - df['avg_temperature']
    
    # Add polynomial features for cycle number
    df['cycle_number_squared'] = df['cycle_number'] ** 2
    df['cycle_number_cubed'] = df['cycle_number'] ** 3
    df['log_cycle_number'] = np.log1p(df['cycle_number'])
    
    # Add interaction features
    df['temp_cycle_interaction'] = df['avg_temperature'] * df['cycle_number']
    df['charge_time_efficiency'] = df['charge_time'] / df['energy_efficiency']
    
    return df

# Create engineered features
X_df_engineered = engineer_features(X_df)

# Check the new features
new_features = [col for col in X_df_engineered.columns if col not in X_df.columns]
print(f"Newly created features: {new_features}")
X_df_engineered[['cycle_number', 'discharge_capacity'] + new_features].head()

## Visualizing the Derived Features

Let's examine how some of these engineered features relate to the battery degradation:

In [None]:
plt.figure(figsize=(12, 6))
for battery_id in X_df_engineered['battery_id'].unique()[:5]:  # Plot first 5 batteries
    battery_data = X_df_engineered[X_df_engineered['battery_id'] == battery_id]
    plt.plot(battery_data['cycle_number'], battery_data['capacity_retention'], 
             marker='o', markersize=4, linestyle='-', alpha=0.7, label=f'Battery {battery_id}')
    
plt.xlabel('Cycle Number', fontsize=12)
plt.ylabel('Capacity Retention (fraction of initial)', fontsize=12)
plt.title('Capacity Retention vs. Cycle Number', fontsize=14)
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(14, 6))

plt.subplot(1, 2, 1)
plt.scatter(X_df_engineered['capacity_change_rate'], X_df_engineered['discharge_capacity'], 
            alpha=0.5, c=X_df_engineered['cycle_number'], cmap='viridis')
plt.colorbar(label='Cycle Number')
plt.xlabel('Capacity Change Rate', fontsize=11)
plt.ylabel('Discharge Capacity (mAh)', fontsize=11)
plt.title('Discharge Capacity vs. Capacity Change Rate', fontsize=13)

plt.subplot(1, 2, 2)
plt.scatter(X_df_engineered['voltage_efficiency'], X_df_engineered['discharge_capacity'], 
            alpha=0.5, c=X_df_engineered['cycle_number'], cmap='viridis')
plt.colorbar(label='Cycle Number')
plt.xlabel('Voltage Efficiency (min/max)', fontsize=11)
plt.ylabel('Discharge Capacity (mAh)', fontsize=11)
plt.title('Discharge Capacity vs. Voltage Efficiency', fontsize=13)

plt.tight_layout()
plt.show()

# Building a Baseline Model

Let's create a simple baseline regression model to predict discharge capacity using our engineered features.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Select features to use in the model
features = [
    'cycle_number', 'charge_capacity', 'energy_efficiency',
    'max_voltage', 'min_voltage', 'avg_temperature',
    'charge_time', 'discharge_time', 'voltage_drop_rate',
    # Engineered features
    'capacity_retention', 'voltage_efficiency', 'temp_variation',
    'cycle_number_squared', 'log_cycle_number', 'temp_cycle_interaction'
]

# Ensure all selected features are in the dataframe
available_features = [f for f in features if f in X_df_engineered.columns]
print(f"Using {len(available_features)} features: {available_features}")

# Prepare the data
X = X_df_engineered[available_features].fillna(0)  # Simple handling of missing values
y = X_df_engineered['discharge_capacity']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Train a random forest model
model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.2f} mAh")
print(f"R²: {r2:.4f}")

# Plot predicted vs actual values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlabel('Actual Discharge Capacity (mAh)', fontsize=12)
plt.ylabel('Predicted Discharge Capacity (mAh)', fontsize=12)
plt.title('Random Forest: Actual vs Predicted Discharge Capacity', fontsize=14)
plt.grid(True)
plt.tight_layout()
plt.show()

# Feature importance
feature_importance = pd.DataFrame({
    'Feature': available_features,
    'Importance': model.feature_importances_
}).sort_values('Importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importance)
plt.title('Feature Importance for Discharge Capacity Prediction', fontsize=14)
plt.tight_layout()
plt.show()

# Time Series Analysis

Since battery degradation is a time series problem (sequences of cycles), let's look at approaches that consider the sequential nature of the data.

In [None]:
# Function to create lagged features for time series prediction
def create_lagged_features(df, lag_features, lags):
    """Create lagged features for time series prediction"""
    df_with_lags = df.copy()
    
    # For each battery, create lagged features
    for battery_id in df['battery_id'].unique():
        battery_data = df[df['battery_id'] == battery_id].sort_values('cycle_number')
        
        for feature in lag_features:
            for lag in lags:
                # Create the lagged feature
                lagged_values = battery_data[feature].shift(lag)
                df_with_lags.loc[battery_data.index, f"{feature}_lag{lag}"] = lagged_values
    
    return df_with_lags

# Create lagged features for important predictors
lag_features = ['discharge_capacity', 'charge_capacity', 'energy_efficiency']
lags = [1, 2, 3]  # Previous cycles

X_df_with_lags = create_lagged_features(X_df_engineered, lag_features, lags)

# Show the lagged features for the first few rows
lag_cols = [col for col in X_df_with_lags.columns if 'lag' in col]
print(f"Created {len(lag_cols)} lagged features: {lag_cols}")
X_df_with_lags[['battery_id', 'cycle_number', 'discharge_capacity'] + lag_cols].head(10)

# Preparing the Final Model

Now let's integrate the feature engineering and modeling steps into a complete pipeline.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.impute import SimpleImputer
import numpy as np

# Define the feature engineering function
def battery_feature_engineering(X):
    """Create features for battery degradation prediction"""
    # Ensure it's a DataFrame
    X = pd.DataFrame(X) if not isinstance(X, pd.DataFrame) else X.copy()
    
    # Basic features
    if 'voltage_efficiency' not in X.columns and 'min_voltage' in X.columns and 'max_voltage' in X.columns:
        X['voltage_efficiency'] = X['min_voltage'] / X['max_voltage']
    
    if 'temp_variation' not in X.columns and 'max_temperature' in X.columns and 'avg_temperature' in X.columns:
        X['temp_variation'] = X['max_temperature'] - X['avg_temperature']
        
    if 'cycle_squared' not in X.columns and 'cycle_number' in X.columns:
        X['cycle_squared'] = X['cycle_number'] ** 2
        X['log_cycle'] = np.log1p(X['cycle_number'])
        
    # Calculate time-based features if available
    if 'charge_time' in X.columns and 'discharge_time' in X.columns:
        X['total_cycle_time'] = X['charge_time'] + X['discharge_time']
        X['charge_discharge_ratio'] = X['charge_time'] / X['discharge_time']
    
    return X

# Define preprocessing pipeline
def get_preprocessing_pipeline():
    numeric_features = [
        'charge_capacity', 'energy_efficiency', 'voltage_drop_rate',
        'avg_temperature', 'max_temperature', 'max_voltage', 'min_voltage', 
        'charge_time', 'discharge_time', 'cycle_number'
    ]
    
    # Feature engineering
    feature_engineering = FunctionTransformer(battery_feature_engineering)
    
    # Preprocessing for numerical features
    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ])
    
    # Bundle preprocessing steps
    preprocessor = Pipeline(steps=[
        ('feature_eng', feature_engineering),
        ('column_trans', ColumnTransformer(
            transformers=[
                ('num', numeric_transformer, numeric_features)
            ],
            remainder='passthrough'
        ))
    ])
    
    return preprocessor

In [None]:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import cross_val_score, GridSearchCV

# Create the full model pipeline
def build_model(model_type='rf'):
    preprocessor = get_preprocessing_pipeline()
    
    if model_type == 'rf':
        model = RandomForestRegressor(
            n_estimators=100,
            max_depth=15,
            min_samples_split=5,
            random_state=42
        )
    elif model_type == 'gbm':
        model = GradientBoostingRegressor(
            n_estimators=100,
            learning_rate=0.1,
            max_depth=5,
            random_state=42
        )
    else:
        raise ValueError(f"Unknown model type: {model_type}")
    
    # Create full pipeline
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('model', model)
    ])
    
    return pipeline

In [None]:
# Evaluate the model
def evaluate_model(pipeline, X_train, y_train, X_test, y_test):
    # Train the model
    pipeline.fit(X_train, y_train)
    
    # Make predictions
    y_pred = pipeline.predict(X_test)
    
    # Calculate metrics
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)
    
    print(f"RMSE: {rmse:.4f}")
    print(f"R² Score: {r2:.4f}")
    
    # Plot actual vs predicted
    plt.figure(figsize=(10, 6))
    plt.scatter(y_test, y_pred, alpha=0.5)
    plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--')
    plt.xlabel('Actual Capacity')
    plt.ylabel('Predicted Capacity')
    plt.title('Actual vs Predicted Battery Capacity')
    plt.grid(True)
    plt.show()
    
    return rmse, r2

In [None]:
# Feature importance analysis
def feature_importance_analysis(pipeline, X):
    # Convert input data through preprocessing
    X_processed = pipeline.named_steps['preprocessor'].transform(X)
    
    # Get feature names
    feature_names = list(X.columns)
    
    # Get feature importances (works for tree-based models)
    if hasattr(pipeline.named_steps['model'], 'feature_importances_'):
        importances = pipeline.named_steps['model'].feature_importances_
        
        # Sort feature importances
        indices = np.argsort(importances)[::-1]
        
        # Print feature ranking
        print("Feature ranking:")
        for f in range(min(10, X_processed.shape[1])):  # Show top 10 features
            print(f"{f+1}. {feature_names[indices[f]]} ({importances[indices[f]]:.4f})")
        
        # Plot feature importances
        plt.figure(figsize=(12, 6))
        plt.title("Feature importances")
        plt.bar(range(min(10, X_processed.shape[1])), 
                importances[indices[:10]],
                align="center")
        plt.xticks(range(min(10, X_processed.shape[1])), 
                  [feature_names[i] for i in indices[:10]], 
                  rotation=45, ha='right')
        plt.tight_layout()
        plt.show()

In [None]:
# Creating a submission
def create_submission(pipeline, X_test, output_file='submission.csv'):
    # Make predictions on test set
    y_pred = pipeline.predict(X_test)
    
    # Create submission DataFrame
    submission = pd.DataFrame({
        'battery_id': X_test['battery_id'] if 'battery_id' in X_test.columns else range(len(X_test)),
        'cycle_number': X_test['cycle_number'],
        'predicted_capacity': y_pred
    })
    
    # Save to CSV
    submission.to_csv(output_file, index=False)
    print(f"Submission saved to {output_file}")
    
    return submission

In [None]:
# Sample workflow
if __name__ == "__main__":
    # 1. Load train/test data
    from problem import get_train_data, get_test_data
    
    X_train, y_train = get_train_data()
    X_test, _ = get_test_data()  # y_test would be None for the actual submission
    
    # 2. Split training data for local validation
    X_train_local, X_val, y_train_local, y_val = train_test_split(
        X_train, y_train, test_size=0.2, random_state=42
    )
    
    # 3. Build and train model
    model_pipeline = build_model(model_type='rf')
    
    # 4. Evaluate model
    rmse, r2 = evaluate_model(model_pipeline, X_train_local, y_train_local, X_val, y_val)
    
    # 5. Train on full training set
    model_pipeline.fit(X_train, y_train)
    
    # 6. Feature importance analysis
    feature_importance_analysis(model_pipeline, X_train)
    
    # 7. Create submission file
    submission = create_submission(model_pipeline, X_test)

In [None]:
# Cross-validation performance
def cv_performance(pipeline, X, y, cv=5):
    # Get cross-validation scores
    cv_rmse = -cross_val_score(
        pipeline, X, y, 
        cv=cv, 
        scoring='neg_root_mean_squared_error'
    )
    
    cv_r2 = cross_val_score(
        pipeline, X, y, 
        cv=cv, 
        scoring='r2'
    )
    
    print(f"CV RMSE: {cv_rmse.mean():.4f} (±{cv_rmse.std():.4f})")
    print(f"CV R²: {cv_r2.mean():.4f} (±{cv_r2.std():.4f})")
    
    return cv_rmse, cv_r2

In [None]:
# Hyperparameter tuning
def tune_hyperparameters(X, y):
    # Define the preprocessing pipeline
    preprocessor = get_preprocessing_pipeline()
    
    # Define the model
    model = RandomForestRegressor(random_state=42)
    
    # Define the parameter grid
    param_grid = {
        'model__n_estimators': [50, 100, 200],
        'model__max_depth': [10, 15, 20],
        'model__min_samples_split': [2, 5, 10]
    }
    
    # Create the full pipeline
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('model', model)
    ])
    
    # Create GridSearchCV
    grid_search = GridSearchCV(
        pipeline, param_grid, cv=5,
        scoring='neg_root_mean_squared_error',
        n_jobs=-1, verbose=1
    )
    
    # Fit the grid search
    grid_search.fit(X, y)
    
    # Print the best parameters
    print(f"Best parameters: {grid_search.best_params_}")
    print(f"Best RMSE: {-grid_search.best_score_:.4f}")
    
    # Return the best model
    return grid_search.best_estimator_