# **04.1 Model Training - Classic ML**

In [None]:
!pip install mlflow

In [2]:
import pandas as pd
import mlflow
import numpy as np
import os


from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import SGDRegressor
from sklearn.linear_model import LinearRegression


# Define the file paths
path = '/lakehouse/default/Files/data/processed/model_ready_data.csv'

# Load the dataset into pandas dataframes
df = pd.read_csv(path)

# Define features and target for the model
features = [
    'VV', 'SA', 'AR', 'WBLR', 'FA', 'TWD', 'ORF', 'ORR', 'GHBHR', 'UBC',
    'CWVWR', 'CWVR', 'TO', 'ATW', 'BWR', 'CWLR', 'FAR', 'WDFR_RATIO', 'CW_WD_IMPACT',
    'WDF', 'WDR', 'OL', 'OW', 'OH', 'WB', 'CW'  
]

StatementMeta(, b2f57b38-29a2-4887-95d2-1eeb316ff28d, 4, Finished, Available)

This code snippet is designed for building and evaluating machine learning models specifically for predicting `CITYE_KWH/100MI`, which represents the city electric consumption in kilowatt-hours per 100 miles for electric vehicles. The process for `HIGHWAYE_KWH/100MI` and `COMBE_KWH/100MI` follows a similar pattern but targets highway and combined electric consumption rates, respectively. Here's a breakdown of the code:

1. **Feature Selection and Target Definition**:
   - `X = df[features]` selects the features from the DataFrame `df` that will be used to predict the target.
   - `y = df['CITYE_KWH/100MI']` defines the target variable, which in this case is the city electric consumption.

2. **Data Splitting**:
   - The dataset is split into training and testing sets using `train_test_split`, with 80% of the data used for training and 20% for testing. The `random_state` ensures reproducibility.

3. **Feature Scaling**:
   - `StandardScaler` is employed to scale features, ensuring that all features contribute equally to the model's prediction.

4. **Model Definition and Parameter Grids**:
   - A list of models along with their hyperparameter grids is defined. This includes `Linear Regression`, `Random Forest Regressor`, `Gradient Boosting`, and `SGD Regressor`. Hyperparameters are specified for models that benefit from tuning.

5. **Model Training and Evaluation Loop**:
   - The code iterates over each model and its hyperparameter combinations. For each combination, it:
     - Starts a new MLflow run to track experiments.
     - Sets the model name as a tag for easy identification.
     - Logs the hyperparameters being tested.
     - Fits the model to the scaled training data.
     - Predicts on the scaled test data and calculates the Mean Squared Error (MSE) as the performance metric.
     - Logs the MSE to MLflow.

6. **Feature Importance Logging**:
   - For models that support feature importance, the importance of each feature is calculated, sorted, and logged as a CSV file. This provides insights into which features are most influential in predicting `CITYE_KWH/100MI`.

7. **Cleanup**:
   - After logging the feature importances, the CSV file is removed from the local filesystem to keep the workspace clean.

This code is a comprehensive approach to model building, evaluation, and logging, providing a robust framework for predictive analysis in the electric vehicle domain.


## **CITY - KWH/100MI**

In [3]:
# Define features (X) and target (y) for the model
X = df[features]
y = df['CITYE_KWH/100MI']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define a list of models to test
models = [
    ("Linear Regression", LinearRegression(), {}),
    ("Random Forest Regressor", RandomForestRegressor(random_state=42), {'n_estimators': [50, 100, 200, 300, 400], 'max_depth': [10, None]}),
    ("Gradient Boosting", GradientBoostingRegressor(random_state=42), {'learning_rate': [0.01, 0.05, 0.1, 0.2]}),
    ("SGD Regressor", SGDRegressor(random_state=42, max_iter=1000, tol=1e-3), {'learning_rate': ['constant', 'optimal', 'invscaling', 'adaptive'], 'eta0': [0.01, 0.1]})
]

for model_name, model, params in models:
    for key, values in params.items():
        for value in values:
            with mlflow.start_run():
                mlflow.set_tag("Model", model_name)
                # Log parameters
                if params:  # Check if model has parameters to log
                    mlflow.log_params({key: value})
                    setattr(model, key, value)

                # Train and evaluate the model
                model.fit(X_train_scaled, y_train)
                y_pred = model.predict(X_test_scaled)
                mse = mean_squared_error(y_test, y_pred)

                # Log metrics
                mlflow.log_metric("MSE", mse)
                print(f"{model_name} with {key} = {value}: MSE = {mse}")

                # Log feature importance for models that support it
                if hasattr(model, 'feature_importances_'):
                    feature_importances = model.feature_importances_
                    sorted_idx = np.argsort(feature_importances)[::-1]  # Sort in descending order
                    
                    # Create a DataFrame to save feature importances
                    fi_df = pd.DataFrame({
                        'Feature': np.array(features)[sorted_idx],
                        'Importance': feature_importances[sorted_idx]
                    })
                    
                    # Save to a CSV file
                    fi_filename = f"feature_importances_{model_name.replace(' ', '_')}.csv"
                    fi_df.to_csv(fi_filename, index=False)
                    
                    # Log the CSV file
                    mlflow.log_artifact(fi_filename)
                    
                    # Optionally, print feature importances
                    print(f"Feature Importances for {model_name} logged.")
                    
                    # Remove the file after logging
                    os.remove(fi_filename)     

StatementMeta(, b2f57b38-29a2-4887-95d2-1eeb316ff28d, 5, Finished, Available)

2024/03/03 18:35:32 INFO mlflow.tracking.fluent: Experiment with name '04_model_training__classical' does not exist. Creating a new experiment.


Random Forest Regressor with n_estimators = 50: MSE = 968.9814966648876
Feature Importances for Random Forest Regressor logged.


Random Forest Regressor with n_estimators = 100: MSE = 972.3113986340308
Feature Importances for Random Forest Regressor logged.


Random Forest Regressor with n_estimators = 200: MSE = 973.0871899921927
Feature Importances for Random Forest Regressor logged.


Random Forest Regressor with n_estimators = 300: MSE = 972.7406898472404
Feature Importances for Random Forest Regressor logged.


Random Forest Regressor with n_estimators = 400: MSE = 973.3300901699828
Feature Importances for Random Forest Regressor logged.


Random Forest Regressor with max_depth = 10: MSE = 1215.6750856179408
Feature Importances for Random Forest Regressor logged.


Random Forest Regressor with max_depth = None: MSE = 973.3300901699828
Feature Importances for Random Forest Regressor logged.


Gradient Boosting with learning_rate = 0.01: MSE = 1984.6204645025969
Feature Importances for Gradient Boosting logged.


Gradient Boosting with learning_rate = 0.05: MSE = 1547.8553355716422
Feature Importances for Gradient Boosting logged.


Gradient Boosting with learning_rate = 0.1: MSE = 1436.0021202887085
Feature Importances for Gradient Boosting logged.


Gradient Boosting with learning_rate = 0.2: MSE = 1289.552420455714
Feature Importances for Gradient Boosting logged.


SGD Regressor with learning_rate = constant: MSE = 3906.2109274985874




SGD Regressor with learning_rate = optimal: MSE = 70741519397.14809


SGD Regressor with learning_rate = invscaling: MSE = 1869.1039776021546


SGD Regressor with learning_rate = adaptive: MSE = 1864.7917983471873


SGD Regressor with eta0 = 0.01: MSE = 1864.7917983471873


SGD Regressor with eta0 = 0.1: MSE = 1919.6210443915434


## **HIGHWAY - KWH/100MI**

In [4]:
# Define features (X) and target (y) for the model
X = df[features]
y = df['HIGHWAYE_KWH/100MI']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define a list of models to test
models = [
    ("Linear Regression", LinearRegression(), {}),
    ("Random Forest Regressor", RandomForestRegressor(random_state=42), {'n_estimators': [50, 100, 200, 300, 400], 'max_depth': [10, None]}),
    ("Gradient Boosting", GradientBoostingRegressor(random_state=42), {'learning_rate': [0.01, 0.05, 0.1, 0.2]}),
    ("SGD Regressor", SGDRegressor(random_state=42, max_iter=1000, tol=1e-3), {'learning_rate': ['constant', 'optimal', 'invscaling', 'adaptive'], 'eta0': [0.01, 0.1]})
]

for model_name, model, params in models:
    for key, values in params.items():
        for value in values:
            with mlflow.start_run():
                mlflow.set_tag("Model", model_name)
                # Log parameters
                if params:  # Check if model has parameters to log
                    mlflow.log_params({key: value})
                    setattr(model, key, value)

                # Train and evaluate the model
                model.fit(X_train_scaled, y_train)
                y_pred = model.predict(X_test_scaled)
                mse = mean_squared_error(y_test, y_pred)

                # Log metrics
                mlflow.log_metric("MSE", mse)
                print(f"{model_name} with {key} = {value}: MSE = {mse}")

                # Log feature importance for models that support it
                if hasattr(model, 'feature_importances_'):
                    feature_importances = model.feature_importances_
                    sorted_idx = np.argsort(feature_importances)[::-1]  # Sort in descending order
                    
                    # Create a DataFrame to save feature importances
                    fi_df = pd.DataFrame({
                        'Feature': np.array(features)[sorted_idx],
                        'Importance': feature_importances[sorted_idx]
                    })
                    
                    # Save to a CSV file
                    fi_filename = f"feature_importances_{model_name.replace(' ', '_')}.csv"
                    fi_df.to_csv(fi_filename, index=False)
                    
                    # Log the CSV file
                    mlflow.log_artifact(fi_filename)
                    
                    # Optionally, print feature importances
                    print(f"Feature Importances for {model_name} logged.")
                    
                    # Remove the file after logging
                    os.remove(fi_filename)     

StatementMeta(, b2f57b38-29a2-4887-95d2-1eeb316ff28d, 6, Finished, Available)

Random Forest Regressor with n_estimators = 50: MSE = 435.40929576121266
Feature Importances for Random Forest Regressor logged.


Random Forest Regressor with n_estimators = 100: MSE = 435.77489636343745
Feature Importances for Random Forest Regressor logged.


Random Forest Regressor with n_estimators = 200: MSE = 434.29825486600305
Feature Importances for Random Forest Regressor logged.


Random Forest Regressor with n_estimators = 300: MSE = 434.4945241879394
Feature Importances for Random Forest Regressor logged.


Random Forest Regressor with n_estimators = 400: MSE = 434.1065331910032
Feature Importances for Random Forest Regressor logged.


Random Forest Regressor with max_depth = 10: MSE = 535.7633299525125
Feature Importances for Random Forest Regressor logged.


Random Forest Regressor with max_depth = None: MSE = 434.1065331910032
Feature Importances for Random Forest Regressor logged.


Gradient Boosting with learning_rate = 0.01: MSE = 869.9134651762974
Feature Importances for Gradient Boosting logged.


Gradient Boosting with learning_rate = 0.05: MSE = 665.2142384399457
Feature Importances for Gradient Boosting logged.


Gradient Boosting with learning_rate = 0.1: MSE = 610.7049767030059
Feature Importances for Gradient Boosting logged.


Gradient Boosting with learning_rate = 0.2: MSE = 561.8642795573155
Feature Importances for Gradient Boosting logged.


SGD Regressor with learning_rate = constant: MSE = 1655.4145737781585




SGD Regressor with learning_rate = optimal: MSE = 14306975381.964724


SGD Regressor with learning_rate = invscaling: MSE = 814.2685254597732


SGD Regressor with learning_rate = adaptive: MSE = 810.9403820123555


SGD Regressor with eta0 = 0.01: MSE = 810.9403820123555


SGD Regressor with eta0 = 0.1: MSE = 835.4674922911685


## **COMBINED - KWH/100MI**

In [5]:
# Define features (X) and target (y) for the model
X = df[features]
y = df['COMBE_KWH/100MI']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define a list of models to test
models = [
    ("Linear Regression", LinearRegression(), {}),
    ("Random Forest Regressor", RandomForestRegressor(random_state=42), {'n_estimators': [50, 100, 200, 300, 400], 'max_depth': [10, None]}),
    ("Gradient Boosting", GradientBoostingRegressor(random_state=42), {'learning_rate': [0.01, 0.05, 0.1, 0.2]}),
    ("SGD Regressor", SGDRegressor(random_state=42, max_iter=1000, tol=1e-3), {'learning_rate': ['constant', 'optimal', 'invscaling', 'adaptive'], 'eta0': [0.01, 0.1]})
]

for model_name, model, params in models:
    for key, values in params.items():
        for value in values:
            with mlflow.start_run():
                mlflow.set_tag("Model", model_name)
                # Log parameters
                if params:  # Check if model has parameters to log
                    mlflow.log_params({key: value})
                    setattr(model, key, value)

                # Train and evaluate the model
                model.fit(X_train_scaled, y_train)
                y_pred = model.predict(X_test_scaled)
                mse = mean_squared_error(y_test, y_pred)

                # Log metrics
                mlflow.log_metric("MSE", mse)
                print(f"{model_name} with {key} = {value}: MSE = {mse}")

                # Log feature importance for models that support it
                if hasattr(model, 'feature_importances_'):
                    feature_importances = model.feature_importances_
                    sorted_idx = np.argsort(feature_importances)[::-1]  # Sort in descending order
                    
                    # Create a DataFrame to save feature importances
                    fi_df = pd.DataFrame({
                        'Feature': np.array(features)[sorted_idx],
                        'Importance': feature_importances[sorted_idx]
                    })
                    
                    # Save to a CSV file
                    fi_filename = f"feature_importances_{model_name.replace(' ', '_')}.csv"
                    fi_df.to_csv(fi_filename, index=False)
                    
                    # Log the CSV file
                    mlflow.log_artifact(fi_filename)
                    
                    # Optionally, print feature importances
                    print(f"Feature Importances for {model_name} logged.")
                    
                    # Remove the file after logging
                    os.remove(fi_filename)    

StatementMeta(, b2f57b38-29a2-4887-95d2-1eeb316ff28d, 7, Finished, Available)

Random Forest Regressor with n_estimators = 50: MSE = 685.6714137870434
Feature Importances for Random Forest Regressor logged.


Random Forest Regressor with n_estimators = 100: MSE = 686.6016177972003
Feature Importances for Random Forest Regressor logged.


Random Forest Regressor with n_estimators = 200: MSE = 684.8313561021281
Feature Importances for Random Forest Regressor logged.


Random Forest Regressor with n_estimators = 300: MSE = 684.5985783977239
Feature Importances for Random Forest Regressor logged.


Random Forest Regressor with n_estimators = 400: MSE = 684.4846067595084
Feature Importances for Random Forest Regressor logged.


Random Forest Regressor with max_depth = 10: MSE = 843.0844489818288
Feature Importances for Random Forest Regressor logged.


Random Forest Regressor with max_depth = None: MSE = 684.4846067595084
Feature Importances for Random Forest Regressor logged.


Gradient Boosting with learning_rate = 0.01: MSE = 1400.3842932962443
Feature Importances for Gradient Boosting logged.


Gradient Boosting with learning_rate = 0.05: MSE = 1075.4147333396404
Feature Importances for Gradient Boosting logged.


Gradient Boosting with learning_rate = 0.1: MSE = 995.1259402691425
Feature Importances for Gradient Boosting logged.


Gradient Boosting with learning_rate = 0.2: MSE = 911.0501877239703
Feature Importances for Gradient Boosting logged.


SGD Regressor with learning_rate = constant: MSE = 2750.5805042859015




SGD Regressor with learning_rate = optimal: MSE = 140984297553.8446


SGD Regressor with learning_rate = invscaling: MSE = 1306.1154911765425


SGD Regressor with learning_rate = adaptive: MSE = 1302.1742076135108


SGD Regressor with eta0 = 0.01: MSE = 1302.1742076135108


SGD Regressor with eta0 = 0.1: MSE = 1326.4716736680741
