# PART 4: prediction of the amount of electricity produced (regression)

This section develops a model to predict windfarm electricity production using sensor data. The goal is to compare two models that we will train on data based on the subject, and test out their r2 score for a minimum of 0.85.

----

In [1]:
from sklearn.svm import SVR
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Lasso, Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from typing import Tuple, Dict, Any
import numpy as np

In [2]:
# Load the data
X_train = np.load('X_train.npy')
X_test = np.load('X_test.npy')
y_train = np.load('y_train.npy') 
y_test = np.load('y_test.npy')

# Check the shape of the data
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)


X_train shape: (100, 100)
X_test shape: (100, 100)
y_train shape: (100, 1)
y_test shape: (100, 1)


## Scaling the data

Data scaling is a preprocessing step that normalizes feature magnitudes, ensuring equitable contribution to model predictions and enhancing algorithm performance.


In [3]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Training the models

We will start our analysis by comparing the performance of Ridge and Lasso regression models, focusing on their ability to predict electricity production accurately. 


## Method 1: Ridge

In [4]:
def tune_ridge_regression(X_train: np.ndarray, y_train: np.ndarray, X_test: np.ndarray, y_test: np.ndarray) -> Tuple[Ridge, float]:
    """
    Tunes hyperparameters for a Ridge Regression model using GridSearchCV and evaluates the optimized model.
    
    Parameters:
    - X_train: ndarray, feature matrix for training data.
    - y_train: ndarray, labels for training data.
    - X_test: ndarray, feature matrix for test data.
    - y_test: ndarray, labels for test data.
    
    Returns:
    - best_model: The Ridge Regression model with the best found hyperparameters.
    - test_r2_score: R^2 score of the best model on the test data.
    """
     # Define the parameter distribution
    param_distributions = {
        'alpha': np.logspace(-4, 4, 100)
    }

    # Initialize the RandomizedSearchCV object
    random_search = RandomizedSearchCV(Ridge(), param_distributions, n_iter=50, cv=5, scoring='r2', verbose=1, random_state=42)
    random_search.fit(X_train, y_train)
    
    best_model = random_search.best_estimator_
    
    # Predict on the test set with the optimized model
    y_test_pred = best_model.predict(X_test)
    
    # Calculate the R^2 score
    test_r2_score = r2_score(y_test, y_test_pred)
    
    return best_model, test_r2_score

In [5]:
# Call the tuning function for Ridge Regression
best_ridge, ridge_test_r2 = tune_ridge_regression(X_train_scaled, y_train, X_test_scaled, y_test)

# Print the best parameters and R^2 score
print("Best Parameters for Ridge Regression:", best_ridge.get_params())
print(f"Ridge Regression Test R^2 Score: {ridge_test_r2}")

Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best Parameters for Ridge Regression: {'alpha': 10.235310218990268, 'copy_X': True, 'fit_intercept': True, 'max_iter': None, 'positive': False, 'random_state': None, 'solver': 'auto', 'tol': 0.0001}
Ridge Regression Test R^2 Score: 0.5984068414387737


# Method 2 : Lasso

In [6]:
def tune_lasso_regression(X_train: np.ndarray, y_train: np.ndarray, X_test: np.ndarray, y_test: np.ndarray) -> Tuple[Lasso, float]:
    """
    Tunes hyperparameters for a Lasso Regression model using GridSearchCV and evaluates the optimized model.
    
    Parameters:
    - X_train: ndarray, feature matrix for training data.
    - y_train: ndarray, labels for training data.
    - X_test: ndarray, feature matrix for test data.
    - y_test: ndarray, labels for test data.
    
    Returns:
    - best_model: The Lasso Regression model with the best found hyperparameters.
    - test_r2_score: R^2 score of the best model on the test data.
    """
    # Since Lasso is sensitive to the scale of the input features, ensure they are scaled
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    # Define the parameter grid
    param_grid = {
        'alpha': np.logspace(-4, -1, 10)  # Exploring a range of alpha values
    }
    grid_search = GridSearchCV(Lasso(max_iter=10000), param_grid, cv=5, scoring='r2', verbose=1)
    grid_search.fit(X_train_scaled, y_train)
    best_model = grid_search.best_estimator_
    
    # Predict on the test set with the optimized model
    y_test_pred = best_model.predict(X_test_scaled)
    
    # Calculate the R^2 score
    test_r2_score = r2_score(y_test, y_test_pred)
    
    return best_model, test_r2_score

In [7]:
best_lasso_model, lasso_test_r2 = tune_lasso_regression(X_train, y_train, X_test, y_test)

print("Best Parameters for Lasso Regression:", best_lasso_model.get_params())
print(f"Lasso Regression Test R^2 Score: {lasso_test_r2}")


Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best Parameters for Lasso Regression: {'alpha': 0.01, 'copy_X': True, 'fit_intercept': True, 'max_iter': 10000, 'positive': False, 'precompute': False, 'random_state': None, 'selection': 'cyclic', 'tol': 0.0001, 'warm_start': False}
Lasso Regression Test R^2 Score: 0.8702316928640285


## Final overview

The fine-tuning of the ridge regression showed us a test accuracy of 0.59, which doesnt achieve the required r2 score of 0.85.

In a concise comparison, the Lasso model outperforms Ridge Regression in terms of accuracy and demonstrates better precision, with an r2 score of 0.87 compared to 0.59 for the Ridge, which is way better

Pol-Antoine Loiseau - Florent Rossignol