**PART 4: Prediction of the amount of electricity produced (Regression)**

*We would like to predict the amount of electricity produced by a windfarm, as a function of the information gathered in a number of physical sensors (e.g. speed of the wind, temperature, ...)*

In [181]:
# All imports for the notebook
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import KFold, ShuffleSplit
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error, r2_score
from statistics import mean

In [41]:
# Here we assume you have the files in the same folder as the notebook and import them
inputs = np.load("inputs.npy")
labels = np.load("labels.npy")
labels = labels[:, 0]

**Using Linear Regression**



The model we will use is Linear Regression, <br>Linear Regression fits this problem because we are trying to determine the electricity produced in function of other parameters (here speed or temperature for example).<br> So we have a **dependant variable: electricity** and **independant variables: speed/temperature**<br><br>Also it is important to note we can't use logistic regression here, although it has regression in the name, it is used to solve classification problems.



In [None]:
# We are going to use 10 iterations to have an average of R2
val = []
for x in range(10):

  # Creating Training and test split with test size of 20%
  X_train, X_test, y_train, y_test = train_test_split(
      inputs,
      labels,
      test_size= 0.20)

  # Create Linear Regression Model
  model = LinearRegression()

  # Fit model
  model.fit(X_train, y_train)

  # Evaluate model
  predictions = model.predict(X_test)

  # We are adding our result to the total array to calculate at the end
  val += [r2_score(y_test, predictions)]

# Lets now print the average of our coefficients of determination also known as R2
print(f"Average coefficient of determination: {mean(val):.2f}")

We obtain an R2 score of around 0.70 to 0.80. <br>
In our tests, we have obtained an R2 score of 0.85, but this score isn't consistent. <br>
We could improve this score by changing some parameters but by testing other models, this one is the best.

**Cross validation**

Now we are going to test cross validation with K-Fold. <br>K-fold splits the dataset into a K number of folds and is mainly used to evaluate the model's ability when it's given new data.

In [194]:
# Initiating the Cross Validation selections
kfold = KFold(n_splits = 11)
ss = ShuffleSplit(n_splits=15, test_size=0.2, random_state=None)

# Recalculating the model
model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression()

In [184]:
# Calculating Cross Validation scores with K-Fold
scores = cross_val_score(model, inputs, labels, cv=kfold)

# Displaying results
print(f"We get an accuracy of {scores.mean():.2f} with a standard deviation of {scores.std():.2f}.")

We get an accuracy of 0.75 with a standard deviation of 0.08.


In [197]:
# Calculating Cross Validation scores with ShuffleSplit
scores = cross_val_score(model, inputs, labels, cv=ss)

# Displaying results
print(f"We get an accuracy of {scores.mean():.2f} with a standard deviation of {scores.std():.2f}.")

We get an accuracy of 0.74 with a standard deviation of 0.07.


We obtain an 0.75 accuracy with a standard deviation of 0.08 with K-Fold cross validation. <br>
We get around an 0.69 to 0.74 accuracy with a standard deviation of 0.06 to 0.08 with ShuffleSplit cross validation. <br>

**Conclusion**

We can conclude that the Linear Regression predictive analysis is the most correct one to use and it's effectiveness is shown throughout these results.