# Regression Predict Student Solution

© Explore Data Science Academy

---
### Honour Code

I {**YOUR NAME, YOUR SURNAME**}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

### Predict Overview: Spain Electricity Shortfall Challenge

The government of Spain is considering an expansion of it's renewable energy resource infrastructure investments. As such, they require information on the trends and patterns of the countries renewable sources and fossil fuel energy generation. Your company has been awarded the contract to:

- 1. analyse the supplied data;
- 2. identify potential errors in the data and clean the existing data set;
- 3. determine if additional features can be added to enrich the data set;
- 4. build a model that is capable of forecasting the three hourly demand shortfalls;
- 5. evaluate the accuracy of the best machine learning model;
- 6. determine what features were most important in the model’s prediction decision, and
- 7. explain the inner working of the model to a non-technical audience.

Formally the problem statement was given to you, the senior data scientist, by your manager via email reads as follow:

> In this project you are tasked to model the shortfall between the energy generated by means of fossil fuels and various renewable sources - for the country of Spain. The daily shortfall, which will be referred to as the target variable, will be modelled as a function of various city-specific weather features such as `pressure`, `wind speed`, `humidity`, etc. As with all data science projects, the provided features are rarely adequate predictors of the target variable. As such, you are required to perform feature engineering to ensure that you will be able to accurately model Spain's three hourly shortfalls.
 
On top of this, she has provided you with a starter notebook containing vague explanations of what the main outcomes are. 

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section you are required to import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. |

---

In [31]:
# Libraries for data loading, data manipulation and data visulisation
import numpy as np
import pandas as pd
import seaborn as sns
import datetime


# Libraries for data preparation and model building
import sklearn as sl
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor
from sklearn.metrics import accuracy_score
from sklearn import tree

# Setting global constants to ensure notebook results are reproducible
##PARAMETER_CONSTANT = ###

<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section you are required to load the data from the `df_train` file into a DataFrame. |

---

In [2]:
# load the data
df = pd.read_csv("C:/Users/lwazi/Downloads/Advanced-Regression-Starter-Data-3036 Predict/Advanced-Regression-Starter-Data/df_train.csv")

<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, you are required to perform an in-depth analysis of all the variables in the DataFrame. |

---


In [3]:
# look at data statistics
pd.set_option('display.max_columns', None)
df.head()

Unnamed: 0.1,Unnamed: 0,time,Madrid_wind_speed,Valencia_wind_deg,Bilbao_rain_1h,Valencia_wind_speed,Seville_humidity,Madrid_humidity,Bilbao_clouds_all,Bilbao_wind_speed,Seville_clouds_all,Bilbao_wind_deg,Barcelona_wind_speed,Barcelona_wind_deg,Madrid_clouds_all,Seville_wind_speed,Barcelona_rain_1h,Seville_pressure,Seville_rain_1h,Bilbao_snow_3h,Barcelona_pressure,Seville_rain_3h,Madrid_rain_1h,Barcelona_rain_3h,Valencia_snow_3h,Madrid_weather_id,Barcelona_weather_id,Bilbao_pressure,Seville_weather_id,Valencia_pressure,Seville_temp_max,Madrid_pressure,Valencia_temp_max,Valencia_temp,Bilbao_weather_id,Seville_temp,Valencia_humidity,Valencia_temp_min,Barcelona_temp_max,Madrid_temp_max,Barcelona_temp,Bilbao_temp_min,Bilbao_temp,Barcelona_temp_min,Bilbao_temp_max,Seville_temp_min,Madrid_temp,Madrid_temp_min,load_shortfall_3h
0,0,2015-01-01 03:00:00,0.666667,level_5,0.0,0.666667,74.333333,64.0,0.0,1.0,0.0,223.333333,6.333333,42.666667,0.0,3.333333,0.0,sp25,0.0,0.0,1036.333333,0.0,0.0,0.0,0.0,800.0,800.0,1035.0,800.0,1002.666667,274.254667,971.333333,269.888,269.888,800.0,274.254667,75.666667,269.888,281.013,265.938,281.013,269.338615,269.338615,281.013,269.338615,274.254667,265.938,265.938,6715.666667
1,1,2015-01-01 06:00:00,0.333333,level_10,0.0,1.666667,78.333333,64.666667,0.0,1.0,0.0,221.0,4.0,139.0,0.0,3.333333,0.0,sp25,0.0,0.0,1037.333333,0.0,0.0,0.0,0.0,800.0,800.0,1035.666667,800.0,1004.333333,274.945,972.666667,271.728333,271.728333,800.0,274.945,71.0,271.728333,280.561667,266.386667,280.561667,270.376,270.376,280.561667,270.376,274.945,266.386667,266.386667,4171.666667
2,2,2015-01-01 09:00:00,1.0,level_9,0.0,1.0,71.333333,64.333333,0.0,1.0,0.0,214.333333,2.0,326.0,0.0,2.666667,0.0,sp25,0.0,0.0,1038.0,0.0,0.0,0.0,0.0,800.0,800.0,1036.0,800.0,1005.333333,278.792,974.0,278.008667,278.008667,800.0,278.792,65.666667,278.008667,281.583667,272.708667,281.583667,275.027229,275.027229,281.583667,275.027229,278.792,272.708667,272.708667,4274.666667
3,3,2015-01-01 12:00:00,1.0,level_8,0.0,1.0,65.333333,56.333333,0.0,1.0,0.0,199.666667,2.333333,273.0,0.0,4.0,0.0,sp25,0.0,0.0,1037.0,0.0,0.0,0.0,0.0,800.0,800.0,1036.0,800.0,1009.0,285.394,994.666667,284.899552,284.899552,800.0,285.394,54.0,284.899552,283.434104,281.895219,283.434104,281.135063,281.135063,283.434104,281.135063,285.394,281.895219,281.895219,5075.666667
4,4,2015-01-01 15:00:00,1.0,level_7,0.0,1.0,59.0,57.0,2.0,0.333333,0.0,185.0,4.333333,260.0,0.0,3.0,0.0,sp25,0.0,0.0,1035.0,0.0,0.0,0.0,0.0,800.0,800.0,1035.333333,800.0,,285.513719,1035.333333,283.015115,283.015115,800.0,285.513719,58.333333,283.015115,284.213167,280.678437,284.213167,282.252063,282.252063,284.213167,282.252063,285.513719,280.678437,280.678437,6620.666667


In [None]:
# plot relevant feature interactions

In [4]:
# evaluate correlation
def correlation(dataset,threshold):
    col_corr=set()
    corr_matrix=dataset.corr()
    for i in range (len(corr_matrix.columns)):
        for j in range(i):
            if (corr_matrix.iloc[i,j])>threshold:
                colname=corr_matrix.columns[i]
                col_corr.add(colname)
    return col_corr


corr_features = correlation(df,0.7)
len(set(corr_features))

16

In [5]:
corr_features

{'Barcelona_temp',
 'Barcelona_temp_max',
 'Barcelona_temp_min',
 'Bilbao_temp',
 'Bilbao_temp_max',
 'Bilbao_temp_min',
 'Madrid_humidity',
 'Madrid_pressure',
 'Madrid_temp',
 'Madrid_temp_max',
 'Madrid_temp_min',
 'Seville_temp',
 'Seville_temp_min',
 'Valencia_temp',
 'Valencia_temp_max',
 'Valencia_temp_min'}

In [6]:
#drop highly correlated features
df = df.drop(corr_features,axis=1)
df.head()

Unnamed: 0.1,Unnamed: 0,time,Madrid_wind_speed,Valencia_wind_deg,Bilbao_rain_1h,Valencia_wind_speed,Seville_humidity,Bilbao_clouds_all,Bilbao_wind_speed,Seville_clouds_all,Bilbao_wind_deg,Barcelona_wind_speed,Barcelona_wind_deg,Madrid_clouds_all,Seville_wind_speed,Barcelona_rain_1h,Seville_pressure,Seville_rain_1h,Bilbao_snow_3h,Barcelona_pressure,Seville_rain_3h,Madrid_rain_1h,Barcelona_rain_3h,Valencia_snow_3h,Madrid_weather_id,Barcelona_weather_id,Bilbao_pressure,Seville_weather_id,Valencia_pressure,Seville_temp_max,Bilbao_weather_id,Valencia_humidity,load_shortfall_3h
0,0,2015-01-01 03:00:00,0.666667,level_5,0.0,0.666667,74.333333,0.0,1.0,0.0,223.333333,6.333333,42.666667,0.0,3.333333,0.0,sp25,0.0,0.0,1036.333333,0.0,0.0,0.0,0.0,800.0,800.0,1035.0,800.0,1002.666667,274.254667,800.0,75.666667,6715.666667
1,1,2015-01-01 06:00:00,0.333333,level_10,0.0,1.666667,78.333333,0.0,1.0,0.0,221.0,4.0,139.0,0.0,3.333333,0.0,sp25,0.0,0.0,1037.333333,0.0,0.0,0.0,0.0,800.0,800.0,1035.666667,800.0,1004.333333,274.945,800.0,71.0,4171.666667
2,2,2015-01-01 09:00:00,1.0,level_9,0.0,1.0,71.333333,0.0,1.0,0.0,214.333333,2.0,326.0,0.0,2.666667,0.0,sp25,0.0,0.0,1038.0,0.0,0.0,0.0,0.0,800.0,800.0,1036.0,800.0,1005.333333,278.792,800.0,65.666667,4274.666667
3,3,2015-01-01 12:00:00,1.0,level_8,0.0,1.0,65.333333,0.0,1.0,0.0,199.666667,2.333333,273.0,0.0,4.0,0.0,sp25,0.0,0.0,1037.0,0.0,0.0,0.0,0.0,800.0,800.0,1036.0,800.0,1009.0,285.394,800.0,54.0,5075.666667
4,4,2015-01-01 15:00:00,1.0,level_7,0.0,1.0,59.0,2.0,0.333333,0.0,185.0,4.333333,260.0,0.0,3.0,0.0,sp25,0.0,0.0,1035.0,0.0,0.0,0.0,0.0,800.0,800.0,1035.333333,800.0,,285.513719,800.0,58.333333,6620.666667


In [None]:
# have a look at feature distribution

In [30]:
df.corr()

Unnamed: 0,Madrid_wind_speed,Bilbao_rain_1h,Valencia_wind_speed,Seville_humidity,Bilbao_clouds_all,Bilbao_wind_speed,Seville_clouds_all,Bilbao_wind_deg,Barcelona_wind_speed,Barcelona_wind_deg,Madrid_clouds_all,Seville_wind_speed,Barcelona_rain_1h,Seville_rain_1h,Bilbao_snow_3h,Barcelona_pressure,Seville_rain_3h,Madrid_rain_1h,Barcelona_rain_3h,Valencia_snow_3h,Madrid_weather_id,Barcelona_weather_id,Bilbao_pressure,Seville_weather_id,Valencia_pressure,Seville_temp_max,Bilbao_weather_id,Valencia_humidity,load_shortfall_3h,Seville_pressure,new_time
Madrid_wind_speed,1.0,0.259719,0.513092,-0.117892,0.244001,0.377854,0.191251,0.27095,0.29464,-0.09538,0.230126,0.434104,0.062758,0.108413,0.071183,0.011134,0.004795,0.150446,-0.014644,0.02166,-0.169358,-0.099582,-0.231747,-0.120014,-0.142737,0.050043,-0.238128,-0.285787,-0.150981,-0.182792,0.186228
Bilbao_rain_1h,0.259719,1.0,0.265864,0.069878,0.370733,0.085398,0.081131,0.27935,0.069997,-0.030723,0.135524,0.140101,0.052558,0.092984,0.09673,0.052458,0.016392,0.187423,-0.001412,0.008269,-0.147768,-0.120618,-0.054814,-0.095723,-0.199341,-0.210323,-0.604616,-0.103868,-0.15251,0.067471,0.054527
Valencia_wind_speed,0.513092,0.265864,1.0,-0.075227,0.210524,0.386478,0.163675,0.248643,0.347966,-0.066071,0.221887,0.316035,0.031804,0.046085,0.115133,0.050282,0.027637,0.093865,-0.037553,0.058629,-0.099056,-0.037605,-0.096374,-0.069092,-0.038234,-0.024045,-0.201379,-0.413017,-0.142791,-0.065082,0.204103
Seville_humidity,-0.117892,0.069878,-0.075227,1.0,0.06168,-0.08818,0.399436,-0.087246,-0.138625,0.164064,0.366602,-0.202449,-0.051022,0.227476,0.023556,0.021599,0.034343,0.164019,0.015555,0.007351,-0.228442,-0.050515,-0.099458,-0.328265,-0.078962,-0.566426,-0.105088,0.464012,-0.16729,0.217941,-0.424982
Bilbao_clouds_all,0.244001,0.370733,0.210524,0.06168,1.0,0.031915,0.046737,0.280154,0.094019,-0.06512,0.109788,0.075066,0.052913,0.04109,0.08018,0.037506,0.009557,0.089281,-0.041013,0.024339,-0.080837,-0.124169,0.000377,-0.033825,-0.067832,-0.102322,-0.536205,-0.129684,-0.127293,-0.038859,-0.023714
Bilbao_wind_speed,0.377854,0.085398,0.386478,-0.08818,0.031915,1.0,0.127344,0.417534,0.275317,-0.018225,0.239326,0.21342,-0.02664,0.07308,-0.001642,0.009572,-0.026037,0.088502,-0.038246,-0.008114,-0.101497,-0.003074,-0.122915,-0.086691,0.049049,0.103342,-0.031661,-0.279825,-0.081602,-0.115875,0.197848
Seville_clouds_all,0.191251,0.081131,0.163675,0.399436,0.046737,0.127344,1.0,0.053482,0.136591,-0.031373,0.552414,0.144119,0.00359,0.408001,0.001718,0.020264,0.08724,0.295499,0.029194,-0.009782,-0.376157,-0.099166,-0.330575,-0.537924,-0.19554,-0.181783,-0.101888,0.097491,-0.091804,-0.094748,-0.017401
Bilbao_wind_deg,0.27095,0.27935,0.248643,-0.087246,0.280154,0.417534,0.053482,1.0,0.177393,-0.015481,0.08504,0.120378,0.026187,0.030082,-0.041314,0.03422,0.006888,0.057058,0.007202,-0.02268,-0.036532,-0.053839,-0.107361,-0.008937,-0.111615,-0.076038,-0.264719,-0.230583,-0.1208,0.041985,0.110457
Barcelona_wind_speed,0.29464,0.069997,0.347966,-0.138625,0.094019,0.275317,0.136591,0.177393,1.0,0.076376,0.147652,0.212193,0.042136,0.105892,0.015752,0.00128,0.058662,0.130751,-0.001722,0.030336,-0.106432,-0.048004,-0.083399,-0.090902,-0.068613,0.152852,-0.064746,-0.24961,-0.103633,-0.113567,0.182538
Barcelona_wind_deg,-0.09538,-0.030723,-0.066071,0.164064,-0.06512,-0.018225,-0.031373,-0.015481,0.076376,1.0,-0.041083,-0.098837,-0.037854,-0.101449,-0.023039,-0.001079,-0.0438,-0.112801,-0.011875,-0.021024,0.095456,0.151534,0.123565,0.068195,0.058359,-0.083393,0.049678,0.045277,-0.116133,0.19935,-0.192949


<a id="four"></a>
## 4. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase. |

---

In [6]:
df.isnull().sum()/df.shape[0]*100

Unnamed: 0               0.000000
time                     0.000000
Madrid_wind_speed        0.000000
Valencia_wind_deg        0.000000
Bilbao_rain_1h           0.000000
Valencia_wind_speed      0.000000
Seville_humidity         0.000000
Madrid_humidity          0.000000
Bilbao_clouds_all        0.000000
Bilbao_wind_speed        0.000000
Seville_clouds_all       0.000000
Bilbao_wind_deg          0.000000
Barcelona_wind_speed     0.000000
Barcelona_wind_deg       0.000000
Madrid_clouds_all        0.000000
Seville_wind_speed       0.000000
Barcelona_rain_1h        0.000000
Seville_pressure         0.000000
Seville_rain_1h          0.000000
Bilbao_snow_3h           0.000000
Barcelona_pressure       0.000000
Seville_rain_3h          0.000000
Madrid_rain_1h           0.000000
Barcelona_rain_3h        0.000000
Valencia_snow_3h         0.000000
Madrid_weather_id        0.000000
Barcelona_weather_id     0.000000
Bilbao_pressure          0.000000
Seville_weather_id       0.000000
Valencia_press

In [7]:
# remove missing values/ features
mean = df['Valencia_pressure'].mean()
df['Valencia_pressure'].fillna(mean,inplace=True) #impute null values using mean

In [8]:
# create new features
df.drop('Valencia_wind_deg',axis='columns', inplace=True) #no way to determine numerical value
df.drop('Unnamed: 0',axis='columns', inplace=True) #removing Unname: 0, as it adds no value
df.head()

Unnamed: 0,time,Madrid_wind_speed,Bilbao_rain_1h,Valencia_wind_speed,Seville_humidity,Bilbao_clouds_all,Bilbao_wind_speed,Seville_clouds_all,Bilbao_wind_deg,Barcelona_wind_speed,Barcelona_wind_deg,Madrid_clouds_all,Seville_wind_speed,Barcelona_rain_1h,Seville_pressure,Seville_rain_1h,Bilbao_snow_3h,Barcelona_pressure,Seville_rain_3h,Madrid_rain_1h,Barcelona_rain_3h,Valencia_snow_3h,Madrid_weather_id,Barcelona_weather_id,Bilbao_pressure,Seville_weather_id,Valencia_pressure,Seville_temp_max,Bilbao_weather_id,Valencia_humidity,load_shortfall_3h
0,2015-01-01 03:00:00,0.666667,0.0,0.666667,74.333333,0.0,1.0,0.0,223.333333,6.333333,42.666667,0.0,3.333333,0.0,sp25,0.0,0.0,1036.333333,0.0,0.0,0.0,0.0,800.0,800.0,1035.0,800.0,1002.666667,274.254667,800.0,75.666667,6715.666667
1,2015-01-01 06:00:00,0.333333,0.0,1.666667,78.333333,0.0,1.0,0.0,221.0,4.0,139.0,0.0,3.333333,0.0,sp25,0.0,0.0,1037.333333,0.0,0.0,0.0,0.0,800.0,800.0,1035.666667,800.0,1004.333333,274.945,800.0,71.0,4171.666667
2,2015-01-01 09:00:00,1.0,0.0,1.0,71.333333,0.0,1.0,0.0,214.333333,2.0,326.0,0.0,2.666667,0.0,sp25,0.0,0.0,1038.0,0.0,0.0,0.0,0.0,800.0,800.0,1036.0,800.0,1005.333333,278.792,800.0,65.666667,4274.666667
3,2015-01-01 12:00:00,1.0,0.0,1.0,65.333333,0.0,1.0,0.0,199.666667,2.333333,273.0,0.0,4.0,0.0,sp25,0.0,0.0,1037.0,0.0,0.0,0.0,0.0,800.0,800.0,1036.0,800.0,1009.0,285.394,800.0,54.0,5075.666667
4,2015-01-01 15:00:00,1.0,0.0,1.0,59.0,2.0,0.333333,0.0,185.0,4.333333,260.0,0.0,3.0,0.0,sp25,0.0,0.0,1035.0,0.0,0.0,0.0,0.0,800.0,800.0,1035.333333,800.0,1012.051407,285.513719,800.0,58.333333,6620.666667


In [9]:
# engineer existing features

#removing the sp in the seville pressure column
df["sp"]=df["Seville_pressure"].str.replace("sp", "")
df['sp'] = pd.to_numeric(df['sp'])
df['sp'] = df['sp'].map(float)

#drop old column and rename new column
df.drop(["Seville_pressure"],axis=1,inplace=True)


#rename new column and view new table
df.rename({'sp': 'Seville_pressure'}, axis=1, inplace=True)
df.head()



Unnamed: 0,time,Madrid_wind_speed,Bilbao_rain_1h,Valencia_wind_speed,Seville_humidity,Bilbao_clouds_all,Bilbao_wind_speed,Seville_clouds_all,Bilbao_wind_deg,Barcelona_wind_speed,Barcelona_wind_deg,Madrid_clouds_all,Seville_wind_speed,Barcelona_rain_1h,Seville_rain_1h,Bilbao_snow_3h,Barcelona_pressure,Seville_rain_3h,Madrid_rain_1h,Barcelona_rain_3h,Valencia_snow_3h,Madrid_weather_id,Barcelona_weather_id,Bilbao_pressure,Seville_weather_id,Valencia_pressure,Seville_temp_max,Bilbao_weather_id,Valencia_humidity,load_shortfall_3h,Seville_pressure
0,2015-01-01 03:00:00,0.666667,0.0,0.666667,74.333333,0.0,1.0,0.0,223.333333,6.333333,42.666667,0.0,3.333333,0.0,0.0,0.0,1036.333333,0.0,0.0,0.0,0.0,800.0,800.0,1035.0,800.0,1002.666667,274.254667,800.0,75.666667,6715.666667,25.0
1,2015-01-01 06:00:00,0.333333,0.0,1.666667,78.333333,0.0,1.0,0.0,221.0,4.0,139.0,0.0,3.333333,0.0,0.0,0.0,1037.333333,0.0,0.0,0.0,0.0,800.0,800.0,1035.666667,800.0,1004.333333,274.945,800.0,71.0,4171.666667,25.0
2,2015-01-01 09:00:00,1.0,0.0,1.0,71.333333,0.0,1.0,0.0,214.333333,2.0,326.0,0.0,2.666667,0.0,0.0,0.0,1038.0,0.0,0.0,0.0,0.0,800.0,800.0,1036.0,800.0,1005.333333,278.792,800.0,65.666667,4274.666667,25.0
3,2015-01-01 12:00:00,1.0,0.0,1.0,65.333333,0.0,1.0,0.0,199.666667,2.333333,273.0,0.0,4.0,0.0,0.0,0.0,1037.0,0.0,0.0,0.0,0.0,800.0,800.0,1036.0,800.0,1009.0,285.394,800.0,54.0,5075.666667,25.0
4,2015-01-01 15:00:00,1.0,0.0,1.0,59.0,2.0,0.333333,0.0,185.0,4.333333,260.0,0.0,3.0,0.0,0.0,0.0,1035.0,0.0,0.0,0.0,0.0,800.0,800.0,1035.333333,800.0,1012.051407,285.513719,800.0,58.333333,6620.666667,25.0


In [10]:
#making time  the index column
df.set_index("time", inplace = True)
df.head()

Unnamed: 0_level_0,Madrid_wind_speed,Bilbao_rain_1h,Valencia_wind_speed,Seville_humidity,Bilbao_clouds_all,Bilbao_wind_speed,Seville_clouds_all,Bilbao_wind_deg,Barcelona_wind_speed,Barcelona_wind_deg,Madrid_clouds_all,Seville_wind_speed,Barcelona_rain_1h,Seville_rain_1h,Bilbao_snow_3h,Barcelona_pressure,Seville_rain_3h,Madrid_rain_1h,Barcelona_rain_3h,Valencia_snow_3h,Madrid_weather_id,Barcelona_weather_id,Bilbao_pressure,Seville_weather_id,Valencia_pressure,Seville_temp_max,Bilbao_weather_id,Valencia_humidity,load_shortfall_3h,Seville_pressure
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1
2015-01-01 03:00:00,0.666667,0.0,0.666667,74.333333,0.0,1.0,0.0,223.333333,6.333333,42.666667,0.0,3.333333,0.0,0.0,0.0,1036.333333,0.0,0.0,0.0,0.0,800.0,800.0,1035.0,800.0,1002.666667,274.254667,800.0,75.666667,6715.666667,25.0
2015-01-01 06:00:00,0.333333,0.0,1.666667,78.333333,0.0,1.0,0.0,221.0,4.0,139.0,0.0,3.333333,0.0,0.0,0.0,1037.333333,0.0,0.0,0.0,0.0,800.0,800.0,1035.666667,800.0,1004.333333,274.945,800.0,71.0,4171.666667,25.0
2015-01-01 09:00:00,1.0,0.0,1.0,71.333333,0.0,1.0,0.0,214.333333,2.0,326.0,0.0,2.666667,0.0,0.0,0.0,1038.0,0.0,0.0,0.0,0.0,800.0,800.0,1036.0,800.0,1005.333333,278.792,800.0,65.666667,4274.666667,25.0
2015-01-01 12:00:00,1.0,0.0,1.0,65.333333,0.0,1.0,0.0,199.666667,2.333333,273.0,0.0,4.0,0.0,0.0,0.0,1037.0,0.0,0.0,0.0,0.0,800.0,800.0,1036.0,800.0,1009.0,285.394,800.0,54.0,5075.666667,25.0
2015-01-01 15:00:00,1.0,0.0,1.0,59.0,2.0,0.333333,0.0,185.0,4.333333,260.0,0.0,3.0,0.0,0.0,0.0,1035.0,0.0,0.0,0.0,0.0,800.0,800.0,1035.333333,800.0,1012.051407,285.513719,800.0,58.333333,6620.666667,25.0


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8763 entries, 2015-01-01 03:00:00 to 2017-12-31 21:00:00
Data columns (total 30 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Madrid_wind_speed     8763 non-null   float64
 1   Bilbao_rain_1h        8763 non-null   float64
 2   Valencia_wind_speed   8763 non-null   float64
 3   Seville_humidity      8763 non-null   float64
 4   Bilbao_clouds_all     8763 non-null   float64
 5   Bilbao_wind_speed     8763 non-null   float64
 6   Seville_clouds_all    8763 non-null   float64
 7   Bilbao_wind_deg       8763 non-null   float64
 8   Barcelona_wind_speed  8763 non-null   float64
 9   Barcelona_wind_deg    8763 non-null   float64
 10  Madrid_clouds_all     8763 non-null   float64
 11  Seville_wind_speed    8763 non-null   float64
 12  Barcelona_rain_1h     8763 non-null   float64
 13  Seville_rain_1h       8763 non-null   float64
 14  Bilbao_snow_3h        8763 non-null   float6

<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, you are required to create one or more regression models that are able to accurately predict the thee hour load shortfall. |

---

In [50]:
# split data
from sklearn.model_selection import train_test_split
X = df.drop('load_shortfall_3h', axis=1)
y = df["load_shortfall_3h"] #predictor variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [51]:
# create targets and features dataset
from sklearn.linear_model import LinearRegression


regressor = LinearRegression(fit_intercept =True)

regressor.fit(X_train,y_train)


LinearRegression()

In [61]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_standardise = pd.DataFrame(X_scaled,columns=X.columns)
X_standardise.head()

Unnamed: 0,Madrid_wind_speed,Bilbao_rain_1h,Valencia_wind_speed,Seville_humidity,Bilbao_clouds_all,Bilbao_wind_speed,Seville_clouds_all,Bilbao_wind_deg,Barcelona_wind_speed,Barcelona_wind_deg,Madrid_clouds_all,Seville_wind_speed,Barcelona_rain_1h,Seville_rain_1h,Bilbao_snow_3h,Barcelona_pressure,Seville_rain_3h,Madrid_rain_1h,Barcelona_rain_3h,Valencia_snow_3h,Madrid_weather_id,Barcelona_weather_id,Bilbao_pressure,Seville_weather_id,Valencia_pressure,Seville_temp_max,Bilbao_weather_id,Valencia_humidity,Seville_pressure
0,-0.950708,-0.362123,-0.796169,0.516117,-1.335491,-0.501451,-0.565065,0.630823,1.932284,-1.660205,-0.694188,0.542975,-0.203099,-0.224278,-0.057269,-0.024277,-0.066278,-0.247776,-0.110037,-0.017312,0.342424,0.385993,1.718219,0.352274,-1.129531,-2.616796,0.649842,0.540928,1.588087
1,-1.130863,-0.362123,-0.381412,0.692953,-1.335491,-0.501451,-0.565065,0.607959,0.63027,-0.578686,-0.694188,0.542975,-0.203099,-0.224278,-0.057269,-0.024206,-0.066278,-0.247776,-0.110037,-0.017312,0.342424,0.385993,1.784583,0.352274,-0.928934,-2.539014,0.649842,0.298645,1.588087
2,-0.770554,-0.362123,-0.657917,0.383491,-1.335491,-0.501451,-0.565065,0.542632,-0.485743,1.520733,-0.694188,0.144442,-0.203099,-0.224278,-0.057269,-0.024158,-0.066278,-0.247776,-0.110037,-0.017312,0.342424,0.385993,1.817765,0.352274,-0.8085757,-2.105564,0.649842,0.02175,1.588087
3,-0.770554,-0.362123,-0.657917,0.118238,-1.335491,-0.501451,-0.565065,0.398912,-0.299741,0.925711,-0.694188,0.941509,-0.203099,-0.224278,-0.057269,-0.024229,-0.066278,-0.247776,-0.110037,-0.017312,0.342424,0.385993,1.817765,0.352274,-0.367262,-1.361703,0.649842,-0.583957,1.588087
4,-0.770554,-0.362123,-0.657917,-0.161751,-1.274045,-0.894581,-0.565065,0.255192,0.816272,0.779762,-0.694188,0.343708,-0.203099,-0.224278,-0.057269,-0.024372,-0.066278,-0.247776,-0.110037,-0.017312,0.342424,0.385993,1.751401,0.352274,2.73663e-13,-1.348214,0.649842,-0.35898,1.588087


In [62]:
clf = RandomForestRegressor(n_estimators=10)
clf.fit(X_train, y_train)
predicted = clf.predict(y_test)
clf.score(y_test, predicted, normalize= False)

ValueError: Expected 2D array, got 1D array instead:
array=[17533.666   4434.     11515.667  ...  6185.6665  8088.3335  6304.    ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [63]:
# create one or more ML models
#instatiation of an object out of our class
#when "fit_intercept = True" - asking the model to obtain intercept which is value of 'm' and 'b'
#when "fit_intercept = False" - model will obtain only the 'm' value; 'b' will be zero by default
regressor = LinearRegression(fit_intercept =True)
regressor.fit(X_train,y_train)

print('Linear Model Coefficient (m): ', regressor.coef_)
print('Linear Model Coefficient (b): ', regressor.intercept_)

Linear Model Coefficient (m):  [-2.69367945e+02 -7.80722502e+02 -9.30929174e+01 -5.82726524e+01
 -3.56590035e+00 -2.61095619e+01  2.30807687e+00 -2.40278244e+00
 -1.49695658e+02 -6.00815460e+00  7.63219780e+00 -2.31000828e+01
 -2.60405764e+02  8.66479926e+02  1.29121975e+02 -1.12397115e-02
 -4.61079398e+04  5.93257820e+02 -3.31270609e+04 -7.68552589e+03
 -1.68284532e+00  2.62749266e+00 -9.30905010e+00  1.09751948e+00
  2.95036930e+01 -8.50763542e+00  9.73630657e-01  2.26386308e+01
  2.09538918e+01]
Linear Model Coefficient (b):  -4469.442574255472


In [64]:
# evaluate one or more ML models
predicted = regressor.predict(X_test)
type(predicted)
print([predicted])

[array([10054.83917472, 12137.17290168, 11000.64901793, ...,
       12735.29767683, 10968.50945945, 10359.15569755])]


In [65]:
import statsmodels.api as sml
from statsmodels import tools

X_new = tools.add_constant(X)

regressor_OLS = sml.OLS(endog = y,exog =  X_new).fit()

regressor_OLS.summary()

0,1,2,3
Dep. Variable:,load_shortfall_3h,R-squared:,0.105
Model:,OLS,Adj. R-squared:,0.102
Method:,Least Squares,F-statistic:,35.49
Date:,"Wed, 03 Nov 2021",Prob (F-statistic):,2.18e-186
Time:,17:18:23,Log-Likelihood:,-86956.0
No. Observations:,8763,AIC:,174000.0
Df Residuals:,8733,BIC:,174200.0
Df Model:,29,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-6490.2810,8353.550,-0.777,0.437,-2.29e+04,9884.646
Madrid_wind_speed,-252.8650,37.914,-6.669,0.000,-327.185,-178.545
Bilbao_rain_1h,-769.9494,185.820,-4.144,0.000,-1134.201,-405.698
Valencia_wind_speed,-83.9167,29.338,-2.860,0.004,-141.425,-26.408
Seville_humidity,-56.7209,3.704,-15.313,0.000,-63.982,-49.460
Bilbao_clouds_all,-3.7080,2.023,-1.833,0.067,-7.673,0.257
Bilbao_wind_speed,-32.9277,39.559,-0.832,0.405,-110.474,44.618
Seville_clouds_all,2.3739,3.040,0.781,0.435,-3.585,8.332
Bilbao_wind_deg,-2.6406,0.629,-4.200,0.000,-3.873,-1.408

0,1,2,3
Omnibus:,130.794,Durbin-Watson:,0.396
Prob(Omnibus):,0.0,Jarque-Bera (JB):,136.47
Skew:,-0.306,Prob(JB):,2.3199999999999997e-30
Kurtosis:,3.013,Cond. No.,4130000.0


In [66]:
#checking mean squared error
from sklearn.metrics import mean_squared_error 
print("MSE",mean_squared_error(y_test,predicted))

MSE 25095906.31163853


In [67]:
#checking Root Mean Squared Error
print("RMSE",np.sqrt(mean_squared_error(y_test, predicted)))

###ALTERNATIVE METHOD
#import math
#math.sqrt(mean_squared_error(y_test, predicted))###


RMSE 5009.581450744017


In [71]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, predicted)

ValueError: Found input variables with inconsistent numbers of samples: [7010, 1753]

In [69]:
y_test.head()

time
2017-05-10 09:00:00    17533.666667
2016-04-18 15:00:00     4434.000000
2015-06-08 06:00:00    11515.666667
2016-08-18 18:00:00    14179.333333
2017-06-04 15:00:00     6127.000000
Name: load_shortfall_3h, dtype: float64

In [70]:
from sklearn.metrics import accuracy_score
regressor.score(y_test.values.reshape(1, -1), predicted.reshape(-1, 1))

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 29 is different from 1753)

<a id="six"></a>
## 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |

---

In [None]:
# Compare model performance

In [None]:
# Choose best model and motivate why it is the best choice

<a id="seven"></a>
## 7. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |

---

In [None]:
# discuss chosen methods logic