# Regression Predict Student Solution

© Explore Data Science Academy

---
### Honour Code

I {**YOUR NAME, YOUR SURNAME**}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

### Predict Overview: Spain Electricity Shortfall Challenge

The government of Spain is considering an expansion of it's renewable energy resource infrastructure investments. As such, they require information on the trends and patterns of the countries renewable sources and fossil fuel energy generation. Your company has been awarded the contract to:

- 1. analyse the supplied data;
- 2. identify potential errors in the data and clean the existing data set;
- 3. determine if additional features can be added to enrich the data set;
- 4. build a model that is capable of forecasting the three hourly demand shortfalls;
- 5. evaluate the accuracy of the best machine learning model;
- 6. determine what features were most important in the model’s prediction decision, and
- 7. explain the inner working of the model to a non-technical audience.

Formally the problem statement was given to you, the senior data scientist, by your manager via email reads as follow:

> In this project you are tasked to model the shortfall between the energy generated by means of fossil fuels and various renewable sources - for the country of Spain. The daily shortfall, which will be referred to as the target variable, will be modelled as a function of various city-specific weather features such as `pressure`, `wind speed`, `humidity`, etc. As with all data science projects, the provided features are rarely adequate predictors of the target variable. As such, you are required to perform feature engineering to ensure that you will be able to accurately model Spain's three hourly shortfalls.
 
On top of this, she has provided you with a starter notebook containing vague explanations of what the main outcomes are. 

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section you are required to import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. |

---

In [108]:
# Libraries for data loading, data manipulation and data visulisation
%matplotlib notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Libraries for data preparation and model building
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn import metrics

from sklearn.model_selection import train_test_split
# import scaler method from sklearn
from sklearn.preprocessing import StandardScaler

# Libraries for file manipulation
import os

# Setting global constants to ensure notebook results are reproducible
#PARAMETER_CONSTANT = ###

<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section you are required to load the data from the `df_train` file into a DataFrame. |

---

In [2]:
# load the data
test_df = pd.read_csv('df_test.csv', index_col=0)
test_df.head(2)

Unnamed: 0,time,Madrid_wind_speed,Valencia_wind_deg,Bilbao_rain_1h,Valencia_wind_speed,Seville_humidity,Madrid_humidity,Bilbao_clouds_all,Bilbao_wind_speed,Seville_clouds_all,...,Barcelona_temp_max,Madrid_temp_max,Barcelona_temp,Bilbao_temp_min,Bilbao_temp,Barcelona_temp_min,Bilbao_temp_max,Seville_temp_min,Madrid_temp,Madrid_temp_min
8763,2018-01-01 00:00:00,5.0,level_8,0.0,5.0,87.0,71.333333,20.0,3.0,0.0,...,287.816667,280.816667,287.356667,276.15,280.38,286.816667,285.15,283.15,279.866667,279.15
8764,2018-01-01 03:00:00,4.666667,level_8,0.0,5.333333,89.0,78.0,0.0,3.666667,0.0,...,284.816667,280.483333,284.19,277.816667,281.01,283.483333,284.15,281.15,279.193333,278.15


In [3]:
train_df = pd.read_csv('df_train.csv', index_col=0)
train_df.head(2)

Unnamed: 0,time,Madrid_wind_speed,Valencia_wind_deg,Bilbao_rain_1h,Valencia_wind_speed,Seville_humidity,Madrid_humidity,Bilbao_clouds_all,Bilbao_wind_speed,Seville_clouds_all,...,Madrid_temp_max,Barcelona_temp,Bilbao_temp_min,Bilbao_temp,Barcelona_temp_min,Bilbao_temp_max,Seville_temp_min,Madrid_temp,Madrid_temp_min,load_shortfall_3h
0,2015-01-01 03:00:00,0.666667,level_5,0.0,0.666667,74.333333,64.0,0.0,1.0,0.0,...,265.938,281.013,269.338615,269.338615,281.013,269.338615,274.254667,265.938,265.938,6715.666667
1,2015-01-01 06:00:00,0.333333,level_10,0.0,1.666667,78.333333,64.666667,0.0,1.0,0.0,...,266.386667,280.561667,270.376,270.376,280.561667,270.376,274.945,266.386667,266.386667,4171.666667


<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, you are required to perform an in-depth analysis of all the variables in the DataFrame. |

---


In [4]:
# look at data statistics

In [5]:
train_df.shape

(8763, 48)

In [6]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8763 entries, 0 to 8762
Data columns (total 48 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   time                  8763 non-null   object 
 1   Madrid_wind_speed     8763 non-null   float64
 2   Valencia_wind_deg     8763 non-null   object 
 3   Bilbao_rain_1h        8763 non-null   float64
 4   Valencia_wind_speed   8763 non-null   float64
 5   Seville_humidity      8763 non-null   float64
 6   Madrid_humidity       8763 non-null   float64
 7   Bilbao_clouds_all     8763 non-null   float64
 8   Bilbao_wind_speed     8763 non-null   float64
 9   Seville_clouds_all    8763 non-null   float64
 10  Bilbao_wind_deg       8763 non-null   float64
 11  Barcelona_wind_speed  8763 non-null   float64
 12  Barcelona_wind_deg    8763 non-null   float64
 13  Madrid_clouds_all     8763 non-null   float64
 14  Seville_wind_speed    8763 non-null   float64
 15  Barcelona_rain_1h    

In [7]:
train_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Madrid_wind_speed,8763.0,2.425729,1.850371,0.0,1.0,2.0,3.333333,13.0
Bilbao_rain_1h,8763.0,0.135753,0.374901,0.0,0.0,0.0,0.1,3.0
Valencia_wind_speed,8763.0,2.586272,2.41119,0.0,1.0,1.666667,3.666667,52.0
Seville_humidity,8763.0,62.658793,22.621226,8.333333,44.333333,65.666667,82.0,100.0
Madrid_humidity,8763.0,57.414717,24.335396,6.333333,36.333333,58.0,78.666667,100.0
Bilbao_clouds_all,8763.0,43.469132,32.551044,0.0,10.0,45.0,75.0,100.0
Bilbao_wind_speed,8763.0,1.850356,1.695888,0.0,0.666667,1.0,2.666667,12.66667
Seville_clouds_all,8763.0,13.714748,24.272482,0.0,0.0,0.0,20.0,97.33333
Bilbao_wind_deg,8763.0,158.957511,102.056299,0.0,73.333333,147.0,234.0,359.3333
Barcelona_wind_speed,8763.0,2.870497,1.792197,0.0,1.666667,2.666667,4.0,12.66667


In [8]:
train_df.isnull().sum()

time                       0
Madrid_wind_speed          0
Valencia_wind_deg          0
Bilbao_rain_1h             0
Valencia_wind_speed        0
Seville_humidity           0
Madrid_humidity            0
Bilbao_clouds_all          0
Bilbao_wind_speed          0
Seville_clouds_all         0
Bilbao_wind_deg            0
Barcelona_wind_speed       0
Barcelona_wind_deg         0
Madrid_clouds_all          0
Seville_wind_speed         0
Barcelona_rain_1h          0
Seville_pressure           0
Seville_rain_1h            0
Bilbao_snow_3h             0
Barcelona_pressure         0
Seville_rain_3h            0
Madrid_rain_1h             0
Barcelona_rain_3h          0
Valencia_snow_3h           0
Madrid_weather_id          0
Barcelona_weather_id       0
Bilbao_pressure            0
Seville_weather_id         0
Valencia_pressure       2068
Seville_temp_max           0
Madrid_pressure            0
Valencia_temp_max          0
Valencia_temp              0
Bilbao_weather_id          0
Seville_temp  

In [9]:
# plot relevant feature interactions

In [10]:
# evaluate correlation

In [11]:
# have a look at feature distributions

<a id="four"></a>
## 4. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase. |

---

In [12]:
# remove missing values/ features

In [13]:
# create new features

In [125]:
train_df[["Valencia_pressure"]] = train_df[["Valencia_pressure"]].fillna(train_df[["Valencia_pressure"]].mean())

In [126]:
train_df.isnull().sum()

time                    0
Madrid_wind_speed       0
Valencia_wind_deg       0
Bilbao_rain_1h          0
Valencia_wind_speed     0
Seville_humidity        0
Madrid_humidity         0
Bilbao_clouds_all       0
Bilbao_wind_speed       0
Seville_clouds_all      0
Bilbao_wind_deg         0
Barcelona_wind_speed    0
Barcelona_wind_deg      0
Madrid_clouds_all       0
Seville_wind_speed      0
Barcelona_rain_1h       0
Seville_pressure        0
Seville_rain_1h         0
Bilbao_snow_3h          0
Barcelona_pressure      0
Seville_rain_3h         0
Madrid_rain_1h          0
Barcelona_rain_3h       0
Valencia_snow_3h        0
Madrid_weather_id       0
Barcelona_weather_id    0
Bilbao_pressure         0
Seville_weather_id      0
Valencia_pressure       0
Seville_temp_max        0
Madrid_pressure         0
Valencia_temp_max       0
Valencia_temp           0
Bilbao_weather_id       0
Seville_temp            0
Valencia_humidity       0
Valencia_temp_min       0
Barcelona_temp_max      0
Madrid_temp_

In [127]:
# engineer existing features
train_df.columns.sort_values()

Index(['Barcelona_pressure', 'Barcelona_rain_1h', 'Barcelona_rain_3h',
       'Barcelona_temp', 'Barcelona_temp_max', 'Barcelona_temp_min',
       'Barcelona_weather_id', 'Barcelona_wind_deg', 'Barcelona_wind_speed',
       'Bilbao_clouds_all', 'Bilbao_pressure', 'Bilbao_rain_1h',
       'Bilbao_snow_3h', 'Bilbao_temp', 'Bilbao_temp_max', 'Bilbao_temp_min',
       'Bilbao_weather_id', 'Bilbao_wind_deg', 'Bilbao_wind_speed',
       'Madrid_clouds_all', 'Madrid_humidity', 'Madrid_pressure',
       'Madrid_rain_1h', 'Madrid_temp', 'Madrid_temp_max', 'Madrid_temp_min',
       'Madrid_weather_id', 'Madrid_wind_speed', 'Seville_clouds_all',
       'Seville_humidity', 'Seville_pressure', 'Seville_rain_1h',
       'Seville_rain_3h', 'Seville_temp', 'Seville_temp_max',
       'Seville_temp_min', 'Seville_weather_id', 'Seville_wind_speed',
       'Valencia_humidity', 'Valencia_pressure', 'Valencia_snow_3h',
       'Valencia_temp', 'Valencia_temp_max', 'Valencia_temp_min',
       'Valencia_wind_d

In [128]:
X = train_df.drop(['load_shortfall_3h', 
                   'Valencia_temp_max',
                   'Valencia_temp_min',
                   'Seville_temp_max',
                   'Seville_temp_min',
                   'Madrid_temp_max', 
                   'Madrid_temp_min',
                   'Bilbao_temp_max',
                   'Bilbao_temp_min',
                   'Barcelona_temp_max',
                   'Barcelona_temp_min',
                   "Valencia_wind_deg", 
                   "Seville_pressure"
                  ], 
                  axis=1)

y = train_df['load_shortfall_3h']

<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, you are required to create one or more regression models that are able to accurately predict the thee hour load shortfall. |

---

In [129]:
# split data

In [130]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

# Save the time column for use later
X_test_time = X_test[['time']]
X_train_time = X_train[['time']]


# remove the time column from the split
X_train = X_train.drop(['time'], axis=1)
X_test = X_test.drop(['time'], axis=1)

In [131]:
# Standardize the model to place all columns in the same scale
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

In [132]:
X_train = pd.DataFrame(X_train)
X_test = pd.DataFrame(X_test)

In [133]:
# create targets and features dataset

In [134]:
# create one or more ML models

In [135]:
lm = LinearRegression()
lm.fit(X_train, y_train)

LinearRegression()

In [136]:
ridge = Ridge()
ridge.fit(X_train, y_train)

Ridge()

In [137]:
lasso = Lasso(alpha=0.01)
lasso.fit(X_train, y_train)

Lasso(alpha=0.01)

In [138]:
# evaluate one or more ML models

In [139]:
pd.DataFrame(lm.coef_, X_train.columns, columns=['Coefficient'])

Unnamed: 0,Coefficient
0,-534.097437
1,-153.327211
2,-163.341264
3,-1112.730184
4,-16.872456
5,-222.365309
6,-42.527122
7,69.818511
8,-251.118764
9,-255.240739


<a id="six"></a>
## 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |

---

In [140]:
# Compare model performance

In [141]:
y_predict = lasso.predict(X_test)
y_predict

array([12240.0091214 ,  7284.37107341, 11623.74240663, ...,
        8729.80743436, 11489.19501948, 13062.77024357])

In [142]:
X_test_time['load_shortfall_3h'] = y_test
test_base_df = X_test_time
X_test_time['predicted_load_shortfall_3h'] = y_predict
X_test_time

Unnamed: 0,time,load_shortfall_3h,predicted_load_shortfall_3h
7441,2017-07-19 18:00:00,18097.666667,12240.009121
6355,2017-03-06 00:00:00,12578.000000,7284.371073
1271,2015-06-09 12:00:00,8510.666667,11623.742407
3511,2016-03-15 12:00:00,16473.666667,11070.637366
1821,2015-08-17 06:00:00,8849.666667,8683.166306
...,...,...,...
5430,2016-11-10 09:00:00,12741.666667,8402.502012
1748,2015-08-08 03:00:00,10809.333333,12571.763286
3258,2016-02-12 21:00:00,21262.000000,8729.807434
3801,2016-04-20 18:00:00,9238.333333,11489.195019


In [143]:
# Dictionary of results
import math
results_dict = {'train_model_mse':
                    {
                        "Linear model": math.sqrt(metrics.mean_squared_error(y_train, lm.predict(X_train))),
                        "Ridge": math.sqrt(metrics.mean_squared_error(y_train, ridge.predict(X_train))),
                        "LASSO": math.sqrt(metrics.mean_squared_error(y_train, lasso.predict(X_train)))
                    },
                'test_model_mse':
                    {
                        "Linear model": math.sqrt(metrics.mean_squared_error(y_test, lm.predict(X_test))),
                        "Ridge": math.sqrt(metrics.mean_squared_error(y_test, ridge.predict(X_test))),
                        "LASSO": math.sqrt(metrics.mean_squared_error(y_test, lasso.predict(X_test)))
                    }
                }

In [144]:
# Create dataframe from dictionary
results_df = pd.DataFrame(data=results_dict)
results_df.sort_values("test_model_mse")

# show model from least accurate to most accurate

Unnamed: 0,train_model_mse,test_model_mse
Linear model,4907.81816,4840.349933
LASSO,4907.818161,4840.3502
Ridge,4907.818277,4840.392074


In [149]:
def gen_submission_csv(model, test_df):
    test_df = test_df.fillna(test_df.mean())
    X_test = test_df.drop([ 'time',
                           'Valencia_temp_max',
                           'Valencia_temp_min',
                           'Seville_temp_max',
                           'Seville_temp_min',
                           'Madrid_temp_max', 
                           'Madrid_temp_min',
                           'Bilbao_temp_max',
                           'Bilbao_temp_min',
                           'Barcelona_temp_max',
                           'Barcelona_temp_min',
                           "Valencia_wind_deg",
                           "Seville_pressure"
                          ], 
                          axis=1)
    kaggle = test_df[['time']]
    X_test = scaler.fit_transform(X_test)
    
    kaggle['load_shortfall_3h'] = model.predict(X_test)
    
    if os.path.exists('kaggle.csv'):
        os.remove('kaggle.csv')
        
    kaggle.to_csv('kaggle.csv', index=False, encoding='utf-8')
    return kaggle

In [150]:
gen_submission_csv(lasso, test_df)

  test_df = test_df.fillna(test_df.mean())
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  kaggle['load_shortfall_3h'] = model.predict(X_test)


Unnamed: 0,time,load_shortfall_3h
8763,2018-01-01 00:00:00,8777.386281
8764,2018-01-01 03:00:00,8154.708152
8765,2018-01-01 06:00:00,8686.864804
8766,2018-01-01 09:00:00,9401.933050
8767,2018-01-01 12:00:00,8271.431384
...,...,...
11678,2018-12-31 09:00:00,8891.287689
11679,2018-12-31 12:00:00,11008.679688
11680,2018-12-31 15:00:00,12159.227384
11681,2018-12-31 18:00:00,11908.203765


In [None]:
# Choose best model and motivate why it is the best choice

In [None]:
# I added a new column

<a id="seven"></a>
## 7. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |

---

In [None]:
# discuss chosen methods logic