# Regression Predict Student Solution

© Explore Data Science Academy

---
### Honour Code

I {**YOUR NAME, YOUR SURNAME**}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

### Predict Overview: Spain Electricity Shortfall Challenge

The government of Spain is considering an expansion of it's renewable energy resource infrastructure investments. As such, they require information on the trends and patterns of the countries renewable sources and fossil fuel energy generation. Your company has been awarded the contract to:

- 1. analyse the supplied data;
- 2. identify potential errors in the data and clean the existing data set;
- 3. determine if additional features can be added to enrich the data set;
- 4. build a model that is capable of forecasting the three hourly demand shortfalls;
- 5. evaluate the accuracy of the best machine learning model;
- 6. determine what features were most important in the model’s prediction decision, and
- 7. explain the inner working of the model to a non-technical audience.

Formally the problem statement was given to you, the senior data scientist, by your manager via email reads as follow:

> In this project you are tasked to model the shortfall between the energy generated by means of fossil fuels and various renewable sources - for the country of Spain. The daily shortfall, which will be referred to as the target variable, will be modelled as a function of various city-specific weather features such as `pressure`, `wind speed`, `humidity`, etc. As with all data science projects, the provided features are rarely adequate predictors of the target variable. As such, you are required to perform feature engineering to ensure that you will be able to accurately model Spain's three hourly shortfalls.
 
On top of this, she has provided you with a starter notebook containing vague explanations of what the main outcomes are. 

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>



For this study, we will analyse the Spain Electricity Shortfall dataset. The methodology for this project includes and not limited to exploratory data anlyses and model prediction. To perform these various tasks, the following libraries were loaded:

+ For data manipulation and analysis, Pandas and Numpy.
+ For data visualization, Matplotlib and Seaborn.
+ For data preparation, model building and evaluation, Scipy and Sklearn.


**The importation of these libraries can be seen below:**

In [2]:
# Libraries for data loading, manipulation and analysis
import numpy as np
import pandas as pd

# Libraries for data visualisation
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from statsmodels.graphics.correlation import plot_corr

# Libraries for Data Preparation
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Libraries for Model Building
# Example of models that could be used
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

# Libraries for Model Evaluation
from sklearn.metrics import mean_squared_error

# Libraries for Handing Errors
import warnings
warnings.filterwarnings('ignore')

# Libraries for Saving Model
import pickle

# Setting global constants to ensure notebook results are reproducible
#PARAMETER_CONSTANT = ###

<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>


The data used for this project was located in the `df_train.csv` file. To better manipulate and analyse the `df_train.csv` file, it was loaded into a Pandas Data Frame using the Pandas function, `.read_csv()` and referred to as `df`. Demonstrate which column will be the index , `index_col=False`.


In [7]:
# Loading of the data
df = pd.read_csv("df_train.csv", index_col=False)

To set the maximum number of columns to be displayed, the `pd.set_option()` function was put in place.

In [8]:
# displays unlimited number of columns
pd.set_option("display.max_columns", None)

To prevent any major unnecessary changes occurring to the original data, a copy of the dataframe was made `df.copy()` and referred to as `df_copy`.

In [9]:
# The copy of the dataframe
df_copy = df.copy()

<a id="two"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

To investigate and summarize the dataset's main characteristics, which includes data visualization methods and statistical analyses. 
Exploratory Data Analysis (EDA) gives a better understanding of the variables and the relationships between them.
 

### 3.1 Displaying the Data

The function `.head()` was used to view the first few rows of the dataset.

In [6]:
# Outputs the first 5 rows of the dataset
df_copy.head()

Unnamed: 0.1,Unnamed: 0,time,Madrid_wind_speed,Valencia_wind_deg,Bilbao_rain_1h,Valencia_wind_speed,Seville_humidity,Madrid_humidity,Bilbao_clouds_all,Bilbao_wind_speed,Seville_clouds_all,Bilbao_wind_deg,Barcelona_wind_speed,Barcelona_wind_deg,Madrid_clouds_all,Seville_wind_speed,Barcelona_rain_1h,Seville_pressure,Seville_rain_1h,Bilbao_snow_3h,Barcelona_pressure,Seville_rain_3h,Madrid_rain_1h,Barcelona_rain_3h,Valencia_snow_3h,Madrid_weather_id,Barcelona_weather_id,Bilbao_pressure,Seville_weather_id,Valencia_pressure,Seville_temp_max,Madrid_pressure,Valencia_temp_max,Valencia_temp,Bilbao_weather_id,Seville_temp,Valencia_humidity,Valencia_temp_min,Barcelona_temp_max,Madrid_temp_max,Barcelona_temp,Bilbao_temp_min,Bilbao_temp,Barcelona_temp_min,Bilbao_temp_max,Seville_temp_min,Madrid_temp,Madrid_temp_min,load_shortfall_3h
0,0,2015-01-01 03:00:00,0.666667,level_5,0.0,0.666667,74.333333,64.0,0.0,1.0,0.0,223.333333,6.333333,42.666667,0.0,3.333333,0.0,sp25,0.0,0.0,1036.333333,0.0,0.0,0.0,0.0,800.0,800.0,1035.0,800.0,1002.666667,274.254667,971.333333,269.888,269.888,800.0,274.254667,75.666667,269.888,281.013,265.938,281.013,269.338615,269.338615,281.013,269.338615,274.254667,265.938,265.938,6715.666667
1,1,2015-01-01 06:00:00,0.333333,level_10,0.0,1.666667,78.333333,64.666667,0.0,1.0,0.0,221.0,4.0,139.0,0.0,3.333333,0.0,sp25,0.0,0.0,1037.333333,0.0,0.0,0.0,0.0,800.0,800.0,1035.666667,800.0,1004.333333,274.945,972.666667,271.728333,271.728333,800.0,274.945,71.0,271.728333,280.561667,266.386667,280.561667,270.376,270.376,280.561667,270.376,274.945,266.386667,266.386667,4171.666667
2,2,2015-01-01 09:00:00,1.0,level_9,0.0,1.0,71.333333,64.333333,0.0,1.0,0.0,214.333333,2.0,326.0,0.0,2.666667,0.0,sp25,0.0,0.0,1038.0,0.0,0.0,0.0,0.0,800.0,800.0,1036.0,800.0,1005.333333,278.792,974.0,278.008667,278.008667,800.0,278.792,65.666667,278.008667,281.583667,272.708667,281.583667,275.027229,275.027229,281.583667,275.027229,278.792,272.708667,272.708667,4274.666667
3,3,2015-01-01 12:00:00,1.0,level_8,0.0,1.0,65.333333,56.333333,0.0,1.0,0.0,199.666667,2.333333,273.0,0.0,4.0,0.0,sp25,0.0,0.0,1037.0,0.0,0.0,0.0,0.0,800.0,800.0,1036.0,800.0,1009.0,285.394,994.666667,284.899552,284.899552,800.0,285.394,54.0,284.899552,283.434104,281.895219,283.434104,281.135063,281.135063,283.434104,281.135063,285.394,281.895219,281.895219,5075.666667
4,4,2015-01-01 15:00:00,1.0,level_7,0.0,1.0,59.0,57.0,2.0,0.333333,0.0,185.0,4.333333,260.0,0.0,3.0,0.0,sp25,0.0,0.0,1035.0,0.0,0.0,0.0,0.0,800.0,800.0,1035.333333,800.0,,285.513719,1035.333333,283.015115,283.015115,800.0,285.513719,58.333333,283.015115,284.213167,280.678437,284.213167,282.252063,282.252063,284.213167,282.252063,285.513719,280.678437,280.678437,6620.666667


***


+ **Results** :


***

### 3.2 Data Anaylses

`.shape` function returned the number of rows by the number of columns in the dataset.

In [19]:
# Displays the number of rows and columns
df_copy.shape

(8763, 49)

***


+ **Results** : The dataset consists of 8763 rows and 49 columns.


***

`.info()` function outputs important details about the dataset. This includes the columns, the data types **(Dtype)** of the columns and the count of non-null values.

In [21]:
# Displays information of the Dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8763 entries, 0 to 8762
Data columns (total 49 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Unnamed: 0            8763 non-null   int64  
 1   time                  8763 non-null   object 
 2   Madrid_wind_speed     8763 non-null   float64
 3   Valencia_wind_deg     8763 non-null   object 
 4   Bilbao_rain_1h        8763 non-null   float64
 5   Valencia_wind_speed   8763 non-null   float64
 6   Seville_humidity      8763 non-null   float64
 7   Madrid_humidity       8763 non-null   float64
 8   Bilbao_clouds_all     8763 non-null   float64
 9   Bilbao_wind_speed     8763 non-null   float64
 10  Seville_clouds_all    8763 non-null   float64
 11  Bilbao_wind_deg       8763 non-null   float64
 12  Barcelona_wind_speed  8763 non-null   float64
 13  Barcelona_wind_deg    8763 non-null   float64
 14  Madrid_clouds_all     8763 non-null   float64
 15  Seville_wind_speed   

***


+ **Results** : 


***

### 3.3 Missing Values

Determining the missing values in the dataset is vital in accurately investigating the relationship between the variables. If not handled correctly it could cause: 
+ Reduction in the power/fit of the model.
+ Have it become a biased model. 

In [24]:
# Outputs the number of missing values
missing_values['missing_values']=df_copy.isnull().sum()

missing_values['missing_values']

Unnamed: 0                 0
time                       0
Madrid_wind_speed          0
Valencia_wind_deg          0
Bilbao_rain_1h             0
Valencia_wind_speed        0
Seville_humidity           0
Madrid_humidity            0
Bilbao_clouds_all          0
Bilbao_wind_speed          0
Seville_clouds_all         0
Bilbao_wind_deg            0
Barcelona_wind_speed       0
Barcelona_wind_deg         0
Madrid_clouds_all          0
Seville_wind_speed         0
Barcelona_rain_1h          0
Seville_pressure           0
Seville_rain_1h            0
Bilbao_snow_3h             0
Barcelona_pressure         0
Seville_rain_3h            0
Madrid_rain_1h             0
Barcelona_rain_3h          0
Valencia_snow_3h           0
Madrid_weather_id          0
Barcelona_weather_id       0
Bilbao_pressure            0
Seville_weather_id         0
Valencia_pressure       2068
Seville_temp_max           0
Madrid_pressure            0
Valencia_temp_max          0
Valencia_temp              0
Bilbao_weather

In [30]:
# Displays the percentage of missing values within the dataset
missing_values['missing_values_percent'] = round(df_copy.isnull().sum()/len(df_copy)*100)

missing_values['missing_values_percent']

Unnamed: 0               0.0
time                     0.0
Madrid_wind_speed        0.0
Valencia_wind_deg        0.0
Bilbao_rain_1h           0.0
Valencia_wind_speed      0.0
Seville_humidity         0.0
Madrid_humidity          0.0
Bilbao_clouds_all        0.0
Bilbao_wind_speed        0.0
Seville_clouds_all       0.0
Bilbao_wind_deg          0.0
Barcelona_wind_speed     0.0
Barcelona_wind_deg       0.0
Madrid_clouds_all        0.0
Seville_wind_speed       0.0
Barcelona_rain_1h        0.0
Seville_pressure         0.0
Seville_rain_1h          0.0
Bilbao_snow_3h           0.0
Barcelona_pressure       0.0
Seville_rain_3h          0.0
Madrid_rain_1h           0.0
Barcelona_rain_3h        0.0
Valencia_snow_3h         0.0
Madrid_weather_id        0.0
Barcelona_weather_id     0.0
Bilbao_pressure          0.0
Seville_weather_id       0.0
Valencia_pressure       24.0
Seville_temp_max         0.0
Madrid_pressure          0.0
Valencia_temp_max        0.0
Valencia_temp            0.0
Bilbao_weather

***


+ **Results** :


***

### 3.4 Descriptive Statistics

Statistical analyses in EDA is crucial for analysing the dataset. The function `.describe()` summarizes the count, mean, standard deviation, min, and max for numeric variables in our dataframe.

In [32]:
# Displays the summary statistics
df_copy_transposed = df_copy.describe().T

df_copy_transposed

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Unnamed: 0,8763.0,4381.0,2529.804538,0.0,2190.5,4381.0,6571.5,8762.0
Madrid_wind_speed,8763.0,2.425729,1.850371,0.0,1.0,2.0,3.333333,13.0
Bilbao_rain_1h,8763.0,0.135753,0.374901,0.0,0.0,0.0,0.1,3.0
Valencia_wind_speed,8763.0,2.586272,2.41119,0.0,1.0,1.666667,3.666667,52.0
Seville_humidity,8763.0,62.658793,22.621226,8.333333,44.333333,65.666667,82.0,100.0
Madrid_humidity,8763.0,57.414717,24.335396,6.333333,36.333333,58.0,78.666667,100.0
Bilbao_clouds_all,8763.0,43.469132,32.551044,0.0,10.0,45.0,75.0,100.0
Bilbao_wind_speed,8763.0,1.850356,1.695888,0.0,0.666667,1.0,2.666667,12.66667
Seville_clouds_all,8763.0,13.714748,24.272482,0.0,0.0,0.0,20.0,97.33333
Bilbao_wind_deg,8763.0,158.957511,102.056299,0.0,73.333333,147.0,234.0,359.3333


***


+ **Results** :


***

### 3.5  Kurtosis and Skewness

To give additional information about the distribution of our dataset, we had a look at the Skewness (`.skew()`)
and the kurtosis (`.kurtosis()`) functions respectively. Skewness is defined by ita measure of its asymmetry in a probability distribution whereas kurtosis describes the lack or heaviness of the tail (outliers) when compared to normal distribution.

In [47]:
# Skewness of the dataframe
df_copy.skew()

Unnamed: 0               0.000000
Madrid_wind_speed        1.441144
Bilbao_rain_1h           5.222802
Valencia_wind_speed      3.499637
Seville_humidity        -0.310175
Madrid_humidity         -0.057378
Bilbao_clouds_all       -0.053085
Bilbao_wind_speed        1.716914
Seville_clouds_all       1.814452
Bilbao_wind_deg          0.226927
Barcelona_wind_speed     1.057331
Barcelona_wind_deg      -0.180001
Madrid_clouds_all        1.246745
Seville_wind_speed       1.151006
Barcelona_rain_1h        8.726988
Seville_rain_1h          8.067341
Bilbao_snow_3h          26.177568
Barcelona_pressure      57.979664
Seville_rain_3h         19.342574
Madrid_rain_1h           7.074308
Barcelona_rain_3h       12.696605
Valencia_snow_3h        63.298084
Madrid_weather_id       -3.107722
Barcelona_weather_id    -2.584011
Bilbao_pressure         -0.999642
Seville_weather_id      -3.275574
Valencia_pressure       -1.705162
Seville_temp_max        -0.033931
Madrid_pressure         -1.850768
Valencia_temp_

***


+ **Results** :


***

In [48]:
# Kurtosis of the dataframe
df_copy.kurtosis()

Unnamed: 0                -1.200000
Madrid_wind_speed          2.036462
Bilbao_rain_1h            32.904656
Valencia_wind_speed       35.645426
Seville_humidity          -1.017983
Madrid_humidity           -1.167537
Bilbao_clouds_all         -1.533417
Bilbao_wind_speed          3.631565
Seville_clouds_all         2.155921
Bilbao_wind_deg           -1.083530
Barcelona_wind_speed       1.493635
Barcelona_wind_deg        -0.959160
Madrid_clouds_all          0.142079
Seville_wind_speed         1.398580
Barcelona_rain_1h        101.578931
Seville_rain_1h           93.840746
Bilbao_snow_3h           806.128471
Barcelona_pressure      3687.564230
Seville_rain_3h          413.136592
Madrid_rain_1h            76.584491
Barcelona_rain_3h        187.800460
Valencia_snow_3h        4089.323165
Madrid_weather_id          9.259047
Barcelona_weather_id       5.701882
Bilbao_pressure            1.825323
Seville_weather_id        10.710308
Valencia_pressure          2.211823
Seville_temp_max          -0

***


+ **Results** :


***

### 3.6 Data Visualization of relevant feature interactions

#### 3.6.1 Plot 1

***


+ **Results** : 


***

#### 3.6.2 Plot 2

***


+ **Results** : 


***

#### 3.6.3 Plot 3

***


+ **Results** : 


***

#### 3.6.4 Plot 4

***


+ **Results** : 


***

#### 3.6.5 Plot 5

***


+ **Results** : 


***

### 3.7 Correlation 

***


+ **Results** : 


***

In [None]:
# have a look at feature distributions

### 3.8 Drop Columns

In [None]:
# Drop unwanted column
df_copy = df_copy.drop(['Unnamed: 0'], axis=1)

***


+ **Results** : 


***

<a id="four"></a>
## 4. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>



### 4.1 Dealing with Missing Values 

In [None]:
#remove missing values/ features

***


+ **Results** : 


***

### 4.2 Creating New Features

In [None]:
# create new features

***


+ **Results** : 


***

### 4.3 Engineer existing features

***


+ **Results** : 


***

<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>


### 5.1 Splitting the Dataset

***


+ **Results** : 


***

### 5.2 Standardization

***


+ **Results** : 


***

### 5.3 Train-Test Split

***


+ **Results** : 


***

### 5.4 Machine Learning Models

#### 5.4.1 Linear Regression

***


+ **Results** : 


***

#### 5.4.2 Decision Tree

***


+ **Results** : 


***

#### 5.4.3 Ridge Regression

***


+ **Results** : 


***

#### 5.4.4 Lasso Regression

***


+ **Results** : 


***

### 5.5 Evaluation of the Machine Learning Models

***


+ **Results** : 


***

In [None]:
# create targets and features dataset

### 5.6 Test the model on the test dataset

In [None]:
#Loading the dataset
#df_test = pd.read_csv('df_test.csv')

***


+ **Results** : 


***

<a id="five"></a>
## 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

Determine using the RMSE

In [None]:
# Compare model performance

***


+ **Results** : 


***

### 6.1 Chosen Model will be tested on the Test Dataset

In [None]:
# Choose best model and motivate why it is the best choice

***


+ **Results** : 


***

<a id="seven"></a>
## 7. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |

---

In [None]:
# discuss chosen methods logic

***


+ **Conclusion** : 


***