# Regression Predict Student Solution

© Explore Data Science Academy

---
### Honour Code

I {**YOUR NAME, YOUR SURNAME**}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

### Predict Overview: Spain Electricity Shortfall Challenge

The government of Spain is considering an expansion of it's renewable energy resource infrastructure investments. As such, they require information on the trends and patterns of the countries renewable sources and fossil fuel energy generation. Your company has been awarded the contract to:

- 1. analyse the supplied data;
- 2. identify potential errors in the data and clean the existing data set;
- 3. determine if additional features can be added to enrich the data set;
- 4. build a model that is capable of forecasting the three hourly demand shortfalls;
- 5. evaluate the accuracy of the best machine learning model;
- 6. determine what features were most important in the model’s prediction decision, and
- 7. explain the inner working of the model to a non-technical audience.

Formally the problem statement was given to you, the senior data scientist, by your manager via email reads as follow:

> In this project you are tasked to model the shortfall between the energy generated by means of fossil fuels and various renewable sources - for the country of Spain. The daily shortfall, which will be referred to as the target variable, will be modelled as a function of various city-specific weather features such as `pressure`, `wind speed`, `humidity`, etc. As with all data science projects, the provided features are rarely adequate predictors of the target variable. As such, you are required to perform feature engineering to ensure that you will be able to accurately model Spain's three hourly shortfalls.
 
On top of this, she has provided you with a starter notebook containing vague explanations of what the main outcomes are. 

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section you are required to import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. |

---

In [1]:
# Libraries for data loading, data manipulation and data visulisation
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr
import numpy as np

# Libraries for data preparation and model building
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn import metrics 
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.linear_model import Ridge

from statsmodels.graphics.correlation import plot_corr
from statsmodels.formula.api import ols

from xgboost import XGBRegressor

# Setting global constants to ensure notebook results are reproducible
PARAMETER_CONSTANT = ['full_data', 'df_train', 'df_test', 'X', 'y', 'X_train', 'y_train', 'X_test', 'X_valid', 'y_valid']

<a id="two"></a>
## 2. Loading the Data and Cleaning
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section you are required to load the data from the `df_train` file into a DataFrame. |

---

In [2]:
df_train = pd.read_csv('df_train.csv') # load the data
df_test = pd.read_csv('df_test.csv')   # load the data

In [3]:
y = df_train[['load_shortfall_3h']]
X = df_train.drop(['load_shortfall_3h'], axis=1)

In [4]:
# merging the train dataset for easy and general cleaning

In [5]:
full_data = pd.concat([X, df_test], axis=0)

##### Overview of the dataset

In [6]:
full_data.head()

Unnamed: 0.1,Unnamed: 0,time,Madrid_wind_speed,Valencia_wind_deg,Bilbao_rain_1h,Valencia_wind_speed,Seville_humidity,Madrid_humidity,Bilbao_clouds_all,Bilbao_wind_speed,...,Barcelona_temp_max,Madrid_temp_max,Barcelona_temp,Bilbao_temp_min,Bilbao_temp,Barcelona_temp_min,Bilbao_temp_max,Seville_temp_min,Madrid_temp,Madrid_temp_min
0,0,2015-01-01 03:00:00,0.666667,level_5,0.0,0.666667,74.333333,64.0,0.0,1.0,...,281.013,265.938,281.013,269.338615,269.338615,281.013,269.338615,274.254667,265.938,265.938
1,1,2015-01-01 06:00:00,0.333333,level_10,0.0,1.666667,78.333333,64.666667,0.0,1.0,...,280.561667,266.386667,280.561667,270.376,270.376,280.561667,270.376,274.945,266.386667,266.386667
2,2,2015-01-01 09:00:00,1.0,level_9,0.0,1.0,71.333333,64.333333,0.0,1.0,...,281.583667,272.708667,281.583667,275.027229,275.027229,281.583667,275.027229,278.792,272.708667,272.708667
3,3,2015-01-01 12:00:00,1.0,level_8,0.0,1.0,65.333333,56.333333,0.0,1.0,...,283.434104,281.895219,283.434104,281.135063,281.135063,283.434104,281.135063,285.394,281.895219,281.895219
4,4,2015-01-01 15:00:00,1.0,level_7,0.0,1.0,59.0,57.0,2.0,0.333333,...,284.213167,280.678437,284.213167,282.252063,282.252063,284.213167,282.252063,285.513719,280.678437,280.678437


In [7]:
full_data.shape

(11683, 48)

In [8]:
y.head()

Unnamed: 0,load_shortfall_3h
0,6715.666667
1,4171.666667
2,4274.666667
3,5075.666667
4,6620.666667


In [9]:
# checking to see variables that are non-numeric
# since, machine learning models don't work well with object datatypes
print('Columns that need to be drop or converted into numeric', [x for x in full_data.select_dtypes('object')])

Columns that need to be drop or converted into numeric ['time', 'Valencia_wind_deg', 'Seville_pressure']


In [10]:
y.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8763 entries, 0 to 8762
Data columns (total 1 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   load_shortfall_3h  8763 non-null   float64
dtypes: float64(1)
memory usage: 68.6 KB


In [11]:
#understanding the shape of our trian dataset (predictiors)
print(f'the train set contain {X.shape[0]} datapoint (row), and {X.shape[1]} variables (columns)')

the train set contain 8763 datapoint (row), and 48 variables (columns)


###### Checking for null values(s) and replacing appriopately

In [12]:
#checking null values in the train dataset
y.isnull().sum()[y.isnull().sum() > 0]

Series([], dtype: int64)

In [13]:
#checking null values in the train dataset
full_data.isnull().sum()[full_data.isnull().sum() > 0]

Valencia_pressure    2522
dtype: int64

In [14]:
#checking for the mean median and mode of the variable
stat = [np.mean(full_data.Valencia_pressure), full_data.Valencia_pressure.mode()[0], full_data.Valencia_pressure.median()]
stat = np.around(stat, 1)
print(f'Mean: {stat[0]}, Mode: {stat[1]}, Median: {stat[2]}')

Mean: 1012.3, Mode: 1018.0, Median: 1015.0


In [15]:
# Replacing the value with the Median: 1015
full_data['Valencia_pressure'] = full_data['Valencia_pressure'].fillna(1015)

In [16]:
#checking AGAIN for 'null' in the full_data
full_data.isnull().sum()[full_data.isnull().sum() > 0]

Series([], dtype: int64)

#### Modifying the Object type columns
##### Converting the time column into features

In [17]:
# Checking the data point scope
full_data[['time']].sample(5)

Unnamed: 0,time
7877,2017-09-12 06:00:00
705,2015-03-30 18:00:00
5268,2016-10-21 03:00:00
7383,2017-07-12 12:00:00
829,2015-04-15 06:00:00


In [18]:
# creating a function that can split the time column 
def convert_time(row):
    date, time = row.split(' ')
    year, month, day = date.split('-')
    hour = time.split(':')[0]
    return year, month, day, hour # we can also return a pd.Series([...]) and not use a zip function later on

In [19]:
# splitting the time column into features
full_data['year'], full_data['month'], full_data['day'], full_data['hour']\
                            = zip(*full_data['time'].map(convert_time)) 

In [20]:
# we need to convert the new features to numeric and drop the old time column
cols = ['year', 'month', 'day', 'hour']
full_data[cols] = full_data[cols].apply(pd.to_numeric, errors='coerce', axis=1)
full_data.drop('time', axis=1, inplace=True)

In [21]:
full_data.head()

Unnamed: 0.1,Unnamed: 0,Madrid_wind_speed,Valencia_wind_deg,Bilbao_rain_1h,Valencia_wind_speed,Seville_humidity,Madrid_humidity,Bilbao_clouds_all,Bilbao_wind_speed,Seville_clouds_all,...,Bilbao_temp,Barcelona_temp_min,Bilbao_temp_max,Seville_temp_min,Madrid_temp,Madrid_temp_min,year,month,day,hour
0,0,0.666667,level_5,0.0,0.666667,74.333333,64.0,0.0,1.0,0.0,...,269.338615,281.013,269.338615,274.254667,265.938,265.938,2015,1,1,3
1,1,0.333333,level_10,0.0,1.666667,78.333333,64.666667,0.0,1.0,0.0,...,270.376,280.561667,270.376,274.945,266.386667,266.386667,2015,1,1,6
2,2,1.0,level_9,0.0,1.0,71.333333,64.333333,0.0,1.0,0.0,...,275.027229,281.583667,275.027229,278.792,272.708667,272.708667,2015,1,1,9
3,3,1.0,level_8,0.0,1.0,65.333333,56.333333,0.0,1.0,0.0,...,281.135063,283.434104,281.135063,285.394,281.895219,281.895219,2015,1,1,12
4,4,1.0,level_7,0.0,1.0,59.0,57.0,2.0,0.333333,0.0,...,282.252063,284.213167,282.252063,285.513719,280.678437,280.678437,2015,1,1,15


###### Convert the Valencia_wind_deg into numeric

In [22]:
# Checking the data point scope
full_data.Valencia_wind_deg.sample(5)

374     level_5
8669    level_6
6314    level_8
162     level_9
7263    level_2
Name: Valencia_wind_deg, dtype: object

This is easy considering it seems like a bad imputation

In [23]:
full_data['Valencia_wind_deg'] = full_data['Valencia_wind_deg'].str.extract('(\d+)').astype('int64')

###### Convert the Seville_pressure into numeric

In [24]:
# Checking the data point scope
full_data.Seville_pressure.sample(5)

5893    sp22
6752     sp5
2222     sp4
421      sp1
8732    sp14
Name: Seville_pressure, dtype: object

This also follow the same manner like for Valencia_wind_deg conversion

In [25]:
full_data['Seville_pressure'] = full_data['Seville_pressure'].str.extract('(\d+)').astype('int64')

In [26]:
# finally before spliting the dataset we need to remove the redundant column ['Unnamed: 0']
full_data = full_data.drop(['Unnamed: 0'], axis=1)

In [27]:
# Spliting the dataset back to test and train after general cleaning of the data set
X = full_data.iloc[:len(y)]
X_test = full_data.iloc[len(y):]

In [28]:
X.shape

(8763, 50)

In [29]:
X_test.shape

(2920, 50)

<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, you are required to perform an in-depth analysis of all the variables in the DataFrame. |

---


In [None]:
# look at predictors data statistics
X.describe().T

In [None]:
# look at predictors data statistics
y.describe().T

In [None]:
# plot relevant feature interactions

In [None]:
# checking for skewness and outliers (kurtosis)

In [None]:
# evaluate correlation

In [None]:
# have a look at feature distributions

In [None]:
# checking for linearity

In [None]:
# checking for Multicollinearity

<a id="four"></a>
## 4. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase. |

---

#### Variable selection through different method

In [None]:
def time_convert():
    pass

#### Convert the other object variable into numeric

In [None]:
# full_dataset.Seville_pressure.sample(5)

In [None]:
# full_dataset.Valencia_wind_deg.sample(5)

#### checking for null and modifying

In [None]:
# full_dataset.isnull().sum()

In [None]:
# Correlation and significance threshod

In [None]:
# create new features and drop less useful features

In [None]:
# engineer existing features

<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, you are required to create one or more regression models that are able to accurately predict the thee hour load shortfall. |

---

#### Classifying the full dataset

In [None]:
# Spliting the dataset back to test and train after general cleaning of the data set

In [None]:
# Divide train dataset further into train and validation subsets

In [None]:
# create targets and features dataset

In [None]:
# create one or more ML models

In [None]:
# evaluate one or more ML models

<a id="six"></a>
## 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |

---

In [None]:
# Compare model performance

In [None]:
# Choose best model and motivate why it is the best choice

<a id="seven"></a>
## 7. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |

---

In [None]:
# discuss chosen methods logic