# Regression Predict Student Solution

© Explore Data Science Academy

---
### Honour Code

I {**YOUR NAME, YOUR SURNAME**}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

### Predict Overview: Spain Electricity Shortfall Challenge

The government of Spain is considering an expansion of it's renewable energy resource infrastructure investments. As such, they require information on the trends and patterns of the countries renewable sources and fossil fuel energy generation. Your company has been awarded the contract to:

- 1. analyse the supplied data;
- 2. identify potential errors in the data and clean the existing data set;
- 3. determine if additional features can be added to enrich the data set;
- 4. build a model that is capable of forecasting the three hourly demand shortfalls;
- 5. evaluate the accuracy of the best machine learning model;
- 6. determine what features were most important in the model’s prediction decision, and
- 7. explain the inner working of the model to a non-technical audience.

Formally the problem statement was given to you, the senior data scientist, by your manager via email reads as follow:

> In this project you are tasked to model the shortfall between the energy generated by means of fossil fuels and various renewable sources - for the country of Spain. The daily shortfall, which will be referred to as the target variable, will be modelled as a function of various city-specific weather features such as `pressure`, `wind speed`, `humidity`, etc. As with all data science projects, the provided features are rarely adequate predictors of the target variable. As such, you are required to perform feature engineering to ensure that you will be able to accurately model Spain's three hourly shortfalls.
 
On top of this, she has provided you with a starter notebook containing vague explanations of what the main outcomes are. 

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section you are required to import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. |

---

In [1]:
# Libraries for data loading, data manipulation and data visulisation
import pandas as pd               #For data loading and data manipulation
import seaborn as sns             #For data visualization 
import matplotlib.pyplot as plt   #For data visualization 
from matplotlib import rc

# Libraries for data preparation and model building
import numpy as np               #for numerical and scientific computing
from sklearn.linear_model import LinearRegression  #for building Linear regression model
from sklearn.model_selection import train_test_split #for splitting the data
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section you are required to load the data from the `df_train` file into a DataFrame. |

---

In [2]:
# load the data
df_train = pd.read_csv("https://raw.githubusercontent.com/Explore-AI/load-shortfall-regression-predict-api/master/utils/data/df_train.csv")
df_test = pd.read_csv("https://raw.githubusercontent.com/Explore-AI/load-shortfall-regression-predict-api/master/utils/data/df_test.csv")

In [5]:
print(f'Shape of train df: {df_train.shape}')
print(f'Shape of test df: {df_test.shape}')

Shape of train df: (8763, 49)
Shape of test df: (2920, 48)


<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, you are required to perform an in-depth analysis of all the variables in the DataFrame. |

---


In [7]:
df_train.head() #viewing the top 5rows of our train data

In [None]:
df_train.describe() # look at data statistics

In [None]:
df_train.columns #Visualizing all the columns in our training dataset

In [None]:
df_train.info() #Visualizing the info of our training dataset

In [None]:
df_train.isnull().sum() #Check for null values in our training dataset

In [None]:
# checking the number of null values in our Valencia_pressure column
print(df_train.isnull().sum())
print('\n')
print(df_test.isnull().sum())

In [None]:
df.Valencia_pressure.describe()  #Exploring the statistics of the feature with null values

<a id="four"></a>
## 4. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase. |

---

In [None]:
# changing the datatype of the time column to datetime
df_train['time'] = pd.to_datetime(df_train['time'])
df_test['time'] = pd.to_datetime(df_test['time'])

In [None]:
# filling null values
mean_value_train = df_train['Valencia_pressure'].mean()
df_train['Valencia_pressure'].fillna(mean_value_train,inplace=True)

mean_value_test = df_test['Valencia_pressure'].mean()
df_test['Valencia_pressure'].fillna(mean_value_test,inplace=True)

In [None]:
# check our dataset for null values
df_train.isnull().sum()

In [None]:
#Checking for the Presence of Outliers with Kurtosis
df_train.kurtosis()

"""
Features with kurtosis values indicate the presence of outliers 
From the above code we can see that Valencia_wind_speed, Bilbao_rain_1h,Barcelona_rain_1h,
Seville_rain_1h, Madrid_rain_1h, Valencia_snow_3h all contain outliers
"""

In [None]:
# plotting a histogram of our target variable
sns.histplot(data=df_train, x='load_shortfall_3h',bins=10)
plt.title('Distribution of Load Shortfall 3h');

In [None]:
# plot heatmap to show correlation between the numeric variables
fig = plt.figure(figsize=(30,25))

sns.heatmap(df_train.corr(),annot=True);

In [None]:
# create new features
# extract year and month values from our time columne
df_train['year'] = df_train['time'].dt.year
df_test['year'] = df_test['time'].dt.year

df_train['month'] = df_train['time'].dt.month
df_test['month'] = df_test['time'].dt.month

In [None]:
# drop the time column
df_train = df_train.drop('time',axis=1)
df_test = df_test.drop('time',axis=1)
df_train.head(3)

<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, you are required to create one or more regression models that are able to accurately predict the thee hour load shortfall. |

---

In [None]:
# split the data into features and label
X = df_train.drop('load_shortfall_3h',axis=1)
y = df_train['load_shortfall_3h'].astype('int')

In [None]:
# create targets and features dataset
selector = SelectKBest(chi2, k=10)
fit = selector.fit(X,y)
scores = pd.DataFrame(fit.scores_)
columns = pd.DataFrame(X.columns)
featureScores = pd.concat([columns, scores], axis=1)
featureScores.columns = ['Features', 'Score']
new_X = featureScores.sort_values('Score',ascending=False).head(40)
new_X.tail(10)

"""
Using SelectKbest to select the Feature Selection 
finding top k features, by this see the ten most important
features in the table based on the selectkbest model
"""

In [None]:
# split data into train and validation
X_train, X_val, y_train, y_val = df_train_split(X, y, test_size=0.2, random_state=42)

In [None]:
# normalize our numerical data
scaler = StandardScaler()

In [None]:
# fit and transform the data
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)

In [None]:
# create the new dataframe with the scaled data
X_train = pd.DataFrame(X_train, columns=X.columns)
X_val = pd.DataFrame(X_val, columns=X.columns)

In [None]:
# create one or more ML models
lr = LinearRegression()

In [None]:
# fit the linear regression model with test data
lr.fit(X_train, y_train)

In [None]:
# evaluate the linear regressor
lr_pred = lr.predict(X_val)
lr_mse = mean_squared_error(y_val, lr_pred)
lr_r2 = r2_score(y_val, lr_pred)
lr_mae = mean_absolute_error(y_val, lr_pred)


In [None]:
# create more ML models. Train, fit, predict and evaluate also the following models:

In [None]:
#support vector model
svm = SVR()

In [None]:
#random forest regressor
rfr = RandomForestRegressor()

In [None]:
#decision tree regressor
dtr = DecisionTreeRegressor()

<a id="six"></a>
## 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |

---

In [None]:
# Compare model performance. Given the following metrics we will be comparing the model performance based on:    
Mean Squared Error
Mean Absolute Error
R-squared score

In [None]:
# Choose best model and motivate why it is the best choice

<a id="seven"></a>
## 7. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |

---

In [None]:
# discuss chosen methods logic