# Regression Predict Student Solution

© Explore Data Science Academy

---
### Honour Code

I, **Team GM5**, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

### Predict Overview: Spain Electricity Shortfall Challenge

The government of Spain is considering an expansion of it's renewable energy resource infrastructure investments. As such, they require information on the trends and patterns of the countries renewable sources and fossil fuel energy generation. Your company has been awarded the contract to:

- 1. analyse the supplied data;
- 2. identify potential errors in the data and clean the existing data set;
- 3. determine if additional features can be added to enrich the data set;
- 4. build a model that is capable of forecasting the three hourly demand shortfalls;
- 5. evaluate the accuracy of the best machine learning model;
- 6. determine what features were most important in the model’s prediction decision, and
- 7. explain the inner working of the model to a non-technical audience.

Formally the problem statement was given to you, the senior data scientist, by your manager via email reads as follow:

> In this project you are tasked to model the shortfall between the energy generated by means of fossil fuels and various renewable sources - for the country of Spain. The daily shortfall, which will be referred to as the target variable, will be modelled as a function of various city-specific weather features such as `pressure`, `wind speed`, `humidity`, etc. As with all data science projects, the provided features are rarely adequate predictors of the target variable. As such, you are required to perform feature engineering to ensure that you will be able to accurately model Spain's three hourly shortfalls.
 
On top of this, she has provided you with a starter notebook containing vague explanations of what the main outcomes are. 

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section you are required to import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. |

---

In [None]:
# Libraries for data loading, data manipulation and data visulisation
import numpy as np
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from statsmodels.graphics.correlation import plot_corr

# Libraries for data preparation and model building
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE
from sklearn.metrics import r2_score as r2
from scipy.stats import pearsonr


<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section you are required to load the data from the `df_train` file into a DataFrame. |

---

In [None]:
# load data
train_data = pd.read_csv('df_train.csv')
test_data = pd.read_csv('df_test.csv') 

In [None]:
# view train and test shape
print('Train_data shape: {}'.format(train_data.shape))
print('Test_data shape: {}'.format(test_data.shape))

In [None]:
# view train info
train_data.info()

<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, you are required to perform an in-depth analysis of all the variables in the DataFrame. |

---


Firstly, we are going to convert the 'time' column from an object into a datetime column. 

In [584]:
#train_data['time'] = pd.to_datetime(train_data['time'], utc=True)+timedelta(hours=1) 

Secondly, we shall proceed to partially clean the dataset and drop the unnamed column. Furthermore, we shall proceed to adjust and sort the columns and indexes. 

In [None]:
# train data preview
train_data.head(3)

In [None]:
# test data preview
test_data.head(3)

Observations:

- It should be noted that in our dataset, "Valencia Pressure" is missing various variables. I think we have to find the median or mode,
so we are able to fill in the missing spaces. 

- Numerous columns can be removed such as the min and max columns for various cities. Could we also add a few new feature which combines the various cities together? It seems pointless to breakdown all the cities (and their stats), in relation to the main question.  

- Also, research "renewable sources" because it usually comprises of solar, geothermal, wind and biomass and hydropower. Many of the columns aren't specifically related to renewable resources, such as humidity, clouds, and pressure.   

Next, using the describe function, we will calculate the statistical data of the dataframe, including but not limited to, the mean, standard deviation and percentile.  

In [None]:
train_data.describe().T

Next, we will analyse the correlation between the various numerical variables.

In [None]:
fig = plt.figure(figsize=(10,8));
ax = fig.add_subplot(111);
plot_corr(train_data.corr(), xnames = train_data.corr().columns, ax = ax)

It should be highlighted that there is a particularly high correlation between the various variables at the bottom right of the correlation matrix.  

In the following section, we shall look into the skewness and kurtosis of our dataset.

- Note: Additional/unnecessary columns should be dropped. 


In [None]:
train_data.kurtosis().plot()

Furthermore, as can be gleaned from the below, there are numerous positive and negative symmetrical relationships between the various features.    

In [None]:
plt.figure(figsize = [10,5])
train_data.skew(axis=0, skipna=True).plot()

In [None]:
train_data.skew()

<a id="four"></a>
## 4. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase. |

---

#### 4.1 Merge Data
    * Merge test and training data 
    * View head, tail and shape of data

In [None]:
# merge test and training data into one df
df = pd.concat([train_data, test_data])

In [None]:
# view head
df.head(2)

In [None]:
# view tail 
df.tail(2)

In [None]:
# view shape
print('df shape: {}'.format(df.shape))

#### 4.2 Missing Values
    * View missing values
    * Create a copy of df - To be safe
    * Impute missing values
    


In [None]:
# view missing values
df.isnull().sum()

Valencia_pressure and load_shortfall_3h are the only features that contain missing 
values. We will however only impute the Valencia_pressure feature, as the load_shortfall_3h
feature reflects the outcome of merging the train and test dataset.

In [None]:
# create copy of df
df_new = df

In [None]:
# impute missing values using mode
df_new['Valencia_pressure'] = df_new['Valencia_pressure'].fillna(df_new['Valencia_pressure'].mode()[0])

In [None]:
# view missing values again - to see if it worked
df_new.isnull().sum()

It worked!!

#### 4.3 Datatypes
    * View datatypes
    * Process Objects - drop and/or transform relevant features

In [None]:
# view datatypes
df_new.dtypes

##### Features to be processed
    - time
    - Valencia_wind_deg
    - Seville_pressure
    - Unnamed: 0

##### time

In [None]:
# transform time from object into datetime
df_new['time'] = pd.to_datetime(df_new['time'])

In [None]:
# view time transformation
df_new['time'].head(2)

##### Valencia_wind_deg

In [603]:
# view Valencia_wind_deg
df_new['Valencia_wind_deg'].head(2)

0     level_5
1    level_10
Name: Valencia_wind_deg, dtype: object

Transform by extracting last element in string. For example, level_5 is a string and the last element is equal to 5

In [604]:
# extract last element
df_new['Valencia_wind_deg'] = df_new['Valencia_wind_deg'].str.extract('(\d+)')

In [605]:
# transform Valencia_wind_deg from object into int
df_new['Valencia_wind_deg'] = pd.to_numeric(df_new['Valencia_wind_deg'])

In [606]:
# view Valencia_wind_deg transformation
df_new['Valencia_wind_deg'].head(2)

0     5
1    10
Name: Valencia_wind_deg, dtype: int64

##### Seville_pressure

In [607]:
# view Seville_pressure
df_new['Seville_pressure'].head(2)

0    sp25
1    sp25
Name: Seville_pressure, dtype: object

In [608]:
# view unique values in Seville_pressure
df_new['Seville_pressure'].unique()

array(['sp25', 'sp23', 'sp24', 'sp21', 'sp16', 'sp9', 'sp15', 'sp19',
       'sp22', 'sp11', 'sp8', 'sp4', 'sp6', 'sp13', 'sp17', 'sp20',
       'sp18', 'sp14', 'sp12', 'sp5', 'sp10', 'sp7', 'sp3', 'sp2', 'sp1'],
      dtype=object)

Transform by extracting last two elements in string. For example, sp25 is a string and the last two elements is equal to 25

In [609]:
# extract last element
df_new['Seville_pressure'] = df_new['Seville_pressure'].str.extract('(\d+)')

In [610]:
# transform Seville_pressure from object into int
df_new['Seville_pressure'] = pd.to_numeric(df_new['Seville_pressure'])

In [611]:
# view Seville_pressure transformation 
df_new['Seville_pressure'].head(2)

0    25
1    25
Name: Seville_pressure, dtype: int64

##### Unnamed: 0

In [612]:
# drop Unnamed: 0
df_new = df_new.drop(['Unnamed: 0', 'time'], axis=1) 

Note column 'time' has been removed. It has only been removed for this base model. We will be using it later. To create more features which will potentially improve our model 

#### 4.4 Cleaned Dataset
    * View cleaned dataset
    * View shape

In [613]:
# view cleaned df_new
df_new.head(2)

Unnamed: 0,Madrid_wind_speed,Valencia_wind_deg,Bilbao_rain_1h,Valencia_wind_speed,Seville_humidity,Madrid_humidity,Bilbao_clouds_all,Bilbao_wind_speed,Seville_clouds_all,Bilbao_wind_deg,...,Madrid_temp_max,Barcelona_temp,Bilbao_temp_min,Bilbao_temp,Barcelona_temp_min,Bilbao_temp_max,Seville_temp_min,Madrid_temp,Madrid_temp_min,load_shortfall_3h
0,0.666667,5,0.0,0.666667,74.333333,64.0,0.0,1.0,0.0,223.333333,...,265.938,281.013,269.338615,269.338615,281.013,269.338615,274.254667,265.938,265.938,6715.666667
1,0.333333,10,0.0,1.666667,78.333333,64.666667,0.0,1.0,0.0,221.0,...,266.386667,280.561667,270.376,270.376,280.561667,270.376,274.945,266.386667,266.386667,4171.666667


In [614]:
# view shape 
df_new.shape

(11683, 47)

<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, you are required to create one or more regression models that are able to accurately predict the thee hour load shortfall. |

---

#### 5.1 Model Preparation
    * Extract train and test data from previously merged dataset - df_new
    * View x and y head and shape

In [680]:
# Extract x and y
y = df_new[:len(train_data)][['load_shortfall_3h']]
x = df_new[:len(train_data)].drop('load_shortfall_3h', axis=1)

In [681]:
# Extract x_train and x_test - for later 
x_train = df_new[:len(train_data)].drop('load_shortfall_3h', axis=1)
x_test = df_new[len(train_data):].drop('load_shortfall_3h', axis=1)

In [682]:
# view x 
x.head(2)

Unnamed: 0,Madrid_wind_speed,Valencia_wind_deg,Bilbao_rain_1h,Valencia_wind_speed,Seville_humidity,Madrid_humidity,Bilbao_clouds_all,Bilbao_wind_speed,Seville_clouds_all,Bilbao_wind_deg,...,Barcelona_temp_max,Madrid_temp_max,Barcelona_temp,Bilbao_temp_min,Bilbao_temp,Barcelona_temp_min,Bilbao_temp_max,Seville_temp_min,Madrid_temp,Madrid_temp_min
0,0.666667,5,0.0,0.666667,74.333333,64.0,0.0,1.0,0.0,223.333333,...,281.013,265.938,281.013,269.338615,269.338615,281.013,269.338615,274.254667,265.938,265.938
1,0.333333,10,0.0,1.666667,78.333333,64.666667,0.0,1.0,0.0,221.0,...,280.561667,266.386667,280.561667,270.376,270.376,280.561667,270.376,274.945,266.386667,266.386667


In [683]:
# view y
y.head(2)

Unnamed: 0,load_shortfall_3h
0,6715.666667
1,4171.666667


In [685]:
# view datasets shape
print('x shape: {}'.format(x.shape))
print('y shape: {}'.format(y.shape))

x shape: (8763, 46)
y shape: (8763, 1)


#### 5.2 Load Model
    * Build model
    * Split data into 3 parts 
        - train set
        - validation set
        - test set
    * View data shape

In [686]:
# build model - Linear Regression
lr = LinearRegression()

In [687]:
# split data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)

In [688]:
# view shape of data
print('x_train shape: {} - training set'.format(x_train.shape))
print('y_train shape: {} - validation set'.format(y_train.shape))
print('x_test shape: {}  - training set '.format(x_test.shape))
print('y_test shape: {}  - validation set'.format(y_test.shape))

x_train shape: (6572, 46) - training set
y_train shape: (6572, 1) - validation set
x_test shape: (2191, 46)  - training set 
y_test shape: (2191, 1)  - validation set


#### 5.3 Fit model 
    * Fit model
    * Create prediction variable

In [689]:
# fit model 
lr.fit(x_train, y_train)

LinearRegression()

In [690]:
# predict
pred = lr.predict(x_test)

#### 5.4 Evaluate Model
    * Calculate and view prediction peformance
        - Mean Squared Error
        - r2 score

In [691]:
# mean squared error 
mse_eval = np.sqrt(MSE(y_test, pred))

In [692]:
# r2 score
r2_eval = r2(y_test, pred)

In [693]:
# view performance 
print('MSE: {}'.format(mse_eval))
print('r2: {}'.format(r2_eval))

MSE: 4886.903828218214
r2: 0.14091258935831086


<a id="six"></a>
## 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |

---

In [628]:
# Compare model performance

In [629]:
# Choose best model and motivate why it is the best choice

<a id="seven"></a>
## 7. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |

---

In [630]:
# discuss chosen methods logic

## 7. Kaggle Submission

##### 7.1 Model Preparation
    * Extract train and test data from previously merged dataset - df_new
    * Fit model and make prediction
    * Create and view prediction variable

In [694]:
# extract x_train and x_test 
x_train = df_new[:len(train_data)].drop('load_shortfall_3h', axis=1)
x_test = df_new[len(train_data):].drop('load_shortfall_3h', axis=1)

In [695]:
# fit model
lr.fit(x_train, y)

LinearRegression()

In [696]:
# prediction 
pred = lr.predict(x_test)

In [698]:
# create prediction variable
pred_var = pd.DataFrame(pred, columns=['load_shortfall_3h'])

In [699]:
# view pred_var
pred_var.head(3)

Unnamed: 0,load_shortfall_3h
0,9722.354933
1,8795.99282
2,9902.349678


##### 7.2 Output Dataframe - csv
    * Create dataframe
    * Create and view submission csv file

In [703]:
# create output
output = pd.DataFrame({'time': test_data['time']})

In [704]:
# submission
submission = output.join(pred_var)
submission.to_csv('submission.csv', index=False)

In [705]:
# view submission preview 
submission.head(3)

Unnamed: 0,time,load_shortfall_3h
0,2018-01-01 00:00:00,9722.354933
1,2018-01-01 03:00:00,8795.99282
2,2018-01-01 06:00:00,9902.349678
