<a href="https://colab.research.google.com/github/MIrfaanA/load-shortfall-regression-predict-api/blob/master/starter_notebook_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Team 13_2 Regression Predict Student Solution

© Explore Data Science Academy

---
Intro 

### Predict Overview: Spain Electricity Shortfall Challenge

![image.png](attachment:image.png)

The government of Spain is considering an expansion of it's renewable energy resource infrastructure investments. As such, they require information on the trends and patterns of the countries renewable sources and fossil fuel energy generation. Your company has been awarded the contract to:

- 1. analyse the supplied data;
- 2. identify potential errors in the data and clean the existing data set;
- 3. determine if additional features can be added to enrich the data set;
- 4. build a model that is capable of forecasting the three hourly demand shortfalls;
- 5. evaluate the accuracy of the best machine learning model;
- 6. determine what features were most important in the model’s prediction decision, and
- 7. explain the inner working of the model to a non-technical audience.

Formally the problem statement was given to you, the senior data scientist, by your manager via email reads as follow:

> In this project you are tasked to model the shortfall between the energy generated by means of fossil fuels and various renewable sources - for the country of Spain. The daily shortfall, which will be referred to as the target variable, will be modelled as a function of various city-specific weather features such as `pressure`, `wind speed`, `humidity`, etc. As with all data science projects, the provided features are rarely adequate predictors of the target variable. As such, you are required to perform feature engineering to ensure that you will be able to accurately model Spain's three hourly shortfalls.
 
On top of this, she has provided you with a starter notebook containing vague explanations of what the main outcomes are.



<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="one"></a>
## 1. Importing Packages

In [None]:
# Libraries for data loading, data manipulation and data visulisation
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.pipeline import make_pipeline

# Libraries for data preparation and model building
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb

# Model Evaluation
from sklearn.metrics import mean_squared_error

# Setting global constants to ensure notebook results are reproducible
#PARAMETER_CONSTANT = ###

<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section you are required to load the data from the `df_train` file into a DataFrame. |

---

In [None]:
# load the train data
df = data = pd.read_csv ('df_train.csv', index_col = 0)

# Making sure that the Dataframe shows all rows and columns
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

 # Show the shape of the Dataframe 
df.shape

(8763, 48)

In [None]:
# Load the test data 
df_test = pd.read_csv('df_test.csv', index_col = 0)
df_test.head()

In [None]:
data.head()

<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
  <a class="anchor" id="1.1"></a>
  <a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, you are required to perform an in-depth analysis of all the variables in the DataFrame. |

---


In [None]:
# look at data statistics
df.describe()

In [None]:
# plot relevant feature interactions


In [None]:
# evaluate correlation
corr = df.corr()
corr.style.background_gradient(cmap='Dark2')

In [None]:
# have a look at feature distributions
sns.displot(df['load_shortfall_3h'], kind = "kde")
plt.show()

<a id="four"></a>
## 4. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase. |

---

In [None]:
df['Valencia_wind_deg'] = df['Valencia_wind_deg'].str.extract('(\d+)')
df['Valencia_wind_deg'] = pd.to_numeric(df['Valencia_wind_deg'])


In [None]:
df['Valencia_wind_deg']

In [None]:
df['Seville_pressure'] = df['Seville_pressure'].str.extract('(\d+)')
df['Seville_pressure'] = pd.to_numeric(df['Seville_pressure'])
df['Seville_pressure']

In [None]:
df['Valencia_pressure'] = df['Valencia_pressure'].fillna((df['Valencia_pressure'].mean()))
df['Valencia_pressure']

In [None]:
# remove missing values/ features

In [None]:
# create new features
y = df['load_shortfall_3h']
X = df.drop(['load_shortfall_3h','time'], axis= 1)

In [None]:
# engineer existing features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [None]:
X_standardise = pd.DataFrame(X_scaled,columns=X.columns)
X_standardise.head()

<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, you are required to create one or more regression models that are able to accurately predict the thee hour load shortfall. |

---

### Dataset Split 

In [None]:
# split data in train and test samples
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.3332)

### Targets and features 

In [None]:
# create targets and features dataset
X_test.shape

### 5.1 Linear Regression 

In [None]:
# create one or more ML models
#Linear Regression Model

lm = LinearRegression()
lm.fit(X_train,y_train)

y_pred_ln = lm.predict(X_test)

mse = mean_squared_error(y_test,y_pred_ln)
rmse = np.sqrt(mse)
print("Linear Regression: ",rmse)



In [None]:
y_pred_ln

In [None]:
submission = pd.DataFrame()
submission['time'] = df_test['time']
submission['load_shortfall_3h'] = y_pred_ln.shape[0]
submission.head()

### 5.2 Ridge Regression Model

In [None]:
#Ridge Regression Model

rdg = Ridge()
rdg.fit(X_train,y_train)

y_pred_rdg = rdg.predict(X_test)

mse_rdg = mean_squared_error(y_test,y_pred_rdg)
rmse_rdg = np.sqrt(mse)
print("Ridge: ",rmse_rdg)

### 5.3 Lasso Regression Model


In [None]:
#Lasso Regression Model

lss = Lasso()
lss.fit(X_train,y_train)

y_pred_lss = lss.predict(X_test)

mse_lss = mean_squared_error(y_test,y_pred_lss)
rmse_lss = np.sqrt(mse)
print("Lasso :",rmse_lss)

### 5.4 Random Forest Regression Model

In [None]:
#Random Forest Regression Model

rfr = RandomForestRegressor()
rfr.fit(X_train, y_train)

y_pred_rfr = rfr.predict(X_test)

mse_rfr = mean_squared_error(y_test,y_pred_rfr)
rmse_rfr = np.sqrt(mse)
print("RandomForestRegressor: ",rmse_rfr)

### 5.5 XGBoost Model

In [None]:
#XGBoost Model

reg = xgb.XGBRegressor(booster='gbtree',n_estimators = 2000, reg_lambda=1,gamma=0, max_depth = 3)
reg.fit(X_train, y_train)

y_pred_xgb = reg.predict(X_test)

mse_xgb = mean_squared_error(y_test,y_pred_xgb)
rmse_xgb = np.sqrt(mse_xgb)
print("XGBOOST: ",rmse_xgb)



In [None]:
daf = pd.DataFrame(y_pred_xgb, columns = ['load_shortfall_3h'])
daf.head()

In [None]:
y_pred_xgb.shape

### Submission

In [None]:
submission = pd.DataFrame()
submission['time'] = df_test['time']
submission['load_shortfall_3h'] =y_pred_xgb
submission.to_csv('Mfundo_sub.')

In [None]:
submission

In [None]:
submission.to_csv('Mfundo_submission.csv', index=False)


<a id="six"></a>
## 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |

---

In [None]:
# Compare model performance


In [None]:
# Choose best model and motivate why it is the best choice


In [None]:
test = pd.read_csv('df_test.csv', index_col = 0)

In [None]:
test.head()

<a id="seven"></a>
## 7. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |

---

In [None]:
# discuss chosen methods logic