# Regression Predict Student Solution

© Explore Data Science Academy

---
### Honour Code

I {**YOUR NAME, YOUR SURNAME**}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

### Predict Overview: Spain Electricity Shortfall Challenge

The government of Spain is considering an expansion of it's renewable energy resource infrastructure investments. As such, they require information on the trends and patterns of the countries renewable sources and fossil fuel energy generation. Your company has been awarded the contract to:

- 1. analyse the supplied data;
- 2. identify potential errors in the data and clean the existing data set;
- 3. determine if additional features can be added to enrich the data set;
- 4. build a model that is capable of forecasting the three hourly demand shortfalls;
- 5. evaluate the accuracy of the best machine learning model;
- 6. determine what features were most important in the model’s prediction decision, and
- 7. explain the inner working of the model to a non-technical audience.

Formally the problem statement was given to you, the senior data scientist, by your manager via email reads as follow:

> In this project you are tasked to model the shortfall between the energy generated by means of fossil fuels and various renewable sources - for the country of Spain. The daily shortfall, which will be referred to as the target variable, will be modelled as a function of various city-specific weather features such as `pressure`, `wind speed`, `humidity`, etc. As with all data science projects, the provided features are rarely adequate predictors of the target variable. As such, you are required to perform feature engineering to ensure that you will be able to accurately model Spain's three hourly shortfalls.
 
On top of this, she has provided you with a starter notebook containing vague explanations of what the main outcomes are. 

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section you are required to import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. |

---

In [None]:
# Libraries for data loading, data manipulation and data visulisation
#Testing github
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Libraries for data preparation and model building
import warnings
warnings.filterwarnings("ignore")

# Libraries for data preparation and model building
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from statsmodels.graphics.correlation import plot_corr
from sklearn import metrics #RMSE
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import plot_tree
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Setting global constants to ensure notebook results are reproducible
# PARAMETER_CONSTANT = ###

<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section you are required to load the data from the `df_train` file into a DataFrame. |

---

In [None]:
df = pd.read_csv("df_train.csv"); # load the data
df.head() #Testing github

First things first, we had to load our train dataframe to be able to train our model(s), after loading we check a summary of what the dataframe consists of by using the .head() method and we can see after executing the cell that the dataframe has 49 columns.

We will later use other methods to analyse our data further.

In [None]:
df_test = pd.read_csv("df_test.csv")
df_test.head()

Then we load our test dataframe which will help us in testing our model(s) and checked what data it consists of by using the .head() method also seeing that it consists of 48 columns.

<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, you are required to perform an in-depth analysis of all the variables in the DataFrame. |

---


We begin by exploring the data by checking all columns for null values using a Pandas method.

In [None]:
# look at data statistics
df.info();

It is now clear to see that the column 'Valencia_pressure' has a significant number of null values, **2,068 null values** to be exact out of 8763 rows.

In the cell below, the next step with regards to understanding the data is by using the 'describe' method to generate descriptive statistics which will provide insights regarding numerical columns.  This method shows us statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

In [None]:
df.describe()

The shape method will show us the amounts of rows the dataset contains, as well as the number of columns.

After executing the cell we found that the train dataframe has 8763 rows and 49 columns.

In [None]:
#Checking Number of rows and coloumns
df.shape

The 'isnull()' method will be applied to confirm the number(sum) of null values in each column.  This analysis further confirms that the 'Valencia_pressure' column indeed contains 2068 null values.

In [None]:
#Check for Null Values
df.isnull().sum()

Further inspection of the "Valencia_pressure" column is done to analyze the values it containts.

In [None]:
#Valencia_pressure has 2068 NulLs
df.Valencia_pressure.describe()

Skew will indicate how symmetrical your data is.  

The "skew" method used in the cell below indicates that the columns "Bilbao_weather_id", "Madrid_pressure", "Valencia_pressure", "Barcelona_weather_id", "Madrid_weather_id" have high negative skewness.  

Amongst other columns containing high positive skewness, "Bilbao_snow_3h", "Barcelona_pressure", "Seville_rain_3h", "Valencia_snow_3h" were found to have dramatically high double-digit values for skewness.

In [None]:
#feature distributions
#Check for symmetrical data\\Most of our data is fairly symmetrical and o High positive skew
df.skew()

To measure the presence of outliers in the data, the uses "kurtosis" method.  A value below 3 indicates low kurtosis and thus indicates a lack of outliers

In [None]:
#Checking for Outliers// indicates we have many outliers in our data
df.kurtosis()

After executing our kurtosis method, we can identify our highest outliers being "Barcelona_rain_1h", "Seville_rain_1h", "Bilbao_snow_3h", "Barcelona_pressure", "Seville_rain_3h", "Madrid_rain_1h", "Baarcelona_rain_3h" and "Valencia_snow_3h".

----------------------------------------------------------------------------------------

### Data Visualisation

A boxplot will now be used to display the distribution of data in terms of the “minimum”, first quartile, median, third quartile, and the maximum.  This is commonly referred to as the "five number summary". The boxplot shows one the outliers in the data and what the outlier values are.

In [None]:
#Standardisation or normalisation can be applied to a feature to adjust the range
sns.boxplot(x='load_shortfall_3h', data=df);

In [None]:
sns.violinplot(x='load_shortfall_3h', data=df);

The visualisations below helps to graphically communicate the skewness that was measured above in this notebook when the "skew()" method was used.

In [None]:
features = ['Madrid_wind_speed', 'Madrid_humidity', 'Madrid_clouds_all', 'Madrid_rain_1h','Madrid_temp_max','Madrid_temp_min'] # create a list of all numerical features
df[features].hist(figsize=(10,10));

### Analyzing correlation

In the next step of our exploratory data analysis, we analyze correlation between columns. The corr method is used to find the pairwise correlation of all columns. 

All non-numeric data columns in the Dataframe are ignore in this step.

Please take note that the correlation of a variable with itself is 1.

In [None]:
# evaluate correlation//determine correlation between features
df.corr()

"Madrid_temp_max" and "Valencia_temp_max" show high correlation for example which may partly be explained by the two cities closeness in distance as compared to Barcelona for example which is around 300 kilometers further from Madrid than Valencia is.

To see the level of correlation visually, a heatmap is used in the cell below to indicate correlation among variables.

In [None]:
# evaluate correlation
plt.figure(figsize = (45, 25))
heatmap = sns.heatmap(df.corr(), vmin=-1, vmax=1, cmap="BuPu", annot=True)
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=12);

In [None]:
# have a look at feature distributions
df.hist(bins = 50, figsize = (40, 30), color = "tab:blue")
plt.show()

<a id="four"></a>
## 4. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase. |

---

Analyzing the data has assisted in identifying data errors. The next step is to update, change, or remove data to correct certain issues with the data. By doing this, data quality is improved. Improves data quality helps provide more accurate, consistent and reliable information to use in the predictive models.

In the cell below, the 'Valencia_pressure' column's null values are updated and imputed with the mean of the column.

In [None]:
# remove missing values/ features??//Valencia_pressure 
#Populate null values of Valencia_pressure with mean 
df['Valencia_pressure']=df['Valencia_pressure'].fillna(df['Valencia_pressure'].mean())

df_test['Valencia_pressure'] = df_test['Valencia_pressure'].fillna(df_test['Valencia_pressure'].mean())

In [None]:
#Run Test 
df['Valencia_pressure']

df_test['Valencia_pressure']

The "isnull" method is now used on the "Valencia_pressure" column to confirm that the mean values have been substituted in place of the null values.

In [None]:
df.Valencia_pressure.isnull().sum()

df_test.Valencia_pressure.isnull().sum()

Further analysis of the dataframe in the cell below shows the column that are "object" type.  This is an important step that may show the analysis team which columns may require their datatype to be changed in order o be useful for prediction.

In [None]:
# create new features
#Check data which is in Object datatype
df.select_dtypes(include=['object']).head(5)

df_test.select_dtypes(include=['object']).head(5)

The "time" column contains valuable information that can be used later on.  Pandas is used in the cell below to change the datatype to numeric type within the dataframe.

In [None]:
# engineer existing features
#Change Objects to numeric,starting with time
df['time'] = pd.to_datetime(df['time'])

df_test['time'] = pd.to_datetime(df_test['time'])

The "Valencia_wind_deg" column also contains valuable ordinal data (statistical data that is categorical where the variables have ordered categories).

Python allows us to extract the relevant information (the digits, in this case) and by transforming the digits into a integer datatype, the data becomes useful for further use.

In [None]:
df['Valencia_wind_deg'].head(5)

In [None]:
#Remove 'level_' on Valencia_wind_deg and convert to numeric datatype0
df['Valencia_wind_deg'] =df['Valencia_wind_deg'].str.extract('(\d+)')
df['Valencia_wind_deg'] = pd.to_numeric(df['Valencia_wind_deg'])

df_test['Valencia_wind_deg'] =df_test['Valencia_wind_deg'].str.extract('(\d+)')
df_test['Valencia_wind_deg'] = pd.to_numeric(df_test['Valencia_wind_deg'])


Python also allows us to extract the relevant information from the "Seville_pressure" column and by transforming the digits into a integer datatype, the data becomes useful for further use, as can be seen below.

In [None]:
# Showing data for the Seville_pressure column which will be cleaned 
df['Seville_pressure'].head(5)

In [None]:
#Remove 'sp' on Seville_pressure data. and convert it numeric
df['Seville_pressure'] = df['Seville_pressure'].str.extract('(\d+)')
df['Seville_pressure'] = pd.to_numeric(df['Seville_pressure'])



In [None]:
df_test['Seville_pressure'] = df_test['Seville_pressure'].str.extract('(\d+)')
df_test['Seville_pressure'] = pd.to_numeric(df_test['Seville_pressure'])

In [None]:
#Delete Markdown
df = df.drop('Unnamed: 0',axis = 1)

df_test = df_test.drop('Unnamed: 0',axis = 1)


<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, you are required to create one or more regression models that are able to accurately predict the thee hour load shortfall. |

---

In [None]:
# split data
X = df.drop(["load_shortfall_3h", "time"], axis = 1) #split the feature variable
y = df.iloc[:, -1] #split the response/target variable

In [None]:
#split data 
#y = df[:len(df_train)][["load_shortfall_3h"]]
#x = df[:len(df_train)].drop(["load_shortfall_3h"], axis = 1)

The above cell contains the features on which we will train our model(s) which is contained in the X variable and the response/target variable which we are trying to predict (load_shortfall_3h) in the y variable.

In [None]:
# create targets and features dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 50) 

In [None]:
#x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2) 

In our train-test function, we indicate the size of the test data we want to test with each iteration, which is typically a size of 0.2 (has to be between 0 and 1).

And the random state ensures that the test picks data randomly when the train-test function is carried out.

In [None]:
# create one or more ML models
# Linear regression model

lm = LinearRegression() #create the model
lm.fit(X_train, y_train) #train the model
predict = lm.predict(X_test) #predict on unseen data


In [None]:
# Ridge model

Ridge = Ridge() #create the model
Ridge.fit(X_train, y_train) #train model
R_pred = Ridge.predict(X_test) #predict on unseen data

In [None]:
#RandomForest model
R_F = RandomForestRegressor(n_estimators=100, random_state=0)
R_F.fit(X_train,y_train)
y_pred = R_F.predict(X_test)

In [None]:
# evaluate one or more ML models

print('Linear model train:', np.sqrt(metrics.mean_squared_error(y_train, lm.predict(X_train))))
print('Linear model test:', np.sqrt(metrics.mean_squared_error(y_test, predict)))

print('Ridge model train:', np.sqrt(metrics.mean_squared_error(y_train, Ridge.predict(X_train))))
print('Ridge model test:', np.sqrt(metrics.mean_squared_error(y_test, R_pred)))

print('RandomForest model train:', np.sqrt(metrics.mean_squared_error(y_train, R_F.predict(X_train))))
print('RandomForest model test:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))




In [None]:
# evaluate one or more ML models
#Evaluating using Decision Tree, to find a better explanation of data
X = df["load_shortfall_3h"] # independent variable
y = df["Madrid_temp_max"] # dependent variable

plt.scatter(X,y) # create scatter plot
plt.title("Madrid Max Temp vs Load Shrtfall in 3 hours")
plt.xlabel("Load_shortfall_3h ")
plt.ylabel("Madrid_temp_max")
plt.show()

In [None]:
#Splitting our data to evaluate the performance of the model
x_train, x_test, y_train, y_test = train_test_split(X[:,np.newaxis],y,test_size=0.2,random_state=42)
# Instantiate regression tree model
regr_tree = DecisionTreeRegressor(max_depth=2,random_state=42)
regr_tree.fit(x_train,y_train)
DecisionTreeRegressor(max_depth=2, random_state=42)

In [None]:
#Plot the Decision tree
plt.figure(figsize=(9,9))
_ = plot_tree(regr_tree, feature_names=['Max Temp vs Load Shotfall in 3h'],  filled=True)

In [None]:
#Evaluating Model Performance
# get predictions for test data
y_pred = regr_tree.predict(x_test)

# calculate MSE
MSE = mean_squared_error(y_pred,y_test)

# Report RMSE
print("Regression Decision Tree model RMSE is:",np.sqrt(MSE))

In [None]:
#Visualising Model Output
x_domain = np.linspace(min(X), max(X), 100)[:, np.newaxis]
# predict y for every point in x-domain
y_predictions = regr_tree.predict(x_domain)
# plot the regression tree line over data
plt.figure()
plt.scatter(X, y)
plt.plot(x_domain, y_predictions, color="red", label='predictions')
plt.xlabel("Madrid Max Temp vs Load Shrtfall in 3 hours")
plt.ylabel("Madrid Max Temp ")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()

In [None]:
#MRL Model
# split predictors and response
X = df.drop(["load_shortfall_3h", "time"], axis = 1) #split the feature variable
y = df.iloc[:, -1] #split the response/target variable

lm = LinearRegression()

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.20, 
                                                    random_state=1)

In [None]:
lm.fit(X_train, y_train)
# extract model intercept
beta_0 = float(lm.intercept_)
# extract model coeffs
beta_js = pd.DataFrame(lm.coef_, X.columns, columns=['Coefficient'])
print("Intercept is", beta_0)

In [None]:
#Show Model
fig, axs = plt.subplots(2, 2, figsize=(9,7))

axs[0,0].scatter(df['Bilbao_snow_3h'], df["load_shortfall_3h"])
axs[0,0].plot(df['Bilbao_snow_3h'], lm.intercept_ + lm.coef_[18]*df['Bilbao_snow_3h'], color='red')
axs[0,0].title.set_text('Bilbao_snow_3h vs. load_shortfall_3h')

axs[0,1].scatter(df['Seville_rain_3h'], df["load_shortfall_3h"])
axs[0,1].plot(df['Seville_rain_3h'], lm.intercept_ + lm.coef_[20]*df['Seville_rain_3h'], color='green')
axs[0,1].title.set_text('Seville_rain_3h vs. load_shortfall_3h')

axs[1,0].scatter(df['Madrid_temp_max'], df["load_shortfall_3h"])
axs[1,0].plot(df['Madrid_temp_max'], lm.intercept_ + lm.coef_[38]*df['Madrid_temp_max'], color='blue')
axs[1,0].title.set_text('Madrid_temp_max vs. load_shortfall_3h')

axs[1,1].scatter(df['Madrid_wind_speed'], df["load_shortfall_3h"])
axs[1,1].plot(df['Madrid_wind_speed'], lm.intercept_ + lm.coef_[1]*df['Madrid_wind_speed'], color='blue')
axs[1,1].title.set_text('Madrid_wind_speed vs. load_shortfall_3h')

fig.tight_layout(pad=3.0)

plt.show()

<a id="six"></a>
## 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |

---

In [None]:
# evaluate one or more ML models

print('Linear model train:', np.sqrt(metrics.mean_squared_error(y_train, lm.predict(X_train))))
print('Linear model test:', np.sqrt(metrics.mean_squared_error(y_test, predict)))

print('Ridge model train:', np.sqrt(metrics.mean_squared_error(y_train, Ridge.predict(X_train))))
print('Ridge model test:', np.sqrt(metrics.mean_squared_error(y_test, R_pred)))

print('RandomForest model train:', np.sqrt(metrics.mean_squared_error(y_train, R_F.predict(X_train))))
print('RandomForest model test:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))




In [None]:
# Compare model performance

In [None]:
# Choose best model and motivate why it is the best choice

<a id="seven"></a>
## 7. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |

---

In [None]:
# discuss chosen methods logic