<h1 align=center><font size = 5> Course Project - Build a Regression Model in Keras</font></h1>

## Introduction

In this course project, you will build a regression model using the deep learning Keras library, and then you will experiment with increasing the number of training epochs and changing number of hidden layers and you will see how changing these parameters impacts the performance of the model.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>
    
1. Download and Clean Dataset
2. Import Keras
3. Section A: Build a baseline model
4. Section B: Normalize the data 
5. Section C: Increate the number of epochs
6. Section D: Increase the number of hidden layers 

</font>
</div>

<hr>

## Download and Clean Dataset

Let's start by importing the <em>pandas</em> and the Numpy libraries.

In [69]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split

Let's download the data and read it into a <em>pandas</em> dataframe.

In [3]:
concrete_data = pd.read_csv('https://cocl.us/concrete_data')
concrete_data.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


In [4]:
concrete_data.shape

(1030, 9)

Let's check the dataset for any missing values.

In [5]:
concrete_data.describe()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,281.167864,73.895825,54.18835,181.567282,6.20466,972.918932,773.580485,45.662136,35.817961
std,104.506364,86.279342,63.997004,21.354219,5.973841,77.753954,80.17598,63.169912,16.705742
min,102.0,0.0,0.0,121.8,0.0,801.0,594.0,1.0,2.33
25%,192.375,0.0,0.0,164.9,0.0,932.0,730.95,7.0,23.71
50%,272.9,22.0,0.0,185.0,6.4,968.0,779.5,28.0,34.445
75%,350.0,142.95,118.3,192.0,10.2,1029.4,824.0,56.0,46.135
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.6


In [6]:
concrete_data.isnull().sum()

Cement                0
Blast Furnace Slag    0
Fly Ash               0
Water                 0
Superplasticizer      0
Coarse Aggregate      0
Fine Aggregate        0
Age                   0
Strength              0
dtype: int64

The data looks very clean and is ready to be used to build our model.

#### Split data into predictors and target

The target variable in this problem is the concrete sample strength. Therefore, our predictors will be all the other columns.

In [7]:
concrete_data_columns = concrete_data.columns

predictors = concrete_data[concrete_data_columns[concrete_data_columns != 'Strength']] # all columns except Strength
target = concrete_data['Strength'] # Strength column

<a id="item2"></a>

Let's do a quick sanity check of the predictors and the target dataframes.

In [8]:
predictors.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360


In [9]:
target.head()

0    79.99
1    61.89
2    40.27
3    41.05
4    44.30
Name: Strength, dtype: float64

In [14]:
n_cols = predictors.shape[1] # number of predictors

### Import Keras

In [15]:
import keras

Let's import the rest of the packages from the Keras library that we will need to build our regressoin model.

In [16]:
from keras.models import Sequential
from keras.layers import Dense

<hr>

## A. Build a baseline model

Use the Keras library to build a neural network with the following:

- One hidden layer of 10 nodes, and a ReLU activation function

- Use the adam optimizer and the mean squared error as the loss function.

1. Randomly split the data into a training and test sets by holding 30% of the data for testing. You can use the train_test_split helper function from Scikit-learn.

2. Train the model on the training data using 50 epochs.

3. Evaluate the model on the test data and compute the mean squared error between the predicted concrete strength and the actual concrete strength. You can use the mean_squared_error function from Scikit-learn.

4. Repeat steps 1 - 3, 50 times, i.e., create a list of 50 mean squared errors.

5. Report the mean and the standard deviation of the mean squared errors.

In [17]:
# define regression model
def regression_model_A():
    # create model
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(1))
    
    # compile model
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

### Train and Test

Let's call the function now to create our model.

In [18]:
# build the model
modelA = regression_model_A()

Let's create a function to calculate the MSE

In [76]:
def calculate_mseA():
    # Step 1
    X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size=0.3)

    # Step 2
    modelA.fit(X_train, y_train, epochs=50, verbose=0)
    
    # Step 3
    y_pred = modelA.predict(X_test)

    # mean squared error (mse) for regression
    mse = sklearn.metrics.mean_squared_error(y_test, y_pred)
    return mse

Now we will calculate the RMSE for 50 times and display the mean and the standard deviation

In [77]:
mseA_list = []
for i in range(50):
    mseA_list.append(calculate_mseA())

print("Mean MSE: ", np.mean(mseA_list))
print("Standart Deviation of MSE: ", np.std(mseA_list))

Mean MSE:  2501.206649823525
Standart Deviation of MSE:  12012.153709424216


<hr>

## B. Normalize the data

To normalize the data we will substract the mean and dividing by the standard deviation.

In [80]:
predictors_norm = (predictors - predictors.mean()) / predictors.std()
predictors_norm.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,0.862735,-1.217079,-0.279597
1,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,1.055651,-1.217079,-0.279597
2,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,3.55134
3,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,5.055221
4,-0.790075,0.678079,-0.846733,0.488555,-1.038638,0.070492,0.647569,4.976069


Let's save the number of predictors to *n_cols* since we will need this number when building our network.

In [81]:
n_cols = predictors_norm.shape[1] # number of predictors

In [82]:
print(n_cols)

8


Repeat Part A but using a normalized version of the data.

Let's create a function to calculate the MSE, with the normalized data

In [83]:
def calculate_mseB():
    # Step 1
    X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3)

    # Step 2
    modelA.fit(X_train, y_train, epochs=50, verbose=0)
    
    # Step 3
    y_pred = modelA.predict(X_test)

    # mean squared error (mse) for regression
    mse = sklearn.metrics.mean_squared_error(y_test, y_pred)
    return mse

Now we will calculate the RMSE for 50 times and display the mean and the standard deviation

In [84]:
mseB_list = []
for i in range(50):
    mseB_list.append(calculate_mseB())

print("Mean MSE: ", np.mean(mseB_list))
print("Standart Deviation of MSE: ", np.std(mseB_list))

Mean MSE:  27.110611535716966
Standart Deviation of MSE:  2.0640459131618103


#### The results tell us that the mean and the standard deviation of the mean squarred errors are significantly lower than in step A just normalizing the data
##### Step A: 

In [85]:
print("Mean MSE: ", np.mean(mseA_list))
print("Standart Deviation of MSE: ", np.std(mseA_list))

Mean MSE:  2501.206649823525
Standart Deviation of MSE:  12012.153709424216


##### Step B: 

In [86]:
print("Mean MSE: ", np.mean(mseB_list))
print("Standart Deviation of MSE: ", np.std(mseB_list))

Mean MSE:  27.110611535716966
Standart Deviation of MSE:  2.0640459131618103


<hr>

## C. Increate the number of epochs

Repeat Part B but using 100 epochs this time for training.

In [88]:
def calculate_mseC():
    # Step 1
    X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3)

    # Step 2
    modelA.fit(X_train, y_train, epochs=100, verbose=0)
    
    # Step 3
    y_pred = modelA.predict(X_test)

    # mean squared error (mse) for regression
    mse = sklearn.metrics.mean_squared_error(y_test, y_pred)
    return mse

Now we will calculate the RMSE for 50 times and display the mean and the standard deviation

In [89]:
mseC_list = []
for i in range(50):
    mseC_list.append(calculate_mseC())

print("Mean MSE: ", np.mean(mseC_list))
print("Standart Deviation of MSE: ", np.std(mseC_list))

Mean MSE:  26.95706559423414
Standart Deviation of MSE:  2.177341992758622


#### The results tell us that the mean of the mean squarred errors is lower than in step B by incrementing the number of epochs.
##### Step B: 

In [90]:
print("Mean MSE: ", np.mean(mseB_list))
print("Standart Deviation of MSE: ", np.std(mseB_list))

Mean MSE:  27.110611535716966
Standart Deviation of MSE:  2.0640459131618103


##### Step C: 

In [91]:
print("Mean MSE: ", np.mean(mseC_list))
print("Standart Deviation of MSE: ", np.std(mseC_list))

Mean MSE:  26.95706559423414
Standart Deviation of MSE:  2.177341992758622


<hr>

## D. Increase the number of hidden layers

Repeat part B but use a neural network with the following instead:

- Three hidden layers, each of 10 nodes and ReLU activation function.

In [92]:
# define regression model
def regression_model_D():
    # create model
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(1))
    
    # compile model
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

In [93]:
# build the model
modelD = regression_model_D()

Let's create a function to calculate the MSE, with the normalized data

In [94]:
def calculate_mseD():
    # Step 1
    X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3)

    # Step 2
    modelD.fit(X_train, y_train, epochs=50, verbose=0)
    
    # Step 3
    y_pred = modelD.predict(X_test)

    # mean squared error (mse) for regression
    mse = sklearn.metrics.mean_squared_error(y_test, y_pred)
    return mse

Now we will calculate the RMSE for 50 times and display the mean and the standard deviation

In [95]:
mseD_list = []
for i in range(50):
    mseD_list.append(calculate_mseD())

print("Mean MSE: ", np.mean(mseD_list))
print("Standart Deviation of MSE: ", np.std(mseD_list))

Mean MSE:  33.458512849064526
Standart Deviation of MSE:  11.422137019830704


#### The results tell us that the mean and the standard deviation of the mean squarred errors are higher than in step B by incrementing the number of hidden layers, what may be cause of overfitting.
##### Step B: 

In [96]:
print("Mean MSE: ", np.mean(mseB_list))
print("Standart Deviation of MSE: ", np.std(mseB_list))

Mean MSE:  27.110611535716966
Standart Deviation of MSE:  2.0640459131618103


##### Step D: 

In [97]:
print("Mean MSE: ", np.mean(mseD_list))
print("Standart Deviation of MSE: ", np.std(mseD_list))

Mean MSE:  33.458512849064526
Standart Deviation of MSE:  11.422137019830704


<hr>

Gaizka Albestain Irineo