<h1 align=center><font size = 5>Regression Model with Keras, A Project</font></h1>

## Introduction

In this project we shall use concrete data downloaded from  https://cocl.us/concrete_data, build a regression model using keras library  and report findings in the following manner. <br>
A. Build a baseline model <br>
.... One hidden layer with 10 nodes, relu, adam, mse, 30% for testing, 50 epochs <br>
B. Normalize the data and repeat the process. <br>
C. Increase epochs to 100 and repeat. <br>
D. Increase hidden layers to 3 and repeat.


## Download and Clean Dataset

In [3]:
import pandas as pd
import numpy as np

# Download the data to local filesystem.
!wget  https://cocl.us/concrete_data

# Read the data into a pandas dataframe.
concrete_data = pd.read_csv('concrete_data')
concrete_data.head()

--2020-05-24 11:18:53--  https://cocl.us/concrete_data
Resolving cocl.us (cocl.us)... 158.85.108.83, 158.85.108.86, 169.48.113.194
Connecting to cocl.us (cocl.us)|158.85.108.83|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DL0101EN/labs/data/concrete_data.csv [following]
--2020-05-24 11:18:55--  https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DL0101EN/labs/data/concrete_data.csv
Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.196
Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.196|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 58988 (58K) [text/csv]
Saving to: ‘concrete_data.1’


2020-05-24 11:18:56 (1.95 MB/s) - ‘concrete_data.1’ saved [58988/58988]



Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


Let us look at the dimensions of the data

In [4]:
concrete_data.shape

(1030, 9)

Check the data for any missing/null values

In [5]:
concrete_data.describe()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,281.167864,73.895825,54.18835,181.567282,6.20466,972.918932,773.580485,45.662136,35.817961
std,104.506364,86.279342,63.997004,21.354219,5.973841,77.753954,80.17598,63.169912,16.705742
min,102.0,0.0,0.0,121.8,0.0,801.0,594.0,1.0,2.33
25%,192.375,0.0,0.0,164.9,0.0,932.0,730.95,7.0,23.71
50%,272.9,22.0,0.0,185.0,6.4,968.0,779.5,28.0,34.445
75%,350.0,142.95,118.3,192.0,10.2,1029.4,824.0,56.0,46.135
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.6


In [6]:
concrete_data.isnull().sum()

Cement                0
Blast Furnace Slag    0
Fly Ash               0
Water                 0
Superplasticizer      0
Coarse Aggregate      0
Fine Aggregate        0
Age                   0
Strength              0
dtype: int64

So, our data looks clean

#### Now, let us prepare our data.

In [7]:
# Split the data into predictors and target.

concrete_data_columns = concrete_data.columns

predictors = concrete_data[concrete_data_columns[concrete_data_columns != 'Strength']] # all columns except Strength
target = concrete_data['Strength'] # Strength column

Sanity check on predictors and target.

In [8]:
predictors.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360


In [9]:
target.head()

0    79.99
1    61.89
2    40.27
3    41.05
4    44.30
Name: Strength, dtype: float64

In [12]:
# Also, get our normalized predictors
predictors_norm = (predictors - predictors.mean()) / predictors.std()
predictors_norm.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,0.862735,-1.217079,-0.279597
1,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,1.055651,-1.217079,-0.279597
2,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,3.55134
3,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,5.055221
4,-0.790075,0.678079,-0.846733,0.488555,-1.038638,0.070492,0.647569,4.976069


In [14]:
n_cols = predictors_norm.shape[1] # number of predictors

## Build a Model.

In [13]:
# Splitting and eval functions from Scikit Learn.
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Keras for model building.
import keras

from keras.models import Sequential
from keras.layers import Dense

In [17]:
# Define a regression function that can be called based on multiple parameters
# Returns a List with MSE for all the runs

def regression_model(hidden_layers=1, nodes_per_layer=10, activation_fn='relu',
                    optimizer='adam', loss='mean_squared_error', n_cols=1,
                    epochs=50, test_size=0.3, normalized=True, runs=50):
    
    # Predictor dataset plain or normalized.
    X = predictors_norm if normalized else predictors
    y = target
    
    # List to hold Mean Squared Error for all runs.
    mse = []
    
    # Loop over runs and capture the MSE.
    
    for i in range(runs):
        
        #Split Data to Train and Test Set
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_size)

        #Create model
        model = Sequential()
        
        # Add hidden layers
        for j in range(hidden_layers):
            if j == 0 : # First time.
                model.add(Dense(nodes_per_layer, activation=activation_fn, input_shape=(n_cols,)))
            else:
                model.add(Dense(nodes_per_layer, activation=activation_fn))
        
        # Output layer.
        model.add(Dense(1))

        #Compile model
        model.compile(optimizer=optimizer, loss=loss)

        #fit the model
        model.fit(X_train, y_train, epochs=epochs, verbose=0)

        #predict output
        y_hat = model.predict(X_test)
    
        ## Append our MSE value to the list, for this run
        mse.append(mean_squared_error(y_test, y_hat))

    # Pass the results to the caller.
    return mse

## Answers to the Questions.

In [18]:
### Part A

## Call our regression model with un-normalized data and rest of the parameters as defaults.
##  Defaults are epochs = 50, runs = 50, test = 30% i.e 0.3, activation = relu, optimzer = adam
mse = regression_model(n_cols=n_cols, normalized=False)

print('mse Mean: {:.2f}'.format(np.mean(mse)))
print('mse StdDev: {:.2f}'.format(np.std(mse)))

mse_Mean: 410.77
mse_StdDev: 534.61


In [19]:
### Part B

## Call our regression model with normalized data and rest of the parameters as defaults.
##  Defaults are epochs = 50, runs = 50, test = 30% i.e 0.3, activation = relu, optimzer = adam
##  Normalized data.
mse = regression_model(n_cols=n_cols, normalized=True)

print('mse Mean: {:.2f}'.format(np.mean(mse)))
print('mse StdDev: {:.2f}'.format(np.std(mse)))

mse Mean: 354.83
mse StdDev: 90.86


<b> Question: How does the mean of the mean squared errors compare to that from Step A? </b> <br>
<b> Answer: </b><br>
We observe that Mean of the mean squared error over 50 epochs has decreased by 13.62% with normalized data. <br>
From this we can infer that our Model is doing better with normalized data. <br>
This also emphasizes that Deep Learning models do better with normalized data than raw data.

In [20]:
### Part c

## Call our regression model with normalized data and rest of the parameters as defaults.
##  Defaults are runs = 50, test = 30% i.e 0.3, activation = relu, optimzer = adam
##  normalized data, 100 epochs.
mse = regression_model(n_cols=n_cols, normalized=True, epochs=100)

print('mse Mean: {:.2f}'.format(np.mean(mse)))
print('mse StdDev: {:.2f}'.format(np.std(mse)))

mse Mean: 166.33
mse StdDev: 15.87


<b> Question: How does the mean of the mean squared errors compare to that from Step B? </b> <br>
<b> Answer:</b><br>
We observe that mean squared error has decreased by about 53.124% compared to "run B" by doubling the epochs. <br>
Our inference is that Deep Learning models do better with increased iterations(epochs)


In [22]:
### Part D

## Call our regression model with normalized data and rest of the parameters as defaults.
##  Defaults are epochs = 50, runs = 50, test = 30% i.e 0.3, optimzer = adam
##  3 hidden layers, normalzied, relu activation function, 10 nodes per layer
mse = regression_model(n_cols=n_cols, normalized=True, hidden_layers=3, 
                       activation_fn='relu', nodes_per_layer=10)

print('mse Mean: {:.2f}'.format(np.mean(mse)))
print('mse StdDev: {:.2f}'.format(np.std(mse)))

mse Mean: 126.61
mse StdDev: 18.98


<b> Question: How does the mean of the mean squared errors compare to that from Step B? </b> <br>
<b> Answer:</b><br>
We observe that MSE has decreased by more than 64.32% compared to "run B" after increasing the number of hidden layers. <br>
Our inference is that Deep Learning Models tend to do better with increased number of hidden layers.
