# Final Project - Build a Regression Model in Keras
By Watts Dietrich

This is the final project for the course titled "Introduction to Deep Learning & Neural Networks with Keras," a part of the IBM AI Engineering Professional Certificate on Coursera. This project uses neural networks to study a concrete dataset. The dataset contains information on the quantities of materials used in a given batch of concrete, the age of the concrete, and the measured strength of the concrete. The goal of this exercise is to train neural networks to predict the strength of a batch given its other characteristics.

In [20]:
import pandas as pd
import numpy as np
import sklearn
from sklearn import model_selection
import keras
from keras.models import Sequential
from keras.layers import Dense
import statistics

# Part A
The concrete dataset is imported and split into training and testing sets (70/30). A neural network with one hidden layer of 10 nodes is built and trained. Then the mean squared error is calculated for the model. This process is repeated 50 times. The average and standard deviation of the 50 MSE values are calculated and displayed.  

In [2]:
# Read dataset into a dataframe
df = pd.read_csv('https://cocl.us/concrete_data')
df.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


In [3]:
# Split data into training and testing sets with a 70/30 split. X contains predictors, y the target.
x = np.array(df.drop(["Strength"], 1))
y = np.array(df["Strength"])
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size = 0.3)

In [4]:
# Get number of predictor columns. Used for defining input_shape in the model.
n_cols = x.shape[1]

In [5]:
# Define a function to build a model with one hidden layer of 10 nodes, relu activation function, adam optimizer, mse loss function
def build_model():
    # create model
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(1))
    
    # compile model
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

In [6]:
# List to store mse values of the 50 models
mse_list = []

In [64]:
# Train 50 models. The MSE of each is calculated and appended to mse_list
for i in range(50):
    model = build_model()
    model.fit(x_train, y_train, epochs=50, verbose=0)    
    predictions = model.predict(x_test)
    mse = sklearn.metrics.mean_squared_error(y_test, predictions)
    mse_list.append(mse)   
    print("Model", i, "training complete. The MSE was:", mse)

Model 0 training complete. The MSE was: 193.80246059230902
Model 1 training complete. The MSE was: 109.73453216396886
Model 2 training complete. The MSE was: 221.5516158704487
Model 3 training complete. The MSE was: 1676.1986894496026
Model 4 training complete. The MSE was: 1154.95239368909
Model 5 training complete. The MSE was: 222.05402248378294
Model 6 training complete. The MSE was: 137.4642788191713
Model 7 training complete. The MSE was: 107.21647397059424
Model 8 training complete. The MSE was: 86.37462910069644
Model 9 training complete. The MSE was: 195.65405411115236
Model 10 training complete. The MSE was: 121.78432329303307
Model 11 training complete. The MSE was: 157.644750262606
Model 12 training complete. The MSE was: 352.80734351608805
Model 13 training complete. The MSE was: 114.31753617283162
Model 14 training complete. The MSE was: 255.4921056761277
Model 15 training complete. The MSE was: 268.4489207071769
Model 16 training complete. The MSE was: 169.8226917826427


In [67]:
# Determine mean and standard deviation of the MSE values of the 50 trained models
mse_mean = statistics.mean(mse_list)
mse_stdev = statistics.stdev(mse_list)

In [68]:
# Report the mean and standard deviation of the MSE values for the 50 models
print("The average MSE of the 50 models was", mse_mean, "\nThe standard deviation of the MSE values was", mse_stdev)

The average MSE of the 50 models was 328.69927286622783 
The standard deviation of the MSE values was 362.6143885248151


### The 50 trained models had an average MSE of 328.7, with a standard deviation of 362.6.

# Part B

A repeat of part A, but using normalized data.

In [8]:
# Create a new dataframe with normalized predictor data
df_norm = df.drop(["Strength"], 1)
df_norm = (df_norm - df_norm.mean()) / df_norm.std()
df_norm.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,0.862735,-1.217079,-0.279597
1,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,1.055651,-1.217079,-0.279597
2,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,3.55134
3,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,5.055221
4,-0.790075,0.678079,-0.846733,0.488555,-1.038638,0.070492,0.647569,4.976069


In [9]:
# Split the normalized data into training and testing sets
x_norm = np.array(df_norm)
y_norm = np.array(df["Strength"])
x_norm_train, x_norm_test, y_norm_train, y_norm_test = sklearn.model_selection.train_test_split(x_norm, y_norm, test_size = 0.3)

In [12]:
norm_mse_list = []

In [17]:
# Train 50 models. The MSE of each is calculated and appended to norm_mse_list
for i in range(50):
    model = build_model()
    model.fit(x_norm_train, y_norm_train, epochs=50, verbose=0)    
    norm_predictions = model.predict(x_norm_test)
    norm_mse = sklearn.metrics.mean_squared_error(y_norm_test, norm_predictions)
    norm_mse_list.append(norm_mse)   
    print("Model", i, "training complete. The MSE was:", norm_mse)

Model 0 training complete. The MSE was: 345.60706956271247
Model 1 training complete. The MSE was: 487.5831459534978
Model 2 training complete. The MSE was: 367.97605641035926
Model 3 training complete. The MSE was: 398.79264296377636
Model 4 training complete. The MSE was: 281.05877062727797
Model 5 training complete. The MSE was: 285.4125202200481
Model 6 training complete. The MSE was: 769.5818878322859
Model 7 training complete. The MSE was: 402.835437016738
Model 8 training complete. The MSE was: 246.77655694466088
Model 9 training complete. The MSE was: 416.728996789066
Model 10 training complete. The MSE was: 262.829240586475
Model 11 training complete. The MSE was: 298.76907787696206
Model 12 training complete. The MSE was: 502.5745447086934
Model 13 training complete. The MSE was: 294.6260976987307
Model 14 training complete. The MSE was: 396.07681972490536
Model 15 training complete. The MSE was: 325.0042371817007
Model 16 training complete. The MSE was: 320.0860781407084
Mod

In [21]:
# Determine mean and standard deviation of the MSE values of the 50 trained models
norm_mse_mean = statistics.mean(norm_mse_list)
norm_mse_stdev = statistics.stdev(norm_mse_list)

# Report the mean and standard deviation of the MSE values for the 50 models
print("Using normalized data, the average MSE of the 50 models was", norm_mse_mean, "\nThe standard deviation of the MSE values was", norm_mse_stdev)

Using normalized data, the average MSE of the 50 models was 388.0963619362353 
The standard deviation of the MSE values was 110.97330271682691


### Using normalized data, the average MSE of the 50 models increased from 328.7 to 388.1 when compared with the un-normalized data. Meanwhile, the standard deviation decreased from 362.6 to just 111.0. Thus, while using normalized data had the downside of slightly increasing the average MSE, it also significantly reduced the standard deviation of the MSEs.

# Part C
A repeat of Part B, but with 100-epoch models instead of 50.

In [19]:
epochs_mse_list = []
for i in range(50):
    model = build_model()
    model.fit(x_norm_train, y_norm_train, epochs=100, verbose=0)    
    norm_predictions = model.predict(x_norm_test)
    norm_mse = sklearn.metrics.mean_squared_error(y_norm_test, norm_predictions)
    epochs_mse_list.append(norm_mse)   
    print("Model", i, "training complete. The MSE was:", norm_mse)

Model 0 training complete. The MSE was: 157.71368096510608
Model 1 training complete. The MSE was: 171.65778595746272
Model 2 training complete. The MSE was: 179.2976087026357
Model 3 training complete. The MSE was: 165.33724437423814
Model 4 training complete. The MSE was: 172.7243604240898
Model 5 training complete. The MSE was: 166.65657272833516
Model 6 training complete. The MSE was: 157.96847147975558
Model 7 training complete. The MSE was: 170.96631666334434
Model 8 training complete. The MSE was: 160.25797977372991
Model 9 training complete. The MSE was: 162.8355443291307
Model 10 training complete. The MSE was: 166.68575130720268
Model 11 training complete. The MSE was: 176.96009477931756
Model 12 training complete. The MSE was: 181.34171917967387
Model 13 training complete. The MSE was: 155.7048055540575
Model 14 training complete. The MSE was: 177.322935927021
Model 15 training complete. The MSE was: 186.6529677102264
Model 16 training complete. The MSE was: 161.159091508354

In [22]:
# Determine mean and standard deviation of the MSE values of the 50 trained models
epochs_mse_mean = statistics.mean(epochs_mse_list)
epochs_mse_stdev = statistics.stdev(epochs_mse_list)

# Report the mean and standard deviation of the MSE values for the 50 models
print("Using normalized data, the average MSE of the 50 models was", epochs_mse_mean, "\nThe standard deviation of the MSE values was", epochs_mse_stdev)

Using normalized data, the average MSE of the 50 models was 168.6638302813069 
The standard deviation of the MSE values was 19.408553288277908


### Increasing the number of epochs in each model from 50 to 100 significantly reduced both the mean and standard deviation of the MSE values.
### Part B (50 epochs):  Mean: 388.1  Std Dev: 111.0
### Part C (100 epochs): Mean: 168.7  Std Dev: 19.41

# Part D
A repeat of Part B, but using a network with three hidden layers instead of just one.

In [23]:
# Define a new function to build a model with three hidden layers of 10 nodes, relu activation function, adam optimizer, mse loss function
def build_layered_model():
    # create model
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(1))
    
    # compile model
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

In [24]:
# Run the 50 training and testing cycles
layered_mse_list = []
for i in range(50):
    model = build_layered_model()
    model.fit(x_norm_train, y_norm_train, epochs=50, verbose=0)    
    norm_predictions = model.predict(x_norm_test)
    norm_mse = sklearn.metrics.mean_squared_error(y_norm_test, norm_predictions)
    layered_mse_list.append(norm_mse)   
    print("Model", i, "training complete. The MSE was:", norm_mse)

Model 0 training complete. The MSE was: 122.38765899489438
Model 1 training complete. The MSE was: 128.88928024982383
Model 2 training complete. The MSE was: 127.8950908835769
Model 3 training complete. The MSE was: 131.22619151949866
Model 4 training complete. The MSE was: 139.67712863003862
Model 5 training complete. The MSE was: 115.27638589051901
Model 6 training complete. The MSE was: 141.74599099662953
Model 7 training complete. The MSE was: 137.1989957623677
Model 8 training complete. The MSE was: 146.2865685992328
Model 9 training complete. The MSE was: 130.05827981474556
Model 10 training complete. The MSE was: 135.61060388441777
Model 11 training complete. The MSE was: 124.76618593230401
Model 12 training complete. The MSE was: 137.80514452932533
Model 13 training complete. The MSE was: 136.82064701095024
Model 14 training complete. The MSE was: 133.34708542521568
Model 15 training complete. The MSE was: 124.28505937049795
Model 16 training complete. The MSE was: 115.12129941

In [25]:
# Determine mean and standard deviation of the MSE values of the 50 trained models
layered_mse_mean = statistics.mean(layered_mse_list)
layered_mse_stdev = statistics.stdev(layered_mse_list)

# Report the mean and standard deviation of the MSE values for the 50 models
print("Using normalized data, the average MSE of the 50 models was", layered_mse_mean, "\nThe standard deviation of the MSE values was", layered_mse_stdev)

Using normalized data, the average MSE of the 50 models was 131.54327006278828 
The standard deviation of the MSE values was 8.85787871658183


### Increasing the number of hidden layers in each model from one to three significantly reduced both the mean and standard deviation of the MSE values compared to the previous examples.
### Part B (1 hidden layer, 50 epochs):  Mean: 388.1  Std Dev: 111.0
### Part C (1 hidden layer, 100 epochs): Mean: 168.7  Std Dev: 19.41
### Part D (3 hidden layers, 50 epochs): Mean: 131.5  Std Dev: 8.858