# Peer-graded Assignment: Build a Regression Model in Keras
by Jacques Jansen van Rensburg

## Download the data file
Since it is the same Data that we used in the Labs Session of Week 3, I just Copy and Paste the url that we used then for convenience sake.
I also know from the Labs session that the Data set is clean and ready to use.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DL0101EN/labs/data/concrete_data.csv')
df.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


## Split the data into predictors and target
The target variable in this problem is the concrete sample strength. Therefore, our predictors will be all the other columns

In [3]:
df_columns = df.columns

predictors = df[df_columns[df_columns != 'Strength']]
target = df['Strength']

cols = predictors.shape[1]

# A. Build a Baseline model

I will now use the  Keras library to build a neural network

In [4]:
import keras
from keras.models import Sequential
from keras.layers import Dense

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


Build a neural network with one hidden layer of 10 nodes, and a ReLU activation function.
And then Use the adam optimizer and the mean squared error as the loss function.

In [5]:
def regression_model():
    # create model
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(cols,)))
    model.add(Dense(1))

    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

1. Randomly split the data into a training and test sets by holding 30% of the data for testing. You can use the train_test_splithelper function from Scikit-learn.

In [6]:
X = predictors
y = target
import sklearn.model_selection as model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, train_size=0.7,test_size=0.3, random_state=101)
print ("X_train: ", X_train)
print ("y_train: ", y_train)
print("X_test: ", X_test)
print ("y_test: ", y_test)

X_train:       Cement  Blast Furnace Slag  Fly Ash  Water  Superplasticizer  \
774   382.0                 0.0      0.0  186.0               0.0   
407   165.0               128.5    132.1  175.1               8.1   
620   254.0                 0.0      0.0  198.0               0.0   
479   446.0                24.0     79.0  162.0              11.6   
530   359.0                19.0    141.0  154.0              10.9   
..      ...                 ...      ...    ...               ...   
575   238.1                 0.0      0.0  185.7               0.0   
973   143.8               136.3    106.2  178.1               7.5   
75    475.0               118.8      0.0  181.1               8.9   
599   339.0                 0.0      0.0  197.0               0.0   
863   288.0               121.0      0.0  177.0               7.0   

     Coarse Aggregate  Fine Aggregate  Age  
774            1111.0           784.0    7  
407            1005.8           746.6    3  
620             968.0     

2. Train the model on the training data using 50 epochs.

3. Evaluate the model on the test data and compute the mean squared error between the predicted concrete strength and the actual concrete strength. You can use the mean_squared_error function from Scikit-learn.

4. Repeat steps 1 - 3, 50 times, i.e., create a list of 50 mean squared errors.

I used a loop to repeat the steps

In [7]:
import statistics 
from sklearn.metrics import mean_squared_error
model = regression_model()
mse_results=[]
for x in range(50):
    # fit the model
    model.fit(X_train, y_train, epochs=50, verbose=0)
    train_results = model.predict( X_test )
    mse_results.append( mean_squared_error(y_test, train_results) )

5. Report the mean and the standard deviation of the mean squared errors.

In [8]:
print("Mean of the list of mean square errors is % s" % (statistics.mean( mse_results ) ))
print("Standard deviation of the list of mean square errors is % s" % ( statistics.stdev( mse_results )))

Mean of the list of mean square errors is 121.50265437726155
Standard deviation of the list of mean square errors is 20.69587590901253


# B. Normalize the data 

Repeat Part A but use a normalized version of the data. 

A way to normalize the data is by subtracting the mean from the individual predictors and dividing by the standard deviation

In [9]:
predictors_norm = (predictors - predictors.mean()) / predictors.std()

In [10]:
n_cols = predictors_norm.shape[1]

In [11]:
def regression_model_norm():
    # create model
    model_norm = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(1))

    model.compile(optimizer='adam', loss='mean_squared_error')
    return model_norm

In [12]:
model_norm = regression_model()
mse_results_norm=[]
for x in range(50):
    # fit the model
    model.fit(X_train, y_train, epochs=50, verbose=0)
    train_results_norm = model.predict( X_test )
    mse_results_norm.append( mean_squared_error(y_test, train_results_norm) )

How does the mean of the mean squared errors compare to that from Step A?

In [13]:
print("Mean of the list of mean square errors is % s" % (statistics.mean( mse_results_norm ) ))
print("Standard deviation of the list of mean square errors is % s" % ( statistics.stdev( mse_results_norm )))

Mean of the list of mean square errors is 115.09650107787353
Standard deviation of the list of mean square errors is 6.191113593932303


# C. Increase the number of epochs

Repeat Part B but use 100 epochs this time for training.

In [14]:
def regression_model_c():
    # create model
    model_c = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(1))

    model.compile(optimizer='adam', loss='mean_squared_error')
    return model_c

In [15]:
model_c = regression_model()
mse_results_c=[]
for x in range(50):
    # fit the model
    model.fit(X_train, y_train, epochs=100, verbose=0)
    train_results_c = model.predict( X_test )
    mse_results_c.append( mean_squared_error(y_test, train_results_c) )

How does the mean of the mean squared errors compare to that from Step B?

In [16]:
print("Mean of the list of mean square errors is % s" % (statistics.mean( mse_results_c ) ))
print("Standard deviation of the list of mean square errors is % s" % ( statistics.stdev( mse_results_c )))

Mean of the list of mean square errors is 83.28499191304228
Standard deviation of the list of mean square errors is 29.22540639830787


# D. Increase the number of hidden layers 

Repeat part B but use a neural network with the following instead:

- Three hidden layers, each of 10 nodes and ReLU activation function.

In [17]:
def regression_model_d():
    # create model
    model_d = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(1))

    model.compile(optimizer='adam', loss='mean_squared_error')
    return model_d

In [18]:
model_d = regression_model_d()
mse_results_d=[]
for x in range(50):
    # fit the model
    model.fit(X_train, y_train, epochs=50, verbose=0)
    train_results_d = model.predict( X_test )
    mse_results_d.append( mean_squared_error(y_test, train_results_d) )

How does the mean of the mean squared errors compare to that from Step B?

In [19]:
print("Mean of the list of mean square errors is % s" % (statistics.mean( mse_results_norm ) ))
print("Standard deviation of the list of mean square errors is % s" % ( statistics.stdev( mse_results_norm )))

Mean of the list of mean square errors is 115.09650107787353
Standard deviation of the list of mean square errors is 6.191113593932303
