Project Final: Build a Regression Model in Keras

## Download and Clean Dataset


In [59]:
import pandas as pd
import numpy as np

We will be playing around with the same dataset that we used in the videos.

<strong>The dataset is about the compressive strength of different samples of concrete based on the volumes of the different ingredients that were used to make them. Ingredients include:</strong>

<strong>1. Cement</strong>

<strong>2. Blast Furnace Slag</strong>

<strong>3. Fly Ash</strong>

<strong>4. Water</strong>

<strong>5. Superplasticizer</strong>

<strong>6. Coarse Aggregate</strong>

<strong>7. Fine Aggregate</strong>


Let's download the data and read it into a <em>pandas</em> dataframe.


In [60]:
concrete_data = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DL0101EN/labs/data/concrete_data.csv')
concrete_data.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


So the first concrete sample has 540 cubic meter of cement, 0 cubic meter of blast furnace slag, 0 cubic meter of fly ash, 162 cubic meter of water, 2.5 cubic meter of superplaticizer, 1040 cubic meter of coarse aggregate, 676 cubic meter of fine aggregate. Such a concrete mix which is 28 days old, has a compressive strength of 79.99 MPa.


#### Let's check how many data points we have.


In [61]:
concrete_data.shape

(1030, 9)

So, there are approximately 1000 samples to train our model on. Because of the few samples, we have to be careful not to overfit the training data.


Let's check the dataset for any missing values.


In [62]:
concrete_data.describe()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,281.167864,73.895825,54.18835,181.567282,6.20466,972.918932,773.580485,45.662136,35.817961
std,104.506364,86.279342,63.997004,21.354219,5.973841,77.753954,80.17598,63.169912,16.705742
min,102.0,0.0,0.0,121.8,0.0,801.0,594.0,1.0,2.33
25%,192.375,0.0,0.0,164.9,0.0,932.0,730.95,7.0,23.71
50%,272.9,22.0,0.0,185.0,6.4,968.0,779.5,28.0,34.445
75%,350.0,142.95,118.3,192.0,10.2,1029.4,824.0,56.0,46.135
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.6


In [63]:
concrete_data.isnull().sum()

Cement                0
Blast Furnace Slag    0
Fly Ash               0
Water                 0
Superplasticizer      0
Coarse Aggregate      0
Fine Aggregate        0
Age                   0
Strength              0
dtype: int64

The data looks very clean and is ready to be used to build our model.


#### Split data into predictors and target


The target variable in this problem is the concrete sample strength. Therefore, our predictors will be all the other columns.


In [64]:
concrete_data_columns = concrete_data.columns

predictors = concrete_data[concrete_data_columns[concrete_data_columns != 'Strength']] # all columns except Strength
target = concrete_data['Strength'] # Strength column

<a id="item2"></a>


Let's do a quick sanity check of the predictors and the target dataframes.


In [65]:
predictors.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360


In [66]:
target.head()

0    79.99
1    61.89
2    40.27
3    41.05
4    44.30
Name: Strength, dtype: float64

Finally, the last step is to normalize the data by substracting the mean and dividing by the standard deviation.


PART B: Normalized Data

In [67]:
predictors_norm = (predictors - predictors.mean()) / predictors.std()
predictors_norm.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,0.862735,-1.217079,-0.279597
1,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,1.055651,-1.217079,-0.279597
2,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,3.55134
3,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,5.055221
4,-0.790075,0.678079,-0.846733,0.488555,-1.038638,0.070492,0.647569,4.976069


Let's save the number of predictors to *n_cols* since we will need this number when building our network.


In [68]:
n_cols = predictors_norm.shape[1] # number of predictors

## Import Keras


#### Let's go ahead and import the Keras library


In [69]:
import keras

In [70]:
from keras.models import Sequential
from keras.layers import Dense

## Build a Neural Network


In [71]:
# define regression model
def regression_model():
    # create model
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,))) #We have de 10 nodes and ReLu 
    model.add(Dense(1))
    # compile model
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

The above function creates a model that has one hidden layer with 10 neurons and a ReLU activation function. It uses the adam optimizer and the mean squared error as the loss function.

Let's import scikit-learn in order to randomly split the data into a training and test sets

In [72]:
from sklearn.model_selection import train_test_split

Splitting the data into a training and test sets by holding 30% of the data for testing

In [73]:
X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3, random_state=42)

## Train and Test the Network


Let's call the function now to create our model.


In [74]:
# build the model
model = regression_model()

Next, we will train the model for 50 epochs.

In [75]:
# fit the model
model.fit(X_train, y_train, epochs=50, verbose=2)

Epoch 1/50
 - 0s - loss: 1579.9570
Epoch 2/50
 - 0s - loss: 1565.7008
Epoch 3/50
 - 0s - loss: 1551.6281
Epoch 4/50
 - 0s - loss: 1537.4554
Epoch 5/50
 - 0s - loss: 1522.9086
Epoch 6/50
 - 0s - loss: 1507.3781
Epoch 7/50
 - 0s - loss: 1491.1979
Epoch 8/50
 - 0s - loss: 1473.7489
Epoch 9/50
 - 0s - loss: 1454.9738
Epoch 10/50
 - 0s - loss: 1434.4955
Epoch 11/50
 - 0s - loss: 1413.0970
Epoch 12/50
 - 0s - loss: 1389.6552
Epoch 13/50
 - 0s - loss: 1364.8069
Epoch 14/50
 - 0s - loss: 1337.7840
Epoch 15/50
 - 0s - loss: 1310.2098
Epoch 16/50
 - 0s - loss: 1280.6390
Epoch 17/50
 - 0s - loss: 1249.5882
Epoch 18/50
 - 0s - loss: 1217.3324
Epoch 19/50
 - 0s - loss: 1183.8649
Epoch 20/50
 - 0s - loss: 1149.8864
Epoch 21/50
 - 0s - loss: 1114.5146
Epoch 22/50
 - 0s - loss: 1077.9253
Epoch 23/50
 - 0s - loss: 1041.6416
Epoch 24/50
 - 0s - loss: 1004.3228
Epoch 25/50
 - 0s - loss: 966.4970
Epoch 26/50
 - 0s - loss: 929.9251
Epoch 27/50
 - 0s - loss: 891.1748
Epoch 28/50
 - 0s - loss: 854.6427
Epoch

<keras.callbacks.History at 0x7fb038550750>

Next we need to evaluate the model on the test data.

In [76]:
loss_val = model.evaluate(X_test, y_test)
y_pred = model.predict(X_test)
loss_val



282.78478717495324

Now we need to compute the mean squared error between the predicted concrete strength and the actual concrete strength.

Let's import the mean_squared_error function from Scikit-learn.

In [77]:
from sklearn.metrics import mean_squared_error

In [78]:
mean_square_error = mean_squared_error(y_test, y_pred)
mean = np.mean(mean_square_error)
standard_deviation = np.std(mean_square_error)
print(mean, standard_deviation)

282.7847963665968 0.0


Create a list of 50 mean squared errors and report mean and the standard deviation of the mean squared errors.

PART B: Normalized Data

In [79]:
total_mean_squared_errors = 50
epochs = 50
mean_squared_errors = []
for i in range(0, total_mean_squared_errors):
    X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3, random_state=i)
    model.fit(X_train, y_train, epochs=epochs, verbose=0)
    MSE = model.evaluate(X_test, y_test, verbose=0)
    print("MSE "+str(i+1)+": "+str(MSE))
    y_pred = model.predict(X_test)
    mean_square_error = mean_squared_error(y_test, y_pred)
    mean_squared_errors.append(mean_square_error)

mean_squared_errors = np.array(mean_squared_errors)
mean = np.mean(mean_squared_errors)
standard_deviation = np.std(mean_squared_errors)

print('\n')
print("Below is the mean and standard deviation of " +str(total_mean_squared_errors) + " mean squared errors with normalized data. Total number of epochs for each training is: " +str(epochs) + "\n")
print("Mean: "+str(mean))
print("Standard Deviation: "+str(standard_deviation))

MSE 1: 150.93994876404795
MSE 2: 149.51793974348644
MSE 3: 92.17099211362573
MSE 4: 76.0172979484484
MSE 5: 58.07518315238089
MSE 6: 54.26990792975071
MSE 7: 53.70063016329768
MSE 8: 42.27746948686618
MSE 9: 41.162722245003415
MSE 10: 42.364734291644545
MSE 11: 40.666013834931704
MSE 12: 39.504358995307996
MSE 13: 46.16996561439292
MSE 14: 45.891387211080506
MSE 15: 40.072043483697094
MSE 16: 35.14875637289004
MSE 17: 37.957105050195
MSE 18: 37.80652443722228
MSE 19: 38.893219018831225
MSE 20: 39.029781711911696
MSE 21: 36.52512221351796
MSE 22: 39.84625756470517
MSE 23: 35.53821777294369
MSE 24: 39.4263494423678
MSE 25: 39.83874905533775
MSE 26: 41.60244585549562
MSE 27: 37.27227632590482
MSE 28: 35.79161572996467
MSE 29: 43.51456948931549
MSE 30: 40.246339273298446
MSE 31: 38.218350654281075
MSE 32: 34.94854890027092
MSE 33: 35.43506351952414
MSE 34: 38.304509344995985
MSE 35: 38.600919581539806
MSE 36: 44.21159634389538
MSE 37: 40.610548220020284
MSE 38: 41.81577975078694
MSE 39: 40

<a id='item32'></a>
