# Part A

## Download and clean the data

First step: download the dataset and save it as a pandas DataFrame.

In [1]:
import pandas as pd

concrete_data = pd.read_csv("https://cocl.us/concrete_data")
concrete_data.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


Looks good. Next up, splitting between predictors (`X`) and target (`y`). Strength is the target variable.

In [2]:
X = concrete_data[concrete_data.columns[concrete_data.columns != "Strength"]]
num_cols = X.shape[1] # Saving for later

y = concrete_data["Strength"]

In [3]:
X.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360


In [4]:
y.head()

0    79.99
1    61.89
2    40.27
3    41.05
4    44.30
Name: Strength, dtype: float64

## Build the neural network

Next I'll build a function to create the neural network, with the number of hidden layers as a parameter, since Part D uses three hidden layers.

In [5]:
import keras
from keras.models import Sequential
from keras.layers import Dense


def regression_model(num_hidden_layers):
    model = Sequential()

    # Hidden layers
    model.add(Dense(10, activation="relu", input_shape=(num_cols,)))
    for i in range(num_hidden_layers - 1):
        model.add(Dense(10, activation="relu"))

    # Output layer
    model.add(Dense(1))

    model.compile(optimizer="adam", loss="mean_squared_error")
    return model

Using TensorFlow backend.


## Split, train, evaluate &times; 50

Now the fun part. I'm going to do the following 50 times:

- Randomly split the data by holding 30% for testing.
- Train a model on the training data over 50 epochs.
- Evaluate the model on the test data and compute the mean squared error between predicted concrete strength and actual concrete strength.

I'll save each of the 50 mean squared errors into a list.

And come to think of it, I'm going to define a function for this process, because I'm going to need to repeat it in all the other parts of the assignment, just varying a few key parameters.

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from statistics import mean, stdev


def process_models(X, y, num_hidden_layers, num_epochs):
    mean_squared_errors = []

    for i in range(50):
        # Split the data
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

        # Train the model
        model_a = regression_model(num_hidden_layers)
        model_a.fit(X_train, y_train, epochs=num_epochs, verbose=0)

        # Test the model
        predictions = model_a.predict(X_test)

        # Find and save the mean squared error
        mean_squared_errors.append(mean_squared_error(y_test, predictions))
        print("Run #{} complete".format(i + 1))

    return mean_squared_errors


errors_a = process_models(X, y, 1, 50)

Run #1 complete
Run #2 complete
Run #3 complete
Run #4 complete
Run #5 complete
Run #6 complete
Run #7 complete
Run #8 complete
Run #9 complete
Run #10 complete
Run #11 complete
Run #12 complete
Run #13 complete
Run #14 complete
Run #15 complete
Run #16 complete
Run #17 complete
Run #18 complete
Run #19 complete
Run #20 complete
Run #21 complete
Run #22 complete
Run #23 complete
Run #24 complete
Run #25 complete
Run #26 complete
Run #27 complete
Run #28 complete
Run #29 complete
Run #30 complete
Run #31 complete
Run #32 complete
Run #33 complete
Run #34 complete
Run #35 complete
Run #36 complete
Run #37 complete
Run #38 complete
Run #39 complete
Run #40 complete
Run #41 complete
Run #42 complete
Run #43 complete
Run #44 complete
Run #45 complete
Run #46 complete
Run #47 complete
Run #48 complete
Run #49 complete
Run #50 complete


## Results

I'll make a function for this part, too.

In [7]:
def report_results(mean_squared_errors):
    print(
        "The mean of the mean squared errors is {}".format(
            round(mean(mean_squared_errors), 3)
        )
    )
    print(
        "The standard deviation of the mean squared errors is {}".format(
            round(stdev(mean_squared_errors), 3)
        )
    )


report_results(errors_a)

The mean of the mean squared errors is 380.812
The standard deviation of the mean squared errors is 408.331


There you have it. I'll admit, those errors look pretty bad. I'm curious to see how the changes in the next three parts affect that.