## Predicting Concrete Strength ##

This project has the goal of accurately predicting the strength of concrete based on multiple measurable factors.

### Variables ###

- "cement" - Portland cement in kg/m3
- "slag" - Blast furnace slag in kg/m3
- "fly_ash" - Fly ash in kg/m3
- "water" - Water in liters/m3
- "superplasticizer" - Superplasticizer additive in kg/m3
- "coarse_aggregate" - Coarse aggregate (gravel) in kg/m3
- "fine_aggregate" - Fine aggregate (sand) in kg/m3
- "age" - Age of the sample in days
- "strength" - Concrete compressive strength in megapascals (MPa)

All of the variables apart from age and strength are the different components of concrete. The goal is to find out which combinations of materials make the strongest material.


### Step 1: Connecting to a MongoDB database ###

I chose to use Mongo because the data used here is simple containing one csv and data is in tabular form.

In [4]:
from pymongo import MongoClient

URL = "mongodb+srv://andreDB:annette@concrete.qzjwh.mongodb.net/?retryWrites=true&w=majority"

client = MongoClient(URL)
db = client.concrete

db.list_collection_names()

['concrete']

### Step 2: Extracting data and creating DataFrame ###

This step involves pulling the data, with use of PyMongo, and then placing raw object into a tabular DataFrame.

This is a neccessary step because Mongo stores data in JSON format.

In [5]:
import pandas as pd
import numpy as np

data = list(db.concrete.find())

raw = pd.DataFrame(data)
raw.drop('_id', axis=1, inplace=True)
raw.head(3)

Unnamed: 0,cement,slag,fly_ash,water,superplasticizer,coarse_aggregate,fine_aggregate,age,strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.986111
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.887366
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.269535


In [6]:
raw.describe()

Unnamed: 0,cement,slag,fly_ash,water,superplasticizer,coarse_aggregate,fine_aggregate,age,strength
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,281.165631,73.895485,54.187136,181.566359,6.203112,972.918592,773.578883,45.662136,35.817836
std,104.507142,86.279104,63.996469,21.355567,5.973492,77.753818,80.175427,63.169912,16.705679
min,102.0,0.0,0.0,121.75,0.0,801.0,594.0,1.0,2.331808
25%,192.375,0.0,0.0,164.9,0.0,932.0,730.95,7.0,23.707115
50%,272.9,22.0,0.0,185.0,6.35,968.0,779.51,28.0,34.442774
75%,350.0,142.95,118.27,192.0,10.16,1029.4,824.0,56.0,46.136287
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.599225


### Step 3: Data exploration ###

To better understand the data, I will look at the types of distributions the data forms, as well as some correlation maps.

### Interactive Distribution Charts ###

In [1]:
import sys
sys.path

['/Users/andrejacobs/Desktop/DataProjects/concrete predictor',
 '/Users/andrejacobs/opt/anaconda3/envs/tensor/lib/python39.zip',
 '/Users/andrejacobs/opt/anaconda3/envs/tensor/lib/python3.9',
 '/Users/andrejacobs/opt/anaconda3/envs/tensor/lib/python3.9/lib-dynload',
 '',
 '/Users/andrejacobs/opt/anaconda3/envs/tensor/lib/python3.9/site-packages']

In [2]:
sys.path.append('/Users/andrejacobs/Desktop/DataProjects/helper functions')
sys.path

['/Users/andrejacobs/Desktop/DataProjects/concrete predictor',
 '/Users/andrejacobs/opt/anaconda3/envs/tensor/lib/python39.zip',
 '/Users/andrejacobs/opt/anaconda3/envs/tensor/lib/python3.9',
 '/Users/andrejacobs/opt/anaconda3/envs/tensor/lib/python3.9/lib-dynload',
 '',
 '/Users/andrejacobs/opt/anaconda3/envs/tensor/lib/python3.9/site-packages',
 '/Users/andrejacobs/Desktop/DataProjects/helper functions']

In [7]:
from plotly_class import InteractivePlot

my_plots = InteractivePlot(raw)
fig = my_plots.histogram()
fig.show()

#### Correlation matrix ####

In [8]:
fig_scatterplot = my_plots.scatterplot_one('strength')
fig_scatterplot.show()

In [8]:
raw.corr()

Unnamed: 0,cement,slag,fly_ash,water,superplasticizer,coarse_aggregate,fine_aggregate,age,strength
cement,1.0,-0.275193,-0.397475,-0.081544,0.092771,-0.109356,-0.22272,0.081947,0.497833
slag,-0.275193,1.0,-0.323569,0.107286,0.043376,-0.283998,-0.281593,-0.044246,0.134824
fly_ash,-0.397475,-0.323569,1.0,-0.257044,0.37734,-0.009977,0.079076,-0.15437,-0.105753
water,-0.081544,0.107286,-0.257044,1.0,-0.657464,-0.182312,-0.450635,0.277604,-0.289613
superplasticizer,0.092771,0.043376,0.37734,-0.657464,1.0,-0.266303,0.222501,-0.192717,0.366102
coarse_aggregate,-0.109356,-0.283998,-0.009977,-0.182312,-0.266303,1.0,-0.178506,-0.003016,-0.164928
fine_aggregate,-0.22272,-0.281593,0.079076,-0.450635,0.222501,-0.178506,1.0,-0.156094,-0.167249
age,0.081947,-0.044246,-0.15437,0.277604,-0.192717,-0.003016,-0.156094,1.0,0.328877
strength,0.497833,0.134824,-0.105753,-0.289613,0.366102,-0.164928,-0.167249,0.328877,1.0


### Preliminary Model ###

For my first model I am going to use Linear regression using 4 feature beacause all of these ferature appear to have aproximately normal distribution.

In [9]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_percentage_error

df = raw.copy()
df = df[['strength', 'cement', 'coarse_aggregate', 'fine_aggregate', 'age']]

x = df.drop('strength', axis=1)
y = df.strength

xtr, xte, ytr, yte = train_test_split(x, y, test_size=0.25)

model = LinearRegression()
model.fit(xtr, ytr)
model.score(xte, yte)

0.3677363945671762

In [10]:
mape = mean_absolute_percentage_error

In [11]:
mape(yte, model.predict(xte))

0.41464206063742287

### Second Model ### 

Lets try some different algorithms. Next I will try Random Forest.
This method uses multiple decision trees to make best decision.

In [12]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=200)
model.fit(xtr, ytr)
model.score(xte, yte)

0.8215838260277135

In [13]:
mape(yte, model.predict(xte))

0.1708825883028405

### Evaluating first two models ###

These first two models are evaluated by an R^2 value. This explains how much of the data the model covers.

The second model was much better than the first model with an R^2 of more than double at 0.69.

Next, I will look at ways to further improve model using more of our columns and using some preprocessing to try to normalize the data.

### Building a Model using all of the Data ###

In [14]:
all_data = raw.copy()

x, y = all_data.drop('strength', axis=1), all_data['strength']
xtr, xte, ytr, yte = train_test_split(x, y, test_size=0.3)

model = RandomForestRegressor()

model.fit(xtr, ytr)
print(model.score(xte, yte))

ypred = model.predict(xte)

0.8827828269067742


### Final Mean Absolute Percent Error ###

In [15]:
mape(yte, ypred)

0.1271874977765759