## Machine Learning with Concrete Strength

Concrete strength is affected by factors such as water to cement ratio, raw material quality, the ratio of coarse or fine aggregate, concrete age, concrete compaction, temperature, relative humidity, and other factors during the curing of the concrete. The data includes the following information for 1030 concrete samples.

- **Input variables:**
  - Cement: kg/m$^3$ mixture
  - Blast Furnace Slag: kg/m$^3$ mixture
  - Fly Ash: kg/m$^3$ mixture
  - Water: kg/m$^3$ mixture
  - Superplasticizer: kg/m$^3$ mixture
  - Coarse Aggregate: kg/m$^3$ mixture
  - Fine Aggregate: kg/m$^3$ mixture
  - Age: Day (1~365)
- **Output variable:**
  - Concrete compressive strength: MPa

```python
url = 'https://apmonitor.com/pds/uploads/Main/cement_strength.txt'
```

The full problem statement is on the [Machine Learning for Engineers course website](http://apmonitor.com/pds/index.php/Main/CementStrength).

### Import Packages

In [1]:
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import cm
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_confusion_matrix
from sklearn.feature_selection import SelectKBest, f_classif, f_regression

# Import classifier models
from sklearn.linear_model import LogisticRegression # Logistic Regression
from sklearn.naive_bayes import GaussianNB # Naïve Bayes
from sklearn.linear_model import SGDClassifier # Stochastic Gradient Descent
from sklearn.neighbors import KNeighborsClassifier # K-Nearest Neighbors
from sklearn.tree import DecisionTreeClassifier # Decision Tree
from sklearn.ensemble import RandomForestClassifier # Random Forest
from sklearn.svm import SVC # Support Vector Classifier
from sklearn.neural_network import MLPClassifier # Neural Network

# Import regression models
from sklearn.linear_model import LinearRegression
from sklearn.neural_network import MLPRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn import svm
import statsmodels.api as sm

### Import Data

In [2]:
url = 'https://apmonitor.com/pds/uploads/Main/cement_strength.txt'
data = pd.read_csv(url)

### Pair Plot of Data

### Divide Data between High and Low Strength

## Part 1: Data Visualization and Cleansing

### Summary Statistics

Generate summary information to [statistically describe the data](https://apmonitor.com/pds/index.php/Main/StatisticsMath).

Unnamed: 0,cement,slag,flyash,water,superplasticizer,coarseaggregate,fineaggregate,age,csMPa
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,281.167864,73.895825,54.18835,181.567282,6.20466,972.918932,773.580485,45.662136,35.817961
std,104.506364,86.279342,63.997004,21.354219,5.973841,77.753954,80.17598,63.169912,16.705742
min,102.0,0.0,0.0,121.8,0.0,801.0,594.0,1.0,2.33
25%,192.375,0.0,0.0,164.9,0.0,932.0,730.95,7.0,23.71
50%,272.9,22.0,0.0,185.0,6.4,968.0,779.5,28.0,34.445
75%,350.0,142.95,118.3,192.0,10.2,1029.4,824.0,56.0,46.135
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.6


HBox(children=(HTML(value='Summarize dataset'), FloatProgress(value=0.0, max=19.0), HTML(value='')))




HBox(children=(HTML(value='Generate report structure'), FloatProgress(value=0.0, max=1.0), HTML(value='')))

Check for balanced classification dataset ('abs' and 'pla' should have about equal amounts)

### Convert String Categories (Text) to Binary (0 or 1)

One-hot encoding translates character labels into a binary representation (0 or 1) for classification. Investigate the data types with `data.dtypes`.

### Data Cleansing

There is one row that contains an outlier. Identify the outlier with boxplots.

View outliers

Remove rows that contain outliers. 

Verify that the outliers are removed with another box plot.

### Data Correlation

Generate a heat map of the data correlation.

### Data Distributions and Pair Plot

## Part 2: Classification

Train and test a classifier to distinguish between high and low strength concrete. Test at least 8 classifiers of your choice. Recommend a best classifier among the 8 that are tested. 

### Divide Input Features (X) and Output Label (y)

Divide the data into sets that contain the input features (X) and output label (y=`csMPa`). Save data feature columns with `X_names=list(data.columns)` and remove `csMPa` with `X_names.remove('csMPa')`.

### Data scaling

Scale the input features with a `StandardScaler` or a `MinMaxScaler`. Why do classifiers return an error if the output label is scaled with `StandardScaler`?

Answer: The output label should not be scaled because it needs to be categorical data as an integer instead of a continuous real number.

### Train / Test Split

Randomly select values that split the data into a train (80%) and test (20%) set by using the sklearn `train_test_split` function with `shuffle=True`.

### Evaluate the Best Features

Use `SelectKBest` to evaluate the best features for the classifier.

### Train (fit) and Test Classification with Logistic Regression 

### Train 8 Classifiers

Create 8 classifier objects and train.

### Classifier Evaluation

Report the confusion matrix on the test set for each classifier. Discuss the performance of each. A confusion matrix shows correct classification (diagonals) and incorrect classification (off-diagonals) groups from the test set. Generate a confusion matrix for each classifier.

## Part 3: Regression

Develop a regression model to predict Tension Strength (MPa). Compare predicted PLS and ABS tension strength with the regression model.

### Scale Data

Scale `data` with `StandardScaler` or `MinMaxScaler`. 

### Select Input Features (X) and Output Label (y)

Using the 8 concrete properties as the input features.

- Cement: kg/m$^3$ mixture
- Blast Furnace Slag: kg/m$^3$ mixture
- Fly Ash: kg/m$^3$ mixture
- Water: kg/m$^3$ mixture
- Superplasticizer: kg/m$^3$ mixture
- Coarse Aggregate: kg/m$^3$ mixture
- Fine Aggregate: kg/m$^3$ mixture
- Age: Day (1~365)

The output label is the `csMPa`.

- Concrete Strength (MPa)

Divide the data into sets that contain the input features (X) and output label (y=`csMPa`). Save data feature columns with `X_names=list(data.columns)[0:8]`.

### Select Best Features for Regression

### Split Data

Randomly select values that split the data into a train (80%) and test (20%) set.

### Regression Fit

Use 3 regression methods. Use Linear Regression, Neural Network (Deep Learning), and another regression method of your choice. Discuss the performance of each. Possible regression methods are:

- Linear Regression
- Neural Network (Deep Learning)
- K-Nearest Neighbors
- Support Vector Regressor

### Validation

Report the correlation coefficient ($R^2$) for the train and test sets.

### Parity Plot

A parity plot is a scatter plot with predicted versus measured. A parity plot of the training and test data is a good way to see the overall fit of tension strength.

A joint plot shows two variables, with the univariate and joint distributions.