# Can you predict the strength of concrete?

## 📖 Background
You work in the civil engineering department of a major university. You are part of a project testing the strength of concrete samples. 

Concrete is the most widely used building material in the world. It is a mix of cement and water with gravel and sand. It can also include other materials like fly ash, blast furnace slag, and additives. 

The compressive strength of concrete is a function of components and age, so your team is testing different combinations of ingredients at different time intervals. 

The project leader asked you to find a simple way to estimate strength so that students can predict how a particular sample is expected to perform.

## 💾 The data
The team has already tested more than a thousand samples ([source](https://archive.ics.uci.edu/ml/datasets/concrete+compressive+strength)):

#### Compressive strength data:
- "cement" - Portland cement in kg/m3
- "slag" - Blast furnace slag in kg/m3
- "fly_ash" - Fly ash in kg/m3
- "water" - Water in liters/m3
- "superplasticizer" - Superplasticizer additive in kg/m3
- "coarse_aggregate" - Coarse aggregate (gravel) in kg/m3
- "fine_aggregate" - Fine aggregate (sand) in kg/m3
- "age" - Age of the sample in days
- "strength" - Concrete compressive strength in megapascals (MPa)

***Acknowledgments**: I-Cheng Yeh, "Modeling of strength of high-performance concrete using artificial neural networks," Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998)*.

# 🖥️ (I) Modeling and Predicting Concrete Strength

 ⭕In this section, we try to model a given data and fine-tuning the model to achieve the best model to predict the concrete strength at the end. Also, user can find the estimated strength very accurately by importing the concrete sample attributes.

# 1.Data Processing

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.read_csv('data/concrete_data.csv')
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'data/concrete_data.csv'

In [None]:
df.shape

## 1.1 Data Cleaning

How many duplicated rows are there in the dataset?

In [None]:
duplicate_count = len(df)-len(df.drop_duplicates()) # Original data lenght minus data length without duplicates

duplicate_count

In [None]:
df.drop_duplicates(inplace=True) # Drop duplicates in place

df.head()

Check the Missing Data

In [None]:
df.isnull().sum().sort_values(ascending=False)

In [None]:
df.shape

Finding & Removing Outliers

In [None]:
(df[['cement']].boxplot())
plt.show()
(df[['slag']].boxplot())
plt.show()
(df[['fly_ash']].boxplot())
plt.show()
(df[['water']].boxplot())
plt.show()
(df[['superplasticizer']].boxplot())
plt.show()
(df[['coarse_aggregate']].boxplot())
plt.show()
(df[['fine_aggregate']].boxplot())
plt.show()



Droping outliers

In [None]:
from scipy import stats
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

## 1.2 Feature Scaling

In [None]:
df=df.copy()
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler() 
 
# make a copy of dataframe
scaled_features = df.copy()

col_names = ['cement', 'slag', 'fly_ash', 'water','superplasticizer','coarse_aggregate','fine_aggregate','age']
features = scaled_features[col_names]

# Use scaler of choice; here Standard scaler is used
features = scaler.fit_transform(features.values)

scaled_features[col_names] = features
scaled_features

In [None]:
X = scaled_features.drop(columns=['strength'])
y = scaled_features['strength']

In [None]:
X

Features Corrolation

In [None]:
import seaborn as sns

corr = scaled_features.corr() # Pearson Correlation

# Heatmap
sns.heatmap(corr, 
        xticklabels=corr.columns,
        yticklabels=corr.columns,
        cmap= "YlGnBu")


# 2.Modeling

## 2.1 Base Model (XGBRegressor)

Data Spliting

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

Cross Validating

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
import xgboost as xg
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE

#model = LinearRegression()
model= xg.XGBRegressor(objective='reg:squarederror')

scores = cross_val_score(model, X_train, y_train, cv=10)

base_model_score = scores.mean()

base_model_score

KNN Model

In [None]:
from sklearn.neighbors import KNeighborsRegressor

knn = KNeighborsRegressor(n_neighbors=1)

scores = cross_val_score(knn, X_train, y_train, cv=10)

base_model_score = scores.mean()

base_model_score

In [None]:
from sklearn.model_selection import GridSearchCV

# Instanciate model
model = KNeighborsRegressor()

# Hyperparameter Grid
k_grid = {'n_neighbors' : [1,5,10,20,50]}

# Instanciate Grid Search
grid = GridSearchCV(model, k_grid, n_jobs=-1,  cv = 5)

# Fit data to Grid Search
grid.fit(X_train, y_train)

In [None]:
grid.best_params_

In [None]:
knn = KNeighborsRegressor(n_neighbors=5)

scores = cross_val_score(knn, X_train, y_train, cv=10)

base_model_score = scores.mean()

base_model_score

XGBRegressor Model fine tunning


In [None]:
model= xg.XGBRegressor(objective='reg:squarederror',n_estimators=1000,learning_rate=0.05)

scores = cross_val_score(model, X_train, y_train, cv=10)

base_model_score = scores.mean()

base_model_score

In [None]:
model= xg.XGBRegressor(objective='reg:squarederror',n_estimators=1000,learning_rate=0.05)
trained_model = model.fit(X_train,y_train)
model.score(X_test,y_test)

We choose the "XGBRegressor" for better result

# 3.Input New Data 

In [None]:
print(df.columns.values.tolist())

## 3.1 Test the sample

In [None]:
sample = [332.5,142.5,0.0,228.0,0.0,932.0,594.0,270]
column_names = ['cement', 'slag', 'fly_ash', 'water', 'superplasticizer', 'coarse_aggregate', 'fine_aggregate', 'age']
new_sample= pd.DataFrame(data = [sample], columns = column_names)

#Scaling input
new_sample['cement'],new_sample['slag'],new_sample['fly_ash'],new_sample['water'],new_sample['superplasticizer'],new_sample['coarse_aggregate'],new_sample['fine_aggregate'],new_sample['age'] = scaler.transform(new_sample[['cement', 'slag', 'fly_ash', 'water', 'superplasticizer', 'coarse_aggregate', 'fine_aggregate', 'age']]).T
new_sample

In [None]:
model.predict(new_sample)[0]

## 3.2 Inputing new sample by user

### Please enter your concrete sample properties by order and run the cell

In [None]:
cement= 332.5
slag= 142.5
fly_ash= 0.0
water= 228.0
superplasticizer= 0.0
coarse_aggregate= 932.0
fine_aggregate= 594.0

## 💪 3.4 Run the the cell and get The average strength of the concrete samples at 1, 7, 14, and 28 days of age


In [None]:
age=[1,7,14,28]
prediction_list=[]
for item in age:
    sample = [cement, slag, fly_ash, water, superplasticizer, coarse_aggregate, fine_aggregate,item]
    column_names = ['cement', 'slag', 'fly_ash', 'water', 'superplasticizer', 'coarse_aggregate', 'fine_aggregate', 'age']
    new_sample= pd.DataFrame(data = [sample], columns = column_names)
    new_sample['cement'],new_sample['slag'],new_sample['fly_ash'],new_sample['water'],new_sample['superplasticizer'],new_sample['coarse_aggregate'],new_sample['fine_aggregate'],new_sample['age'] = scaler.transform(new_sample[['cement', 'slag', 'fly_ash', 'water', 'superplasticizer', 'coarse_aggregate', 'fine_aggregate', 'age']]).T
    prediction_list.append(model.predict(new_sample)[0])
    
mean_strength=(sum(prediction_list)/len(prediction_list))

print ("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
print ("The average strength of your concrete sample at 1, 7, 14, and 28 days of age is", mean_strength , "MPa")
print ("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")

# 📈(II) Finding a formula that estimates the compressive strength of concrete

In [None]:
df

In [None]:
from sklearn.linear_model import LinearRegression
# creating an object of LinearRegression class
LR = LinearRegression()
# fitting the training data
LR.fit(X1_train,y1_train)


In [None]:
# Make predictions using the testing set
y1_pred = LR.predict(X1_test)


In [None]:
from sklearn.metrics import mean_squared_error, r2_score
# The coefficients
print("Coefficients: \n", LR.coef_)
print("Interception: \n", LR.intercept_)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y1_test,y1_pred))
# The coefficient of determination: 1 is perfect prediction
print("Coefficient of determination: %.2f" % r2_score(y1_test,y1_pred))

In [None]:
import statsmodels.formula.api as smf
model_s = smf.ols('strength ~cement+slag+fly_ash+water+superplasticizer+coarse_aggregate+fine_aggregate+age',  data=df).fit()
print(model_s.summary())

In [None]:
X2 = df.drop(columns=['strength'])
y2 = df['strength']

In [None]:
predicted_streangth = model_s.predict(X2)
residuals = predicted_streangth - y2
residuals

In [None]:
RMSE = (residuals.map(lambda x: x**2).sum() / len(residuals))**0.5
RMSE

In [None]:
sns.histplot(residuals, kde=True, stat='density', discrete=True)
plt.title('residuals density plot')

In [None]:
sns.histplot(y2, kde=True, stat='density', discrete=True)
sns.histplot(predicted_streangth, kde=True, stat='density', discrete=True)


In [None]:
model_s.predict(X2.iloc[2])

### Strength Formula:

### Strength = (-17.7481) + 0.1172 ∗ cement + 0.0994 ∗ slag + 0.0856 ∗ fly ash - 0.1526 ∗ water + 0.2834 ∗ superplasticizer + 0.0156 ∗ coarse aggregate + 0.0183 ∗ fine aggregate +  0.1122 ∗ age

In [None]:
((1+7+14+28)*0.1122)-17.7481

## The average strength of the concrete samples at 1, 7, 14, and 28 days of age:

Strength = (-12.13811) + 0.1172 ∗ cement + 0.0994 ∗ slag + 0.0856 ∗ fly ash - 0.1526 ∗ water + 0.2834 ∗ superplasticizer + 0.0156 ∗ coarse aggregate + 0.0183 ∗ fine aggregate + 0.1122

## ✅ Finish