# Context

Concrete is the most used material for construction in the world! There are some components that should be combined to make the concrete. These components can affect the compressive strength of the concrete. To obtain the real compressive strength of concrete (target labels in the dataset), an engineer needs to break the cylinder samples under the compression-testing machine. The failure load is divided by the cylinder's cross-section to obtain the compressive strength. Engineers use different kinds of concretes for different building purposes. For example, the strength of concrete used for residential buildings should not be lower than 2500 psi (17.2 MPa). Concrete is a material with high strength in compression, but low strength in tension. That is why engineers use reinforced concrete (usually with steel rebars) to build structures.

The raw dataset has columns labeled as:

* Cement (component 1) -- quantitative -- kg in a m3 mixture -- Input Variable
* Blast Furnace Slag (component 2) -- quantitative -- kg in a m3 mixture -- Input Variable
* Fly Ash (component 3) -- quantitative -- kg in a m3 mixture -- Input Variable
* Water (component 4) -- quantitative -- kg in a m3 mixture -- Input Variable
* Superplasticizer (component 5) -- quantitative -- kg in a m3 mixture -- Input Variable
* Coarse Aggregate (component 6) -- quantitative -- kg in a m3 mixture -- Input Variable
* Fine Aggregate (component 7) -- quantitative -- kg in a m3 mixture -- Input Variable
* Age -- quantitative -- Day (1~365) -- Input Variable
* Concrete compressive strength -- quantitative -- MPa -- Output Variable

# Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
%matplotlib inline

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.svm import SVR
from sklearn import metrics
from sklearn.ensemble import VotingRegressor
from scipy import stats
from scipy.stats import zscore
from sklearn.metrics import mean_absolute_error, median_absolute_error, r2_score
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn import preprocessing
from sklearn.preprocessing import PolynomialFeatures
from sklearn.cluster import KMeans
from sklearn.utils import resample


# Functions

In [21]:
def iqr(data, column):
    Q1 = data[column].quantile(q = 0.25)
    Q3 = data[column].quantile(q = 0.75)
    print('1st Quartile (Q1) is: ', Q1)
    print('3rd Quartile (Q3) is: ', Q3)
    print('Interqualile range (IQR) is: ', stats.iqr(data[column]))

# Exploratory Data Analysis

In [2]:
df = pd.read_csv('Concrete_Data.csv')

In [3]:
df.head()

Unnamed: 0,Cement (component 1)(kg in a m^3 mixture),Blast Furnace Slag (component 2)(kg in a m^3 mixture),Fly Ash (component 3)(kg in a m^3 mixture),Water (component 4)(kg in a m^3 mixture),Superplasticizer (component 5)(kg in a m^3 mixture),Coarse Aggregate (component 6)(kg in a m^3 mixture),Fine Aggregate (component 7)(kg in a m^3 mixture),Age (day),"Concrete compressive strength(MPa, megapascals)"
0,5400,0,0,1620,25,10400,6760,28,7999
1,5400,0,0,1620,25,10550,6760,28,6189
2,3325,1425,0,2280,0,9320,5940,270,4027
3,3325,1425,0,2280,0,9320,5940,365,4105
4,1986,1324,0,1920,0,9784,8255,360,4430


It's necessary to replace commas to dots first.

In [4]:
for column in df.columns:
    # convert all columns to string
    df[column] = df[column].astype(str)
    # replace commas to dots
    df[column] = df[column].str.replace(',', '.')

In [5]:
df.head()

Unnamed: 0,Cement (component 1)(kg in a m^3 mixture),Blast Furnace Slag (component 2)(kg in a m^3 mixture),Fly Ash (component 3)(kg in a m^3 mixture),Water (component 4)(kg in a m^3 mixture),Superplasticizer (component 5)(kg in a m^3 mixture),Coarse Aggregate (component 6)(kg in a m^3 mixture),Fine Aggregate (component 7)(kg in a m^3 mixture),Age (day),"Concrete compressive strength(MPa, megapascals)"
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


In order for an easier manipulation, let's replace and short the column names.

In [6]:
df.rename(columns = {'Cement (component 1)(kg in a m^3 mixture)':'cement',
       'Blast Furnace Slag (component 2)(kg in a m^3 mixture)':'slag',
       'Fly Ash (component 3)(kg in a m^3 mixture)':'ash',
       'Water  (component 4)(kg in a m^3 mixture)':'water',
       'Superplasticizer (component 5)(kg in a m^3 mixture)':'superplastic',
       'Coarse Aggregate  (component 6)(kg in a m^3 mixture)':'coarseagg',
       'Fine Aggregate (component 7)(kg in a m^3 mixture)':'fineagg', 
       'Age (day)':'age',
       'Concrete compressive strength(MPa, megapascals) ':'strength'}, inplace = True)

In [7]:
df.head()

Unnamed: 0,cement,slag,ash,water,superplastic,coarseagg,fineagg,age,strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


There are 8 independent variabes and all the records are numeric.

In [8]:
df.dtypes

cement          object
slag            object
ash             object
water           object
superplastic    object
coarseagg       object
fineagg         object
age             object
strength        object
dtype: object

Let's convert the data types of each columns

In [9]:
for column in df.columns:
    df[column] = df[column].astype('float64')

In [10]:
df['age'] = df['age'].astype('int64')

In [11]:
df.head()

Unnamed: 0,cement,slag,ash,water,superplastic,coarseagg,fineagg,age,strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


In [12]:
df.dtypes

cement          float64
slag            float64
ash             float64
water           float64
superplastic    float64
coarseagg       float64
fineagg         float64
age               int64
strength        float64
dtype: object

In [13]:
df.shape

(1030, 9)

In [14]:
# checking for missing values
df.isnull().sum()

cement          0
slag            0
ash             0
water           0
superplastic    0
coarseagg       0
fineagg         0
age             0
strength        0
dtype: int64

In [15]:
# main statistics
df.describe().T.round(2)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
cement,1030.0,281.17,104.51,102.0,192.38,272.9,350.0,540.0
slag,1030.0,73.9,86.28,0.0,0.0,22.0,142.95,359.4
ash,1030.0,54.19,64.0,0.0,0.0,0.0,118.3,200.1
water,1030.0,181.57,21.35,121.8,164.9,185.0,192.0,247.0
superplastic,1030.0,6.2,5.97,0.0,0.0,6.4,10.2,32.2
coarseagg,1030.0,972.92,77.75,801.0,932.0,968.0,1029.4,1145.0
fineagg,1030.0,773.58,80.18,594.0,730.95,779.5,824.0,992.6
age,1030.0,45.66,63.17,1.0,7.0,28.0,56.0,365.0
strength,1030.0,35.82,16.71,2.33,23.71,34.44,46.14,82.6


It's possible to see that cement, slag and ash are left sweked, let's investigate those features.

## Cement

In [20]:
# quartiles
iqr(data = df, column = 'cement')

1st Quartile (Q1) is:  192.375
3rd Quartile (Q3) is:  350.0
Interqualile range (IQR) is:  157.625
