#### Pre-steps 1: Import the necessary libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from scipy.stats import zscore
import math as math
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn import metrics
from sklearn.metrics import r2_score
from sklearn import svm
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
from sklearn.model_selection import KFold
from sklearn.cluster import KMeans
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))



%matplotlib inline
# sns.set(color_codes=True)
sns.set(style="darkgrid", color_codes=True)

#### Pre-Step2: Load the dataset

In [None]:
#step 2.1: Read the dataset
rdata=pd.read_csv('/kaggle/input/yeh-concret-data/Concrete_Data_Yeh.csv')

### Deliverable -1 (Exploratory data quality report)

#### 1.a Univariate analysis
Univariate analysis – data types and description of the independent attributes which should include (name, meaning, range of values observed, central values (mean and median), standard deviation and quartiles, analysis of the body of distributions / tails, missing values, outliers

In [None]:
# step 2.1: browse through the first few columns
rdata.head()

##### Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients. These ingredients include cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, and fine aggregate.

##### Dataset:
- The dataset is composed of the actual concrete compressive strength (MPa) for a given mixture under a specific age (days). 
- The observations have been determined through a laboratory.
- The Data is in raw form and is not scaled and has 8 quantitative input variables, and 1 quantitative output variable


##### The main aim of the project is to create a Model of strength of high performance concrete

In [None]:
# Step 2.2: Understand the shape of the data
shape_data=rdata.shape
print('The shape of the dataframe is',shape_data,'which means there are',shape_data[0],'rows of observations and',shape_data[1],'attributes of age, ingredients and concrete compression strength')

In [None]:
# Step 2.3: Identify Duplicate records in the data 
# It is very important to check and remove data duplicates. 
# Else our model may break or report overly optimistic / pessimistic performance results
dupes=rdata.duplicated()
print(' The number of duplicates in the dataset are:',sum(dupes), '\n')
dupes_record=pd.DataFrame(rdata[dupes])
print(' The duplicate observations are:') 
dupes_record

In [None]:
# step 2.3.1: Remove duplicates from the data
t1data=rdata.copy()
t1data=t1data.drop_duplicates(keep="first")

dupes1=t1data.duplicated()
print(' The number of duplicates in the new dataset are:',sum(dupes1), '\n',
      'Clearly evident that now there are no duplicates in the dataset.')

In [None]:
t1data.columns=['cement','slag','ash','water','superplastic','coarseagg','fineagg','age','strength']

In [None]:
shape_t1data=t1data.shape
print('The shape of the new dataframe is',shape_t1data,'which means there are',shape_t1data[0],'rows of observations and',shape_data[1],'attributes of age, ingredients and concrete compression strength')

In [None]:
# Step 2.4: Lets analyze the data types
t1data.info()

Refering the summary of the dataframe as above; In the dataset, all the columns appear to be of numerical data with data type Integer and float. There are no null values.<br>

The dataset contains the following variables:

independet variables are as below:

- Cement : measured in kg in a m3 mixture
- Blast : measured in kg in a m3 mixture
- Fly ash : measured in kg in a m3 mixture
- Water : measured in kg in a m3 mixture
- Superplasticizer : measured in kg in a m3 mixture
- Coarse Aggregate : measured in kg in a m3 mixture
- Fine Aggregate : measured in kg in a m3 mixture
- Age : day (1~365)

The dependent variable is :<br>

- Strength: Concrete compressive strength measured in MPa<br>

In [None]:
#EDA 1: lets evaluate statistical details of the dataset. 
cname=t1data.columns
data_desc=t1data.describe().T
data_desc

From the describe summary above: <br>
1. Cement: 
    - The range of cement is [102,540] with a median of 265. The mean is more that median which means that there could be slight skewness on the right.
    - There might be few outliars on the right; we will plot a box plot to confirm the same. <br><br>

2. slag: 
    - The range of slag is [0,359.4] with a median of 20. The mean is more than median which means that there could be skewness on the right.
    - There might be potential outliars which we will identify by plotting a box plot.<br><br>

3. ash: 
    - The range of ash is [0,200.1] with a median of 0. The mean is more than median which means that there could be skewness on the right.
    - There might be potential outliars which we will identify by plotting a box plot.<br><br>

4. water: 
    - The range of water is [121.8,247] with a median of 185.7. The mean is slightly less than median which means that there could be skewness on the left.
    - There doesnt appear to be any outliars. We will identify by plotting a box plot.<br><br>
    
5. Superplastic: 
    - The range of superplastic is [0,32.2] with a median of 6.1. The mean is almost equal to median which means that there might not be any skewness.
    - There doesnt appear to be any outliars. We will identify by plotting a box plot.<br><br>
    
6. Coarseagg: 
    - The range of coarseagg is [801, 1145] with a median of 968. The mean is slightly more than median which means that there might be skewness on the right.
    - There doesnt appear to be any outliars. We will identify by plotting a box plot.<br><br>
    
7. fineagg: 
    - The range of coarseagg is [594,992.6] with a median of 780. The mean is slightly less than median which means that there might be skewness on the left.
    - There doesnt appear to be any outliars. We will identify by plotting a box plot.<br><br>
    
8. age: 
    - The range of age is [1,365] with a median of 28. The mean is more than median which means that there might be skewness on the right.
    - There might be outliars. We will identify by plotting a box plot.<br><br>
    
9. strength: 
    - The range of age is [2.33,82.6] with a median of 33.8. The mean is more than median which means that there might be skewness on the right.
    - There might be outliars. We will identify by plotting a box plot.<br><br>

In [None]:
# Attributes in the Group
Atr1g1='cement'
Atr2g1='slag'
Atr3g1='ash'
Atr4g1='water'
Atr5g1='superplastic'
Atr6g1='coarseagg'
Atr7g1='fineagg'
Atr8g1='age'
Atr9g1='strength'

In [None]:
#EDA 1: Outliar Detection leveraging Box Plot
data=t1data
fig, ax = plt.subplots(1,9,figsize=(38,16)) 
sns.boxplot(x=Atr1g1,data=data,ax=ax[0],orient='v') 
sns.boxplot(x=Atr2g1,data=data,ax=ax[1],orient='v')
sns.boxplot(x=Atr3g1,data=data,ax=ax[2],orient='v')
sns.boxplot(x=Atr4g1,data=data,ax=ax[3],orient='v')
sns.boxplot(x=Atr5g1,data=data,ax=ax[4],orient='v')
sns.boxplot(x=Atr6g1,data=data,ax=ax[5],orient='v')
sns.boxplot(x=Atr7g1,data=data,ax=ax[6],orient='v')
sns.boxplot(x=Atr8g1,data=data,ax=ax[7],orient='v')
sns.boxplot(x=Atr9g1,data=data,ax=ax[8],orient='v')

Observation:
- Attributes slag, water, superplastic, fineagg, age and strength have outliars.
- we will work on the outliars in the next section

In [None]:
data=t1data
#EDA 2: Skewness check
Atr1g1_skew=round(stats.skew(data[Atr1g1]),4)
Atr2g1_skew=round(stats.skew(data[Atr2g1]),4)
Atr3g1_skew=round(stats.skew(data[Atr3g1]),4)
Atr4g1_skew=round(stats.skew(data[Atr4g1]),4)
Atr5g1_skew=round(stats.skew(data[Atr5g1]),4)
Atr6g1_skew=round(stats.skew(data[Atr6g1]),4)
Atr7g1_skew=round(stats.skew(data[Atr7g1]),4)
Atr8g1_skew=round(stats.skew(data[Atr8g1]),4)
Atr9g1_skew=round(stats.skew(data[Atr9g1]),4)

print(' The skewness of',Atr1g1,'is', Atr1g1_skew)
print(' The skewness of',Atr2g1,'is', Atr2g1_skew)
print(' The skewness of',Atr3g1,'is', Atr3g1_skew)
print(' The skewness of',Atr4g1,'is', Atr4g1_skew)
print(' The skewness of',Atr5g1,'is', Atr5g1_skew)
print(' The skewness of',Atr6g1,'is', Atr6g1_skew)
print(' The skewness of',Atr7g1,'is', Atr7g1_skew)
print(' The skewness of',Atr8g1,'is', Atr8g1_skew)
print(' The skewness of',Atr9g1,'is', Atr9g1_skew)

- Attribute: Age has high skewness.<br>
- Atributes: cement, slag, superplastic have slight skewness.<br>
- Attributes: ash, water, coarseagg, fineagg, strength have negligible skewness.<br>

In [None]:
##EDA 3: Spread
data=t1data
fig, ax = plt.subplots(1,9,figsize=(16,8)) 
sns.distplot(data[Atr1g1],ax=ax[0]) 
sns.distplot(data[Atr2g1],ax=ax[1]) 
sns.distplot(data[Atr3g1],ax=ax[2])
sns.distplot(data[Atr4g1],ax=ax[3])
sns.distplot(data[Atr5g1],ax=ax[4])
sns.distplot(data[Atr6g1],ax=ax[5])
sns.distplot(data[Atr7g1],ax=ax[6])
sns.distplot(data[Atr8g1],ax=ax[7])
sns.distplot(data[Atr9g1],ax=ax[8])

Understanding the distribution of different attributes.
- cement: There is tail on the right hand side and it is nearly normally distributed.
- slag: Slag seems to have bi-modal distribution meaning it has 2 Gaussian. There also appear to be a tail on the right. This might mean that their might be extreme value or possibility of outliars
- ash: Ash seems to have bi-modal distribution meaning it has 2 Gaussian. There also appear to be a tail on the right. This might mean that their might be extreme value or possibility of outliars
- water: Water seems to have 3 Gaussian. There also appear to be a tail on the left. This might mean that their might be extreme value or possibility of outliars on the left.
- superplastic: Superplastic seems to have bi-modal distribution meaning it has 2 Gaussian. There also appear to be a tail on the right. This might mean that their might be extreme value or possibility of outliars.
- coarseagg: Coarseagg seems to have bi-modal distribution meaning it has 2 Gaussian. Coarseagg doesnt seem to have any tail and hence it appears that this attribute might not have any outliars.
- fineagg: Fineagg seems to have bi-modal distribution meaning it has 2 Gaussian. There also appear to be a slight tail on the right. This might mean that their might be extreme value or possibility of outliars
- age: The attribute age seems to have 5 Gaussian. There also appear to be a tail on the right. This might mean that their might be extreme value or possibility of outliars.
- strength: strength has a near normal distribution with a slight tail on the right; indicating possibility of extreme values or outlairs.

#### 1.b Multivariate analysis 
Bi-variate analysis between the predictor variables and between the predictor variables and target column.Comment on your findings in terms of their relationship and degree of relation if any. Presence of leverage points. Visualize the analysis using boxplots and pair plots,histograms or density curves. Select the most appropriate attributes

In [None]:
# Step 2.6: Lets visually understand if there is any correlation between the independent variables. 
usecols =[i for i in t1data.columns if i != 'strength']
sns.pairplot(t1data,diag_kind='kde');

Observations:
- On the diagonals we have changed the plot to a density plot. The default is histogram.
- The different plots signifies how the different attributes interact with each other; how they depend on each other
- It is very important to solve the interdependence, since all the algorithms assume that all attributes are independent of each other. So if there is interdependence between the attributes then our model will perform sub-optimally in production.
- It appears that in the dataset there are 2 Gaussians or 2 clusters hidden in the data set as density plot of multiple attributes has bi-modal distribution.
- We will do cluster analysis to understand any hidden patterns or hidden clusters in the data set. Hence, to begin with we will consider that the dataset has 3-4 clusters (as seen from the pair plot; there are multiple attributes with atleast 2 Gaussians)


From the scatter plots between different attributes; it appears that there isn't any significant correlation between attributes. We will calculate correlation to ascertain the same.


In [None]:
# Step 2.7: lets evaluate correlation between different attributes.
# The dependent attribute strength has been ignored from the correlation heatmap. 
#The reason for the same will be explained in the next section.
corr=t1data.corr()
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(corr,annot=True,linewidth=0.05,ax=ax, fmt= '.2f');

As observed while analysing pairplot; there doesnt seem to be very high correlation between independent attributes. However, the following attributes appear to have some correlation
- There appear to correlation between Superplastic and water
- There seem to be some correlation between water and fineagg
- There seem to be some correlation between cement and strength
- There seem to be some correlation between ash and superplastic

Scaling: There are 2 scales in the independent attributes (kg/m3 and days). We will scale the data in the next sections

In [None]:
### Analyzing Dependent variable (Strength) vs Independent variable (cement, age and water)
fig, ax = plt.subplots(figsize=(10,8))
sns.scatterplot(y="strength", x="cement", hue="water", size="age", data=t1data, ax=ax, sizes=(50, 300),
                palette='RdYlGn', alpha=0.9)
ax.set_title("Strength vs Cement, Age, Water")
ax.legend()
plt.show()

observation:
- Strength correlates positively with Cement
- Strength correlates positively with Age, though less than Cement
- Older Cement tends to require more Water, as shown by the larger green data points
- Strength correlates negatively with Water
- High Strength with a low Age requires more Cement

In [None]:
### Analyzing Dependent variable (Strength) vs Independent variable (FineAgg, Ash, Superplastic)
fig, ax = plt.subplots(figsize=(10,8))
sns.scatterplot(y="strength", x="fineagg", hue="ash", size="superplastic", data=t1data, ax=ax, sizes=(50, 300),
                palette='RdYlBu', alpha=0.9)
ax.set_title("Strength vs FineAgg, Ash, Superplastic")
ax.legend(loc="upper left", bbox_to_anchor=(1,1)) # Moved outside the chart so it doesn't cover any data
plt.show()

observation:
- strength doesnt have any clear correlation with ash
- strength correlates positively with superplastic

#### 1.c Outliar Addressal
Pick one strategy to address the presence outliers and missing values and perform necessary imputation

In [None]:
# before proceeding further, lets first create a copy of the data-set.
# we will use simple imputer and strategy as median for addressing outliars

In [None]:
t2data = t1data.copy()

In [None]:
#EDA 2: Outliar Detection leveraging Box Plot
data=t2data
fig, ax = plt.subplots(1,9,figsize=(38,16)) 
sns.boxplot(x=Atr1g1,data=data,ax=ax[0],orient='v') 
sns.boxplot(x=Atr2g1,data=data,ax=ax[1],orient='v')
sns.boxplot(x=Atr3g1,data=data,ax=ax[2],orient='v')
sns.boxplot(x=Atr4g1,data=data,ax=ax[3],orient='v')
sns.boxplot(x=Atr5g1,data=data,ax=ax[4],orient='v')
sns.boxplot(x=Atr6g1,data=data,ax=ax[5],orient='v')
sns.boxplot(x=Atr7g1,data=data,ax=ax[6],orient='v')
sns.boxplot(x=Atr8g1,data=data,ax=ax[7],orient='v')
sns.boxplot(x=Atr9g1,data=data,ax=ax[8],orient='v')

In [None]:
def outliar_detection(col):
    Q1=t2data[col].quantile(0.25)
    Q3=t2data[col].quantile(0.75)
    IQR=Q3-Q1
    Lower_Whisker = Q1-1.5*IQR
    Upper_Whisker = Q3+1.5*IQR
    t2data[col][t2data[col]> Upper_Whisker] = np.nan
    t2data[col][t2data[col]< Lower_Whisker] = np.nan
    return t2data[col][t2data[col].isnull()]

In [None]:
for i in usecols:
    outliar_detection(i)

In [None]:
t2data.info()

In [None]:
# Imputing the missing values with median
columns=t2data.columns
imp_median = SimpleImputer(missing_values=np.nan, strategy='median')

imp_median.fit_transform(t2data)
# imp_median.fit(t2data)
t2data=pd.DataFrame(imp_median.transform(t2data))
t2data.columns=columns

In [None]:
data=t2data
fig, ax = plt.subplots(1,9,figsize=(38,16)) 
sns.boxplot(x=Atr1g1,data=data,ax=ax[0],orient='v') 
sns.boxplot(x=Atr2g1,data=data,ax=ax[1],orient='v')
sns.boxplot(x=Atr3g1,data=data,ax=ax[2],orient='v')
sns.boxplot(x=Atr4g1,data=data,ax=ax[3],orient='v')
sns.boxplot(x=Atr5g1,data=data,ax=ax[4],orient='v')
sns.boxplot(x=Atr6g1,data=data,ax=ax[5],orient='v')
sns.boxplot(x=Atr7g1,data=data,ax=ax[6],orient='v')
sns.boxplot(x=Atr8g1,data=data,ax=ax[7],orient='v')
sns.boxplot(x=Atr9g1,data=data,ax=ax[8],orient='v')

When we remove outliers and replace with median, the distribution shape changes, the standard deviation becomes tighter creating new outliers. The new outliers would be much closer to the centre than original outliers so we accept them without modifying them

### Deliverable -2 (Feature Engineering techniques)

#### 2.a Identify opportunities (if any) to create a composite feature, drop a feature etc.

##### 2.a.i Creation of a composite feature.

Feature Engineering: 
- A key component for testing strength and durability of concrete mix is the water to cement ratio.
- A lower ratio leads to higher strength and durability.
- Since this attribute doesnt exist in the dataset; hence we will compute and add this attribute.

In [None]:
t2data['w/c ratio']=t2data['water']/t2data['cement']

In [None]:
t2data.head()

##### 2.a.ii Deletion of features.
 - We will leverage PCA for dimensionality reduction.
 - The approach we are going to follow is; we will build 2 sets of models; 
     - One without deletion of any attributes and 
     - Second - reducing attributes by leveraging PCA
     - They we will decide whether we will recommend the model with dimensionality reduction or without dimensionality reduction

In [None]:
# It is always better to make a copy of the data before applying any transformation on data
t3data=t2data.copy()

In [None]:
t3data_scaled=t3data.apply(zscore)
X_scaled=t3data_scaled.drop('strength',axis=1)

In [None]:
covMatrix = np.cov(X_scaled,rowvar=False)

In [None]:
# choosing PCA components to be 8 and fitting it on the scaled data. 
#The count of 8 has been selected randomly to check the variance explained by 8 components; 
#We will finalize the components basis the count of components required to explain 95% variance
pca = PCA(n_components=8)
pca.fit(X_scaled)

In [None]:
#Computing the eigen Values
print(pca.explained_variance_)

In [None]:
#Lets compute the eigen Vectors
print(pca.components_)

In [None]:
print(pca.explained_variance_ratio_)

In [None]:
plt.bar(list(range(1,9)),pca.explained_variance_ratio_,alpha=0.5, align='center')
plt.ylabel('Variation explained')
plt.xlabel('eigen Value')
plt.show()

In [None]:
plt.step(list(range(1,9)),np.cumsum(pca.explained_variance_ratio_), where='mid')
plt.ylabel('Cum of variation explained')
plt.xlabel('eigen Value')
plt.show()

In [None]:
# cumulating explained variance ratio to identify how many principal components are required to explain 95% of the variance
cum_var_exp = np.cumsum(pca.explained_variance_ratio_)
# print("Cumulative Variance Explained", cum_var_exp)
pd.DataFrame(cum_var_exp,columns=['Cumul Variance Explanation'],index=['1','2','3','4','5','6','7','8'])

In [None]:
# 6 components explains over 95% of the variance. Hence we will take 6 components

In [None]:
pca6 = PCA(n_components=6)
pca6.fit(X_scaled)
print(pca6.components_)
print(pca6.explained_variance_ratio_)
Xpca6 = pca6.transform(X_scaled)
Y = t3data_scaled['strength']

In [None]:
X_train_pca, X_test_pca, y_train_pca, y_test_pca=train_test_split(Xpca6,Y,test_size=0.30,random_state=1)

In [None]:
# lets check split of data

print("{0:0.2f}% data is in training set".format((len(X_train_pca)/len(t3data.index)) * 100))
print("{0:0.2f}% data is in test set".format((len(X_test_pca)/len(t3data.index)) * 100))

##### 2.b Decide on complexity of the model, should it be simple linear model in terms of parameters or would a quadratic or higher degree help

we will begin our model building considering linear regression; basis the performance of the algorithm, we will try other model. We will also try polynomial regression algorithm with different degree of freedom. The model building and analysing the best model will be done in step 3.

We will train the following regression algorithms:
1. Linear Regression
2. SVR
3. Ridge Regression
4. Lasso Regression
5. Polynomial Regression
5. Decision Tree
6. Random Forest
7. Bagging
8. Ada Boost
9. Gradient Boost

##### 2.c Explore for gaussians. If data is likely to be a mix of gaussians, explore individual clusters and present your findings in terms of the independent attributes and their suitability to predict strength

Referring the pair plot in section 1:
- It appears that in the dataset there are 3 to 4 Gaussians since for multiple variables (slag, ash, superplastic) there are 2 or more gaussians or 2 clusters hidden in the data set as density plot of multiple attributes has bi-modal distribution.
- We will do cluster analysis to understand any hidden patterns or hidden clusters in the data set. Hence, to begin with we will consider that the dataset has 2-6 clusters (as seen from the pair plot; there are atleast 2 Gaussians). Then we will finalize the cluster leveraging elbow plots.

We will use K-means clustering. We dont know how many clusters to look for; we got a hint that the number of clusters are likely to be in the range of 2 to 6. So lets explore the range of 2 to 6

In [None]:
kdata=t2data.copy()

In [None]:
# expect 3 to four clusters from the pair plot visual inspection hence restricting from 2 to 5

cluster_range = range( 2, 6 )
cluster_errors = []
for num_clusters in cluster_range:
  clusters = KMeans( num_clusters, n_init = 5)
  clusters.fit(kdata)
  labels = clusters.labels_
  centroids = clusters.cluster_centers_
  cluster_errors.append( clusters.inertia_ )
clusters_df = pd.DataFrame( { "num_clusters":cluster_range, "cluster_errors": cluster_errors } )
clusters_df[0:15]

In [None]:
# Elbow plot to ascertain the number of clusters

plt.figure(figsize=(12,6))
plt.plot( clusters_df.num_clusters, clusters_df.cluster_errors, marker = "o" )

In [None]:
# The elbow plot confirms our visual analysis that there are likely 3 good clusters

In [None]:
kdata_z = kdata.apply(zscore)

cluster = KMeans( n_clusters = 3, random_state = 1 )
cluster.fit(kdata_z)

prediction=cluster.predict(kdata_z)
kdata_z["GROUP"] = prediction     # Creating a new column "GROUP" which will hold the cluster id of each record

kdata_z_copy = kdata_z.copy(deep = True)  # Creating a mirror copy for later re-use instead of building repeatedly

In [None]:
centroids = cluster.cluster_centers_
centroids

In [None]:
centroid_df = pd.DataFrame(centroids, columns = list(kdata) )
centroid_df

In [None]:
kdata_z.boxplot(by = 'GROUP',figsize = (40,18), layout = (2,15));

##### we notice that there are outliars; However, we had resolved outliards earlier and replaced with median. When we solve for outliars, the distribution shape changes, the standard deviation becomes tighter creating new outliers. The new outliers would be much closer to the centre than original outliers so we accept them without modifying them. Hence, we will ignore these outliars

Lets analyse the variable at cluster level.
At cluster level, we want to understand how strength is impact by different attributes

In [None]:
### strength Vs cement

var = 'cement'

with sns.axes_style("white"):
    plot = sns.lmplot(var,'strength',data=kdata_z,hue='GROUP')
plot.set(ylim = (-3,3));

The more horizontal the line is, the more weak the independent variable is in predicting the target variable

Observation:
   - For cluster 2 (orange line) and cluster 3 (green line), there seem to be some positive relationship between strength and cement.
   - cluster 1 represented by blue line appear to be straight line which means that for cluster 1, strength is weakly predicted by cement. 
   - So cement may not be a good predictor for all the 3 clusters

In [None]:
# strength Vs water

var = 'water'

with sns.axes_style("white"):
    plot = sns.lmplot(var,'strength',data=kdata_z,hue='GROUP')
plot.set(ylim = (-3,3));

Observation:
   - For group 1 (blue) and group 2 (orange), there seem to be some negative relationship between strength and water
   - group 3 represented by green line appear have a positive relationship
   - So 2 clusters seem to have negative relationship while 1 cluster seem to positive relationship. Hence, water may not be a good predictor for all the 3 clusters

In [None]:
# strength Vs fineagg

var = 'fineagg'

with sns.axes_style("white"):
    plot = sns.lmplot(var,'strength',data=kdata_z,hue='GROUP')
plot.set(ylim = (-3,3))

Observation:
   - group 3 (green line) seems to represent good relationship between strength and fineagg
   - For group 1 (blue) and group 2 (orange), there appear to be straight line which means that for group 1 and 2, strength is weakly predicted by fineagg. 
   - So fineagg may not be a good predictor for all the 3 clusters

In [None]:
# strength Vs slag

var = 'slag'

with sns.axes_style("white"):
    plot = sns.lmplot(var,'strength',data=kdata_z,hue='GROUP')
plot.set(ylim = (-3,3));

Observation:
   - For all 3 clusters, there appear to be straight line which means that for group 1,2 and 3 strength is weakly predicted by slag. 
   - So slag may not be a good predictor for all the 3 clusters

In [None]:
# strength Vs ash

var = 'ash'

with sns.axes_style("white"):
    plot = sns.lmplot(var,'strength',data=kdata_z,hue='GROUP')
plot.set(ylim = (-3,3));

Observation:
   - group 3 (green line) seems to represent slight postive relationship between strength and ash
   - For group 2 (orange), there appear to be straight line which means that for group 1 and 2, strength is weakly predicted by ash.
   - For group 1 (blue line), there appear to be slight negative relationship between strengh and ash
   - So ash may not be a good predictor for all the 3 clusters

In [None]:
# strength Vs Superplasticizer

var = 'superplastic'

with sns.axes_style("white"):
    plot = sns.lmplot(var,'strength',data=kdata_z,hue='GROUP')
plot.set(ylim = (-3,3));

Observation:
   - For cluster 2 (orange line) and cluster 3 (green line), there appears to be slight relationship between superplastic and strength.
   - For cluster 1 (blue), there appear to be straight line which means that for group 1, strength is weakly predicted by superplastic.
   - So superplastic may not be a good predictor for all the 3 clusters

In [None]:
# strength Vs Coarse Aggregate

var = 'coarseagg'

with sns.axes_style("white"):
    plot = sns.lmplot(var,'strength',data=kdata_z,hue='GROUP')
plot.set(ylim = (-3,3));

Observation:
   - For all 3 clusters, there appear to be straight line which means that for group 1,2 and 3 strength is weakly predicted by coarseagg. 
   - So coarseagg may not be a good predictor for all the 3 clusters

In [None]:
# strength Vs age

var = 'age'

with sns.axes_style("white"):
    plot = sns.lmplot(var,'strength',data=kdata_z,hue='GROUP')
plot.set(ylim = (-3,3));

Observation:
   - For all 3 clusters, there appear to be some relationship between strength and age. Hence, age can be considered as an attribute which can predict strength for all 3 clusters.

### 3. Deliverable -3 (create the model )

#### 3.a Obtain feature importance for the individual features and present your findings

The feature importance was seen in section 2.a when we performed Principal component analysis. There is another way to identify feature importance which is through decision tree regressor. We will identify feature importance while building model using decision tree regressor.

In this step, We will build multiple Algorithms and then basis the performance, we will decide the Algorithm that gives the best performance in section 4. However, Before we proceed further, lets scale the data and then divide the data into train and test

From the dataset it is quite evident that for independent variables, there are 2 scales e.g: kg/m3, days. Now, Machine learning algorithms dont recognize the unit of data; Hence, it won't be prudent to compare Kg/m3 with age. Higher ranging numbers in one of the attributes will have superiority. 
10 kg/m3 and 10 days means different but machine learning algorithm understand both to be the same.<br>

Scales impacts
1. gradient descent based algorithms like Linear Regression, Logistics Regression
2. Distance based algorithms like KNN, K-means and SVM

Scales dont impact:
1. Tree based algorithms like Decision trees

In [None]:
# lets build our regression model
# before proceeding further we'll scale the data so that we can analyse them further
# Linear models are not impacted by scaling; however when we use regularization models like ridge and lasso; they are impacted by scaling. 
# Hence, to be on safe side lets scaling the data; since we might use regulaization of the data.

t2data_scaled=t2data.apply(zscore)
X=t2data_scaled.drop('strength',axis=1)
y = t2data_scaled['strength']

In [None]:

# splitting the data into train and test
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.30,random_state=1)

In [None]:
# lets check split of data
print("{0:0.2f}% data is in training set".format((len(X_train)/len(t2data.index)) * 100))
print("{0:0.2f}% data is in test set".format((len(X_test)/len(t2data.index)) * 100))

We will train the following regression algorithms:
1. Linear Regression
2. SVR
3. Ridge Regression
4. Lasso Regression
5. Polynomial Regression
5. Decision Tree
6. Random Forest
7. Bagging
8. Ada Boost
9. Gradient Boost

#### Regression Model 1: Linear Regression

In [None]:
### Building the model with all the attributes

In [None]:
# Fit the model on train data
regression_model = LinearRegression()
regression_model.fit(X_train, y_train)

In [None]:
# Let us explore the coefficients for each of the independent attributes

for idx, col_name in enumerate(X_train.columns):
    print("The coefficient for {} is {}".format(col_name, regression_model.coef_[idx]))

In [None]:
intercept = regression_model.intercept_

print("The intercept for our model is {}".format(regression_model.intercept_))

In [None]:
regression_model.score(X_train, y_train)

In [None]:
score_LR= regression_model.score(X_test, y_test)
score_LR

In [None]:
#### Building the model with reduced dimensionality (PCA)

In [None]:
regression_model_pca = LinearRegression()
regression_model_pca.fit(X_train_pca, y_train_pca)

In [None]:
regression_model_pca.coef_

In [None]:
intercept_pca = regression_model_pca.intercept_

print("The intercept for our model is {}".format(regression_model_pca.intercept_))

In [None]:
y_predict_LR_pca = regression_model_pca.predict(X_test_pca)

In [None]:
regression_model_pca.score(X_train_pca, y_train_pca)

In [None]:
score_LR_PCA = regression_model_pca.score(X_test_pca, y_test_pca)
score_LR_PCA

#### Regression Model 2: SVR

In [None]:
### Building the model with all the attributes

In [None]:
clf = svm.SVR()
clf.fit(X_train, y_train)

In [None]:
y_predict_SVR = clf.predict(X_test)

In [None]:
clf.score(X_train, y_train)

In [None]:
score_SVR = clf.score(X_test, y_test)
score_SVR

In [None]:
#### Building the model with reduced dimensionality (PCA)

In [None]:
clf_pca = svm.SVR() 
clf_pca.fit(X_train_pca, y_train_pca)

In [None]:
clf_pca.score(X_train_pca, y_train_pca)

In [None]:
score_SVR_PCA=clf_pca.score(X_test_pca, y_test_pca)
score_SVR_PCA

#### Regression Model 3: Ridge Regression: Regularised Linear Model

In [None]:
### Building the model with all the attributes

In [None]:
ridge = Ridge(alpha=0.3)

In [None]:
ridge.fit(X_train,y_train)
print("Ridge model:",ridge.coef_)

In [None]:
ridge.score(X_train,y_train)

In [None]:
score_ridge = ridge.score(X_test,y_test)
score_ridge

In [None]:
#### Building the model with reduced dimensionality (PCA)

In [None]:
ridge_pca = Ridge(alpha=0.3)

In [None]:
ridge_pca.fit(X_train_pca,y_train_pca)
print("Ridge model:",ridge.coef_)

In [None]:
ridge_pca.score(X_train_pca,y_train_pca)

In [None]:
score_ridge_PCA = ridge_pca.score(X_test_pca,y_test_pca)
score_ridge_PCA

#### Regression Model 4: Lasso Regression - Regularised Linear Model

In [None]:
### Building the model with all the attributes

In [None]:
lasso=Lasso(alpha=0.1)

In [None]:
lasso.fit(X_train,y_train)
print("Lasso Model",lasso.coef_)

In [None]:
lasso.score(X_train,y_train)

In [None]:
score_lasso = lasso.score(X_test,y_test)
score_lasso

In [None]:
#### Building the model with reduced dimensionality (PCA)

In [None]:
lasso_pca=Lasso(alpha=0.1)
lasso_pca.fit(X_train_pca,y_train_pca)
print("Lasso Model",lasso.coef_)

In [None]:
lasso_pca.score(X_train_pca,y_train_pca)

In [None]:
score_lasso_PCA=lasso_pca.score(X_test_pca, y_test_pca)
score_lasso_PCA

#### Regression Model 5: Polynomial Regression

In [None]:
### Building the model with all the attributes

In [None]:
poly = PolynomialFeatures(degree=2, interaction_only=True)
X_train3 = poly.fit_transform(X_train)
X_test3 = poly.fit_transform(X_test)

poly_clf = linear_model.LinearRegression()

poly_clf.fit(X_train3, y_train)

y_pred = poly_clf.predict(X_test3)

#print(y_pred)

#In sample (training) R^2 will always improve with the number of variables!
print(poly_clf.score(X_train3, y_train))
score_LR_poly = poly_clf.score(X_test3, y_test)
score_LR_poly

In [None]:
#### Building the model with reduced dimensionality (PCA)

In [None]:
poly_pca = PolynomialFeatures(degree=4, interaction_only=True)
X_train_poly = poly_pca.fit_transform(X_train_pca)
X_test_poly = poly_pca.fit_transform(X_test_pca)

poly_clf_pca = linear_model.LinearRegression()

poly_clf_pca.fit(X_train_poly, y_train_pca)

y_pred = poly_clf_pca.predict(X_test_poly)

#print(y_pred)

#In sample (training) R^2 will always improve with the number of variables!
print(poly_clf_pca.score(X_train_poly, y_train_pca))
score_LR_poly_PCA = poly_clf_pca.score(X_test_poly, y_test_pca)
score_LR_poly_PCA

#### Regressor Model 6: Decision Tree Regressor

In [None]:
### Building the model with all the attributes. We will also compute feature importance

In [None]:
regressor = DecisionTreeRegressor(random_state=1,max_depth=5)
regressor.fit(X_train, y_train)

In [None]:
feature_importances = regressor.feature_importances_
feature_names=X_train.columns

In [None]:
summary = {'Features' : feature_names,'Feature Importance' : feature_importances
          }

In [None]:
Feature_Importance_df = pd.DataFrame(summary)
print('The feature importance is:','\n')
Feature_Importance_df

In [None]:
y_pred_DTR = regressor.predict(X_test)
score_DTR= regressor.score(X_test, y_test)
score_DTR

In [None]:
#### Building the model with reduced dimensionality (PCA)

In [None]:
regressor_pca = DecisionTreeRegressor(random_state=1,max_depth=5)
regressor_pca.fit(X_train_pca, y_train_pca)

In [None]:
y_pred_dtr_pca = regressor_pca.predict(X_test_pca)
score_DTR_PCA = regressor_pca.score(X_test_pca, y_test_pca)
score_DTR_PCA

#### Regression Model 7: Random Forest

In [None]:
### Building the model with all the attributes

In [None]:
model_rf = RandomForestRegressor() 
# n_estimators = 50,random_state=1,max_features=3
model_rf = model_rf.fit(X_train, y_train)

In [None]:
y_predict_rf = model_rf.predict(X_test)
score_RF = model_rf.score(X_test, y_test)
score_RF

In [None]:
#### Building the model with reduced dimensionality (PCA)

In [None]:
model_rf_pca = RandomForestRegressor() 
# n_estimators = 50,random_state=1,max_features=3
model_rf_pca = model_rf_pca.fit(X_train_pca, y_train_pca)

In [None]:
y_predict_rf_pca = model_rf_pca.predict(X_test_pca)
score_RF_PCA = model_rf_pca.score(X_test_pca, y_test_pca)
score_RF_PCA

#### Regression Model 8: Bagging Regressor

In [None]:
### Building the model with all the attributes

In [None]:
bgcl = BaggingRegressor()
#n_estimators=50,random_state=1
bgcl = bgcl.fit(X_train, y_train)

In [None]:
y_predict_bag = bgcl.predict(X_test)
score_bag = bgcl.score(X_test , y_test)
score_bag

In [None]:
#### Building the model with reduced dimensionality (PCA)

In [None]:
bgcl_pca = BaggingRegressor()
#n_estimators=50,random_state=1
bgcl_pca = bgcl_pca.fit(X_train_pca, y_train_pca)

In [None]:
y_predict_bag_pca = bgcl_pca.predict(X_test_pca)
score_bag_PCA = bgcl_pca.score(X_test_pca , y_test_pca)
score_bag_PCA

#### Regression Model 9: Ada Boost Regressor

In [None]:
### Building the model with all the attributes

In [None]:
AdaBC = AdaBoostRegressor()
# n_estimators=50, random_state=1
#abcl = AdaBoostClassifier( n_estimators=50,random_state=1)
AdaBC = AdaBC.fit(X_train, y_train)

In [None]:
y_predict_ada = AdaBC.predict(X_test)
score_AdaBC = AdaBC.score(X_test , y_test)
score_AdaBC

In [None]:
#### Building the model with reduced dimensionality (PCA)

In [None]:
AdaBC_pca = AdaBoostRegressor()
# n_estimators=50, random_state=1
#abcl = AdaBoostClassifier( n_estimators=50,random_state=1)
AdaBC_pca = AdaBC_pca.fit(X_train_pca, y_train_pca)

In [None]:
y_predict_ada_pca = AdaBC_pca.predict(X_test_pca)
score_AdaBC_PCA = AdaBC_pca.score(X_test_pca , y_test_pca)
score_AdaBC_PCA

#### Regression Model 10: Gradient Boost Regressor

In [None]:
### Building the model with all the attributes

In [None]:
GraBR = GradientBoostingRegressor()
# n_estimators=50, random_state=1
#abcl = AdaBoostClassifier( n_estimators=50,random_state=1)
GraBR_fit = GraBR.fit(X_train, y_train)
y_predict_GraBR = GraBR.predict(X_test)

In [None]:
## Testing the model on train data
score_GraBR_train = GraBR.score(X_train , y_train)
score_GraBR_train

In [None]:
## Testing the model on the test data
score_GraBR = GraBR.score(X_test , y_test)
score_GraBR

In [None]:
#### Building the model with reduced dimensionality (PCA)

In [None]:
GraBR_pca = GradientBoostingRegressor()
# n_estimators=50, random_state=1
#abcl = AdaBoostClassifier( n_estimators=50,random_state=1)
GraBR_pca = GraBR_pca.fit(X_train_pca, y_train_pca)

In [None]:
y_predict_GraBR_pca = GraBR_pca.predict(X_test_pca)
score_GraBR_PCA = GraBR_pca.score(X_test_pca , y_test_pca)
score_GraBR_PCA

### 4. Deliverable - 4 (Tuning the model)

#### 4.a Identifying Algorithms suitable for this project

We will consolidate all the models built in deliverable 3 into a table for quick reference and then make our observations<br>
To do the same, we will create summary of all the models and then call them inside a data frame

#### Step 4.a.i: Summarise all the models

In [None]:
summary = {'Score': [score_LR, score_lasso,score_ridge, score_LR_poly, score_SVR, score_DTR,score_RF,score_bag,score_AdaBC, score_GraBR],

                    'Score for models trained with 6 Principal Components': [score_LR_PCA,score_lasso_PCA,score_ridge_PCA,score_LR_poly_PCA, score_SVR_PCA, score_DTR_PCA, score_RF_PCA, score_bag_PCA, score_AdaBC_PCA, score_GraBR_PCA]

                     }

models=['Linear Regression','Lasso','Ridge','Polynomial Regression','SVR', 'Decision Tree Regressor','Random Forest','Bagging','Ada Boost','Gradient Boost']
sum_df = pd.DataFrame(summary,models)

In [None]:
sum_df

Observations:

- The table in step 4.a.i captures scores of all the models trained by us. <br>
- As learnt from the case study document, our main objective is to identify a model that predicts the strength of high performance concrete. The model trained through Gradient Boost Algorithm seems to give the best results. <br>
- We also computed scores for models trained with 6 principal components to check the impact on performance with reduced dimensionality.
- We noticed a significant dip in score for Gradient boost (from 89.7 to 82) when we reduced dimensionality to 6 (from 9). 
- The score of model with all dimensions outweighs the benefits provided by reducing dimensions, since there is a significant drop in score to 0.82. Hence, we will go-ahead with Gradient Boost model considering all the dimensions

We performed an exhuastive EDA; reduced dimensionality; did scaling, built multiple models in our endeavours to identify the model which gives the best results. As seen above Gradient Boost Algorithm gives the best results and hence we selected the same. In the next steps we will perform hyper parameter tuning to identify the parameters which can further enhance performance of the Gradient Boost Algorithm.

#### 4.b Techniques employed to squeeze that extra performance out of the model without making it overfit or underfit

We will the technique of grid search to get that extra performance from the model. Since the best performing model is Gradient Boost; hence the technique will be applied on it.

In [None]:
estimator = GradientBoostingRegressor()

In [None]:
estimator.get_params()

In [None]:
estimator=GradientBoostingRegressor()
search_grid={'n_estimators':[100,200,300,400,500,600],'learning_rate':[.001,0.01,.1],'max_depth':[1,2,3,4,5],'subsample':[.5,.75,1],'random_state':[1]}
search=GridSearchCV(estimator=estimator,param_grid=search_grid,scoring='neg_mean_squared_error',n_jobs=1,cv=10)

In [None]:
search.fit(X_train,y_train)
search.best_params_

In [None]:
## Creating the Gradient Boosting Regressor with the best parameters

In [None]:
GraBR = GradientBoostingRegressor(learning_rate= 0.1,max_depth= 3,n_estimators= 600,random_state= 1,subsample= 1)

GraBR_fit = GraBR.fit(X_train, y_train)
y_predict_GraBR = GraBR.predict(X_test)

In [None]:
### Testng on the train data

score_GraBR = GraBR.score(X_train , y_train)
score_GraBR

In [None]:
### Testing on the test data
score_GraBR = GraBR.score(X_test , y_test)
score_GraBR

Observation:
- Grid search help us getting the optimal parameters which will help in getting the best parameters for the selected model.
- By selecting the recommended parametes for learning rate, max depth, n_estimators, subsample; we notice that the performance of our model has increased from 0.897 to 0.925;

#### 4.c Model performance range at 95% confidence level

In [None]:
scores = cross_val_score(GraBR, X, y, cv=10)
CV_score_acc_GraBR = scores.mean()
CV_score_std_GraBR = scores.std()

print(scores)
print("Accuracy: %.3f%% (%.3f%%)" % (CV_score_acc_GraBR*100.0, CV_score_std_GraBR*100.0))

Observation: 
- As we know that Cross validation is a technique to evaluate and validate a model and estimates its performance in unseen data;
- From the calculation above it is quite clear that the accuracy of the Gradient Boost model in the production environment is expected to be 92.677% (+-) standard deviation. 
- So if we have to say it with 95% confidence level then the model accuracy in the production environment is expected to be in the range of 92.677% (+-) 2 * standard deviation i.e. [88.719, 96.637] which is quite good and acceptable