<p align="center"><img src="https://github.com/insaid2018/Term-1/blob/master/Images/INSAID_Full%20Logo.png?raw=true" width="260" height="110" /></p>

![image-5.png](attachment:image-5.png)

## CONCRETE COMPRESSIVE-STRENGTH PREDICTION (MACHINE LEARNING - ADVANCED)

**Submitted by NARAYAN V. SHANBHAG**

**GCD Student, INSAID**

**Batch: May 9,2021**

---
# **Table of Contents**
---

1. [**Introduction**](#Section1)<br>
2. [**Problem Statement**](#Section2)<br>
3. [**Installing & Importing Libraries**](#Section3)<br>
  3.1 [**Installing Libraries**](#Section31)<br>
  3.2 [**Upgrading Libraries**](#Section32)<br>
  3.3 [**Importing Libraries**](#Section33)<br>
4. [**Data Acquisition & Description**](#Section4)<br>
5. [**Data Pre-Profiling**](#Section5)<br>
6. [**Data Pre-Processing**](#Section6)<br>
7. [**Data Post-Profiling**](#Section7)<br>
8. [**Exploratory Data Analysis**](#Section8)<br>
9. [**Summarization**](#Section9)</br>
  9.1 [**Conclusion**](#Section91)</br>
  9.2 [**Actionable Insights**](#Section91)</br>

---

---
<a name = Section1></a>
# **1. Introduction**
---
### Company Introduction - Plastion

### Your client for this project is a major Concrete Producer.

- 1) Their concrete stands out to be one of the best in the business and holds a contract with five of the most well known real estate companies.
- 2) Recently, they have developed a new kind of concrete which requires less water and is stronger and better than the concrete they used to sell.
- 3) They have few competitors who are also developing new kinds of concrete to launch in the market to get more clients.


### Current Scenario

 - 1) The regular price of concrete per cubic yard is around $100 to $200 but due to market inflation the current price has gone down and the company is at loss.
 - 2) The company has developed a new concrete solution which can be a potential game-changer for the company in the market but they are not sure about the concrete compressive strength which is a very important factor for concrete sale.

---
<a name = Section2></a>
# **2. Problem Statement**
---
- This section is emphasised on providing some generic introduction to the problem that most companies confront.

### The company process suffers from the following problems:

- 1) The company is under a time crunch to test the compressive strength of the concrete to release in the market.
- 2) Previously they were using manual methods to test the compressive strength of the concrete which is very time-consuming and inefficient.

The company has hired you as data science consultants. They want to automate the process of predicting the compressive strength of the concrete, based on the materials used.

### Your Role
- You are given a dataset containing materials used in the concrete.
- Your task is to build a regression model using the dataset.

Because there was no machine learning model for this problem in the company, you don’t have a quantifiable win condition. You need to build the best possible model.


### Project Deliverable
   - Deliverable: ** Predict the compressive strength of concrete.**
   - Machine Learning Task: **Regression**
   - Target Variable: **csMPa**
   - Win Condition: **N/A (best possible model)**

### Evaluation Metric
   - The model evaluation will be based on the **RMSE** Score.

# **3. Installing & Importing Libraries**
---
- This section is emphasised on installing and importing the necessary libraries that will be required.

### **Installing Libraries**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
!pip install -q datascience                                         
# Package that is required by pandas profiling
!pip install -q pandas-profiling                                    
# Library to generate basic statistics about data

# To install more libraries insert your code here..


### **Upgrading Libraries**

- **After upgrading** the libraries, you need to **restart the runtime** to make the libraries in sync.

- Make sure not to execute the cell under Installing Libraries and Upgrading Libraries again after restarting the runtime.

In [None]:
!pip install -q --upgrade pandas-profiling                          
# Upgrading pandas profiling to the latest version

### **Importing Libraries**

- You can headstart with the basic libraries as imported inside the cell below.

- If you want to import some additional libraries, feel free to do so.

In [None]:
#-------------------------------------------------------------------------------------------------------------------------------
import pandas as pd                                                 # Importing package pandas (For Panel Data Analysis)
from pandas_profiling import ProfileReport                          # Import Pandas Profiling (To generate Univariate Analysis)
#-------------------------------------------------------------------------------------------------------------------------------
import numpy as np                                                  # Importing package numpys (For Numerical Python)
#-------------------------------------------------------------------------------------------------------------------------------
import matplotlib.pyplot as plt                                     # Importing pyplot interface to use matplotlib
import seaborn as sns                                               # Importing seaborn library for interactive visualization
%matplotlib inline
#-------------------------------------------------------------------------------------------------------------------------------
import scipy as sp                                                  # Importing library for scientific calculations
#-------------------------------------------------------------------------------------------------------------------------------
import warnings
warnings.filterwarnings('ignore')

In [None]:
! pip install sweetviz

In [None]:
! pip install autoviz

# 4. Data  Description




Train Set:
The train set contains 824 rows and 10 columns.
The last column csMPa is the target variable.

Test Set:
The test set contains 206 rows and 9 columns.
The test set doesn’t contain the csMPa column.
It needs to be predicted for the test set.

 - The dataset contains materials used in making the concrete.
 
 - The column **csMPa** is the compressive strength of concrete.

 - This is the data that we have to predict the **compressive strength.**
 
#### The dataset is divided into two parts: Train and Test sets.

#### Training Set: 
 - The train set contains **824 rows** and **10 columns**
 - The last column **csMPa** is the **target variable**.

#### Testing Set:
 - The test set contains **206 rows** and **9 columns**.
 - The test set **doesn’t contain** the **csMPa** column.
 - It needs to be predicted for the test set.
 

**Train Set:**

|Records|Features| Target Variable |
|:--|:--|:--|
|**824**|**10**|**csMPa**|

**Test Set:**

|Records|Features|Predicted Variable|
|:--|:--|:--|
|**206**|**9**|**csMPa**|
 
 |***ID***|****Feature****|****Description****|
|:--|:--|:--|
|01| Id        | Unique identity of each observation.|
|02| cement | Quantity of cement in the mixture in kg(kilogram).|  
|03| slag   | Quantity of slag in the mixture in kg(kilogram).| 
|04| Flyash    | Quantity of fly ash in the mixture in kg(kilogram).|   
|05| water       | Quantity of water in the mixture in kg(kilogram). |
|06| superplasticizer       | Quantity of superplasticizer in the mixture in kg(kilogram).|
|07| coarseaggregate       | Quantity of coarse aggregate in the mixture in kg(kilogram).|
|08| fineaggregate | Quantity of fine aggregate in the mixture in kg(kilogram).|
|09| age | Age of the mixture in days.|
|10| csMPa | compressive strength of concrete in MPa(dependent variable).|

---
<a name = Section4></a>
# **5. Data Acquisition**
---

- This section is emphasised on the accquiring the data and obtain some descriptive information out of it.

- You could either scrap the data and then continue, or use a direct source of link (generally preferred in most cases).

- You will be working with a direct source of link to head start your work without worrying about anything.

- Before going further you must have a good idea about the features of the data set:



In [None]:
#from google.colab import files
#uploaded = files.upload()
#import io
#data = pd.read_csv(io.BytesIO(uploaded['Churn_train.csv']))


In [None]:
#data = pd.read_csv(filepath_or_buffer = 'https://raw.githubusercontent.com/insaid2018/Term-1/master/Data/Projects/car_sales.csv', encoding='cp1252')
data = pd.read_csv("D://concrete_train.csv",skipinitialspace=True)
data
print('Data Shape:', data.shape)
data.head()
#Initial Analysis shows there are 5634 RECORDS/ROWS and 21 FEATURES/COLUMNS

 ### **Data Information**
- **Totally there are 10 Columns/Features including Id which are NUMERICAL**
- **Check whether the NUMERICAL Columns have 0s and if they are relevant**


In [None]:
# Insert your code here
data.info()


### **Data Description**

- **To get some quick description out of the data you can use describe method defined in pandas library.**
-**Gives the 5-Point or 5-Number summary and other details such as Count, Mean and Standard Deviation of the data-set**

In [None]:
data.describe()

In [None]:
data_final = pd.read_csv("D://concrete_test.csv")
print('Data Shape:', data_final.shape)
data_final

In [None]:
data_final.info()

In [None]:
data_final.describe()

---
<a name = Section5></a>
# **6. Data Pre-Profiling**
---

- This section is emphasised on getting a report about the data.

- You need to perform pandas profiling and get some observations out of it...

In [None]:
# Insert your code here...
profile_Concrete_Compressive_Strength_prediction_Train = ProfileReport(df=data)
profile_Concrete_Compressive_Strength_prediction_Train

**5.1 Data Pre-Profiling for TEST SET**

In [None]:
profile_Concrete_Compressive_Strength_prediction_Test  = ProfileReport(df=data_final)
profile_Concrete_Compressive_Strength_prediction_Test 

---
<a name = Section6></a>
# **7. Data Pre-Processing**
---

- This section is emphasised on performing data manipulation over unstructured data for further processing and analysis.

- To modify unstructured data to strucuted data you need to verify and manipulate the integrity of the data by:
  - **Handling/Checking Duplicate Data for both the TRAIN and TEST Data Sets**

  - Handling missing data,

  - Handling redundant data,

  - Handling inconsistent data,

  - Handling outliers,

  - Handling typos

-**There are actually NO DUPLICATE RECORDS/ROWS in the DATA-SETS**



In [None]:
# Insert your code here...
data[data.duplicated()]

In [None]:
data_final[data_final.duplicated()]

-**Now Check ALL the NUMERICAL COLUMNS for ZERO values and Replace/Substitute them with appropriate values**

In [None]:
data.isnull().sum()

In [None]:
(data == 0 ).sum(axis = 0)

In [None]:
data_final.isnull().sum()

In [None]:
(data_final == 0 ).sum(axis = 0)

### **Check whether the DataSet is Balanced for the label column**

In [None]:
figure = plt.figure(figsize = (20,10))
HeatMap = sns.heatmap(data.corr(),annot = True,cmap = 'coolwarm',vmin = -1, vmax = 1,linecolor = 'black',linewidths = 1)
HeatMap.set_title('Correlation HeatMap', fontdict = {'fontsize':16})

In [None]:
import sweetviz as sv
sweet_report_concrete_strength = sv.analyze(data)
sweet_report_concrete_strength.show_html('Concrete_Strength_Sweet_Report_.html')

### OBSERVATIONS
- Average price is highly correlated with Type and Region

In [None]:
data.columns

### Using Gradient Boosting Regressor

In [None]:
#With Pipeline
from sklearn.model_selection import train_test_split  
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PowerTransformer
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn import metrics  
X=data.drop('csMPa',axis=1)
y=data[['csMPa']]
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size= .20,random_state=10)
pipe = Pipeline((
#("pt",PowerTransformer()),
("lr", GradientBoostingRegressor(n_estimators=1500,random_state=10)),
))
pipe.fit(Xtrain,ytrain)
print("Training R2")
print(pipe.score(Xtrain,ytrain))
print("Testing R2")
print(pipe.score(Xtest,ytest))
scoresdt = cross_val_score(pipe,Xtrain,ytrain,cv=10)
print(scoresdt)
print("Average R2")
print(np.mean(scoresdt))

In [None]:
X1 = data.drop('Id',axis=1)
X = X1.drop('csMPa',axis=1)
print(X.columns)
y=data[['csMPa']]
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size= .20,random_state=10)
gb_model = GradientBoostingRegressor(n_estimators=1500,random_state=10)
gb_model.fit(Xtrain, ytrain)
y_pred_train_gb = gb_model.predict(Xtrain)
y_pred_train_gb
y_pred_test_gb =gb_model.predict(Xtest)
y_pred_test_gb
RMSE_train_gb = np.sqrt(metrics.mean_squared_error(ytrain, y_pred_train_gb))
r2_train_gb = metrics.r2_score(ytrain, y_pred_train_gb)
print('RMSE for training set in LR is {}'.format(RMSE_train_gb))
print('R2 for training set in LR is {}'.format(r2_train_gb))
RMSE_test_gb = np.sqrt(metrics.mean_squared_error(ytest, y_pred_test_gb))
r2_test_gb = metrics.r2_score(ytest, y_pred_test_gb)
print('RMSE for testing set in LR is {}'.format(RMSE_test_gb))
print('R2 for testing set in LR is {}'.format(r2_test_gb))

### Using Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor
#X=data.drop('Id','csMPa',axis=1)
X1 = data.drop('Id',axis=1)
X = X1.drop('csMPa',axis=1)
print(X.columns)
y=data[['csMPa']]
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size= .20,random_state=10)
pipe = Pipeline((
("lr", RandomForestRegressor(n_estimators=200,random_state=10)),
))
pipe.fit(Xtrain,ytrain)
print("Training R2")
print(pipe.score(Xtrain,ytrain))
print("Testing R2")
print(pipe.score(Xtest,ytest))
scoresdt = cross_val_score(pipe,Xtrain,ytrain,cv=10)
print(scoresdt)
print("Average R2")
print(np.mean(scoresdt))

In [None]:
X1 = data.drop('Id',axis=1)
X = X1.drop('csMPa',axis=1)
print(X.columns)
y=data[['csMPa']]
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size= .20,random_state=10)
rf_model = RandomForestRegressor(n_estimators=200,random_state=10)
#rf_model = RandomForestRegressor(n_estimators = 14, random_state = 42, n_jobs = -1)
rf_model.fit(Xtrain, ytrain)
y_pred_train_rf = rf_model.predict(Xtrain)
y_pred_train_rf
y_pred_test_rf = rf_model.predict(Xtest)
y_pred_test_rf
RMSE_train_rf = np.sqrt(metrics.mean_squared_error(ytrain, y_pred_train_rf))
r2_train_rf = metrics.r2_score(ytrain, y_pred_train_rf)
print('RMSE for training set in LR is {}'.format(RMSE_train_rf))
print('R2 for training set in LR is {}'.format(r2_train_rf))
RMSE_test_rf = np.sqrt(metrics.mean_squared_error(ytest, y_pred_test_rf))
r2_test_rf = metrics.r2_score(ytest, y_pred_test_rf)
print('RMSE for testing set in LR is {}'.format(RMSE_test_rf))
print('R2 for testing set in LR is {}'.format(r2_test_rf))

### **5.2 Data Cleaning**

- In this section, we will **remove** columns which are **redundant** for model.

---
# **6. Exploratory Data Analysis**

#### **Question:** What is the distribution of the **target** feature?

In [None]:
# Plot a displot on target variable
data_sns = data
sns.displot(x='csMPa', data=data_sns, kde=True, aspect=3)

# Add some cosmetics
plt.xticks(size=12)
plt.yticks(size=12)
plt.xlabel(xlabel='Values', size=14)
plt.ylabel(ylabel='Count', size=14)
plt.title(label='Displot on target', size=16)
plt.grid(b=True)

# Display the plot
plt.show()

# **7. Data Post Processing**

### **7.2 Data Splitting**

- Now we will split our data into train set and test set.

- We will keep **80%** data in the **train** set, and **20%** data in the **test** set.

# **8. Model Development & Evaluation**

### **8.1 Model Development & Evaluation without PCA**

In [None]:
#-------------------------------------------------------------------------------------------------------------------------------
import pandas as pd                                                 # Importing for panel data analysis
pd.set_option('display.max_columns', None)                          # Unfolding hidden features if the cardinality is high
pd.set_option('display.max_colwidth', None)                         # Unfolding the max feature width for better clearity
pd.set_option('display.max_rows', None)                             # Unfolding hidden data points if the cardinality is high
pd.set_option('mode.chained_assignment', None)                      # Removing restriction over chained assignments operations
pd.set_option('display.float_format', lambda x: '%.5f' % x)         # To suppress scientific notation over exponential values
#-------------------------------------------------------------------------------------------------------------------------------
import numpy as np                                                  # Importing package numpys (For Numerical Python)
#-------------------------------------------------------------------------------------------------------------------------------
import matplotlib.pyplot as plt                                     # Importing pyplot interface using matplotlib
import seaborn as sns                                               # Importin seaborm library for interactive visualization
import plotly.graph_objs as go                                      # Importing plotly for interactive visualizations
%matplotlib inline
#-------------------------------------------------------------------------------------------------------------------------------
from sklearn.preprocessing import StandardScaler                    # Importing to scale the features in the dataset
from sklearn.model_selection import train_test_split                # To properly split the dataset into train and test sets
from sklearn.ensemble import RandomForestRegressor                  # To create a random forest regressor model
from sklearn.linear_model import LinearRegression                   # To create a linear regression model
from sklearn import metrics                                         # Importing to evaluate the model used for regression
from sklearn.decomposition import PCA                               # Importing to create an instance of PCA model
#-------------------------------------------------------------------------------------------------------------------------------
from random import randint                                          # Importing to generate random integers
#-------------------------------------------------------------------------------------------------------------------------------
import time                                                         # For time functionality
import warnings                                                     # Importing warning to disable runtime warnings
warnings.filterwarnings("ignore")                                   # Warnings will appear only once

In [None]:
#With Pipeline
from sklearn.pipeline import Pipeline
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import PowerTransformer
from sklearn.model_selection import train_test_split
#from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor 
from sklearn.svm import SVR
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.naive_bayes import GaussianNB                                 # To create a Naive Bayes model using algorithm
from sklearn.model_selection import cross_val_score, cross_val_predict

from sklearn.svm import SVR
svm_model_linear = SVR().fit(x_train, y_train)
svm_model_linear

### Dimensionality reduction using PCA

In [None]:
# Perform PCA on X (standardized features)
pca = PCA(n_components=0.80, random_state=0).fit(Xtrain)
#pca = PCA(n_components=0.5, random_state=0).fit(Xtrain)

# Calculate the explained variance
var = np.cumsum(np.round(a=pca.explained_variance_ratio_, decimals=3) * 100)

# Initiate an empty figure
fig = go.Figure()

# Add a trace of bar to the figure
fig.add_trace(trace=go.Scatter(x=list(range(1000)),
                               y= var,
                               name="'Cumulative Explained Variance'",
                               mode='lines+markers'))

# Update the layout with some cosmetics
fig.update_layout(height=500, 
                  width=1000, 
                  title_text='PCA Analysis', 
                  title_x=0.5,
                  xaxis_title='Number of components', 
                  yaxis_title='Explained Variance %')

# Display the figure
fig.show()

In [None]:
pca = PCA(n_components=9, random_state=0)

X_pca_train = pca.fit_transform(Xtrain)
X_pca_test = pca.transform(Xtest)

# Printing shape of X_train and X_test
print('Shape of X_train: ', X_pca_train.shape)
print('Shape of X_test: ', X_pca_test.shape)

In [None]:
clfs = [LinearRegression(fit_intercept=True), DecisionTreeRegressor(max_depth = 30),RandomForestRegressor(n_estimators=200,random_state=10)]
for clf in clfs:

  # Extracting model name
  model_name = type(clf).__name__

  # Calculate start time
  start_time = time.time()

  # Train the model
  clf.fit(X_pca_train, ytrain)

  # Make predictions on the trained model
  predictions = clf.predict(X_pca_test)

  # Estimating the model performance
  RMSE = np.sqrt(metrics.mean_squared_error(ytest, predictions))
  R_squared = metrics.r2_score(ytest, predictions)

  # Calculate evaluated time
  elapsed_time = (time.time() - start_time)

  # Display the metrics and time took to develop the model
  print('Performance Metrics of', model_name, ':')
  print('[RMSE]:', RMSE, '[R-Squared]:', R_squared, '[Processing Time]:', elapsed_time, 'seconds')
  print('----------------------------------------\n')

## Analyzing the Test File

### Seperate out the Categorical and Numerical Columns in the Test DataSet

In [None]:
data_submission = data_final['Id']
data_submission

In [None]:
data_final_model = data_final.drop('Id',axis=1)

### Predict the label values using the earlier trained Logistic Regression Model

In [None]:
y_pred_test_final_gb = gb_model.predict(data_final_model)
y_pred_test_final_gb

### Convert the array into a DataFrame

In [None]:
y_pred_test_final_gb = pd.DataFrame(y_pred_test_final_gb)

In [None]:
y_pred_test_final_gb

## Prepare the submission file which should have only two columns viz. the KEY/INDEX column(Id) and TARGET column(label)

In [None]:
submission_file = pd.concat([data_submission,y_pred_test_final_gb], axis = 1)

In [None]:
submission_file.head()

### To write the final data to the submission file which is .csv without HEADER and INDEX

In [None]:
submission_file.to_csv('D://Concrete_Compressive_Strength_Prediction_Submission.csv', header=False, index=False)

### Thank You !!!