# PRCP-1019-ConcreteStren

# Problem Statement

- Prepare a complete data analysis report on the concrete data.
- Create a machine learning model which can predict the future strength of a concrete mix, based on its constituents’ composition and also the age of the mix.

# Dataset Overview

Concrete as a building block of most construction is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients. These ingredients include cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, and fine aggregate.
The actual concrete compressive strength (MPa) for a given mixture under a specific age (days) was determined from the laboratory. Data is in raw form (not scaled).The data has 8 quantitative input variables, and 1 quantitative output variable, and 1030 instances (observations).


# Features and its Description


| **Feature Name**           | **Unit**     | **Description**                                                            |
| -------------------------- | ------------ | -------------------------------------------------------------------------- |
| **Cement**   | kg/m³        | Primary binding material; higher amounts generally increase strength.      |
| **Blast Furnace Slag** | kg/m³        | Supplementary cement material; improves durability and long-term strength. |
| **Fly Ash**  | kg/m³        | Cement substitute; enhances workability and long-term strength.            |
| **Water**    | kg/m³        | Required for hydration; too much reduces strength due to higher porosity.  |
| **Superplasticizer**   | kg/m³        | Chemical admixture; improves workability while keeping water content low.  |
| **Coarse Aggregate**   | kg/m³        | Gravel/crushed stone; contributes to concrete strength and bulk.           |
| **Fine Aggregate**     | kg/m³        | Sand-like material; fills gaps, affects workability and surface finish.    |
| **Age**                    | Days (1–365) | Time since casting; strength increases over time due to ongoing hydration. |
| **Compressive Strength**   | MPa          | Target variable; measures the load-bearing capacity of concrete.           |


## Importing Necessary Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Data reading

In [None]:
df = pd.read_csv('/content/concrete.csv')

In [None]:
df.head(10)

In [None]:
df.shape

# Data Cleaning and Preprocessing

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
# Handling the missing values
df.isnull().sum()

In [None]:
#Handling Duplicate values
df.duplicated().sum()

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
df.duplicated().sum()

In [None]:
# Outliers analysis
columns = df.columns.tolist()[:-1]
plt.figure(figsize=(15,15),facecolor='white')
plotnumber = 1
for column in columns:
  if plotnumber<=9:
    plt.subplot(3,3,plotnumber)
    sns.boxplot(df[column])
    plt.xlabel(column,fontsize = 20)
    plt.ylabel('count',fontsize = 20)
    plotnumber+=1
plt.tight_layout()

In [None]:
plt.figure(figsize = (25, 20))
plotnumber = 1

for col in columns:
    if plotnumber <= 9:
        ax = plt.subplot(3, 3, plotnumber)
        sns.distplot(df[col])
        plt.xlabel(col, fontsize = 15)

    plotnumber += 1

plt.tight_layout()
plt.show()

as we can see some outliers are present in water, Superplastic and Age columns, I will handle these outliers later in feature engineering part.

# EDA

In [None]:
# Univariate analysis
columns = df.columns.tolist()[:-1]
plt.figure(figsize=(15,15),facecolor='white')
plotnumber = 1
for column in columns:
  if plotnumber<=9:
    plt.subplot(3,3,plotnumber)
    sns.histplot(df[column],kde=True)
    plt.xlabel(column,fontsize = 20)
    plt.ylabel('count',fontsize = 20)
    plotnumber+=1
plt.tight_layout()


In [None]:
# Bi variate analysis
#- which features contributes more in predicting the target column

In [None]:
plt.figure(figsize=(25,25),facecolor='white')
plotnumber = 1
for column in columns:
  if plotnumber<=9:
    plt.subplot(3,3,plotnumber)
    sns.scatterplot(df[column])
    plt.xlabel(column)
    plt.ylabel('concrete_strength')
    plotnumber+=1
plt.tight_layout()

In [None]:
# Multi variate analysis

In [None]:
#Correlation Analysis
#Heat map for correlation
plt.figure(figsize=(10,10))
sns.heatmap(df.corr(),annot = True)
plt.show()

**Insights from Correlation Matrix :**

 - Cement and Concrete Compressive Strength: The correlation is 0.49(0.50), which is a moderate positive correlation. This indicates that as the amount of cement increases, the compressive strength of the concrete tends to increase as well.

- Water and Superplasticizer: There is a strong negative correlation of -0.65(-0.66). This suggests that the more superplasticizer used, the less water is needed. Superplasticizers are used to enhance the workability of concrete, allowing for a reduction in water content without reducing fluidity.

- Fly Ash and Superplasticizer: These have a correlation of 0.41(0.38), a moderate positive correlation, implying that fly ash and superplasticizer quantities tend to increase together. Fly ash can improve workability and reduce water content, which might be why it's used in conjunction with superplasticizers.

- Fine Aggregate and Water: This pair has a correlation of -0.44(-0.45), indicating a moderate negative correlation. It implies that an increase in the amount of fine aggregate may be associated with a decrease in water content.

- Age and Concrete Compressive Strength: With a correlation of 0.34(0.33), it indicates a positive relationship, albeit not very strong, suggesting that as the concrete ages, its compressive strength tends to increase, which is expected as concrete gains strength over time.

- Blast Furnace Slag and Fly Ash: There's a negative correlation of -0.31(-0.32), which could indicate that in mixtures where blast furnace slag is used, less fly ash is present, and vice versa.

- Water and Cement: The correlation is -0.057(-0.08), which is a very weak negative correlation, suggesting that there is no significant relationship between the amounts of water and cement used in the concrete mix.

In [None]:
#Pair plot
sns.pairplot(df)
plt.show()

 - This pairplot provides a comprehensive visualization of the relationship between each pair of variables in the dataset. Here's an analysis of the insights that I have derived from this plot:

- Distribution of Individual Variables:

 - The histograms along the diagonal show the distribution of single variables
Cement, blast furnace slag, and fly ash display a somewhat right-skewed distribution, indicating a concentration of lower values with some higher outliers.

 - Water, superplasticizer, and age show a near-normal or uniform distribution.
Coarse aggregate and fine aggregate are left-skewed, with higher frequencies of larger quantities.

 - Concrete compressive strength appears normally distributed, which is ideal for many statistical analysis methods that assume normality.

- Pairwise Relationships:
The scatter plots off the diagonal show the relationships between pairs of variables.

 - There are some variables that show a pattern suggesting a correlation, like cement and concrete compressive strength, where an increase in cement seems to be associated with higher compressive strength.

 - Age and compressive strength show a non-linear pattern where strength increases with age up to a certain point before leveling off, which is consistent with the curing process of concrete.

 - The negative relationship between water and superplasticizer is also visible, supporting the idea that superplasticizers are effective in reducing water content while maintaining workability.

 - Some variables show no discernible pattern or relationship, like coarse aggregate with many other components, indicating a lack of correlation.

- Data Density:
 - The density of points within scatter plots varies, with some areas being more densely populated. This indicates the commonality of certain mix proportions in the dataset.

# Feature engineering

---



In [None]:
#Checking multi-collinearity
# Rule of thumb:

# VIF > 5 → Moderate multicollinearity

# VIF > 10 → Severe multicollinearity (consider removing variable)
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
X = df.drop('strength',axis=1)
X = add_constant(X)

vif = pd.DataFrame()
vif["Feature"] = X.columns
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif)


In [None]:
#Spliting the independent and non-indepentent features
x = df.drop('strength',axis=1)
y = df['strength']

In [None]:
x.var()

In [None]:
# normalizing features
# let's add 1 to each value in everycolumn so that we don't get exception while calculating the log value of 0

for column in x.columns:
    x[column] += 1
    x[column] = np.log(x[column])

In [None]:
x.var()

In [None]:
# Checking for Outliers
plt.figure(figsize = (20, 15))
plotnumber = 1

for col in x.columns:
    if plotnumber <= 8:
        ax = plt.subplot(3, 3, plotnumber)
        sns.boxplot(X[col])
        plt.xlabel(col, fontsize = 15)

    plotnumber += 1
plt.tight_layout()
plt.show()

#Scaling

In [None]:
#Spliting the data for training and testing
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)

In [None]:
#StandardScaler scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

In [None]:
#After standardscaling , Creating dataframe for x_train and x_test.
x_train = pd.DataFrame(x_train,columns = x.columns)
x_test  = pd.DataFrame(x_test,columns = x.columns)
print(x_train.head())
print(x_test.head())

# Model Creation and Evaluation

In [None]:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor,AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error,mean_absolute_error, r2_score



# Models
models = {
    "KNN": KNeighborsRegressor(),
    "Desicion Tree": DecisionTreeRegressor(),
    "Random Forest": RandomForestRegressor(),
    "AdaBoost": AdaBoostRegressor(),
    "Gradient Boosting": GradientBoostingRegressor(),
    "XGBoost": XGBRegressor(),
}

#train and evaluate each model
for name, model in models.items():
    model.fit(x_train, y_train)
    y_pred = model.predict(x_test)

    mse = mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    print(f"{name}:\n\tMSE = {mse:.2f},\n\tMAE = {mae:.2f}, R² = {r2:.2f}\n")


From the all the model we evaluated, XGBoostRegressor has highest accuracy.

Let's do the Hyperparameter tuning for XGBoostRegressor

In [None]:
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score

# Define model
model = XGBRegressor(objective='reg:squarederror', random_state=42)

# Define hyperparameter grid
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.3],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(estimator=model,
                           param_grid=param_grid,
                           cv=3,
                           scoring='neg_mean_squared_error',
                           verbose=1,
                           n_jobs=-1)

grid_search.fit(x_train, y_train)

#  Best model and score
print("Best Hyperparameters:", grid_search.best_params_)
print("Best RMSE on training:", (-grid_search.best_score_)**0.5)

#  Evaluate on test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(x_test)
rmse = mean_squared_error(y_test, y_pred)
print("Test RMSE:", rmse)

# Evaluate r2 score
r2 = r2_score(y_test, y_pred)
print("R-squared (R² Score):", r2)


In [None]:
best_params = grid_search.best_params_

In [None]:
from xgboost import XGBRegressor
model = XGBRegressor(**best_params,objective='reg:squarederror', random_state=42)
model.fit(x_train,y_train)
y_pred = model.predict(x_test)
print('Accuracy:',r2_score(y_test,y_pred))

In [None]:
pd.DataFrame({'Actual':y_test,'Predicted':y_pred , 'Error':y_test-y_pred})

In [None]:
#Let's give a new data to the model to predict
cement = float(input('Enter the value of cement: '))
slag = float(input('Enter the value of blast_furnace_slag: '))
ash = float(input('Enter the value of fly_ash: '))
water = float(input('Enter the value of water: '))
superplastic = float(input('Enter the value of superplasticizer: '))
coarseagg = float(input('Enter the value of coarse_aggregate: '))
fineagg = float(input('Enter the value of fine_aggregate: '))
age = float(input('Enter the value of the age :'))

new_val = np.array([cement,slag,ash,water,superplastic,coarseagg,fineagg,age])
new_val = new_val.reshape(1,-1)
new_val = mm.transform(new_val)
strength = model.predict(new_val)
print(f"The strength of the concrete is :{strength}")


# Model Comparison report

We have tested 6 Regression models to predict the concrete strength. Here the detailed report.

K-Nearest Neighbors (KNN):

- Moderate performance with R² = 0.84 and relatively high MSE (46.68).

- This model may struggle to generalize, especially in datasets with higher dimensionality.

Decision Tree:

- Simple and interpretable model with R² = 0.88. However, it can overfit the training data, which limits its generalization ability compared to ensembles.

Random Forest:

- Strong performer with R² = 0.91, lower error values than Decision Tree and KNN.

- Offers robustness and good generalization due to its ensemble nature.

AdaBoost:

- The weakest model here, with highest MSE (65.21) and lowest R² (0.78).

- Tends to be sensitive to noisy data and outliers in regression tasks.

Gradient Boosting:

- Performs well with R² = 0.89, but not as accurate as Random Forest or XGBoost.

XGBoost:

- Clearly the best performing model, with the lowest MSE (20.39), lowest MAE (2.76), and highest R² (0.93).

- Optimized for both speed and accuracy, XGBoost handles feature importance and regularization effectively, minimizing overfitting while maintaining strong performance.

Hyperparameter Tuning:

- To further enhance model performance, we applied hyperparameter tuning using techniques such as Grid Search on the XGBoost model. After optimization, the model achieved an improved R² score of 0.944, indicating even better predictive capability.

Result:

- Deploy XGBoost as the primary model for production.

- It demonstrates the best predictive accuracy and generalization capability, making it the most reliable choice for predicting concrete compressive strength in real-world applications.


# Report on Challenges faced

1.	The main challenge was handling the non-linearity in the data, where linear models performed poorly, while tree-based models handled it well without requiring feature scaling. Additionally, tuning hyperparameters and managing model complexity, especially with SVR and tree-based methods, was crucial to avoid overfitting and improve performance.

2. **Features on Different Scales**

  **Problem:** Input features like cement, water, and age had vastly different ranges, which could mislead certain models (especially SVR and linear models).
  
  **Impact:** Poor convergence and lower accuracy.

 **Solution:** We applied StandardScaler to bring all features onto the same scale. This significantly improved the performance of SVR and linear regression
