# Concrete Strength Prediction using Random Forest

This notebook contains my best approach to building this model

## 1. Importing Required Libraries

I am importing pandas, numpy, matplotlib, seaborn and sklearn libraries for data manipulation, visualization, and machine learning

In [9]:
# importing required libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.model_selection import RandomizedSearchCV

from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.ensemble import VotingRegressor

import warnings
warnings.filterwarnings('ignore')

# setting style for plots
plt.style.use('seaborn-v0_8')

print("Libraries imported successfully!")

Libraries imported successfully!


## 2. Load the Dataset

Using pandas to load the training and test datasets from CSV files.

In [10]:
# load the datasets
train_df = pd.read_csv('datasets/train.csv')
test_df = pd.read_csv('datasets/test.csv')
sample_submission = pd.read_csv('datasets/sample_submission.csv')

print(f"Training data shape: {train_df.shape}")
print(f"Test data shape: {test_df.shape}")
print(f"Sample submission selection shape: {sample_submission.shape}")

# Display first few rows of training data
print("\nFirst 5 rows of training data:")
train_df.head()

Training data shape: (5407, 10)
Test data shape: (3605, 9)
Sample submission selection shape: (3605, 2)

First 5 rows of training data:


Unnamed: 0,id,CementComponent,BlastFurnaceSlag,FlyAshComponent,WaterComponent,SuperplasticizerComponent,CoarseAggregateComponent,FineAggregateComponent,AgeInDays,Strength
0,0,525.0,0.0,0.0,186.0,0.0,1125.0,613.0,3,10.38
1,1,143.0,169.0,143.0,191.0,8.0,967.0,643.0,28,23.52
2,2,289.0,134.7,0.0,185.7,0.0,1075.0,795.3,28,36.96
3,3,304.0,76.0,0.0,228.0,0.0,932.0,670.0,365,39.05
4,4,157.0,236.0,0.0,192.0,0.0,935.4,781.2,90,74.19


## 3. Data Preprocessing

Handling missiong values, encoding categorical variables, and prepraring features for the models

In [11]:
# Checking for missing values
print("Missing values in training data:")
print(train_df.isnull().sum())
print("\nMissing values in test data:")
print(test_df.isnull().sum())

# Since, there are no missing values, we will proceed with feature selection
# Prepraring features and target
X = train_df.drop(['id', 'Strength'], axis=1)
y = train_df['Strength']
X_test = test_df.drop(['id'], axis=1)

print(f"\nFeature matric shape: {X.shape}")
print(f"Target vector shape: {y.shape}")
print(f"Test feature matrix shape: {X_test.shape}")

print("\nFeature Columns:")
print(X.columns.tolist())

Missing values in training data:
id                           0
CementComponent              0
BlastFurnaceSlag             0
FlyAshComponent              0
WaterComponent               0
SuperplasticizerComponent    0
CoarseAggregateComponent     0
FineAggregateComponent       0
AgeInDays                    0
Strength                     0
dtype: int64

Missing values in test data:
id                           0
CementComponent              0
BlastFurnaceSlag             0
FlyAshComponent              0
WaterComponent               0
SuperplasticizerComponent    0
CoarseAggregateComponent     0
FineAggregateComponent       0
AgeInDays                    0
dtype: int64

Feature matric shape: (5407, 8)
Target vector shape: (5407,)
Test feature matrix shape: (3605, 8)

Feature Columns:
['CementComponent', 'BlastFurnaceSlag', 'FlyAshComponent', 'WaterComponent', 'SuperplasticizerComponent', 'CoarseAggregateComponent', 'FineAggregateComponent', 'AgeInDays']


In [12]:
# Random Forest does not require feature scaling, but we will keep the option
# For Random Forest, we will use the original features without scaling
print("Random Forest can work with original features without scaling.")
print('using original features for Random Forest model.')

# Keeping the original features for Random Forest
X_final = X.copy()
X_test_final = X_test.copy()

print(f"\nTraining features shape: {X_final.shape}")
print(f"Test feature shape: {X_test.shape}")

# Display first few rows of features
print('\nFirst 5 rows of features:')
X_final.head()

Random Forest can work with original features without scaling.
using original features for Random Forest model.

Training features shape: (5407, 8)
Test feature shape: (3605, 8)

First 5 rows of features:


Unnamed: 0,CementComponent,BlastFurnaceSlag,FlyAshComponent,WaterComponent,SuperplasticizerComponent,CoarseAggregateComponent,FineAggregateComponent,AgeInDays
0,525.0,0.0,0.0,186.0,0.0,1125.0,613.0,3
1,143.0,169.0,143.0,191.0,8.0,967.0,643.0,28
2,289.0,134.7,0.0,185.7,0.0,1075.0,795.3,28
3,304.0,76.0,0.0,228.0,0.0,932.0,670.0,365
4,157.0,236.0,0.0,192.0,0.0,935.4,781.2,90


## 4. Feature Engineering

In [19]:
# Creating feature engineeing for both train and sets
print("Starting feature engineering...")
print(f"Original feature columns: {X.columns.tolist()}")

# Creating engineered features for training data
X_engineered = X.copy()

# Ratio features
X_engineered['CementToWater'] = X['CementComponent'] / (X['WaterComponent'] + 1e-8)
X_engineered['CementToAge'] = X['CementComponent'] / (X['AgeInDays'] + 1)
X_engineered['TotalBinder'] = X['CementComponent'] + X['BlastFurnaceSlag'] + X['FlyAshComponent']
X_engineered['WaterToBinder'] = X['WaterComponent'] / (X_engineered['TotalBinder'] + 1e-8)

# Polynomial features for key components
poly = PolynomialFeatures(degree=2, include_bias=False)
key_features = ['CementComponent', 'WaterComponent', 'AgeInDays']
poly_features = poly.fit_transform(X[key_features])
poly_df = pd.DataFrame(poly_features, columns=poly.get_feature_names_out(key_features))

# Removing duplicae columns (keeping only interaction terms and squared terms, not original features)
# Original features are already in X_engineered
poly_interaction_cols = [col for col in poly_df.columns if col not in key_features]
poly_df_filtered = poly_df[poly_interaction_cols]
X_engineered = pd.concat([X_engineered, poly_df_filtered], axis=1)

# Apply same feature engineering to test data
X_test_engineered = X_test.copy()
X_test_engineered['CementToWater'] = X_test['CementComponent'] / (X_test['WaterComponent'] + 1e-8)
X_test_engineered['CementToAge'] = X_test['CementComponent'] / (X_test['AgeInDays'] + 1)
X_test_engineered['TotalBinder'] = X_test['CementComponent'] + X_test['BlastFurnaceSlag'] + X_test['FlyAshComponent']
X_test_engineered['WaterToBinder'] = X_test['WaterComponent'] / (X_test_engineered['TotalBinder'] + 1e-8)

# Apply polynomial features to test data
poly_features_test = poly.transform(X_test[key_features])
poly_df_test = pd.DataFrame(poly_features_test, columns=poly.get_feature_names_out(key_features))

# Removing duplicate columns for test data too
poly_df_test_filtered = poly_df_test[poly_interaction_cols]
X_test_engineered = pd.concat([X_test_engineered, poly_df_test_filtered], axis=1)

print(f"Engineered training features hspae: {X_engineered.shape}")
print(f"Engineered test features shape: {X_test_engineered.shape}")
print(f"New feature columns added: {[col for col in X_engineered.columns if col not in X.columns]}")

# Updating X_final and X_test_final to use engineered features
X_final = X_engineered.copy()
X_test_final = X_test_engineered.copy()

X_train, X_val, y_train, y_val = train_test_split(
    X_final, y, test_size=0.2, random_state=42
)

print(f"Training set shape: {X_train.shape}")
print(f"Validation set shape: {X_val.shape}")
print(f"Training target shape: {y_train.shape}")
print(f"Validation target shape: {y_val.shape}")

print(f"\ntraining set percentage: {len(X_train) / len(X_final) * 100:.2f}%")
print(f"Validation set percentage: {len(X_val) / len(X_final) * 100:.2f}%")

Starting feature engineering...
Original feature columns: ['CementComponent', 'BlastFurnaceSlag', 'FlyAshComponent', 'WaterComponent', 'SuperplasticizerComponent', 'CoarseAggregateComponent', 'FineAggregateComponent', 'AgeInDays']
Engineered training features hspae: (5407, 18)
Engineered test features shape: (3605, 18)
New feature columns added: ['CementToWater', 'CementToAge', 'TotalBinder', 'WaterToBinder', 'CementComponent^2', 'CementComponent WaterComponent', 'CementComponent AgeInDays', 'WaterComponent^2', 'WaterComponent AgeInDays', 'AgeInDays^2']
Training set shape: (4325, 18)
Validation set shape: (1082, 18)
Training target shape: (4325,)
Validation target shape: (1082,)

training set percentage: 79.99%
Validation set percentage: 20.01%
