# Concrete Strength Prediction using Rsndom Forest

This notebook analyzes concrete strength dats and builds a Random Forest model to predict concrete strength based pn various components

## 1. Import Required Libraries

Import pandas, numpy, matplotlib, seaborn and sklearn libraries for data manipulation, visualization, and machine learning

In [1]:
# import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.preprocessing import StandardScaler
import warnings

warnings.filterwarnings('ignore')

# set style for plots
plt.style.use('seaborn-v0_8')

print("Libraries imported successfully!")

Libraries imported successfully!


## 2. Load the Dataset

Use pandas to load the training and test datasets from CSV files.

In [3]:
# Load the datasets
train_df = pd.read_csv('datasets/train.csv')
test_df = pd.read_csv('datasets/test.csv')
sample_submssion = pd.read_csv('datasets/sample_submission.csv')

print(f"Trainign data shape: {train_df.shape}")
print(f"Test data shape: {test_df.shape}")
print(f"Sample submission selection shape: {sample_submssion.shape}")

# Display first few rows of training data
print("\nFirst 5 rows of training data:")
train_df.head()

Trainign data shape: (5407, 10)
Test data shape: (3605, 9)
Sample submission selection shape: (3605, 2)

First 5 rows of training data:


Unnamed: 0,id,CementComponent,BlastFurnaceSlag,FlyAshComponent,WaterComponent,SuperplasticizerComponent,CoarseAggregateComponent,FineAggregateComponent,AgeInDays,Strength
0,0,525.0,0.0,0.0,186.0,0.0,1125.0,613.0,3,10.38
1,1,143.0,169.0,143.0,191.0,8.0,967.0,643.0,28,23.52
2,2,289.0,134.7,0.0,185.7,0.0,1075.0,795.3,28,36.96
3,3,304.0,76.0,0.0,228.0,0.0,932.0,670.0,365,39.05
4,4,157.0,236.0,0.0,192.0,0.0,935.4,781.2,90,74.19


## 5. Data Preprocessing

Handle missing values, encode categorical variables, and preprare features for the Random Forest model.

In [None]:
# for component in ['BlastFurnaceSlag', 'FlyAshComponent', 'SuperplasticizerComponent']:
#     train_df.loc[train_df[component] != 0, f'{component}_used'] = 1
#     train_df[f'{component}_used'] = train_df[f'{component}_used'].fillna(0)
#     train_df[f'{component}_used'] = train_df[f'{component}_used'].astype(int)

#     test_df.loc[test_df[component] != 0, f'{component}_used'] = 1
#     test_df[f'{component}_used'] = test_df[f'{component}_used'].fillna(0)
#     test_df[f'{component}_used'] = test_df[f'{component}_used'].astype(int)

In [4]:
# Check for missing values again
print("Missng values in training data:")
print(train_df.isnull().sum())
print('\nMissing values in test data:')
print(test_df.isnull().sum())

# Since there are no missing values, we proceed to feature selection
# Preprare features and target
X = train_df.drop(['id', 'Strength'], axis=1)
y = train_df['Strength']
X_test = test_df.drop(['id'], axis=1)

print(f"\nFeature matrix shape: {X.shape}")
print(f"Target vector shape: {y.shape}")
print(f"Test feature matrix shape: {X_test.shape}")

print("\nFeature columsns:")
print(X.columns.tolist())

Missng values in training data:
id                           0
CementComponent              0
BlastFurnaceSlag             0
FlyAshComponent              0
WaterComponent               0
SuperplasticizerComponent    0
CoarseAggregateComponent     0
FineAggregateComponent       0
AgeInDays                    0
Strength                     0
dtype: int64

Missing values in test data:
id                           0
CementComponent              0
BlastFurnaceSlag             0
FlyAshComponent              0
WaterComponent               0
SuperplasticizerComponent    0
CoarseAggregateComponent     0
FineAggregateComponent       0
AgeInDays                    0
dtype: int64

Feature matrix shape: (5407, 8)
Target vector shape: (5407,)
Test feature matrix shape: (3605, 8)

Feature columsns:
['CementComponent', 'BlastFurnaceSlag', 'FlyAshComponent', 'WaterComponent', 'SuperplasticizerComponent', 'CoarseAggregateComponent', 'FineAggregateComponent', 'AgeInDays']


In [6]:
# Random Forest doesn't require feature scaling, but we'll keep the option
# For Random Forest, we'll use the original features without scaling
print("Random Forest can work with original features without scaling.")
print("Using original features for Random Forest model.")

# Keep the original features for Random Forest
X_final = X.copy()
X_test_final = X_test.copy()

print(f"\ntraining features shape: {X_final.shape}")
print(f"Test features shape: {X_test.shape}")

# Display first few rows of features
print("\nfirst 5 rows of features:")
X_final.head()

Random Forest can work with original features without scaling.
Using original features for Random Forest model.

training features shape: (5407, 8)
Test features shape: (3605, 8)

first 5 rows of features:


Unnamed: 0,CementComponent,BlastFurnaceSlag,FlyAshComponent,WaterComponent,SuperplasticizerComponent,CoarseAggregateComponent,FineAggregateComponent,AgeInDays
0,525.0,0.0,0.0,186.0,0.0,1125.0,613.0,3
1,143.0,169.0,143.0,191.0,8.0,967.0,643.0,28
2,289.0,134.7,0.0,185.7,0.0,1075.0,795.3,28
3,304.0,76.0,0.0,228.0,0.0,932.0,670.0,365
4,157.0,236.0,0.0,192.0,0.0,935.4,781.2,90


## 6. Split Data into Train and Test Sets

Separate features and target variables, and split the training data for model validation

In [7]:
X_train, X_val, y_train, y_val = train_test_split(
    X_final, y, test_size=0.8, random_state=42
)

print(f"Training set shape: {X_train.shape}")
print(f"Validation set shape: {X_val.shape}")
print(f"Training target shape: {y_train.shape}")
print(f"Validation set shape: {y_val.shape}")

print(f"\nTraining set percentage: {len(X_train) / len(X_final) * 100:.2f}%")
print(f"Validation set percentage: {len(X_val) / len(X_final) * 100:.2f}%")

Training set shape: (1081, 8)
Validation set shape: (4326, 8)
Training target shape: (1081,)
Validation set shape: (4326,)

Training set percentage: 19.99%
Validation set percentage: 80.01%


## 7. Train Random Forest Model

Create and train a Random Forest model using sklearn, and evaluate its performance on validation data

In [8]:
# Create and train the Random Forest model
rf_model = RandomForestRegressor(
    n_estimators=100,        # Number of trees
    max_depth=5,             # Maximum depth of trees
    min_samples_split=2,     # Minimum samples required to split
    min_samples_leaf=1,      # Minimum samples required at leaf node
    random_state=42,         # For reproducibility
    n_jobs=1                 # Use all available cores
)

rf_model.fit(X_train, y_train)

print("Random Forest model trained successfully!")
print(f"Number of trees: {rf_model.n_estimators}")
print(f"Feature importance shape: {rf_model.feature_importances_.shape}")

# Make predictions on validation set
y_val_pred = rf_model.predict(X_val)

# Calculate metrics
mse = mean_squared_error(y_val, y_val_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_val, y_val_pred)
r2 = r2_score(y_val, y_val_pred)

print(f"\nModel Performance on Validation Set:")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"R² Score: {r2:.4f}")

Random Forest model trained successfully!
Number of trees: 100
Feature importance shape: (8,)

Model Performance on Validation Set:
Mean Squared Error (MSE): 149.1810
Root Mean Squared Error (RMSE): 12.2140
Mean Absolute Error (MAE): 9.4881
R² Score: 0.4447
