# XGBoost Model for Predicting Concrete Compressive Strength

In this notebook, we will create and evaluate an XGBoost model to predict the compressive strength of concrete based on various features. The workflow includes loading the data, preprocessing, feature engineering, hyperparameter tuning using GridSearchCV, and evaluating the model's performance.

---
*Created: Md. Rafiquzzaman Rafi*

*Date: 27 August, 2024*

---

## 1. Import Libraries

We start by importing the necessary libraries for data manipulation, model training, and evaluation.


In [18]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import xgboost as xgb

---

## 2. Load the Dataset

Next, we load the dataset containing the concrete mix components and their corresponding compressive strengths.


In [19]:
# Load the dataset
data = pd.read_csv('Concrete_Data.csv')

---

## 3. Rename Columns for Easier Access

We rename the columns to shorter names for easier access and manipulation.


In [None]:
# Rename columns for easier access
data = data.rename(columns={
    'Cement (component 1)(kg in a m^3 mixture)': 'cement',
    'Blast Furnace Slag (component 2)(kg in a m^3 mixture)': 'blast_furnace_slag',
    'Fly Ash (component 3)(kg in a m^3 mixture)': 'fly_ash',
    'Water  (component 4)(kg in a m^3 mixture)': 'water',
    'Superplasticizer (component 5)(kg in a m^3 mixture)': 'superplasticizer',
    'Coarse Aggregate  (component 6)(kg in a m^3 mixture)': 'coarse_aggregate',
    'Fine Aggregate (component 7)(kg in a m^3 mixture)': 'fine_aggregate',
    'Age (day)': 'age',
    'Concrete compressive strength(MPa, megapascals) ': 'compressive_strength'
})

---

## 4. Create Additional Features

We create two new features that might help in predicting compressive strength: the ratio of cement to coarse aggregate and the ratio of cement to fine aggregate.


In [None]:
# Create additional features
data['cement_coarse'] = data.cement / data.coarse_aggregate
data['cement_fine'] = data.cement / data.fine_aggregate

---

## 5. Define Features and Target Variable

We separate the features (X) from the target variable (y). The target variable in this case is the compressive strength of the concrete.


In [None]:
# Define features and target variable
X = data.drop(['compressive_strength'], axis=1)
y = data['compressive_strength']

---

## 6. Feature Scaling

We apply standard scaling to the features to normalize them, which is particularly important for models like XGBoost.


In [None]:
scaler = StandardScaler()
X = scaler.fit_transform(X)

---

## 7. Split Data into Training and Test Sets

We split the data into training and test sets, using 80% of the data for training and 20% for testing.


In [None]:

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

---

## 8. Create the XGBoost Regressor Model

We initialize the XGBoost regressor with some basic parameters. The `objective` is set to `reg:squarederror` as it's a regression problem, and the `eval_metric` is set to `rmse`.


In [20]:
# Create the XGBoost regressor model
model = xgb.XGBRegressor(objective='reg:squarederror', eval_metric='rmse')

---

## 9. Define Hyperparameters for Tuning

We define a grid of hyperparameters that we want to tune using GridSearchCV. This includes the number of estimators, the maximum depth of trees, the learning rate, and the subsample ratio.


In [None]:
# Define hyperparameters to tune
param_grid = {
    'n_estimators': [1000, 2000],
    'max_depth': [3, 6, 9],
    'learning_rate': [0.01, 0.05, 0.1],
    'subsample': [0.8, 1.0]
}

---

## 10. Hyperparameter Tuning with GridSearchCV

We use GridSearchCV to perform an exhaustive search over the specified hyperparameter grid. The model is evaluated using 5-fold cross-validation.


In [21]:
# Use GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 1 candidates, totalling 5 fits


In [22]:
grid_search.best_params_

{'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 1000, 'subsample': 0.8}

---

## 11. Get the Best Model from Grid Search

After the grid search is complete, we retrieve the model with the best hyperparameters.


In [32]:
# Get the best model from grid search
best_model = grid_search.best_estimator_

---

## 12. Predict on the Test Set

We use the best model to make predictions on the test set.


In [None]:
# Predict on the test set
y_pred = best_model.predict(X_test)

---

## 13. Evaluate the Model

We evaluate the model's performance using Mean Squared Error (MSE) and R² score. These metrics will give us an idea of how well the model is performing.


In [33]:
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.2f}")
print(f"R^2 Score: {r2:.2f}")

Mean Squared Error: 14.54
R^2 Score: 0.94


---

## 14. Plot Feature Importances

We plot the feature importances to understand which features contributed the most to the model's predictions.


In [None]:

# Plot feature importances
xgb.plot_importance(best_model)
plt.show()

---

## 15. Save the Model

Finally, we save the trained model to a file using `joblib`, so it can be loaded and used for predictions later without retraining.


In [24]:
import joblib
# Save the model
joblib.dump(best_model, "xgboost_model.pkl")

['xgboost_model.pkl']

# Final Model

This model has all the preprocessing and hyperparameter tuned

In [31]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import xgboost as xgb

# Load the dataset
data = pd.read_csv('Concrete_Data.csv')

# Rename columns for easier access
data = data.rename(columns={
    'Cement (component 1)(kg in a m^3 mixture)': 'cement',
    'Blast Furnace Slag (component 2)(kg in a m^3 mixture)': 'blast_furnace_slag',
    'Fly Ash (component 3)(kg in a m^3 mixture)': 'fly_ash',
    'Water  (component 4)(kg in a m^3 mixture)': 'water',
    'Superplasticizer (component 5)(kg in a m^3 mixture)': 'superplasticizer',
    'Coarse Aggregate  (component 6)(kg in a m^3 mixture)': 'coarse_aggregate',
    'Fine Aggregate (component 7)(kg in a m^3 mixture)': 'fine_aggregate',
    'Age (day)': 'age',
    'Concrete compressive strength(MPa, megapascals) ': 'compressive_strength'
})

# Create additional features
data['cement_coarse'] = data.cement / data.coarse_aggregate
data['cement_fine'] = data.cement / data.fine_aggregate

# Define features and target variable
X = data.drop(['compressive_strength'], axis=1)
y = data['compressive_strength']

scaler = StandardScaler()
X = scaler.fit_transform(X)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create the XGBoost regressor model
model = xgb.XGBRegressor(objective='reg:squarederror', 
                         eval_metric='rmse')

model.set_params(n_estimators=1000, max_depth=3, learning_rate=0.1, subsample=0.8)

model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.2f}")
print(f"R^2 Score: {r2:.2f}")

Mean Squared Error: 14.54
R^2 Score: 0.94


---

# Conclusion

In this notebook, we've successfully built and evaluated an XGBoost model for predicting concrete compressive strength. The model was tuned using GridSearchCV, and the final model's performance was assessed using MSE and R² score. The feature importances were also visualized to understand the contribution of each feature to the model's predictions.
