**Preprocessing and Model Building**

This segment of code is responsible for preparing the raw data for machine learning. It loads the dataset, identifies and separates the features from the target variable, and performs necessary transformations. Specifically, it handles the categorical variables by applying one-hot encoding, which converts them into a numerical format that the GradientBoostingClassifier can process. The final step is splitting the data into a training set and a testing set, a critical step for a robust model evaluation.

In [16]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
df = pd.read_csv('bim_ai_civil_engineering_dataset.csv')

# Separate features (X) and target variable (y)
X = df.drop('Risk_Level', axis=1)
y = df['Risk_Level']

# Drop columns that are not useful for prediction
X = X.drop(['Project_ID', 'Start_Date', 'End_Date'], axis=1)

# Identify categorical columns for one-hot encoding
categorical_cols_to_encode = ['Project_Type', 'Location', 'Weather_Condition']

# Perform one-hot encoding on categorical features
X = pd.get_dummies(X, columns=categorical_cols_to_encode, drop_first=True)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")

Training set size: 700 samples
Testing set size: 300 samples


**Model Training and Evaluation**

This segment focuses on the machine learning pipeline. It first initializes and trains the GradientBoostingClassifier model using the prepared training data. After the model has learned the patterns in the data, it is used to make predictions on the unseen testing data. Finally, the model's performance is rigorously evaluated by calculating the prediction accuracy and generating a confusion matrix. The confusion matrix provides a detailed breakdown of correct and incorrect predictions for each risk category.

In [17]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Initialize and train the GradientBoostingClassifier model
gb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gb_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = gb_model.predict(X_test)

# Evaluate the model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\nPrediction Accuracy on Test Set: {accuracy:.4f}")

# Display the confusion matrix
print("\n--- Confusion Matrix ---")
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=gb_model.classes_, yticklabels=gb_model.classes_)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

# Display classification report
print("\n--- Classification Report ---")
print(classification_report(y_test, y_pred))


Prediction Accuracy on Test Set: 0.9433

--- Confusion Matrix ---

--- Classification Report ---
              precision    recall  f1-score   support

        High       0.96      0.98      0.97       163
         Low       0.95      0.88      0.92        43
      Medium       0.91      0.90      0.91        94

    accuracy                           0.94       300
   macro avg       0.94      0.92      0.93       300
weighted avg       0.94      0.94      0.94       300

