Step 1: Install Required Libraries

In [1]:
!pip install shap imbalanced-learn joblib



Step 2: Import Libraries

In [2]:
import pandas as pd
import numpy as np
import shap  # For explainability
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE
import joblib

Step 3: Load Dataset

In [3]:
# Load dataset
df = pd.read_csv("/content/CESAW_Task_fact.csv")

# Display dataset overview
print("Dataset Preview:")
print(df.head())

# Check for missing values
print("\nMissing Values Summary:")
print(df.isnull().sum())

Dataset Preview:
    task_id            task_type  task_complexity  task_priority  \
0  7bed2185           Bug Fixing                2              3   
1  87d513d7           Bug Fixing                1              2   
2  c8a32d63          Code Review                1              2   
3  a9db64e2  Feature Development                2              3   
4  9e34afcf  Feature Development                2              3   

   estimated_effort_hours team_id  team_experience_level  total_members  \
0                    7.23     T18                      1             18   
1                   13.20      T7                      2              6   
2                   47.28     T44                      1             15   
3                    7.39      T6                      3             13   
4                   28.26     T25                      2             19   

   past_projects_completed          specialization  team_skillset_match  \
0                       22  Mobile App Developme

Step 4: Encode Categorical Variables

Since the dataset has categorical features, they need to be converted into numerical values before training.

In [4]:
# Initialize a dictionary to store label encoders
label_encoders = {}

# Define categorical columns that need encoding
categorical_columns = ['task_type', 'specialization', 'assigned_team', 'team_id']

# Encode each categorical column
for col in categorical_columns:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le  # Store encoders for later use

print("\nCategorical Variables Encoded Successfully!")


Categorical Variables Encoded Successfully!


Step 5: Select Features and Target Variable

The dataset is split into independent features (X) and the target variable (y).

In [5]:
# Define features (X) and target variable (y)
X = df.drop(columns=['task_id', 'assigned_team'])
y = df['assigned_team']

print(f"\nFeature Selection Completed: {X.shape[1]} features selected.")


Feature Selection Completed: 11 features selected.


Step 6: Standardize Numerical Features

To ensure consistent scaling across features, we apply standardization.

In [6]:
# Initialize the scaler
scaler = StandardScaler()

# Apply scaling
X_scaled = scaler.fit_transform(X)

print("\nFeature Scaling Completed.")


Feature Scaling Completed.


Step 7: Split Dataset into Training & Testing Sets

I divide the dataset into 80% training and 20% testing to validate model performance.

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42, stratify=y)

print(f"\nData Split: {X_train.shape[0]} training rows, {X_test.shape[0]} testing rows.")


Data Split: 800 training rows, 200 testing rows.


Step 8: Handle Class Imbalance Using SMOTE.

SMOTE (Synthetic Minority Over-sampling Technique) balances the dataset so that no team is underrepresented.

In [8]:
# Apply SMOTE to balance the dataset
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

print("\nClass Imbalance Handled with SMOTE.")


Class Imbalance Handled with SMOTE.


Step 9: Train an Initial XGBoost Model to Get Feature Importance.

XGBoost is trained first to identify the most important features.

In [9]:
# Train an initial XGBoost model
xgb_temp = XGBClassifier(n_estimators=300, max_depth=10, learning_rate=0.1, subsample=0.8, colsample_bytree=0.8, random_state=42)
xgb_temp.fit(X_train_resampled, y_train_resampled)

# Identify the top 10 most important features
feature_importance = xgb_temp.feature_importances_
important_features = np.argsort(feature_importance)[-10:]  # Select indices of top 10 features

print("\nTop 10 Important Features Selected.")


Top 10 Important Features Selected.


Step 10: Train the Final Models Using Selected Features.

Now, I train both XGBoost and RandomForest models using only the most important features.

In [10]:
# Reduce dataset to top 10 selected features
X_train_selected = X_train_resampled[:, important_features]
X_test_selected = X_test[:, important_features]

# Train XGBoost on selected features
xgb = XGBClassifier(n_estimators=300, max_depth=10, learning_rate=0.1, subsample=0.8, colsample_bytree=0.8, random_state=42)
xgb.fit(X_train_selected, y_train_resampled)

# Train Random Forest on the same features
rf = RandomForestClassifier(n_estimators=300, max_depth=30, min_samples_split=5, min_samples_leaf=2, class_weight='balanced', random_state=42)
rf.fit(X_train_selected, y_train_resampled)

print("\nFinal Models Trained on Selected Features.")


Final Models Trained on Selected Features.


Step 11: Train SHAP Explainer for Model Interpretability

SHAP (SHapley Additive Explanations) helps understand which features influence predictions.

In [11]:
# Train SHAP explainer on the trained XGBoost model
explainer = shap.Explainer(xgb, X_train_selected)

print("\nSHAP Explainer Trained for Model Interpretability.")


SHAP Explainer Trained for Model Interpretability.


Step 12: Make Predictions Using Both Models.

I use both models to predict task allocation and apply a majority voting approach.

In [12]:
# Predict using XGBoost
xgb_preds = xgb.predict(X_test_selected)

# Predict using Random Forest
rf_preds = rf.predict(X_test_selected)

# Apply majority voting (rounding ensures valid team assignments)
final_preds = (xgb_preds + rf_preds) // 2

print("\nEnsemble Predictions Completed.")


Ensemble Predictions Completed.


Step 13: Evaluate Model Performance.

The model is evaluated using accuracy, confusion matrix, and classification report.

In [13]:
# Calculate model accuracy
accuracy = accuracy_score(y_test, final_preds)

# Generate confusion matrix and classification report
conf_matrix = confusion_matrix(y_test, final_preds)
classification_rep = classification_report(y_test, final_preds, output_dict=True, zero_division=1)

# Print evaluation results
print("\nModel Evaluation Results:")
print(f"Accuracy: {accuracy:.4f}")
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", classification_rep)


Model Evaluation Results:
Accuracy: 0.6300
Confusion Matrix:
 [[2 0 0 ... 0 0 0]
 [0 5 0 ... 0 0 0]
 [0 1 4 ... 0 0 0]
 ...
 [0 0 0 ... 6 0 0]
 [0 0 0 ... 1 1 0]
 [0 0 0 ... 0 0 5]]
Classification Report:
 {'0': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 2.0}, '1': {'precision': 0.8333333333333334, 'recall': 1.0, 'f1-score': 0.9090909090909091, 'support': 5.0}, '2': {'precision': 1.0, 'recall': 0.8, 'f1-score': 0.8888888888888888, 'support': 5.0}, '3': {'precision': 1.0, 'recall': 0.8, 'f1-score': 0.8888888888888888, 'support': 5.0}, '4': {'precision': 0.5714285714285714, 'recall': 1.0, 'f1-score': 0.7272727272727273, 'support': 4.0}, '5': {'precision': 0.6666666666666666, 'recall': 0.5, 'f1-score': 0.5714285714285714, 'support': 4.0}, '6': {'precision': 1.0, 'recall': 0.75, 'f1-score': 0.8571428571428571, 'support': 4.0}, '7': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 5.0}, '8': {'precision': 1.0, 'recall': 0.75, 'f1-score': 0.8571428571428571, 

Step 14: Save the Trained Models and Preprocessing Tools.

The trained models and preprocessing tools are saved for later use in the API.

In [14]:
feature_names = X.columns[important_features]  # Store selected feature names

joblib.dump(xgb, "task_allocation_xgb.pkl")
joblib.dump(rf, "task_allocation_rf.pkl")
joblib.dump(scaler, "T_scaler.pkl")  # Scaler for future data processing
joblib.dump(label_encoders, "label_encoders.pkl")  # Label encoders for categorical features
joblib.dump(important_features, "selected_features.pkl")  # Save feature selection indices
joblib.dump(explainer, "shap_explainer.pkl")  # Save SHAP explainer

print("\nModels, encoders, and SHAP explainer saved successfully.")


Models, encoders, and SHAP explainer saved successfully.


Step 15: Save Model Performance Metrics.

The accuracy, confusion matrix, and classification report are stored separately for API integration.

In [15]:
joblib.dump(accuracy, "task_allocation_accuracy.pkl")
joblib.dump(conf_matrix, "task_allocation_conf_matrix.pkl")
joblib.dump(classification_rep, "task_allocation_classification_report.pkl")

print("Task Allocation Model Performance Saved.")

Task Allocation Model Performance Saved.
