### Cell 1: Imports
- This cell imports all the libraries we will need.

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
import joblib
import json

print("Libraries imported successfully.")

Libraries imported successfully.


In [9]:
# Load the dataset
df = pd.read_csv('training_data.csv')

# Display the first 5 rows and data types
print(df.head())
print("\nData Info:")
df.info()

  agent_id        task_type  capability_match  agent_current_load  \
0  agent_2   image_analysis                 0                   2   
1  agent_1  code_generation                 0                   3   
2  agent_2    data_analysis                 0                   3   
3  agent_3   image_analysis                 1                   2   
4  agent_1  code_generation                 0                   2   

   agent_success_rate  task_complexity  duration_ms  
0              0.9873              1.8        15402  
1              0.9414              1.5        13372  
2              0.9027              1.2        12861  
3              0.9012              1.8         4609  
4              0.9003              1.5        12047  

Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   agent_id            5000 non-null   obje

### Cell 3: Feature Engineering and Data Preparation
The model needs all input to be numeric. We will convert the text-based columns (agent_id, task_type) into a numeric format using one-hot encoding.

In [10]:
# Define features (X) and target (y)
features = [
    'agent_id', 
    'task_type', 
    'capability_match', 
    'agent_current_load', 
    'agent_success_rate', 
    'task_complexity'
]
target = 'duration_ms'

X = df[features]
y = df[target]

# Create a column transformer to handle one-hot encoding for categorical features
categorical_features = ['agent_id', 'task_type']
preprocessor = make_column_transformer(
    (OneHotEncoder(handle_unknown='ignore'), categorical_features),
    remainder='passthrough'
)

print("Feature engineering pipeline created.")

Feature engineering pipeline created.


### Cell 4: Train-Test Split
We split our data into a training set (to teach the model) and a testing set (to see how well it performs on new data).

In [11]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Data split into {len(X_train)} training samples and {len(X_test)} testing samples.")

Data split into 4000 training samples and 1000 testing samples.


### Cell 5: Define and Train the Model
Here we create our Random Forest model and train it on the data. This may take a few seconds.

In [12]:
# Create the full model pipeline, including the preprocessor and the regressor
model_pipeline = make_pipeline(
    preprocessor,
    RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
)

print("Training model...")
model_pipeline.fit(X_train, y_train)
print("Model training complete.")

Training model...
Model training complete.


### Cell 6: Evaluate Model Performance
Let's see how accurate our model is. We'll predict durations on the test data and check the mean absolute error.

In [13]:
# Make predictions on the test set
predictions = model_pipeline.predict(X_test)

# Calculate the error
mae = mean_absolute_error(y_test, predictions)
print(f"Model evaluation complete.")
print(f"Mean Absolute Error: {mae:.2f} ms")
print(f"This means our model's predictions are, on average, off by about {mae:.2f} milliseconds.")

Model evaluation complete.
Mean Absolute Error: 415.34 ms
This means our model's predictions are, on average, off by about 415.34 milliseconds.


### Cell 7: Save the Model for Production Use
We save our trained model pipeline to a file so our Flask API can use it to make live predictions.

In [14]:
# Save the entire pipeline (preprocessor + model) to a file
model_filename = 'router_model.joblib'
joblib.dump(model_pipeline, model_filename)

# We also need to save the order of columns X was trained on
model_columns = list(X.columns)
with open('model_columns.json', 'w') as f:
    json.dump(model_columns, f)

print(f"Model pipeline saved to '{model_filename}'")
print(f"Model columns saved to 'model_columns.json'")

Model pipeline saved to 'router_model.joblib'
Model columns saved to 'model_columns.json'
