# Model Engineering

Approaching a regression problem with multiple targets (also known as multi-output regression) involves similar steps to a single-output regression, but there are some additional considerations. Here's a general approach:

1. **Understand the problem and data**: Understand the problem domain, the meaning of each feature, and the relationship between the features and the targets. Perform exploratory data analysis to get insights into the data.

2. **Preprocess the data**: Clean the data, handle missing values, outliers, and encode categorical variables. Normalize or standardize the data if necessary.

3. **Feature selection/engineering**: Depending on the complexity of the dataset, you might need to perform feature selection or engineer new features that can help improve the model's performance.

4. **Model selection**: Choose a suitable model. Some regression models like linear regression, decision trees, and neural networks can directly handle multiple targets. For models that don't support multi-output regression directly, you can use wrapper methods like `MultiOutputRegressor` in scikit-learn.

5. **Train the model**: Split the data into a training set and a test set. Fit the model to the training data.

6. **Evaluate the model**: Use appropriate metrics to evaluate the model's performance on the test set. For regression tasks, common metrics include Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R-squared.

7. **Hyperparameter tuning**: Use methods like grid search or random search to find the optimal hyperparameters for your model.

8. **Interpret the model**: Depending on the model used, interpret the results and understand the relationship between the features and the targets.

9. **Iterate**: Based on the performance and interpretation of the model, you might need to go back to previous steps and make adjustments, such as collecting more data, engineering new features, or trying different models.

Remember, the key to a successful machine learning project is iteration. It's unlikely that you'll get everything perfect on the first try, so be prepared to go through this process multiple times.

## Gather Data

In [2]:
# Load Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import mlflow
import azureml
from azureml.core import Workspace, Experiment, Run
import sklearn

from sklearn.multioutput import MultiOutputRegressor
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor

import os
import warnings
warnings.filterwarnings('ignore')  # Ignore all warning messages (for production use)

In [3]:
# Load Data
def load_data(train_file_path, test_file_path):
    try:
        train_tka = pd.read_csv(train_file_path)
        test_tka = pd.read_csv(test_file_path)
        return train_tka, test_tka
    except FileNotFoundError:
        print("File not found. Please provide valid file paths.")
        return None, None
    except Exception as e:
        print("An error occurred while loading the data:", str(e))
        return None, None

train_file_path = r'C:\Users\HP\Desktop\Predicting-Component-Sizing-in-Primary-Total-Knee-Arthroplasty-using-Demographic-Variables\Data\ForModeling\tka_data.csv'
test_file_path = r'C:\Users\HP\Desktop\Predicting-Component-Sizing-in-Primary-Total-Knee-Arthroplasty-using-Demographic-Variables\Data\ForModeling\tka_data_test.csv'

# Call the load_data function and capture the returned data in variables
train_tka, test_tka = load_data(train_file_path, test_file_path)

if train_tka is not None and test_tka is not None:
    print('Data Loaded Successfully!')

Data Loaded Successfully!


In [4]:
# Separate features and targets
features = train_tka[['gender', 'height_cm', 'weight_kg']]
targets = train_tka[['femur_dim', 'tibia_dim']]

# Print the shape of features and targets
print("Shape of features:", features.shape)
print("Shape of targets:", targets.shape)


Shape of features: (3299, 3)
Shape of targets: (3299, 2)


In [5]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, targets, test_size=0.2, random_state=8888)
print("Training set size:", len(X_train))
print("Testing set size:", len(X_test))


Training set size: 2639
Testing set size: 660


In [6]:
# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


## Model Selection and Training

In [7]:
# Define the base estimators
base_estimators = [
    LinearRegression(),
    Ridge(),
    Lasso(),
    MultiOutputRegressor(DecisionTreeRegressor(random_state=42)),
    MultiOutputRegressor(RandomForestRegressor(n_estimators=100, random_state=42)),
    MultiOutputRegressor(GradientBoostingRegressor(random_state=42)),
    MultiOutputRegressor(XGBRegressor(objective='reg:squarederror', random_state=42))
]

# Initialize a list to store the results
results = []

# Create a pipeline for each base estimator, fit it to the data, make predictions, and evaluate the model
for base_estimator in base_estimators:
    print(f"Training and evaluating {base_estimator.__class__.__name__}...")
    pipeline = Pipeline([
        ('regression', base_estimator)  # Regression step
    ])
    pipeline.fit(X_train_scaled, y_train)
    y_pred = pipeline.predict(X_test_scaled)
    
    # Calculate the metrics
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    
    # Store the results
    results.append([base_estimator.__class__.__name__, rmse, r2, mae])

# Create a DataFrame with the results
results_df = pd.DataFrame(results, columns=['Model', 'RMSE', 'R2 Score', 'MAE'])

# Set 'Model' as the index
results_df.set_index('Model', inplace=True)

results_df

Training and evaluating LinearRegression...
Training and evaluating Ridge...
Training and evaluating Lasso...
Training and evaluating MultiOutputRegressor...
Training and evaluating MultiOutputRegressor...
Training and evaluating MultiOutputRegressor...
Training and evaluating MultiOutputRegressor...


Unnamed: 0_level_0,RMSE,R2 Score,MAE
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
LinearRegression,3.301016,0.547154,2.589017
Ridge,3.300926,0.547178,2.588998
Lasso,3.380034,0.523367,2.72623
MultiOutputRegressor,4.319664,0.213891,3.185699
MultiOutputRegressor,3.668615,0.433396,2.802775
MultiOutputRegressor,3.306121,0.544705,2.552612
MultiOutputRegressor,3.524324,0.479954,2.71329


## Hyperparameter Tuning

The best model, as measured by Mean Absolute Error (MAE), is then tuned and used to make predictions on the test set. The predictions are evaluated using the same metric, and the final model is then used to make predictions on new data.

In [8]:
# Define the base estimator and its parameter grid
estimator = Pipeline([
    ('regression', MultiOutputRegressor(GradientBoostingRegressor(random_state=42)))  # Regression step
])
parameters = {
    'regression__estimator__n_estimators': [50, 100, 200],
    'regression__estimator__learning_rate': [0.01, 0.1, 1.0],
    'regression__estimator__max_depth': [3, 5, 7],
    'regression__estimator__min_samples_leaf': [1, 2, 3]
}

# Create a GridSearchCV, fit it to the data, and print the best parameters and score
print(f"Tuning hyperparameters for {estimator.steps[-1][0]}...")
grid_search = GridSearchCV(estimator, parameters, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train_scaled, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {-grid_search.best_score_}")

Tuning hyperparameters for regression...
Best parameters: {'regression__estimator__learning_rate': 0.1, 'regression__estimator__max_depth': 3, 'regression__estimator__min_samples_leaf': 2, 'regression__estimator__n_estimators': 50}
Best score: 11.25777294526315


## Model Evaluation

In [9]:
# Define the best model with the best hyperparameters
model = MultiOutputRegressor(GradientBoostingRegressor(learning_rate=0.1, n_estimators=50, max_depth=3, min_samples_leaf=2, random_state=42))

# Fit the model to the data
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)

# Calculate the metrics
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
print("RMSE:", rmse)
print("R2 Score:", r2)
print("MAE:", mae)


RMSE: 3.293770451756413
R2 Score: 0.5478755067797553
MAE: 2.55470550911337


## Make Predictions 

In [10]:
# Preprocess the tka_test dataset
X_test_new = test_tka[['gender', 'height_cm', 'weight_kg']]
y_test_new= test_tka[['femur_dim', 'tibia_dim']]

# Scale the features
X_test_scaled_new = scaler.transform(X_test_new)

# Make predictions using the best model
y_pred_test = model.predict(X_test_scaled_new)

# Calculate the metrics
rmse = np.sqrt(mean_squared_error(y_test_new, y_pred_test))
r2 = r2_score(y_test_new, y_pred_test)
mae = mean_absolute_error(y_test_new, y_pred_test)
print("RMSE:", rmse)
print("R2 Score:", r2)
print("MAE:", mae)

# Store the predictions in a variable
test_tka['femur_dim_pred'], test_tka['tibia_dim_pred'] = y_pred_test[:, 0], y_pred_test[:, 1]

# Optionally, save the predictions to a file
predictions_file_path = r'C:\Users\HP\Desktop\Predicting-Component-Sizing-in-Primary-Total-Knee-Arthroplasty-using-Demographic-Variables\Data\Predictions\Predictions.csv'
test_tka.to_csv(predictions_file_path, index=False)


RMSE: 3.5858965365009
R2 Score: 0.5471605917828155
MAE: 2.811895769274299


## Experiment Traching using MLflow 

Configure MLflow to point to the Azure Machine Learning workspace, and set it to the workspace tracking URI.

In [21]:
# Connect to the workspace
ws = Workspace.from_config()

# Configure MLflow to log to an Azure ML Workspace
mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri())

Create a MLflow experiment, which allows us to group runs.

In [22]:
# Set the MLflow experiment
experiment_name = "Predicting Component Sizing in Primary Total Knee Arthroplasty using Demographic Variables"
mlflow.set_experiment(experiment_name)

<Experiment: artifact_location='', creation_time=1709351654064, experiment_id='c91c0b77-be46-4cdc-a53d-f90e6561d8dc', last_update_time=None, lifecycle_stage='active', name=('Predicting Component Sizing in Primary Total Knee Arthroplasty using '
 'Demographic Variables'), tags={}>

Log the model, the metrics, and the parameters of the best model to the MLflow experiment.

In [None]:
# Start a run
with mlflow.start_run() as run:

    # Train the model for a number of epochs
    for epoch in range(100):

        # Train the model for one epoch and get the predictions
        # This is just a placeholder; replace it with your actual training code
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)

        # Calculate the metrics
        rmse = np.sqrt(mean_squared_error(y_test, y_pred))
        r2 = r2_score(y_test, y_pred)
        mae = mean_absolute_error(y_test, y_pred)

        # Log the metrics to the MLflow experiment
        mlflow.log_metric("RMSE", rmse, step=epoch)
        mlflow.log_metric("R2 Score", r2, step=epoch)
        mlflow.log_metric("MAE", mae, step=epoch)

    # Log the model to the MLflow experiment
    mlflow.sklearn.log_model(model, "model")

    # Log the parameters of the best model to the MLflow experiment
    mlflow.log_params(model.get_params())

In [19]:
mlflow_client = mlflow.tracking.MlflowClient()
experiment = mlflow_client.get_experiment_by_name(experiment_name)
