## Task 4 – Train the Final Model Instance

This notebook implements Task 4 of the assignment: taking the winner algorithm selected in Task 3 (LR_ElasticNet), finding its optimal hyperparameters using GridSearchCV on the full dataset, training a final model instance with these parameters on all data and saving the complete pipeline (including preprocessing) for deployment.

### Imports

Imports necessary libraries: `pandas` for data handling, `numpy` for numerical operations, various `scikit-learn` components for preprocessing (`LabelEncoder`, `RobustScaler`), modeling (`LogisticRegression`), pipeline creation (`Pipeline`), imputation (`SimpleImputer`), hyperparameter tuning (`GridSearchCV`, `StratifiedKFold`) and metrics (`make_scorer`, `roc_auc_score`). Also imports `joblib` for saving the final model, `os` for directory operations, and `warnings` to manage output messages.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, RobustScaler # Use the same scaler as in rnCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import make_scorer, roc_auc_score # Use AUC for finding best params
import joblib # For saving the final model
import os # For creating the models directory
import warnings


### Configuration

Defines configuration variables for the final model training process: the dataset path, target and ID column names, the random state for reproducibility, the scaler (`RobustScaler`) and imputer strategy (`median`) chosen based on previous steps, the number of folds for GridSearchCV (`CV_FOLDS`), the metric for optimization (`roc_auc`), and the directory path and filename for saving the final trained model pipeline.

In [2]:
# --- Configuration ---
DATASET_PATH = '../data/breast_cancer.csv'
TARGET_VARIABLE = 'diagnosis'
ID_COLUMN = 'id'
RANDOM_STATE = 42
SCALER = RobustScaler # Ensure consistency with rnCV
IMPUTER_STRATEGY = 'median' # Ensure consistency
CV_FOLDS = 5 # Number of folds for GridSearchCV
# Define the metric for GridSearchCV optimization (must match inner_cv_metric if desired)
GRIDSEARCH_METRIC = 'roc_auc'
# Define the path to save the final model
MODEL_DIR = '../models'
MODEL_FILENAME = 'final_lr_elasticnet_model.pkl'
MODEL_SAVE_PATH = os.path.join(MODEL_DIR, MODEL_FILENAME)


### 1. Load and Prepare DataLoads the full dataset

Loads the full dataset from the specified path. Performs the same initial preprocessing steps used previously: drops the ID column, encodes the target variable numerically (0/1), and separates the data into features (`X`) and target (`y`). Using the full dataset ensures the final model is trained on all available samples. Error handling is included for file loading.

In [3]:
# --- 1. Load and Prepare Data ---
print("--- 1. Loading and Preparing Data ---")
try:
    df = pd.read_csv(DATASET_PATH)
    print(f"Dataset loaded successfully from: {DATASET_PATH}")
except FileNotFoundError:
    print(f"Error: Dataset file not found at {DATASET_PATH}")
    exit()
except Exception as e:
    print(f"An error occurred while loading the dataset: {e}")
    exit()

# Drop the ID column
if ID_COLUMN in df.columns:
    df = df.drop(columns=[ID_COLUMN])
    print(f"Dropped ID column: '{ID_COLUMN}'")

# Encode the target variable
if TARGET_VARIABLE in df.columns:
    if df[TARGET_VARIABLE].dtype == 'object':
        le = LabelEncoder()
        df[TARGET_VARIABLE] = le.fit_transform(df[TARGET_VARIABLE])
        print(f"Target variable '{TARGET_VARIABLE}' encoded.")
    # Separate features (X) and target (y) using ALL data
    X = df.drop(TARGET_VARIABLE, axis=1)
    y = df[TARGET_VARIABLE]
    print("Features (X) and target (y) separated using the full dataset.")
else:
    print(f"Error: Target variable '{TARGET_VARIABLE}' not found.")
    exit()


--- 1. Loading and Preparing Data ---
Dataset loaded successfully from: ../data/breast_cancer.csv
Dropped ID column: 'id'
Target variable 'diagnosis' encoded.
Features (X) and target (y) separated using the full dataset.


### 2. Define Final Pipeline Structure and Hyperparameter Grid

Defines the structure for the final scikit-learn pipeline, consisting of the median imputer, the `RobustScaler` and the selected winner algorithm (`LR_ElasticNet`). Specifies the hyperparameter grid (`param_grid`) for the `LR_ElasticNet` component to be searched using `GridSearchCV`. Note that parameters within the pipeline are prefixed (e.g., `LR_ElasticNet__C`). An expanded grid compared to Task 3 is used here to allow for more thorough final tuning.

In [4]:
# --- 2. Define Final Pipeline Structure and Hyperparameter Grid ---
print("\n--- 2. Defining Pipeline and Hyperparameter Grid for Winner Algorithm ---")

# Winner algorithm identified from Task 3
winner_algorithm_name = 'LR_ElasticNet'
print(f"Winner algorithm: {winner_algorithm_name}")

# Define the pipeline structure (Imputer -> Scaler -> Estimator)
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy=IMPUTER_STRATEGY)),
    ('scaler', SCALER()),
    (winner_algorithm_name, LogisticRegression(penalty='elasticnet', solver='saga', max_iter=10000, random_state=RANDOM_STATE, n_jobs=1))
])

# Define the hyperparameter grid for the winner algorithm
# Use the full grid explored in rnCV (or the simplified one if preferred, but full is better for final tuning)
# Using the non-simplified grid here for thoroughness in final tuning:
param_grid = {
    # Prefix parameters with the pipeline step name (estimator name followed by __)
    f'{winner_algorithm_name}__C': [0.01, 0.1, 1, 10, 50, 100], # Expanded C range slightly
    f'{winner_algorithm_name}__l1_ratio': [0, 0.1, 0.25, 0.5, 0.75, 0.9, 1] # Expanded l1_ratio slightly
}
print("Using hyperparameter grid:")
print(param_grid)



--- 2. Defining Pipeline and Hyperparameter Grid for Winner Algorithm ---
Winner algorithm: LR_ElasticNet
Using hyperparameter grid:
{'LR_ElasticNet__C': [0.01, 0.1, 1, 10, 50, 100], 'LR_ElasticNet__l1_ratio': [0, 0.1, 0.25, 0.5, 0.75, 0.9, 1]}


### 3. Find Best Hyperparameters using GridSearchCV

Sets up and executes `GridSearchCV` to find the optimal hyperparameters for the `LR_ElasticNet` model within the defined pipeline. Stratified 5-fold cross-validation (`cv`) is performed on the entire dataset (`X`, `y`), optimizing for the `roc_auc` score. The search is parallelized across available CPU cores (`n_jobs=-1`) for efficiency and verbosity is set to 1 to display progress. Warnings are suppressed during the search.

In [5]:
# --- 3. Find Best Hyperparameters using GridSearchCV ---
print(f"\n--- 3. Finding Best Hyperparameters using {CV_FOLDS}-Fold CV ---")

# Suppress warnings during grid search
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=UserWarning)
# from sklearn.exceptions import ConvergenceWarning
# warnings.filterwarnings('ignore', category=ConvergenceWarning)

# Define the cross-validation strategy
cv = StratifiedKFold(n_splits=CV_FOLDS, shuffle=True, random_state=RANDOM_STATE)

# Setup GridSearchCV
# Note: Use a scorer compatible with the metric (needs predict_proba for roc_auc)
grid_search = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    scoring=GRIDSEARCH_METRIC, # Optimize for AUC
    cv=cv,
    n_jobs=-1, # Use all available cores for grid search
    verbose=1 # Set verbosity level (1 or 2 for more details)
)

# Fit GridSearchCV on the entire dataset (X, y)
print("Running GridSearchCV...")
grid_search.fit(X, y)



--- 3. Finding Best Hyperparameters using 5-Fold CV ---
Running GridSearchCV...
Fitting 5 folds for each of 42 candidates, totalling 210 fits


### 4. Display Best Hyperparameters

Prints the results obtained from the completed `GridSearchCV`. Displays the best cross-validated score achieved during the search (`grid_search.best_score_`) and the corresponding optimal hyperparameter set (`grid_search.best_params_`). Extracts and displays only the estimator-specific parameters for clarity.

In [6]:
# --- 4. Display Best Hyperparameters ---
print("\n--- 4. Best Hyperparameters Found ---")
print(f"Best Score ({GRIDSEARCH_METRIC}): {grid_search.best_score_:.4f}")
print("Best Hyperparameters:")
# Extract only the parameters for the estimator step
best_params_estimator = {k.split('__')[1]: v for k, v in grid_search.best_params_.items()}
print(best_params_estimator)



--- 4. Best Hyperparameters Found ---
Best Score (roc_auc): 0.9944
Best Hyperparameters:
{'C': 0.1, 'l1_ratio': 0.1}


### 5. Train Final Model Instance

Creates the final deployable model pipeline instance. This pipeline uses the same structure (imputer, scaler, estimator) but configures the `LR_ElasticNet` step with the `best_params_estimator` identified by `GridSearchCV` in the previous step. This final pipeline is then trained (`.fit()`) on the **entire dataset** (`X`, `y`) to leverage all available data.

In [7]:
# --- 5. Train Final Model Instance ---
print("\n--- 5. Training Final Model Instance on Full Dataset ---")

# Create the final pipeline with the best hyperparameters found
final_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy=IMPUTER_STRATEGY)),
    ('scaler', SCALER()),
    (winner_algorithm_name, LogisticRegression(
        penalty='elasticnet',
        solver='saga',
        max_iter=10000, # Ensure sufficient iterations
        random_state=RANDOM_STATE,
        n_jobs=1, # Keep n_jobs=1 for final training consistency
        **best_params_estimator # Unpack the best C and l1_ratio
    ))
])

# Train the final pipeline on ALL available data (X, y)
final_pipeline.fit(X, y)
print("Final model pipeline trained successfully.")



--- 5. Training Final Model Instance on Full Dataset ---
Final model pipeline trained successfully.


### 6. Save the Final Model

Saves the fully trained final pipeline object (`final_pipeline`) to the specified path (`MODEL_SAVE_PATH`) using `joblib.dump`. This serialized `.pkl` file encapsulates the fitted imputer, fitted scaler and the trained Logistic Regression model, making it ready for predictions on new data via the `predict.py` script. Ensures the target directory (`../models`) exists before saving and resets warnings.

In [8]:
# --- 6. Save the Final Model ---
print("\n--- 6. Saving Final Model Pipeline ---")

# Create the models directory if it doesn't exist
os.makedirs(MODEL_DIR, exist_ok=True)
print(f"Ensuring directory exists: {MODEL_DIR}")

# Save the entire trained pipeline object
try:
    joblib.dump(final_pipeline, MODEL_SAVE_PATH)
    print(f"Final model pipeline saved successfully to: {MODEL_SAVE_PATH}")
except Exception as e:
    print(f"Error saving model: {e}")

# Reset warnings
warnings.filterwarnings('default')

print("\n--- Task 4 Complete ---")



--- 6. Saving Final Model Pipeline ---
Ensuring directory exists: ../models
Final model pipeline saved successfully to: ../models/final_lr_elasticnet_model.pkl

--- Task 4 Complete ---
