After spending time training and tuning a machine learning model (or a full `pipeline`), you'll often want to save it so you can reuse it later for making predictions on new data without having to retrain it every time. This process is called model persistence.

`Scikit-learn` recommends using the `joblib` library (which is often installed as a dependency of `scikit-learn`) for saving and loading models, as it's generally more efficient for objects containing large `NumPy` arrays. Python's built-in `pickle` module is another option.

## Scikit-learn: Model Persistence (Saving & Loading)

This document covers:

* **Training a Model:** Briefly trains a sample `Pipeline` for demonstration.
* **Saving with `joblib`:** Shows how to use `joblib.dump()` to save the trained pipeline object to a file.
* **Loading with `joblib`:** Shows how to use `joblib.load()` to load the pipeline back into memory.
* **Verification:** Demonstrates making predictions with the loaded model to ensure it works as expected.
* **`pickle` Alternative:** Briefly mentions Python's built-in `pickle` module as another option, noting `joblib`'s preference for `sklearn` objects and security considerations.
* **Considerations:** Highlights important points about library version compatibility and security.

---

Saving trained models is essential for deploying machine learning applications or reusing models without retraining.

In [2]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline, make_pipeline # Example with pipeline
from sklearn.metrics import accuracy_score
import joblib # For saving and loading models
import os # For file path management

# --- 1. Train a Sample Model (or Pipeline) ---
# We'll train a simple pipeline on the Iris dataset

print("--- Training a Sample Model ---")
iris = load_iris()
X = iris.data
y = iris.target
class_names = iris.target_names

# Split data (optional for this demo, but good practice)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Create and train a pipeline (Scaler + Logistic Regression)
# Using make_pipeline for simplicity here
pipe = make_pipeline(
    StandardScaler(),
    LogisticRegression(solver='liblinear', random_state=42)
)

pipe.fit(X_train, y_train)

# Evaluate the trained model (optional check)
y_pred_train = pipe.predict(X_train)
y_pred_test = pipe.predict(X_test)
train_accuracy = accuracy_score(y_train, y_pred_train)
test_accuracy = accuracy_score(y_test, y_pred_test)

print(f"Trained Pipeline: {pipe}")
print(f"Accuracy on Training Data: {train_accuracy:.4f}")
print(f"Accuracy on Test Data: {test_accuracy:.4f}")
print("-" * 30)


# --- 2. Saving the Model using joblib ---
# joblib.dump(model_object, filename)

print("--- Saving the Model ---")

# Define filename and directory
output_dir = 'sklearn_models'
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
model_filename = os.path.join(output_dir, 'iris_pipeline.joblib')

# Save the entire pipeline object
try:
    joblib.dump(pipe, model_filename)
    print(f"Pipeline successfully saved to: '{model_filename}'")
except Exception as e:
    print(f"Error saving model: {e}")
print("-" * 30)


# --- 3. Loading the Model using joblib ---
# loaded_model = joblib.load(filename)

print("--- Loading the Model ---")

# Check if file exists before loading
if os.path.exists(model_filename):
    try:
        loaded_pipe = joblib.load(model_filename)
        print(f"Pipeline successfully loaded from: '{model_filename}'")
        print(f"Loaded Pipeline object: {loaded_pipe}")

        # --- 4. Verifying the Loaded Model ---
        print("\n--- Verifying Loaded Model ---")
        # Make predictions with the loaded model on the test set
        y_pred_loaded = loaded_pipe.predict(X_test)

        # Compare predictions or evaluate performance
        loaded_accuracy = accuracy_score(y_test, y_pred_loaded)
        print(f"Accuracy of loaded model on Test Data: {loaded_accuracy:.4f}")

        # Check if the accuracy matches the original model's accuracy
        if np.isclose(test_accuracy, loaded_accuracy):
            print("Loaded model performance matches original model performance.")
        else:
            print("Warning: Loaded model performance differs from original.")

        # Example prediction on new data
        # New samples must have the same number of features as the training data (4 for Iris)
        X_new_samples = np.array([[5.1, 3.5, 1.4, 0.2],  # Should be Setosa (0)
                                  [6.7, 3.0, 5.2, 2.3],  # Should be Virginica (2)
                                  [5.9, 3.0, 4.2, 1.5]]) # Should be Versicolor (1)

        new_predictions = loaded_pipe.predict(X_new_samples)
        new_proba = loaded_pipe.predict_proba(X_new_samples) # If model supports it

        print("\nPredictions on new samples:")
        for i, pred in enumerate(new_predictions):
            print(f"  Sample {i+1}: Predicted Class = {pred} ({class_names[pred]})")
        print("\nProbabilities for new samples:\n", new_proba.round(3))

    except Exception as e:
        print(f"Error loading model: {e}")
else:
    print(f"Model file not found: '{model_filename}'")

print("-" * 30)


# --- 5. Using pickle (Alternative) ---
# import pickle
# pickle_filename = os.path.join(output_dir, 'iris_pipeline.pkl')
# # Save
# with open(pickle_filename, 'wb') as f:
#     pickle.dump(pipe, f)
# print(f"\nSaved model using pickle: '{pickle_filename}'")
# # Load
# with open(pickle_filename, 'rb') as f:
#     loaded_pipe_pickle = pickle.load(f)
# print(f"Loaded model using pickle: {loaded_pipe_pickle}")
# # Verify...
# Be aware of potential security risks when loading pickle files from untrusted sources.
print("--- pickle (Alternative) ---")
print("Python's 'pickle' module can also be used, but 'joblib' is generally")
print("preferred for scikit-learn objects due to efficiency with large NumPy arrays.")
print("Exercise caution when loading pickle files from untrusted sources.")
print("-" * 30)


# --- 6. Important Considerations ---
# - Library Versions: Ensure the environment where you load the model has the same major
#   versions (or compatible versions) of scikit-learn, numpy, etc., as the environment
#   where it was saved. Mismatches can cause errors or unexpected behavior.
# - Security: Be cautious loading files (especially pickle files) from untrusted sources,
#   as they can potentially contain malicious code.
# - Model Updates: Saved models don't automatically update if the underlying libraries change.
#   You may need to retrain models periodically.

--- Training a Sample Model ---
Trained Pipeline: Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression',
                 LogisticRegression(random_state=42, solver='liblinear'))])
Accuracy on Training Data: 0.9250
Accuracy on Test Data: 0.8333
------------------------------
--- Saving the Model ---
Pipeline successfully saved to: 'sklearn_models\iris_pipeline.joblib'
------------------------------
--- Loading the Model ---
Pipeline successfully loaded from: 'sklearn_models\iris_pipeline.joblib'
Loaded Pipeline object: Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression',
                 LogisticRegression(random_state=42, solver='liblinear'))])

--- Verifying Loaded Model ---
Accuracy of loaded model on Test Data: 0.8333
Loaded model performance matches original model performance.

Predictions on new samples:
  Sample 1: Predicted Class = 0 (setosa)
  Sample 2: Predicted Class = 2 (virginica)
  Sample 3: Pre