# Module 03: Model Serialization

**Difficulty**: ⭐⭐ Intermediate
**Estimated Time**: 50 minutes
**Prerequisites**: 
- [Module 01: Experiment Tracking with MLflow](01_experiment_tracking_mlflow.ipynb)
- [Module 02: Model Versioning and Registry](02_model_versioning_registry.ipynb)

## Learning Objectives
By the end of this notebook, you will be able to:
1. Understand different model serialization formats and their trade-offs
2. Serialize models using pickle and joblib
3. Export models to ONNX for cross-platform compatibility
4. Optimize model size for deployment
5. Handle version compatibility issues
6. Choose the right serialization format for your use case

## 1. Why Model Serialization Matters

### The Challenge

You've trained a model in a Jupyter notebook:
- **Problem**: How do you save it for later use?
- **Problem**: How do you deploy it to a web server?
- **Problem**: How do you share it with teammates?
- **Problem**: What if production uses a different Python version?

### The Solution: Serialization

**Serialization** converts a Python object (your model) into a format that can be:
- Saved to disk
- Transmitted over a network
- Loaded in different environments
- Used across programming languages

### Common Serialization Formats

| Format | Best For | Pros | Cons |
|--------|----------|------|------|
| **pickle** | Python-only deployment | Native Python support | Python version sensitive |
| **joblib** | Large numpy arrays | Efficient with sklearn | Python-only |
| **ONNX** | Cross-platform deployment | Language-agnostic | Requires conversion |
| **TensorFlow SavedModel** | TensorFlow models | Production-ready | Framework-specific |
| **PyTorch TorchScript** | PyTorch models | Optimized inference | Framework-specific |

In [None]:
# Setup: Import all required libraries
import warnings
warnings.filterwarnings('ignore')

import os
import pickle
import joblib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Machine learning libraries
from sklearn.datasets import make_classification, load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Set random seed for reproducibility
np.random.seed(42)

# Configure plotting
plt.style.use('default')
sns.set_palette("husl")
%matplotlib inline

# Create models directory
models_dir = Path("saved_models")
models_dir.mkdir(exist_ok=True)

print("Setup complete!")
print(f"Models will be saved to: {models_dir.absolute()}")

## 2. Preparing Data and Training Models

Let's train some models to demonstrate different serialization approaches.

In [None]:
# Generate synthetic dataset
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    n_classes=2,
    random_state=42
)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

In [None]:
# Train a simple Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
rf_model.fit(X_train, y_train)

# Evaluate
y_pred = rf_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"✓ Random Forest trained")
print(f"  Accuracy: {accuracy:.4f}")

## 3. Serialization with Pickle

**pickle** is Python's built-in serialization library. It's simple but has limitations.

In [None]:
# Save model using pickle
pickle_path = models_dir / "rf_model.pkl"

with open(pickle_path, 'wb') as f:
    pickle.dump(rf_model, f)

print(f"✓ Model saved with pickle")
print(f"  File: {pickle_path}")
print(f"  Size: {pickle_path.stat().st_size / 1024:.2f} KB")

In [None]:
# Load model using pickle
with open(pickle_path, 'rb') as f:
    loaded_model = pickle.load(f)

# Test loaded model
loaded_predictions = loaded_model.predict(X_test)
loaded_accuracy = accuracy_score(y_test, loaded_predictions)

print(f"✓ Model loaded from pickle")
print(f"  Accuracy: {loaded_accuracy:.4f}")
print(f"  Original == Loaded: {accuracy == loaded_accuracy}")

### Pickle Pros and Cons

**Pros**:
- Simple and built into Python
- Works with any Python object
- Fast for small models

**Cons**:
- **Security risk**: Never unpickle untrusted data (code execution vulnerability)
- **Python version sensitivity**: May not work across different Python versions
- **Framework version sensitivity**: Requires same scikit-learn version
- Inefficient for large numpy arrays

## 4. Serialization with Joblib

**joblib** is optimized for objects containing large numpy arrays, making it ideal for sklearn models.

In [None]:
# Save model using joblib
joblib_path = models_dir / "rf_model.joblib"

joblib.dump(rf_model, joblib_path)

print(f"✓ Model saved with joblib")
print(f"  File: {joblib_path}")
print(f"  Size: {joblib_path.stat().st_size / 1024:.2f} KB")

In [None]:
# Load model using joblib
loaded_model_joblib = joblib.load(joblib_path)

# Test loaded model
joblib_predictions = loaded_model_joblib.predict(X_test)
joblib_accuracy = accuracy_score(y_test, joblib_predictions)

print(f"✓ Model loaded from joblib")
print(f"  Accuracy: {joblib_accuracy:.4f}")

In [None]:
# Compare file sizes: pickle vs joblib
pickle_size = pickle_path.stat().st_size / 1024
joblib_size = joblib_path.stat().st_size / 1024

print("File Size Comparison:")
print(f"  Pickle: {pickle_size:.2f} KB")
print(f"  Joblib: {joblib_size:.2f} KB")
print(f"  Difference: {abs(pickle_size - joblib_size):.2f} KB")

# Visualize comparison
fig, ax = plt.subplots(figsize=(8, 6))
formats = ['Pickle', 'Joblib']
sizes = [pickle_size, joblib_size]
colors = ['skyblue', 'lightcoral']

ax.bar(formats, sizes, color=colors, edgecolor='black', alpha=0.7)
ax.set_ylabel('File Size (KB)', fontweight='bold', fontsize=12)
ax.set_title('Model Serialization Format Comparison', fontweight='bold', fontsize=14)
ax.grid(axis='y', alpha=0.3)

for i, v in enumerate(sizes):
    ax.text(i, v + 1, f'{v:.2f} KB', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

### Joblib Pros and Cons

**Pros**:
- Efficient for large numpy arrays
- Better compression than pickle
- Faster for sklearn models
- Recommended by scikit-learn documentation

**Cons**:
- Still Python-only
- Same version sensitivity as pickle
- Same security concerns

## 5. Serialization with Compression

Both pickle and joblib support compression to reduce file size.

In [None]:
# Save with different compression levels
compression_levels = [0, 3, 6, 9]  # 0 = no compression, 9 = max compression
results = []

for level in compression_levels:
    compressed_path = models_dir / f"rf_model_compress_{level}.joblib"
    joblib.dump(rf_model, compressed_path, compress=level)
    
    file_size = compressed_path.stat().st_size / 1024
    results.append({
        'compression_level': level,
        'size_kb': file_size
    })
    
    print(f"Compression level {level}: {file_size:.2f} KB")

# Visualize compression impact
results_df = pd.DataFrame(results)

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(results_df['compression_level'], results_df['size_kb'], 
        marker='o', linewidth=2, markersize=10, color='darkblue')
ax.set_xlabel('Compression Level', fontweight='bold', fontsize=12)
ax.set_ylabel('File Size (KB)', fontweight='bold', fontsize=12)
ax.set_title('Impact of Compression on Model Size', fontweight='bold', fontsize=14)
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

# Calculate savings
original_size = results_df.loc[results_df['compression_level'] == 0, 'size_kb'].values[0]
max_compressed = results_df.loc[results_df['compression_level'] == 9, 'size_kb'].values[0]
savings_percent = ((original_size - max_compressed) / original_size) * 100

print(f"\n✓ Maximum compression savings: {savings_percent:.1f}%")

## 6. Serializing Pipelines

In production, you often need to save preprocessing steps along with the model.

In [None]:
# Create a pipeline with preprocessing and model
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Train pipeline
pipeline.fit(X_train, y_train)

# Evaluate
pipeline_accuracy = pipeline.score(X_test, y_test)
print(f"✓ Pipeline trained")
print(f"  Accuracy: {pipeline_accuracy:.4f}")

In [None]:
# Save entire pipeline
pipeline_path = models_dir / "pipeline.joblib"
joblib.dump(pipeline, pipeline_path)

print(f"✓ Pipeline saved")
print(f"  Size: {pipeline_path.stat().st_size / 1024:.2f} KB")

# Load and test
loaded_pipeline = joblib.load(pipeline_path)
loaded_pipeline_accuracy = loaded_pipeline.score(X_test, y_test)

print(f"\n✓ Pipeline loaded")
print(f"  Accuracy: {loaded_pipeline_accuracy:.4f}")
print(f"  Includes: {[step[0] for step in loaded_pipeline.steps]}")

## 7. ONNX: Cross-Platform Model Format

**ONNX (Open Neural Network Exchange)** allows models to be used across different frameworks and languages.

**Use cases**:
- Deploy Python-trained models to C++/Java applications
- Use GPU acceleration with ONNX Runtime
- Ensure model compatibility across environments

In [None]:
# Install required packages (run once)
# !pip install skl2onnx onnxruntime

from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
import onnxruntime as rt

# Define input type for ONNX conversion
# We need to specify the shape: (None, 20) means any number of samples with 20 features
initial_type = [('float_input', FloatTensorType([None, X_train.shape[1]]))]

# Convert sklearn model to ONNX
onnx_model = convert_sklearn(rf_model, initial_types=initial_type)

# Save ONNX model
onnx_path = models_dir / "rf_model.onnx"
with open(onnx_path, "wb") as f:
    f.write(onnx_model.SerializeToString())

print(f"✓ Model converted to ONNX")
print(f"  File: {onnx_path}")
print(f"  Size: {onnx_path.stat().st_size / 1024:.2f} KB")

In [None]:
# Load and use ONNX model
session = rt.InferenceSession(str(onnx_path))

# Get input name (required for ONNX Runtime)
input_name = session.get_inputs()[0].name

# Make predictions
# ONNX requires float32 format
onnx_predictions = session.run(None, {input_name: X_test.astype(np.float32)})[0]

# Calculate accuracy
onnx_accuracy = accuracy_score(y_test, onnx_predictions)

print(f"✓ ONNX model predictions")
print(f"  Accuracy: {onnx_accuracy:.4f}")
print(f"  Same as original: {onnx_accuracy == accuracy}")

### ONNX Pros and Cons

**Pros**:
- Cross-platform and cross-language
- Optimized for inference (faster predictions)
- Reduces version compatibility issues
- Supports GPU acceleration

**Cons**:
- Not all sklearn models supported
- Conversion adds complexity
- Larger file size than joblib
- Requires ONNX Runtime for inference

## 8. Model Size Optimization

Smaller models mean faster loading and less storage/bandwidth cost.

In [None]:
# Train models with different complexity
model_configs = [
    {"name": "small", "n_estimators": 10, "max_depth": 5},
    {"name": "medium", "n_estimators": 50, "max_depth": 10},
    {"name": "large", "n_estimators": 200, "max_depth": 20},
]

size_results = []

for config in model_configs:
    # Train model
    model = RandomForestClassifier(
        n_estimators=config["n_estimators"],
        max_depth=config["max_depth"],
        random_state=42
    )
    model.fit(X_train, y_train)
    
    # Evaluate
    accuracy = model.score(X_test, y_test)
    
    # Save and measure size
    model_path = models_dir / f"rf_{config['name']}.joblib"
    joblib.dump(model, model_path, compress=3)
    file_size = model_path.stat().st_size / 1024
    
    size_results.append({
        'model': config['name'],
        'n_estimators': config['n_estimators'],
        'max_depth': config['max_depth'],
        'accuracy': accuracy,
        'size_kb': file_size
    })

size_df = pd.DataFrame(size_results)
print("Model Size vs Accuracy Trade-off:")
print(size_df.to_string(index=False))

In [None]:
# Visualize size vs accuracy trade-off
fig, ax = plt.subplots(figsize=(10, 6))

scatter = ax.scatter(
    size_df['size_kb'], 
    size_df['accuracy'],
    s=200,
    c=range(len(size_df)),
    cmap='viridis',
    edgecolors='black',
    linewidth=2,
    alpha=0.7
)

# Add labels for each point
for idx, row in size_df.iterrows():
    ax.annotate(
        row['model'],
        (row['size_kb'], row['accuracy']),
        xytext=(10, 5),
        textcoords='offset points',
        fontweight='bold'
    )

ax.set_xlabel('Model Size (KB)', fontweight='bold', fontsize=12)
ax.set_ylabel('Accuracy', fontweight='bold', fontsize=12)
ax.set_title('Model Size vs Accuracy Trade-off', fontweight='bold', fontsize=14)
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

## 9. Version Compatibility Best Practices

Avoid the "it works on my machine" problem by tracking versions.

In [None]:
import sys
import sklearn

# Create metadata to save with model
model_metadata = {
    "model_type": "RandomForestClassifier",
    "model_params": rf_model.get_params(),
    "training_accuracy": accuracy,
    "python_version": sys.version,
    "sklearn_version": sklearn.__version__,
    "numpy_version": np.__version__,
    "training_date": pd.Timestamp.now().isoformat(),
    "n_features": X_train.shape[1],
    "n_classes": len(np.unique(y_train))
}

# Save metadata alongside model
metadata_path = models_dir / "rf_model_metadata.json"
import json
with open(metadata_path, 'w') as f:
    json.dump(model_metadata, f, indent=2)

print("✓ Model metadata saved")
print(f"\nMetadata:")
for key, value in model_metadata.items():
    if key != 'model_params':  # Skip params for brevity
        print(f"  {key}: {value}")

## 10. Exercises

Practice different serialization techniques.

### Exercise 1: Save and Load with Different Formats

Train a LogisticRegression model and save it using pickle, joblib, and ONNX.

**Requirements**:
1. Train a LogisticRegression model
2. Save it in all three formats
3. Compare file sizes
4. Verify all loaded models produce same predictions

In [None]:
# Exercise 1: Your code here

# YOUR CODE HERE

In [None]:
# Exercise 1 Solution

# Train model
lr_model = LogisticRegression(max_iter=1000, random_state=42)
lr_model.fit(X_train, y_train)
original_accuracy = lr_model.score(X_test, y_test)

print(f"✓ Logistic Regression trained (Accuracy: {original_accuracy:.4f})\n")

# Save with pickle
pickle_lr_path = models_dir / "lr_model.pkl"
with open(pickle_lr_path, 'wb') as f:
    pickle.dump(lr_model, f)
pickle_size = pickle_lr_path.stat().st_size / 1024

# Save with joblib
joblib_lr_path = models_dir / "lr_model.joblib"
joblib.dump(lr_model, joblib_lr_path)
joblib_size = joblib_lr_path.stat().st_size / 1024

# Save with ONNX
onnx_lr = convert_sklearn(lr_model, initial_types=initial_type)
onnx_lr_path = models_dir / "lr_model.onnx"
with open(onnx_lr_path, "wb") as f:
    f.write(onnx_lr.SerializeToString())
onnx_size = onnx_lr_path.stat().st_size / 1024

# Compare sizes
print("File Size Comparison:")
print(f"  Pickle: {pickle_size:.2f} KB")
print(f"  Joblib: {joblib_size:.2f} KB")
print(f"  ONNX: {onnx_size:.2f} KB")

# Verify predictions
pickle_model = pickle.load(open(pickle_lr_path, 'rb'))
joblib_model = joblib.load(joblib_lr_path)
onnx_session = rt.InferenceSession(str(onnx_lr_path))
onnx_pred = onnx_session.run(None, {input_name: X_test.astype(np.float32)})[0]

print(f"\n✓ All models produce same predictions: "
      f"{np.array_equal(pickle_model.predict(X_test), joblib_model.predict(X_test)) and np.array_equal(pickle_model.predict(X_test), onnx_pred)}")

### Exercise 2: Pipeline Serialization

Create a pipeline with multiple preprocessing steps and serialize it.

**Requirements**:
1. Create pipeline: StandardScaler → LogisticRegression
2. Train on the dataset
3. Save the entire pipeline
4. Load and verify it works correctly

In [None]:
# Exercise 2 Solution

# Create pipeline
lr_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(max_iter=1000, random_state=42))
])

# Train
lr_pipeline.fit(X_train, y_train)
pipeline_acc = lr_pipeline.score(X_test, y_test)

print(f"✓ Pipeline trained (Accuracy: {pipeline_acc:.4f})")

# Save
pipeline_lr_path = models_dir / "lr_pipeline.joblib"
joblib.dump(lr_pipeline, pipeline_lr_path)
print(f"✓ Pipeline saved ({pipeline_lr_path.stat().st_size / 1024:.2f} KB)")

# Load and verify
loaded_lr_pipeline = joblib.load(pipeline_lr_path)
loaded_acc = loaded_lr_pipeline.score(X_test, y_test)

print(f"✓ Pipeline loaded (Accuracy: {loaded_acc:.4f})")
print(f"  Pipeline steps: {[step[0] for step in loaded_lr_pipeline.steps]}")
print(f"  Accuracy preserved: {pipeline_acc == loaded_acc}")

### Exercise 3: Compression Comparison

Experiment with different compression levels and measure the impact.

**Requirements**:
1. Save the same model with compression levels 0, 3, 6, 9
2. Measure file sizes
3. Time how long it takes to load each
4. Create a visualization comparing size and load time

In [None]:
# Exercise 3 Solution

import time

compression_results = []

for level in [0, 3, 6, 9]:
    # Save with compression
    comp_path = models_dir / f"model_compress_{level}.joblib"
    joblib.dump(rf_model, comp_path, compress=level)
    
    # Measure size
    file_size = comp_path.stat().st_size / 1024
    
    # Measure load time (average of 10 runs)
    load_times = []
    for _ in range(10):
        start = time.time()
        _ = joblib.load(comp_path)
        load_times.append(time.time() - start)
    
    avg_load_time = np.mean(load_times) * 1000  # Convert to ms
    
    compression_results.append({
        'level': level,
        'size_kb': file_size,
        'load_time_ms': avg_load_time
    })

comp_df = pd.DataFrame(compression_results)
print("Compression Analysis:")
print(comp_df.to_string(index=False))

# Visualize
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Size comparison
ax1.plot(comp_df['level'], comp_df['size_kb'], marker='o', linewidth=2, color='darkblue')
ax1.set_xlabel('Compression Level', fontweight='bold')
ax1.set_ylabel('File Size (KB)', fontweight='bold')
ax1.set_title('File Size vs Compression', fontweight='bold')
ax1.grid(alpha=0.3)

# Load time comparison
ax2.plot(comp_df['level'], comp_df['load_time_ms'], marker='o', linewidth=2, color='darkred')
ax2.set_xlabel('Compression Level', fontweight='bold')
ax2.set_ylabel('Load Time (ms)', fontweight='bold')
ax2.set_title('Load Time vs Compression', fontweight='bold')
ax2.grid(alpha=0.3)

plt.tight_layout()
plt.show()

## 11. Summary

### Key Takeaways

1. **Choose the right format**:
   - **pickle**: Quick prototyping (Python-only)
   - **joblib**: sklearn models (recommended)
   - **ONNX**: Cross-platform deployment

2. **Compression reduces file size** with minimal impact on load time

3. **Always serialize the entire pipeline** (preprocessing + model) for production

4. **Track versions** (Python, libraries, training date) to avoid compatibility issues

5. **Model size matters** for deployment costs and latency

6. **Security**: Never unpickle untrusted data

### Decision Guide

**Use pickle/joblib when**:
- Deploying to Python environments only
- You control both training and inference environments
- Quick iteration is important

**Use ONNX when**:
- Deploying to non-Python environments
- Need cross-framework compatibility
- Performance optimization is critical
- Working with edge devices

### What's Next?

In **Module 04**, we'll learn about:
- **Creating ML APIs** with FastAPI
- **Request/response models** with Pydantic
- **Input validation** and error handling
- **Testing API endpoints**

## 12. Additional Resources

### Documentation
- **Joblib Persistence**: https://joblib.readthedocs.io/en/latest/persistence.html
- **sklearn Model Persistence**: https://scikit-learn.org/stable/model_persistence.html
- **ONNX**: https://onnx.ai/
- **skl2onnx**: https://onnx.ai/sklearn-onnx/

### Tutorials
- **ONNX Runtime**: https://onnxruntime.ai/
- **Model Deployment Best Practices**: https://ml-ops.org/

### Advanced Topics
- TensorFlow SavedModel format
- PyTorch TorchScript
- Model quantization for edge deployment
- Model serving with TensorFlow Serving