# Chapter 39: Model Serialization and Storage

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Understand the importance of model serialization for deployment and reproducibility
- Compare different serialization formats (Pickle, Joblib, ONNX, SavedModel) and choose the right one for your use case
- Save and load scikit‑learn models using Pickle and Joblib
- Export TensorFlow/Keras models in the SavedModel format
- Convert models to ONNX for cross‑framework compatibility
- Implement model versioning to track changes over time
- Use a model registry (MLflow, Weights & Biases, custom) to organize and manage models
- Store model metadata (hyperparameters, performance metrics, training data hash) alongside the model
- Manage model artifacts (scalers, encoders, feature lists) as part of the model package
- Choose appropriate storage backends (local filesystem, cloud storage, databases)
- Implement backup and recovery strategies for model artifacts
- Address security considerations (encryption, access control) for stored models
- Adopt best practices for model serialization and storage in production

---

## **39.1 Introduction to Model Serialization**

After training a model, we need to save it to disk so that it can be loaded later for prediction without retraining. This process is called **serialization** (or pickling, in Python). Proper serialization ensures that:

- The model can be deployed to a production environment.
- The model can be shared with other team members.
- The model can be versioned and archived for reproducibility.
- The model can be loaded quickly for inference.

For the NEPSE prediction system, we will have multiple models (one per stock, or a single model for all stocks), and we need to save them along with any preprocessing artifacts (scalers, feature lists) so that predictions can be made consistently.

---

## **39.2 Serialization Formats**

Different libraries use different serialization formats. Choosing the right format depends on the model type, the deployment environment, and compatibility requirements.

### **39.2.1 Pickle**

Python's built‑in `pickle` module can serialize almost any Python object, including scikit‑learn models. It is simple and widely used.

**Advantages:** Simple, no extra dependencies.
**Disadvantages:** Not secure (can execute arbitrary code), Python‑only, can be slow for large models, may break across Python versions.

```python
import pickle
from sklearn.ensemble import RandomForestRegressor

# Train a model
model = RandomForestRegressor(n_estimators=100)
model.fit(X_train, y_train)

# Save with pickle
with open('model.pkl', 'wb') as f:
    pickle.dump(model, f)

# Load later
with open('model.pkl', 'rb') as f:
    loaded_model = pickle.load(f)

# Make predictions
predictions = loaded_model.predict(X_test)
```

**Explanation:**  
The model is saved as a binary file. When loading, we must ensure the same environment (library versions) to avoid errors. Pickle is convenient for prototyping but has security implications – never load untrusted pickle files.

### **39.2.2 Joblib**

Joblib is part of the scikit‑learn ecosystem and is optimized for saving large NumPy arrays. It is often faster and more efficient than pickle for models with large arrays (e.g., random forests).

```python
import joblib

# Save
joblib.dump(model, 'model.joblib')

# Load
loaded_model = joblib.load('model.joblib')
```

**Explanation:**  
Joblib is the recommended way to save scikit‑learn models. It uses pickle under the hood but with better compression and handling of large data.

### **39.2.3 TensorFlow SavedModel**

For TensorFlow/Keras models, the native format is **SavedModel**. It saves the model architecture, weights, and training configuration in a directory.

```python
import tensorflow as tf

# Assume we have a Keras model
model = tf.keras.Sequential([...])
model.compile(...)
model.fit(X_train, y_train)

# Save in SavedModel format
model.save('my_model')

# Load
loaded_model = tf.keras.models.load_model('my_model')
```

**Explanation:**  
SavedModel is a directory containing assets, variables, and a saved_model.pb file. It is portable and can be used with TensorFlow Serving for production deployment.

### **39.2.4 ONNX (Open Neural Network Exchange)**

ONNX is an open format for representing machine learning models. It allows models to be transferred between different frameworks (e.g., scikit‑learn to PyTorch, or to specialized inference engines).

```python
import skl2onnx
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType

# Convert a scikit-learn model to ONNX
initial_type = [('float_input', FloatTensorType([None, X_train.shape[1]]))]
onx = convert_sklearn(model, initial_types=initial_type)

# Save
with open("model.onnx", "wb") as f:
    f.write(onx.SerializeToString())

# Load and predict with ONNX runtime
import onnxruntime as ort
sess = ort.InferenceSession("model.onnx")
input_name = sess.get_inputs()[0].name
pred_onx = sess.run(None, {input_name: X_test.astype(np.float32)})[0]
```

**Explanation:**  
ONNX is useful when you need to deploy models in environments that may not support the original framework (e.g., mobile, edge devices). It also enables optimizations like quantization and graph transformations.

### **39.2.5 Comparison Table**

| Format   | Framework        | Use Case                          | Pros                          | Cons                          |
|----------|------------------|-----------------------------------|-------------------------------|-------------------------------|
| Pickle   | Python (any)     | Quick saving, prototyping         | Simple, built‑in              | Security risk, Python‑only    |
| Joblib   | scikit‑learn     | Large NumPy arrays                | Efficient, fast               | Python‑only                   |
| SavedModel | TensorFlow/Keras | TensorFlow models                 | Complete, TF Serving ready    | TensorFlow‑specific           |
| ONNX     | Cross‑framework  | Model interoperability            | Framework‑agnostic, optimized | Conversion complexity         |

For the NEPSE system, we will likely use Joblib for scikit‑learn models and SavedModel for neural networks. If we need to deploy on edge devices, we might convert to ONNX.

---

## **39.3 Model Versioning**

As we iterate on models, we need to keep track of different versions. Versioning helps with:

- Reproducing past results
- Rolling back to a previous version if a new model performs poorly
- A/B testing different models
- Auditing and compliance

A simple versioning scheme is to include a version number in the filename, e.g., `model_v1.2.joblib`. Better yet, use a model registry that stores metadata along with the model.

### **39.3.1 Manual Versioning**

```python
import datetime

version = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"models/rf_{version}.joblib"
joblib.dump(model, filename)
```

This creates files like `rf_20250315_143022.joblib`. It's simple but does not store any metadata.

### **39.3.2 Storing Metadata**

Along with the model, save a metadata file (JSON or YAML) containing:

- Model version
- Training date
- Hyperparameters
- Performance metrics (validation RMSE, etc.)
- Feature list
- Hash of training data (to detect data drift)
- Git commit hash of the code

```python
import json
import hashlib

metadata = {
    'version': '1.2.0',
    'date': '2025-03-15',
    'model_type': 'RandomForest',
    'params': model.get_params(),
    'metrics': {'val_rmse': 0.85},
    'features': list(X_train.columns),
    'data_hash': hashlib.md5(pd.util.hash_pandas_object(X_train).values).hexdigest(),
    'git_commit': 'a1b2c3d4'
}

with open('models/rf_v1.2.0_metadata.json', 'w') as f:
    json.dump(metadata, f, indent=2)
```

---

## **39.4 Model Registry**

A model registry is a centralized system for storing, versioning, and managing models. It provides a UI/API to track experiments, compare models, and promote models to production.

### **39.4.1 MLflow**

MLflow is an open‑source platform for the machine learning lifecycle. It includes a model registry.

```python
import mlflow
import mlflow.sklearn

# Set tracking URI (e.g., local directory or database)
mlflow.set_tracking_uri("file:./mlruns")

# Start an experiment run
with mlflow.start_run():
    # Log parameters
    mlflow.log_params(model.get_params())
    
    # Log metrics
    mlflow.log_metric("rmse", rmse)
    
    # Log model
    mlflow.sklearn.log_model(model, "model")
    
    # Register the model
    mlflow.register_model(f"runs:/{mlflow.active_run().info.run_id}/model", "NEPSE_RandomForest")
```

Later, you can load a model by its stage (Staging, Production, Archived):

```python
model = mlflow.pyfunc.load_model(model_uri="models:/NEPSE_RandomForest/Production")
```

**Explanation:**  
MLflow tracks runs, logs parameters and metrics, and stores models. The registry allows promoting models through stages.

### **39.4.2 Weights & Biases (W&B)**

W&B is another popular platform that includes model versioning.

```python
import wandb

wandb.init(project="nepse-forecast")
wandb.config.update(model.get_params())
wandb.log({"rmse": rmse})
wandb.save('model.joblib')  # saves as artifact
```

### **39.4.3 Custom Registry**

If you cannot use third‑party tools, you can build a simple registry using a database and file storage. For example, store model metadata in a SQLite database and model files in a structured directory.

```python
import sqlite3
import json

conn = sqlite3.connect('model_registry.db')
c = conn.cursor()
c.execute('''CREATE TABLE IF NOT EXISTS models
             (id INTEGER PRIMARY KEY, name TEXT, version TEXT, 
              path TEXT, metrics TEXT, created_at TIMESTAMP)''')

def register_model(name, version, path, metrics):
    c.execute("INSERT INTO models (name, version, path, metrics, created_at) VALUES (?,?,?,?, datetime('now'))",
              (name, version, path, json.dumps(metrics)))
    conn.commit()
```

This gives you full control but requires more maintenance.

---

## **39.5 Artifact Management**

Models often depend on artifacts like scalers, encoders, and feature lists. These must be saved alongside the model to ensure consistent preprocessing.

### **39.5.1 Saving Preprocessing Objects**

```python
# Assume we have a scaler fitted on training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Save scaler
joblib.dump(scaler, 'models/scaler.joblib')

# Save feature list (could be just the column names)
feature_list = X_train.columns.tolist()
with open('models/features.json', 'w') as f:
    json.dump(feature_list, f)
```

### **39.5.2 Packaging Everything Together**

You can create a single archive (e.g., ZIP) containing the model, scaler, and metadata.

```python
import zipfile
import os

with zipfile.ZipFile('model_package.zip', 'w') as z:
    z.write('models/model.joblib', 'model.joblib')
    z.write('models/scaler.joblib', 'scaler.joblib')
    z.write('models/features.json', 'features.json')
    z.write('models/metadata.json', 'metadata.json')
```

Then, in production, unzip and load each component.

### **39.5.3 Using a Pipeline Object**

Scikit‑learn's `Pipeline` can encapsulate preprocessing and the model. Saving the pipeline saves everything in one object.

```python
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestRegressor())
])

pipeline.fit(X_train, y_train)
joblib.dump(pipeline, 'pipeline.joblib')

# Later
loaded_pipeline = joblib.load('pipeline.joblib')
predictions = loaded_pipeline.predict(X_test)
```

**Explanation:**  
The pipeline ensures that the same scaling is applied during inference. This is the recommended approach for scikit‑learn.

---

## **39.6 Storage Backends**

Where you store your models depends on your infrastructure.

### **39.6.1 Local Filesystem**

Simplest, but not scalable or shared across servers. Use for development.

### **39.6.2 Network File System (NFS)**

Shared storage accessible by multiple servers. Good for small teams.

### **39.6.3 Cloud Storage**

- **AWS S3:** Scalable, durable, accessible from anywhere.
- **Google Cloud Storage**
- **Azure Blob Storage**

```python
import boto3

s3 = boto3.client('s3')
s3.upload_file('model.joblib', 'my-bucket', 'models/nepse/model.joblib')

# To download
s3.download_file('my-bucket', 'models/nepse/model.joblib', 'model.joblib')
```

### **39.6.4 Database BLOBs**

You can store models as binary large objects (BLOBs) in a database. This simplifies backup but can be slower.

```python
# Store model in PostgreSQL
import psycopg2
import pickle

conn = psycopg2.connect(...)
cur = conn.cursor()
model_binary = pickle.dumps(model)
cur.execute("INSERT INTO models (name, model_data) VALUES (%s, %s)", ('rf_v1', model_binary))
conn.commit()
```

---

## **39.7 Data Partitioning Strategies**

When you have many models (e.g., one per stock), you need an organizational strategy.

- **One file per model:** `models/rf_NEPSE.joblib`, `models/rf_HRL.joblib`, etc.
- **Subdirectories:** `models/NEPSE/v1/model.joblib`, `models/HRL/v1/model.joblib`
- **Database index:** Store model paths in a database keyed by symbol and version.

For the NEPSE system with many stocks, a directory structure like:

```
models/
├── NEPSE/
│   ├── v1/
│   │   ├── model.joblib
│   │   └── metadata.json
│   └── v2/
│       ├── model.joblib
│       └── metadata.json
├── HRL/
│   ├── v1/
│   │   ├── model.joblib
│   │   └── metadata.json
...
```

This keeps things organized and allows easy retrieval by symbol and version.

---

## **39.8 Data Archival and Retention**

Not all models need to be kept forever. Define a retention policy:

- Keep all models for a certain period (e.g., 1 year).
- Archive older models to cheaper storage (e.g., S3 Glacier).
- Delete models that are no longer needed.

Automate this with scripts or lifecycle policies on cloud storage.

---

## **39.9 Backup and Recovery**

Model files are critical; they represent the result of expensive training. Ensure they are backed up.

- Use version control (Git) for metadata, but not for large model files.
- Regularly back up the model storage location to another region or service.
- Test recovery by restoring from backup periodically.

---

## **39.10 Security Considerations**

### **39.10.1 Encryption**

- Encrypt model files at rest, especially if they contain sensitive information (e.g., trained on proprietary data).
- Use S3 server‑side encryption, or encrypt before uploading.

### **39.10.2 Access Control**

- Restrict access to model storage using IAM roles/permissions.
- Use separate buckets/directories for different environments (dev, staging, prod).

### **39.10.3 Model Theft Prevention**

- Models are intellectual property. Limit download access.
- Consider watermarking or encryption if models are distributed to untrusted environments.

### **39.10.4 Secure Deserialization**

- Never load pickle files from untrusted sources. Use safer formats (ONNX, SavedModel) when possible.
- If you must use pickle, ensure the source is trusted and files are integrity‑checked (e.g., using checksums).

---

## **39.11 Best Practices**

1. **Always save preprocessing artifacts** with the model.
2. **Use pipelines** to encapsulate the entire preprocessing and modeling steps.
3. **Version your models** with semantic versioning or timestamps.
4. **Store metadata** (params, metrics, date) alongside the model.
5. **Use a model registry** for tracking and promoting models.
6. **Back up models** regularly.
7. **Secure model storage** with encryption and access controls.
8. **Test model loading** in a fresh environment to ensure no missing dependencies.
9. **Document the storage structure** in a README.

---

## **39.12 Chapter Summary**

In this chapter, we covered the critical aspects of model serialization and storage for the NEPSE prediction system.

- **Serialization formats:** Pickle (simple), Joblib (efficient for scikit‑learn), SavedModel (TensorFlow), ONNX (cross‑framework).
- **Model versioning:** Use version numbers, timestamps, and metadata to track changes.
- **Model registry:** Tools like MLflow and W&B help organize models and promote them through stages.
- **Artifact management:** Save scalers, encoders, and feature lists alongside the model, preferably in a Pipeline.
- **Storage backends:** Local, network, cloud (S3), or databases.
- **Organizational strategies:** Directory structures for multiple models.
- **Backup and security:** Encrypt, control access, and test recovery.

### **Practical Takeaways for the NEPSE System:**

- For scikit‑learn models, use Joblib to save `Pipeline` objects containing the scaler and model.
- For neural networks, use TensorFlow's SavedModel format.
- Store models in a structured directory: `models/{symbol}/{version}/`.
- Maintain a metadata JSON file in each model directory.
- Use a simple SQLite database as a lightweight model registry if MLflow is overkill.
- Back up the models directory to cloud storage regularly.
- Set up IAM roles to restrict access in production.

With these practices, your models are safely stored, versioned, and ready for deployment. In the next chapter, **Chapter 40: Building Prediction Services**, we will explore how to expose these models via REST APIs and other service interfaces.

---

**End of Chapter 39**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='38. from_development_to_production.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='40. building_prediction_services.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
