# `mlarena.utils.io_utils` Demo

This notebook serves as a demonstration of the various utilities available in the `mlarena.utils.io_utils` module. 

In [1]:
import pandas as pd
import numpy as np
import mlarena.utils.io_utils as iou
from mlarena import PreProcessor, MLPipeline
import pandas as pd

from sklearn.ensemble import (
    RandomForestClassifier
) 
from sklearn.datasets import (
    fetch_openml
)
from sklearn.model_selection import (
    train_test_split
)

## 1. `save_object` and `load_object`

The `save_object` and `load_object` functions provide a convenient way to save and load Python objects to/from disk. These functions are particularly useful for:

- Saving trained machine learning models for later use
- Storing intermediate results or processed data
- Archiving complex Python objects (dictionaries, DataFrames, etc.)
- Sharing data between different Python sessions or scripts

The functions support two backends:
- `pickle`: Python's built-in serialization format
- `joblib`: Optimized for large numerical data and scientific computing objects

Key features include:
- Streamlined backend handling:
    - With `save_object`: Automatic file extension handling based on backend
    - With `load_object`: Backend automatically detected 
- Optional date stamping of saved files
- Compression support for joblib backend
- Simple and consistent interface for both saving and loading



### 1.1 Save & load a dictionary with pickle backend

In [2]:
# Create a sample dictionary with mixed data types
data_dict = {
    "name": "example",
    "values": [1, 2, 3, 4, 5],
    "active": True,
    "metadata": {
        "version": "1.0",
        "timestamp": "2024-03-20"
    }
}

# Save the dictionary using pickle backend (default)
filepath1 = iou.save_object(
    data_dict,
    directory="demo_outputs",
    basename="data_dict",
    use_date=True  # Include date in filename
)

# Load the saved dictionary
data_dict_retrieved = iou.load_object(filepath1)

# Verify data integrity
print("Data integrity check:", data_dict == data_dict_retrieved)

Object saved to demo_outputs\data_dict_2025-05-29.pkl
Object loaded from demo_outputs\data_dict_2025-05-29.pkl
Data integrity check: True


### 1.2 Save & load a pandas df with joblib backend

In [3]:
# Create sample data
dates = pd.date_range(start='2024-01-01', periods=5)
df = pd.DataFrame({
    'date': dates,
    'category': ['A', 'B', 'C', 'A', 'B'],
    'value': np.random.randn(5),
    'count': np.random.randint(1, 100, 5),
    'is_active': [True, False, True, True, False]
})

# Save the DataFrame using joblib backend with compression
filepath2 = iou.save_object(
    df,
    directory="demo_outputs",
    basename="sample_df",
    backend="joblib",  # Use joblib backend
    compress=True,     # Enable compression
    use_date=True     # Include date in filename
)

# Load the saved DataFrame, the backend is detected automatically
df_retrieved = iou.load_object(filepath2)

# Verify data integrity
print("\nData integrity check:", df.equals(df_retrieved))


Object saved to demo_outputs\sample_df_2025-05-29.joblib
Object loaded from demo_outputs\sample_df_2025-05-29.joblib

Data integrity check: True


### 1.3 Saving and loading a MLPipeline instance

In [4]:

titanic = fetch_openml('titanic', version=1, as_frame=True)
X = titanic.data
y = titanic.target.astype(int)  
X = X.drop(['boat', 'body', 'home.dest', 'ticket', 'cabin', 'name'], axis=1)
X = PreProcessor.mlflow_input_prep(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

ml_pipeline = MLPipeline(
    model = RandomForestClassifier(),
    preprocessor = PreProcessor()
    )
# fit pipeline
ml_pipeline.fit(X_train,y_train)

# Save the ML pipeline using joblib backend with compression
pipeline_filepath = iou.save_object(
    ml_pipeline,
    directory="demo_outputs",
    basename="titanic_pipeline",
    backend="joblib",
    compress=True,
    use_date=True
)

# Load the saved pipeline
pipeline_retrieved = iou.load_object(pipeline_filepath)

# Verify pipeline integrity by comparing predictions
original_preds = ml_pipeline.predict(model_input = X_test, context = None)
retrieved_preds = pipeline_retrieved.predict(model_input = X_test, context = None)

print("\nPipeline integrity check:", np.array_equal(original_preds, retrieved_preds))



Object saved to demo_outputs\titanic_pipeline_2025-05-29.joblib
Object loaded from demo_outputs\titanic_pipeline_2025-05-29.joblib

Pipeline integrity check: True
