# __Artifacts, Logging, and Reproducible Workflows__

![Snapshot Preview](../assets/images/snapshot_wallpaper.png)

## __1. Introduction to Snapshots__

### __(a). The Challenge of Reproducibility in AI__

Building an AI pipeline is an iterative process. You start with some implementation—maybe a data preprocessing pipeline and an initial model. You run experiments, evaluate results, tweak hyperparameters, try a different model architecture, retrain, evaluate again, and repeat. This cycle continues until you reach a working solution.

But as your project evolves through these iterations, keeping track of what you've done and what outputs you've generated becomes messy. Here's why:

*   **Messy Files Everywhere**: Scripts, outputs, and data files pile up with each iteration. You lose track of what's important and what state your project is in.
*   **No Version Control for Results**: While code is typically version-controlled using systems like Git, data outputs, models, and visualizations often lack systematic versioning, leading to ambiguity regarding which script with what configuration generated which result.
*   **Can't Recreate Past Experiments**: Want to rerun an experiment from last week? You're now hunting through different config files, trying to remember which hyperparameters you used, which data preprocessing steps were active, and what model checkpoint you started from. Without careful tracking, it's nearly impossible to know.
*   **Inconsistent Logs**: Everyone logs differently (or not at all), making it hard to debug issues or understand what happened during a run.

These problems get worse with each iteration. What starts as a minor inconvenience becomes a major blocker when you need to compare results across experiments, validate findings, or move from development to production.

### __(b). OpenCrate's Solution: The Snapshot API__

Here's what Snapshots give you:

*   **Auto-organized Outputs**: File paths and versions are handled automatically. Everything stays clean and traceable across iterations.
*   **Built-in Logging**: Every run gets logged automatically. No more scattered print statements.
*   **Easy Artifact Management**: Save and load any data type (CSVs, images, models and more...) with simple APIs. OpenCrate handles the messy details.
*   **Version Control for Results**: Backup important files before overwriting them. Experiment freely without fear of losing work.

OpenCrate handles the boring file management stuff so you can focus on your actual work. This guide will show you how to use it to build clean, reproducible pipelines.

## __2. Core Concepts: Snapshots__

### __(a). Understanding Snapshots__

A **Snapshot** is nothing but a special folder for your pipeline's execution run. Think of it like a Git commit, but for your results/configurations instead of code. It sounds bit silly and simple, which it is, but a simple folder containing any kind of config of your entire pipeline along with all the outputs generated during that run solves every major pain point mentioned above.

What makes snapshots useful:

*   **Isolation**: Each snapshot is separate. Different experiments don't interfere with each other.
*   **Automatic Versioning**: Snapshots are numbered (v0, v1, v2...). Easy to track what changed when.
*   **Reproducibility**: Everything from a specific run is stored together. You can always go back to see exactly what you had.
*   **Flexibility**: Create snapshots at different stages of your workflow, that is data processing, model training, evaluation, etc.

### __(b). Initializing a Snapshot__

Use `oc.snapshot.setup()` to create or resume a snapshot. Here are the main parameters:

**`name` (str):** Give your snapshot series a unique name (e.g., "my_experiment").

**`start` (str or int):** Controls which snapshot version to use:
*   `"new"`: Create a fresh snapshot (increments version: v0 → v1 → v2...)
*   `"last"`: Resume from the most recent snapshot
*   `0`, `1`, `2`...: Resume from or create a specific version number

**`tag` (str, optional):** Add a label to the snapshot (e.g., v0:baseline, v1:feature-x). Useful for marking different experiments or configurations.

**`log_level` (str, optional):** How much detail to log. Options: "debug", "info", "warning", "error", "critical". Default is "info".

### __(c). Demonstrating Snapshot Initialization__

Let's create our first snapshot named `snapshot_guide` with the tag `initial-run`.


In [1]:
import opencrate as oc

In [2]:
oc.snapshot.setup(name="snapshot_guide", start="new", tag="initial-run")
oc.snapshot.reset(confirm=True)
oc.snapshot.setup(name="snapshot_guide", start="new", tag="initial-run")

oc.info(
    f"Snapshot with version `{oc.snapshot.version}` and name `{oc.snapshot.version_name}` has been set up at: `{oc.snapshot.dir_path}`"
)
oc.io.show_files_in_dir("snapshots", depth=4)

[1mINFO     [0m Snapshot with version `0` and name `v0:initial-run` has been set up at: `snapshots/snapshot_guide/v0:initial-run`


As you can see, we've created a very first snapshot for our pipeline called "snapshot_guide" with version `v0` and tag `initial-run`.

### __(d). Resuming an Existing Snapshot__

Sometimes you need to continue work from where you left off. Use `oc.snapshot.setup()` with `start="last"` to pick up from the most recent snapshot, or `start=<version_number>` to target a specific version.

**Important:** If your snapshot has a tag, you must pass the same tag when resuming. Otherwise, OpenCrate creates a new snapshot without the tag instead of resuming the existing one.

In our example: use `start="last"` to resume the most recent snapshot, or `start=0` to specifically resume v0.


In [1]:
# Notebook is restarted here to simulate a fresh run.

import opencrate as oc

In [2]:
oc.snapshot.setup(name="snapshot_guide", start="last", tag="initial-run")
# in our case we can pass start="last" as we are resuming from the last snapshot
# otherwise we can pass start="0" as we are resuming from the version v0

oc.info(
    f"Resumed Snapshot with version `{oc.snapshot.version}` and name `{oc.snapshot.version_name}` located at: `{oc.snapshot.dir_path}`"
)
oc.io.show_files_in_dir("snapshots", depth=4)

[1mINFO     [0m Resumed Snapshot with version `0` and name `v0:initial-run` located at: `snapshots/snapshot_guide/v0:initial-run`


You might notice that our log files under `snapshot_guide.log` and `snapshot_guide.history.log` are automatically created. We'll talk more about logging in a bit.

### __(e). Creating a New Snapshot Version__

When you hit a major milestone or want to save a stable state before making big changes, create a new snapshot version using `start="new"`. OpenCrate will bump the version number automatically (v0 → v1 → v2...).

**When to create a new version:**

*   **Before major changes**: Save a baseline before experimenting with new features or algorithms
*   **After important updates**: Document results from significant changes (new model architecture, different hyperparameters, etc.)
*   **For clean history**: Keep each major stage of development separate and easy to compare


In [1]:
# Notebook is restarted here to simulate a fresh run.

import opencrate as oc

In [2]:
oc.snapshot.setup(name="snapshot_guide", start="new", tag="major-update")

oc.info(
    f"New Snapshot version `{oc.snapshot.version}` with name `{oc.snapshot.version_name}` has been set up at: `{oc.snapshot.dir_path}`"
)
oc.io.show_files_in_dir("snapshots", depth=4)

[1mINFO     [0m New Snapshot version `1` with name `v1:major-update` has been set up at: `snapshots/snapshot_guide/v1:major-update`


## __3. Integrated Logging for Pipeline Observability__

### __(a). Why Logging Matters__

Good logging is essential for any serious project. Here's why:

*   **Debugging**: Logs show exactly when and where things went wrong
*   **Monitoring**: Track your pipeline's progress and performance
*   **Reproducibility**: Document what happened in each run so you can recreate or verify results
*   **Status Updates**: See what's happening during long-running jobs

Without good logs, you're flying blind. Debugging becomes guesswork and reproducing results becomes impossible.

### __(b). OpenCrate's Logging System__

When you create a snapshot, OpenCrate automatically sets up logging. You get two log files in your snapshot directory:

**`<name>.log`** (e.g., `snapshot_guide.log`): Logs from the **current run only**. Gets overwritten each time you run your pipeline. Perfect for checking what just happened.

**`<name>.history.log`** (e.g., `snapshot_guide.history.log`): **All logs from all runs**, appended over time. Your complete history for this snapshot version. Only created after you've run the pipeline more than once.

This gives you both a clean view of your latest run and a full history when you need it.

### __(c). Logging Levels and Usage__

OpenCrate provides simple logging functions for different situations:

*   `oc.info()`: General updates about what's happening
*   `oc.debug()`: Detailed info for troubleshooting (usually filtered out in production)
*   `oc.warning()`: Something's off but not broken
*   `oc.error()`: Something failed in a specific task
*   `oc.critical()`: Major failure, pipeline might crash
*   `oc.success()`: Confirm an important step completed
*   `oc.exception()`: Use in `try...except` blocks to log full error details with traceback

Just pass a string to any of these functions and OpenCrate handles the rest.


In [3]:
oc.info("This is an informational message from the current run.")
oc.debug("Detailed debug information for troubleshooting.")
oc.warning("A potential issue detected, but execution continues.")
oc.error("An error occurred, affecting a part of the pipeline.")
oc.critical("Critical failure: pipeline likely to terminate.")
oc.success("Important step completed successfully!")

try:
    # Simulate an error
    result = 10 / 0
except ZeroDivisionError:
    oc.exception("Caught a division by zero error.")

oc.info("All log messages have been dispatched.")


[1mINFO     [0m This is an informational message from the current run.
[31m[1mERROR    [0m An error occurred, affecting a part of the pipeline.
[41m[1mCRITICAL [0m Critical failure: pipeline likely to terminate.
[32m[1mSUCCESS  [0m Important step completed successfully!
[31m[1mERROR    [0m Caught a division by zero error.
[33m[1mTraceback (most recent call last):[0m

  File "[32m/tmp/ipykernel_351658/[0m[32m[1m1530191759.py[0m", line [33m10[0m, in [35m<module>[0m
    [1mresult[0m [35m[1m=[0m [34m[1m10[0m [35m[1m/[0m [34m[1m0[0m

[31m[1mZeroDivisionError[0m:[1m division by zero[0m
[1mINFO     [0m All log messages have been dispatched.


### __(d). Demonstrating Logging and Log File Analysis__

Let's check the log files OpenCrate created. Notice:

1. The `v0:initial-run` snapshot has two log files: `snapshot_guide.log` (latest run) and `snapshot_guide.history.log` (previous runs).
2. The `v1:major-update` snapshot only has `snapshot_guide.log` because it's only been run once.


In [4]:
oc.io.show_files_in_dir("snapshots", depth=4)

Let's compare the logs for `v0:initial-run` to see the difference.


In [5]:
!cat snapshots/snapshot_guide/v0:initial-run/snapshot_guide.log

2025-11-16 11:44:43 - INFO     Resumed Snapshot with version `0` and name `v0:initial-run` located at: `snapshots/snapshot_guide/v0:initial-run`


In [6]:
!cat snapshots/snapshot_guide/v0:initial-run/snapshot_guide.history.log

2025-11-16 11:44:23 - INFO     Snapshot with version `0` and name `v0:initial-run` has been set up at: `snapshots/snapshot_guide/v0:initial-run`
2025-11-16 11:44:43 - INFO     Resumed Snapshot with version `0` and name `v0:initial-run` located at: `snapshots/snapshot_guide/v0:initial-run`


As expected: `snapshot_guide.history.log` has logs from both our current and previous runs, while `snapshot_guide.log` only has the current run. Every time you resume a snapshot, `.log` gets overwritten with fresh logs, while `.history.log` keeps growing with the full timeline.


Quick check: let's look at `v1:major-update/snapshot_guide.log`. It should only have logs from the current run.


In [7]:
!cat snapshots/snapshot_guide/v1:major-update/snapshot_guide.log

2025-11-16 11:45:03 - INFO     New Snapshot version `1` with name `v1:major-update` has been set up at: `snapshots/snapshot_guide/v1:major-update`
2025-11-16 11:45:07 - INFO     This is an informational message from the current run.
2025-11-16 11:45:07 - ERROR    An error occurred, affecting a part of the pipeline.
2025-11-16 11:45:07 - CRITICAL Critical failure: pipeline likely to terminate.
2025-11-16 11:45:07 - SUCCESS  Important step completed successfully!
2025-11-16 11:45:07 - ERROR    Caught a division by zero error.
Traceback (most recent call last):

  File "/tmp/ipykernel_351658/1530191759.py", line 10, in <module>
    result = 10 / 0

ZeroDivisionError: division by zero
2025-11-16 11:45:07 - INFO     All log messages have been dispatched.


Perfect!

## __4. Artifact Management: Saving and Loading Data__

### __(a). What Are Artifacts?__

An **artifact** is any important output from your pipeline that you want to keep. Not temporary files—stuff that matters:

*   **Processed Datasets**: Cleaned data, feature-engineered datasets (e.g., `training_data.csv`)
*   **Models**: Trained weights, saved model files (e.g., `model_v1.pth`, `classifier.pkl`)
*   **Visualizations**: Important plots and charts (e.g., `accuracy_plot.png`, `confusion_matrix.jpg`)
*   **Config Files**: Settings and parameters used during training

OpenCrate handles all the annoying details: file paths, serialization, versioning. You just call `.save()` and `.load()`.


### __(b). Built-in Artifact Handlers__

OpenCrate has handlers for common file types. Just pick the right one for your data, give it a name, and call `.save()`. OpenCrate handles the rest.


#### __Data & Configuration Handlers:__

*   `oc.snapshot.json(name)`: Manages Python dictionaries, lists, and other JSON-serializable objects, saving them as `.json` files.
*   `oc.snapshot.yaml(name)`: Ideal for configuration management, handling dictionaries and similar structures as `.yaml` files.
*   `oc.snapshot.csv(name)`: Designed for tabular data, supporting Pandas DataFrames, lists of lists, or NumPy arrays for saving to `.csv` format.
*   `oc.snapshot.text(name)`: A versatile handler for saving any string data to a plain `.txt` file.




#### __Media Handlers:__

*   `oc.snapshot.image(name)`: Handles various image formats, supporting saving and loading from NumPy arrays, PIL Images, or Matplotlib figures. Offers `lib` parameter for specifying image processing library (e.g., `"pil"`, `"cv2"`).
*   `oc.snapshot.gif(name)`: Facilitates the creation and loading of animated GIFs from a sequence of images.
*   `oc.snapshot.video(name)`: Manages video files from diverse sources.
*   `oc.snapshot.audio(name)`: Supports audio data from libraries like Torchaudio or Librosa, with options to specify the sampling rate and library.




#### __Machine Learning Model Handlers:__

*   `oc.snapshot.checkpoint(name)`: A powerful handler for saving and loading machine learning model checkpoints. It supports a wide array of popular frameworks and formats, including:
    *   PyTorch (`.pth`, `.pt`, `.safetensors`)
    *   TensorFlow/Keras (`.h5`, `.keras`)
    *   Scikit-learn (`.joblib`, `.pkl`)
    *   And more, typically by handling a dictionary containing model state, optimizer state, and other metadata.




Let's save different file types: JSON, CSV, text, images, audio, and a PyTorch model checkpoint.


In [8]:
import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch

In [9]:
# First we initialize our artifacts based on their handling type

greeting_artifact = oc.snapshot.text("greeting.txt")
data_artifact = oc.snapshot.json("data.json")
config_artifact = oc.snapshot.yaml("config.yaml")
sample_data_artifact = oc.snapshot.csv("sample_data.csv")
sine_artifact = oc.snapshot.image("sine_wave_plot.png")
numpy_image_artifact = oc.snapshot.image("random_numpy_image.jpg")
audio_artifact = oc.snapshot.audio("high_pitch_sine.wav")
custom_model_ckpt_artifact = oc.snapshot.checkpoint("custom_model_checkpoint.pth")

greeting_artifact.save("Hello, OpenCrate Guide!") # saving as plain text
data_artifact.save({"array": [10, 20, 30], "message": "Sample JSON data"}) # saving as JSON
config_artifact.save({"project": "OpenCrate Guide", "version": 1.1, "settings": {"debug_mode": True}}) # saving as YAML
sample_data_artifact.save(pd.DataFrame({"col_a": [100, 200], "col_b": [300, 400]}), index=False) # saving as CSV

figure = plt.figure(figsize=(6, 4))
plt.plot(np.sin(np.linspace(0, 2 * np.pi, 50)))
plt.title("Sine Wave Plot")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
sine_artifact.save(figure) # saving matplotlib figure image
plt.close(figure)

numpy_image = np.random.randint(0, 256, (128, 128, 3), dtype=np.uint8)
numpy_image_artifact.save(numpy_image) # saving numpy array as image

sr = 44100
duration = 3
frequency = 220.0
t = np.linspace(0., duration, int(sr * duration), endpoint=False)
amplitude = 0.3 * np.iinfo(np.int16).max
audio_data = (amplitude * np.sin(2. * np.pi * frequency * t)).astype(np.int16)
audio_artifact.save(audio_data, sr, lib="soundfile")

model = torch.nn.Sequential(
    torch.nn.Linear(20, 10),
    torch.nn.ReLU(),
    torch.nn.Linear(10, 1)
)
optimizer = torch.optim.Adam(lr=0.001, params=model.parameters())

custom_model_ckpt_artifact.save(
    {
        "epoch": 5,
        "model_state_dict": model.state_dict(),
        "optimizer_state_dict": optimizer.state_dict(),
        "loss": 0.015,
        "description": "A sample PyTorch model checkpoint after 5 epochs."
    }
)


### __(c). Visualizing Artifact Storage__

OpenCrate automatically organizes artifacts into folders by type. Clean and easy to navigate.


In [None]:
oc.io.show_files_in_dir("snapshots", depth=4, verbose=True)
# neat trick - you can use verbose argument in show_files_in_dir to see file sizes and last modified times

### __(d). Loading Artifacts__

Loading is just as easy as saving. Call `.load()` and you get your data back in its original Python format. No worrying about file paths or deserialization.


In [11]:
loaded_greeting = greeting_artifact.load()
oc.info(f"Loaded Text: {loaded_greeting}")

loaded_json_data = data_artifact.load()
oc.info(f"Loaded JSON: {loaded_json_data}")

loaded_config = config_artifact.load()
oc.info(f"Loaded YAML Config: {loaded_config}")

loaded_csv_data = sample_data_artifact.load()
oc.info(f"Loaded CSV Data:\n{loaded_csv_data}")

loaded_sine_wave_plot = sine_artifact.load(lib="cv2")
oc.info(f"Loaded Sine Wave Plot (shape): {loaded_sine_wave_plot.shape}")

loaded_numpy_image = numpy_image_artifact.load(lib="cv2")
oc.info(f"Loaded NumPy Image (size): {loaded_numpy_image.size}")

# For audio, you might need to specify the library used during saving if not default
# For checkpoint, it typically returns the dictionary it was saved with
loaded_checkpoint = custom_model_ckpt_artifact.load()
oc.info(f"Loaded Checkpoint Keys: {loaded_checkpoint.keys()}")
oc.info(f"Loaded Checkpoint Description: {loaded_checkpoint['description']}")


[1mINFO     [0m Loaded Text: Hello, OpenCrate Guide!
[1mINFO     [0m Loaded JSON: {'array': [10, 20, 30], 'message': 'Sample JSON data'}
[1mINFO     [0m Loaded YAML Config: {'project': 'OpenCrate Guide', 'settings': {'debug_mode': True}, 'version': 1.1}
[1mINFO     [0m Loaded CSV Data:
   col_a  col_b
0    100    300
1    200    400
[1mINFO     [0m Loaded Sine Wave Plot (shape): (393, 557, 3)
[1mINFO     [0m Loaded NumPy Image (size): 49152
[1mINFO     [0m Loaded Checkpoint Keys: dict_keys(['epoch', 'model_state_dict', 'optimizer_state_dict', 'loss', 'description'])
[1mINFO     [0m Loaded Checkpoint Description: A sample PyTorch model checkpoint after 5 epochs.


In [12]:
loaded_audio = audio_artifact.load(lib="soundfile")

def audio_playback_widget(audio_data, sample_rate, volume=0.1):
    import IPython.display as ipd
    import numpy as np

    audio_data = np.array(audio_data) * volume
    ipd.display(
        ipd.Audio(data=audio_data, rate=sample_rate, autoplay=False, normalize=False)
    )

audio_playback_widget(loaded_audio["data"], loaded_audio["sample_rate"])

### __(e). Advanced Artifact Features__

Beyond basic save/load, artifacts have useful properties and methods:

**Properties:**

*   `.exists`: Returns `True` if the artifact file exists. Use this for conditional logic.
*   `.path`: The full file path where the artifact is stored. Useful when other tools need the path.

**Methods:**

*   `.backup(tag=None)`: Creates a backup copy before you overwrite something important. Add a tag or use automatic timestamps.
*   `.list_backups()`: Shows all backup files for this artifact.
*   `.delete(confirm=False)`: Delete an artifact. Requires `confirm=True` to prevent accidents.

Let's try them out.


In [13]:
oc.info(f"Artifact Name: {custom_model_ckpt_artifact.name}")
oc.info(f"Artifact Type: {custom_model_ckpt_artifact.snapshot_type}")
oc.info(f"Artifact Exists: {custom_model_ckpt_artifact.exists}") # Should be True as we just saved it
oc.info(f"Artifact Path: {custom_model_ckpt_artifact.path}")


[1mINFO     [0m Artifact Name: custom_model_checkpoint.pth
[1mINFO     [0m Artifact Type: checkpoint
[1mINFO     [0m Artifact Exists: True
[1mINFO     [0m Artifact Path: snapshots/snapshot_guide/v1:major-update/checkpoints/custom_model_checkpoint.pth


#### __Creating Backups with `.backup()`__

Before modifying an important artifact, create a backup. This way you can always recover if something goes wrong.

You can tag backups for easy identification, or let OpenCrate use timestamps automatically.


In [14]:
custom_model_ckpt_artifact.backup(tag="initial-version")
oc.info("Created initial backup with tag 'initial-version'.")

# Simulate some changes and then create another backup
loaded_state = custom_model_ckpt_artifact.load()
loaded_state["loss"] = 0.012 # Simulate a better loss
custom_model_ckpt_artifact.save(loaded_state)
oc.info("Modified and re-saved the main artifact.")

custom_model_ckpt_artifact.backup(tag="improved-loss")
oc.info("Created backup with tag 'improved-loss' after modification.")

# Create a backup without a tag (timestamped)
custom_model_ckpt_artifact.backup()
oc.info("Created a timestamped backup without a specific tag.")

oc.io.show_files_in_dir(os.path.dirname(custom_model_ckpt_artifact.path), verbose=True)


[1mINFO     [0m Created initial backup with tag 'initial-version'.
[1mINFO     [0m Modified and re-saved the main artifact.
[1mINFO     [0m Created backup with tag 'improved-loss' after modification.
[1mINFO     [0m Created a timestamped backup without a specific tag.


#### __Listing and Loading Backups__

Use `.list_backups()` to see all your saved backup versions. Then load any backup just like you'd load a regular artifact.


In [15]:
all_backups = "\n".join(custom_model_ckpt_artifact.list_backups())
oc.info(f"All Backups:\n{all_backups}")

[1mINFO     [0m All Backups:
custom_model_checkpoint.backup_initial-version.pth
custom_model_checkpoint.backup_improved-loss.pth
custom_model_checkpoint.backup_11:46:37_16-Nov-2025.pth


In [16]:
initial_checkpoint_artifact = oc.snapshot.checkpoint("custom_model_checkpoint.backup_initial-version.pth")

if initial_checkpoint_artifact.exists:
    initial_checkpoint = initial_checkpoint_artifact.load()
    oc.info(f"Loaded Initial Version Loss: {initial_checkpoint['loss']}")
else:
    oc.warning("Initial version backup not found.")

[1mINFO     [0m Loaded Initial Version Loss: 0.015


#### __Deleting Artifacts__

Delete old or unnecessary artifacts with `.delete(confirm=True)`. The `confirm=True` requirement prevents accidents.


In [17]:
if all_backups:
    artifact_to_delete_name = all_backups.split("\n")[0] # Let's delete the first backup
    artifact_to_delete = oc.snapshot.checkpoint(artifact_to_delete_name)
    artifact_to_delete.delete(confirm=True)
    oc.info(f"Deleted backup: {artifact_to_delete_name}")
    oc.io.show_files_in_dir(
        os.path.dirname(custom_model_ckpt_artifact.path), verbose=True
    )
else:
    oc.warning("No backups to delete.")

[1mINFO     [0m Deleted backup: custom_model_checkpoint.backup_initial-version.pth


## __5. Extending OpenCrate: Custom Artifact Handlers__

### __(a). Why Custom Handlers?__

OpenCrate has handlers for common formats (CSV, JSON, images, models, etc.), but sometimes you need something specific:

*   Unique file formats in your field
*   Custom data validation or preprocessing
*   Special compression or storage requirements
*   Proprietary data structures

Custom handlers let you save/load any data type while keeping all of OpenCrate's versioning and logging benefits.

### __(b). How to Create a Custom Handler__

Create a Python class with at least two methods: `save()` and `load()`. You can add other methods too (like `reset()` for cleanup).

```python
class BoundingBoxHandler:
    def save(self, bounding_boxes_list):
        # Your save logic here using self.path
        ...
    
    def load(self):
        # Your load logic here using self.path
        ...

bounding_box_artifact = oc.snapshot.labels(
    "bounding_boxes", handler=BoundingBoxHandler
)
```

OpenCrate automatically gives your handler these attributes:

*   `self.path`: Where to save/load the file
*   `self.verbose`: Whether to print detailed logs
*   `self.name`: The artifact name (e.g., "bounding_boxes")
*   `self.snapshot_type`: The handler type (e.g., "labels")

Your `save()` method writes data to `self.path`. Your `load()` method reads from `self.path` and returns the data.

Let's see two practical examples.


### __(c). Example 1: Bounding Box Handler__

Say you're doing object detection and want to save bounding box coordinates. Instead of overwriting a single file, let's keep a history by saving each set of boxes as a new numbered file.

This `BoundingBoxHandler` creates files like `bounding_boxes_0.txt`, `bounding_boxes_1.txt`, etc. The `load()` method reads all of them and returns the complete history.


In [18]:
from shutil import rmtree
from typing import Dict, List


class BoundingBoxHandler:
    def save(self, bboxes: List[Dict[str, float]], *args, **kwargs):
        # Ensure the directory exists for storing individual bounding box files
        os.makedirs(self.path, exist_ok=True)

        idx = len(os.listdir(self.path)) # Determine the next index for the file
        file_path = os.path.join(self.path, f"bounding_boxes_{idx}.txt")

        lines = []
        for bbox in bboxes:
            # Format bounding box coordinates into a single line
            line = f"{bbox['x1']} {bbox['y1']} {bbox['x2']} {bbox['y2']}"
            lines.append(line)

        content = '\n'.join(lines)
        oc.io.text.save(content, file_path) # Use OpenCrate's internal text handler to save the file
        # you can also use your custom serialization logic here as well instead of oc.io.text.save

        if self.verbose:
            oc.success(f"Successfully saved {len(bboxes)} bounding boxes to {file_path}")

    def load(self, *args, **kwargs) -> List[List[Dict[str, float]]]:
        if self.verbose:
            oc.info(f"Loading bounding boxes from {self.path}")

        loaded_boxes_history = [] # To store list of lists of bboxes

        if not os.path.exists(self.path):
            if self.verbose:
                oc.warning(f"Bounding box directory not found at {self.path}. Returning empty list.")
            return []

        # List files and sort them numerically to maintain the order of saving
        files_in_dir = oc.io.list_files_in_dir(self.path)
        sorted_files = sorted(files_in_dir, key=lambda x: int(x.split('_')[-1].split('.')[0]))

        for file_name in sorted_files:
            file_path = os.path.join(self.path, file_name)
            content = oc.io.text.load(file_path) # Load content of each bounding box file

            current_bboxes_list = []
            for line in content.strip().split('\n'):
                if line.strip():
                    coords = line.strip().split()
                    if len(coords) == 4:
                        bbox = {
                            'x1': float(coords[0]),
                            'y1': float(coords[1]),
                            'x2': float(coords[2]),
                            'y2': float(coords[3])
                        }
                        current_bboxes_list.append(bbox)
            loaded_boxes_history.append(current_bboxes_list)

        if self.verbose:
            oc.info(f"Successfully loaded {len(loaded_boxes_history)} sets of bounding boxes")

        return loaded_boxes_history

    def reset(self, *args, **kwargs):
        # Custom reset logic to delete the directory and recreate it
        if os.path.exists(self.path):
            rmtree(self.path)
        os.makedirs(self.path, exist_ok=True)
        if self.verbose:
            oc.success(f"Reset bounding box handler at {self.path}")

# Instantiate the custom bounding box artifact handler
bounding_box_artifact = oc.snapshot.labels(
    "bounding_boxes", handler=BoundingBoxHandler, verbose=True
)
oc.info(f"Custom Bounding Box Artifact Handler initialized at: {bounding_box_artifact.path}")

[1mINFO     [0m Custom Bounding Box Artifact Handler initialized at: snapshots/snapshot_guide/v1:major-update/labels/bounding_boxes


In [19]:
boxes1 = [
    {"x1": 10.0, "y1": 20.0, "x2": 150.0, "y2": 200.0},
    {"x1": 50.0, "y1": 60.0, "x2": 180.0, "y2": 250.0},
]
boxes2 = [
    {"x1": 100.0, "y1": 110.0, "x2": 220.0, "y2": 300.0},
]

# Reset the handler to ensure a clean state before saving
bounding_box_artifact.reset()

# Save multiple sets of bounding boxes, each creating a new file
bounding_box_artifact.save(boxes1)
bounding_box_artifact.save(boxes2)

oc.info("Saved multiple sets of bounding boxes using the custom handler.")
oc.io.show_files_in_dir(bounding_box_artifact.path)


[32m[1mSUCCESS  [0m Reset bounding box handler at snapshots/snapshot_guide/v1:major-update/labels/bounding_boxes
[32m[1mSUCCESS  [0m Successfully saved 2 bounding boxes to snapshots/snapshot_guide/v1:major-update/labels/bounding_boxes/bounding_boxes_0.txt
[1mINFO     [0m ✓ 'bounding_boxes' of 'labels' saved successfully at 'snapshots/snapshot_guide/v1:major-update/labels/bounding_boxes'.
[32m[1mSUCCESS  [0m Successfully saved 1 bounding boxes to snapshots/snapshot_guide/v1:major-update/labels/bounding_boxes/bounding_boxes_1.txt
[1mINFO     [0m ✓ 'bounding_boxes' of 'labels' saved successfully at 'snapshots/snapshot_guide/v1:major-update/labels/bounding_boxes'.
[1mINFO     [0m Saved multiple sets of bounding boxes using the custom handler.


In [20]:
loaded_bounding_boxes_history = bounding_box_artifact.load()
oc.info(f"Loaded Bounding Boxes History: {loaded_bounding_boxes_history}")

# You can access individual sets of bounding boxes
oc.info(f"First set of boxes: {loaded_bounding_boxes_history[0]}")
oc.info(f"Second set of boxes: {loaded_bounding_boxes_history[1]}")

[1mINFO     [0m Loading bounding boxes from snapshots/snapshot_guide/v1:major-update/labels/bounding_boxes
[1mINFO     [0m Successfully loaded 2 sets of bounding boxes
[1mINFO     [0m ✓ 'bounding_boxes' of 'labels' loaded successfully from 'snapshots/snapshot_guide/v1:major-update/labels/bounding_boxes'.
[1mINFO     [0m Loaded Bounding Boxes History: [[{'x1': 10.0, 'y1': 20.0, 'x2': 150.0, 'y2': 200.0}, {'x1': 50.0, 'y1': 60.0, 'x2': 180.0, 'y2': 250.0}], [{'x1': 100.0, 'y1': 110.0, 'x2': 220.0, 'y2': 300.0}]]
[1mINFO     [0m First set of boxes: [{'x1': 10.0, 'y1': 20.0, 'x2': 150.0, 'y2': 200.0}, {'x1': 50.0, 'y1': 60.0, 'x2': 180.0, 'y2': 250.0}]
[1mINFO     [0m Second set of boxes: [{'x1': 100.0, 'y1': 110.0, 'x2': 220.0, 'y2': 300.0}]


### __(d). Example 2: Zipped Image Dataset Handler__

Managing hundreds of individual image files is messy. Better to bundle them into a single ZIP file.

This `ImageZipHandler` saves a list of NumPy arrays (images) as PNGs inside a compressed ZIP archive. When loading, it unpacks them back into NumPy arrays.


In [21]:
import zipfile

import cv2


class ImageZipHandler:
    def save(self, images: List[np.ndarray], *args, **kwargs):
        if self.verbose:
            oc.info(f"Saving {len(images)} images to {self.path}")

        with zipfile.ZipFile(self.path, 'w', zipfile.ZIP_DEFLATED) as zipf:
            for i, img_data in enumerate(images):
                # Encode image to PNG format before adding to zip
                is_success, buffer = cv2.imencode(".png", img_data)
                if not is_success:
                    oc.warning(f"Could not encode image at index {i}")
                    continue
                zipf.writestr(f"image_{i:04d}.png", buffer.tobytes()) # Use 4-digit padding for sorting

        if self.verbose:
            oc.success(f"Successfully saved {len(images)} images to {self.path}")

    def load(self, *args, **kwargs) -> List[np.ndarray]:
        if self.verbose:
            oc.info(f"Loading images from {self.path}")

        images = []
        if not os.path.exists(self.path):
            if self.verbose:
                oc.warning(f"Image zip file not found at {self.path}. Returning empty list.")
            return []

        with zipfile.ZipFile(self.path, 'r') as zipf:
            # Sort names to ensure consistent loading order
            for file_name in sorted(zipf.namelist()):
                with zipf.open(file_name) as img_file:
                    file_bytes = np.frombuffer(img_file.read(), np.uint8)
                    img = cv2.imdecode(file_bytes, cv2.IMREAD_COLOR)
                    if img is not None:
                        images.append(img)
                    else:
                        oc.warning(f"Could not decode image {file_name}")

        if self.verbose:
            oc.info(f"Loaded {len(images)} images from {self.path}")

        return images

# Instantiate the custom image dataset artifact handler
image_dataset_artifact = oc.snapshot.image_archive(
    "images_archive.zip", handler=ImageZipHandler, verbose=True
)
oc.info(f"Custom Image Archive Artifact Handler initialized at: {image_dataset_artifact.path}")

[1mINFO     [0m Custom Image Archive Artifact Handler initialized at: snapshots/snapshot_guide/v1:major-update/image_archive/images_archive.zip


In [22]:
# Generate some random images for demonstration
random_images = [np.random.randint(0, 256, (64, 64, 3), dtype=np.uint8) for _ in range(50)]

# Save the images using the custom handler
image_dataset_artifact.save(random_images)
oc.info("Saved a collection of random images into a zip archive.")

# Load the images back from the zip archive
loaded_images = image_dataset_artifact.load()
oc.info(f"Loaded {len(loaded_images)} images from the archive. First image shape: {loaded_images[0].shape}")

oc.io.show_files_in_dir(os.path.dirname(image_dataset_artifact.path), verbose=True)


[1mINFO     [0m Saving 50 images to snapshots/snapshot_guide/v1:major-update/image_archive/images_archive.zip
[32m[1mSUCCESS  [0m Successfully saved 50 images to snapshots/snapshot_guide/v1:major-update/image_archive/images_archive.zip
[1mINFO     [0m ✓ 'images_archive.zip' of 'image_archive' saved successfully at 'snapshots/snapshot_guide/v1:major-update/image_archive/images_archive.zip'.
[1mINFO     [0m Saved a collection of random images into a zip archive.
[1mINFO     [0m Loading images from snapshots/snapshot_guide/v1:major-update/image_archive/images_archive.zip
[1mINFO     [0m Loaded 50 images from snapshots/snapshot_guide/v1:major-update/image_archive/images_archive.zip
[1mINFO     [0m ✓ 'images_archive.zip' of 'image_archive' loaded successfully from 'snapshots/snapshot_guide/v1:major-update/image_archive/images_archive.zip'.
[1mINFO     [0m Loaded 50 images from the archive. First image shape: (64, 64, 3)


## __6. Best Practices for Artifact Management__

### __(a). Choose Artifacts Wisely__

Not every file needs to be an artifact. Save what matters and skip the rest. Too many artifacts clutters your snapshots and wastes storage.

**Good artifacts:**
*   Final cleaned datasets
*   Trained model checkpoints
*   Important plots and visualizations
*   Config files with key parameters

**Not artifacts:**
*   Temporary cache files
*   Files you can easily regenerate
*   Large raw datasets (unless you're specifically versioning them)

### __(b). Group Related Files__

If your pipeline generates tons of related files, group them into one artifact instead of saving each individually.

**Benefits:**
*   Less clutter
*   Easier to manage (backup, delete, load as a unit)
*   Clearer organization

**How to group:**
*   **Save the whole directory**: Use a custom handler to save an entire folder as one artifact
*   **Compress into an archive**: Bundle files into a ZIP or tar.gz (like our `ImageZipHandler` example)

Example: If you generate 1,300 JSON annotation files, don't create 1,300 artifacts. Either save the parent directory or compress them into `annotations.zip`. Much simpler.


## __7. Conclusion__

This guide covered everything you need to use OpenCrate effectively: snapshots, logging, artifacts, and custom handlers.

### __What OpenCrate Gives You__

*   **Reproducibility**: Version your outputs and recreate past experiments easily
*   **Organization**: Auto-organized folders and files. No more mess.
*   **Easy Artifact Handling**: Save and load any data type with simple commands
*   **Safety**: Backup important files before changes. No more accidental overwrites.
*   **Flexibility**: Extend with custom handlers for any file format

OpenCrate takes care of the boring file management stuff so you can focus on actual data science work.

### __Next Steps__

*   **Read the docs**: Check the official OpenCrate documentation for the full API reference
*   **Join the community**: Ask questions, share ideas, contribute
*   **Try it yourself**: Start using OpenCrate in your own projects

Thanks for reading! We hope OpenCrate makes your workflows cleaner and more reproducible.
