# Demo notebook for `richfile`

Welcome to the demo notebook for the `richfile` python package.\
Below you'll find the following chapters:
#### A. Basics of conventions
#### B. Basic examples in python
#### C. Advanced examples in python
#### D. Details


What is `richfile`?

- Primarily, it set of **conventions** for saving and loading nested data structures in a human-readable format.
- In practice, it is also a software library / API that implements these conventions.
- The goals of the `richfile` conventions:
    - **Human-readable**: The data saved on disk should be human-readable in a file explorer.
    - **Directory structure**: Hierarchically organized data should be stored in a directory structure.
    - **Versioning**: The data should be insensitive to software version changes. No loss of old data.
    - **Customizable**: The data should be highly customizable.

## A. Basics of `richfile` python conventions
1. richfiles are hierarchically organized directory structures that represent hierarchically organized data structures. (Similar to JSON, HDF5, etc.)
2. Container objects like `list`, `dict`, `set`, `tuple` are represented as directories.
3. Atomic objects like `int`, `float`, arrays, etc. are represented as files.
4. Each directory has a `metadata.richfile` file that contains metadata about the container and its contents. This is a protected filename.
5. Atomic objects are saved and loaded using functions that are specific to that object type. Ideally, these are libraries native to the data type (e.g. `numpy` for arrays). Native python atomic objects like `int`, `str`, etc. are saved as `.json` files.

#### What does a richfile look like?
Here is an example python object:
```python
some_dict = {
    "some_list_of_int": [1, 2, 3],
    "a_nested_dict": {
        "some_float": 4.5,
        "some_str": "hello"
        "another_dict": {
            "f": None,
            "some_tuple_of_numpy_arrays": (np.array(...), np.array(...)),
            "some_set_of_dicts": {{"a": 1}, {"b": 2}},
        }
    }
}
```

And the corresponding `richfile` folder structure:
```
some_dict.richfile (a folder containing the following folder structure)
├── metadata.richfile
├── some_list_of_int.list
|   ├── metadata.richfile
|   ├── 0.int
|   ├── 1.int
|   ├── 2.int
|
├── a_nested_dict.dict
|   ├── metadata.richfile
|   ├── some_float.dict_item
|   |   ├── metadata.richfile
|   |   ├── key.str
|   |   ├── value.float
|   |
|   ├── some_str.dict_item
|   |   ├── metadata.richfile
|   |   ├── key.str
|   |   ├── value.str
|   |
|   ├── another_dict.dict_item
|   |   ├── metadata.richfile
|   |   ├── key.str
|   |   ├── value.dict
|   |   |   ├── metadata.richfile
|   |   |   ├── f.none
|   |   |   ├── some_tuple_of_numpy_arrays.tuple
|   |   |   |   ├── metadata.richfile
|   |   |   |   ├── 0.npy
|   |   |   |   ├── 1.npy
|   |   |   |
|   |   |   ├── some_set_of_dicts.set
|   |   |   |   ├── metadata.richfile
|   |   |   |   ├── 0.dict
|   |   |   |   |   ├── metadata.richfile
|   |   |   |   |   ├── a.dict_item
|   |   |   |   |   |   ├── metadata.richfile
|   |   |   |   |   |   ├── key.str
|   |   |   |   |   |   ├── value.int
|   |   |   |   |   |
|   |   |   |   |
|   |   |   |   ├── 1.dict
|   |   |   |   |   ├── metadata.richfile
|   |   |   |   |   ├── b.dict_item
|   |   |   |   |   |   ├── metadata.richfile
|   |   |   |   |   |   ├── key.str
|   |   |   |   |   |   ├── value.int
|   |   |   |   |   |
|   |   |   |   |
|   |   |   |
|   |   |
|   |
|
```

## B. Basic examples in python

1. Saving data objects
2. Exploring the saved data
3. Loading the saved data
4. Loading specific elements from the saved data

In [1]:
%load_ext autoreload
%autoreload 2
import richfile as rf

## Set path to save / load a richfile
path = '/home/rich/Desktop/test4/data.richfile'

#### Make a data object
We will make a nested dictionary object and save it as a `richfile` folder.

In [2]:
import numpy as np

## Save a dictionary to a richfile
data = {
    "name": "John Doe",
    "age": 25,
    "address": {
        "street": "1234 Elm St",
        "zip": None
    },
    "siblings": [
        "Jane",
        "Jim"
    ],
    "data": np.array([1,2,3]),
    (1,2,3): "complex key",
}

#### 1. Save the data object

In [3]:
## Save the dictionary to a file
r = rf.RichFile(path=path).save(obj=data)

#### 2. Explore the saved data
The folder was saved as `some_dict.richfile` on disk. We will print out the directory structure below. 

You can also explore the folder in your file explorer BUT DO NOT MODIFY THE FOLDER CONTENTS. You can copy data out of it, but if you modify the names or contents of the files without also updating the metadata, the data will be corrupted.

In [4]:
### Prepare the richfile object
r = rf.RichFile(path=path)

print("Object tree in `richfile` directory")
r.view_tree(show_filenames=True)

print("")
print("Directory structure")
r.view_directory_tree()

Object tree in `richfile` directory
Path: /home/rich/Desktop/test4/data.richfile (dict)
├── 'name': value.json  (str)
├── 'age': value.json  (int)
├── 'address': value.dict  (dict)
|    ├── 'street': value.json  (str)
|    ├── 'zip': value.json  (None)
|    
├── 'siblings': value.list  (list)
|    ├── 0.json  (str)
|    ├── 1.json  (str)
|    
├── 'data': value.npy  (numpy_array)
├── '(1, 2, 3)': value.json  (str)


Directory structure
Viewing tree structure of richfile at path: /home/rich/Desktop/test4/data.richfile (dict)
├── name.dict_item (dict_item)
|   ├── key.json (str)
|   ├── value.json (str)
|   
├── age.dict_item (dict_item)
|   ├── key.json (str)
|   ├── value.json (int)
|   
├── address.dict_item (dict_item)
|   ├── key.json (str)
|   ├── value.dict (dict)
|   |   ├── street.dict_item (dict_item)
|   |   |   ├── key.json (str)
|   |   |   ├── value.json (str)
|   |   |   
|   |   ├── zip.dict_item (dict_item)
|   |   |   ├── key.json (str)
|   |   |   ├── value.json (None)

#### 3. Load the saved data
The directory structure of the `some_dict.richfile` directory can be loaded back into a python object.

In [5]:
### Prepare the richfile object
data_2 = rf.RichFile(path=path).load()

## Check if the data is the same
def check_data(d1, d2):
    if isinstance(d1, dict):
        [check_data(d1[k], d2[k]) for k in d1]
    elif isinstance(d1, list):
        [check_data(d1[i], d2[i]) for i in range(len(d1))]
    elif isinstance(d1, np.ndarray):
        assert np.all(d1 == d2)
    else:
        assert d1 == d2
    return True

print(f"Data is the same: {check_data(data, data_2)}")

Data is the same: True


#### 4. Load specific elements from the saved data
We can also load specific elements from the directory structure without loading the entire directory. You can index into `RichFile` objects directly like python dictionaries and lists. This will create a new `RichFile` object corresponding to the subdirectory you indexed into.

In [6]:
### Prepare the richfile object
r = rf.RichFile(path=path)

### Lazily load a single element from deep in the dictionary by specifying the path
print(f"Original richfile object:")
r.view_tree()
## Make a new richfile object that points to the 'siblings' list inside the data dictionary
r2 = r['siblings'] 

print(f"\nNew richfile object for the 'siblings' dictionary item:")
r2.view_tree()
## Lazily load the first element of the 'siblings' list
data2 = r2[1].load()

print(f"\nFirst element of the 'siblings' list:")
print(data2)

Original richfile object:
Path: /home/rich/Desktop/test4/data.richfile (dict)
├── 'name':   (str)
├── 'age':   (int)
├── 'address':   (dict)
|    ├── 'street':   (str)
|    ├── 'zip':   (None)
|    
├── 'siblings':   (list)
|    ├──   (str)
|    ├──   (str)
|    
├── 'data':   (numpy_array)
├── '(1, 2, 3)':   (str)


New richfile object for the 'siblings' dictionary item:
Path: /home/rich/Desktop/test4/data.richfile/siblings.dict_item/value.list (list)
├──   (str)
├──   (str)


First element of the 'siblings' list:
Jim


## C. Advanced examples in python

1. Calling loading and saving functions with custom arguments
2. Custom saving and loading functions for any object type

#### 1. Calling loading and saving functions with custom arguments

**Loading**: Let's load a numpy array using memory mapping. This requires passing the `mmap_mode` argument to the loading function. This is accomplished by calling the `.set_load_kwargs` method on the `RichFile` object before calling the loading function.

In [7]:
### Prepare the richfile object
r = rf.RichFile(path=path)

### Set the `mmap_mode='r'` for loading objects of the `'numpy_array'` or np.ndarray type
r.set_load_kwargs(type_=np.ndarray, mmap_mode='r')

### Load the numpy array as a memory-mapped array
data = r['data'].load()

print(f"Type of the loaded numpy array: {type(data)}")

Type of the loaded numpy array: <class 'numpy.memmap'>


**Saving**: You can do the same thing with saving using the `.set_save_kwargs` method. Though, there are fewer use cases for this. Remember that all atomic objects should be able to be saved and loaded and saved and loaded again without any loss of information or change in the object.

#### 2. Custom saving and loading functions for any object type

You can register a new type to be used with a `RichFile` object. This is done by calling the `.register_type` function. This is useful for saving and loading custom objects that are not natively supported by `richfile`.

For this example we will register a the `sparse.COO` object from the `sparse` library, which is great for saving high-dimensional sparse arrays.

In [10]:
## Make save / load functions for the sparse.COO type
def load_sparseCOO_array(path):
    import sparse
    sparse.load_npz(path)

def save_sparseCOO_array(obj, path):
    import sparse
    sparse.save_npz(path, obj)

## Save a sparse.COO array to a richfile
path_sparse = '/home/rich/Desktop/test4/data_sparse.richfile'

import sparse
data_sparse = {
    "some_sparse_data": sparse.COO(np.random.randint(0, 10, (100, 100))),
}

r = rf.RichFile(path=path_sparse)
## Define the new type for the sparse.COO array
### NOTE: all of these fields are required
r.register_type(
    type_name="sparseCOO_array",
    function_load=load_sparseCOO_array,
    function_save=save_sparseCOO_array,
    object_type=sparse.COO,
    suffix=".npz",
    library="sparse",
)
r.save(obj=data_sparse)

Path: /home/rich/Desktop/test4/data_sparse.richfile (dict)
├── 'some_sparse_data':   (sparseCOO_array)



RichFileHandler(path=/home/rich/Desktop/test4/data_sparse.richfile, check=True), params_load={}), params_save={})

## D: Details
DETAILED PYTHON CONVENTIONS FOR NERDS:

The system is based on the following principles: 
- Each leaf object is saved as a separate file 
- The folder structure mirrors the nested object structure:
    - Lists, tuples, and sets are saved as folders with elements saved as files
      or folders with integer names
    - Dicts are saved as folders with items saved as folders with integer names.
      Dict items are saved as folders containing 2 elements.
- There is a single metadata file for each folder describing the properties of
  each element in the folder
    - The metadata file is a JSON file named "metadata.richfile" and contains
      the following items:
        - "elements": a dictionary with keys that are the names of the files /
          folders in the directory and values that are dictionaries with the
          following items:
            - "type": A string describing type of the element. The string used
              should be a valid richfile type, as it is determines how the
              element is loaded. Examples: "npy_array", "scipy_sparse_array",
              "list", "object", "float", etc.
            - "library": A string describing the library used to save the
              element. Examples: "numpy", "scipy", "python", "json" (for native
              python types), etc.
           - "version": A string describing the version of the library used to
              save the element. This is used to determine how the element is
              loaded. Examples: "1.0.0", "0.1.0", etc.
            - "index": An integer that is used to determine the order of the
              elements when loading them. Example: 0, 1, 2, etc.
        - "type": A string describing the type of the folder. The string used
          should be a valid richfile type, as it determines how the folder is
          loaded. Examples: "list", "dict", "tuple", etc. (Only container-like
          types)
        - "library": A string describing the library used to save the folder.
          Examples: "python"
        - "version": A string describing the version of the library used to for
          the container. This is used to determine how the folder is loaded.
          Examples: "3.12", "3.13", etc.
        - "version_richfile": A string describing the version of the richfile
          format used to save the metadata file. Examples: "1.0.0", "0.1.0",
          etc.
- Loading proceeds as follows:
    - enter outer folder
    - load metadata file
    - check that files / folders in the directory match the metadata
    - if folder represents a list, tuple, or set:
        - elements are expected to be named as integers with an appropriate
          suffix: 0.list, 1.npy, 2.dict, 3.npz, 4.json, etc.
        - load each element in the order specified by the metadata index
        - if an element is container-like, enter its folder, load, and package
          it.
    - if folder represents a dict:
        - each item will be saved as a folder containing a single dict item
        - each dict item folder will contain 2 elements: key (0) and value (1)
    - load elements:
        - richfile types (eg. "array", "sparse_array", etc.) are saved and
          loaded using numpy, scipy, etc. as appropriate.
        - an appropriate suffix will be added to the file or folder name.
        - native python types (eg. "float", "int", "str", etc.) are saved as
          JSON files and loaded using the json library.