This example notebook shows how we can train an [image/digit classification](https://scikit-learn.org/stable/auto_examples/linear_model/plot_sparse_logistic_regression_mnist.html)
model based on MNIST dataset, and store it as TileDB array. Firstly, let's import what we need.

In [1]:
import glob
import json
import os
import shutil
from pprint import pprint

import matplotlib.pyplot as plt
import numpy as np
import tiledb
from sklearn.datasets import fetch_openml
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.utils import check_random_state
from sklearn import tree

from tiledb.ml.models.sklearn import SklearnTileDBModel

Then load our data, split in train and test and perform basic scaling by employing a standard scaler.

In [2]:
data_home = os.path.join(os.path.pardir, "data")
train_samples = 5000

# Load data from https://www.openml.org/d/554
print('Data fetching...')
X, y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False, data_home=data_home)

Data fetching...


In [3]:
random_state = check_random_state(0)
permutation = random_state.permutation(X.shape[0])
X = X[permutation]
y = y[permutation]
X = X.reshape((X.shape[0], -1))

X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=train_samples, test_size=10000)

print('Data scaling...')
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Data scaling...


We move on by declaring a simple Logistic Regression classifier, train it and print the accuracy score.

In [4]:
clf = LogisticRegression(
    C=50. / train_samples, penalty='l1', solver='saga', tol=0.1
)

print('Model fit...')
clf.fit(X_train, y_train)

print('Model score...')
sparsity = np.mean(clf.coef_ == 0) * 100
score = clf.score(X_test, y_test)

print("Sparsity with L1 penalty: %.2f%%" % sparsity)
print("Test score with L1 penalty: %.4f" % score)

Model fit...
Model score...
Sparsity with L1 penalty: 77.93%
Test score with L1 penalty: 0.8298


We can now save the trained model as a TileDB array. For information about the structure of a dense
TileDB array in terms of files on disk please take a look [here](https://docs.tiledb.com/main/concepts/data-format).
At the moment (will change in the future) we use pickle, which is one of the [most common scenarios for sklearn models](https://scikit-learn.org/stable/modules/model_persistence.html),
to serialize the whole model and save it as a [variable sized attribute](https://docs.tiledb.com/main/how-to/arrays/writing-arrays/var-length-attributes)
in a TileDB array.  We first declare a SklearnTileDBModel object (with the corresponding uri and model attributes) and then save the model as a TileDB array.
Finally, we can save any kind of metadata (in any structure, i.e., list, tuple or dictionary) by passing a dictionary to the meta attribute.

In [5]:
uri = os.path.join(data_home, 'sklearn-mnist-1')
tiledb_model_1 = SklearnTileDBModel(uri=uri, model=clf)

tiledb_model_1.save(meta={'Sparsity_with_L1_penalty': sparsity,
                          'score': score})

The above step will create a TileDB array in your working directory. Let's open our TileDB array model and check metadata.
Metadata that are of type list, dict or tuple have been JSON
serialized while saving, i.e., we need json.loads to deserialize them.

In [6]:
# Check array directory
pprint(glob.glob(f'{uri}/*'))

# Open in write mode in order to add metadata
print()
model_array_1 = tiledb.open(uri)
for key, value in model_array_1.meta.items():
    if isinstance(value, bytes):
        value = json.loads(value)
    print("Key: {}, Value: {}".format(key, value))

['../data/sklearn-mnist-1/__fragment_meta',
 '../data/sklearn-mnist-1/__meta',
 '../data/sklearn-mnist-1/__fragments',
 '../data/sklearn-mnist-1/__commits',
 '../data/sklearn-mnist-1/__schema']

Key: Sparsity_with_L1_penalty, Value: 77.93367346938776
Key: TILEDB_ML_MODEL_ML_FRAMEWORK, Value: SKLEARN
Key: TILEDB_ML_MODEL_ML_FRAMEWORK_VERSION, Value: 1.0.2
Key: TILEDB_ML_MODEL_PREVIEW, Value: LogisticRegression(C=0.01, penalty='l1', solver='saga', tol=0.1)
Key: TILEDB_ML_MODEL_PYTHON_VERSION, Value: 3.7.13
Key: TILEDB_ML_MODEL_STAGE, Value: STAGING
Key: score, Value: 0.8298


As we can see, in array's metadata we have by default information about the backend we used for training (sklearn),
sklearn version, python version and the extra metadata about epochs and training loss that we added.
We can load and check any of the aforementioned without having to load the entire model in memory.
Moreover, we can add any kind of extra information in model's metadata also by opening the TileDB array and adding new keys.

In [7]:
# Open the array in write mode
with tiledb.Array(uri, "w") as A:
    # Keep all history
    A.meta['new_meta'] = json.dumps(['Any kind of info'])

# Check that everything is there
model_array_1 = tiledb.open(uri)
for key, value in model_array_1.meta.items():
    if isinstance(value, bytes):
        value = json.loads(value)
    print("Key: {}, Value: {}".format(key, value))

Key: Sparsity_with_L1_penalty, Value: 77.93367346938776
Key: TILEDB_ML_MODEL_ML_FRAMEWORK, Value: SKLEARN
Key: TILEDB_ML_MODEL_ML_FRAMEWORK_VERSION, Value: 1.0.2
Key: TILEDB_ML_MODEL_PREVIEW, Value: LogisticRegression(C=0.01, penalty='l1', solver='saga', tol=0.1)
Key: TILEDB_ML_MODEL_PYTHON_VERSION, Value: 3.7.13
Key: TILEDB_ML_MODEL_STAGE, Value: STAGING
Key: new_meta, Value: ["Any kind of info"]
Key: score, Value: 0.8298


Moving on, we can load the trained models for evaluation or retraining, as usual with Sklearn models. What is really nice with saving models as TileDB array, is native versioning based on fragments as described [here](https://docs.tiledb.com/main/concepts/data-format#immutable-fragments). We can load a model, retrain it with new data and update the already existing TileDB model array with new model parameters and metadata. All information, old and new will be there and accessible. This is extremely useful when you retrain with new data or trying different architectures for the same problem, and you want to keep track of all your experiments without having to store different model instances. In our case, let's continue training `sklearn-mnist-1` with test data (just for simplicity). After training is done, we can save the model again with `update=True`. You will notice the extra directories and files (fragments) added to `sklearn-mnist-1` TileDB array directory, which keep all versions of the model.

In [8]:
loaded_clf = tiledb_model_1.load()

# Sparsity and score should be the same as in the previous step.
print('Model score...')
sparsity = np.mean(loaded_clf.coef_ == 0) * 100
score = loaded_clf.score(X_test, y_test)

print("Sparsity with L1 penalty: %.2f%%" % sparsity)
print("Test score with L1 penalty: %.4f" % score)


# We retrain with test data just for the sake of simplicity.
print('Model fit...')
loaded_clf.fit(X_test, y_test)

print('Model score...')
sparsity = np.mean(loaded_clf.coef_ == 0) * 100
score = loaded_clf.score(X_test, y_test)

print("Sparsity with L1 penalty: %.2f%%" % sparsity)
print("Test score with L1 penalty: %.4f" % score)


tiledb_model_1 = SklearnTileDBModel(uri=uri, model=loaded_clf)
tiledb_model_1.save(update=True,
                    meta={'Sparsity_with_L1_penalty': sparsity,
                          'score': score})

# Check array directory
print()
pprint(glob.glob(f'{uri}/*'))


# tiledb.array_fragments() requires TileDB-Py version > 0.8.5
fragments_info = tiledb.array_fragments(uri)

print()
print("====== FRAGMENTS  INFO ======")
print("array uri: {}".format(fragments_info.array_uri))
print("number of fragments: {}".format(len(fragments_info)))

for fragment_num, fragment in enumerate(fragments_info, start=1):
    print()
    print("===== FRAGMENT NUMBER {} =====".format(fragment.num))
    print("timestamp range: {}".format(fragment.timestamp_range))
    print(
        "number of unconsolidated metadata: {}".format(
            fragment.unconsolidated_metadata_num
        )
    )
    print("version: {}".format(fragment.version))


Model score...
Sparsity with L1 penalty: 77.93%
Test score with L1 penalty: 0.8298
Model fit...
Model score...
Sparsity with L1 penalty: 44.07%
Test score with L1 penalty: 0.7194

['../data/sklearn-mnist-1/__fragment_meta',
 '../data/sklearn-mnist-1/__meta',
 '../data/sklearn-mnist-1/__fragments',
 '../data/sklearn-mnist-1/__commits',
 '../data/sklearn-mnist-1/__schema']

array uri: ../data/sklearn-mnist-1
number of fragments: 2

===== FRAGMENT NUMBER 0 =====
timestamp range: (1664858394611, 1664858394611)
number of unconsolidated metadata: 2
version: 15

===== FRAGMENT NUMBER 1 =====
timestamp range: (1664858399506, 1664858399506)
number of unconsolidated metadata: 2
version: 15


Finally, a very interesting and useful, for machine learning models, TileDB feature that is described
[here](https://docs.tiledb.com/main/concepts/data-format#groups) and [here](https://docs.tiledb.com/main/how-to/object-management#creating-tiledb-groups)
are groups. Assuming we want to solve the MNIST problem, and we want to try several architectures. We can save each architecture
as a separate TileDB array with native versioning each time it is re-trained, and then organise all models that solve the same problem (MNIST)
as a TileDB array group with any kind of hierarchy. Let's firstly define a new model architecture, then train a model and save
it as a new TileDB array.

In [9]:
# We declare a Decision Tree classifier
clf = tree.DecisionTreeClassifier()

# Fit the model
print('Fit...')
clf.fit(X_train, y_train)

# Evaluate
score = clf.score(X_test, y_test)
print("Test score: %.4f" % score)

# Declare a SklearnTileDBModel object
uri2 = os.path.join(data_home, 'sklearn-mnist-2')
tiledb_model_2 = SklearnTileDBModel(uri=uri2, model=clf)

# Save model as a TileDB array
tiledb_model_2.save(meta={'score': score})

Fit...
Test score: 0.7755


Now we can create a TileDB group and organise (in hierarchies, e.g., sophisticated vs less sophisticated) all our
MNIST models as follows.

In [10]:
group = os.path.join(data_home, 'tiledb-sklearn-mnist')
tiledb.group_create(group)
shutil.move(uri, group)
shutil.move(uri2, group)

'../data/tiledb-sklearn-mnist/sklearn-mnist-2'

Any time we can check and query all the available models, including their metadata, for a specific problem like MNIST.

In [11]:
tiledb.ls(group, lambda obj_path, obj_type: print(obj_path, obj_type))

file:///home/gsk/projects/TileDB-ML/examples/data/tiledb-sklearn-mnist/sklearn-mnist-1 array
file:///home/gsk/projects/TileDB-ML/examples/data/tiledb-sklearn-mnist/sklearn-mnist-2 array
