# Quickly create error models

Given a trained model, the dataset used for training, and optionally a validation dataset, OCE allows for the quick creation of an error model.

In [1]:
import olorenchemengine as oce
import pandas as pd
import numpy as np
import json
import os

if not os.path.exists("lipophilicity_dataset.oce"):
    dataset = oce.DatasetFromCSV("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/Lipophilicity.csv", structure_col = "smiles", property_col = "exp")
    splitter = oce.RandomSplit(split_proportions=[0.8,0.1,0.1])
    dataset = dataset + splitter
    oce.save(dataset, 'lipophilicity_dataset.oce')
else:
    dataset = oce.load('lipophilicity_dataset.oce')
    
if not os.path.exists("lipophilicity_model_rf.oce"):
    model = oce.RandomForestModel(oce.OlorenCheckpoint("default"), n_estimators=1000)
    model.fit(dataset.train_dataset[0], dataset.train_dataset[1])
    oce.save(model, 'lipophilicity_model_rf.oce')
else:
    model = oce.load("lipophilicity_model_rf.oce")


To complete installation of Oloren ChemEngine, either:
    (1) Run oce.online() to use the demonstration package, Oloren ChemEngine Online,
    (2) Install the missing dependencies, instructions can be found oce.MISSING_DEPENDENCIES(), or
    (3) Email contact@oloren.ai with subject "Oloren ChemEngine Enterprise", for a secure privately hosted
        Server version.

Oloren ChemEngine online is the public, dependency-free version of OCE, which compiles OCE code locally for fast,
parallelized, remote execution on Oloren's cloud solution. Oloren ChemEngine online SHOULD NOT BE USED FOR
CONFIDENTAIL DATA, and is only intended for demonstration purposes. The securely privately hosted Server version is
called Oloren ChemEngine Enterprise.



The `create_error_model` method quickly creates an error model and stores it in `model.error_model`. The error model and training dataset must be specified, and optionally the validation dataset can be specified for error model fitting. If no validation dataset is inputted, then cross validation is used to fit the error model. Here, we use SDC to estimate 80% confidence intervals.

In [2]:
error_model = oce.SDC(ci=0.8)
model.create_error_model(error_model, *dataset.train_dataset, *dataset.valid_dataset)

100%|██████████| 420/420 [01:20<00:00,  5.23it/s]


The error model is now ready to make predictions.

In [3]:
model.error_model.score(dataset.test_dataset[0])[0:10]

100%|██████████| 420/420 [01:04<00:00,  6.56it/s]


array([1.34667487, 1.63756768, 1.23844928, 1.22823365, 1.13443495,
       1.08317984, 1.76692449, 1.72157803, 1.59796142, 1.29997161])

# Concurrent Error Models

Error models can also be built alongside a model. To build an error model during model training, simply input the error model you wish to use. Here, we will again use the ```oce.SDC``` error model.

In [2]:
model = oce.RandomForestModel(oce.OlorenCheckpoint("default"), n_estimators=1000)
error_model = oce.SDC(ci=0.8)
model.fit(dataset.train_dataset[0], dataset.train_dataset[1], error_model=error_model)

The error model is now built and stored in ```model.error_model```. From here, any error model methods, such as ```.fit()``` and ```.fit_cv()``` can be run. Fitting can also be done when running ```model.test()``` by setting ```fit_error_model=True```.

In [3]:
model.test(dataset.valid_dataset[0], dataset.valid_dataset[1], fit_error_model=True)

100%|██████████| 420/420 [00:42<00:00,  9.99it/s]


{'r2': -0.6512297026152631,
 'Spearman': 0.5708690272241712,
 'Explained Variance': 0.11665021730288161,
 'Max Error': 3.151970455658377,
 'Mean Absolute Error': 1.2711863427259735,
 'Mean Squared Error': 2.1369793008176274,
 'Root Mean Squared Error': 1.461841065512126}

Finally, if a model contains a fitted error model, setting ```return_ci=True``` when running ```model.predict()``` will return the confidence intervals. Setting ```return_vis=True``` will in turn return ```VisualizeError``` objects.

In [4]:
df = model.predict(dataset.test_dataset[0], return_ci=True, return_vis=True)

100%|██████████| 420/420 [00:54<00:00,  7.77it/s]


In [5]:
df.head()

Unnamed: 0,predicted,ci,vis
0,1.447593,1.88686,<olorenchemengine.visualizations.visualization...
1,1.057038,1.909262,<olorenchemengine.visualizations.visualization...
2,1.488842,1.885839,<olorenchemengine.visualizations.visualization...
3,1.323693,1.885781,<olorenchemengine.visualizations.visualization...
4,1.256467,1.885416,<olorenchemengine.visualizations.visualization...


In [6]:
df.vis[0].render_ipynb()

# Production Level Models

Production level models use the entire dataset to train the model. As such, metrics and error model training and fitting are done via cross validation. The entire process can be done by calling the ```.fit_cv()``` function.

In [None]:
model = oce.RandomForestModel(oce.OlorenCheckpoint("default"), n_estimators=1000)
error_model = oce.SDC(ci=0.8)

model.fit_cv(dataset.entire_dataset[0], dataset.entire_dataset[1], error_model=error_model, scoring = "r2")

The trained error model will similarly be stored in ```model.error_model```.