## Different python versions

This notebook demonstrates what happens when you have a model trained on a different version of python, and try to run inference in a databricks notebook with a different python version.

We have user story [SUP-12708](https://allianzdirect.atlassian.net/browse/SUP-12708) open for the Allianz Direct Data Team to address this.

In [0]:
!python -V

Note that we have already trained an model, using python 3.11, and registered the model to MLflow. For clarify, here is the training code used (not run in this notebook.).

Train model: 

```python
import pandas as pd
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.datasets import make_hastie_10_2

# Data and training
X, y = make_hastie_10_2(random_state=0)
X = pd.DataFrame(X, columns=[f"f{i}" for i in range(X.shape[1])])
X_train, X_test = X[:2000], X[2000:]
y_train, y_test = y[:2000], y[2000:]
clf = HistGradientBoostingClassifier(max_iter=100).fit(X_train, y_train)
```

Log model:

```python
import mlflow
from ds_workmate.mlflow import setup_mlflow_env

# Login to MLflow
setup_mlflow_env()
client = mlflow.MlflowClient(registry_uri="databricks")
experiment = client.get_experiment_by_name("/machine-learning-models/ds_ramen")

with mlflow.start_run(experiment_id=experiment.experiment_id):
    mlflow.sklearn.log_model(
            sk_model=clf,
            artifact_path="model",
            code_paths=None,
            signature=None,  # inferred from input_example
            input_example=X_train.sample(1),
            pyfunc_predict_fn="predict_proba",
            pip_requirements=["scikit-learn==1.5.0"],
            registered_model_name="test_tim",
    )
```

We can load this model back and do inference

In [0]:
# Step 1: Install mlflow so we can interact with it
!pip install mlflow

dbutils.library.restartPython()

In [0]:
# Generate some sample inference data
import pandas as pd
from sklearn.datasets import make_hastie_10_2
X, y = make_hastie_10_2(n_samples=5, random_state=0)
X = pd.DataFrame(X, columns=[f"f{i}" for i in range(X.shape[1])])

In [0]:
import mlflow
mlflow.set_registry_uri("databricks")

mlflow.models.predict(model_uri="models:/test_tim/12", input_data=X)

In [0]:
%pyenv install --list


# Databricks


## Library version management

In [0]:
!python -V

### Training

In [0]:
# For demonstration purpose we install a scikit learn version that is different to the training env 
%pip install "scikit-learn==1.5.0"
dbutils.library.restartPython()

In [0]:
import mlflow

# Disable MLflow autolog so we don't get unnecessary experiment runs
mlflow.autolog(disable=True)

In [0]:
import pandas as pd
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.datasets import make_hastie_10_2

# Data and training
X, y = make_hastie_10_2(random_state=0)
X = pd.DataFrame(X, columns=[f"f{i}" for i in range(X.shape[1])])
X_train, X_test = X[:2000], X[2000:]
y_train, y_test = y[:2000], y[2000:]


In [0]:
import mlflow


# Login to MLflow
#setup_mlflow_env()
#client = mlflow.MlflowClient(registry_uri="databricks")
#experiment = client.get_experiment_by_name("/machine-learning-models/ds_ramen")

catalog = "dbdemos_aj"
schema = "test_schema"
model_name = "test_model"

mlflow.set_registry_uri("databricks-uc")

with mlflow.start_run() as r:

    # run training 
    clf = HistGradientBoostingClassifier(max_iter=100).fit(X_train, y_train)

    # log metrics, parameters, schema


    # log model 
    # AJ: please avoid specifing a ml framework version in pip requirements parameter that is different from the version the model was trained with. Mlflow will automatically log the version the model was trained with. If you need a framework version that is different from env version %pip install requirements.txt before doing the training run. 
    mlflow.sklearn.log_model(
            sk_model=clf,
            artifact_path="model",
            code_paths=None,
            signature=None,  # inferred from input_example
            input_example=X_train.sample(1),
            pyfunc_predict_fn="predict_proba",
            registered_model_name=f"{catalog}.{schema}.{model_name}"
    )


### Inference 

Inference environments are often different from the training env. Mlflow allows you to either bring the dependencies to the inference envionment (Option 1) or generate generic Python and Spark functions from model that no longer require ml framework dependencies in the inference envionment (Option 2). 

In [0]:
# Install scikit learn version that os different to the training env
%pip install "scikit-learn==1.3.0"
dbutils.library.restartPython()


#### Option 1: Use mlflow model flavour + install dependencies 
With get_model_dependencies we can download the requirements.txt of the model to the inference env. Install to the inference envionment and run the prediction.

In [0]:
import mlflow 
import yaml

catalog = "dbdemos_aj"
schema = "test_schema"
model_name = "test_model"
model_version = 3

config = {
    "model_resources": {
        "catalog": catalog,
        "schema": schema,
        "model_name": model_name,
        "model_version": model_version,
        "model_uri": f"models:/{catalog}.{schema}.{model_name}/{model_version}"
    },
}

try:
    with open("config.yaml", "w") as f:
        yaml.dump(config, f)
except Exception as e:
    print(f"Failed to write YAML file: {e}")

In [0]:
import mlflow
import yaml

mlflow.set_registry_uri("databricks-uc")

with open("config.yaml", "r") as f:
    config = yaml.safe_load(f)

# Get requirements.txt from model 
model_dependencies = mlflow.pyfunc.get_model_dependencies(config["model_resources"]["model_uri"])

# Install dependencies
%env MODEL_DEPENDENCIES=$model_dependencies
%pip install -r $MODEL_DEPENDENCIES
dbutils.library.restartPython()

In [0]:
import mlflow
import yaml

mlflow.set_registry_uri("databricks-uc")

with open("config.yaml", "r") as f:
    config = yaml.safe_load(f)

# AJ: Make sure to use the right model flavour when loading the model. 
sklearn_model = mlflow.sklearn.load_model(model_uri=config["model_resources"]["model_uri"])

In [0]:
from sklearn.datasets import make_hastie_10_2
import pandas as pd

# Data and training
X, y = make_hastie_10_2(random_state=0)
X = pd.DataFrame(X, columns=[f"f{i}" for i in range(X.shape[1])])

# Use sklearn model to predict
sklearn_model.predict(X)

#### Option 2: Use generic mlflow pyfunc flavour

Mlflow can convert models into a generic python function - even when model was trained with a different model flavour. This will allow you to use the model in the inference env wihout any dependency to the training env. 

In [0]:
import mlflow
import yaml

with open("config.yaml", "r") as f:
    config = yaml.safe_load(f)

pyfunc_model = mlflow.pyfunc.load_model(model_uri=config["model_resources"]["model_uri"])

In [0]:
import pandas as pd

# Data and training
X, y = make_hastie_10_2(random_state=0)
X = pd.DataFrame(X, columns=[f"f{i}" for i in range(X.shape[1])])

# Use sklearn model to predict
pyfunc_model.predict(X)


## Pyenv management

We recommend to manage library version's instead of entire python envionments (see chapter on top) and upgrade the DBR runtime if necessary. Below is an example where we use a DBR runtime with an older version of Python for inference. The inference will work properly, as long we can install the right version of sklearn. 

In [0]:
!python -V

In [0]:
%pip install mlflow --upgrade
dbutils.library.restartPython()

In [0]:
import mlflow
import yaml

with open("config.yaml", "r") as f:
    config = yaml.safe_load(f)

mlflow.set_registry_uri("databricks-uc")

# Get requirements.txt from model 
model_dependencies = mlflow.pyfunc.get_model_dependencies(model_uri=config["model_resources"]["model_uri"])

# Install dependencies
%env MODEL_DEPENDENCIES=$model_dependencies
%pip install -r $MODEL_DEPENDENCIES
dbutils.library.restartPython()

In [0]:
import mlflow
import yaml

with open("config.yaml", "r") as f:
    config = yaml.safe_load(f)

sklearn_model = mlflow.sklearn.load_model(model_uri=config["model_resources"]["model_uri"])

In [0]:
from sklearn.datasets import make_hastie_10_2
import pandas as pd

# Data and training
X, y = make_hastie_10_2(random_state=0)
X = pd.DataFrame(X, columns=[f"f{i}" for i in range(X.shape[1])])

# AJ: The model was trained with a different python version. As long we install the correct library version of sklearn, we can still run the inference. Even if the inference env is running a different python version. To avoid any side effects try to keep training and inference peython versions close together. 
sklearn_model.predict(X)