When came with a new model. We want to ask some questions. Like what has changed from previous version of model to new version. Is there any preprocessing needed? What are extra libraries that we need to run a new model

And what if when running this new model in production we face some issues and roll back to old model. We need to know where the old model is stored

When doing an ML task, we use the MLFlow Tracking Server to log the parameters, metrics, artifactions and also many different model versions

Once we believe those models are fit for production, then we will "register model" to the MLFlow registry

MLFlow registry is the place where we store the production ready models. So whenver a deployment engineer wants to update the models, they can take a look at the Model Registry to find the new prod ready models

The MLflow Model Registry component is a centralized model store, set of APIs, and UI, to collaboratively manage the full lifecycle of an MLflow Model. It provides model lineage (which MLflow experiment and run produced the model), model versioning, model aliasing, model tagging, and annotations.

Model Registry does not deploy the models, instead it stores the models that are prod ready

### Understanding How MLFlow saves Data

MLFlow has two types of stores

**Backend Store**
- In this store, MLFlow stores metadata information of experiments like parameters, metrics, tags etc
- Bydefault this gets stored in the local filesystem under the name mlruns. In the mlruns file, you will have folder for each experiment. The folder name start with 0 (default experiment) for experiment 1 and if you have two experiments, you will have two folders 0 and 1, these 0 and 1 are experiment ids for the two experiments
- Withing each of these folder you will have a meta.yaml file that contains info about where is the artifact location, the experiment_id, experiment name (for folder 0 experiment name is default) and such
- For each run you create, a unique folder is created in that experiment folder(i.e. folder names with 0 , 1 ...) And inside this run folder, you will have the artifcats folder containing the model, parameters and all
- We can also configure it to store in SQLAlchemy compatible DB (e.g. SQLite, Postgres)

**Artifacts Store**
- Here MLFlow stores all the artifacts like the dataset used for training, the model itself and other configuration files like conda.yaml, requirements.txt that are needed to run the model
- Again by default this is stored in the local filesystem
- We can also configure it to store this information in remote location like Amazon S3 Bucket

### Tracking Experiments with a Local Database

Till now, we have used local files, now we will use local database like sqlite and store the information there

We use the following CLI `mlflow ui --port 8080 --backend-store-uri sqlite:///mlruns.db`

For a custom artifact location, we can use

`mlflow ui --port 8080 --backend-store-uri sqlite:///mlruns.db --default-artifact-root ./artifacts_local`

The above commands are written assuming you are running it from the mlops-learning folder, as the paths to backend and artifacts is given according to that folder

In [2]:
# set the following environment variable
# as we are running in 02-mlfow, using the ../mlruns.db
%env MLFLOW_TRACKING_URI=sqlite:///../mlruns.db

env: MLFLOW_TRACKING_URI=sqlite:///../mlruns.db


../mlruns.db implies that create / use the sqlite db in the parent folder

In [1]:
import mlflow

mlflow.set_tracking_uri("http://127.0.0.1:8080")
mlflow.get_tracking_uri()

'http://127.0.0.1:8080'

In [2]:
mlflow.search_experiments()

[<Experiment: artifact_location='/home/topisano/Desktop/projects/mlops-learning/artifacts_local/0', creation_time=1725110293634, experiment_id='0', last_update_time=1725110293634, lifecycle_stage='active', name='Default', tags={}>]

We can see that the artifact_location is changed to the location that we have specified

We will start logging information

In [3]:
import mlflow

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
# from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

mlflow.set_experiment('mysql-experiment')

mlflow.sklearn.autolog()

db = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(db.data, db.target)

# Create and train models.
rf = LinearRegression()
rf.fit(X_train, y_train)

# Use the model to make predictions on the test dataset.
predictions = rf.predict(X_test)

2024/08/31 18:49:20 INFO mlflow.tracking.fluent: Experiment with name 'mysql-experiment' does not exist. Creating a new experiment.
2024/08/31 18:49:20 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'e37d1f74668c41c98c6da5674a07abc4', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow
2024/08/31 18:49:22 INFO mlflow.tracking._tracking_service.client: 🏃 View run bald-snail-320 at: http://127.0.0.1:8080/#/experiments/1/runs/e37d1f74668c41c98c6da5674a07abc4.
2024/08/31 18:49:22 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://127.0.0.1:8080/#/experiments/1.


In [4]:
mlflow.search_experiments()

[<Experiment: artifact_location='/home/topisano/Desktop/projects/mlops-learning/artifacts_local/1', creation_time=1725110360508, experiment_id='1', last_update_time=1725110360508, lifecycle_stage='active', name='mysql-experiment', tags={}>,
 <Experiment: artifact_location='/home/topisano/Desktop/projects/mlops-learning/artifacts_local/0', creation_time=1725110293634, experiment_id='0', last_update_time=1725110293634, lifecycle_stage='active', name='Default', tags={}>]