M2: Process and Tooling
Objective: Gain hands-on experience with popular MLOps tools and
understand the processes they support.
Tasks:
1. Experiment Tracking:
• Use MLflow to track experiments for a machine learning project.
• Record metrics, parameters, and results of at least three different model
training runs.
2. Data Versioning:
• Use DVC (Data Version Control) to version control a dataset used in your
project.
• Show how to revert to a previous version of the dataset.
Deliverables:
• MLflow experiment logs with different runs and their results.
• A DVC repository showing different versions of the dataset.

In [1]:
!pip install mlflow #Step 1: Install MLflow


Collecting mlflow
  Obtaining dependency information for mlflow from https://files.pythonhosted.org/packages/76/d9/6ec0b635fcd0eb46d5a671bb5f350defb656a3b925bd780786f9fac0da12/mlflow-2.15.0-py3-none-any.whl.metadata
  Downloading mlflow-2.15.0-py3-none-any.whl.metadata (29 kB)
Collecting mlflow-skinny==2.15.0 (from mlflow)
  Obtaining dependency information for mlflow-skinny==2.15.0 from https://files.pythonhosted.org/packages/a2/0e/4dab2a4a1eba4f05de80cf7f21e5ac166dc4ee338aaf6878d1b3c570c0e4/mlflow_skinny-2.15.0-py3-none-any.whl.metadata
  Downloading mlflow_skinny-2.15.0-py3-none-any.whl.metadata (30 kB)
Collecting alembic!=1.10.0,<2 (from mlflow)
  Obtaining dependency information for alembic!=1.10.0,<2 from https://files.pythonhosted.org/packages/df/ed/c884465c33c25451e4a5cd4acad154c29e5341e3214e220e7f3478aa4b0d/alembic-1.13.2-py3-none-any.whl.metadata
  Downloading alembic-1.13.2-py3-none-any.whl.metadata (7.4 kB)
Collecting docker<8,>=4.0.0 (from mlflow)
  Obtaining dependency in

In [4]:
import mlflow   #Step 2: Set Up MLflow
import mlflow.sklearn

# Set the MLflow tracking URI
mlflow.set_tracking_uri("/mlruns") 


In [5]:
experiment_name = "Iris_Classification_Experiment" #Step 3: Define Your Experiment
mlflow.set_experiment(experiment_name)


<Experiment: artifact_location='file:///C:/path/to/mlruns/114706818427449650', creation_time=1722866190330, experiment_id='114706818427449650', last_update_time=1722866190330, lifecycle_stage='active', name='Iris_Classification_Experiment', tags={}>

In [6]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

# Load Iris dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define and train model
with mlflow.start_run():
    rf = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=42)
    rf.fit(X_train, y_train)
    
    # Predict and evaluate
    predictions = rf.predict(X_test)
    accuracy = accuracy_score(y_test, predictions)
    
    # Log parameters, metrics, and model
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 3)
    mlflow.log_metric("accuracy", accuracy)
    mlflow.sklearn.log_model(rf, "model")

    print(f"Run ID: {mlflow.active_run().info.run_id}")




Run ID: 5a84ed2ed2f048c0a1357971a3f7ea3e


In [7]:
with mlflow.start_run():
    rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
    rf.fit(X_train, y_train)
    
    # Predict and evaluate
    predictions = rf.predict(X_test)
    accuracy = accuracy_score(y_test, predictions)
    
    # Log parameters, metrics, and model
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 5)
    mlflow.log_metric("accuracy", accuracy)
    mlflow.sklearn.log_model(rf, "model")

    print(f"Run ID: {mlflow.active_run().info.run_id}")




Run ID: 65fb702347e142d2a502f458e13d1c6f


In [8]:
with mlflow.start_run():
    rf = RandomForestClassifier(n_estimators=150, max_depth=7, random_state=42)
    rf.fit(X_train, y_train)
    
    # Predict and evaluate
    predictions = rf.predict(X_test)
    accuracy = accuracy_score(y_test, predictions)
    
    # Log parameters, metrics, and model
    mlflow.log_param("n_estimators", 150)
    mlflow.log_param("max_depth", 7)
    mlflow.log_metric("accuracy", accuracy)
    mlflow.sklearn.log_model(rf, "model")

    print(f"Run ID: {mlflow.active_run().info.run_id}")




Run ID: 11ba4c56347f43d6b1d698342e714587


In [9]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
import mlflow
import mlflow.sklearn

# Set the MLflow tracking URI
mlflow.set_tracking_uri("/mlruns")  # Change to your desired path

# Define your experiment
experiment_name = "Iris_Classification_Experiment"
mlflow.set_experiment(experiment_name)

# Load Iris dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define different parameter sets
params_list = [
    {"n_estimators": 50, "max_depth": 3},
    {"n_estimators": 100, "max_depth": 5},
    {"n_estimators": 150, "max_depth": 7}
]

for params in params_list:
    with mlflow.start_run():
        rf = RandomForestClassifier(n_estimators=params["n_estimators"], max_depth=params["max_depth"], random_state=42)
        rf.fit(X_train, y_train)
        
        # Predict and evaluate
        predictions = rf.predict(X_test)
        accuracy = accuracy_score(y_test, predictions)
        
        # Log parameters, metrics, and model
        mlflow.log_param("n_estimators", params["n_estimators"])
        mlflow.log_param("max_depth", params["max_depth"])
        mlflow.log_metric("accuracy", accuracy)
        mlflow.sklearn.log_model(rf, "model")

        print(f"Run ID: {mlflow.active_run().info.run_id}")




Run ID: 21a14de172ee45e38932eb206010e864




Run ID: 0128e6b49afb403c953bae592caa2cb4




Run ID: c2cff3c8c74a472ab8936f720cf3e940


In [10]:
import subprocess

# Start MLflow UI in the background
subprocess.Popen(["mlflow", "ui"])


<Popen: returncode: None args: ['mlflow', 'ui']>

In [11]:
from mlflow.tracking import MlflowClient

client = MlflowClient()
experiment = client.get_experiment_by_name(experiment_name)
runs = client.search_runs(experiment_ids=[experiment.experiment_id])

for run in runs:
    print(f"Run ID: {run.info.run_id}")
    print(f"Parameters: {run.data.params}")
    print(f"Metrics: {run.data.metrics}")
    print("-" * 80)


Run ID: c2cff3c8c74a472ab8936f720cf3e940
Parameters: {'max_depth': '7', 'n_estimators': '150'}
Metrics: {'accuracy': 1.0}
--------------------------------------------------------------------------------
Run ID: 0128e6b49afb403c953bae592caa2cb4
Parameters: {'max_depth': '5', 'n_estimators': '100'}
Metrics: {'accuracy': 1.0}
--------------------------------------------------------------------------------
Run ID: 21a14de172ee45e38932eb206010e864
Parameters: {'max_depth': '3', 'n_estimators': '50'}
Metrics: {'accuracy': 1.0}
--------------------------------------------------------------------------------
Run ID: 11ba4c56347f43d6b1d698342e714587
Parameters: {'max_depth': '7', 'n_estimators': '150'}
Metrics: {'accuracy': 1.0}
--------------------------------------------------------------------------------
Run ID: 65fb702347e142d2a502f458e13d1c6f
Parameters: {'max_depth': '5', 'n_estimators': '100'}
Metrics: {'accuracy': 1.0}
-------------------------------------------------------------------

# Part 2: Data Versioning with DVC

Step 1: Install and Initialize DVC

In [16]:
!pip install dvc




In [17]:
!git init
!dvc init


Reinitialized existing Git repository in C:/Users/user/.git/
Initialized DVC repository.

You can now commit the changes to git.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>


Step 2: Version Control the Dataset

In [18]:
from sklearn.datasets import load_iris
import pandas as pd
import os

iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

os.makedirs('data', exist_ok=True)
df.to_csv('data/iris_dataset.csv', index=False)


In [19]:
!dvc add data/iris_dataset.csv
!git add data/iris_dataset.csv.dvc .gitignore
!git commit -m "Add Iris dataset"



To track the changes with git, run:

	git add 'data\iris_dataset.csv.dvc' 'data\.gitignore'

To enable auto staging, run:

	dvc config core.autostage true


\u280b Checking graph

fatal: pathspec '.gitignore' did not match any files


[master (root-commit) e748fea] Add Iris dataset
 3 files changed, 6 insertions(+)
 create mode 100644 .dvc/.gitignore
 create mode 100644 .dvc/config
 create mode 100644 .dvcignore


Step 3: Simulate and Track Changes


In [20]:
df['target'] = df['target'].apply(lambda x: x + 1)
df.to_csv('data/iris_dataset_v2.csv', index=False)


In [21]:
!dvc add data/iris_dataset_v2.csv
!git add data/iris_dataset_v2.csv.dvc
!git commit -m "Add updated Iris dataset"



To track the changes with git, run:

	git add 'data\.gitignore' 'data\iris_dataset_v2.csv.dvc'

To enable auto staging, run:

	dvc config core.autostage true


\u280b Checking graph



[master 43dd90a] Add updated Iris dataset
 1 file changed, 5 insertions(+)
 create mode 100644 data/iris_dataset_v2.csv.dvc


Step 4: Revert to a Previous Version

In [22]:
!dvc checkout


In [23]:
!dvc status


Data and pipelines are up to date.
