### Name: Riya Shyam Huddar
### Roll no: MDS202431
### Applied Machine Learning Assignment 2

In [None]:
# Install
!pip install mlflow

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [5]:
# Imports
import os
import sys
import pandas as pd
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import average_precision_score
import mlflow
import mlflow.sklearn

### Pull Data from Remote Storage

In a separate training environment, we retrieve the latest
version of the dataset splits using `dvc pull`.


This demonstrates decoupling of storage (remote) and compute (local).

In [4]:
import sys

# Pull latest version from remote storage (Google Drive)
!"{sys.executable}" -m dvc pull

Everything is up to date.


### Step 1: Load Dataset and Verify Version

After running `dvc pull`, the train, validation, and test splits are restored based on the currently checked-out Git commit.

The active commit (`1417f76`) corresponds to the split generated using seed=99.  
All experiments will therefore be trained on this dataset version.  

We can switch to another version later using `git checkout` and `dvc checkout`.


In [7]:
!git checkout ee9794a   
!"{sys.executable}" -m dvc checkout

M	.dvc/config
M	prepare.ipynb
M	train.ipynb


HEAD is now at ee9794a First data split version (seed=99)


In [9]:
!git log -1 --oneline

ee9794a First data split version (seed=99)


In [11]:
train = pd.read_csv("train.csv")
val = pd.read_csv("validation.csv")
test = pd.read_csv("test.csv")

In [13]:
print("Train shape:", train.shape)
print("Validation shape:", val.shape)
print("Test shape:", test.shape)

print("\nTrain distribution:")
print(train["target"].value_counts())

print("\nValidation distribution:")
print(val["target"].value_counts())

print("\nTest distribution:")
print(test["target"].value_counts())


Train shape: (3609, 2)
Validation shape: (773, 2)
Test shape: (774, 2)

Train distribution:
target
0    3160
1     449
Name: count, dtype: int64

Validation distribution:
target
0    677
1     96
Name: count, dtype: int64

Test distribution:
target
0    678
1     96
Name: count, dtype: int64


In [15]:
train.head()

Unnamed: 0,message,target
0,i fetch yun or u fetch?,0
1,urgent! you have won a 1 week free membership ...,1
2,i couldn t say no as he is a dying man and i f...,0
3,and smile for me right now as you go and the w...,0
4,wait 4 me in sch i finish ard 5,0


---

### Step 2: Configure MLflow

We configure MLflow to use a local SQLite backend for experiment tracking.
An experiment named `SMS_Spam_Classification` is created (or reused)
to log model runs, parameters, and evaluation metrics.


In [19]:
# Use SQLite backend instead of filesystem
mlflow.set_tracking_uri("sqlite:///mlflow.db")

# Create / set experiment
EXPERIMENT_NAME = "SMS_Spam_Classification"

mlflow.set_experiment(EXPERIMENT_NAME)

print("Tracking URI:", mlflow.get_tracking_uri())
print("Experiment set:", EXPERIMENT_NAME)


Tracking URI: sqlite:///mlflow.db
Experiment set: SMS_Spam_Classification


In [21]:
# Features and labels
X_train = train["message"]
y_train = train["target"]

X_val = val["message"]
y_val = val["target"]

X_test = test["message"]
y_test = test["target"]

### Feature Engineering (TF-IDF)

We convert text messages into numerical feature vectors using TF-IDF.

- The vectorizer is fit only on the training data to prevent data leakage.
- The same fitted vectorizer is used to transform validation and test sets.
- We limit the vocabulary to 5000 features for efficiency.


In [23]:
# Initialize vectorizer 
vectorizer = TfidfVectorizer( stop_words="english", max_features=5000 )

----

### Step 3: Train and Track Benchmark Models with MLflow

We trained three benchmark models - MultinomialNB, Logistic Regression, and Linear SVM,  logging parameters, AUCPR, and registered models using MLflow.

**Validation AUCPR:**

- MultinomialNB: 0.9717  
- LogisticRegression: 0.9657  
- LinearSVC: 0.9759  

LinearSVC achieved the highest AUCPR and performed best on the validation set.


In [25]:
# Define benchmark models
models = {
    "MultinomialNB": MultinomialNB(),
    "LogisticRegression": LogisticRegression(max_iter=1000),
    "LinearSVC": LinearSVC()
}

print("Models defined:")
for name in models:
    print("-", name)


Models defined:
- MultinomialNB
- LogisticRegression
- LinearSVC


In [27]:
from sklearn.pipeline import Pipeline

def train_and_log_model(model_name, model, X_train, y_train, X_val, y_val, seed, vectorizer):
    
    with mlflow.start_run(run_name=model_name):
        
       
        # Create full pipeline (vectorizer + classifier)
        pipeline = Pipeline([
            ("vectorizer", vectorizer),
            ("classifier", model)
        ])
        
      
        # Log metadata for reproducibility
        mlflow.log_param("model_name", model_name)
        mlflow.log_param("data_seed", seed)
        
        # Log classifier hyperparameters
        mlflow.log_params(model.get_params())
        
        # Log vectorizer hyperparameters (prefixed)
        mlflow.log_params({
            f"vec_{k}": v for k, v in vectorizer.get_params().items()
        })
        
       
        # Train Full pipeline on raw text
        pipeline.fit(X_train, y_train)
        
      
        # Compute AUCPR
        if hasattr(pipeline.named_steps["classifier"], "predict_proba"):
            y_scores = pipeline.predict_proba(X_val)[:, 1]
        else:
            y_scores = pipeline.decision_function(X_val)
        
        aucpr = average_precision_score(y_val, y_scores)
        
        # Log metric
        mlflow.log_metric("AUCPR", aucpr)
        
        
        # Log Full pipeline
        mlflow.sklearn.log_model(
            sk_model=pipeline,
            name="model",
            registered_model_name=model_name
        )
        
        print(f"{model_name} AUCPR: {aucpr:.4f}")
    
    return aucpr


In [29]:
# Store AUCPR results
results = {}

# Train and log each model
for name, model in models.items():
    aucpr = train_and_log_model(
    name,
    model,
    X_train,
    y_train,
    X_val,
    y_val,
    seed=99,
    vectorizer=vectorizer
)
    results[name] = aucpr

Successfully registered model 'MultinomialNB'.
Created version '1' of model 'MultinomialNB'.


MultinomialNB AUCPR: 0.9717


Successfully registered model 'LogisticRegression'.
Created version '1' of model 'LogisticRegression'.


LogisticRegression AUCPR: 0.9657
LinearSVC AUCPR: 0.9759


Successfully registered model 'LinearSVC'.
Created version '1' of model 'LinearSVC'.


---

### Step 4: Retrieve and Compare Logged Metrics from MLflow

We query MLflow to retrieve all runs under the
`SMS_Spam_Classification` experiment and sort them
by AUCPR in descending order.

This ensures that model comparison is based on
logged experiment data rather than printed outputs.


In [31]:
# Get all runs from this experiment
experiment = mlflow.get_experiment_by_name("SMS_Spam_Classification")

runs = mlflow.search_runs(
    experiment_ids=[experiment.experiment_id],
    order_by=["metrics.AUCPR DESC"]
)

# Display relevant columns
summary = runs[["tags.mlflow.runName", "metrics.AUCPR", "params.data_seed","params.C", "params.alpha"]]

print("Model Comparison (Sorted by AUCPR):")
display(summary)


Model Comparison (Sorted by AUCPR):


Unnamed: 0,tags.mlflow.runName,metrics.AUCPR,params.data_seed,params.C,params.alpha
0,LinearSVC,0.975947,99,1.0,
1,MultinomialNB,0.971662,99,,1.0
2,LogisticRegression,0.965655,99,1.0,


### Retrieve AUCPR from Registered Models

Using the MLflow Model Registry, we fetch the latest
registered version of each benchmark model and retrieve
its logged AUCPR metric from the associated run.

This confirms that model performance is reproducible
and traceable through MLflow.


In [33]:
from mlflow.tracking import MlflowClient

client = MlflowClient()

model_names = ["MultinomialNB", "LogisticRegression", "LinearSVC"]

print("Registered Model AUCPR Scores:\n")

for name in model_names:
    
    # Get latest version
    latest_version = client.get_latest_versions(name, stages=None)[0]
    
    run_id = latest_version.run_id
    
    run = client.get_run(run_id)
    
    aucpr = run.data.metrics["AUCPR"]
    
    print(f"{name} (Version {latest_version.version}) AUCPR: {aucpr:.4f}")


Registered Model AUCPR Scores:

MultinomialNB (Version 1) AUCPR: 0.9717
LogisticRegression (Version 1) AUCPR: 0.9657
LinearSVC (Version 1) AUCPR: 0.9759


---

### Step 5: Evaluate Registered Models on the Test Set (AUCPR)

We load the latest registered version of each benchmark model
from the MLflow Model Registry and evaluate performance on
the unseen test set using AUCPR.

**Test Set AUCPR:**

- MultinomialNB: 0.9420  
- LogisticRegression: 0.9483  
- LinearSVC: 0.9610  

LinearSVC achieves the highest test AUCPR, confirming it as the best-performing model on unseen data.


In [37]:
import mlflow.sklearn
from sklearn.metrics import average_precision_score

registered_models = ["MultinomialNB", "LogisticRegression", "LinearSVC"]

print("Test Set Evaluation (Version 1-seed=99):\n")

for model_name in registered_models:
    
    model_uri = f"models:/{model_name}/latest"
    model = mlflow.sklearn.load_model(model_uri)
    
    if hasattr(model, "predict_proba"):
        scores = model.predict_proba(X_test)[:, 1]
    else:
        scores = model.decision_function(X_test)
    
    aucpr = average_precision_score(y_test, scores)
    
    print(f"{model_name} Test AUCPR: {aucpr:.4f}")


Test Set Evaluation (Version 1-seed=99):

MultinomialNB Test AUCPR: 0.9420
LogisticRegression Test AUCPR: 0.9483
LinearSVC Test AUCPR: 0.9610


----

### Training on Updated Dataset Version (seed=17)

We switch to the dataset generated using seed=17 and
repeat the training and evaluation process under the
same MLflow experiment.

New runs are logged for each model, and the MLflow Model
Registry automatically creates Version 2 of each model,
demonstrating model evolution across dataset versions.


In [39]:
!git checkout 955b170  
!"{sys.executable}" -m dvc checkout

M	.dvc/config
M	prepare.ipynb
M	train.ipynb


Previous HEAD position was ee9794a First data split version (seed=99)
HEAD is now at 955b170 Updated split version (seed=17)


M       test.csv
M       train.csv
M       validation.csv


In [41]:
train = pd.read_csv("train.csv")
val = pd.read_csv("validation.csv")
test = pd.read_csv("test.csv")

print("Train shape:", train.shape)


Train shape: (3609, 2)


In [43]:
!git log -1 --oneline

955b170 Updated split version (seed=17)


In [45]:
X_train = train["message"]
y_train = train["target"]

X_val = val["message"]
y_val = val["target"]

X_test = test["message"]
y_test = test["target"]

vectorizer = TfidfVectorizer(
    stop_words="english",
    max_features=5000
)

X_train_vec = vectorizer.fit_transform(X_train)
X_val_vec = vectorizer.transform(X_val)
X_test_vec = vectorizer.transform(X_test)


In [47]:
# Store AUCPR results
results = {}

# Train and log each model
for name, model in models.items():
    aucpr = train_and_log_model(
    name,
    model,
    X_train,
    y_train,
    X_val,
    y_val,
    seed=17,
    vectorizer=vectorizer
)
    results[name] = aucpr

Registered model 'MultinomialNB' already exists. Creating a new version of this model...
Created version '2' of model 'MultinomialNB'.


MultinomialNB AUCPR: 0.9654


Registered model 'LogisticRegression' already exists. Creating a new version of this model...
Created version '2' of model 'LogisticRegression'.


LogisticRegression AUCPR: 0.9625
LinearSVC AUCPR: 0.9719


Registered model 'LinearSVC' already exists. Creating a new version of this model...
Created version '2' of model 'LinearSVC'.


In [49]:
client = MlflowClient()

for name in ["MultinomialNB", "LogisticRegression", "LinearSVC"]:
    latest_version = client.get_latest_versions(name)[0]
    run = client.get_run(latest_version.run_id)
    print(f"{name} (Version {latest_version.version}) AUCPR: {run.data.metrics['AUCPR']:.4f}")


MultinomialNB (Version 2) AUCPR: 0.9654
LogisticRegression (Version 2) AUCPR: 0.9625
LinearSVC (Version 2) AUCPR: 0.9719


In [51]:
from mlflow.tracking import MlflowClient

client = MlflowClient()

print("Validation AUCPR by Model Version:\n")

for name in model_names:
    
    versions = client.search_model_versions(f"name='{name}'")
    
    print(f"{name}:")
    
    # Sort versions numerically
    for v in sorted(versions, key=lambda x: int(x.version)):
        
        run = client.get_run(v.run_id)
        aucpr = run.data.metrics["AUCPR"]
        
        print(f"  Version {v.version} AUCPR: {aucpr:.4f}")
    
    print()


Validation AUCPR by Model Version:

MultinomialNB:
  Version 1 AUCPR: 0.9717
  Version 2 AUCPR: 0.9654

LogisticRegression:
  Version 1 AUCPR: 0.9657
  Version 2 AUCPR: 0.9625

LinearSVC:
  Version 1 AUCPR: 0.9759
  Version 2 AUCPR: 0.9719



In [55]:
import mlflow.sklearn
from sklearn.metrics import average_precision_score
from mlflow.tracking import MlflowClient

client = MlflowClient()

print("Test Set Evaluation (Version 2-seed=17):\n")

for name in model_names:
    
    # Explicitly load Version 2
    model_uri = f"models:/{name}/2"
    model = mlflow.sklearn.load_model(model_uri)
    
    # Get scores
    if hasattr(model, "predict_proba"):
        scores = model.predict_proba(X_test)[:, 1]
    else:
        scores = model.decision_function(X_test)
    
    aucpr = average_precision_score(y_test, scores)
    
    print(f"{name} (Version 2) Test AUCPR: {aucpr:.4f}")


Test Set Evaluation (Version 2-seed=17):

MultinomialNB (Version 2) Test AUCPR: 0.9519
LogisticRegression (Version 2) Test AUCPR: 0.9574
LinearSVC (Version 2) Test AUCPR: 0.9713


### Test Set AUCPR Comparison Across Dataset Versions

| Model               | Version 1 (seed=99) | Version 2 (seed=17) |
|---------------------|--------------------|--------------------|
| MultinomialNB       | 0.9420             | 0.9519             |
| LogisticRegression  | 0.9483             | 0.9574             |
| LinearSVC           | 0.9610             | 0.9713             |

Across all three models, Version 2 (trained on seed=17) achieves
higher test AUCPR compared to Version 1 (seed=99).

LinearSVC consistently performs best across both dataset versions,
demonstrating strong and stable performance.


### Final Summary

In this notebook, we implemented end-to-end model version control and experiment tracking using MLflow.

- Three benchmark models were trained: MultinomialNB, Logistic Regression, and LinearSVC.
- Experiments were tracked using a SQLite MLflow backend.
- Validation performance was evaluated using AUCPR.
- Each trained model was registered in the MLflow Model Registry.
- Two dataset versions (seed=99 and seed=17) were evaluated, creating Version 1 and Version 2 of each model.
- Model versions were explicitly retrieved from the registry and compared.
- Final evaluation on the test set confirmed that LinearSVC achieved the highest and most stable AUCPR across dataset versions.