In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder

# Load the dataset
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

# Encode labels: spam = 1, ham = 0
label_encoder = LabelEncoder()
train["Label"] = label_encoder.fit_transform(train["Label"])
test["Label"] = label_encoder.transform(test["Label"])

# Convert text into TF-IDF vectors
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(train["Message"])
X_test = vectorizer.transform(test["Message"])
y_train = train["Label"]
y_test = test["Label"]

In [2]:
import mlflow
import mlflow.sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.metrics import average_precision_score

# Set up MLflow
mlflow.set_tracking_uri("file:///AML_Project/mlruns")  # Local tracking
mlflow.set_experiment("SMS_Spam_Classification")

# List of models
models = {
    "Logistic_Regression": LogisticRegression(),
    "Random_Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "XGBoost": xgb.XGBClassifier(eval_metric="logloss")
}

# Dictionary to store Run IDs
run_ids = {}

# Train and log models
for model_name, model in models.items():
    with mlflow.start_run():
        # Train the model
        model.fit(X_train, y_train)

        # Get predictions and calculate AUCPR
        y_probs = model.predict_proba(X_test)[:, 1]
        aucpr = average_precision_score(y_test, y_probs)

        # Log parameters and metrics
        mlflow.log_param("model_type", model_name)
        mlflow.log_metric("AUCPR", aucpr)

        # Log the model
        mlflow.sklearn.log_model(model, model_name)

        # Store Run ID
        run_ids[model_name] = mlflow.active_run().info.run_id

        print(f"Model {model_name} logged with AUCPR: {aucpr:.4f}")

2025/03/07 18:59:08 INFO mlflow.tracking.fluent: Experiment with name 'SMS_Spam_Classification' does not exist. Creating a new experiment.


Model Logistic_Regression logged with AUCPR: 0.9585




Model Random_Forest logged with AUCPR: 0.9723




Model XGBoost logged with AUCPR: 0.9471


In [3]:
for model_name, run_id in run_ids.items():
    # Load the model
    model_uri = f"runs:/{run_id}/{model_name}"
    loaded_model = mlflow.sklearn.load_model(model_uri)

    # Evaluate the model
    y_probs = loaded_model.predict_proba(X_test)[:, 1]
    aucpr = average_precision_score(y_test, y_probs)

    print(f"Loaded Model {model_name} (Run ID: {run_id}) -> AUCPR: {aucpr:.4f}")

Loaded Model Logistic_Regression (Run ID: 37d9ba332ec54e75a33bceb1ca31734e) -> AUCPR: 0.9585
Loaded Model Random_Forest (Run ID: 45ff8afdbea947589ea28a7b2ac46785) -> AUCPR: 0.9723
Loaded Model XGBoost (Run ID: 1c805ceac792478cb209f660a18a3fd4) -> AUCPR: 0.9471




In this notebook, we performed the following steps to train machine learning models and track experiments:

### **1. Loading and Preprocessing the Data**
- The training and test datasets (`train.csv` and `test.csv`) were loaded using `pandas`.
- The target variable (`Label`) was encoded:
  - `ham` → \( 0 \)
  - `spam` → \( 1 \)
- The text data (`Message`) was converted into numerical features using **TF-IDF Vectorization**:
  - The `TfidfVectorizer` from `scikit-learn` was used to create a sparse matrix of TF-IDF features.
  - The maximum number of features was set to 5000.

### **2. Training Machine Learning Models**
- Three machine learning models were trained:
  1. **Logistic Regression**:
     - A linear model for binary classification.
  2. **Random Forest**:
     - An ensemble model using 100 decision trees.
  3. **XGBoost**:
     - A gradient boosting model optimized for performance.

### **3. Tracking Experiments with MLflow**
- **MLflow** was used to track experiments and log:
  - **Parameters**: Model type (e.g., `Logistic_Regression`).
  - **Metrics**: AUCPR (Area Under the Precision-Recall Curve).
  - **Artifacts**: Trained models.
- The experiments were stored in a local directory (`mlruns`).

### **4. Evaluating Model Performance**
- The models were evaluated using the **AUCPR** metric:
  - **Logistic Regression**: AUCPR = 0.9585
  - **Random Forest**: AUCPR = 0.9723
  - **XGBoost**: AUCPR = 0.9471
- The **Random Forest** model performed the best with the highest AUCPR.

### **5. Loading and Re-evaluating Models**
- The trained models were loaded using MLflow and re-evaluated on the test set to ensure consistency.

### **Key Takeaways**
- The **Random Forest** model achieved the best performance on the test set.
- Using MLflow allowed us to track experiments, log metrics, and store models for reproducibility.
- The trained models can now be deployed for inference or further fine-tuning.

---

This concludes the model training and experiment tracking process. The trained models are now ready for deployment or further analysis.