<h2 align='center'>Codebasics ML Course: ML Flow Tutorial</h2>

In [1]:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Step 1: Create an imbalanced binary classification dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=2, n_redundant=8, 
                           weights=[0.9, 0.1], flip_y=0, random_state=42)

np.unique(y, return_counts=True)

(array([0, 1]), array([900, 100]))

Masing-masing parameter dalam fungsi `make_classification()` dari `sklearn.datasets`:

1. **`n_samples=1000`**:
   - **Maksud**: Jumlah total **data sampel** (baris) yang akan dibuat.
   - **Di sini**: Dataset terdiri dari 1000 sampel (baris).

2. **`n_features=10`**:
   - **Maksud**: Jumlah total **fitur** (kolom) yang akan dibuat dalam dataset.
   - **Di sini**: Dataset akan memiliki 10 fitur (kolom).

3. **`n_informative=2`**:
   - **Maksud**: Jumlah fitur yang **informatif**, yaitu fitur yang berkontribusi secara langsung dalam menentukan kelas target.
   - **Di sini**: Hanya ada 2 fitur yang benar-benar informatif dalam memprediksi output (kelas) dari dataset.

4. **`n_redundant=8`**:
   - **Maksud**: Jumlah fitur yang **redundan**, yang merupakan kombinasi linier dari fitur informatif.
   - **Di sini**: 8 dari 10 fitur adalah redundan, yang berarti mereka adalah hasil dari kombinasi linier dari fitur yang benar-benar informatif.

5. **`weights=[0.9, 0.1]`**:
   - **Maksud**: Proporsi **kelas target** dalam dataset. `weights` menentukan distribusi kelas untuk klasifikasi biner (dua kelas).
   - **Di sini**: 90% sampel berada di kelas 0 (kelas mayoritas), dan 10% sampel berada di kelas 1 (kelas minoritas). Ini menciptakan dataset yang **tidak seimbang**.

6. **`flip_y=0`**:
   - **Maksud**: Proporsi **label target** yang secara acak akan dibalik, atau dibuat salah, untuk menambahkan **noise** ke dalam data.
   - **Di sini**: `flip_y=0` berarti tidak ada label yang di-flip, sehingga tidak ada noise dalam label target.

7. **`random_state=42`**:
   - **Maksud**: Menetapkan **seed** untuk pembangkitan angka acak, sehingga hasilnya akan selalu sama setiap kali kode dijalankan.
   - **Di sini**: `random_state=42` digunakan untuk memastikan **reproducibility**, sehingga setiap kali kamu menjalankan kode ini, kamu akan mendapatkan hasil yang sama.

### Penjelasan Umum
- Fungsi `make_classification()` digunakan untuk membuat dataset sintetis yang bisa digunakan untuk **klasifikasi**.
- Pada contoh ini, kita menghasilkan dataset dengan **1000 sampel**, **10 fitur**, di mana **2 fitur** adalah benar-benar informatif untuk prediksi, dan **8 fitur** lainnya merupakan kombinasi linier dari fitur-fitur informatif tersebut.
- **Distribusi kelasnya** tidak seimbang, dengan 90% sampel berada di kelas 0 dan 10% sampel di kelas 1.

In [3]:
# # Dataset dengan noise (flip_y=0.3)
# X_with_noise, y_with_noise = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=3, 
#                                                  flip_y=0.3, random_state=42)


# print("\nDataset dengan noise (flip_y=0.3):")
# print(X_with_noise)

flip_y=0.3: Sekitar 30% dari label target akan diubah secara acak menjadi kebalikan dari nilai aslinya, menghasilkan noise dalam dataset. Dalam contoh ini, jika sebelumnya kelas target adalah 0, beberapa di antaranya akan dibalik menjadi 1, dan sebaliknya.

In [4]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

### Experiment 1: Train Logistic Regression Classifier

In [5]:
log_reg = LogisticRegression(C=1, solver='liblinear')
log_reg.fit(X_train, y_train)
y_pred_log_reg = log_reg.predict(X_test)
print(classification_report(y_test, y_pred_log_reg))

              precision    recall  f1-score   support

           0       0.95      0.96      0.95       270
           1       0.60      0.50      0.55        30

    accuracy                           0.92       300
   macro avg       0.77      0.73      0.75       300
weighted avg       0.91      0.92      0.91       300



### Experiment 2: Train Random Forest Classifier

In [6]:
rf_clf = RandomForestClassifier(n_estimators=10, max_depth=3, random_state=42)
rf_clf.fit(X_train, y_train)
y_pred_rf = rf_clf.predict(X_test)
print(classification_report(y_test, y_pred_rf))

              precision    recall  f1-score   support

           0       0.97      1.00      0.98       270
           1       0.95      0.70      0.81        30

    accuracy                           0.97       300
   macro avg       0.96      0.85      0.89       300
weighted avg       0.97      0.97      0.96       300



### Experiment 3: Train XGBoost

In [7]:
xgb_clf = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
xgb_clf.fit(X_train, y_train)
y_pred_xgb = xgb_clf.predict(X_test)
print(classification_report(y_test, y_pred_xgb))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99       270
           1       0.96      0.80      0.87        30

    accuracy                           0.98       300
   macro avg       0.97      0.90      0.93       300
weighted avg       0.98      0.98      0.98       300



### Experiment 4: Handle class imbalance using SMOTETomek and then Train XGBoost

In [8]:
from imblearn.combine import SMOTETomek

smt = SMOTETomek(random_state=42)
X_train_res, y_train_res = smt.fit_resample(X_train, y_train)

np.unique(y_train_res, return_counts=True)

(array([0, 1]), array([619, 619]))

In [9]:
xgb_clf = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
xgb_clf.fit(X_train_res, y_train_res)
y_pred_xgb = xgb_clf.predict(X_test)
print(classification_report(y_test, y_pred_xgb))

              precision    recall  f1-score   support

           0       0.98      0.98      0.98       270
           1       0.81      0.83      0.82        30

    accuracy                           0.96       300
   macro avg       0.89      0.91      0.90       300
weighted avg       0.96      0.96      0.96       300



<h2 align="center" style="color:blue">Track Experiments Using MLFlow</h2>

In [10]:
models = [
    (
        "Logistic Regression",
        {"C": 1, "solver": "liblinear" },
        LogisticRegression(C=1, solver='liblinear'), 
        (X_train, y_train),
        (X_test, y_test)
    ),
    (
        "Random Forest",
        {"n_estimators": 10, "max_depth": 3, "random_state": 42},
        RandomForestClassifier(n_estimators=10, max_depth=3, random_state=42), 
        (X_train, y_train),
        (X_test, y_test)
    ),
    (
        "XGBClassifier",
        {"use_label_encoder": False, "eval_metric": "logloss"},
        XGBClassifier(use_label_encoder=False, eval_metric='logloss'), 
        (X_train, y_train),
        (X_test, y_test)
    ),
    (
        "XGBClassifier With SMOTE",
        {"use_label_encoder": False, "eval_metric": "logloss"},
        XGBClassifier(use_label_encoder=False, eval_metric='logloss'), 
        (X_train_res, y_train_res),
        (X_test, y_test)
    )
]

In [11]:
models

[('Logistic Regression',
  {'C': 1, 'solver': 'liblinear'},
  LogisticRegression(C=1, solver='liblinear'),
  (array([[ 1.18673836,  1.51144074,  0.78490373, ..., -0.61229492,
           -0.13830257, -0.24753395],
          [-1.28810271, -1.03855344, -2.07092052, ...,  0.49607021,
           -1.50376955,  0.62474155],
          [ 1.66393774,  1.55142135,  2.25024183, ..., -0.69955586,
            1.36600648, -0.68290518],
          ...,
          [ 0.43101615,  0.90013637, -0.42606094, ..., -0.32069604,
           -1.01508454,  0.11782042],
          [ 0.79839935,  1.5003473 , -0.45098948, ..., -0.54728572,
           -1.42140103,  0.11944858],
          [ 0.67367695,  1.27538516, -0.39960348, ..., -0.464427  ,
           -1.22522394,  0.10635794]]),
   array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
          0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0,

In [12]:
reports = []

for model_name, params,model, train_set, test_set in models:
    X_train = train_set[0]
    y_train = train_set[1]
    X_test = test_set[0]
    y_test = test_set[1]
    
    model.set_params(**params)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    report = classification_report(y_test, y_pred, output_dict=True)
    reports.append(report)

In [13]:
import mlflow
import mlflow.sklearn
import mlflow.xgboost

In [14]:
# dagshub setup

import dagshub
dagshub.init(repo_owner='Mich', repo_name='MLflow-dagshub', mlflow=True)

In [15]:
# If we get an error when running the code for the first time, we need to set the MLFLOW environment variable
import os
os.environ['MLFLOW_TRACKING_USERNAME'] = 'Mich'
os.environ['MLFLOW_TRACKING_PASSWORD'] = '8d4dcbf356850858a0e926df3b602aa5ae656164'
os.environ['MLFLOW_TRACKING_URI'] = 'https://dagshub.com/Mich/MLflow-dagshub.mlflow'

mlflow.set_experiment("Anomaly Detection_Imbalanced Classification")
# mlflow.set_tracking_uri("http://localhost:5000")
# mlflow.tracking.set_tracking_uri("https://dagshub.com/Mich/MLflow-dagshub.mlflow")

for i, element in enumerate(models):
    model_name = element[0]
    params = element[1]
    model = element[2]
    report = reports[i]
    
    with mlflow.start_run(run_name=model_name):
        params["model_name"]=model_name
        mlflow.log_params(params)
        mlflow.log_metrics({
            "accuracy": report["accuracy"],
            "recall_class_1": report["1"]["recall"],
            "recall_class_0": report["0"]["recall"],
            "f1_score_macro": report["macro avg"]["f1-score"]
        })
        if "XGB" in model_name:
            mlflow.xgboost.log_model(model, model_name)
        else:
            mlflow.sklearn.log_model(model, model_name)

2024/10/10 02:38:38 INFO mlflow.tracking._tracking_service.client: 🏃 View run Logistic Regression at: https://dagshub.com/Mich/MLflow-dagshub.mlflow/#/experiments/0/runs/5b3765f310654bcfbe949440457950f6.
2024/10/10 02:38:38 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dagshub.com/Mich/MLflow-dagshub.mlflow/#/experiments/0.
2024/10/10 02:38:46 INFO mlflow.tracking._tracking_service.client: 🏃 View run Random Forest at: https://dagshub.com/Mich/MLflow-dagshub.mlflow/#/experiments/0/runs/38a2ce657b4547cb99d3931aecb21bdc.
2024/10/10 02:38:46 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dagshub.com/Mich/MLflow-dagshub.mlflow/#/experiments/0.
2024/10/10 02:38:54 INFO mlflow.tracking._tracking_service.client: 🏃 View run XGBClassifier at: https://dagshub.com/Mich/MLflow-dagshub.mlflow/#/experiments/0/runs/5e28bdc329bb42a29139c5b3d73dda79.
2024/10/10 02:38:54 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https

In [16]:
help(mlflow.xgboost.log_model)

Help on function log_model in module mlflow.xgboost:

log_model(xgb_model, artifact_path, conda_env=None, code_paths=None, registered_model_name=None, signature: mlflow.models.signature.ModelSignature = None, input_example: Union[pandas.core.frame.DataFrame, numpy.ndarray, dict, list, ForwardRef('csr_matrix'), ForwardRef('csc_matrix'), str, bytes, tuple] = None, await_registration_for=300, pip_requirements=None, extra_pip_requirements=None, model_format='xgb', metadata=None, **kwargs)
    Log an XGBoost model as an MLflow artifact for the current run.

    Args:
        xgb_model: XGBoost model (an instance of `xgboost.Booster`_ or models that implement the
            `scikit-learn API`_) to be saved.
        artifact_path: Run-relative artifact path.
        conda_env: Either a dictionary representation of a Conda environment or the path to a conda
                   environment yaml file. If provided, this describes the environment this model should be run in.
                   At 

### Register the model

In [17]:
# result = mlflow.register_model(
#     "runs:/d16076a3ec534311817565e6527539c0/sklearn-model", "sk-learn-random-forest-reg"
# )

In [18]:
model_name = "XGB-SMOTE_AnomalyDetection"
run_id = input("Enter Run ID:")
model_uri = f"runs:/{run_id}/XGBClassifier With SMOTE" # Same with artifact path when saving the model with mlflow.sklearn.log_model
                                    #  mlflow.xgboost.log_model(model, "model") -> model_name
result = mlflow.register_model(
    model_uri,model_name
)

Successfully registered model 'XGB-SMOTE_AnomalyDetection'.
2024/10/10 02:39:15 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: XGB-SMOTE_AnomalyDetection, version 1
Created version '1' of model 'XGB-SMOTE_AnomalyDetection'.


### Load the model
Load challenger model and do some testing

In [19]:
model_name = "XGB-SMOTE_AnomalyDetection"
model_version = 1
# model_uri = f"models:/{model_name}/{model_version}"
model_uri = f"models:/{model_name}/{model_version}"

loaded_model=mlflow.xgboost.load_model(model_uri)
y_pred = loaded_model.predict(X_test)
y_pred[:4]

array([0, 0, 0, 0])

In [22]:

dev_model_uri = f"models:/Anomaly-Detection-Model/1"
prod_model = "Anomaly-Detection-Production"

client = mlflow.MlflowClient()
client.copy_model_version(src_model_uri=dev_model_uri,
                          dst_name=prod_model)

Registered model 'Anomaly-Detection-Production' already exists. Creating a new version of this model...
Copied version '1' of model 'Anomaly-Detection-Model' to version '3' of model 'Anomaly-Detection-Production'.


<ModelVersion: aliases=[], creation_timestamp=1728502907222, current_stage='None', description='', last_updated_timestamp=1728502907222, name='Anomaly-Detection-Production', run_id='df0c1058bd69433e88a40b4d58759dc7', run_link='', source='models:/Anomaly-Detection-Model/1', status='READY', status_message='', tags={}, user_id='', version='3'>

In [23]:
model_uri = f"models:/{prod_model}@champion"

# loaded_model=mlflow.xgboost.load_model(model_uri)
loaded_model=mlflow.sklearn.load_model(model_uri)
y_pred = loaded_model.predict(X_test)
y_pred[:4]

array([0, 0, 0, 0])