# Drug classification [sklearn]
* Multiclass classification of drug type, given person's health data.
* Reference notebook: <https://www.kaggle.com/code/caesarmario/drug-classification-w-various-ml-models>
* Dataset: <https://www.kaggle.com/datasets/prathamtripathi/drug-classification?datasetId=830916&sortBy=voteCount>

By running this notebook, you’ll create a whole test suite in a few lines of code. The model used here is a support vector classification model with the drug classification dataset. Feel free to use your own model (tabular, text, or LLM).

You’ll learn how to:
* Detect vulnerabilities by scanning the model
* Generate a test suite with domain-specific tests
* Customize your test suite by loading a test from the Giskard catalog
* Upload your model to the Giskard server to:
* Compare models to decide which one to promote
* Debug your tests to diagnose issues
* Share your results and collect business feedback from your team

## Install Giskard

In [None]:
!pip install giskard

## Import libraries

In [None]:
import os

import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.svm import SVC
from urllib.request import urlretrieve
from imblearn.over_sampling import SMOTE
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from imblearn.pipeline import Pipeline as PipelineImb

import giskard
from giskard import Dataset, Model, GiskardClient, testing

## Define constants

In [None]:
# Constants.
RANDOM_SEED = 0

TARGET_NAME = "Drug"

AGE_BINS = [0, 19, 29, 39, 49, 59, 69, 80]
AGE_CATEGORIES = ['<20s', '20s', '30s', '40s', '50s', '60s', '>60s']

NA_TO_K_BINS = [0, 9, 19, 29, 50]
NA_TO_K_CATEGORIES = ['<10', '10-20', '20-30', '>30']

# Paths.
DATA_URL = os.path.join("ftp://sys.giskard.ai", "pub", "unit_test_resources", "drug_classification_dataset", "drug200.csv")
DATA_PATH = Path.home() / ".giskard" / "drug_classification_dataset" / "drug200.csv"

## Dataset preparation

### Load and preprocess data

In [None]:
def fetch_from_ftp(url: str, file: Path) -> None:
    """Helper to fetch data from the FTP server."""
    if not file.parent.exists():
        file.parent.mkdir(parents=True, exist_ok=True)

    if not file.exists():
        print(f"Downloading data from {url}")
        urlretrieve(url, file)

    print(f"Data was loaded!")


def load_data() -> pd.DataFrame:
    """Load data."""
    fetch_from_ftp(DATA_URL, DATA_PATH)
    df = pd.read_csv(DATA_PATH)
    return df


def bin_numerical(df: pd.DataFrame) -> np.ndarray:
    """Perform numerical features binning."""
    def _bin_age(_df: pd.DataFrame) -> pd.DataFrame:
        """Bin age feature."""
        _df.Age = pd.cut(_df.Age, bins=AGE_BINS, labels=AGE_CATEGORIES)
        return _df

    def _bin_na_to_k(_df: pd.DataFrame) -> pd.DataFrame:
        """Bin Na_to_K feature."""
        _df.Na_to_K = pd.cut(_df.Na_to_K, bins=NA_TO_K_BINS, labels=NA_TO_K_CATEGORIES)
        return _df

    df = df.copy()
    df = _bin_age(df)
    df = _bin_na_to_k(df)

    return df

In [None]:
df_drug = load_data()
df_drug = bin_numerical(df_drug)

### Train-test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_drug.drop(TARGET_NAME, axis=1), df_drug.Drug,
                                                    test_size=0.3, random_state=RANDOM_SEED)

### Wrap dataset with Giskard

In [None]:
raw_dataset = pd.concat([X_train, y_train], axis=1)
wrapped_dataset = Dataset(raw_dataset,
                          name="drug_classification_dataset",
                          target=TARGET_NAME,
                          cat_columns=X_test.columns.tolist())

## Train model

In [None]:
pipeline = PipelineImb(steps=[
    ("one_hot_encoder", OneHotEncoder()),
    ("resampler", SMOTE(random_state=RANDOM_SEED)),
    ("classifier", SVC(kernel='linear', max_iter=250, random_state=RANDOM_SEED, probability=True))
])

print(f"Model training...")
pipeline.fit(X_train, y_train)
print(f"Model training finished!")

print(f"Model testing...")
y_train_pred = pipeline.predict(X_train)
y_test_pred = pipeline.predict(X_test)
train_metric = accuracy_score(y_train_pred, y_train)
test_metric = accuracy_score(y_test_pred, y_test)
print(f"Train accuracy score: {train_metric:.2f}\n"
      f"Test accuracy score: {test_metric:.2f}")

### Define prediction function

In [None]:
def prediction_function(df: pd.DataFrame) -> np.ndarray:
    return pipeline.predict_proba(df)

### Wrap model with Giskard

In [None]:
wrapped_model = Model(prediction_function,
                      model_type="classification",
                      name="drug_classifier",
                      feature_names=X_train.columns.tolist(),
                      classification_labels=pipeline.classes_)

# Validate wrapped model.
wrapped_y_train_pred = pipeline.classes_[wrapped_model.predict(wrapped_dataset).raw_prediction]
wrapped_train_metric = accuracy_score(wrapped_y_train_pred, y_train)
print(f"Wrapped Train accuracy score: {wrapped_train_metric:.2f}")

## Scan your model to find vulnerabilities
With the Giskard scan feature, you can detect vulnerabilities in your model, including performance biases, unrobustness, data leakage, stochasticity, underconfidence, ethical issues, and more. For detailed information about the scan feature, please refer to our scan documentation.

In [None]:
results = giskard.scan(wrapped_model, wrapped_dataset)

In [None]:
display(results)

## Generate a test suite from the Scan
The objects produced by the scan can be used as fixtures to generate a test suite that integrate domain-specific issues. To create custom tests, refer to the Test your ML Model page.

In [None]:
test_suite = results.generate_test_suite("My first test suite")
test_suite.run()

## Customize your suite by loading objects from the Giskard catalog

The Giskard open source catalog will enable to load:
* Tests such as metamorphic, performance, prediction & data drift, statistical tests, etc
* Slicing functions such as detectors of toxicity, hate, emotion, etc
* Transformation functions such as generators of typos, paraphrase, style tune, etc

For demo purposes, we will load a simple unit test (test_f1) that checks if the test F1 score is above the given threshold. For more examples of tests and functions, refer to the Giskard catalog.

In [None]:
test_suite.add_test(testing.test_f1(model=wrapped_model, dataset=wrapped_dataset, threshold=0.7)).run()

## Upload your suite to the Giskard server

Upload your suite to the Giskard server to:
* Compare models to decide which model to promote
* Debug your tests to diagnose the issues
* Create more domain-specific tests that are integrating business feedback
* Share your results

In [None]:
# Uploading the test suite will automatically save the model, dataset, tests, slicing & transformation functions inside the Giskard UI server
# Create a Giskard client after having install the Giskard server (see documentation)
token = "API_TOKEN"  # Find it in Settings in the Giskard server

client = GiskardClient(
    url="http://localhost:19000",  # URL of your Giskard instance
    token=token
)

my_project = client.create_project("my_project", "PROJECT_NAME", "DESCRIPTION")

# Upload to the current project ✉️
test_suite.upload(client, "my_project")