# Drug classification [sklearn]
* Multiclass classification of drug type, given person's health data.
* Reference notebook: <https://www.kaggle.com/code/caesarmario/drug-classification-w-various-ml-models>
* Dataset: <https://www.kaggle.com/datasets/prathamtripathi/drug-classification?datasetId=830916&sortBy=voteCount>

## Import libraries

In [11]:
import os

import numpy as np
import pandas as pd
from sklearn.svm import SVC
from imblearn.over_sampling import SMOTE
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from imblearn.pipeline import Pipeline as PipelineImb

import giskard
from giskard import Dataset, Model

## Define constants

In [12]:
# Constants.
RANDOM_SEED = 0

TARGET_NAME = "Drug"

AGE_BINS = [0, 19, 29, 39, 49, 59, 69, 80]
AGE_CATEGORIES = ['<20s', '20s', '30s', '40s', '50s', '60s', '>60s']

NA_TO_K_BINS = [0, 9, 19, 29, 50]
NA_TO_K_CATEGORIES = ['<10', '10-20', '20-30', '>30']

# Paths.
PATH_DATA = os.path.join(".", "datasets", "drug_classification_dataset", "drug200.csv")

## Load data

In [13]:
def load_data() -> pd.DataFrame:
    """Load data."""
    print(f"Loading data...")
    df = pd.read_csv(PATH_DATA)
    print(f"Loading data finished!")
    return df

df_drug = load_data()

Loading data...
Loading data finished!


In [21]:
df_drug

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,20s,F,HIGH,HIGH,20-30,drugY
1,40s,M,LOW,HIGH,10-20,drugC
2,40s,M,LOW,HIGH,10-20,drugC
3,20s,F,NORMAL,HIGH,<10,drugX
4,60s,F,LOW,HIGH,10-20,drugY
...,...,...,...,...,...,...
195,50s,F,LOW,HIGH,10-20,drugC
196,<20s,M,LOW,HIGH,10-20,drugC
197,50s,M,NORMAL,HIGH,10-20,drugX
198,20s,M,NORMAL,NORMAL,10-20,drugX


## Define preprocessing steps

In [14]:
def bin_numerical(df: pd.DataFrame) -> np.ndarray:
    """Perform numerical features binning."""
    def _bin_age(_df: pd.DataFrame) -> pd.DataFrame:
        """Bin age feature."""
        _df.Age = pd.cut(_df.Age, bins=AGE_BINS, labels=AGE_CATEGORIES)
        return _df

    def _bin_na_to_k(_df: pd.DataFrame) -> pd.DataFrame:
        """Bin Na_to_K feature."""
        _df.Na_to_K = pd.cut(_df.Na_to_K, bins=NA_TO_K_BINS, labels=NA_TO_K_CATEGORIES)
        return _df

    df = df.copy()
    df = _bin_age(df)
    df = _bin_na_to_k(df)

    return df

df_drug = bin_numerical(df_drug)

## Train-test split

In [15]:
X_train, X_test, y_train, y_test = train_test_split(df_drug.drop(TARGET_NAME, axis=1), df_drug.Drug,
                                                    test_size=0.3, random_state=RANDOM_SEED)

## Build Support Vector Machine classifier

In [16]:
pipeline = PipelineImb(steps=[
    ("one_hot_encoder", OneHotEncoder()),
    ("resampler", SMOTE(random_state=RANDOM_SEED)),
    ("classifier", SVC(kernel='linear', max_iter=250, random_state=RANDOM_SEED, probability=True))
])

print(f"Model training...")
pipeline.fit(X_train, y_train)
print(f"Model training finished!")

print(f"Model testing...")
y_pred = pipeline.predict(X_test)
metric = accuracy_score(y_pred, y_test)
print(f"Test accuracy score: {metric}")

Model training...
Model training finished!
Model testing...
Test accuracy score: 0.85




## Wrap dataset and model and perform scanning

In [22]:
raw_dataset = pd.concat([X_train, y_train], axis=1)
wrapped_dataset = Dataset(raw_dataset,
                          name="drug_classification_dataset",
                          target=TARGET_NAME,
                          cat_columns=X_test.columns.tolist())

Your 'pandas.DataFrame' is successfully wrapped by Giskard's 'Dataset' wrapper class.


In [24]:
wrapped_model = Model(pipeline,
                      model_type="classification",
                      name="drug_classifier",
                      feature_names=X_test.columns.tolist())

Your 'model' is successfully wrapped by Giskard's 'SKLearnModel' wrapper class.


In [25]:
scanning_results = giskard.scan(wrapped_model, wrapped_dataset)

Your 'pandas.DataFrame' is successfully wrapped by Giskard's 'Dataset' wrapper class.
Your model is successfully validated.
Your 'pandas.DataFrame' is successfully wrapped by Giskard's 'Dataset' wrapper class.
Your 'pandas.DataFrame' is successfully wrapped by Giskard's 'Dataset' wrapper class.
Your 'pandas.DataFrame' is successfully wrapped by Giskard's 'Dataset' wrapper class.
Your 'pandas.DataFrame' is successfully wrapped by Giskard's 'Dataset' wrapper class.
Your 'pandas.DataFrame' is successfully wrapped by Giskard's 'Dataset' wrapper class.
Your 'pandas.DataFrame' is successfully wrapped by Giskard's 'Dataset' wrapper class.
Your 'pandas.DataFrame' is successfully wrapped by Giskard's 'Dataset' wrapper class.
Your 'pandas.DataFrame' is successfully wrapped by Giskard's 'Dataset' wrapper class.
Your 'pandas.DataFrame' is successfully wrapped by Giskard's 'Dataset' wrapper class.
Your 'pandas.DataFrame' is successfully wrapped by Giskard's 'Dataset' wrapper class.
Your 'pandas.Dat

In [26]:
display(scanning_results)

Your 'pandas.DataFrame' is successfully wrapped by Giskard's 'Dataset' wrapper class.
Your 'pandas.DataFrame' is successfully wrapped by Giskard's 'Dataset' wrapper class.
Your 'pandas.DataFrame' is successfully wrapped by Giskard's 'Dataset' wrapper class.
Your 'pandas.DataFrame' is successfully wrapped by Giskard's 'Dataset' wrapper class.
Your 'pandas.DataFrame' is successfully wrapped by Giskard's 'Dataset' wrapper class.
Your 'pandas.DataFrame' is successfully wrapped by Giskard's 'Dataset' wrapper class.
Your 'pandas.DataFrame' is successfully wrapped by Giskard's 'Dataset' wrapper class.
Your 'pandas.DataFrame' is successfully wrapped by Giskard's 'Dataset' wrapper class.
Your 'pandas.DataFrame' is successfully wrapped by Giskard's 'Dataset' wrapper class.
Your 'pandas.DataFrame' is successfully wrapped by Giskard's 'Dataset' wrapper class.
Your 'pandas.DataFrame' is successfully wrapped by Giskard's 'Dataset' wrapper class.
Your 'pandas.DataFrame' is successfully wrapped by Gis