# Tabular Classification Example

## Introduction

In this notebook, we'll walk-through a detailed example of how you can use Velour to evaluate classifications made on a tabular dataset. We'll use `sklearn's` breast cancer dataset to make a binary prediction about whether a woman has breast cancer based on a table of descriptive features (e.g., mean radius, mean texture, etc.). 

For a conceptual introduction to Velour, [check out our project overview](https://striveworks.github.io/velour/). For a higher-level example notebook, [check out our "Getting Started" notebook](https://github.com/Striveworks/velour/blob/main/examples/getting_started.ipynb).

Before using this notebook, please ensure that the Velour service is running on your machine (for start-up instructions, [click here](https://striveworks.github.io/velour/getting_started/)). To connect to a non-local instance of Velour, update `client = Client("http://0.0.0.0:8000")` in the first code block to point to the correct URL.

## Defining Our Datasets

We start by fetching our dataset, dividing it into test/train splits, and uploading both sets to Velour.

In [1]:
from tqdm import tqdm

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report

from velour import Dataset, Model, Datum, Annotation, GroundTruth, Prediction, Label
from velour.enums import TaskType
from velour.client import Client

# connect to Velour API
client = Client("http://localhost:8000")



Successfully connected to host at http://localhost:8000/


In [2]:
# load data from sklearn
dset = load_breast_cancer()
dset.feature_names

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23')

In [3]:
# split datasets
X, y, target_names = dset["data"], dset["target"], dset["target_names"]
X_train, X_test, y_train, y_test = train_test_split(X, y)

# show an example input
X_train.shape, y_train[:4], target_names

((426, 30), array([0, 1, 0, 0]), array(['malignant', 'benign'], dtype='<U9'))

In [4]:
# create train dataset in Velour
velour_train_dataset = Dataset(client, "breast-cancer-train", delete_if_exists=True)

# create test dataset in Velour
velour_test_dataset = Dataset(client, "breast-cancer-test", delete_if_exists=True)

### Adding GroundTruths to our Dataset

Now that our two datasets exists in Velour, we can add `GroundTruths` to each dataset.

In [5]:
# format training groundtruths
training_groundtruths = [
    GroundTruth(
        datum=Datum(
            uid=f"train{i}",
        ),
        annotations=[
            Annotation(
                task_type=TaskType.CLASSIFICATION,
                labels=[Label(key="class", value=target_names[t])]
            )
        ]
    )
    for i, t in enumerate(y_train)
]

# format testing groundtruths
testing_groundtruths = [
    GroundTruth(
        datum=Datum(
            uid=f"test{i}",
        ),
        annotations=[
            Annotation(
                task_type=TaskType.CLASSIFICATION,
                labels=[Label(key="class", value=target_names[t])]
            )
        ]
    )
    for i, t in enumerate(y_test)
]

# add the training groundtruths
for gt in tqdm(training_groundtruths):
    velour_train_dataset.add_groundtruth(gt)

# add the testing groundtruths
for gt in tqdm(testing_groundtruths):
    velour_test_dataset.add_groundtruth(gt)

100%|██████████| 426/426 [00:07<00:00, 57.55it/s]
100%|██████████| 143/143 [00:02<00:00, 57.80it/s]


### Finalizing Our Datasets

Lastly, we finalize both datasets to prep them for evaluation.

In [6]:
velour_train_dataset.finalize()
velour_test_dataset.finalize()

<Response [200]>

## Defining Our Model

Now that our `Datasets` have been defined, we can describe our model in Velour using the `Model` object.

In [7]:
# fit an sklearn model to our data
pipe = make_pipeline(StandardScaler(), LogisticRegression())
pipe.fit(X_train, y_train)

# get predictions on both of our datasets
y_train_probs = pipe.predict_proba(X_train)
y_test_probs = pipe.predict_proba(X_test)

# show an example output
y_train_probs[:4]

array([[1.00000000e+00, 5.42543127e-12],
       [3.10366665e-04, 9.99689633e-01],
       [7.88809064e-01, 2.11190936e-01],
       [9.99195106e-01, 8.04893945e-04]])

In [8]:
# create our model in Velour
velour_model = Model(client, "breast-cancer-linear-model", delete_if_exists=True)

### Adding Predictions to Our Model

With our model defined in Velour, we can post predictions for each of our `Datasets` to our `Model` object. Each `Prediction` should contain a list of `Labels` describing the prediction and its associated confidence score. Since we're running a classification task, the confidence scores over all prediction classes should sum to (approximately) 1.

In [9]:

# define our predictions
training_predictions = [
    Prediction(
        datum=Datum(
            dataset=velour_train_dataset.name,
            uid=f"train{i}",
        ),
        annotations=[
            Annotation(
                task_type=TaskType.CLASSIFICATION,
                labels=[
                    Label(
                        key="class", 
                        value=target_names[j],
                        score=p,
                    )                        
                    for j, p in enumerate(prob)
                ]
            )
        ]
    )
    for i, prob in enumerate(y_train_probs)
]

testing_predictions = [
    Prediction(
        datum=Datum(
            dataset=velour_test_dataset.name,
            uid=f"test{i}",
        ),
        annotations=[
            Annotation(
                task_type=TaskType.CLASSIFICATION,
                labels=[
                    Label(
                        key="class",
                        value=target_names[j],
                        score=p,
                    )                        
                    for j, p in enumerate(prob)
                ]
            )
        ]
    )
    for i, prob in enumerate(y_test_probs)
]

# add the train predictions
for pd in tqdm(training_predictions):
    velour_model.add_prediction(velour_train_dataset, pd)

# add the test predictions
for pd in tqdm(testing_predictions):
    velour_model.add_prediction(velour_test_dataset, pd)

100%|██████████| 426/426 [00:08<00:00, 51.84it/s]
100%|██████████| 143/143 [00:02<00:00, 55.27it/s]


### Finalizing Our Model

Finally, we finalize our `Model` to prep it for evaluation.

In [10]:
velour_model.finalize_inferences(velour_train_dataset)
velour_model.finalize_inferences(velour_test_dataset)

## Evaluating Performance

With our `Dataset` and `Model` defined, we're ready to evaluate our performance and display the results. Note that we use the `wait_for_completion` method since all evaluations run as background tasks; this method ensures that the evaluation finishes before we display the results.

In [11]:
train_eval_job = velour_model.evaluate_classification(velour_train_dataset)
train_eval_job.wait_for_completion()
results = train_eval_job.results()

results.to_dataframe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,value
Unnamed: 0_level_1,Unnamed: 1_level_1,dataset,breast-cancer-train
type,parameters,label,Unnamed: 3_level_2
Accuracy,"{""label_key"": ""class""}",,0.988263
F1,"""n/a""",class: benign,0.990403
F1,"""n/a""",class: malignant,0.984894
Precision,"""n/a""",class: benign,0.984733
Precision,"""n/a""",class: malignant,0.993902
ROCAUC,"{""label_key"": ""class""}",,0.997411
Recall,"""n/a""",class: benign,0.996139
Recall,"""n/a""",class: malignant,0.976048


In [12]:
results.confusion_matrices

[{'label_key': 'class',
  'entries': [{'prediction': 'benign', 'groundtruth': 'benign', 'count': 258},
   {'prediction': 'benign', 'groundtruth': 'malignant', 'count': 4},
   {'prediction': 'malignant', 'groundtruth': 'benign', 'count': 1},
   {'prediction': 'malignant', 'groundtruth': 'malignant', 'count': 163}]}]

As a brief sanity check, we can check Velour's outputs against `sklearn's` own classification report. We see that the two results are equal.

In [13]:
y_train_preds = pipe.predict(X_train)
print(classification_report(y_train, y_train_preds, digits=6, target_names=target_names))

              precision    recall  f1-score   support

   malignant   0.993902  0.976048  0.984894       167
      benign   0.984733  0.996139  0.990403       259

    accuracy                       0.988263       426
   macro avg   0.989318  0.986093  0.987649       426
weighted avg   0.988327  0.988263  0.988244       426

