# Evaluating Tabular Classifications

## Introduction

In this notebook, we'll walk-through a detailed example of how you can use Valor to evaluate classifications made on a tabular dataset. This example uses `sklearn`'s breast cancer dataset to make a binary prediction about whether a woman has breast cancer, based on a table of descriptive features, such as mean radius and mean texture.

For a conceptual introduction to Valor, [check out our project overview](https://striveworks.github.io/valor/). For a higher-level example notebook, [check out our "Getting Started" notebook](https://github.com/Striveworks/valor/blob/main/examples/getting_started.ipynb).

Before using this notebook, please ensure that the Valor service is running on your machine (for start-up instructions, [click here](https://striveworks.github.io/valor/getting_started/)). To connect to a non-local instance of Valor, update `client = Client("http://0.0.0.0:8000")` in the first code block to point to the correct URL.

## Defining Our Datasets

We start by fetching our dataset, dividing it into test/train splits, and uploading both sets to Valor.

In [1]:
from tqdm import tqdm

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report

from valor import connect, Client, Dataset, Model, Datum, Annotation, GroundTruth, Prediction, Label
from valor.enums import TaskType

# connect to the Valor API
connect("http://localhost:8000")
client = Client()



Successfully connected to host at http://localhost:8000/


In [2]:
# load data from sklearn
dset = load_breast_cancer()
dset.feature_names

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23')

In [3]:
# split datasets
X, y, target_names = dset["data"], dset["target"], dset["target_names"]
X_train, X_test, y_train, y_test = train_test_split(X, y)

# show an example input
X_train.shape, y_train[:4], target_names

((426, 30), array([1, 1, 1, 0]), array(['malignant', 'benign'], dtype='<U9'))

In [5]:
# create train dataset in Valor
valor_train_dataset = Dataset.create("breast-cancer-train")

# create test dataset in Valor
valor_test_dataset = Dataset.create("breast-cancer-test")

### Adding GroundTruths to our Dataset

Now that our two datasets exists in Valor, we can add `GroundTruths` to each dataset.

In [6]:
# format training groundtruths
training_groundtruths = [
    GroundTruth(
        datum=Datum(
            uid=f"train{i}",
        ),
        annotations=[
            Annotation(
                task_type=TaskType.CLASSIFICATION,
                labels=[Label(key="class", value=target_names[t])]
            )
        ]
    )
    for i, t in enumerate(y_train)
]

# format testing groundtruths
testing_groundtruths = [
    GroundTruth(
        datum=Datum(
            uid=f"test{i}",
        ),
        annotations=[
            Annotation(
                task_type=TaskType.CLASSIFICATION,
                labels=[Label(key="class", value=target_names[t])]
            )
        ]
    )
    for i, t in enumerate(y_test)
]

# add the training groundtruths
valor_train_dataset.add_groundtruths(training_groundtruths)

# add the testing groundtruths
valor_test_dataset.add_groundtruths(testing_groundtruths)

### Finalizing Our Datasets

Lastly, we finalize both datasets to prep them for evaluation.

In [7]:
valor_train_dataset.finalize()
valor_test_dataset.finalize()

<Response [200]>

## Defining Our Model

Now that our `Datasets` have been defined, we can describe our model in Valor using the `Model` object.

In [8]:
# fit an sklearn model to our data
pipe = make_pipeline(StandardScaler(), LogisticRegression())
pipe.fit(X_train, y_train)

# get predictions on both of our datasets
y_train_probs = pipe.predict_proba(X_train)
y_test_probs = pipe.predict_proba(X_test)

# show an example output
y_train_probs[:4]

array([[1.03794513e-05, 9.99989621e-01],
       [6.08759201e-03, 9.93912408e-01],
       [9.52455109e-05, 9.99904754e-01],
       [1.45856827e-01, 8.54143173e-01]])

In [9]:
# create our model in Valor
valor_model = Model.create("breast-cancer-linear-model")

### Adding Predictions to Our Model

With our model defined in Valor, we can post predictions for each of our `Datasets` to our `Model` object. Each `Prediction` should contain a list of `Labels` describing the prediction and its associated confidence score. Since we're running a classification task, the confidence scores over all prediction classes should sum to (approximately) 1.

In [10]:

# define our predictions
training_predictions = [
    Prediction(
        datum=Datum(
            uid=f"train{i}",
        ),
        annotations=[
            Annotation(
                task_type=TaskType.CLASSIFICATION,
                labels=[
                    Label(
                        key="class", 
                        value=target_names[j],
                        score=p,
                    )                        
                    for j, p in enumerate(prob)
                ]
            )
        ]
    )
    for i, prob in enumerate(y_train_probs)
]

testing_predictions = [
    Prediction(
        datum=Datum(
            uid=f"test{i}",
        ),
        annotations=[
            Annotation(
                task_type=TaskType.CLASSIFICATION,
                labels=[
                    Label(
                        key="class",
                        value=target_names[j],
                        score=p,
                    )                        
                    for j, p in enumerate(prob)
                ]
            )
        ]
    )
    for i, prob in enumerate(y_test_probs)
]

# add the train predictions
valor_model.add_predictions(valor_train_dataset, training_predictions)

# add the test predictions
valor_model.add_predictions(valor_test_dataset, testing_predictions)

## Evaluating Performance

With our `Dataset` and `Model` defined, we're ready to evaluate our performance and display the results. Note that we use the `wait_for_completion` method since all evaluations run as background tasks; this method ensures that the evaluation finishes before we display the results.

In [11]:
train_eval_job = valor_model.evaluate_classification(valor_train_dataset)
train_eval_job.wait_for_completion()
train_eval_job.to_dataframe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,value
Unnamed: 0_level_1,Unnamed: 1_level_1,evaluation,1
type,parameters,label,Unnamed: 3_level_2
Accuracy,"{""label_key"": ""class""}",,0.988263
F1,"""n/a""",class: benign,0.990758
F1,"""n/a""",class: malignant,0.983923
Precision,"""n/a""",class: benign,0.985294
Precision,"""n/a""",class: malignant,0.993506
ROCAUC,"{""label_key"": ""class""}",,0.997609
Recall,"""n/a""",class: benign,0.996283
Recall,"""n/a""",class: malignant,0.974522


In [12]:
train_eval_job.confusion_matrices

[{'label_key': 'class',
  'entries': [{'prediction': 'benign', 'groundtruth': 'benign', 'count': 268},
   {'prediction': 'benign', 'groundtruth': 'malignant', 'count': 4},
   {'prediction': 'malignant', 'groundtruth': 'benign', 'count': 1},
   {'prediction': 'malignant', 'groundtruth': 'malignant', 'count': 153}]}]

As a brief sanity check, we can check Valor's outputs against `sklearn's` own classification report. We see that the two results are equal.

In [13]:
y_train_preds = pipe.predict(X_train)
print(classification_report(y_train, y_train_preds, digits=6, target_names=target_names))

              precision    recall  f1-score   support

   malignant   0.993506  0.974522  0.983923       157
      benign   0.985294  0.996283  0.990758       269

    accuracy                       0.988263       426
   macro avg   0.989400  0.985402  0.987340       426
weighted avg   0.988321  0.988263  0.988239       426

