# Datalab: A unified way to automatically detect all kinds of issues in datasets.

`Datalab` helps you identify various issues in your machine learning datasets that may negatively impact the performance of your machine learning model if not addressed.
This includes: noisy labels, outliers, (near) duplicates, and other types of problems that commonly occur in real-world data.
`Datalab` utilizes *any* ML model you have already trained for your data to diagnose these issues, it only requires access to either: (probabilistic) predictions from your model or its learned representations of the data.

Underneath the hood, this class calls all the appropriate cleanlab methods for your dataset and provided model outputs, in order to best audit the data and alert you of important issues. This makes it easy to apply many functionalities of this library all within a single line of code. Before considering other cleanlab functions, we recommend considering if `Datalab` can address your needs. That said, `Datalab` is primarily for auditing a classification dataset for potential issues, while other cleanlab methods implement a wide variety of other data-centric AI capabilities.


This tutorial demonstrates how to use `Datalab` to identify issues in a (toy) dataset. You can easily replace our demo dataset with your own image/text/tabular/audio/etc dataset, and then run the same code to discover what sort of issues lurk within it!

<div class="alert alert-info">
Quickstart
<br/>
    
Already have (out-of-sample) `pred_probs` from a model trained on an existing set of labels? Maybe you have some `features` as well? Run the code below to examine your dataset for multiple types of issues.

<div  class=markdown markdown="1" style="background:white;margin:16px">  
    
```ipython3 
from cleanlab import Datalab

lab = Datalab(data=your_dataset, label_name="column_name_of_labels")

lab.find_issues(features=your_feature_matrix, pred_probs=your_pred_probs)

lab.report()
```
   
</div>
</div>

## Setup

`Datalab` has additional dependencies that are not included in the standard installation of cleanlab.

To install everything necessary for this tutorial, run:


In [None]:
# !pip install matplotlib
# !pip install git+https://github.com/cleanlab/cleanlab.git#egg=cleanlab[datalab]

Then import the following:

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict

from scripts.data_generation import create_data, plot_data
from cleanlab import Datalab

### Loading data

We'll load a toy classification dataset for this tutorial.
The dataset has two numerical features and a label column with three classes.

Our dataset is produced in a dictionary called `data_dict`. The following variables are extracted for convenience:

- `X_train`: A matrix of features for the training set.
- `noisy_labels`: A vector of noisy observed labels for *the training set* (represented as strings).
- `y_train_idx`: A vector of true labels (not available in practice) for the training set (represented as integers).
- `noisy_labels_idx`: Indices of label errors (not available in practice) in the training set (represented as integers).
- `X_out`: A matrix of features for additional examples that are outliers.
  - This also contains a pair of near duplicate examples, which are also in `X_train`.

Below we also print out the label accuracy for the training set (the proportion of examples whose noisy label matches the true label), then plot the features of the datasets colored by the value of the corresponding observed labels. Incorrect given labels are highlighted in red if they do not match the true label, and outliers highlighted with an X.

In [None]:
data_dict = create_data()

X_train, noisy_labels, y_train_idx, noisy_labels_idx, X_out = (
    data_dict[key]
    for key in [
        "X_train", "noisy_labels", "y_train_idx", "noisy_labels_idx", "X_out"
    ]
)

In [None]:
plot_data(X_train, y_train_idx, noisy_labels_idx, X_out)

In real-world scenarios, you won't know the true labels or the distribution of the features, so we won't use these in this tutorial, except for evaluation purposes.



`Datalab` has several ways of loading the data.
In this case, we'll simply wrap the training features and noisy labels in a dictionary so that we can pass it to `Datalab`.

In [None]:
data = {"X": X_train, "y": noisy_labels}

Other supported data formats for `Datalab` include: [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) and [HuggingFace Datasets](https://huggingface.co/docs/datasets/index). `Datalab` works across most data modalities (image, text, tabular, audio, etc). It is intended to find issues that commonly occur in datasets for which you have trained a supervised ML model, regardless of the type of data.

### Get out-of-sample predicted probabilities from a classifier

To detect certain types of issues in classification data (e.g. label errors), `Datalab` relies on predicted class probabilities from a trained model. Ideally, the prediction for each example should be out-of-sample (to avoid overfitting), coming from a copy of the model that was not trained on this example. 

This tutorial uses a simple logistic regression model 
and the `cross_val_predict()` function from scikit-learn to generate out-of-sample predicted class probabilities for every example in the training set. You can replace this with *any* other classifier model and train it with cross-validation to get out-of-sample predictions.

In [None]:
model = LogisticRegression()
pred_probs = cross_val_predict(
    estimator=model, X=data["X"], y=data["y"], cv=5, method="predict_proba"
)

## Use Datalab to find issues in the dataset

We create a `Datalab` (you should only have one per dataset) and provide it with the data object and name of the label column in the data object (only classification datasets are currently supported for now).

All that is need to audit your data is to call `find_issues()`.
This method accepts various inputs like: predicted class probabilities, numeric feature representations of the data. The more information you provide here, the more thoroughly `Datalab` will audit your data! Note that `features` should be some numeric representation of each example, either obtained through preprocessing transformation of your raw data or embeddings from a (pre)trained model. In this case, our data is already entirely numeric so we just provide the features directly.

In [None]:
lab = Datalab(data, label_name="y")
lab.find_issues(pred_probs=pred_probs, features=data["X"])

Now let's review the results of this audit using `report()`.
This provides a high-level summary of each type of issue found in the dataset.

In [None]:
lab.report()

There are several methods to get more details about a particular issue.

For example, `get_summary()` fetches summary statistics regarding how severe each type of issue is overall across the whole dataset.

In [None]:
lab.get_summary("label")

In [None]:
lab.get_summary()

The `get_issues()` method returns information for each individual example about: whether or not it is plagued by this issue, as well as a quality score for how severe this issue appears to be. Lower scores indicate more severe instances of the issue, so you can sort by these values to see the most concerning examples in your dataset for each type of issue. 

In [None]:
examples_w_issue = (
    lab.get_issues("label")
    .query('is_label_issue')
    .sort_values("label_score")
)
examples_w_issue.head()

Looking at the labels for some of these top-ranked examples, we find their given label was indeed incorrect!

Additional information (statistics, intermediate results, etc) related to a particular issue check can be accessed via `get_info()`.

In [None]:
label_issue_info = lab.get_info("label")  # big dict with tons of miscellaneous information

for k, v in label_issue_info.items():  # only print some of this big dict
    str_v = str(v)
    max_print_chars = 50
    if len(str_v) > max_print_chars:
        str_v = str_v[:max_print_chars] + "..."
    print(f"{k}: {str_v}")

These are also directly accessible as attributes of the `Datalab` instance.

In [None]:
print("Summary:", lab.issue_summary, sep="\n", end="\n\n")

print("Issues:", lab.issues.head(), sep="\n", end="\n\n")

# print("Info:", lab.info)  # Massive dict with all sorts of information, too big to print here

You can see `Datalab` makes it very easy to check your datasets for all sorts of issues that are important to deal with for training robust models.  The remainder of this tutorial covers more advanced functionality to control various aspects of this process.

## Incremental issue search and specifying nondefault arguments

We can call `find_issues` multiple times on a `Datalab` object to detect issues one type at a time.

This is done via the `issue_types` argument which accepts a dictionary of issue types and any corresponding keyword arguments to specify nondefault keyword arguments to use for detecting each type of issues. Notice this nondefault call to `find_issues()` updates the output of `report()`.

In [None]:
lab = Datalab(data, label_name="y")
lab.find_issues(pred_probs=pred_probs, issue_types={"label": {}})  # label issues are detected solely based on pred_probs so `features` is unnecessary here
lab.report()

We can check for additional types of issues with the same `Datalab`.

In [None]:
lab.find_issues(pred_probs=pred_probs, features=data["X"], issue_types={"outlier": {}, "near_duplicate": {}})
lab.report()

We can also overwrite previously-executed checks for a type of issue. Here we re-run the detection of outliers, but specify that different non-default settings should be used (in this case, the number of neighbors `k` compared against to determine which datapoints are outliers). 
The results from this new detection will replace the original outlier detection results in the updated `Datalab`. You could similarly specify non-default settings for other issue types in the first call to `Datalab.find_issues()`.

In [None]:
lab.find_issues(pred_probs=pred_probs, features=data["X"], issue_types={"outlier": {"k": 80}})
lab.report()

You can also increase the verbosity of the `report` to see additional information about the data issues and control how many top-ranked examples are shown for each issue.

In [None]:
lab.report(num_examples=10, verbosity=2)

Notice how the number of flagged outlier issues has changed after specfying different settings to use for outlier detection.

## Adding a custom IssueManager (advanced)

`Datalab` detects pre-defined types of issues for you in one line of code: `find_issues()`. What if you want to check for other custom types of issues along with these pre-defined types, all within the same line of code?

All issue types in `Datalab` are subclasses of cleanlab's `IssueManager` class.
To register a custom issue type for use with `Datalab`, simply also make it a subclass of `IssueManager`.

The necessary members to implement in the subclass are:

- A class variable called `issue_name` that acts as a unique identifier for the type of issue.
- An instance method called `find_issues` that:
  - Computes a quality score for each example in the dataset (between 0-1), in terms of how *unlikely* it is to be an issue.
  - Flags each example as an issue or not (may be based on thresholding the quality scores).
  - Combine these in a dataframe that is assigned to an `issues` attribute of the `IssueManager`.
  - Define a summary score for the overall quality of entire dataset, in terms of this type of issue. Set this score as part of the `summary` attribute of the `IssueManager`.
  
To demonstrate this, we create an arbitrary issue type that checks the divisibility of an example's index in the dataset by 13.

In [None]:
from cleanlab.datalab.issue_manager import IssueManager
from cleanlab.datalab.factory import register


def scoring_function(idx: int, div: int = 13) -> float:
    if idx == 0:
        # Zero excluded from the divisibility check, gets the highest score
        return 1
    rem = idx % div
    inv_scale = idx // div
    if rem == 0:
        return 0.5 * (1 - np.exp(-0.1*(inv_scale-1)))
    else:
        return 1 - 0.49 * (1 - np.exp(-inv_scale**0.5))*rem/div


@register  # register this issue type for use with Datalab
class SuperstitionIssueManager(IssueManager):
    """A custom issue manager that keeps track of issue indices that
    are divisible by 13.
    """
    description: str = "Examples with indices that are divisible by 13 may be unlucky."  # Optional
    issue_name: str = "superstition"

    def find_issues(self, div=13, **_) -> None:
        ids = self.datalab.issues.index.to_series()
        issues_mask = ids.apply(lambda idx: idx % div == 0 and idx != 0)
        scores = ids.apply(lambda idx: scoring_function(idx, div))
        self.issues = pd.DataFrame(
            {
                f"is_{self.issue_name}_issue": issues_mask,
                self.issue_score_key: scores,
            },
        )
        summary_score = 1 - sum(issues_mask) / len(issues_mask)
        self.summary = self.make_summary(score = summary_score)

Once registered, this `IssueManager` will perform custom issue checks when `find_issues` is called on a `Datalab` instance.

As our `Datalab` instance here already has results from the outlier and near duplicate checks, we perform the custom issue check separately.

In [None]:
lab.find_issues(issue_types={"superstition": {}})
lab.report()

## Save and load Datalab objects

A `Datalab` can be saved to a folder at a specified path. In a future Python process, this path can be used to load the `Datalab` from file back into memory. Your dataset is not saved as part of this process, so you'll need to save/load it separately to keep working with it.

In [None]:
path = "datalab-files"
lab.save(path)

new_lab = Datalab.load(path)
new_lab.report()