<a href="https://colab.research.google.com/github/LxYuan0420/nlp/blob/main/notebooks/Text_Data_Cleaning_using_cleanlab_and_SentenceTransformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Classification with Transformers and Datalab


In this 5-minute quickstart tutorial, we use cleanlab to find potential label errors in an intent classification dataset composed of (text) customer service requests at an online bank. We consider a subset of the [Banking77-OOS Dataset](https://arxiv.org/abs/2106.04564) containing 1,000 customer service requests which can be classified into 10 categories corresponding to the intent of the request. Cleanlab automatically identifies bad examples in our dataset, including mislabeled data, out-of-scope examples (outliers), or otherwise ambiguous examples. Consider filtering or correcting such bad examples before you dive deep into modeling your data!

**Overview of what we'll do in this tutorial:**

- Use a pretrained transformer model to extract the text embeddings from the customer service requests

- Train a simple Logistic Regression model on the text embeddings to compute out-of-sample predicted probabilities

- Run cleanlab's `Datalab` audit with these predictions and embeddings in order to identify problems like: label issues, outliers, and near duplicates in the dataset.

<div class="alert alert-info">
Quickstart
<br/>

Already have (out-of-sample) `pred_probs` from a model trained on an existing set of labels? Maybe you have some numeric `features` as well? Run the code below to find any potential label errors in your dataset.

<div  class=markdown markdown="1" style="background:white;margin:16px">

```ipython3
from cleanlab import Datalab

lab = Datalab(data=your_dataset, label_name="column_name_of_labels")
lab.find_issues(pred_probs=your_pred_probs, features=your_features)

lab.report()
lab.get_issues()
```

</div>
</div>

## 1. Install required dependencies


You can use `pip` to install all packages required for this tutorial as follows:

```ipython3
!pip install sklearn sentence-transformers
!pip install "cleanlab[datalab]"
# Make sure to install the version corresponding to this tutorial
# E.g. if viewing master branch documentation:
#     !pip install git+https://github.com/cleanlab/cleanlab.git
```

In [None]:
# Package installation (hidden on docs.cleanlab.ai).
# If running on Colab, may want to use GPU (select: Runtime > Change runtime type > Hardware accelerator > GPU)
# Package versions we used:scikit-learn==1.2.0 sentence-transformers==2.2.2

dependencies = ["cleanlab", "sklearn", "sentence_transformers", "datasets"]

# Supress outputs that may appear if tensorflow happens to be improperly installed:
import os

os.environ["TOKENIZERS_PARALLELISM"] = "false"  # disable parallelism to avoid deadlocks with huggingface

if "google.colab" in str(get_ipython()):  # Check if it's running in Google Colab
    %pip install cleanlab==v2.4.0
    cmd = ' '.join([dep for dep in dependencies if dep != "cleanlab"])
    %pip install $cmd
else:
    missing_dependencies = []
    for dependency in dependencies:
        try:
            __import__(dependency)
        except ImportError:
            missing_dependencies.append(dependency)

    if len(missing_dependencies) > 0:
        print("Missing required dependencies:")
        print(*missing_dependencies, sep=", ")
        print("\nPlease install them before running the rest of this notebook.")

In [2]:
import re
import string
import pandas as pd
from sklearn.metrics import accuracy_score, log_loss
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LogisticRegression
from sentence_transformers import SentenceTransformer

from cleanlab import Datalab

In [3]:
# This cell is hidden from docs.cleanlab.ai

import random
import numpy as np

pd.set_option("display.max_colwidth", None)

SEED = 123456  # for reproducibility
np.random.seed(SEED)
random.seed(SEED)

In [11]:
#!pip install watermark
%reload_ext watermark

In [21]:
!pip freeze | grep -E "trans|clean"

cleanlab==2.4.0
google-cloud-translate==3.11.1
sentence-transformers==2.2.2
transformers==4.30.2


## 2. Load and format the text dataset


In [22]:
data = pd.read_csv("https://s.cleanlab.ai/banking-intent-classification.csv")
data.head()

Unnamed: 0,text,label
0,i accidentally made a payment to a wrong account. what should i do?,cancel_transfer
1,"i no longer want to transfer funds, can we cancel that transaction?",cancel_transfer
2,"cancel my transfer, please.",cancel_transfer
3,i want to revert this mornings transaction.,cancel_transfer
4,i just realised i made the wrong payment yesterday. can you please change it to the right account? it's my rent payment and really really needs to be in the right account by tomorrow,cancel_transfer


In [23]:
raw_texts, labels = data["text"].values, data["label"].values
num_classes = len(set(labels))

print(f"This dataset has {num_classes} classes.")
print(f"Classes: {set(labels)}")

This dataset has 10 classes.
Classes: {'visa_or_mastercard', 'card_about_to_expire', 'apple_pay_or_google_pay', 'beneficiary_not_allowed', 'card_payment_fee_charged', 'change_pin', 'supported_cards_and_currencies', 'getting_spare_card', 'cancel_transfer', 'lost_or_stolen_phone'}


In [24]:
from collections import Counter

print(Counter(labels))

Counter({'card_payment_fee_charged': 138, 'beneficiary_not_allowed': 111, 'cancel_transfer': 109, 'visa_or_mastercard': 98, 'card_about_to_expire': 98, 'supported_cards_and_currencies': 95, 'apple_pay_or_google_pay': 93, 'getting_spare_card': 90, 'change_pin': 88, 'lost_or_stolen_phone': 80})


Let's view the i-th example in the dataset:

In [25]:
i = 1  # change this to view other examples from the dataset
print(f"Example Label: {labels[i]}")
print(f"Example Text: {raw_texts[i]}")

Example Label: cancel_transfer
Example Text: i no longer want to transfer funds, can we cancel that transaction?


The data is stored as two numpy arrays:

1. `raw_texts` stores the customer service requests utterances in text format
2. `labels` stores the intent categories (labels) for each example

<div class="alert alert-info">
Bringing Your Own Data (BYOD)?

You can easily replace the above with your own text dataset, and continue with the rest of the tutorial.

</div>

Next we convert the text strings into vectors better suited as inputs for our ML models.

We will use numeric representations from a pretrained Transformer model as embeddings of our text. The [Sentence Transformers](https://huggingface.co/docs/hub/sentence-transformers) library offers simple methods to compute these embeddings for text data. Here, we load the pretrained `electra-small-discriminator` model, and then run our data through network to extract a vector embedding of each example.

In [26]:
transformer = SentenceTransformer('google/electra-small-discriminator')
text_embeddings = transformer.encode(raw_texts)

Downloading (…)af769/.gitattributes:   0%|          | 0.00/391 [00:00<?, ?B/s]

Downloading (…)e3c41af769/README.md:   0%|          | 0.00/2.21k [00:00<?, ?B/s]

Downloading (…)c41af769/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/54.2M [00:00<?, ?B/s]

Downloading (…)af769/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)e3c41af769/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Some weights of the model checkpoint at /root/.cache/torch/sentence_transformers/google_electra-small-discriminator were not used when initializing ElectraModel: ['discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.weight', 'discriminator_predictions.dense.bias']
- This IS expected if you are initializing ElectraModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Our subsequent ML model will directly operate on elements of `text_embeddings` in order to classify the customer service requests.

## 3. Define a classification model and compute out-of-sample predicted probabilities

A typical way to leverage pretrained networks for a particular classification task is to add a linear output layer and fine-tune the network parameters on the new data. However this can be computationally intensive. Alternatively, we can freeze the pretrained weights of the network and only train the output layer without having to rely on GPU(s). Here we do this conveniently by fitting a scikit-learn linear model on top of the extracted embeddings.

To identify label issues, cleanlab requires a probabilistic prediction from your model for each datapoint. However these predictions will be _overfit_ (and thus unreliable) for datapoints the model was previously trained on. cleanlab is intended to only be used with **out-of-sample** predicted class probabilities, i.e. on datapoints held-out from the model during the training.

Here we obtain out-of-sample predicted class probabilities for every example in our dataset using a Logistic Regression model with cross-validation.

In [27]:
model = LogisticRegression(max_iter=400)

pred_probs = cross_val_predict(model, text_embeddings, labels, method="predict_proba")

## 4. Use cleanlab to find issues in your dataset

Given feature embeddings and the (out-of-sample) predicted class probabilities obtained from any model you have, cleanlab can quickly help you identify low-quality examples in your dataset.

Here, we use cleanlab's `Datalab` to find issues in our data. Datalab offers several ways of loading the data; we’ll simply wrap the training features and noisy labels in a dictionary.

In [29]:
data_dict = {"texts": raw_texts, "labels": labels}

All that is need to audit your data is to call `find_issues()`. We pass in the predicted probabilities and the feature embeddings obtained above, but you do not necessarily need to provide all of this information depending on which types of issues you are interested in. The more inputs you provide, the more types of issues `Datalab` can detect in your data. Using a better model to produce these inputs will ensure cleanlab more accurately estimates issues.

In [30]:
lab = Datalab(data_dict, label_name="labels")
lab.find_issues(pred_probs=pred_probs, features=text_embeddings)

Finding label issues ...
Finding outlier issues ...
Fitting OOD estimator based on provided features ...
Finding near_duplicate issues ...
Audit complete. 87 issues found in the dataset.


After the audit is complete, review the findings using the `report` method:

In [31]:
lab.report()

Here is a summary of the different kinds of issues found in the data:

    issue_type  num_issues
         label          44
       outlier          39
near_duplicate           4

Dataset Information: num_examples: 1000, num_classes: 10


----------------------- label issues -----------------------

About this issue:
	Examples whose given label is estimated to be potentially incorrect
    (e.g. due to annotation error) are flagged as having label issues.
    

Number of examples with this issue: 44
Overall dataset quality in terms of this issue: 0.9560

Examples representing most severe instances of this issue:
     is_label_issue  label_score              given_label           predicted_label
981            True     0.000005     card_about_to_expire  card_payment_fee_charged
974            True     0.000150  beneficiary_not_allowed                change_pin
982            True     0.000219  apple_pay_or_google_pay      card_about_to_expire
990            True     0.000325  apple_pay_o

### Label issues

The report indicates that cleanlab identified many label issues in our dataset. We can see which examples are flagged as likely mislabeled and the label quality score for each example using the `get_issues` method, specifying `label` as an argument to focus on label issues in the data.

In [32]:
label_issues = lab.get_issues("label")
label_issues.head()

Unnamed: 0,is_label_issue,label_score,given_label,predicted_label
0,False,0.806106,cancel_transfer,cancel_transfer
1,False,0.271031,cancel_transfer,cancel_transfer
2,False,0.695446,cancel_transfer,cancel_transfer
3,False,0.179742,cancel_transfer,apple_pay_or_google_pay
4,False,0.822643,cancel_transfer,cancel_transfer


In [33]:
label_issues.shape

(1000, 4)

In [35]:
# low score means low confidence aka likely to be mislabeled
sorted_label_issues = label_issues.sort_values(["label_score"], ascending=[True])
sorted_label_issues.head()

Unnamed: 0,is_label_issue,label_score,given_label,predicted_label
981,True,5e-06,card_about_to_expire,card_payment_fee_charged
974,True,0.00015,beneficiary_not_allowed,change_pin
982,True,0.000219,apple_pay_or_google_pay,card_about_to_expire
990,True,0.000325,apple_pay_or_google_pay,beneficiary_not_allowed
971,True,0.000511,beneficiary_not_allowed,change_pin


This method returns a dataframe containing a label quality score for each example. These numeric scores lie between 0 and 1, where lower scores indicate examples more likely to be mislabeled. The dataframe also contains a boolean column specifying whether or not each example is identified to have a label issue (indicating it is likely mislabeled).

We can get the subset of examples flagged with label issues, and also sort by label quality score to find the indices of the 5 most likely mislabeled examples in our dataset.

In [52]:
identified_label_issues = label_issues[label_issues["is_label_issue"] == True]
lowest_quality_labels = label_issues["label_score"].argsort()[:55].to_numpy()

print(
    f"cleanlab found {len(identified_label_issues)} potential label errors in the dataset.\n"
    f"Here are indices of the top 5 most likely errors: \n {lowest_quality_labels}"
)

cleanlab found 44 potential label errors in the dataset.
Here are indices of the top 5 most likely errors: 
 [981 974 982 990 971 997 980 978 989 998 985 987 983 966 119 991 950 964
 994 970 984 992 977 963 652 993 367 962 972 973 959 436 979 976 960 988
 957 958 412 612 699 967 591 965 961 954 547 968  81 480 756 558 999 502
 557]


Let's review some of the most likely label errors.

Here we display the top 15 examples identified as the most likely label errors in the dataset, together with their given (original) label and a suggested alternative label from cleanlab.


In [55]:
data_with_suggested_labels.head()

Unnamed: 0,text,given_label,suggested_label,label_score
0,i accidentally made a payment to a wrong account. what should i do?,cancel_transfer,cancel_transfer,0.806106
1,"i no longer want to transfer funds, can we cancel that transaction?",cancel_transfer,cancel_transfer,0.271031
2,"cancel my transfer, please.",cancel_transfer,cancel_transfer,0.695446
3,i want to revert this mornings transaction.,cancel_transfer,apple_pay_or_google_pay,0.179742
4,i just realised i made the wrong payment yesterday. can you please change it to the right account? it's my rent payment and really really needs to be in the right account by tomorrow,cancel_transfer,cancel_transfer,0.822643


In [56]:
# sample code to update specific idx/row's given label with suggested label
suggested_label = data_with_suggested_labels.loc[3, "suggested_label"]
data_with_suggested_labels.loc[3, "given_label"] = suggested_label

In [57]:
data_with_suggested_labels.head()

Unnamed: 0,text,given_label,suggested_label,label_score
0,i accidentally made a payment to a wrong account. what should i do?,cancel_transfer,cancel_transfer,0.806106
1,"i no longer want to transfer funds, can we cancel that transaction?",cancel_transfer,cancel_transfer,0.271031
2,"cancel my transfer, please.",cancel_transfer,cancel_transfer,0.695446
3,i want to revert this mornings transaction.,apple_pay_or_google_pay,apple_pay_or_google_pay,0.179742
4,i just realised i made the wrong payment yesterday. can you please change it to the right account? it's my rent payment and really really needs to be in the right account by tomorrow,cancel_transfer,cancel_transfer,0.822643


In [63]:
# before: drop idx 3
data_with_suggested_labels = data_with_suggested_labels.drop([3])

In [64]:
# after: drop idx
data_with_suggested_labels.head()

Unnamed: 0,text,given_label,suggested_label,label_score
0,i accidentally made a payment to a wrong account. what should i do?,cancel_transfer,cancel_transfer,0.806106
1,"i no longer want to transfer funds, can we cancel that transaction?",cancel_transfer,cancel_transfer,0.271031
2,"cancel my transfer, please.",cancel_transfer,cancel_transfer,0.695446
4,i just realised i made the wrong payment yesterday. can you please change it to the right account? it's my rent payment and really really needs to be in the right account by tomorrow,cancel_transfer,cancel_transfer,0.822643
5,i want to cancel a transaction from this morning.,cancel_transfer,cancel_transfer,0.646769


In [53]:
data_with_suggested_labels = pd.DataFrame(
    {"text": raw_texts, "given_label": labels, "suggested_label": label_issues["predicted_label"], "label_score": label_issues["label_score"]}
)
data_with_suggested_labels.iloc[lowest_quality_labels]

Unnamed: 0,text,given_label,suggested_label,label_score
981,i was charged for getting cash.,card_about_to_expire,card_payment_fee_charged,5e-06
974,can i change my pin on holiday?,beneficiary_not_allowed,change_pin,0.00015
982,will i be sent a new card before mine expires?,apple_pay_or_google_pay,card_about_to_expire,0.000219
990,Connection Timed Out,apple_pay_or_google_pay,beneficiary_not_allowed,0.000325
971,please tell me how to change my pin.,beneficiary_not_allowed,change_pin,0.000511
997,Sprinkle the rainbow sprinkles over the top of the salad and serve immediately. Enjoy your silly Rainbow Surprise Salad!NaN,getting_spare_card,change_pin,0.000709
980,why do i see extra charges for withdrawing my money?,card_about_to_expire,card_payment_fee_charged,0.000947
978,why am i being charge a fee when using an atm?,card_about_to_expire,card_payment_fee_charged,0.001031
989,<p><samp>File not found.<br>Press F1 to continue</samp></p>,supported_cards_and_currencies,beneficiary_not_allowed,0.001195
998,https://github.com/cleanlab/cleanlab,visa_or_mastercard,beneficiary_not_allowed,0.002013


In [44]:
highest_quality_labels = label_issues["label_score"].argsort()[::-1][:15].to_numpy()
data_with_suggested_labels.iloc[highest_quality_labels]

Unnamed: 0,text,given_label,suggested_label,label_score
912,need help with google pay top up,apple_pay_or_google_pay,apple_pay_or_google_pay,0.999051
402,can i choose my visa or mastercard?,visa_or_mastercard,visa_or_mastercard,0.998505
413,visa or mastercard?,visa_or_mastercard,visa_or_mastercard,0.998316
10,cancel a transaction,cancel_transfer,cancel_transfer,0.99804
369,can i choose between mastercard and visa?,visa_or_mastercard,visa_or_mastercard,0.997868
146,there was a fee charged when i paid with my card.,card_payment_fee_charged,card_payment_fee_charged,0.997449
94,i want to cancel the transaction i made earlier,cancel_transfer,cancel_transfer,0.997044
353,can i choose from either visa or mastercard?,visa_or_mastercard,visa_or_mastercard,0.996566
419,can i get a visa instead of a mastercard?,visa_or_mastercard,visa_or_mastercard,0.996501
410,can i get a visa or mastercard?,visa_or_mastercard,visa_or_mastercard,0.996221


These are very clear label errors that cleanlab has identified in this data! Note that the `given_label` does not correctly reflect the intent of these requests, whoever produced this dataset made many mistakes that are important to address before modeling the data.

### Outlier issues

According to the report, our dataset contains some outliers.
We can see which examples are outliers (and a numeric quality score quantifying how typical each example appears to be) via `get_issues`. We sort the resulting DataFrame by cleanlab's outlier quality score to see the most severe outliers in our dataset.

In [45]:
outlier_issues = lab.get_issues("outlier")
outlier_issues.sort_values("outlier_score").head()

Unnamed: 0,is_outlier_issue,outlier_score
994,True,0.676322
999,True,0.686193
989,True,0.711223
433,True,0.711974
990,True,0.713793


In [59]:
lowest_quality_outliers = outlier_issues["outlier_score"].argsort()[:35]

data_with_suggested_labels.iloc[lowest_quality_outliers]

Unnamed: 0,text,given_label,suggested_label,label_score
994,(A AND NOT B) OR (C AND NOT D) OR (B AND NOT C AND D),change_pin,beneficiary_not_allowed,0.004902
999,636C65616E6C616220697320617765736F6D6521,cancel_transfer,beneficiary_not_allowed,0.062079
989,<p><samp>File not found.<br>Press F1 to continue</samp></p>,supported_cards_and_currencies,beneficiary_not_allowed,0.001195
433,phone is gone,lost_or_stolen_phone,lost_or_stolen_phone,0.912998
990,Connection Timed Out,apple_pay_or_google_pay,beneficiary_not_allowed,0.000325
81,cancel transaction,cancel_transfer,supported_cards_and_currencies,0.049447
993,Nants ingonyama bagithi baba,card_about_to_expire,beneficiary_not_allowed,0.008978
998,https://github.com/cleanlab/cleanlab,visa_or_mastercard,beneficiary_not_allowed,0.002013
986,Voluptatem velit sit labore modi quiquia ipsum ut,apple_pay_or_google_pay,beneficiary_not_allowed,0.266437
662,payment did not process,beneficiary_not_allowed,beneficiary_not_allowed,0.977302


In [62]:
lowest_quality_outliers.to_numpy()

array([994, 999, 989, 433, 990,  81, 993, 998, 986, 662,  10,  49, 988,
       987, 347, 995, 506, 992, 996,  60, 846, 461, 275, 956, 285, 276,
       189,  69, 967, 906, 912, 235, 387, 871, 828])

We see that cleanlab has identified entries in this dataset that do not appear to be proper customer requests. Outliers in this dataset appear to be out-of-scope customer requests and other nonsensical text which does not make sense for intent classification. Carefully consider whether such outliers may detrimentally affect your data modeling, and consider removing them from the dataset if so.

### Near-duplicate issues

According to the report, our dataset contains some sets of nearly duplicated examples.
We can see which examples are (nearly) duplicated (and a numeric quality score quantifying how dissimilar each example is from its nearest neighbor in the dataset) via `get_issues`. We sort the resulting DataFrame by cleanlab's near-duplicate quality score to see the text examples in our dataset that are most nearly duplicated.

In [49]:
duplicate_issues = lab.get_issues("near_duplicate")
duplicate_issues.sort_values("near_duplicate_score").head()

Unnamed: 0,is_near_duplicate_issue,near_duplicate_score,near_duplicate_sets,distance_to_nearest_neighbor
160,True,0.006237,"[148, 219, 234, 118, 201, 223, 125, 140, 978, 172]",0.006237
148,True,0.006237,"[160, 219, 234, 223, 140, 118, 201, 125, 229, 978]",0.006237
546,True,0.006485,"[514, 523, 570, 569, 458, 528, 757, 137, 539, 827]",0.006485
514,True,0.006485,"[546, 523, 570, 757, 761, 569, 528, 527, 458, 137]",0.006485
481,False,0.008164,"[475, 493, 486, 849, 466, 845, 434, 795, 468, 834]",0.008165


The results above show which examples cleanlab considers nearly duplicated (rows where `is_near_duplicate_issue == True`). Here, we see that example 160 and 148 are nearly duplicated, as are example 546 and 514.

Let's view these examples to see how similar they are.

In [50]:
data.iloc[[160, 148]]

Unnamed: 0,text,label
160,why was i charged an additional fee when paying with card?,card_payment_fee_charged
148,why was i charged an extra fee when paying with card?,card_payment_fee_charged


In [51]:
data.iloc[[546, 514]]

Unnamed: 0,text,label
546,do i have to go to the bank to change my pin?,change_pin
514,do i have to go into the bank to change my pin?,change_pin


We see that these two sets of request are indeed very similar to one another! Including near duplicates in a dataset may have unintended effects on models, and be wary about splitting them across training/test sets.

As demonstrated above, cleanlab can automatically shortlist the most likely issues in your dataset to help you better curate your dataset for subsequent modeling. With this shortlist, you can decide whether to fix these label issues or remove nonsensical or duplicated examples from your dataset to obtain a higher-quality dataset for training your next ML model. cleanlab's issue detection can be run with outputs from *any* type of model you initially trained.
