## Software installation

This lab relies on a couple PyPI packages. If you don't have them installed, run the following cell:

In [None]:
# !pip install xgboost scikit-learn pandas cleanlab

## Setup and Data Processing

Let's take a look at the dataset used in this lab, a tabular dataset of student grades.

The data includes three exam scores (numerical features), a written note (categorical feature with missing values), and a (noisy) letter grade (categorical label). Our aim is to train a model to classify the grade for each student based on the other features.

In this dataset, 20% of the grade labels are actually incorrect (the `noisy_letter_grade` column). Synthetic noise was added to this dataset for the purpose of this lab. In this lab, we have access to the true letter grade each student should have received (the `letter_grade` column), which we use for evaluating both the underlying accuracy of model predictions and how well our approach detects which data are mislabeled. We are careful to only use these true grades for evaluation, not for model training.

In the real world, you don't have access to the true labels (you only observe the `noisy_letter_grade`, not the true `letter_grade`). So when evaluating models in the real world, you have to be careful to make sure that your test set is free of error (using methods like those covered in this lab, ideally combined with human review).

In [2]:
from xgboost import XGBClassifier
from sklearn import preprocessing
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

df = pd.read_csv("data/student-grades.csv")
df.head()

Unnamed: 0,stud_ID,exam_1,exam_2,exam_3,notes,letter_grade,noisy_letter_grade
0,f48f73,53,77,93,,C,C
1,0bd4e7,81,64,80,great participation +10,B,B
2,e1795d,74,88,97,,B,B
3,cb9d7a,61,94,78,,C,C
4,9acca4,48,90,91,,C,C


In [3]:
df_c = df.copy()
# Transform letter grades and notes to categorical numbers.
# Necessary for XGBoost.
df['letter_grade'] = preprocessing.LabelEncoder().fit_transform(df['letter_grade'])
df['noisy_letter_grade'] = preprocessing.LabelEncoder().fit_transform(df['noisy_letter_grade'])
df['notes'] = preprocessing.LabelEncoder().fit_transform(df["notes"])
df['notes'] = df['notes'].astype('category')

# Split data for evaluation and set test data.
df_train, df_test = train_test_split(df, random_state=0)
df_train.reset_index(drop=True, inplace=True)
df_test.reset_index(drop=True, inplace=True)
df_train.head()

Unnamed: 0,stud_ID,exam_1,exam_2,exam_3,notes,letter_grade,noisy_letter_grade
0,37fd76,99,59,70,3,3,3
1,018bff,94,41,91,2,1,1
2,b3c9a0,91,74,88,5,1,1
3,076d92,0,79,65,0,4,4
4,68827d,91,98,75,3,2,2


# Get What We Need

To apply confident learning (the technique explained in today's lecture), we need to obtain [**out-of-sample** predicted probabilities](https://docs.cleanlab.ai/stable/tutorials/pred_probs_cross_val.html#out-of-sample-predicted-probabilities) for all of our data. To do this, we can use K-fold cross validation: for each fold, we will train on some subset of our data and get predictions on the rest of the data that was _not_ used for training.

We need to choose a model in order to do this. For this lab, we'll use [XGBoost](https://xgboost.readthedocs.io/), a library implementing gradient-boosted decision trees, a class of model commonly used for tabular data.

In [4]:
#Prepare test data (this will not change across models)
test_data = df_test.drop(['stud_ID', 'letter_grade', 'noisy_letter_grade'], axis=1)
test_labels = df_test['letter_grade']

# Prepare training data (remove labels from the dataframe) and labels
train_data = df_train.drop(['stud_ID', 'letter_grade', 'noisy_letter_grade'], axis=1)
train_labels = df_train['noisy_letter_grade']

# XGBoost(experimental) supports categorical data.
# Here we use default hyperparameters for simplicity.
# Get out-of-sample predicted probabilities and check model accuracy.
model = XGBClassifier(tree_method="hist", enable_categorical=True)

# Establish Baseline Accuracy

Let's also train our model on the noisy data and evaluate it on our seperate test data to establish a baseline to compare our final results with.

In [5]:
# Train model on original, possibly noisy data.
model.fit(train_data, train_labels)

# Evaluate model on test split with ground truth labels.
preds = model.predict(test_data)
acc_original = accuracy_score(preds, test_labels)
print(f"Accuracy with original data: {round(acc_original*100,1)}%")

Accuracy with original data: 79.2%


# Exercise 1: getting out-of-sample predicted probabilities

Compute out-of-sample predicted probabilities for every data point. You can do this manually using for loops and multiple invocations of model training and prediction, or you can use scikit-learn's [cross_val_predict](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html) (if you're using this function, take a look at the documentations, and in particular, the `method=` keyword argument). Let's use 10-folds (`cv=10`) for a balance of accuracy and speed.

In [7]:
# pred_probs should be a Nx5 matrix of out-of-sample predicted probabilities, with N = len(data)

pred_probs = cross_val_predict(model, train_data, train_labels, method='predict_proba', cv=10)
pred_probs, pred_probs.shape

(array([[0.00445347, 0.0336567 , 0.1216571 , 0.75258416, 0.08764851],
        [0.8702499 , 0.07702546, 0.03176929, 0.01376026, 0.00719508],
        [0.00350568, 0.88157785, 0.00146268, 0.08411037, 0.02934347],
        ...,
        [0.02096226, 0.02510118, 0.88480276, 0.02610611, 0.04302767],
        [0.01524511, 0.05465487, 0.4861005 , 0.00636221, 0.43763727],
        [0.00533796, 0.3721377 , 0.00089144, 0.06241276, 0.5592202 ]],
       dtype=float32),
 (705, 5))

# Finding label issues automatically

We count label issues using confident learning. First, we need to compute class thresholds for the different classes.

# Exercise 2: computing class thresholds

Implement the Confident Learning algorithm for computing class thresholds for the 5 classes. You can refer to slide 26 from today's lecture or see equation 2 in [this paper](https://jair.org/index.php/jair/article/view/12125).

The class threshold for each class is the model's expected (average) self-confidence for each class. In other words, to compute the threshold for a particular class, you can average the predicted probability for that class, for all datapoints that are labeled with that particular class.

In [9]:
pred_probs, train_labels

(array([[0.00445347, 0.0336567 , 0.1216571 , 0.75258416, 0.08764851],
        [0.8702499 , 0.07702546, 0.03176929, 0.01376026, 0.00719508],
        [0.00350568, 0.88157785, 0.00146268, 0.08411037, 0.02934347],
        ...,
        [0.02096226, 0.02510118, 0.88480276, 0.02610611, 0.04302767],
        [0.01524511, 0.05465487, 0.4861005 , 0.00636221, 0.43763727],
        [0.00533796, 0.3721377 , 0.00089144, 0.06241276, 0.5592202 ]],
       dtype=float32),
 0      3
 1      1
 2      1
 3      4
 4      2
       ..
 700    4
 701    1
 702    0
 703    1
 704    4
 Name: noisy_letter_grade, Length: 705, dtype: int32)

In [53]:
sum(train_labels.to_numpy() == 0)

177

In [54]:
def compute_class_thresholds(pred_probs: np.ndarray, labels: np.ndarray) -> np.ndarray:
    # YOUR CODE HERE
    return np.array([1 / sum(labels == c) * sum(pred_probs[(labels == c), c]) for c in range(5)], dtype = np.float64)

In [55]:
# should be a numpy array of length 5
thresholds = compute_class_thresholds(pred_probs, train_labels.to_numpy())
thresholds, isinstance(thresholds, np.ndarray)

(array([0.70661993, 0.63892676, 0.63531261, 0.45481032, 0.44689162]), True)

# Exercise 3: constructing the confident joint

Next, we compute the confident joint, a matrix that counts the number of label errors for each noisy label $\tilde{y}$ and true label $y^*$. You can follow the algorithm that we walked through in slide 27 from today's lecture, or see equation 1 in [this paper](https://jair.org/index.php/jair/article/view/12125).

The confident joint C is a K x K matrix (with K = 5 for this dataset), where `C[i][j]` is an estimate of the count of the number of data points with noisy label `i` and true label `j`. From lecture, recall that we put a data point in bin `(i, j)` if its given label is `i`, and its predicted probability for class `j` is above the threshold for class `j` (`thresholds[j]`). Each data point should only go in a single bin; if a data point's predicted probability is above the class threshold for multiple classes, it goes in the bin for which it has the highest predicted probability.

In [56]:
def compute_confident_joint(pred_probs: np.ndarray, labels: np.ndarray, thresholds: np.ndarray) -> np.ndarray:
    # YOUR CODE HERE
    
    return np.array([
        [
            sum((pred_probs[:, j] > thresholds[j]) & (labels == i))
            for j in range(5)
        ]
        for i in range(5)
    ])

In [65]:
C = compute_confident_joint(pred_probs, train_labels.to_numpy(), thresholds)
C

array([[116,   3,  12,   5,  14],
       [  7, 109,   3,  24,   7],
       [  0,   3,  77,   6,   8],
       [  2,  25,   5,  56,   4],
       [ 16,  14,  12,   7,  51]])

# Exercise 4: count the number of label issues

Now that we have the confident joint C, we can count the estimated number of label issues in our dataset. Recall that this is the sum of the off-diagonal entries (the cases where we estimate that a label has been flipped).

In [58]:
num_label_issues = sum([sum([C[i][j] for j in range(5) if i != j]) for i in range(5)])
num_label_issues

177

In [59]:
print('Estimated noise rate: {:.1f}%'.format(100*num_label_issues / pred_probs.shape[0]))

Estimated noise rate: 25.1%


# Exercise 5: filter out label issues

In this lab, our approach to identifying issues is to rank the data points by a score ("self-confidence", the model's predicted probability for a data point's given label) and then take the top `num_label_issues` of those.

First, we want to compute the model's _self-confidence_ for each data point. For a data point `i`, that is `pred_probs[i, labels[i]]`.

In [62]:
len(pred_probs)

705

In [68]:
# this should be a numpy array of length 941 of probabilities
self_confidences = [pred_probs[i, train_labels[i]] for i in range(len(pred_probs))]
self_confidences

[0.75258416,
 0.07702546,
 0.88157785,
 0.2157897,
 0.15904717,
 0.034208495,
 0.74240506,
 0.98297095,
 0.9822227,
 0.06445531,
 0.61537373,
 0.020835672,
 0.9890243,
 0.9854848,
 0.97718734,
 0.8674533,
 0.002863215,
 0.58845896,
 0.99375397,
 0.72261715,
 0.9926609,
 0.67272645,
 0.8653403,
 0.7672698,
 0.98744756,
 0.86155856,
 0.15069957,
 0.37150386,
 0.68537515,
 0.9935415,
 0.022344857,
 0.6455186,
 0.67208755,
 0.6140305,
 0.05843262,
 0.93352,
 0.93162894,
 0.8131347,
 0.9755539,
 0.0442314,
 0.8802679,
 0.058346037,
 0.016612608,
 0.18220071,
 0.989099,
 0.6160011,
 0.038626477,
 0.5968413,
 0.9951367,
 0.8004122,
 0.4080649,
 0.65225065,
 0.02932973,
 0.9871999,
 0.021419862,
 0.9946413,
 0.99122053,
 0.643335,
 0.13058646,
 0.95004976,
 0.14528996,
 0.06288553,
 0.9863202,
 0.99056494,
 0.9200623,
 0.3081974,
 0.27729222,
 0.9529175,
 0.79340434,
 0.9871585,
 0.9775735,
 0.9441286,
 0.7858582,
 0.90613616,
 0.03042209,
 0.0031981363,
 0.23279856,
 0.9766905,
 0.98974085,
 

Next, we rank the _indices_ of the data points by the self-confidence.

In [78]:
# this should be a numpy array of length 941 of integer indices
ranked_indices = np.argsort(self_confidences)
# ranked_indices = np.empty_like(sorted_indices)
# ranked_indices[sorted_indices] = np.arange(len(self_confidences))
ranked_indices

array([541, 409, 196, 215, 206, 190, 627, 272, 356, 469, 609, 199, 558,
       186, 497, 671, 359, 578, 647, 608, 580, 309,  16, 562, 381, 591,
       368, 111,  75, 322, 473, 404, 628, 640, 510, 182, 365, 681, 398,
       266, 234,  95, 612, 345, 511, 496, 669, 372, 600, 243, 193, 204,
       568, 468, 477, 405,  84, 262, 403, 685,  92, 471, 446, 462, 349,
       252, 524, 304,  80, 603, 531, 695, 120, 265,  42, 615, 123, 544,
       328, 325, 348, 140,  11, 702,  54, 378,  30, 416, 214, 648, 599,
       350, 413, 216, 614, 239, 474, 167, 396,  52, 699, 148,  74, 229,
       255, 585,   5, 175, 279, 476,  46, 129, 226, 169, 494,  82,  39,
       152, 249, 180, 177, 147, 633, 283, 547, 703, 549, 223,  41,  34,
        61, 145, 513, 456,   9, 596, 432, 253,  98, 401, 221, 385, 200,
         1, 100, 296, 679,  81, 291, 595, 553, 388, 384, 361, 355, 545,
       677, 491, 564, 450, 380, 112, 502, 436, 493, 235, 527, 519,  58,
       448, 163, 142, 343, 459, 498,  60,  26, 197,   4, 289, 62

Finally, let's compute the indices of label issues as the top `num_label_issues` items in the `ranked_indices`.

In [79]:
issue_idx = ranked_indices[:int(num_label_issues)]
issue_idx

array([541, 409, 196, 215, 206, 190, 627, 272, 356, 469, 609, 199, 558,
       186, 497, 671, 359, 578, 647, 608, 580, 309,  16, 562, 381, 591,
       368, 111,  75, 322, 473, 404, 628, 640, 510, 182, 365, 681, 398,
       266, 234,  95, 612, 345, 511, 496, 669, 372, 600, 243, 193, 204,
       568, 468, 477, 405,  84, 262, 403, 685,  92, 471, 446, 462, 349,
       252, 524, 304,  80, 603, 531, 695, 120, 265,  42, 615, 123, 544,
       328, 325, 348, 140,  11, 702,  54, 378,  30, 416, 214, 648, 599,
       350, 413, 216, 614, 239, 474, 167, 396,  52, 699, 148,  74, 229,
       255, 585,   5, 175, 279, 476,  46, 129, 226, 169, 494,  82,  39,
       152, 249, 180, 177, 147, 633, 283, 547, 703, 549, 223,  41,  34,
        61, 145, 513, 456,   9, 596, 432, 253,  98, 401, 221, 385, 200,
         1, 100, 296, 679,  81, 291, 595, 553, 388, 384, 361, 355, 545,
       677, 491, 564, 450, 380, 112, 502, 436, 493, 235, 527, 519,  58,
       448, 163, 142, 343, 459, 498,  60,  26], dtype=int64)

Let's look at a couple of the highest-ranked data points (most likely to be label issues):

In [80]:
df_c.iloc[ranked_indices[:5]]

Unnamed: 0,stud_ID,exam_1,exam_2,exam_3,notes,letter_grade,noisy_letter_grade
541,338cae,0,89,90,"cheated on exam, gets 0pts",D,A
409,7cb11e,78,57,85,,C,A
196,d77a5c,89,70,74,,C,C
215,4065e7,96,75,92,great final presentation +10,A,A
206,85b1fe,72,78,69,missed homework frequently -10,D,D


# How'd We Do?

Let's go a step further and see how we did at automatically identifying which data points are mislabeled. If we take the intersection of the labels errors identified by Confident Learning and the true label errors, we see that our approach was able to identify 83% of the label errors correctly (based on predictions from a model that is only 79% accurate). 

In [81]:
# Computing percentage of true errors identified. 
true_error_idx = df_train[df_train.letter_grade != df_train.noisy_letter_grade].index.values
cl_acc = len(set(true_error_idx).intersection(set(issue_idx)))/len(true_error_idx)
print(f"Percentage of errors found: {round(cl_acc*100,1)}%")

Percentage of errors found: 82.9%


# Train a More Robust Model

Now that we have the indices of potential label errors within our data, let's remove them from our data, retrain our model, and see what improvement we can gain.

Keep in mind that our baseline model from above, trained on the original data using the `noisy_letter_grade` as the prediction label, achieved an accuracy of 79.2%.

Let's use a very simple method to handle these label errors and just **drop them entirely** from the data and retrain our exact same `XGBClassifier`. In a real-world application, a better approach might be to have humans review the issues and _correct_ the labels rather than dropping the data points.

In [82]:
# Remove the label errors found by Confident Learning from the train set
data = df_train.drop(issue_idx)
filtered_labels = data['noisy_letter_grade']
data = data.drop(['stud_ID', 'letter_grade', 'noisy_letter_grade'], axis=1)

# Train a more robust classifier with less erroneous data
model = XGBClassifier(tree_method="hist", enable_categorical=True)
model.fit(data, filtered_labels)
# Evaluate on unmodified test set
preds = model.predict(test_data)
acc_clean = accuracy_score(preds, test_labels)
print(f"Accuracy with original data: {round(acc_original*100, 1)}%")
print(f"Accuracy with errors found by Confident Learning removed: {round(acc_clean*100, 1)}%")

# Compute reduction in error.
err = ((1-acc_original)-(1-acc_clean))/(1-acc_original)
print(f"Reduction in error: {round(err*100,1)}%")

Accuracy with original data: 79.2%
Accuracy with errors found by Confident Learning removed: 86.4%
Reduction in error: 34.7%


After removing the suspected label issues, our model's new accuracy is now 86%, which means we **reduced the error-rate of the model by 35%** (the original model had 79% accuracy). 

**Note: throughout this entire process we never changed any code related to model architecture/hyperparameters, training, or data preprocessing!  This improvement is strictly coming from increasing the quality of our data which leaves additional room for additional optimizations on the modeling side.**

# Conclusion

For the student grades dataset, we found that simply dropping identified label errors and retraining the model resulted in a 35% reduction in prediction error on our classification problem (with accuracy improving from 79% to 86%).

An implementation of the Confident Learning algorithm (and much more) is available in the [cleanlab](https://github.com/cleanlab/cleanlab) library on GitHub. This is how today's lab assignment can be done in a single line of code with Cleanlab:

In [83]:
import cleanlab

cl_issue_idx = cleanlab.filter.find_label_issues(labels, pred_probs, return_indices_ranked_by='self_confidence')

NameError: name 'labels' is not defined

In [84]:
df_c.iloc[cl_issue_idx[:5]]

NameError: name 'cl_issue_idx' is not defined

_Advanced topic_: you might notice that the above `cl_issue_idx` differs in length (by a little bit) from our `issue_idx`. The reason for this is that we implemented a slightly simplified version of the algorithm in this lab. We skipped a calibration step after computing the confident joint that makes the confident joint have the true noisy prior $p(labels)$ (summed over columns for each row) and also add up to the total number of examples. If you're interested in the details of this, see equation 3 and the subsequent explanation in the [paper](https://jair.org/index.php/jair/article/view/12125).