# Confident Learning
This notebook reimplement the major processes of [confident learning](https://www.jair.org/index.php/jair/article/download/12125/26676/) along with a toy domain example and shows how to invoke Confident Learning from `cleanlab` with a relatively challenge real-world task. It partly refers to this [blog](https://zhuanlan.zhihu.com/p/488970312) and [code](https://github.com/Christmas-Wong/paper_project/blob/main/) (note that the `cleanlab` invoked in the referring code is outdated).

### Manual Reimplement and Toy Domain Experience
* Generate manual labels.
* Calculate threshold $t_i$, the expected (average) self-confidence for the class $i$.
* Calculate confident joint matrix $C$.
* Calibrate confident joint matrix so that the distribution sums to 1.
* Output potential mislabeled samples. 

In [1]:
import cleanlab
import numpy as np
import pandas as pd

np.random.seed(0)

In [2]:
p0 = np.random.uniform(0, 1, size=20)
p0 = np.array([round(i, 1) for i in p0])
p1 = 1 - p0
y  = np.random.randint(0, 2, 20)

df_data = pd.DataFrame(dict(p0=p0, p1=p1, y=y))
df_data

Unnamed: 0,p0,p1,y
0,0.5,0.5,0
1,0.7,0.3,1
2,0.6,0.4,0
3,0.5,0.5,1
4,0.4,0.6,1
5,0.6,0.4,1
6,0.4,0.6,1
7,0.9,0.1,1
8,1.0,0.0,0
9,0.4,0.6,1


#### Calculate Thresholds
The threshold $t_j$ is the expected (average) self-confidence for each class $j$, *i.e.,*
$$t_j=\frac{1}{|X_{\tilde{y}=j}|} \sum_{x\in X_{\tilde{y}=j}} \hat{p}(\tilde{y}=j;x,\theta),$$
where $\tilde{y}$ is the given label with noise.

In [3]:
n = [0, 0]
t = [0, 0]
for index, column in df_data.iterrows():
    if column['y'] == 0:
        n[0] += 1
        t[0] += column['p0']
    elif column['y'] == 1:
        n[1] += 1
        t[1] += column['p1']
    else:
        raise ValueError(f"unknown label '{column['y']}'.")
t[0] = t[0] / n[0]
t[1] = t[1] / n[1]
t

[0.6571428571428573, 0.46923076923076923]

#### Calculate Confident Joint Matrix
The definition of the confident joint $C_{\tilde{y},y^*}$ estimates $X_{\tilde{y}=i,y^*=j}$,the set of examples with noisy label $i$ that actually have true label $j$,by partitioning $X$ into estimate bins $\hat{X}_{\tilde{y}=i,y^*=j}$. Formally, the definition of the confident joint is,
$$C_{\tilde{y},y^*}[i][j]:=|\hat{X}_{\tilde{y}=i,y^*=j}|,$$
where,
$$\hat{X}_{\tilde{y}=i,y^*=j}:= \left\{ x\in X_{\tilde{y}=i}: \hat{p}(\tilde{y}=j;x,\theta)\geq t_j, j=\mathop{\arg\max}_{l\in[m]:\hat{p}(\tilde{y}=l;x,\theta)\geq t_l} \hat{p}(\tilde{y}=l;x,\theta) \right\}$$


In [4]:
C = pd.DataFrame(dict(true_0=[0, 0], true_1=[0, 0]), index=["pred_0", "pred_1"])
    
for index, column in df_data.iterrows():
    # true label is `0`.
    if column['p0'] > column['p1'] and column['p0'] > t[0]:
        # given label is `0`.
        if column['y'] == 0:
            C.loc["pred_0", "true_0"] += 1
        # given label is `1`.
        if column['y'] == 1:
            C.loc["pred_1", "true_0"] += 1
    # true label is `1`.
    if column['p1'] > column['p0'] and column['p1'] > t[1]:
        # given label is `0`.
        if column['y'] == 0:
            C.loc["pred_0", "true_1"] += 1
        # given label is `1`.
        if column['y'] == 1:
            C.loc["pred_1", "true_1"] += 1

C

Unnamed: 0,true_0,true_1
pred_0,4,1
pred_1,4,5


#### Calibrate Confident Joint Matrix
When counting the number of real labels, the samples whose probability is lower than the threshold are excluded, leading to the change of distribution in the confident joint matrix. So we need to calibrate the matrix to be the same as the original distribution by,
$$\hat{Q}_{\tilde{y}=i,y^*=j} = \frac{\frac{C_{\tilde{y}=i,y^*=j}}{\sum_{j\in[m]} C_{\tilde{y}=i,y^*=j}} \cdot |X_{\tilde{y}=i}|}{\sum_{i\in[m], j\in[m]} \left( \frac{C_{\tilde{y}=i,y^*=j}}{\sum_{j'\in[m]} C_{\tilde{y}=i,y^*=j'}} \cdot |X_{\tilde{y}=i|} \right)}.$$

In [5]:
Q = C.copy()
# calculate numerator.
# for each predicted class `i`.
for index, column in C.iterrows():
    y_ij = column["true_0"] + column["true_1"]  # denominator in the numerator.
    Q.loc[index, 'true_0'] = n[0] * column["true_0"] / y_ij
    Q.loc[index, 'true_1'] = n[0] * column["true_1"] / y_ij
Q

Unnamed: 0,true_0,true_1
pred_0,5.6,1.4
pred_1,3.111111,3.888889


In [6]:
# devided by the denominator.
Q = Q / Q.values.sum()
assert Q.values.sum() == 1.0
Q

Unnamed: 0,true_0,true_1
pred_0,0.4,0.1
pred_1,0.222222,0.277778


#### Invoke Confident Learning from `cleanlab`
The authors proposed five approaches for finding mislabeled data based on the calibrated confident joint matrix $Q$ in their paper. To simplify, here I implement these approaches via `cleanlab` rather than reimplementing them.

In [7]:
y_true = df_data['y'].values
y_pred = np.c_[df_data['p0'].values, df_data['p1'].values]

In [8]:
# Method 3：Prune by Class (PBC)
cleanlab
cl_pbc = cleanlab.filter.find_label_issues(
    y_true,
    y_pred,
    filter_by='prune_by_class',
    return_indices_ranked_by='self_confidence'
)
print(f"The index of error samples discriminated by PBC are: {','.join([str(ele) for ele in cl_pbc])}")

# Method 4：Prune by Noise Rate (PBNR)
cl_pbnr = cleanlab.filter.find_label_issues(
    y_true,
    y_pred,
    filter_by='prune_by_noise_rate',
    return_indices_ranked_by='self_confidence'
)
print(f"The index of error samples discriminated by PBNR are: {','.join([str(ele) for ele in cl_pbnr])}")

# Method 5：C+NR
cl_both = cleanlab.filter.find_label_issues(
    y_true,
    y_pred,
    filter_by='both',
    return_indices_ranked_by='self_confidence'
)
print(f"The index of error samples discriminated by C+NR are: {','.join([str(ele) for ele in cl_both])}")

The index of error samples discriminated by PBC are: 7,13,19,15,1,5
The index of error samples discriminated by PBNR are: 7,13,19,15,1,5
The index of error samples discriminated by C+NR are: 7,13,19,15,1,5


In [9]:
# mislabeled samples.
df_data.loc[cl_both]

Unnamed: 0,p0,p1,y
7,0.9,0.1,1
13,0.9,0.1,1
19,0.9,0.1,1
15,0.1,0.9,0
1,0.7,0.3,1
5,0.6,0.4,1


### A Real-World Task
To be completed...