# Section 2: Handling the label noise in datasets

Throughout this section, you will get familiar with:

* Why the label noise transition matrix $T$ is important in handling label noise?

* How do we estimate $T$ given a dataset with only noisy labels?

* How do we detect label errors in a dataset?

## 1. Importance of $T$

* Understanding the pattern/structure of label noise
* Design robust loss functions
* Helps label aggregation (weighted majority vote)

### 1.1 Understand the pattern of label noise

### Examples from CIFAR-10N

* CIFAR-10N: 
  * 10 classes. 
  * Each image is annotated by 3 independent human workers.
* Aggregation labels: 
  * Take the majority vote from 3 annotations.
  * Break ties evenly.

<table>
  <tr>
    <td ><img src="./tutorial_imgs/c10_agg.png" width="700"> 

    Figure: Label noise transition matrix of CIFAR-10N.
</td>
    <td>
    * Humans can be very accurate on some classes (ship 97%, horse 96%)<br/>
    * Humans can be inaccurate on other classes (cat 83%, deer 83%)<br/>
    * Human annotations have bias:<br/>
      &nbsp;&nbsp;&nbsp;&nbsp;- Horse-deer is a pair with high similarity, <b>but</b>..<br/>
      &nbsp;&nbsp;&nbsp;&nbsp;- Humans tend to annotate deer as horse: deer &rarr; horse 0.04<br/>
      &nbsp;&nbsp;&nbsp;&nbsp;- Humans tend <b>not</b> to annotate horse as deer: horse &rarr; deer 0.01 <br/>
    </td>
  </tr>
</table> 

### Examples from CIFAR-100N

* CIFAR-100N: 
  * 20 coarse classes, 100 fine classes. Each coarse class contains 5 fine classes.
  * Each image is annotated by 1 independent human workers.


<table>
  <tr>
    <td ><img src="tutorial_imgs/c100_coarse.png" width="1100"> 

    Figure: Label noise transition matrix of CIFAR-100N.
</td>
    <td>
    * Humans can be very accurate on some classes<br/>
      &nbsp;&nbsp;&nbsp;&nbsp;- people 94%<br/>
    * Humans can be inaccurate on other classes <br/>
      &nbsp;&nbsp;&nbsp;&nbsp;- medium-sized mammals 47%<br/>
    * Human annotations have bias:<br/>
      &nbsp;&nbsp;&nbsp;&nbsp;- man-made &rarr; natural 0.09<br/>
      &nbsp;&nbsp;&nbsp;&nbsp;- natural &rarr; man-made 0.03 <br/>
    </td>
  </tr>
</table> 

### 1.2 Design robust loss functions

Recall that:
* Feature $X$, noisy label $\widetilde Y$. 
* Model: $\bm f(\cdot)$ (Input: $X$, output: a column vector, probability of predicting each label class)
* Loss function: $\ell$.
* Label noise transition matrix $\bm T$, and its transpose $\bm T^\top$.

#### Forward loss correction:
$$
\ell^{\rightarrow}(\bm f(X),\widetilde Y):= \ell(\bm T^\top \bm f(X),\widetilde Y).
$$

### 1.3 Helps label aggregation (weighted majority vote)

Intuition:
* Normal majority vote: each labeler has the same weight. 
  * E.g., $\text{MV}(1,1,0) = 1$.
* Weighted majority vote: each labeler makes mistakes with some probability. 
  * E.g., label class 1 is rare, the first two labelers are not reliable and the third labeler is reliable, we may have $\text{MV}_\text{Weighted}(1,1,0) = 0$.
  * Condition:  $\mathbb P(Y=1) = 0.2, T = \begin{pmatrix} 0.8 & 0.2 \\ 0.7 & 0.3 \end{pmatrix}  $
  * Probability of label 1: $$   \begin{align*} & \mathbb P(Y=1| \widetilde Y_1 = 1, \widetilde Y_2 = 1, \widetilde Y_3 = 0)  \\ = & \frac{\mathbb P(Y=1)}{\mathbb P( \widetilde Y_1 = 1, \widetilde Y_2 = 1, \widetilde Y_3 = 0)} \cdot \mathbb P(\widetilde Y_1=1|Y=1) \cdot \mathbb P(\widetilde Y_2=1|Y=1) \cdot \mathbb P(\widetilde Y_3=0|Y=1) \\ = & \frac{0.0126}{\mathbb P( \widetilde Y_1 = 1, \widetilde Y_2 = 1, \widetilde Y_3 = 0)}\end{align*}  $$
  * Probability of label 0: $$   \begin{align*} & \mathbb P(Y=0| \widetilde Y_1 = 1, \widetilde Y_2 = 1, \widetilde Y_3 = 0)  \\ = & \frac{\mathbb P(Y=0)}{\mathbb P( \widetilde Y_1 = 1, \widetilde Y_2 = 1, \widetilde Y_3 = 0)} \cdot \mathbb P(\widetilde Y_1=1|Y=0) \cdot \mathbb P(\widetilde Y_2=1|Y=0) \cdot \mathbb P(\widetilde Y_3=0|Y=0) \\ = & \frac{0.0256}{\mathbb P( \widetilde Y_1 = 1, \widetilde Y_2 = 1, \widetilde Y_3 = 0)} \end{align*} $$
  
 

## 2. Estimate $T$

* Naive approach
* Estimate with anchor points
* Estimate with consensus patterns

### 2.1. Naive approach
If we know both the ground-truth labels and noise labels:
$$ \mathbb P(\widetilde Y=j | Y=i) = \frac{\text{\#Samples with true label i and noisy label j}}{\text{\#Samples with true label i}} $$


In [35]:
import numpy as np

def est_T_naive(clean_labels, noisy_labels, num_classes):
      T = np.zeros((num_classes, num_classes))
      for i in range(num_classes):
            for j in range(num_classes):
                  T[i][j] = ((clean_labels == i) * (noisy_labels == j)).sum() / (clean_labels == i).sum()
      matrix_with_brackets = '\n'.join(['[ ' + '\t'.join(map(str, row)) + ' ]' for row in T])

      print(f'''T = [
{matrix_with_brackets}
      ]
            ''')

clean_labels = np.array([0, 1, 0, 0, 0, 1, 1, 1])
noisy_labels = np.array([1, 1, 0, 0, 1, 1, 1, 0])
est_T_naive(clean_labels, noisy_labels, num_classes=2)


T = [
[ 0.5	0.5 ]
[ 0.25	0.75 ]
      ]
            


### 2.2. Estimate with anchor points

**Definition: (Anchor points)**
A feature $x$ is an anchor point for the class $i$ if $\mathbb P(Y = i|X=x)$ is equal to one or close to one.

* Step 1: Find the anchor points according to model predictions
  * Methods:
  * Results:
    ```python
        labels_of_anchor_points = np.array([0, 1, 0, 0, 0, 1, 1, 1])
        noisy_labels = np.array([1, 1, 0, 0, 1, 1, 1, 0])
    ```
* Step 2 Estimate $\bm T$ with anchor points.

In [36]:
labels_of_anchor_points = np.array([0, 1, 0, 0, 0, 1, 1, 1])
noisy_labels = np.array([1, 1, 0, 0, 1, 1, 1, 0])
est_T_naive(clean_labels=labels_of_anchor_points, noisy_labels=noisy_labels, num_classes=2)

T = [
[ 0.5	0.5 ]
[ 0.25	0.75 ]
      ]
            


### 2.3. Estimate with consensus patterns

#### 2.3.1 What are consensus patterns?

<img src="tutorial_imgs/consensus.png" width="700"> 

*Figure: Illustration of high-order consensus patterns.*

* $\widetilde Y_1$: The noisy label of a particular instance $i$.
* $\widetilde Y_2$: The noisy label of instance-$i$'s nearst neighbor.
* $\widetilde Y_3$: The noisy label of instance-$i$'s second nearst neighbor.
  
**Intuition: Consensus patterns encode $\bm T$.**

#### 2.3.2 Condition: $2$-NN label clusterability

**Definition: (Clusterability)**
A dataset $D$ satisfies $k$-NN label clusterability if $\forall n \in [N]$, the feature $x_n$ and its $k$-Nearest-Neighbor $x_{n_1}, \cdots, x_{n_k}$ belong to the same true label class.

<img src="tutorial_imgs/clusterability.png" width="700"> 

*Figure: Illustration of $k$-NN label clusterability.*

**Properties**
* $k_1$-NN label clusterability is *harder* than $k_2$-NN label clusterability when $k_1 > k_2$;
* The cluster containing the same clean labels is not required to be a continuum, e.g., two clusters of class ``1'' can be far away;
* The $k$-NN label clusterability only requires the existence of these feasible points, i.e., specifying the true class is not necessary.

#### 2.3.3 Equations (sketch)

$$
\mathbb P(\widetilde Y_1, \widetilde Y_2, \widetilde Y_3) = \textsf{Func}_3(\bm T, \bm p).
$$

* LHS: *Numerical* counts of consensus patterns
* RHS: *Analytical* equations (probabilities)

For example,
$$
\begin{align*}
    &\mathbb P(\widetilde Y_1 = \tilde y_1, \widetilde Y_2 = \tilde y_2, \widetilde Y_3 =\tilde y_3 )   = \sum_{i \in [K]} \mathbb P(Y=i) \cdot T_{i, \tilde y_1} \cdot T_{i, \tilde y_2} \cdot T_{i, \tilde y_3}
\end{align*}
$$

#### 2.3.4 Code


**LHS - Get numerical counts**

```python
cnt = [[] for _ in range(3)]
cnt[0] = torch.zeros(KINDS)
cnt[1] = torch.zeros(KINDS, KINDS)
cnt[2] = torch.zeros(KINDS, KINDS, KINDS)

for pattern in consensus_patterns:
    cnt[0][pattern[0]] += 1
    cnt[1][pattern[0]][pattern[1]] += 1
    cnt[2][pattern[0]][pattern[1]][pattern[2]] += 1

```



**RHS - Prepare analytical equations**

```python
c_analytical = [[] for _ in range(3)]
c_analytical[0] = torch.mm(T.transpose(0, 1), P).transpose(0, 1)
c_analytical[2] = torch.zeros((KINDS, KINDS, KINDS))

temp33 = torch.tensor([])
for i in range(KINDS):
    Ti = torch.cat((T[:, i:], T[:, :i]), 1)
    temp2 = torch.mm((T * Ti).transpose(0, 1), P)
    c_analytical[1] = torch.cat(
        [c_analytical[1], temp2], 1) if i != 0 else temp2

    for j in range(KINDS):
        Tj = torch.cat((T[:, j:], T[:, :j]), 1)
        temp3 = torch.mm((T * Ti * Tj).transpose(0, 1), P)
        temp33 = torch.cat([temp33, temp3], 1) if j != 0 else temp3
    # adjust the order of the output (N*N*N), keeping consistent with c_est
    t3 = []
    for p3 in range(KINDS):
        t3 = torch.cat((temp33[p3, KINDS - p3:], temp33[p3, :KINDS - p3]))
        temp33[p3] = t3
    if mode == -1:
        for r in range(KINDS):
            c_analytical[2][r][(i+r+KINDS) % KINDS] = temp33[r]
    else:
        c_analytical[2][mode][(i + mode + KINDS) % KINDS] = temp33[mode]
```

#### 2.3.5 Package: Docta
```bash
pip install docta.ai
```

**A Toy Example**
* Synthesize a dataset of 1,000 instances
* Binary classifications
* Each instance has three noisy labels (given by three independent labeler)
* *Task:* Estimate 
  * the label noise transition matrix $\bm T$ 
  * the clean label distribution $\bm p$. 

Synthesize the dataset (Only synthesize the labels)

In [46]:
import numpy as np
np.random.seed(0)
num_samples = 1000

# Set the label noise transition matrix T
T = [
    [0.6, 0.4],
    [0.2, 0.8],
]

# Set the clean label distribution p
p = [0.4, 0.6]

# Generate clean labels
clean_labels = [0] * int(num_samples * p[0]) + [1] * (num_samples - int(num_samples * p[0]))
np.random.shuffle(clean_labels)

# Generate noisy labels
noisy_labels = []
for i in clean_labels: # each instance has three noisy labels
    noisy_labels.append(np.random.choice([0, 1], size = 3, p=T[i]))

# Get the true T
true_T = np.zeros((2,2))
true_p = np.zeros(2)
for i in range(len(clean_labels)):
    for j in range(len(noisy_labels[0])):
        true_T[clean_labels[i]][noisy_labels[i][j]] += 1
    true_p[clean_labels[i]] += 1
true_T /= np.sum(true_T, 1).reshape(-1,1)
true_p /= np.sum(true_p)

# Set precisions
np.set_printoptions(precision=3)

# Print the True T and p
print(f"The true T is:\n{true_T}")
print(f"The true p is:\n{true_p}")

The true T is:
[[0.603 0.398]
 [0.206 0.794]]
The true p is:
[0.4 0.6]


Naive Approach: Use majority vote to estimate clean labels

In [47]:
mv_T = np.zeros((2,2))
from collections import Counter
mv_p = np.zeros(2)
for i in range(len(clean_labels)):
    mv_label = Counter(noisy_labels[i]).most_common(1)[0][0]
    mv_p[mv_label] += 1
    for j in range(len(noisy_labels[0])):
        mv_T[mv_label][noisy_labels[i][j]] += 1

mv_T /= np.sum(mv_T, 1).reshape(-1,1)
mv_p /= np.sum(mv_p)

# Print the True T and p
print(f"The estimated T by majority vote is:\n{mv_T}")
print(f"The estimated p by majority vote is:\n{mv_p}")

The estimated T by majority vote is:
[[0.763 0.237]
 [0.169 0.831]]
The estimated p by majority vote is:
[0.329 0.671]


Estimate with consensus patterns (by Docta)

In [51]:
from docta.apis import Diagnose
from docta.core.report import Report
from docta.utils.config import Config

# Load config
cfg = Config.fromfile('./config/toy.py')

# Initialize the report
report = Report()

# Build a dataset
class MyDataset:
    def __init__(self, consensus_patterns):
        self.consensus_patterns = consensus_patterns
        self.label = np.asarray(noisy_labels)[:,0]

    def __len__(self):
        return len(self.consensus_patterns)

dataset = MyDataset(noisy_labels)

# Estimate T and p
estimator = Diagnose(cfg, dataset, report = report)
estimator.hoc()

# Print the True T and p
print(f"The estimated T by Docta is:\n{report.diagnose['T']}")
print(f"The estimated p by Docta is:\n{report.diagnose['p_clean'].reshape(-1)}")


Estimating consensus patterns...


  0%|          | 0/50 [00:00<?, ?it/s]

100%|██████████| 50/50 [00:01<00:00, 27.76it/s]


Estimating consensus patterns... [Done]
Use cpu to solve equations


100%|██████████| 1501/1501 [00:02<00:00, 606.47it/s]

Solve equations... [Done]
The estimated T by Docta is:
[[0.589 0.411]
 [0.19  0.81 ]]
The estimated p by Docta is:
[0.421 0.579]





## 3. Detect label errors

* Detect with model confidence
* Detect with sample influence
* Detect with similar features

### 3.1. Detect with model confidence

### 3.2. Detect with sample influence

### 3.3. Detect with similar features

### 3.4. Example with Docta


#### 3.4.1 Dataset 
We will adopt the Iris dataset for illustration.

**Basic information**
The Iris data includes three iris species with 50 samples each as well as some properties about each flower. Here is a display of the main features/labels of the iris dataset.

In [52]:
import pandas as pd
base_path = './data/'
clean_iris = pd.read_csv(base_path + 'clean_Iris.csv')
clean_iris.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,83,5.8,2.7,3.9,1.2,Iris-versicolor
1,132,7.9,3.8,6.4,2.0,Iris-virginica
2,93,5.8,2.6,4.0,1.2,Iris-versicolor
3,29,5.2,3.4,1.4,0.2,Iris-setosa
4,12,4.8,3.4,1.6,0.2,Iris-setosa


The column ``Id`` indicates the raw index of the sample in the [official Iris dataset](https://archive.ics.uci.edu/dataset/53/iris). As displayed in the above table, there are four key compenents (faetures) for categorizing the species of the iris flower: ``SepalLengthCm``, ``SepalWidthCm``, ``PetalLengthCm``, ``PetalWidthCm``.

#### 3.4.2 Synthesize label noise 

The following function gives an example pipeline for preparing a dataset for Docta treatment.

In [53]:
import numpy as np
from sklearn.preprocessing import LabelEncoder

def process_csv(file_path, e=0.2):
    """
    Input:
      file_path: a raw file path of your csv file that you want to process
      e: the percentage of label errors to simulate
      (in this Iris data, we use the clean label to simulate label errors)
    Output:
      df: a processed csv files with label errors,
          each column denotes a kind of feature (changed to nemerical ones if not),
          except for the last one which is the (noisy) target column.
      clean_label: this is the clean target reserved for checking the Docta performances
    """
    # Load your data
    df = pd.read_csv(file_path)

    # (1) Rename the last column to 'clean_target'
    df.rename(columns={df.columns[-1]: 'clean_target'}, inplace=True)

    # (2) If 'clean_target' column is not of integer type, convert it
    if df['clean_target'].dtype != 'int':
        le = LabelEncoder()
        df['clean_target'] = le.fit_transform(df['clean_target'])

    # (3) Convert other columns to numerical values if they are not already
    for col in df.columns[:-1]:  # Exclude the last column
        if df[col].dtype == 'object':  # If the column has text
            df[col] = le.fit_transform(df[col])  # Convert text to integer

    # (4) Add a new 'target' column
    n_unique = df['clean_target'].nunique()
    def generate_target(val):
        rand_val = np.random.random()
        if rand_val < e:
            new_val = np.random.choice([i for i in range(n_unique) if i != val])
        else:
            new_val = val
        return new_val

    df['target'] = df['clean_target'].apply(generate_target)
    accuracy = (df['target'] == df['clean_target']).mean() * 100
    # Print the accuracy
    print(f"Label error rate: {100 - accuracy:.2f}%")
    clean_label = df['clean_target'].tolist()
    # Remove the clean label
    df = df.drop(columns=['Id', 'clean_target'], axis=1)

    return df, clean_label

Synthesize and save the noise data

In [54]:
noisy_df, clean_label = process_csv(base_path + 'clean_Iris.csv', e=0.25)
noisy_df.to_csv(base_path + 'noisy_Iris.csv', index=False)


Label error rate: 25.33%


In [56]:
from docta.apis import Diagnose
from docta.core.report import Report
from docta.utils.config import Config
from docta.datasets import TabularDataset

# Load config
cfg = Config.fromfile('./config/label_error_tabular.py')


dataset = TabularDataset(root_path=cfg.data_root)
cfg.num_classes = len(np.unique(dataset.label))
test_dataset = None
print('Tabular-data load finished')

# Initialize the report
report = Report()

from docta.apis import DetectLabel
from docta.core.report import Report
report = Report()
detector = DetectLabel(cfg, dataset, report = report)
detector.detect()



Tabular-data load finished
Detecting label errors with simifeat.
Estimating consensus patterns...


  0%|          | 0/50 [00:00<?, ?it/s]

100%|██████████| 50/50 [00:00<00:00, 165.88it/s]


Estimating consensus patterns... [Done]
Use cpu to solve equations


100%|██████████| 1501/1501 [00:04<00:00, 361.71it/s]


Solve equations... [Done]
Use SimiFeat-rank to detect label errors.


100%|██████████| 51/51 [00:00<00:00, 695.54it/s]

[SimiFeat] We find 37 corrupted instances from 150 instances





Motivation for estimating T
Understanding (give a figure from CIFAR-N paper)
Give equation for loss correction (refer to next section)
Knowing T also helps aggregation (Sigmetrics’15)
Estimate T
Anchor point (equation only, no code)
HOC
Equation + Figure
Example
Load a X matrix 100*10
Load noisy Y (show ground truth T)
Find 2-NN (print an example)
Build tensor 
Solve equation 
Show transition matrix
Detection 
Confident learning (equation + intuition)
Influence function (def of influence function)
SimFeat
Equation + Figure
Example: 2D example
Show figure of the data points 
Use one wrongly labeled sample to show:
Find K-NN neighbor
Weighted majority vote
Ranking + HOC 
Show suggestion
