# Section 2: Handling the label noise in datasets

Throughout this section, you will get familiar with:

* Why the label noise transition matrix $T$ is important in handling label noise?

* How do we estimate $T$ given a dataset with only noisy labels?

* How do we detect label errors in a dataset?

## 1. Importance of $T$

* Understanding the pattern/structure of label noise
* Design robust loss functions
* Helps label aggregation (weighted majority vote)

### 1.1 Understand the pattern of label noise

### Examples from CIFAR-10N

* CIFAR-10N: 
  * 10 classes. 
  * Each image is annotated by 3 independent human workers.
* Aggregation labels: 
  * Take the majority vote from 3 annotations.
  * Break ties evenly.

<table>
  <tr>
    <td ><img src="tutorial_imgs/c10_agg.png" width="700"> 

    Figure: Label noise transition matrix of CIFAR-10N.
</td>
    <td>
    * Humans can be very accurate on some classes (ship 97%, horse 96%)<br/>
    * Humans can be inaccurate on other classes (cat 83%, deer 83%)<br/>
    * Human annotations have bias:<br/>
      &nbsp;&nbsp;&nbsp;&nbsp;- Horse-deer is a pair with high similarity, <b>but</b>..<br/>
      &nbsp;&nbsp;&nbsp;&nbsp;- Humans tend to annotate deer as horse: deer &rarr; horse 0.04<br/>
      &nbsp;&nbsp;&nbsp;&nbsp;- Humans tend <b>not</b> to annotate horse as deer: horse &rarr; deer 0.01 <br/>
    </td>
  </tr>
</table> 

### Examples from CIFAR-100N

* CIFAR-100N: 
  * 20 coarse classes, 100 fine classes. Each coarse class contains 5 fine classes.
  * Each image is annotated by 1 independent human workers.


<table>
  <tr>
    <td ><img src="tutorial_imgs/c100_coarse.png" width="1100"> 

    Figure: Label noise transition matrix of CIFAR-100N.
</td>
    <td>
    * Humans can be very accurate on some classes<br/>
      &nbsp;&nbsp;&nbsp;&nbsp;- people 94%<br/>
    * Humans can be inaccurate on other classes <br/>
      &nbsp;&nbsp;&nbsp;&nbsp;- medium-sized mammals 47%<br/>
    * Human annotations have bias:<br/>
      &nbsp;&nbsp;&nbsp;&nbsp;- man-made &rarr; natural 0.09<br/>
      &nbsp;&nbsp;&nbsp;&nbsp;- natural &rarr; man-made 0.03 <br/>
    </td>
  </tr>
</table> 

### 1.2 Design robust loss functions

Recall that:
* Feature $X$, noisy label $\widetilde Y$. 
* Model: $\bm f(\cdot)$ (Input: $X$, output: a column vector, probability of predicting each label class)
* Loss function: $\ell$.
* Label noise transition matrix $\bm T$, and its transpose $\bm T^\top$.

#### Forward loss correction:
$$
\ell^{\rightarrow}(\bm f(X),\widetilde Y):= \ell(\bm T^\top \bm f(X),\widetilde Y).
$$

### 1.3 Helps label aggregation (weighted majority vote)

Intuition:
* Normal majority vote: each labeler has the same weight. 
  * E.g., $\text{MV}(1,1,0) = 1$.
* Weighted majority vote: each labeler makes mistakes with some probability. 
  * E.g., label class 1 is rare, the first two labelers are not reliable and the third labeler is reliable, we may have $\text{MV}_\text{Weighted}(1,1,0) = 0$.
  * Condition:  $P(Y=1) = 0.2, T = \begin{pmatrix} 0.8 & 0.2 \\ 0.7 & 0.3 \end{pmatrix}  $
  * Probability of label 1: $$   \begin{align*} & P(Y=1| \widetilde Y_1 = 1, \widetilde Y_2 = 1, \widetilde Y_3 = 0)  \\ = & \frac{P(Y=1)}{P( \widetilde Y_1 = 1, \widetilde Y_2 = 1, \widetilde Y_3 = 0)} \cdot P(\widetilde Y_1=1|Y=1) \cdot P(\widetilde Y_2=1|Y=1) \cdot P(\widetilde Y_3=0|Y=1) \\ = & \frac{0.0126}{P( \widetilde Y_1 = 1, \widetilde Y_2 = 1, \widetilde Y_3 = 0)}\end{align*}  $$
  * Probability of label 0: $$   \begin{align*} & P(Y=0| \widetilde Y_1 = 1, \widetilde Y_2 = 1, \widetilde Y_3 = 0)  \\ = & \frac{P(Y=0)}{P( \widetilde Y_1 = 1, \widetilde Y_2 = 1, \widetilde Y_3 = 0)} \cdot P(\widetilde Y_1=1|Y=0) \cdot P(\widetilde Y_2=1|Y=0) \cdot P(\widetilde Y_3=0|Y=0) \\ = & \frac{0.0256}{P( \widetilde Y_1 = 1, \widetilde Y_2 = 1, \widetilde Y_3 = 0)} \end{align*} $$
  
 

## 2. Estimate $T$

* Naive approach
* Estimate with anchor points
* Estimate with consensus patterns

### 2.1. Naive approach

### 2.2. Estimate with anchor points

### 2.3. Estimate with consensus patterns

## 3. Detect label errors

* Detect with model confidence
* Detect with sample influence
* Detect with similar features

Motivation for estimating T
Understanding (give a figure from CIFAR-N paper)
Give equation for loss correction (refer to next section)
Knowing T also helps aggregation (Sigmetrics’15)
Estimate T
Anchor point (equation only, no code)
HOC
Equation + Figure
Example
Load a X matrix 100*10
Load noisy Y (show ground truth T)
Find 2-NN (print an example)
Build tensor 
Solve equation 
Show transition matrix
Detection 
Confident learning (equation + intuition)
Influence function (def of influence function)
SimFeat
Equation + Figure
Example: 2D example
Show figure of the data points 
Use one wrongly labeled sample to show:
Find K-NN neighbor
Weighted majority vote
Ranking + HOC 
Show suggestion
