# Weakly Supervised Learning

## Introduction

From Wikipedia - **Supervised learning** is the machine learning task of learning a function that maps an input to an output based on example input-output pairs (i.e. training set). 

Most successful techniques, such as deep learning, require ground-truth labels to be given for a big training data
set; in many tasks, however, it can be difficult to attain strong supervision information due to the high cost of the data-labeling process. Thus, it is desirable for machine-learning techniques to be able to work with weak supervision. In general, we refer these tasks as weakly supervised learning (WSL).

## WSL Settings

**(Strongly) supervised learning**: learn a classifier $f: \mathcal{X} \mapsto \mathcal{Y}$ from a training set $D = \{(x_1,y_1), \ldots, (x_m, y_m)\}$, where $\mathcal{X}$ is the feature space and $\mathcal{Y} = \{0,1\}$ for a binary classification task. Furthermore, we assume that $(x_i,y_i)$ are i.i.d. samples generated from some unknown true distribution $\mathcal{D}$ on $\mathcal{X} \times \mathcal{Y}$.

### Three types of WSL tasks

Zhou(2018) categorize WSL tasks into three types:
- Incomplete supervision
- Inexact supervision
- Inaccurate supervision

## Incomplete supervision

Incomplete supervision concerns the situation in which we are given a small amount of labeled data, which is insufficient to train a good learner, while abundant unlabeled data are available. 

Formally, the task is to learn $f : \mathcal{X} \mapsto \mathcal{Y}$ from a training data set
$D = \{(x_1, y_1), \ldots , (x_l, y_l), x_{l+1}, \ldots, x_{m}\}$, where there are $l$ number of labeled training examples (i.e. those given with $y_i$) and $u = m − l$ number of unlabeled instances.

![](./Incomplete.png)

## Incomplete supervision

There are two major techniques for incomplete supervision problems:
- **Active learning**: there is an ‘oracle’, such as a human expert, that can be queried to get ground-truth labels for selected unlabeled instances.
- **Semi-supervised learning**: automatically exploit unlabeled data in addition to labeled data to improve learning performance, where no human intervention is assumed

## Inexact supervision

Inexact supervision concerns the situation in which some supervision information is given, but not as exact as desired.

Formally, the task is to learn $f : \mathcal{X} \mapsto \mathcal{Y}$ from a training data set $D = \{(X_1,y_1), \ldots, (X_m, y_m)\}$, where $X_i = {x_{i,1}, \ldots, x_{i,m_i}} \subset \mathcal{X}$ is called a bag, $x_{i,j} \in \mathcal{X}$ is an instance, $m_i$ is the number of instances in $X_i$. 

![](./inexact.png)

## Inexact supervision: multi-instance learning

Multi-instance learning is the case where $X_i$ is a positive bag, i.e. $y_i = 1$, if there exists $x_{i,p}$ that is positive, while $p \in {1, \ldots, m_i}$ is unknown. The goal is to predict labels for unseen bags.  There are also studies trying to identify the key instance that enables a positive bag to be positive.

Multi-instance learning has been successfully applied to various tasks, including:
- image categorization/retrieval/annotation
- text categorization
- spam detection
- medical diagnosis
- face/object detection 
- object class discovery
- object tracking

## Inaccurate supervision

Inaccurate supervision concerns the situation in which the supervision information is not always ground-truth; in other words, some label information may suffer from errors. In practice, a basic idea is to identify the potentially mislabeled examples, and then try to make some correction. We refer this procedure as "label correction". Many WSL methods that address the issue of inaccuarate supervision fall in this category. 

![](./Inaccurate.png)

## A typical situation with inaccurate supervision: crowdsourcing

An interesting recent scenario of inaccurate supervision occurs with *crowdsourcing*, a popular paradigm to outsource work to individuals. A famous crowdsourcing system, Amazon Mechanical Turk (AMT), is a market where the user can submit a task, such as annotating images of trees versus non-trees, to be completed by workers in exchange for small monetary payments. 

The workers usually come from a large society and each of them is presented with multiple tasks. They are usually independent and relatively inexpensive, and will provide labels based on their own judgments. Among the workers, some
may be more reliable than others; however, the user usually does not know this in advance because the identities of workers are protected. There may exist ‘spammers’ who assign almost random labels to the tasks (e.g. robots pretendto be a human forthe monetary payment), or ‘adversaries’ who give incorrect answers deliberately.

## Mixed cases

Though the three types of WSL are discussed separately, in practice they often occur simultaneously. 

![](./WSL_types.png)

# An example of WSL methods in image classification


As mentioned earlier, many different methods exist in the literature of WSL. In this tutorial, we will briefly introduce a simplified version of the method proposed in Section 4 of [Inoue et al. (2017)](https://openaccess.thecvf.com/content_ICCV_2017_workshops/papers/w32/Inoue_Multi-Label_Fashion_Image_ICCV_2017_paper.pdf), which addresses the issue of inaccurate supervision.

## Problem setting

Formally, we have a very large training dataset $T$. $T$ consists of tuples of noisy labels $y$ and images $x$, $T =\{(x_i , y_i), \ldots,\}$. Additionally, we have a small dataset $V$ with human verified labels $v$, $V = \{(x_j , y_j , v_j), \ldots\}$. The number of the data in $T$ is significantly larger than that in $V$. For a $d$-class classificaton problem, the $y_j$'s and $v_j$'s are $d$-dimensional one-hot variables that encode the class of the $j$-th object according to its noisy label and verified label. 

## The two-phase deep learning model

The model is designed to jointly learn to generate accurate labels from noisy labels and to learn a more accurate multilabel classifier from the generated labels. In particular, it consists of two convolutional neural networks (For those who are not familiar with CNN, pytorch provides [a good tutorial](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html) among others):

- **Phase 1: label correction**
    
    The classifier $g$ in this phase is called the *label cleaning network*. It learns a mapping from noisy labels $y$ to human-verified labels $v$, conditional on the input image. Its output $c$ denotes the cleaned labels. The classifier $g$ has two separate inputs, the noisy labels $y$ and the visual features $f(x)$. Each input is projected into an embedding by a linear layer and the two are concatenated, then transformed with a hidden linear layer. Finally, $y$ is added to the output by an identity-skip connection and clipped to $[0, 1]$ to remain in the valid label space. For the training, they consider to minimize the $L_1$ loss.

- **Phase 2: image classication**

## The two-phase deep learning model

The model is designed to jointly learn to generate accurate labels from noisy labels and to learn a more accurate multilabel classifier from the generated labels. In particular, it consists of two convolutional neural networks (For those who are not familiar with CNN, pytorch provides a good tutorial among others):

- **Phase 1: label correction**

- **Phase 2: image classication**

    The second classifier $h$ is called the image classifier. It learns to predict labels by imitating the first classifier using only the image as input. This network is rather similar to the regular ones except it learns a mapping from input image $x$ to the corrected label $c$ instead of $y$ or $v$. For the training, the cross-entropy loss is minimized. 

## The architectures - label correction network
![](./label_cleaning_net.png)

## The architectures - image classifier
![](./image_clf.png)

For more details of training, please refer to the original paper. 

## References

 - A brief introduction to weakly supervised learning ([Zhou, 2018](https://academic.oup.com/nsr/article/5/1/44/4093912))
 - Multi-Label Fashion Image Classification with Minimal Human Supervision ([Inoue et al., 2017](https://openaccess.thecvf.com/content_ICCV_2017_workshops/papers/w32/Inoue_Multi-Label_Fashion_Image_ICCV_2017_paper.pdf))