## Title
[Learning from few examples: A summary of approaches to Few-Shot Learning](https://arxiv.org/abs/2203.04291)

## Authors and Year
Archit Parnami, Minwoo Lee (2022)

## Abstract
Few-Shot Learning refers to the problem of learning the underlying pattern in the data just from a few training samples. Requiring a large number of data samples, many deep learning solutions suffer from data hunger and extensively high computation time and resources. Furthermore, data is often not available due to not only the nature of the problem or privacy concerns but also the cost of data preparation. Data collection, preprocessing, and labeling are strenuous human tasks. Therefore, few-shot learning that could drastically reduce the turnaround time of building machine learning applications emerges as a low-cost solution. This survey paper comprises a representative list of recently proposed few-shot learning algorithms. Given the learning dynamics and characteristics, the approaches to few-shot learning problems are discussed in the perspectives of meta-learning, transfer learning, and hybrid approaches (i.e., different variations of the few-shot learning problem).

<p align="center">
    <img src="Images/Categorize.jpg" alt="drawing" width="800"/>

## Introduction
#### History of deep learning and FSL(Few-Shot Learning)

- Deep learning needs large amount of **labeled** data.   
    - These models have shown great performance in **image classification, machine translation, speech modeling**.   
   
- Problem : we sometimes want to learn from very few labeled examples.   
    - Data collection & labeling is so **Painful**...
    - In many fields, data is hard or impossible to acquire due to reasons such as privacy and safety.   
      
- **Few-Shot Learning** : the ability of machine learning models to **generalize from few training examples**.

## Background
#### Meta-Learning

- Meta-Learning = Learning to learn   
    - Focuses on **learning priors from previous experiences** that can lead to efficient downstream learning of new tasks.
    - Gathers experience across **multiple similar tasks**, and use that experience to **solve new task**.   
   
- Two levels of Meta-Learning
    1. Within tasks : Learning to accurately classify within a **particular dataset**.
    2. Across tasks : Captures the way task structure varies **across target domains**.
   
##### Problem definition; typical supervised learning setting
- Task $\mathcal{T}$ with a dataset $\mathcal{D} = \left\{\left(x_{k},y_{k}\right)\right\}_{k=1}^{n}$ with $\mathcal{n}$ data samples.
- Split $\mathcal{D}$ into $\mathcal{D^{train}}$ and $\mathcal{D^{test}}$ such that:
$$
    \mathcal{D^{train}} = \left\{\left(x_{k},y_{k}\right)\right\}_{k=1}^{t} \\   
    \mathcal{D^{test}} = \left\{\left(x_{k},y_{k}\right)\right\}_{k=t+1}^{n} \\   
    \scriptstyle\text{t = number of training samples}
$$
- We optimize parameter $\theta$ on the training set $\mathcal{D^{train}}$, and evaluate its generalization performance on the test set $\mathcal{D^{test}}$.
$$
    y \approx f\left(x;\theta\right) \mathrm{where} \left(x,y\right) \in \mathcal{D^{test}} \\   
    \theta = \mathrm{arg}\underset{\theta}\min\sum_{\left(x,y\right)\in\mathcal{D^{train}}}\mathcal{L}\left(f\left(x;\theta\right),y\right) \\   
    \scriptstyle\text{y = true label}
$$
   

      
##### Problem definition; **Meta-Learning**
- We have a **distribution $\mathcal{p\left(T\right)}$ of task $\mathcal{T}$**
- A meta-learner learns from a set of training tasks $\mathcal{T_{i}}\overset{\mathcal{train}}\sim\mathcal{p\left(T\right)}$ and is evaluated on a set of testing tasks $\mathcal{T_{i}}\overset{\mathcal{test}}\sim\mathcal{p\left(T\right)}$. Each of these task has its own dataset $\mathcal{D_{i}}$ where $\mathcal{D_{i}} = \left\{\mathcal{D_{i}^{train}},\mathcal{D_{i}^{test}}\right\}$.
- Training tasks; $\mathcal{T_{meta-train}}=\left\{\mathcal{T_{1}},\mathcal{T_{2}},\cdots,\mathcal{T_{n}}\right\}$, testing tasks; $\mathcal{T_{meta-test}}=\left\{\mathcal{T_{n+1}},\mathcal{T_{n+2}},\cdots,\mathcal{T_{n+k}}\right\}$
- Training dataset; $\mathcal{D_{meta-train}}=\left\{\mathcal{D_{1}},\mathcal{D_{2}},\cdots,\mathcal{T_{n}}\right\}$, testing dataset; $\mathcal{D_{meta-test}}=\left\{\mathcal{D_{n+1}},\mathcal{D_{n+2}},\cdots,\mathcal{D_{n+k}}\right\}$.
- Parameters $\theta$ are optimized on $\mathcal{D_{meta-train}}$ and its generalization performance is tested on $\mathcal{D_{meta-test}}$.

$$
    y \approx f\left(\mathcal{D_{i}^{train}},x;\theta\right) \mathrm{where} \left(x,y\right) \in \mathcal{D_{i}^{test}} \\   
    \mathcal{D_{i}} = \left\{\mathcal{D_{i}^{train}}, \mathcal{D_{i}^{test}}\right\} \mathrm{where} \mathcal{D_{i}}\in\mathcal{D_{meta-test}} \\   
    \theta = \mathrm{arg}\underset{\theta}\min\sum_{\mathcal{D_{i}}\in\mathcal{D_{meta-train}}}\sum_{\left(x,y\right)\in\mathcal{D_{i}^{test}}}\mathcal{L}\left(f\left(\mathcal{D_{i}^{train}},x;\theta\right),y\right)
$$
   
<p align="center">
    <img src="Images/Meta-learning.jpg" alt="drawing" width="800"/>

##### What is different from transfer learning / multi-task learning / ensemble learning?
- Transfer learning
    - **Source task** : model is trained on a **single task** in the source domain where **sufficient training data is available**.
    - **Target task** : trained model is retrained or finetuned on **another single task** in the target domain.
    - Knowledge transfer occurs from the source task to the target task.
   
- Multi-task learning
    - Learns multiple tasks **simultaneously**.
    - Starts from **no prior experience**, and optimize over **solving multiple tasks at the same time**.
   
- Ensemble learning
    - It is the process by which **multiple models are strategically generated and combined** to solve a particular task.


## Few-Shot Learning
#### The Few-Shot Classification Problem; **M-way-K-shot task**
<p align="center">
    <img src="Images/Meta-learning_framework.jpg" alt="drawing" width="800"/>

- M : number of classes
- K : number of examples per class present in $\mathcal{D^{train}}$
    - K is usually a small number (ex. 1, 5, 10)
- |$\mathcal{D^{train}}$| = M x K
- Performance -> measured by a loss function $\mathcal{L\left(\hat{y},y\right)}$
- 보통은 큰 dataset이 있고, 그 중 M class를 sample함.