# **[One-shot Learning with Memory-Augmented Neural Networks](https://arxiv.org/pdf/1605.06065)**

**One-shot learning** (i.e., learning from a single example) is a **challenge for traditional deep neural networks**, which **require large amounts of data** and **long training times**.

When a **new piece of data is introduced**, these networks have to readjust their **parameters in an inefficient way**, also **risking forgetting** the **knowledge acquired previously** (catastrophic interference problem).

The authors propose **neural networks** with increased **external memory**, such as **[Neural Turing Machines (NTM)](https://arxiv.org/pdf/1410.5401)**. These architectures **allow new information to be stored and recalled quickly**, avoiding classic **iterative learning**. The paper shows that a **neural network with memory can assimilate data** quickly and make accurate predictions even after a few examples.



The **success of modern deep learning** relies on **gradient optimization** (Values that indicate how much the **network's weights** should be updated) of high-capacity models, which work well in **large-scale supervised tasks** such as **image classification, speech recognition, and games**.

However, these **models need a lot of data and incremental training**, making them unsuitable for situations where rapid learning from a few examples is required ("one-shot learning").

**This kind of flexible adaptation is typical of human beings**: with just **one** example we can **infer new meanings or behaviors**. 
**Conventional neural networks fail in this**, as they have to **readjust the weights for each new piece of data**, with the risk of **catastrophic interference**. For this reason, nonparametric methods are often preferred in low-data contexts.

_The problem of **catastrophic interference** refers to the fact that when **conventional neural networks**. learn new data, they can **"forget"** what they have previously learned – a problem that **humans generally do not have** in such a pronounced way._

A possible solution is **[meta-learning](https://machinelearningmastery.com/meta-learning-in-machine-learning/)**(refers to learning algorithms that learn from other learning algorithms.), i.e. learning on **two levels**:

1. within the single task (e.g. classify images),
2. slow between different tasks, accumulating general knowledge about the structure of problems.

**Neural networks with memory**, such as **LSTMs**, have shown **meta-learning capabilities**, but they are **not suitable for contexts** where a lot of **new information needs to be encoded quickly**.
We need a **scalable architecture**, with:
- **Stable**, content-addressable memory.
- A **number of parameters independent** of the memory size.

**Recent models** such as **Neural Turing Machines (NTMs) or Memory Networks** do this. The paper proposes a class of networks called **Memory-Augmented Neural Networks (MANNs)**, which use **external memory**.
**MANNs** are capable of:
- **meta-learning** in tasks with high short- and long-term memory requirements,
- classify new **Omniglot classes with human-like accuracy**, Omniglot is an **encyclopedia of writing systems and languages**.
- **Estimate complex function**s from a few examples.

Their approach **combines**:
- **slow learning** of useful **representations through gradient descent**,
- and quick storage of **new information via external memory**.


### Meta-Learning Task Methodology**

In **meta-learning**, the goal is not simply to minimize a learning cost \( L \) on a single dataset \( D \), but to **minimize the expected cost** on a **distribution of datasets** \( p(D) \):

$$
\theta = \arg\min_{\theta} \mathbb{E}_{D \sim p(D)} [L(D; \theta)]
\tag{1}
$$

To do so, it is essential to structure the task correctly. The authors use an episodic approach: each **episode (or task)** consists of the presentation of a dataset $( D = \{(x_t, y_t)\}_{t=1}^T )$, where:
- $( x_t )$ is the input (e.g. an image),
- $( y_t )$ is the label (classification) or the real value (regression).

However, the label $( y_t )$ is **not shown together with the corresponding input**. Instead, the model receives a time sequence like:

$$
(x_1, \varnothing), (x_2, y_1), (x_3, y_2), \dots, (x_T, y_{T-1})
$$

That is, at time $( t )$, the model sees $( x_t )$ and the label $( y_{t-1} )$, and must **predict $( y_t )$**.

> This means that the network must **memorize the data it saw previously** (for which it just received the label) and use it to build a map between input and label, to be used for the next examples.

Additionally:
- **Labels are shuffled** in each episode, to prevent the model from simply learning to associate classes with fixed symbols in the weights.
- The model must **bind input representations to labels dynamically**, using memory.
- At the first appearance of a class, the model can only guess. But in subsequent presentations, if it has memorized correctly, it can achieve **perfect accuracy**.

This structure forces the model to:
- **meta-learn** to build input-label links regardless of the specific content,
- generalize an **association method** that works on any new class or function, learned in a single episode.

### Simulation of the Input Sequence for Meta-Learning

The basic idea of the input sequence used in the paper for meta-learning is as follows:

1. **Show the model a sequence of pairs**:
   $$
   (x_t, y_{t-1})
   $$
   - Where $(x_t)$ is the current input, and $(y_{t-1})$ is the output from the previous step.

2. **Ask the model to predict**:
   $$
   y_t
   $$
   - The goal is for the model to learn to predict the current output $(y_t)$ based on the given input-output pair $((x_t, y_{t-1}))$.

This approach helps the model learn patterns and dependencies in the sequence, which is a key concept in meta-learning.

- **Feature** → the info you give to the model (input)
- **Class** → what you want the model to predict (output)

In [2]:
import numpy as np

# Let's assume we have 3 classes: A, B, C
# Each input is a vector of 2 numbers, and each class has 2 examples
class_names = ['A', 'B', 'C']
inputs = np.array([
    [1, 1],  # class A
    [1, 2],  # class A
    [5, 5],  # class B
    [5, 6],  # class B
    [9, 9],  # class C
    [8, 9],  # class C
])

# Corresponding labels
labels = np.array(['A', 'A', 'B', 'B', 'C', 'C'])

# Shuffle the dataset and labels
perm = np.random.permutation(len(inputs))
inputs = inputs[perm]
labels = labels[perm]

# Now we simulate the temporal sequence (x_t, y_{t-1})
# Where the first y is None, as described in the paper
print("Input sequence (x_t, y_{t-1}):\n")
for t in range(len(inputs)):
    x_t = inputs[t]
    y_prev = labels[t-1] if t > 0 else None
    print(f"t={t} -> x_t: {x_t}, y_(t-1): {y_prev}")

# Model task: predict y_t for each x_t
print("\nTarget to predict (y_t):\n")
for t in range(len(inputs)):
    print(f"t={t} -> y_t: {labels[t]}")


Input sequence (x_t, y_{t-1}):

t=0 -> x_t: [8 9], y_(t-1): None
t=1 -> x_t: [5 6], y_(t-1): C
t=2 -> x_t: [5 5], y_(t-1): B
t=3 -> x_t: [9 9], y_(t-1): B
t=4 -> x_t: [1 1], y_(t-1): C
t=5 -> x_t: [1 2], y_(t-1): A

Target to predict (y_t):

t=0 -> y_t: C
t=1 -> y_t: B
t=2 -> y_t: B
t=3 -> y_t: C
t=4 -> y_t: A
t=5 -> y_t: A


- Each **x_t** is a set of **features** → e.g. [5, 6].
- Each **y_t** is a **class** → e.g. "B".

### Interpretation

- At first the model **knows nothing**: $( x_0 )$ is presented without a label → it has to **guess**.
- Then, at each step, it receives a new $( x_t )$ with the **label of the previous step** $( y_{t-1} )$ → it has to use **internal memory** to associate inputs with labels.

### **What really happens in this meta-learning task?**

1. We have an **episode**, i.e. a small learning "story", where:
- Each input $( x_t )$ is a feature (e.g. image, vector),
- Each label $( y_t )$ is a class associated with that input (e.g. "A", "B", "C").

2. The **classes change from episode to episode**: the network **cannot learn them in the weights** (they are shuffled on purpose each time).

3. The model sees a sequence like this:

$$
(x_1, \varnothing), (x_2, y_1), (x_3, y_2), \dots, (x_T, y_{T-1})
$$

Where at each step it **receives** the current data $( x_t )$ and the label of the previous $( y_{t-1} )$, but it has to **predict** the current label $( y_t )$.

4. The **real test** is: can the network *memorize on the fly* the pairings $( x \to y )$ and use them to recognize the same class when it sees it again?

The network:
- At time `t=1`, it receives `x_1` and nothing → it has to **guess**.
- At time `t=2`, it receives `x_2` and `y_1="B"` → it has to predict `y_2="A"` → **it can't know, new class, guess again**.
But...
- When it sees `x_3 = [1,2]` again, if it saw `x_2 = [1,1]` with `y_2="A"`, it can deduce "this looks like A" → **correct prediction**.
- And so on.

All this **without** memorizing A = class 0, B = class 1, etc., because those labels change every episode.



### **Memory-Augmented Model – Summary**
#### **Neural Turing Machines (NTM)**

**Neural Turing Machines (NTM)** are a **differentiable realization of MANN** (Memory-Augmented Neural Network). They are composed of:
- A **controller** (a feed-forward network or an LSTM),
- An **external memory module**,
- Several **read/write heads** that interact with the memory.

---

### **What do NTMs do?**

- They allow **writing and reading vectors from external memory** at each time-step.
- They are capable of both **short-term memory** (thanks to external memory) and **long-term memory** (via gradient-updated weights).
- This makes them **perfect for meta-learning**, where you need to:
- Learn **quickly** (just one exposure),
- And maintain **general structures** over time.

---

### **How ​​does memory access work?**

When the controller receives an input $( x_t )$, it generates a **key** $( k_t )$, which is used to:
- **Write** to memory $( M_t )$,
- Or **read** from it.

#### **Formula 1 – Cosine Similarity**

To read, we compute the **cosine similarity** between the key $( k_t )$ and each row of memory $( M_t(i) )$:

$$
K(k_t, M_t(i)) = \frac{k_t \cdot M_t(i)}{\|k_t\| \cdot \|M_t(i)\|}
\tag{2}
$$

#### **Formula 2 – Read Weights (softmax)**

This similarity is then normalized with softmax to obtain the **read weights vector** $( w^r_t(i) )$:

$$
w^r_t(i) = \frac{\exp(K(k_t, M_t(i)))}{\sum_j \exp(K(k_t, M_t(j)))}
\tag{3}
$$

#### **Formula 3 – Read from Memory**

The **read** from memory is done by making a **weighted average** of the memory rows with the weights just calculated:

$$
r_t = \sum_i w^r_t(i) \cdot M_t(i)
\tag{4}
$$

### **How ​​is the memory read?**

The read vector $( r_t )$ is:
- Sent to the **classifier** (e.g. a softmax layer),
- Used as **additional input** for the next state of the controller.

This creates a **continuous loop** of write-read-adapt, which allows the network to:
- Store new data **instantly**,
- Reuse it in real time to make **accurate predictions** even with just one example.

In [2]:
%pip install ace_tools_open

Collecting ace_tools_open
  Using cached ace_tools_open-0.1.0-py3-none-any.whl.metadata (1.1 kB)
Collecting itables (from ace_tools_open)
  Using cached itables-2.2.5-py3-none-any.whl.metadata (8.4 kB)
Using cached ace_tools_open-0.1.0-py3-none-any.whl (3.0 kB)
Using cached itables-2.2.5-py3-none-any.whl (1.4 MB)
Installing collected packages: itables, ace_tools_open
Successfully installed ace_tools_open-0.1.0 itables-2.2.5
Note: you may need to restart the kernel to use updated packages.


In [2]:
import numpy as np
import pandas as pd

# ================================
# Minimalistic NTM-like simulation
# ================================

# Parameters
dim_feature = 3  # Size of x_t vectors
mem_slots = 5    # Number of memory rows
mem_dim = dim_feature  # Size of memory rows (same as x_t)

# Initialize memory to zeros
M = np.zeros((mem_slots, mem_dim))

# Cosine similarity function
def cosine_similarity(k, m):
    norm_k = np.linalg.norm(k)
    norm_m = np.linalg.norm(m, axis=1)
    dot = m @ k
    return dot / (norm_k * norm_m + 1e-8)  # Avoid division by zero

# Softmax function
def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

# Input simulation (3 examples)
x_samples = np.array([
    [1.0, 0.0, 0.0],  # Class A
    [0.0, 1.0, 0.0],  # Class B
    [1.0, 0.0, 0.0],  # Same class A (expected recognition)
])

# Step 1: Write the first two examples into memory
M[0] = x_samples[0]  # Write the first vector
M[1] = x_samples[1]  # Write the second vector

# Step 2: Read the third input (should find the first slot)
query = x_samples[2]  # New input similar to the first
similarities = cosine_similarity(query, M)  # Compute similarity with memory
w_read = softmax(similarities)  # Compute read weights
r_t = np.sum(w_read[:, None] * M, axis=0)  # Perform weighted read

# Display memory, query, and read result using pandas
df = pd.DataFrame(M, columns=["dim1", "dim2", "dim3"])
df["read_weight"] = w_read

# Show the DataFrame
print("Simulated NTM Memory (Memory Slots and Read Weights):\n")
print(df)

# Output the read result
print("\nRead Result (r_t):\n", r_t)


Simulated NTM Memory (Memory Slots and Read Weights):

   dim1  dim2  dim3  read_weight
0   1.0   0.0   0.0     0.404610
1   0.0   1.0   0.0     0.148848
2   0.0   0.0   0.0     0.148848
3   0.0   0.0   0.0     0.148848
4   0.0   0.0   0.0     0.148848

Read Result (r_t):
 [0.40460967 0.14884758 0.        ]


### **"Explanation for Dummies"**
A Neural Turing Machine (NTM)** is a **neural network** that uses an **external memory (a kind of notebook)** to quickly **remember information**.

This memory is a table made up of **rows that store numerical data (called "features")**. When the **model** receives **new input**, it **compares this input to each row in the memory** using a measure called "**cosine similarity**," which indicates **how similar two vectors are**.

**After calculating the similarity**, the model assigns an **importance weight to each row (softmax)**. The **most similar** row will **have a higher weight**. Finally, the model combines all the weighted rows together to produce a final reading from the memory.
Bur, having **limited memory (few slots)**, when you write a new feature into memory, typically this happens
therefore inevitably there is a risk of **"catastrophic forgetting"**.

In essence, the **NTM can learn quickly from just a few examples** because it **can immediately remember what it sees using the external memory**.

To avoid this phenomenon, **Memory-Augmented Neural Network models (such as NTM)** adopt more advanced strategies, for example:

They write to the lines that have been used least recently **(Least Recently Used - LRU)**.

They **write to memory locations** chosen through **more sophisticated strategies** (such as gating mechanisms or adaptive writing), which **avoid simple direct replacement**.

They **combine external memory** (fast and dynamic) with **internal memory** (parameters updated slowly via backpropagation) to **balance short-term and long-term memory**.

In other words, a **real NTM avoids directly and brutally overwriting the most relevant information**, precisely to avoid falling into catastrophic forgetting.

### **Least Recently Used Access (LRUA)**

Previous **Neural Turing Machine (NTM)** models used **two modes to read and write to memory**:

1. **Content-based**: Search for similarity of content.
2. **Location-based**: Access based on location, similar to scrolling on a "tape".

The **position-based mode was useful in sequential tasks** (predicting sequences), but it is **not ideal for tasks where position does not matter**, but combining information is essential ("conjunctive coding").

We then introduce **Least Recently Used Access (LRUA)**, which uses **exclusively content-based access**.

How it works:
- **Write** to the **least used slot** in memory, thus **preserving recent information already stored**.
- **Update** the most **recently used slot**, potentially **overwriting older information** with more relevant information.



**What is interpolation?**
**Interpolation** is simply a way of **combining two values ​​with a certain weight**. 
In practice:
- You **don't choose** option 1 completely or option 2 completely
- You **take a little bit** of option 1 and a little bit of option 2, **in a certain proportion**

In **mathematical terms**
**Interpolation** can be expressed as:

$$result = α × option1 + (1-α) × option2$$

Where **α (alpha) is a value between 0 and 1** that determines **how much weight to give to option 1 versus option 2**.

In the case of the **LRUA**, the **formula** would be something like:

$$memory allocation = α × least used slot weight + (1-α) × most recent slot weight$$

Where:

- If $α = 1$, always choose the **least used slot**
- If $α = 0$, always choose the **most recent slot**
- If $α = 0.3$, **give 30%** importance to the least used slot and **70%** to the most recent slot

The value of α is **determined dynamically based** on memory access patterns and the needs of the model in the current context

**Interpolation** is only used as a **decision mechanism** to choose between two strategies:

- **Use** the **least recently used slot**
- **Update** the **most recently used slot**

Once the decision is made, the information is written entirely into the chosen slot, not split between multiple slots. It is more like deciding which post-it to write on, rather than splitting the message between multiple post-its.

## **Mathematical formulas used**

**1. Calculate memory usage weights**:

At each time step, update the memory "usage weights" (\(w_t^u\)), which track how often a slot has been used recently:

$$
w_t^u = \gamma w_{t-1}^u + w_t^r + w_t^w
\tag{5}
$$

Where:
- $( \gamma )$ is a decay parameter,
- $( w_t^r )$ are the read weights,
- $( w_t^w )$ are the write weights.

---

**2. Identify the least used slots**:

To identify the least used slots, select the ones with the lowest usage weights. The function $(m(v,n))$ indicates the nth smallest value of the vector $(v)$:

$$
w_t^{lu}(i) =
\begin{cases}
0, & \text{if } w_t^u(i) > m(w_t^u, n) \\[6pt]
1, & \text{se } w_t^u(i) \leq m(w_t^u, n)
\end{cases}
\tag{6}
$$

The parameter $$( n )$$ represents the number of reads from memory.

---

**3. Final calculation of write weights**:

The write weights $(w_t^w)$ are calculated as a convex combination of the previous read weights and the previously calculated least used weights, using a learnable sigmoidal "gate":

$$
w_t^w = \sigma(\alpha) w_{t-1}^r + \bigl(1 - \sigma(\alpha)\bigr) w_{t-1}^{lu}
\tag{7}
$$

Where:
- $(\sigma(\alpha) = \frac{1}{1 + e^{-\alpha}})$ is a sigmoidal function,
- $(\alpha)$ is a learnable scalar parameter (gate).

---

**4. Writing to memory**:

Before writing, the least used slot is zeroed. Then, the memory is updated with the calculated weights:

$$
M_t(i) = M_{t-1}(i) + w_t^w(i) k_t
\tag{8}
$$

Then, the new vector (feature) can be written to the newly emptied slot (the least used one), or the most recently used slot can be updated. If we choose the latter option, the least used memory is simply erased.

### **Explanation for Dummies**

- **External memory** is a table with slots (rows), each containing numerical data (the "features").

- The **Least Recently Used Access (LRUA)** module works like this:
- When new information (new "features") arrives to be stored, LRUA must decide **where to write it**.
- To do this, it uses an interpolation (a weighted combination) between:
1. The memory slot that has been used least recently (to save recent information without erasing it immediately).
2. The memory slot used most recently (to possibly update information already stored with more relevant data).

- This choice is made via a "gate" parameter (a port), which is automatically learned by the neural network itself during training.

Then, the new information is written **either in the least used slot** (thus protecting recent memory), **or in the most recently used slot**, overwriting and updating that information.

In practice, LRUA continuously balances the memory between new information and updating the information already present.

In [3]:
import numpy as np
import pandas as pd

# ===================================
# Simplified LRUA Practical Example
# ===================================

# Memory: 3 slots, each with 3 features
mem_slots = 3
mem_dim = 3

# Initialize random memory for simulation
M = np.random.rand(mem_slots, mem_dim)

# Initial usage weights (simulate how much each slot is used)
usage_weights = np.array([0.9, 0.2, 0.5])  # slot 0 heavily used, slot 1 rarely used

# New feature to store
new_feature = np.array([0.1, 0.7, 0.3])

# Least recently used and most recently used slots
least_used_slot = np.argmin(usage_weights)   # least used (slot with the lowest weight)
most_used_slot = np.argmax(usage_weights)    # most used (slot with the highest weight)

# Sigmoid gate (learned by the model, simulated here)
alpha = 0.4
gate = 1 / (1 + np.exp(-alpha))

# Interpolation between the chosen slots:
# if gate is close to 1 -> preference for updating the most recently used slot
# if gate is close to 0 -> preference for using the least used slot
write_weights = gate * most_used_slot + (1 - gate) * least_used_slot
selected_slot = round(write_weights)

# Write the new feature into memory
M[selected_slot] = new_feature

# Update the usage weight
usage_weights[selected_slot] = 1.0  # just used, so maximum usage

# Final memory visualization
df_mem = pd.DataFrame(M, columns=["Feature1", "Feature2", "Feature3"])
df_mem["Usage Weights"] = usage_weights
df_mem["Slot Status"] = ["Most Used" if i == most_used_slot else "Least Used" if i == least_used_slot else "Intermediate" for i in range(mem_slots)]

df_mem

Unnamed: 0,Feature1,Feature2,Feature3,Usage Weights,Slot Status
0,0.1,0.7,0.3,1.0,Most Used
1,0.86854,0.806575,0.660996,0.2,Least Used
2,0.639077,0.450875,0.52584,0.5,Intermediate


- We have **3 slots** of memory, each with 3 features.
- A **new feature** has been stored.
- By interpolation (gate = 0.6), the **most recently used slot** has been selected for updating, because the gate value favored this option.

**Final memory result:**

- **Slot 0 ("Most Used")**: updated with the new feature `[0.1, 0.7, 0.3]`, maximum usage weight (1.0).
- **Slot 1 ("Least Used")**: unchanged, lowest usage (0.2).
- **Slot 2 ("Intermediate")**: unchanged, medium usage (0.5).

The results show that the **Memory-Augmented Neural Network (MANN)** model with the **Least Recently Used Access (LRUA)** module is extremely effective for rapid learning tasks (**one-shot learning**):

- It **learns new classes quickly** and with high accuracy.
- It significantly outperforms human performance.
- It outperforms conventional models even with complex labels and many classes per episode.

## Key findings of the study:
- Many real-world **problems** require **fast learning** capabilities based on **few examples**. These problems are a challenge for classical deep learning, which usually learns slowly through gradual updates.
- This study approaches the problem using the concept of **meta-learning**: gradual and general learning across different tasks ("background knowledge"), combined with a flexible and fast memory to remember specific information from new tasks.
"Learning to learn".
- The main innovation proposed is a particular type of neural network with external memory (**Memory-Augmented Neural Network - MANN**) that is particularly effective for meta-learning. This memory is **separated from the network structure that controls the processes**.

### Results obtained:
- The proposed MANN clearly outperformed a traditional **LSTM** in both **classification** and **regression** tasks based on few examples.
- The tasks studied require not only **remembering information**, but also **generalizing** (transferring previous knowledge to new examples), a skill called **"inductive transfer"**.
- MANNs were **very well suited** for these tasks thanks to the combination of a flexible **external memory** and the powerful learning capabilities of deep neural networks.

### Comparison with human learning:
- **Meta-learning** is considered a key component of human intelligence.
- In an informal comparison with human subjects, the *MANN** showed superior performance, even with amounts of information that could easily be managed by human working memory.
- However, when the memory was not emptied between different tasks (episodes), the MANN showed phenomena of **proactive interference**, similar to those observed in human memory (difficulty in remembering new information due to old information).

## Short final summary:
- This study confirms that **MANNs** are effective for **meta-learning** with few examples.
- They show great **generalization** ability and can represent a **valid model for human learning**.
- Interesting problems remain open, such as further optimizing **memory strategies**, exploring a greater variety of tasks and addressing the challenge of active learning.