# “Long Short-Term Memory-Networks for Machine Reading”  
*Cheng, Dong, and Lapata*

# # https://arxiv.org/pdf/1601.06733

---

## Abstract
The paper introduces Long Short-Term Memory-Networks (LSTMN), a reading simulator that enhances standard LSTMs by replacing the single memory cell with a memory network and incorporating neural attention. This design enables adaptive memory usage, weak induction of token relations, and improved handling of structured input. LSTMN is validated on language modeling, sentiment analysis, and natural language inference, achieving results on par with or surpassing state-of-the-art baselines.

---

## Problems
1. Vanishing/Exploding Gradients – Training RNNs over long sequences remains difficult despite gating mechanisms.  
2. Memory Compression – Standard LSTMs compress entire sequences into a single dense vector, limiting representation for long sequences.  
3. Lack of Structural Awareness – Sequence-level models fail to capture relations and latent structures inherent in natural language.

---

## Proposed Solutions
- Dedicated Memory Slots: Each token is stored in its own memory slot rather than being compressed into one vector.  
- Intra-Attention: Within recurrence, the model links current tokens with relevant past tokens.  
- Multi-Sequence Extension:  
  - Shallow Fusion: Intra-attention within encoder/decoder plus standard inter-attention.  
  - Deep Fusion: Inter-alignment vectors recurrently stored in target memory.

---

## Purpose
To build a general-purpose machine reading simulator that:  
- Incrementally processes text like humans.  
- Memorizes longer contexts.  
- Discovers latent relations.  
- Reasons over shallow structures without explicit supervision.  

---

## Methodology
- Architecture: Replace the LSTM memory cell with a memory tape; use intra-attention to compute token-to-token relations.  

$$ h_t = f\left(W_h h_{t-1} + W_x x_t + b\right) \quad \Rightarrow \quad \text{enhanced with memory + attention} $$

- Sequence-to-Sequence Extension:  
  - Shallow Fusion: intra-attention plus inter-attention.  
  - Deep Fusion: store inter-alignment vectors in target memory.  

- Tasks and Datasets:  
  - Language Modeling: Penn Treebank.  
  - Sentiment Analysis: Stanford Sentiment Treebank.  
  - Natural Language Inference: Stanford NLI dataset.  

---

## Results
- Language Modeling:  
  - LSTMN achieved the lowest perplexity.  
  - Single-layer: PPL = 108 (vs LSTM = 115).  
  - Three-layer: PPL = 102 (vs Gated-Feedback LSTM = 107).  

- Sentiment Analysis:  
  - Fine-grained: 47.9% (vs LSTM = 46.4%).  
  - Binary: 87.0% (vs LSTM = 84.9%).  
  - Comparable to strong CNN models.  

- Natural Language Inference:  
  - LSTMN with deep fusion achieved 86.3% accuracy, surpassing LSTM attention-based models.  

---

## Conclusions
- LSTMN addresses memory compression and structural reasoning limitations of LSTMs.  
- The intra-attention mechanism induces soft, undirected lexical relations, improving text comprehension.  
- The approach of integrating memory and attention is general and extendable beyond LSTMs.  

Future Directions:  
- Structured parsing.  
- Relation extraction.  
- Discovering compositionality with weak supervision.  

---


# Mathematical and Statistical Content Summary  
*“Long Short-Term Memory-Networks for Machine Reading” — Cheng, Dong, and Lapata*

---

## 1. Standard LSTM Equations

The LSTM updates its memory and hidden states using gates:

$$
\begin{bmatrix}
i_t \\ f_t \\ o_t \\ \hat{c}_t
\end{bmatrix}
=
\begin{bmatrix}
\sigma \\ \sigma \\ \sigma \\ \tanh
\end{bmatrix}
W \cdot [h_{t-1}, x_t]
$$

- $i_t$ → input gate (controls how much new information enters memory).  
- $f_t$ → forget gate (controls how much past memory is erased).  
- $o_t$ → output gate (controls how much memory is revealed).  
- $\hat{c}_t$ → candidate memory update.  

Memory and hidden state updates:

$$
c_t = f_t \odot c_{t-1} + i_t \odot \hat{c}_t
$$

$$
h_t = o_t \odot \tanh(c_t)
$$

This mechanism balances remembering, forgetting, and exposing information.

---

## 2. LSTMN Intra-Attention Equations

LSTMN replaces a single compressed memory with multiple slots and uses attention to decide which past tokens matter.

Attention score for past token $i$:

$$
a_{it} = v^T \tanh(W_h h_i + W_x x_t + W_{\tilde{h}} \tilde{h}_{t-1})
$$

Softmax normalization:

$$
s_{it} = \text{softmax}(a_{it})
$$

Weighted memory and hidden summaries:

$$
\begin{bmatrix}
\tilde{h}_t \\ \tilde{c}_t
\end{bmatrix}
= \sum_{i=1}^{t-1} s_{it}
\begin{bmatrix}
h_i \\ c_i
\end{bmatrix}
$$

Updated gates:

$$
\begin{bmatrix}
i_t \\ f_t \\ o_t \\ \hat{c}_t
\end{bmatrix}
=
\begin{bmatrix}
\sigma \\ \sigma \\ \sigma \\ \tanh
\end{bmatrix}
W \cdot [\tilde{h}_t, x_t]
$$

State updates:

$$
c_t = f_t \odot \tilde{c}_t + i_t \odot \hat{c}_t
$$

$$
h_t = o_t \odot \tanh(c_t)
$$

This allows the model to reference multiple past states instead of just the previous one.

---

## 3. Multi-layer (Stacked) Attention

For stacked layers:

$$
a_{it}^{k+1} = v^T \tanh(W_h h_i^{k+1} + W_l h_t^k + W_{\tilde{h}} \tilde{h}_{t-1}^{k+1})
$$

This lets higher layers refine attention using deeper representations.

---

## 4. Inter-Attention (Two Sequences)

For encoder-decoder tasks:

$$
b_{jt} = u^T \tanh(W_\gamma \gamma_j + W_x x_t + W_{\tilde{\gamma}} \tilde{\gamma}_{t-1})
$$

$$
p_{jt} = \text{softmax}(b_{jt})
$$

$$
\begin{bmatrix}
\tilde{\gamma}_t \\ \tilde{\alpha}_t
\end{bmatrix}
=
\sum_{j=1}^{m} p_{jt}
\begin{bmatrix}
\gamma_j \\ \alpha_j
\end{bmatrix}
$$

This aligns decoder tokens with encoder tokens.

---

## 5. Fusion Mechanism

Deep fusion integrates inter-attention with intra-sequence updates:

$$
r_t = \sigma(W_r \cdot [\tilde{\gamma}_t, x_t])
$$

$$
c_t = r_t \odot \tilde{\alpha}_t + f_t \odot \tilde{c}_t + i_t \odot \hat{c}_t
$$

$$
h_t = o_t \odot \tanh(c_t)
$$

This merges encoder–decoder alignment with intra-sequence reasoning.

---

## 6. Evaluation Metrics

**Perplexity (PPL):**

$$
PPL = \exp\left(\frac{NLL}{T}\right)
$$

- $NLL$ = negative log-likelihood of test set.  
- $T$ = number of tokens.  
- Lower perplexity = better predictive performance.

---

## 7. Optimization Methods

- **Stochastic Gradient Descent (SGD):** with learning rate decay and gradient clipping.  
- **Adam Optimizer:** adaptive, momentum-based optimization.  
- **Dropout:** regularization to reduce overfitting.

---

## 8. Statistical Results

- **Penn Treebank (Language Modeling):**  
  LSTM PPL = 115 → LSTMN PPL = 102 (3-layer).  

- **Sentiment Analysis (SST):**  
  Accuracy improved over LSTMs and approached CNN performance.  

- **Natural Language Inference (SNLI):**  
  LSTMN deep fusion → 86.3% accuracy, competitive with best attention-based models.  

---

## Final Takeaway

The mathematical foundation of the paper is built on:  

- Standard LSTM recurrence equations.  
- Intra-attention for linking tokens within a sequence.  
- Inter-attention for aligning across sequences.  
- Fusion mechanisms for combining the two attentions.  
- Perplexity as the main evaluation metric.  
- Optimization with SGD/Adam and regularization via dropout.  

Together, these methods enable LSTMN to improve memory usage, capture structure, and achieve state-of-the-art performance in multiple NLP tasks.


# Full System Flow

```
Input Sequence x1...xn
        │
        ▼
 ┌─────────────┐
 │  LSTMN Core │   ← stores token-wise memory slots
 └─────────────┘
        │
        ▼
 ┌─────────────┐
 │ Intra-Attn  │   ← finds relations among tokens
 └─────────────┘
        │
        ▼
 ┌─────────────┐
 │ Encoder     │
 └─────────────┘
        │
        ▼
 ┌─────────────┐
 │ Decoder     │
 │  - Intra-attn over target
 │  - Inter-attn over source
 │  - Fusion (shallow or deep)
 └─────────────┘
        │
        ▼
  Output Sequence (translation, inference, etc.)
```

In [10]:
# ============================================================
# Educational PyTorch Implementation of LSTMN (Cheng et al.)
# Task: Translation (Toy Dataset EN → FR)
# ============================================================

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from collections import Counter
import random, re

# ------------------------------------------------------------
# 1. Toy parallel dataset (English → French)
# ------------------------------------------------------------
train_data = [
    ("i love this movie", "j aime ce film"),
    ("the film was boring", "le film etait ennuyeux"),
    ("wonderful acting", "jeu merveilleux"),
    ("terrible movie", "film terrible"),
    ("a masterpiece", "un chef d oeuvre")
]

dev_data = [
    ("i hated this", "j ai deteste cela"),
    ("great story", "bonne histoire")
]

# ------------------------------------------------------------
# 2. Tokenizer & vocab utils
# ------------------------------------------------------------
def simple_tokenizer(text):
    text = text.lower()
    text = re.sub(r"[^a-z0-9àâçéèêëîïôûùüÿñæœ]+", " ", text)
    return text.strip().split()

def build_vocab(dataset, side="src", min_freq=1):
    counter = Counter()
    for src, tgt in dataset:
        text = src if side == "src" else tgt
        counter.update(simple_tokenizer(text))
    vocab = {"<pad>":0,"<unk>":1,"<sos>":2,"<eos>":3}
    for w,f in counter.items():
        if f>=min_freq: vocab[w]=len(vocab)
    return vocab

src_vocab = build_vocab(train_data,"src")
tgt_vocab = build_vocab(train_data,"tgt")
PAD_IDX, SOS_IDX, EOS_IDX = tgt_vocab["<pad>"], tgt_vocab["<sos>"], tgt_vocab["<eos>"]

def numericalize(text, vocab, max_len=8, add_eos=False):
    toks = simple_tokenizer(text)
    ids = [vocab.get(tok,vocab["<unk>"]) for tok in toks]
    if add_eos: ids.append(EOS_IDX)
    ids = ids[:max_len]
    ids += [PAD_IDX]*(max_len-len(ids))
    return ids

# ------------------------------------------------------------
# 3. LSTMN Cell (intra-attention)
# ------------------------------------------------------------
class LSTMNCell(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super().__init__()
        self.hidden_dim=hidden_dim
        self.gates=nn.Linear(input_dim+hidden_dim,4*hidden_dim)
        self.attn_w_h=nn.Linear(hidden_dim,hidden_dim,bias=False)
        self.attn_w_x=nn.Linear(input_dim,hidden_dim,bias=False)
        self.attn_w_prev=nn.Linear(hidden_dim,hidden_dim,bias=False)
        self.attn_v=nn.Linear(hidden_dim,1,bias=False)
    def forward(self,x_t,H,C):
        if len(H)==0:
            h_prev=torch.zeros(x_t.size(0),self.hidden_dim,device=x_t.device)
            c_prev=torch.zeros(x_t.size(0),self.hidden_dim,device=x_t.device)
            H,C=[h_prev],[c_prev]
        h_stack=torch.stack(H,dim=1); c_stack=torch.stack(C,dim=1)
        attn_scores=self.attn_v(torch.tanh(
            self.attn_w_h(h_stack)+
            self.attn_w_x(x_t).unsqueeze(1)+
            self.attn_w_prev(H[-1]).unsqueeze(1)
        )).squeeze(-1)
        attn_w=F.softmax(attn_scores,dim=1)
        h_tilde=torch.bmm(attn_w.unsqueeze(1),h_stack).squeeze(1)
        c_tilde=torch.bmm(attn_w.unsqueeze(1),c_stack).squeeze(1)
        gates=self.gates(torch.cat([x_t,h_tilde],dim=1))
        i,f,o,c_hat=gates.chunk(4,dim=1)
        i,f,o=torch.sigmoid(i),torch.sigmoid(f),torch.sigmoid(o)
        c_hat=torch.tanh(c_hat)
        c_t=f*c_tilde+i*c_hat
        h_t=o*torch.tanh(c_t)
        return h_t,c_t,H+[h_t],C+[c_t]

# ------------------------------------------------------------
# 4. Encoder–Decoder with LSTMN
# ------------------------------------------------------------
class Encoder(nn.Module):
    def __init__(self,vocab_size,embed_dim,hidden_dim):
        super().__init__()
        self.embed=nn.Embedding(vocab_size,embed_dim)
        self.cell=LSTMNCell(embed_dim,hidden_dim)
    def forward(self,src):
        H,C=[],[]
        for t in range(src.size(1)):
            h,c,H,C=self.cell(self.embed(src[:,t]),H,C)
        return H,C

class Decoder(nn.Module):
    def __init__(self,vocab_size,embed_dim,hidden_dim):
        super().__init__()
        self.embed=nn.Embedding(vocab_size,embed_dim)
        self.cell=LSTMNCell(embed_dim,hidden_dim)
        self.fc=nn.Linear(hidden_dim,vocab_size)
    def forward(self,tgt,H,C,teacher_forcing_ratio=0.5):
        B,T=tgt.shape; outputs=[]
        h,c,H_d,C_d=[],[],[],[]
        x=self.embed(torch.full((B,),SOS_IDX,dtype=torch.long,device=tgt.device))
        for t in range(T):
            h,c,H_d,C_d=self.cell(x,H_d,C_d)
            out=self.fc(h); outputs.append(out.unsqueeze(1))
            teacher_force=random.random()<teacher_forcing_ratio
            top1=out.argmax(1)
            inp=tgt[:,t] if teacher_force else top1
            x=self.embed(inp)
        return torch.cat(outputs,dim=1)

class Seq2Seq(nn.Module):
    def __init__(self,enc,dec):
        super().__init__(); self.enc=enc; self.dec=dec
    def forward(self,src,tgt):
        H,C=self.enc(src)
        return self.dec(tgt,H,C)

# ------------------------------------------------------------
# 5. Training
# ------------------------------------------------------------
device="cuda" if torch.cuda.is_available() else "cpu"
enc=Encoder(len(src_vocab),32,64)
dec=Decoder(len(tgt_vocab),32,64)
model=Seq2Seq(enc,dec).to(device)
optimizer=optim.Adam(model.parameters(),lr=0.01)
criterion=nn.CrossEntropyLoss(ignore_index=PAD_IDX)

def batchify(data,batch_size=2,shuffle=True):
    if shuffle: random.shuffle(data)
    for i in range(0,len(data),batch_size):
        batch=data[i:i+batch_size]
        src=[numericalize(src,src_vocab,add_eos=True) for src,_ in batch]
        tgt=[numericalize(tgt,tgt_vocab,add_eos=True) for _,tgt in batch]
        yield torch.tensor(src).to(device),torch.tensor(tgt).to(device)

for epoch in range(200):
    model.train(); total_loss=0
    for src,tgt in batchify(train_data):
        optimizer.zero_grad()
        out=model(src,tgt[:,:-1])   # predict next tokens
        loss=criterion(out.reshape(-1,out.size(-1)),tgt[:,1:].reshape(-1))
        loss.backward(); optimizer.step()
        total_loss+=loss.item()
    print(f"Epoch {epoch+1}, Loss={total_loss:.4f}")

Epoch 1, Loss=8.7238
Epoch 2, Loss=7.9351
Epoch 3, Loss=6.8637
Epoch 4, Loss=5.4413
Epoch 5, Loss=5.4554
Epoch 6, Loss=4.7811
Epoch 7, Loss=4.4558
Epoch 8, Loss=4.2947
Epoch 9, Loss=4.1061
Epoch 10, Loss=3.3352
Epoch 11, Loss=3.2487
Epoch 12, Loss=3.4155
Epoch 13, Loss=4.7039
Epoch 14, Loss=3.0020
Epoch 15, Loss=3.7901
Epoch 16, Loss=3.8352
Epoch 17, Loss=2.7812
Epoch 18, Loss=3.1384
Epoch 19, Loss=3.4388
Epoch 20, Loss=3.0954
Epoch 21, Loss=2.5091
Epoch 22, Loss=2.3468
Epoch 23, Loss=2.5982
Epoch 24, Loss=3.1085
Epoch 25, Loss=2.7143
Epoch 26, Loss=2.6720
Epoch 27, Loss=2.1767
Epoch 28, Loss=1.9859
Epoch 29, Loss=2.6051
Epoch 30, Loss=2.5834
Epoch 31, Loss=2.0447
Epoch 32, Loss=3.0147
Epoch 33, Loss=2.2219
Epoch 34, Loss=3.9671
Epoch 35, Loss=2.8967
Epoch 36, Loss=2.4293
Epoch 37, Loss=3.1520
Epoch 38, Loss=2.6283
Epoch 39, Loss=2.4988
Epoch 40, Loss=2.6665
Epoch 41, Loss=1.6662
Epoch 42, Loss=3.9601
Epoch 43, Loss=2.6165
Epoch 44, Loss=3.5123
Epoch 45, Loss=2.8546
Epoch 46, Loss=2.38

In [11]:
# ------------------------------------------------------------
# 6. Translation Demo
# ------------------------------------------------------------
def translate(sentence,max_len=8):
    model.eval()
    src=torch.tensor([numericalize(sentence,src_vocab,add_eos=True)]).to(device)
    H,C=model.enc(src)
    x=torch.full((1,),SOS_IDX,dtype=torch.long,device=device)
    H_d,C_d=[],[]
    outputs=[]
    for _ in range(max_len):
        x_emb=model.dec.embed(x)
        h,c,H_d,C_d=model.dec.cell(x_emb,H_d,C_d)
        out=model.dec.fc(h)
        top1=out.argmax(1)
        if top1.item()==EOS_IDX: break
        outputs.append(top1.item()); x=top1
    inv_vocab={i:w for w,i in tgt_vocab.items()}
    return " ".join(inv_vocab.get(idx,"?") for idx in outputs)

print("EN: i love this movie")
print("FR:", translate("i love this movie"))


EN: i love this movie
FR: merveilleux


# Problem–Solution Mapping: “Long Short-Term Memory-Networks for Machine Reading”  
*Cheng, Dong, and Lapata*

---

| Problem / Research Gap | How It Limits Prior Work | Paper’s Proposed Solution |
|-------------------------|--------------------------|----------------------------|
| **Memory compression in sequence models**: single-vector compression is insufficient for long sequences. | Leads to poor generalization on long inputs and inefficient use of memory on short ones. Prior LSTMs must fit all past context into one dense state. | Replace the single LSTM cell with a **memory network (memory tape)** that stores a slot per token; use **attention** to read adaptively from past slots, avoiding over-compression. |
| **Lack of structural inductive bias**: sequence models process tokens linearly without explicit mechanisms for relations. | Imposes a bias misaligned with language’s inherent structure; models cannot explicitly reason over token–token relations. | Add **intra-attention inside the recurrence** to induce soft, undirected lexical relations among tokens during reading, trained end-to-end without direct supervision. |
| **Markovian state update in standard LSTMs**: next state depends only on current state; “unbounded memory” assumption may fail. | Long-distance information can be lost when the current state is treated as a sufficient summary; harms modeling of long dependencies. | Perform **non-Markov updates** by attending over all prior hidden/memory slots to build adaptive summaries $(\tilde{h}_t, \tilde{c}_t)$ before gating, directly incorporating distant context. |
| **Integrating structure with sequence transduction (two sequences)**: standard encoder–decoder uses only inter-attention. | Decoder may not fully exploit both intra-sequence relations (within target) and inter-sequence alignments (source↔target), limiting fusion of structural cues. | Propose two integrations: **Shallow fusion** (LSTMN for encoder/decoder plus inter-attention), and **Deep fusion** (store inter-alignment in decoder memory via a gate $r_t$). |
| **Need for relation induction without explicit supervision** | Obtaining gold structural annotations (e.g., dependencies) is costly; prior work may rely on external memories that are separate from recurrence. | Internalize memory within recurrence and let **soft attention weakly induce relations** during task training (no direct supervision), strengthening interaction between memory and update. |
| **Training stability of RNNs (vanishing/exploding gradients)** | Classic difficulty in learning long-term dependencies; prior remedies rely on gating/gradient clipping. | Retain **LSTM gating (input/forget/output, candidate)** for stable training while augmenting with attention-based memory reads, preserving proven optimization behavior. |

---

## Scope & Empirical Support  
The authors validate **LSTMN** on:  
- **Language modeling** (Penn Treebank): lower perplexity than LSTMs and gated-feedback models.  
- **Sentiment analysis** (Stanford Sentiment Treebank): higher accuracy than standard LSTMs, competitive with CNNs.  
- **Natural language inference** (SNLI): deep fusion achieves state-of-the-art accuracy (86.3%).  

Overall, LSTMN demonstrates performance comparable to or better than state-of-the-art baselines, and consistently superior to vanilla LSTMs.

---

##
- Cheng, Jianpeng, Li Dong, and Mirella Lapata. *“Long Short-Term Memory-Networks for Machine Reading.”* EMNLP, 2016.  
- Datasets: Penn Treebank (LM), Stanford Sentiment Treebank (SA), Stanford NLI (NLI).  


# Related Work References

| Author(s) | Year | Title | Venue | Connection to This Paper |
|-----------|------|-------|-------|---------------------------|
| Bahdanau, D., Cho, K., & Bengio, Y. | 2014 | Neural machine translation by jointly learning to align and translate | ICLR | Introduces attention in encoder–decoder models to mitigate memory compression in RNNs; inspires LSTMN’s intra-attention mechanism. |
| Rush, A. M., Chopra, S., & Weston, J. | 2015 | A neural attention model for abstractive sentence summarization | EMNLP | Example of sequence-to-sequence RNNs with attention, motivating LSTMN’s ability to uncover lexical relations. |
| Hermann, K. M., Kocisky, T., Grefenstette, E., Espeholt, L., Kay, W., Suleyman, M., & Blunsom, P. | 2015 | Teaching machines to read and comprehend | NeurIPS | Applies attention-based RNNs for reading comprehension; provides a benchmark task for models like LSTMN. |
| Bengio, Y., Simard, P., & Frasconi, P. | 1994 | Learning long-term dependencies with gradient descent is difficult | IEEE Trans. Neural Networks | Classic paper on vanishing/exploding gradients; motivates gated RNNs and LSTM. |
| Hochreiter, S., & Schmidhuber, J. | 1997 | Long short-term memory | Neural Computation | Foundational LSTM work; LSTMN extends this by integrating memory networks. |
| Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. | 2014 | Learning phrase representations using RNN encoder–decoder for statistical machine translation | EMNLP | Introduces GRUs; influential in advancing gated architectures compared to vanilla RNNs. |
| Koutník, J., Greff, K., Gomez, F., & Schmidhuber, J. | 2014 | A clockwork RNN | ICML | Enhances information flow in recurrent networks; motivates architectural modifications beyond standard LSTMs. |
| Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. | 2015 | Gated feedback recurrent neural networks | ICML | Proposes gating mechanisms to improve RNN memory handling; precursor to richer memory integration. |
| Yao, K., Cohn, T., Vylomova, K., Duh, K., & Dyer, C. | 2015 | Depth-gated recurrent neural networks | arXiv | Introduces depth-gated RNNs; inspires comparison with LSTMN as another LSTM variant. |
| Zaremba, W., & Sutskever, I. | 2014 | Learning to execute | arXiv | Notes RNN memory limitations; contextualizes the need for attention/memory augmentation. |
| Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y., & Potts, C. | 2013 | Recursive deep models for semantic compositionality over a sentiment treebank | EMNLP | Example of recursive neural networks leveraging structure; contrasts with LSTMN’s soft intra-attention. |
| Das, S., Giles, C. L., & Sun, G. Z. | 1992 | Learning context-free grammars with a recurrent neural network and external stack memory | CogSci | Early attempt to add structural bias with external memory; LSTMN builds on this idea. |
| Weston, J., Chopra, S., & Bordes, A. | 2015 | Memory networks | ICLR | Proposes external memory for reasoning; LSTMN adapts this by embedding memory inside the recurrence. |
| Sukhbaatar, S., Weston, J., Fergus, R., et al. | 2015 | End-to-end memory networks | NeurIPS | Extends memory networks with differentiable addressing; parallels LSTMN’s attention-based memory addressing. |
| Meng, F., Lu, Z., Tu, Z., Li, H., & Liu, Q. | 2015 | A deep memory-based architecture for sequence-to-sequence learning | ICLR Workshop | Applies memory networks to translation; motivates LSTMN’s encoder–decoder integration. |
| Grefenstette, E., Hermann, K. M., Suleyman, M., & Blunsom, P. | 2015 | Learning to transduce with unbounded memory | NeurIPS | Proposes differentiable memory structures; foundational to LSTMN’s memory design. |
| Tran, K., Bisazza, A., & Monz, C. | 2016 | Recurrent memory network for language modeling | NAACL | Combines RNNs with external memory; directly comparable to LSTMN’s approach. |
| Kumar, A., Irsoy, O., Su, J., Bradbury, J., English, R., Pierce, B., Ondruska, P., Gulrajani, I., & Socher, R. | 2016 | Ask me anything: Dynamic memory networks for NLP | ICML | Uses episodic memory modules for QA; related to LSTMN’s memory-driven reading. |
| Xiong, C., Merity, S., & Socher, R. | 2016 | Dynamic memory networks for visual and textual question answering | ICML | Extends memory networks to multimodal tasks; demonstrates broader relevance of memory-based reasoning. |
| Dyer, C., Ballesteros, M., Ling, W., Matthews, A., & Smith, N. A. | 2015 | Transition-based dependency parsing with stack LSTMs | ACL | Uses hard structural decisions in parsing; contrasts with LSTMN’s soft, differentiable attention. |
| Bowman, S. R., Gauthier, J., Rastogi, A., Gupta, R., Manning, C. D., & Potts, C. | 2016 | A fast unified model for parsing and sentence understanding | ACL | Example of shift-reduce neural models with hard decisions; compared against LSTMN’s soft induction. |
| Klein, D., & Manning, C. D. | 2004 | Corpus-based induction of syntactic structure | ACL | Classic work on grammar induction; cited to distinguish LSTMN’s undirected lexical relations from directed head-modifier ones. |
