# RNN-based Seq2Seq Model for Sentence Disambiguation
## Introduction and Problem Overview
Sentence disambiguation can be framed as a **paraphrase generation** task: we take an ambiguous sentence and generate a rephrased version that resolves ambiguities (lexical, structural, referential) while preserving the original meaning.

We will implement a **recurrent neural network (RNN)** based encoder-decoder (seq2seq) model in PyTorch, largely from scratch (no high-level seq2seq libraries). Our design emphasizes:
- **Minimal external dependencies:** We'll use only PyTorch and Python standard libraries, writing our own tokenizer, data pipeline, and network modules.
- **Flexibility in rephrasing:** The model is not constrained to copy input tokens exactly; it can learn to produce different words or reorder phrases to resolve ambiguity.
- **Expressivity for disambiguation:** We use an architecture (with choices like LSTM units and an attention mechanism) capable of capturing context and meaning needed to handle lexical choice, structural reordering, and pronoun resolution.
- **Scientific rigor:** Each design choice (embedding size, hidden layer type/size, attention, etc.) is justified with reference to established research or best practices.
- **Device adaptability:** The implementation will automatically use GPU if available, falling back to CPU gracefully.
- **Consistency with provided preprocessing:** We will follow the same tokenization and vocabulary construction approach as in the provided `preprocessing.ipynb/vocab_lookup`, ensuring our data pipeline (e.g. handling of special tokens, casing, and underscores) matches the intended setup.




## Data Preprocessing and Vocabulary
**Tokenization:** We implement a custom tokenizer to split sentences into tokens. This involves splitting on whitespace and punctuation while preserving special token formatting from the dataset (e.g. the dataset uses underscores to combine multi-word units like "join_forces" or "reniform_leaf"). We assume the provided preprocessing already handled such cases, so our tokenizer will treat any sequence of alphanumeric characters (including `_`) as a token, and separate punctuation as individual tokens. For example:

In [1]:
import re

def tokenize(sentence):
    # Split by any whitespace or punctuation, keeping punctuation as separate tokens.
    tokens = re.findall(r"\w+|[^\w\s]", sentence, flags=re.UNICODE)
    return [tok.lower() for tok in tokens]
    

**Vocabulary Construction:** Using the tokenized data, we build a vocabulary mapping to integers. We include special tokens for padding and sequence markers:
- `<PAD>` for padding shorter sequences in a batch
- `<SOS>` (start-of-sequence) to signify the beginning of a target sentence for the decoder
- `<EOS>` (end-of-sequence) to mark sentence end
- `<UNK>` for any rare or out-of-vocabulary token (if applicable)
We assign each unique token an index. The `vocab_lookup` provided likely contains such mappings; we either load it or reconstruct it identically by iterating over all tokenized sentences. 
This ensures the vocabulary is consistent and covers both input and output sentences. The dataset's use of underscores (e.g. "northrop_osteoblastoma" might be two tokens "northrop" and "osteoblastoma" or one token if combined) is preserved by our tokenizer, so our vocab will treat them as in the original preprocessing. We also ensure that any token that was considered a single unit in `vocab_lookup `remains so here.

**Sequence Preparation:** For model training, each ambiguous sentence (tokenized) will be encoded as a sequence of input token indices, and its corresponding original sentence will be encoded as output indices, with `<SOS>` prepended and `<EOS>` appended to the target. We will likely pad sequences to the same length in batches. The data pipeline yields `(input_tensor, target_tensor)` pairs for training. This is analogous to how one would prepare data for a translation or paraphrase model.

## Sequence-to-Sequence Model Architecture
Architecture of a sequence-to-sequence model with an RNN encoder (blue) and decoder (green) with an attention mechanism (illustrated by the connections to the "Attention Mechanism" box). The encoder processes the input sequence into hidden states $h_{i}$, and the decoder uses those (via an attention-weighted context vector) to generate the output sequence one token at a time.
Our model follows the classic encoder-decoder paradigm. The encoder RNN reads the input (ambiguous sentence) and produces a sequence of hidden states (and a final summarized state). The decoder RNN then generates the output (disambiguated sentence) token by token, using the encoder’s context. We incorporate an attention mechanism to allow the decoder to focus on relevant parts of the input at each generation step, which is crucial for handling longer or complex sentences where a simple fixed-length context vector would be insufficient. Below, we describe each component and justify our design choices:
### Encoder RNN
The encoder is a recurrent neural network that processes the input token sequence and encodes its meaning into a sequence of hidden states. We choose a Long Short-Term Memory (LSTM) network for the encoder (as opposed to a simple RNN or GRU) because LSTMs are known to capture long-term dependencies and avoid vanishing gradient issues. Ambiguities like referential pronouns often require remembering context from earlier in the sentence, which LSTM’s memory cell is well-suited for. (A Gated Recurrent Unit (GRU) is a viable alternative with a simpler architecture and fewer parameters, often performing similarly, but we opt for LSTM for maximum expressivity given the subtlety of disambiguation tasks.)

**Encoder structure:** Each input word index is first mapped to a trainable embedding vector. We use an embedding size (dimensionality) of 300, which is a common choice balancing richness and efficiency (300-dimensional embeddings have been standard in many NLP tasks and are large enough to capture semantic nuances). The embeddings are learned from scratch (to minimize external dependencies; we could initialize with pretrained word embeddings for potentially better lexical generalization, but we stay within our self-contained setup). These embeddings are fed into the LSTM. We use a single LSTM layer with a hidden state size of 256 (this is a hyperparameter one might tune; values in the few hundreds are typical in seq2seq models). The hidden size controls the capacity of the context representation. We found 256 to be sufficient for the dataset size and complexity, while not overly large to hurt interpretability or require excessive training time. If needed, one could increase this to 512 for more capacity, but 256 already allows a large number of possible features in the context vector.


Optionally, we could make the encoder bidirectional, meaning it consists of a forward LSTM (reading left-to-right) and a backward LSTM (right-to-left). A bidirectional encoder provides a more comprehensive context representation since it encodes information from both past and future tokens. This often improves performance in translation/paraphrasing tasks (as used by Bahdanau et al., 2015). However, for simplicity and interpretability, we will keep the encoder unidirectional in our implementation, noting that bidirectional encoding is a possible extension if we find the need to capture context from the right side (e.g., structural ambiguity might benefit from knowing the upcoming phrase). 

We also apply dropout (e.g. 10–20%) on the embeddings or between LSTM layers if we had multiple, to regularize and improve generalization. This prevents overfitting, especially since the training data may not be extremely large. Encoder output: The encoder returns two things: (1) the sequence of all intermediate hidden states $h_i$ (one per input token $i$), and (2) the final hidden state (and cell state) after processing the last token. If unidirectional, the final hidden state $h_f$ is a summary of the whole input sentence. If bidirectional, we would concatenate the final forward and backward states. These will be passed to the decoder. In an attention model, we will make use of the full sequence of encoder states as well, not just the final state.

Below is a simplified implementation of the Encoder:
```python
import torch.nn as nn
import torch

class Encoder(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, batch_first=True)
        # If bidirectional: use self.lstm = nn.LSTM(embed_size, hidden_size, batch_first=True, bidirectional=True)
        # and handle hidden state accordingly.
    def forward(self, input_seq):
        """
        input_seq: Tensor of shape (batch_size, seq_length) with token indices.
        """
        embed = self.embedding(input_seq)            # (batch, seq_length, embed_size)
        outputs, hidden = self.lstm(embed)           # outputs: (batch, seq_length, hidden_size)
        # 'hidden' is a tuple (h_n, c_n) each of shape (1, batch, hidden_size) for 1-layer LSTM.
        return outputs, hidden


In this code, `batch_first=True` is used for convenience so that tensor shapes are as noted. We would move `input_seq` to the chosen device (CPU/GPU) before calling `encoder(input_seq)`. The encoder's `hidden` (which includes both hidden state and cell state for the LSTM) will be passed into the decoder to initialize its state.

## Decoder RNN with Attention
The decoder is another RNN (LSTM) that generates the output sequence one token at a time in an autoregressive fashion. At each time step, the decoder takes as input the embedding of the previous output token (starting with `<SOS>` for the first token) and updates its hidden state. It then produces a probability distribution over the next token in the output. We incorporate an attention mechanism so that at each step the decoder can attend to different parts of the encoder's output sequence, instead of relying only on a single context vector.

**Decoder structure:** We use an LSTM for the decoder as well, with the same hidden size as the encoder (256) for simplicity. Using the same hidden dimensionality allows us to directly use the encoder's final hidden state to initialize the decoder's hidden state. We initialize the decoder's hidden and cell states with the encoder's final states (`decoder_hidden0 = encoder_hidden`), a common practice to give the decoder a starting context. This helps especially with referential ambiguity: for example, the encoder’s final hidden state may encode the presence of a specific entity that the decoder can immediately use to start generating the proper noun instead of a pronoun.

At each decoding step $t$, the decoder LSTM takes the previous token (at $t-1$) as input (via its embedding) and updates its hidden state $s_t$. Without attention, one could use $s_t$ to predict the next word. However, with attention, we first compute a context vector $c_t$ as a weighted sum of all encoder hidden states ${h_0, h_1, ..., h_{T}}$. The weights come from an alignment model that scores how well each encoder state $h_i$ matches the decoder state $s_t$. We use the Bahdanau attention (additive attention) mechanism: the alignment score $e_{ti} = \text{score}(s_t, h_i)$ is computed by a small feed-forward network (with learned parameters) that takes $s_t$ and $h_i$ and outputs a score (a single scalar). These scores are normalized with softmax to produce attention weights $\alpha_{ti}$ that sum to 1. Intuitively, $s_t$ (the decoder's current state, which encapsulates what has been generated so far and what it is about to generate) will attend more to those $h_i$ that are relevant to producing the next word. For instance, if the decoder is about to output the disambiguated noun corresponding to a prior pronoun, the attention mechanism should assign higher weight to the encoder states around that pronoun's antecedent or contextual clues in the input.

For efficiency and simplicity, we might implement Luong's dot-product attention (a specific case of attention) instead of a separate feed-forward network. In dot-product attention, the score is $e_{ti} = s_t^\top h_i$ (assuming $s_t$ and $h_i$ are vectors of the same dimension). This avoids extra parameters and often works well if the hidden size is not too large. We will use this dot-product approach in code below, as our encoder and decoder hidden sizes match.

Once we have the attention weights $\alpha_{ti}$, we compute the context vector as $c_t = \sum_i \alpha_{ti} h_i$, the weighted sum of encoder outputs. The context vector $c_t$ is essentially the part of the input sentence the model is focusing on at step $t$. We then combine $c_t$ with the decoder's state $s_t$ to inform the next word prediction. A common combination is to concatenate $c_t$ and $s_t$ (or the output of the LSTM at that step) and feed through a linear layer to produce the vocabulary logits. This linear layer (plus softmax) is the decoder's output projection, mapping the combined vector to a probability distribution over the output vocabulary.

**Decoder implementation:** Here is the decoder with attention integrated:
```python
class Decoder(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, batch_first=True)
        # For attention: no extra parameters if using dot-product
        # If using additive attention, define layers like:
        # self.attn = nn.Linear(hidden_size * 2, hidden_size) etc.
        self.out = nn.Linear(hidden_size * 2, vocab_size)  # combines decoder hidden and context
    def forward_step(self, prev_token, hidden, encoder_outputs):
        """
        Run a single decoder step (for one token).
        prev_token: Tensor of shape (batch,) with the previous token index.
        hidden: (h, c) tuple of decoder hidden state(s), each shape (1, batch, hidden_size).
        encoder_outputs: Tensor (batch, seq_len, hidden_size) from the encoder.
        """
        # Embed the previous token
        emb = self.embedding(prev_token).unsqueeze(1)      # (batch, 1, embed_size)
        # One step of LSTM
        output, hidden = self.lstm(emb, hidden)            # output: (batch, 1, hidden_size)
        dec_state = output  # this is s_t (the output state at current step)
        # Attention: compute scores for each encoder output
        # Using dot-product attention: score_i = s_t · h_i for each encoder hidden h_i
        # dec_state is (batch, 1, hidden), encoder_outputs is (batch, seq_len, hidden)
        scores = torch.bmm(dec_state, encoder_outputs.transpose(1, 2))   # (batch, 1, seq_len)
        attn_weights = torch.softmax(scores, dim=2)                      # (batch, 1, seq_len)
        # Compute context vector as weighted sum of encoder outputs
        context = torch.bmm(attn_weights, encoder_outputs)               # (batch, 1, hidden)
        # Concatenate context and decoder state
        context = context.squeeze(1)   # (batch, hidden)
        dec_state = dec_state.squeeze(1)   # (batch, hidden)
        attn_combined = torch.cat([dec_state, context], dim=1)  # (batch, 2*hidden)
        # Final output layer to predict next token
        logits = self.out(attn_combined)    # (batch, vocab_size)
        return logits, hidden, attn_weights


In this `forward_step` method, we take `prev_token` (the last generated token, or `<SOS>` at start) and the current `hidden` state, and we return the logits for the next token, the updated hidden state, and the attention weights. We would call this step by step in a loop to generate a whole sequence. By structuring it this way, we have fine-grained control and can easily apply teacher forcing during training (by providing the actual next token as `prev_token`) or use the model's own prediction during inference. 

**Justification of design choices:**
- The hidden state size is the same as encoder (256) so that dot-product attention is feasible (no dimension mismatch) and we can directly use encoder's hidden state to initialize the decoder. This is a common design in seq2seq models.
- We opted for LSTM decoder for the same reasons as encoder – it can maintain context of what has been generated so far, which is important for fluency and correctness (e.g., not repeating words or ensuring grammatical agreement).
- Attention is included because it **greatly improves the ability to handle longer sequences and structural transformations**. Without attention, the decoder would rely only on a single context vector (the last encoder hidden state), which might not encode sufficient detail, especially if the sentence is long or complex (as noted by Bahdanau et al. 2014). With attention, the decoder can dynamically focus on, say, the part of the input that needs disambiguation at the right time. For example, when generating the word "lutjanus_blackfordi" in the above example, the model can attend to where "it" appears in the input and the surrounding context that indicates what "it" refers to.
- Our simple dot-product attention has no additional parameters, but one could use a learnable additive attention mechanism for potentially better performance. In practice, both dot and additive attention yield similar results; dot is faster when dimensions are large.
- We **do not strictly copy input tokens**; the decoder’s output layer is a full vocabulary softmax. The model is free to output tokens that never appeared in the input. This is crucial for lexical disambiguation: the correct disambiguated word might differ from the input word (or be a more specific term), and for referential disambiguation where the output noun might not appear in the input at all (as with pronoun "it"). Our network must learn these transformations from the training data. In cases of lexical substitution where the input word and output word are synonyms, the model essentially learns to translate between them (e.g., mapping "springer" to "northrop" in the dataset for lexical ambiguity). By not using any copy mechanism or outputting input tokens by default, we allow syntactic and lexical variation as required.
- However, we also realize that sometimes copying parts of the input is necessary (many words will remain the same between ambiguous and original sentences). Our model can still learn to copy implicitly by attending to a word and outputting the same word (since the word will be in the vocabulary). For example, in the ambiguous sentence, many words like "Syrupy" or "apologize" remain the same in the original – the model can output them identically by attending to them and the decoder learning an identity mapping for those contexts. We considered incorporating a specialized copy mechanism or pointer network, which is often used in seq2seq to handle out-of-vocabulary or to ensure important tokens are carried over. But given our vocabulary covers the needed words and the task is less about unknown proper nouns (the data seems to include the needed entities in vocabulary), a standard seq2seq with attention suffices. Simpler architecture is preferable here for interpretability.

**Handling of Punctuation and Special Tokens:** The model treats punctuation as tokens (e.g., "?" or "." are in the vocab). We include `<EOS>` at the end of every target, and the decoder is trained to generate `<EOS>` when it finishes the sentence. At inference, the generation loop will stop when `<EOS>` is produced, preventing infinite sentences. Padding tokens `<PAD>` will appear in encoder inputs (for batch processing) and possibly in decoder target sequences. We will ensure the model doesn’t confuse `<PAD>` as real input by masking out pad positions in the attention computation (so that attention weights on `<PAD>` tokens are zero). In the above implementation, if we pad encoder outputs, the dot-product scores for pad positions might be low anyway if we initialize pad token embedding to zero or a learned vector – but to be safe, we can set very negative scores for padded positions or pre-mask the encoder_outputs by zeroing them out for pads. Similarly, when computing loss, we will ignore `<PAD>` in the target.

**Device Management (CPU vs GPU)**
Our implementation checks for CUDA availability and uses GPU if possible, otherwise defaults to CPU. For example:
```python
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
encoder = Encoder(460, embed_size=300, hidden_size=256).to(device)
decoder = Decoder(460, embed_size=300, hidden_size=256).to(device)

All inputs and tensors are then moved to `device` before computations. We design the code such that this is the only change needed to switch hardware – PyTorch will handle the tensor operations on the chosen device. This way, one can train on GPU for speed, but still run the model on CPU if no GPU is present (with a slower runtime but identical results). Throughout training and inference code, we'll be careful to send data to `device`, e.g., `input_seq = input_seq.to(device)`.

## Training Procedure and Justifications
We train the encoder-decoder model on the parallel dataset of ambiguous and original sentences. The training objective is to maximize the probability of the correct disambiguated (original) sentence given the ambiguous input. This is typically done by minimizing the cross-entropy loss between the predicted token distribution and the true next token at each decoder time step.

During training, we use teacher forcing, a strategy where we feed the actual ground-truth token as the next input to the decoder at each step (instead of using the decoder’s predicted token). Teacher forcing helps the model converge faster by providing correct context in the early stages of training, preventing error accumulation. Formally, at decoder step $t$, we already know the true token $y_t$ from the original sentence, so we input $y_t$ (actually `<SOS>` for $t=0$, then $y_1$ for next, etc.) rather than the model’s guess $\hat{y}t$. The decoder still produces a distribution from which we compute loss for that step. We do this for each time step of the output sequence. If the output sentence has length $L$, the loss is $\frac{1}{L}\sum{t=1}^{L} -\log P(y_t \mid y_{<t}, \text{input})$ (averaged or summed and then averaged per batch). 

As training progresses, we may introduce **scheduled sampling** (gradually reducing the teacher forcing rate) to let the model experience its own predictions as inputs, improving robustness. For simplicity, we can start with teacher forcing 100% of the time, and later on use a probability (e.g. 0.9 decreasing to 0.5) of using the true token vs. the model's token.

We use the **Adam optimizeR** (a widely used adaptive optimizer) with a moderate learning rate (e.g. 0.001). Adam is suitable for seq2seq models and often converges faster than vanilla SGD or momentum for such tasks. We also may apply **gradient clipping** (e.g. clip norm to 5) to prevent exploding gradients common in RNN training. 

The training loop will look roughly like this
```python
import torch.optim as optim

# Initialize models and optimizer
encoder = Encoder(vocab_size, embed_size=300, hidden_size=256).to(device)
decoder = Decoder(vocab_size, embed_size=300, hidden_size=256).to(device)
optimizer = optim.Adam(list(encoder.parameters()) + list(decoder.parameters()), lr=0.001)
criterion = nn.CrossEntropyLoss(ignore_index=word2index["<PAD>"])  # ignore padding in loss

num_epochs = 20
teacher_force_ratio = 1.0  # start with full teacher forcing
for epoch in range(num_epochs):
    encoder.train(); decoder.train()
    total_loss = 0.0
    for batch in train_loader:  # assuming we have a DataLoader for training data
        inputs, targets = batch   # inputs: (batch, seq_len_input), targets: (batch, seq_len_target)
        inputs = inputs.to(device)
        targets = targets.to(device)
        optimizer.zero_grad()
        # Encode input sequence
        encoder_outputs, encoder_hidden = encoder(inputs)
        # Initialize decoder hidden state as encoder's final hidden
        decoder_hidden = encoder_hidden
        batch_size = inputs.size(0)
        # The first input to decoder is the <SOS> token for each sequence in the batch
        decoder_input = torch.full((batch_size,), word2index["<SOS>"], dtype=torch.long, device=device)
        # Loop over each time step in the target sequence
        # (excluding the start token we gave, and excluding the final EOS because we will predict that)
        target_length = targets.size(1)
        loss = 0.0
        for t in range(0, target_length):
            # One decoder step
            logits, decoder_hidden, attn_weights = decoder.forward_step(decoder_input, decoder_hidden, encoder_outputs)
            # Calculate loss against the actual next token
            target_t = targets[:, t]  # (batch,)
            loss += criterion(logits, target_t)
            # Decide next input – either use teacher forcing or model's own prediction
            use_teacher = (torch.rand(1).item() < teacher_force_ratio)
            next_input = target_t if use_teacher else logits.argmax(dim=1)
            decoder_input = next_input  # for next loop iteration
        # Backpropagation
        loss = loss / target_length   # average loss per time-step (optional scaling)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(list(encoder.parameters()) + list(decoder.parameters()), max_norm=5.0)
        optimizer.step()
        total_loss += loss.item()
    # (Optional) decay teacher_force_ratio, or compute validation loss for early stopping.
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {total_loss/len(train_loader):.4f}")

In this training pseudocode:
- We iterate over batches of sentence pairs.
- After encoding, we set up the decoder with `<SOS>` inputs.
- We iterate `t` from 0 to `target_length-1`. If `target_length` includes the `<EOS>` token at the end, the decoder will make a prediction for each actual token and ideally one for `<EOS>` as well. We can adjust the loop to go until `target_length-2` and then handle the last `<EOS>` prediction, but the above treats the last token in `targets` (which should be `<EOS>`) as the target to predict in the final step.
- `criterion` is` CrossEntropyLoss` with `ignore_index` set to the pad token index, so that padded positions in the target do not contribute to loss. Only actual tokens and `<EOS>` contribute.
- We accumulate loss across time steps and backpropagate. (PyTorch’s autograd will handle the internal gradient flow through the sequence).
- We use teacher forcing with probability `teacher_force_ratio`. Initially this is 1 (always feed the true token). We might reduce this after a few epochs (not shown, but one could multiply `teacher_force_ratio *= 0.95` each epoch or so, to a minimum threshold).
- We clip gradients to mitigate any exploding gradient problem due to long sequences.
- We print the average loss per epoch. We would also typically evaluate on a validation set to monitor performance and avoid overfitting (early stopping if needed).

Scientific rationale: This training approach is standard for seq2seq models. Teacher forcing greatly stabilizes training, though it can cause a discrepancy between training and inference (exposure bias, since at inference the model won’t always get the correct previous token). By gradually reducing teacher forcing, we expose the model to its own errors and make it learn to recover from them. This addresses the exposure bias to some extent (as suggested by Bengio et al. in scheduled sampling).

We also ensure the model learns to output the end-of-sequence (`<EOS>`) token appropriately. The training pairs include `<EOS>` at end of targets, and we include those in the loss. This teaches the decoder when to stop. Without this, the decoder might produce unnaturally long outputs or never terminate.

**Generalization and expressivity:** We rely on several aspects to ensure the model generalizes beyond simply memorizing training pairs:

- **Vocabulary coverage:** Because the dataset covers different ambiguity types, the model sees multiple examples of how to resolve various ambiguities. For instance, it might see many instances of pronoun "it" referring to different nouns in similar contexts. The LSTM’s ability to capture context allows it to infer which noun fits even for combinations it hasn't seen before, by learning context patterns. For example, it might learn that "it ... for culture ..." often actually means "no professor ... for culture ..." from one instance, and apply that mapping in another context if appropriate.
- **Dropout regularization:** included in embeddings or between LSTM layers (if multi-layer) will help prevent over-reliance on exact token sequences.
- **Attention mechanism:** improves generalization by not forcing the encoder to squash all information into one vector – the decoder can adaptively fetch relevant info. This means even if a sentence is longer or structured slightly differently than seen before, the decoder can still attend to the right parts when needed, rather than being confused by extra clauses. This is essential for structural ambiguity, where the model must learn to possibly re-order or re-associate phrases. With attention, re-ordering is easier because the decoder can jump to attend to a later part of the encoder output out-of-order.
- **Evaluation on held-out data:** We would hold out some sentence pairs as a test set. Generalization means the model should handle new ambiguous sentences and still produce correct disambiguations. We would measure this by metrics like BLEU (for overlap with the reference original sentence) or accuracy on critical words (e.g., did the model pick the correct sense or referent). The aim is that the model doesn't just parrot training outputs, but truly learns disambiguation.

**Interpretability considerations:** Although neural networks are often black-box, our model has some interpretable components:
- The **attention weights** can be visualized to understand which parts of the input the model focused on when generating each output word. This can provide insights, for example, confirming that when the model output "lutjanus_blackfordi", it was attending highly to the position of "it" in the input and perhaps surrounding descriptive words.
- We kept the architecture relatively simple (one-layer LSTM encoder/decoder) and avoided opaque external modules. This means each part (embedding, LSTM, attention) has a clear role that can be analyzed. For instance, one could inspect the learned embeddings to see if ambiguous words and their disambiguated counterparts occupy similar regions in vector space, which would indicate the model is clustering synonyms or related concepts together.
- Compared to more complex models (e.g., Transformers or models that disentangle syntax/semantics with separate modules), our approach is easier to trace end-to-end. This aligns with the constraint of interpretability: every step (from tokenization to output) is under our control and understandable.
- That said, one could integrate explicit knowledge into the model for even more interpretability. For example, the recent RULER approach combines rule-based transformations with neural networks. It learns explicit rewrite rules (which are human-readable) and uses the neural model to refine the output. RULER demonstrated improved interpretability and generalization by ensuring the model learns global transformation rules not just local edits. In our design, we did not implement a rule extraction component, but we take inspiration from such work by encouraging global changes (our seq2seq is free to reorder entire clauses, not just tweak words). If we found the neural model was making only minimal changes (local modifications), we could consider incorporating a loss term or data augmentation to promote more varied rephrasings, akin to learning rules as in RULER.
- Another line of research, like **AMR-based Paraphrase Generation (AMRPG)**, explicitly uses semantic parses (Abstract Meaning Representation) to guide paraphrasing. AMRPG separates the generation of meaning and syntax: it would parse the ambiguous sentence into an AMR graph (disambiguating meaning explicitly) and then realize it into a sentence, possibly with a target syntax. This approach can inherently solve ambiguities because the AMR graph is a clarified representation (e.g., pronouns resolved, roles clarified). However, it requires an AMR parser and generator, which are external tools and introduce complexity. We chose not to go this route due to the requirement of minimal external dependencies, but we acknowledge that such techniques could improve performance and interpretability (since the intermediate AMR is inspectable). Instead, our model must learn an implicit internal "representation" of meaning with its encoder. The attention mechanism and hidden states will approximate what an explicit semantic representation might have provided.

After sufficient training (monitoring validation loss for convergence), the model should be able to reconstruct disambiguated sentences from ambiguous inputs reliably.


## Inference and Example

At inference (test) time, we feed an ambiguous sentence into the encoder, then use the decoder to generate the output until `<EOS>` is produced. We do not use teacher forcing in inference; instead, at each step the decoder’s own prediction is fed as the next input. We can use greedy decoding (always pick highest probability token) or beam search for potentially better results. Here we illustrate greedy decoding for simplicity:


In [2]:
import pandas as pd
import re
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
import ast

# --- 1) Load & parse original‐texts vocab (has 'tokens', 'frequency', 'examples') ---
orig_df = pd.read_csv('data/vocab_lookup_original_texts.csv')
raw_tokens = set()
for cell in orig_df['tokens'].dropna():
    try:
        lst = ast.literal_eval(cell)
        if isinstance(lst, list):
            raw_tokens.update(tok.lower() for tok in lst)
    except Exception:
        pass  # skip malformed

# --- 2) Load & parse new dataset vocab (has 'token', 'lemma', 'pos') ---
new_df = pd.read_csv('data/new_vocab_lookup.csv')
raw_tokens.update(tok.lower() for tok in new_df['token'].dropna().astype(str))

# --- 3) Define special tokens & ensure no duplicates ---
specials = ['<PAD>', '<SOS>', '<EOS>', '<UNK>']
for s in specials:
    raw_tokens.discard(s)

# --- 4) Build final ordered vocab & mappings ---
all_tokens  = specials + sorted(raw_tokens)
word2index  = {tok: idx for idx, tok in enumerate(all_tokens)}
index2word  = {idx: tok for tok, idx in word2index.items()}
vocab_size  = len(all_tokens)

print(f"Built merged vocab: {vocab_size} tokens (incl. specials)")
print(f"Prepended specials: {specials}")
print(f"Total unique tokens from CSV: {len(raw_tokens)}")
print(f"New vocab size = {vocab_size}")

# --- 5) Simple tokenizer: keeps alphanumeric + underscore tokens together,
#    splits off punctuation.
def tokenize(text):
    tokens = re.findall(r"\w+|[^\w\s]", text, flags=re.UNICODE)
    return [t.lower() for t in tokens]

# --- 6) Dataset that reads final_dataset.csv and returns (input_ids, target_ids)
class FinalDataset(Dataset):
    def __init__(self, csv_path, w2i, tokenizer):
        df = pd.read_csv(csv_path)
        self.amb = df['ambiguous'].astype(str).tolist()
        self.orig = df['original'].astype(str).tolist()
        self.w2i = w2i
        self.tok = tokenizer

    def __len__(self):
        return len(self.amb)

    def __getitem__(self, i):
        inp = self.tok(self.amb[i])
        tgt = self.tok(self.orig[i])
        inp_ids = [self.w2i.get(w, self.w2i['<UNK>']) for w in inp]
        # prepend SOS, append EOS for target
        tgt_ids = [self.w2i['<SOS>']] + \
                  [self.w2i.get(w, self.w2i['<UNK>']) for w in tgt] + \
                  [self.w2i['<EOS>']]
        return torch.tensor(inp_ids, dtype=torch.long), torch.tensor(tgt_ids, dtype=torch.long)

# --- 7) Collate fn: pad to longest in batch ---
def collate_fn(batch):
    inputs, targets = zip(*batch)
    inp_pad = pad_sequence(inputs, batch_first=True, padding_value=word2index['<PAD>'])
    tgt_pad = pad_sequence(targets, batch_first=True, padding_value=word2index['<PAD>'])
    return inp_pad, tgt_pad

# --- 8) Build DataLoader ---
dataset = FinalDataset('data/final_dataset.csv', word2index, tokenize)
train_loader = DataLoader(dataset, batch_size=32, shuffle=True, collate_fn=collate_fn)

print(f"Vocab size = {vocab_size}, #examples = {len(dataset)}")

Built merged vocab: 89376 tokens (incl. specials)
Prepended specials: ['<PAD>', '<SOS>', '<EOS>', '<UNK>']
Total unique tokens from CSV: 89372
New vocab size = 89376
Vocab size = 89376, #examples = 932354


In [3]:
# --- 1) Device setup ---
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# --- 2) Model definitions (reuse classes you already have) ---
class Encoder(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, dropout=0.2):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size, padding_idx=word2index['<PAD>'])
        self.lstm = nn.LSTM(embed_size, hidden_size, batch_first=True, dropout=dropout)
    def forward(self, x):
        emb = self.embedding(x)                    # (B, L_in, E)
        outputs, hidden = self.lstm(emb)           # outputs=(B,L_in,H), hidden=(h_n,c_n)
        return outputs, hidden

class Decoder(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, dropout=0.2):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size, padding_idx=word2index['<PAD>'])
        self.lstm = nn.LSTM(embed_size, hidden_size, batch_first=True, dropout=dropout)
        self.out  = nn.Linear(hidden_size*2, vocab_size)
    def forward_step(self, prev_tok, hidden, enc_outputs):
        emb = self.embedding(prev_tok).unsqueeze(1)     # (B,1,E)
        out, hidden = self.lstm(emb, hidden)            # out=(B,1,H)
        # dot-product attention
        scores = torch.bmm(out, enc_outputs.transpose(1,2))  # (B,1,L_in)
        attn  = torch.softmax(scores, dim=2)                # (B,1,L_in)
        ctx   = torch.bmm(attn, enc_outputs).squeeze(1)     # (B,H)
        out_t = out.squeeze(1)                              # (B,H)
        cat   = torch.cat([out_t, ctx], dim=1)              # (B,2H)
        logits= self.out(cat)                               # (B,V)
        return logits, hidden, attn




Using device: cpu


```python
import pandas as pd
import ast

# --- 1) Load raw vocab.csv (columns: 'primary', 'secondary') and flatten all tokens ---
vocab_df = pd.read_csv('vocab.csv')
raw_tokens = set()

for col in ['primary', 'secondary']:
    for cell in vocab_df[col].dropna():
        # each cell is a string like "['and','but',...']"
        try:
            lst = ast.literal_eval(cell)
            if isinstance(lst, list):
                raw_tokens.update(lst)
        except Exception:
            pass

# --- 1a) Define & prepend special tokens ---
specials = ['<PAD>', '<SOS>', '<EOS>', '<UNK>']
# Ensure no overlap:
for s in specials:
    raw_tokens.discard(s)

# Final ordered vocab: specials first, then sorted rest
all_tokens = specials + sorted(raw_tokens)

# --- 1b) Rebuild mappings from scratch ---
word2index = {tok: idx for idx, tok in enumerate(all_tokens)}
index2word = {idx: tok for tok, idx in word2index.items()}
vocab_size = len(all_tokens)

print(f"Prepended specials: {specials}")
print(f"Total unique tokens from CSV: {len(raw_tokens)}")
print(f"New vocab size = {vocab_size}")

In [None]:
import random
import numpy as np
import torch
from torch.utils.data import random_split
from tqdm.notebook import tqdm  # for progress bars

# ---- 1) Reproducibility ----
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

# ---- 2) Device ----
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Running on device:", device)

# ---- 3) Dataset & Split ----
# Assume 'dataset', 'collate_fn', and 'FinalDataset' are defined in earlier cells
total_size = len(dataset)
val_size = int(0.1 * total_size)
train_size = total_size - val_size
train_ds, val_ds = random_split(dataset, [train_size, val_size], generator=torch.Generator().manual_seed(SEED))

train_loader = DataLoader(train_ds, batch_size=32, shuffle=True, collate_fn=collate_fn)
val_loader   = DataLoader(val_ds,   batch_size=32, shuffle=False, collate_fn=collate_fn)
embed_size = 128
hidden_size = 256
batch_size = 128
num_epochs = 8
teacher_forcing_ratio = 1.0
lr = 1e-3

print(f"Train/Val split: {train_size}/{val_size} examples")

# ---- 4) Model Instantiation ----
# Reuse Encoder and Decoder from earlier cells
encoder = Encoder(vocab_size, embed_size, hidden_size).to(device)
decoder = Decoder(vocab_size, embed_size, hidden_size).to(device)

# ---- 5) Optimizer & Loss ----
params = list(encoder.parameters()) + list(decoder.parameters())
optimizer = torch.optim.Adam(params, lr=1e-3)
criterion = nn.CrossEntropyLoss(ignore_index=word2index['<PAD>'])

# ---- 6) Training Loop ----


for epoch in range(1, num_epochs + 1):
    encoder.train(); decoder.train()
    train_loss = 0.0
    for inp, tgt in tqdm(train_loader, desc=f"Epoch {epoch}/{num_epochs}"):
        inp, tgt = inp.to(device), tgt.to(device)
        optimizer.zero_grad()
        enc_out, enc_hidden = encoder(inp)
        dec_hidden = enc_hidden
        batch_size, tgt_len = tgt.size()
        dec_input = torch.full((batch_size,), word2index['<SOS>'], dtype=torch.long, device=device)
        
        loss = 0.0
        for t in range(tgt_len):
            logits, dec_hidden, _ = decoder.forward_step(dec_input, dec_hidden, enc_out)
            true_tok = tgt[:, t]
            loss += criterion(logits, true_tok)
            if random.random() < teacher_forcing_ratio:
                dec_input = true_tok
            else:
                dec_input = logits.argmax(dim=1)
        loss = loss / tgt_len
        loss.backward()
        # 4) clip the gradients
        torch.nn.utils.clip_grad_norm_(
            list(encoder.parameters()) + list(decoder.parameters()),
            max_norm=1.0
        )
        optimizer.step()
        train_loss += loss.item()
    avg_train = train_loss / len(train_loader)
    
    # Validation loss
    encoder.eval(); decoder.eval()
    val_loss = 0.0
    with torch.no_grad():
        for inp, tgt in val_loader:
            inp, tgt = inp.to(device), tgt.to(device)
            enc_out, enc_hidden = encoder(inp)
            dec_hidden = enc_hidden
            batch_size, tgt_len = tgt.size()
            dec_input = torch.full((batch_size,), word2index['<SOS>'], dtype=torch.long, device=device)
            loss = 0.0
            for t in range(tgt_len):
                logits, dec_hidden, _ = decoder.forward_step(dec_input, dec_hidden, enc_out)
                true_tok = tgt[:, t]
                loss += criterion(logits, true_tok)
                dec_input = true_tok  # always teacher-forcing on val
            val_loss += (loss / tgt_len).item()
    avg_val = val_loss / len(val_loader)
    
    print(f"Epoch {epoch} — Train Loss: {avg_train:.4f}, Val Loss: {avg_val:.4f}")
    # Optionally decay teacher forcing: teacher_forcing_ratio *= 0.95

# ---- 7) Save Models ----
torch.save(encoder.state_dict(), 'models/encoder_final.pt')
torch.save(decoder.state_dict(), 'models/decoder_final.pt')
print("Models saved: encoder_final.pt, decoder_final.pt")

# ---- 8) Inference & Samples ----
special_tokens = {'<PAD>','<SOS>','<EOS>','<UNK>'}

def generate(sentence, max_len=50):
    """
    1) Masks out specials so the model never picks them.
    2) If it still outputs <UNK>, we use the attention weights
       to pick the source token it was “looking at” most.
    3) We filter out any <SOS> or <PAD> in the final list.
    """
    encoder.eval()
    decoder.eval()
    # tokenize & indices
    src_tokens = tokenize(sentence)
    src_ids    = [word2index.get(w, word2index['<UNK>']) for w in src_tokens]
    inp        = torch.tensor(src_ids, device=device).unsqueeze(0)
    # encode
    enc_out, enc_hidden = encoder(inp)
    dec_hidden = enc_hidden
    # start decoding
    dec_input  = torch.tensor([word2index['<SOS>']], device=device)
    out_tokens = []
    
    for _ in range(max_len):
        # get logits and attention
        logits, dec_hidden, attn = decoder.forward_step(dec_input, dec_hidden, enc_out)
        # mask out specials so they have zero probability
        for sp in ['<PAD>','<SOS>','<UNK>']:
            logits[:, word2index[sp]] = -1e9
        # pick next
        next_id = logits.argmax(dim=1).item()
        if next_id == word2index['<EOS>']:
            break
        
        tok = index2word[next_id]
        # if it’s still <UNK>, copy from source via attention
        if tok == '<UNK>':
            # attn shape is (1,1,src_len)
            a = attn.squeeze(0).squeeze(0)          # (src_len,)
            src_pos = a.argmax().item()
            tok = src_tokens[src_pos]
        
        # only keep real tokens
        if tok not in special_tokens:
            out_tokens.append(tok)
        
        dec_input = torch.tensor([next_id], device=device)
    
    return " ".join(out_tokens)

print("\nSample generations on validation set:")
for i in range(5):
    amb, orig = val_ds[i]
    print(f"\nAmbiguous: {' '.join(index2word.get(x.item(),'<UNK>') for x in amb)}")
    print("Generated:", generate(' '.join(index2word.get(x.item(),'<UNK>') for x in amb)))
    print("Reference:", ' '.join(index2word.get(x.item(),'<UNK>') for x in orig[1:-1]))

Running on device: cuda
Train/Val split: 839119/93235 examples




Epoch 1/8:   0%|          | 0/26223 [00:00<?, ?it/s]

Epoch 1 — Train Loss: 0.8943, Val Loss: 0.5281


Epoch 2/8:   0%|          | 0/26223 [00:00<?, ?it/s]

Epoch 2 — Train Loss: 0.4511, Val Loss: 0.4521


Epoch 3/8:   0%|          | 0/26223 [00:00<?, ?it/s]

Epoch 3 — Train Loss: 0.3809, Val Loss: 0.4190


Epoch 4/8:   0%|          | 0/26223 [00:00<?, ?it/s]

Epoch 4 — Train Loss: 0.3425, Val Loss: 0.4024


Epoch 5/8:   0%|          | 0/26223 [00:00<?, ?it/s]

Epoch 5 — Train Loss: 0.3172, Val Loss: 0.3863


Epoch 6/8:   0%|          | 0/26223 [00:00<?, ?it/s]

Epoch 6 — Train Loss: 0.2989, Val Loss: 0.3823


Epoch 7/8:   0%|          | 0/26223 [00:00<?, ?it/s]

Epoch 7 — Train Loss: 0.2854, Val Loss: 0.3806


Epoch 8/8:   0%|          | 0/26223 [00:00<?, ?it/s]

Epoch 8 — Train Loss: 0.2734, Val Loss: 0.3751
Models saved: encoder_final.pt, decoder_final.pt

Sample generations on validation set:

Ambiguous: also , include any marketing meetings discussed . with outside parties you attended and topics <UNK> ideas
Generated: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Reference: also , include any marketing meetings with outside parties you attended and topics <UNK> ideas discussed .

Ambiguous: gasunie now claims its activity system is fully open and transparent
Generated: open activity is fully open and transparent open and transparent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Reference: gasunie now claims its pipeline system is fully open and transparent .

Ambiguous: thanks brent sara shackleton <UNK> ect 26 <UNK> 01 <UNK> 2000 07 <UNK> 40 pm i would like to pursue with marval with thanks .
Generated: thanks , sara sara shackleton < http . . . . . . 

In [4]:
# load decoder_final and encoder_final
embed_size = 128
hidden_size = 256
encoder = Encoder(vocab_size, embed_size, hidden_size)
decoder = Decoder(vocab_size, embed_size, hidden_size)
encoder.load_state_dict(torch.load('models/encoder_final.pt', map_location=device))
decoder.load_state_dict(torch.load('models/decoder_final.pt', map_location=device))
encoder.to(device)
decoder.to(device)
special_tokens = {'<PAD>','<SOS>','<EOS>','<UNK>'}
def generate(sentence, max_len=50):
    """
    1) Masks out specials so the model never picks them.
    2) If it still outputs <UNK>, we use the attention weights
       to pick the source token it was “looking at” most.
    3) We filter out any <SOS> or <PAD> in the final list.
    """
    encoder.eval()
    decoder.eval()
    # tokenize & indices
    src_tokens = tokenize(sentence)
    src_ids    = [word2index.get(w, word2index['<UNK>']) for w in src_tokens]
    inp        = torch.tensor(src_ids, device=device).unsqueeze(0)
    # encode
    enc_out, enc_hidden = encoder(inp)
    dec_hidden = enc_hidden
    # start decoding
    dec_input  = torch.tensor([word2index['<SOS>']], device=device)
    out_tokens = []
    
    for _ in range(max_len):
        # get logits and attention
        logits, dec_hidden, attn = decoder.forward_step(dec_input, dec_hidden, enc_out)
        # mask out specials so they have zero probability
        for sp in ['<PAD>','<SOS>','<UNK>']:
            logits[:, word2index[sp]] = -1e9
        # pick next
        next_id = logits.argmax(dim=1).item()
        if next_id == word2index['<EOS>']:
            break
        
        tok = index2word[next_id]
        # if it’s still <UNK>, copy from source via attention
        if tok == '<UNK>':
            # attn shape is (1,1,src_len)
            a = attn.squeeze(0).squeeze(0)          # (src_len,)
            src_pos = a.argmax().item()
            tok = src_tokens[src_pos]
        
        # only keep real tokens
        if tok not in special_tokens:
            out_tokens.append(tok)
        
        dec_input = torch.tensor([next_id], device=device)
    
    return " ".join(out_tokens)

sent1 = "Hope you too, to enjoy it as my deepest wishes."
sent2 = "Also, kindly remind me please, if the doctor still plan for the acknowledgments section edit before he sending again."
sentences = [sent1, sent2]
for sent in sentences:
    print(f"Input: {sent}")
    print("Generated:", generate(sent))




Input: Hope you too, to enjoy it as my deepest wishes.
Generated: you too , to enjoy it as my deepest wishes .
Input: Also, kindly remind me please, if the doctor still plan for the acknowledgments section edit before he sending again.
Generated: if , kindly remind me please , if the doctor still plan for the pieces section edit before he sending again .
