# Workflow
## 1. Preprocessing & Vocabulary Construction
#### 1.1 Rule-Based Tokenization
Implement a deterministic tokenizer that splits on whitespace, punctuation, and language-specific clitics, following best practices in NLP pipeline design
#### 1.2 POS Tagging
Use nltk to reliably POS tag the sentences of the text 
#### 1.3 Finite-State Lemmatization
Construct a two-level morphological analyzer based on Koskenniemi’s framework, encoding suffix-stripping and morphophonological alternation rules as finite-state transducers to derive lemmas
#### 1.4 Vocabulary Lookup Table
Aggregate unique lemma–POS pairs into a lexicon, recording token frequencies and contextual usage to constrain later transformations.
## 2. Sentence Generation from Vocabulary
#### 2.1 Template-Driven Sentence Synthesis
- Define a diverse set of syntactically richer templates spanning multiple clause types (e.g., simple transitive, passive, ditransitive, relative clauses, subordinations).
- Represent each template as a shallow dependency skeleton with labeled slots
- Sample lemmas from the vocabulary lookup table using frequency-weighted or stratified selection to fill each slot, and enforce feature unification (gender, number, person) to guarantee agreement.
- Linearize filled skeletons into surface strings via deterministic ordering rules, then score and rank candidates using a pretrained language model or n‑gram extractor to retain the top‑K most fluent instantiations per template.
- Optionally apply semantic compatibility checks—e.g., distributional similarity thresholds or selectional preference models—to filter out semantically anomalous slot combinations.

#### 2.2 Constraint Enforcement
Apply simple agreement and subcategorization checks—using POS sequences and verb valency from the lookup table—to filter out invalid or ungrammatical const
## 3. Seed Sentence Curation
#### 3.1 Lexical Clarity Selection
Filter generated sentences to retain only those where each lemma occurs with a single dominant sense in context, as characterized in Navigli’s survey.
#### 3.2 Syntactic Clarity Selection
Ensure prepositional-phrase attachments are unambiguous by selecting sentences where Hindle & Rooth’s likelihood-ratio criterion for PP attachment strongly favors one parse (e.g., λ ≫ 0).
## 4. Ambiguity Transformation Rules
#### 4.1 Lexical Ambiguity (Hononym Substitution)
Replace unambiguous lemmas with contextually plausible homonyms drawn from the corpus lexicon, ensuring multiple sense readings without changing surrounding syntax
#### 4.2 Structural Ambiguity (PP Attachment Flip)
For sentences containing a PP, swap its attachment site—reattach a PP from its original head (noun vs. verb) to the alternate—to create two distinct parses
#### 4.3 Referential Ambiguity (Pronoun Insertion)
Insert or replace noun phrases with pronouns in Winograd-style schemas, introducing antecedent ambiguity that requires world knowledge or context to resolve
## 5. Optional Model-Based Refinement
Use a reversed syntactically controlled paraphrase network (trained as in Iyyer et al. 2018) to regenerate each ambiguous sentence, conditioning on its original unambiguous parse template to improve fluency and adherence to the target ambiguity pattern

# IMPORTS

In [1]:
import re
import math
import spacy
from collections import defaultdict, Counter
from typing import List, Tuple, Dict
import nltk
import pandas as pd
import random
from nltk.corpus import wordnet as wn, brown
nltk.download('wordnet', quiet=True)
nltk.download('brown', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nlp = spacy.load("en_core_web_sm")
import itertools
from itertools import product
import csv
from tqdm.notebook import tqdm

## Texts we are going to be working with
#### Text 1:

Today is our dragon boat festival, in our Chinese culture, to celebrate it with all safe and great in 
our lives. Hope you too, to enjoy it as my deepest wishes. 
Thank your message to show our words to  the doctor, as his next contract checking, to all of us. 
I got this message to see the approved message. In fact, I have received the message from  the 
professor, to show me, this, a couple of days ago.  I am very appreciated  the full support of the 
professor, for our Springer proceedings publication
#### Text 2:
During our final discuss, I told him about the new submission — the one we were waiting since 
last autumn, but the updates was confusing as it not included the full feedback from reviewer or 
maybe editor?
 Anyway, I believe the team, although bit delay and less communication at recent days, they really 
tried best for paper and cooperation. We should be grateful, I mean all of us, for the acceptance 
and efforts until the Springer link came finally last week, I think.
 Also, kindly remind me please, if the doctor still plan for the acknowledgments section edit before 
he sending again. Because I didn’t see that part final yet, or maybe I missed, I apologize if so.
 Overall, let us make sure all are safe and celebrate the outcome with strong coffee and future 
targets

In [21]:
text1 = (
    "Today is our dragon boat festival, in our Chinese culture, to celebrate it "
    "with all safe and great in our lives. Hope you too, to enjoy it as my "
    "deepest wishes. Thank your message to show our words to the doctor, as his "
    "next contract checking, to all of us. I got this message to see the approved "
    "message. In fact, I have received the message from the professor, to show me, "
    "this, a couple of days ago. I am very appreciated the full support of the "
    "professor, for our Springer proceedings publication"
)
text2 = (
    "During our final discuss, I told him about the new submission — the one we were "
    "waiting since last autumn, but the updates was confusing as it not included the "
    "full feedback from reviewer or maybe editor? Anyway, I believe the team, although "
    "bit delay and less communication at recent days, they really tried best for paper "
    "and cooperation. We should be grateful, I mean all of us, for the acceptance and "
    "efforts until the Springer link came finally last week, I think. Also, kindly remind "
    "me please, if the doctor still plan for the acknowledgments section edit before he "
    "sending again. Because I didn’t see that part final yet, or maybe I missed, I apologize "
    "if so. Overall, let us make sure all are safe and celebrate the outcome with strong "
    "coffee and future targets"
)

# 1. Preprocessing & Vocabulary Construction

## 1.1 Rule Based Tokenizer

In [22]:
# Cell 3
def tokenize(text: str) -> List[str]:
    """
    Tokenizes the input text into a list of tokens.
    """
    text = text.replace('\n', ' ')
    pattern = r"""(
        (?:[A-Za-z]+(?:'t|'re|'ve|'ll|'d|'s|'m))  # Words with clitics
        |[A-Za-z]+                                # Words
        |\d+(?:\.\d+)?                            # Numbers, including decimals
        |[.,!?;:"()\[\]\/\\]                     # Punctuation
    )"""
    return re.findall(pattern, text, flags=re.VERBOSE)

def tokenize_paragraph(paragraph: str) -> List[List[str]]:
    """
    Tokenizes a paragraph into sentences then words.
    """
    sentences = re.split(r'(?<=[\.!?])\s+', paragraph)
    return [tokenize(s) for s in sentences if s]


### This tokenizer:
- **splits on whitespaces** (after normalizing new lines)
- **Handles punctuation** as seperate tokens
- **Isolates English Clitics** (e.g. ’t, ’re, ’ve, ’ll, ’d, ’s, ’m) from their hosts

The function uses a single regular expression so that the behavior is transparent and easy to adapt (for example, to add or remove clitics or punctuation marks).


## 1.2 POS Tagging
#### nltk's `average perceptron tagger` can find the following tags which are plenty for our two paragrapgs
| Tag   | Meaning                                         |
|-------|-------------------------------------------------|
| CC    | Coordinating conjunction                        |
| CD    | Cardinal number                                 |
| DT    | Determiner                                      |
| EX    | Existential “there”                             |
| FW    | Foreign word                                    |
| IN    | Preposition or subordinating conjunction        |
| JJ    | Adjective                                       |
| JJR   | Adjective, comparative                          |
| JJS   | Adjective, superlative                          |
| LS    | List item marker                                |
| MD    | Modal auxiliary                                 |
| NN    | Noun, singular or mass                          |
| NNS   | Noun, plural                                    |
| NNP   | Proper noun, singular                           |
| NNPS  | Proper noun, plural                             |
| PDT   | Predeterminer                                   |
| POS   | Possessive ending                               |
| PRP   | Personal pronoun                                |
| PRP$  | Possessive pronoun                              |
| RB    | Adverb                                          |
| RBR   | Adverb, comparative                             |
| RBS   | Adverb, superlative                             |
| RP    | Particle                                        |
| TO    | “to” (as preposition or infinitive marker)      |
| UH    | Interjection                                    |
| VB    | Verb, base form                                 |
| VBD   | Verb, past tense                                |
| VBG   | Verb, gerund or present participle              |
| VBN   | Verb, past participle                           |
| VBP   | Verb, non-3rd-person singular present            |
| VBZ   | Verb, 3rd-person singular present               |
| .     | Sentence-final punctuation                      |
| ,     | Comma                                           |

In [None]:
texts = [text1, text2]
records = []

for text_id, text in enumerate(texts, start=1):
    for sentence_id, tokens in enumerate(tokenize_paragraph(text), start=1):
        tagged = nltk.pos_tag(tokens)
        for token, pos in tagged:
            records.append({
                'text_id': text_id,
                'sentence_id': sentence_id,
                'token': token,
                'pos': pos
            })

pos_tags_df = pd.DataFrame(records)
pos_tags_df.to_csv('data/pos_tags.csv', index=False)

## 1.3 Finite-State Lemmatization

We implement a **two-level morphological model** in a simplified form:

1. **Suffix rules** (longest first):
   - `-ies` → `-y` (e.g. *studies* → *study*)  
   - `-ves` → `-f` (e.g. *leaves* → *leaf*)  
   - `-es` → `∅` (e.g. *boxes* → *box*)  
   - `-ing` → `∅` (e.g. *running* → *run*)  
   - `-ed` → `∅` (e.g. *jumped* → *jump*)  
   - `-s` → `∅`, except when the word ends in `-ss` (e.g. *glass* remains `glass`)  
2. **Context-free application**, akin to Koskenniemi’s two-level rules, but here executed sequentially in code rather than via a compiled FST[^1][^11].

This cascade covers the most common English inflectional patterns and approximates a finite-state transducer for lemmatization[^2][^5].


In [24]:
def lemmatize_word(word: str) -> str:
    """
    Lemmatizes a word based on simple English rules.
    """
    if word.endswith("ies") and len(word) > 4:
        return word[:-3] + "y"
    if word.endswith("ves") and len(word) > 4:
        return word[:-3] + "f"
    if word.endswith("es") and len(word) > 3:
        return word[:-2]
    if word.endswith("ing") and len(word) > 4:
        return word[:-3]
    if word.endswith("ed") and len(word) > 3:
        return word[:-2]
    if word.endswith("s") and not word.endswith("ss") and len(word) > 3:
        return word[:-1]
    return word

In [None]:
data = []
texts = [text1, text2]

for text_id, text in enumerate(texts, start=1):
    for sentence_id, tokens in enumerate(tokenize_paragraph(text), start=1):
        for token in tokens:
            lemma = lemmatize_word(token.lower())
            data.append({
                'text_id': text_id,
                'sentence_id': sentence_id,
                'token': token,
                'lemma': lemma
            })

lemmatized_df = pd.DataFrame(data)
lemmatized_df.to_csv('data/lemmatized_tokens.csv', index=False)

## 1.4 Vocabulary Lookup Table
1. **Reading your CSVs**  
   Loads the two pre-existing DataFrames: one with POS tags and one with lemmas.  
2. **Renaming**  
   Aligns the column names (`Original_Token` vs. `token`) so that pandas can match rows correctly.  
3. **Merging**  
   Performs a left-join on `TextID` and the surface token, ensuring every POS-tagged token gets its lemma attached.  
4. **Aggregation**  
   - **`frequency`**: how many times each `(lemma, POS)` pair occurs.  
   - **`examples`**: up to three distinct original tokens showing that lemma+POS in context.  
5. **Saving**  
   Exports the final lexicon to `vocab_lookup.csv`, ready for downstream use in your lexicon-driven transformations.

In [None]:
# 1. Load CSVs
pos_tags_df      = pd.read_csv('pos_tags.csv')
lemmatized_df    = pd.read_csv('lemmatized_tokens.csv')

# 2. Merge on text_id, sentence_id, token
merged_df = pd.merge(
    pos_tags_df,
    lemmatized_df,
    on=['text_id', 'sentence_id', 'token'],
    how='left'
)

# 3. Build vocabulary lookup
vocab_lookup_df = (
    merged_df
      .groupby(['lemma', 'pos'], as_index=False)
      .agg(
           tokens    = ('token', lambda toks: list(pd.unique(toks))),
          frequency=('lemma', 'size'),
          examples=('token', lambda toks: list(pd.unique(toks)[:3]))
      )
)
vocab_lookup_df = vocab_lookup_df[
    ['tokens', 'lemma', 'pos', 'frequency', 'examples']
]
# 4. Save to CSV
vocab_lookup_df.to_csv('data/vocab_lookup.csv', index=False)

# 2. Sentence Generation from Vocabulary

Generating syntactically correct and semantically clear sentences can be achieved by using template-based generation with a controlled vocabulary of part-of-speech (POS) tags. In this approach, we define sentence structures as sequences of POS tags (e.g., Determiner + Noun + Verb + Object) and then populate these structures with actual words that exemplify each POS category. This method is a simple yet powerful form of Natural Language Generation (NLG) that avoids randomness: given the same templates and vocabulary, it will always produce the same sentences (i.e., it is fully deterministic).Such template-based generation is ideal for tasks that require control and consistency over language output, significantly reducing ambiguity and ensuring grammatical correctness. The use of POS tag templates also aligns with the concept of syntactic templates in the literature, which are sequences of POS tags representing sentence patterns. 

## Approach and Template Design
1. **Corpus Selection**  
   - Load sentences from the Brown corpus’s “government” and “learned” categories to approximate a formal/letter register.

2. **Custom Parsing**  
   - Tokenize and POS-tag each sentence using NLTK’s built-in tools, rather than a heavyweight parser.

3. **Tag Filtering**  
   - Define exactly which Penn Treebank tags your generator knows (e.g. `DT`, `NN`, `PRP$`, etc.).
   - Discard any sentence whose POS sequence contains tags outside this whitelist.

4. **Length Constraint**  
   - Only consider sentences between 10 and 18 tokens long, so that templates produce sufficiently rich, paragraph-like seed sentences.

5. **Template Extraction**  
   - Count how often each “clean” POS sequence (template) occurs in the filtered corpus.
   - Select the top 50 most frequent sequences.

6. **Template Formatting**  
   - Build a dictionary where each key is the space-joined tag string (e.g. `"PRP VBD DT NN IN DT JJ NN"`) and each value is the corresponding list of tags.  
   - This map plugs directly into your existing generator loop (`generate_sentences(pattern, vocab_map)`).


---
## Pipeline Implementation in Python
- **Vocabulary Construction:** We define a dictionary mapping each POS tag to a list of example words (lemmas) for that tag. This can be derived from the provided lookup CSV (for reproducibility, we show a hard-coded small selection of words per tag in this example). In a real scenario, one could parse the full lookup table to build this dictionary. Each list is kept in a fixed order so that iteration is deterministic. (We also ensure the words are in the correct case for usage; e.g., pronoun “I” is capitalized because it must always appear as “I”.)
- **Template Definitions:** We define templates as patterns of POS tags. For convenience, each template is represented as a list of tags (or a string of tags). We also store an example template pattern description for clarity. These templates correspond to structures, for instance, ["DT","NN","VBZ","DT","NN"] represents Det + Noun + Verb + Det + Noun.
- **Sentence Generation Function:** A function generate_sentences(pattern) uses the vocabulary to produce all possible sentences for a given tag pattern. It does so by taking the Cartesian product of the word lists for each tag in the pattern (ensuring every combination of choices is covered)
researchgate.net
. For each combination, it joins the words into a sentence, capitalizes the first letter, and adds a period at the end. (Capitalization and punctuation are added as a post-processing step to make the output look like proper English sentences.) Because we iterate in a consistent order (e.g., the order of tags in the template and the order of words in each tag’s list never changes), the generation is deterministic in both content and order of sentences.

In [None]:
raw_sents = [' '.join(sent) for sent in brown.sents(categories=['government', 'learned'])]
# — 2) Homemade parser: tokenize + POS-tag —
def parse_sent(text):
    toks = nltk.word_tokenize(text)
    tags = [tag for (_, tag) in nltk.pos_tag(toks)]
    return toks, tags

# — 3) Define your allowed Penn Treebank tags —
vocab_lookup = pd.read_csv('data/vocab_lookup.csv')
allowed_tags = set(vocab_lookup['pos'].dropna().unique())
allowed_tags |= {'.', ','}

# — 4) Count templates of length 10–18, discarding any containing an unknown tag —
tpl_counts = Counter()
for sent in raw_sents:
    toks, tags = parse_sent(sent)
    L = len(toks)
    if 12 <= L <= 20 and set(tags).issubset(allowed_tags):
        tpl = tuple(tags)
        tpl_counts[tpl] += 1

# — 5) Take the top 50 most frequent “clean” templates —
top_templates = [tpl for tpl, _ in tpl_counts.most_common(50)]

# — 6) Build the dict you’ll feed into your generator —
templates = {
    " ".join(tpl): list(tpl)
    for tpl in top_templates
}

### Vocabulary Coverage & Enrichment

We now split our lookup vocabulary into **primary** and **secondary** pools, then generate two batches of sentences:

1. **Primary (Coverage) Pass**  
   - **Primary** = all lemmas from `vocab_lookup.csv`.  
   - For each primary lemma _ℓ_ of POS tag _t_:  
     1. Pick a random template that contains tag _t_.  
     2. **Force** ℓ into its slot.  
     3. Fill remaining slots with a mix of primary + secondary lemmas.  
   - **Sentences generated**:  
     $$
       \sum_{t}\bigl|\text{primary}_t\bigr|
       \;=\;154
     $$

2. **Secondary (Enrichment) Pass**  
   - **Secondary** = all other WordNet lemmas (minus primaries), pruned to max 20 per tag.  
   - For each of the **50** templates:  
     - Fill **all** slots with secondary lemmas only.  
     - Repeat **500** times per template.  
   - **Sentences generated**:  
     $$
       50 \times 500 \;=\;25{,}000
     $$

---

### Total Sentences

$$
154\;(\text{coverage})\;+\;25{,}000\;(\text{enrichment})
\;=\;25{,}154\quad\text{sentences.}
$$


In [None]:
# --- A) Build primary vs secondary vocab ---

# --- 1) Load your lookup table ⇒ primary_vocab + gather CSV tags ---
vocab_lookup = pd.read_csv('data/vocab_lookup.csv')  # cols: pos, lemma
lookup_map = defaultdict(list)
for _, row in vocab_lookup.iterrows():
    tag, lemma = row['pos'], row['lemma']
    if pd.notna(tag) and pd.notna(lemma):
        lookup_map[tag].append(lemma)
primary_vocab = { tag: sorted(lookup_map[tag]) for tag in lookup_map }
all_tags      = set(primary_vocab.keys())

# --- 2) Full PTB→WordNet mapping for supported tags ---
_full_ptb2wn = {
    'NN': wn.NOUN, 'NNS': wn.NOUN, 'NNP': wn.NOUN, 'NNPS': wn.NOUN,
    'VB': wn.VERB, 'VBD': wn.VERB, 'VBG': wn.VERB, 'VBN': wn.VERB,
    'VBP': wn.VERB, 'VBZ': wn.VERB,
    'JJ': wn.ADJ, 'JJR': wn.ADJ, 'JJS': wn.ADJ,
    'RB': wn.ADV, 'RBR': wn.ADV, 'RBS': wn.ADV
}
ptb2wn = { tag: wn_pos for tag, wn_pos in _full_ptb2wn.items() if tag in all_tags }
brown_map = defaultdict(set)
for sent in brown.sents(categories=['government','learned']):
    toks = nltk.word_tokenize(' '.join(sent))
    for word, tag in nltk.pos_tag(toks):
        if tag in all_tags:
            brown_map[tag].add(word.lower())

# --- 4) Build source_vocab: primary from CSV + secondary from WordNet or Brown ---
source_vocab = {}
for tag in all_tags:
    prim_list = sorted(set(primary_vocab[tag]))
    if tag in ptb2wn:
        all_lemmas = set(wn.all_lemma_names(pos=ptb2wn[tag]))
        prim_set   = set(prim_list) & all_lemmas
        # use WordNet lemmas minus your primaries
        sec_list   = sorted(all_lemmas - prim_set)
    else:
        # use Brown tokens minus your primaries
        sec_list   = sorted(brown_map[tag] - set(prim_list))
    source_vocab[tag] = {
        'primary': sorted(prim_set if tag in ptb2wn else prim_list),
        'secondary': sec_list
    }

# --- 5) Prune secondary lists so they don’t explode ---
def prune_secondary(vmap, max_per_slot=20):
    pruned = {}
    for tag, pools in vmap.items():
        sec = pools['secondary']
        pruned[tag] = {
            'primary': pools['primary'],
            'secondary': (random.sample(sec, max_per_slot)
                          if len(sec) > max_per_slot else sec.copy())
        }
    return pruned

vocab = prune_secondary(source_vocab, max_per_slot=20)
# save to CSV for later use
vocab_df = pd.DataFrame.from_dict(vocab, orient='index')
vocab_df.to_csv('vocab.csv', index=False)
# --- B) Two‐phase sentence generation ---

# Helper to fill a pattern once, forcing a specific word in one slot
def fill_template(pattern, forced_tag=None, forced_word=None):
    slots = []
    for tag in pattern:
        if tag == forced_tag:
            slots.append([forced_word])
        else:
            # mix primary & secondary for the other slots
            pool = vocab[tag]['secondary'] + vocab[tag]['primary']
            slots.append(pool)
    choice = [random.choice(pool) for pool in slots]
    return " ".join(choice).capitalize() + "."

# figure out which tags actually occur in your templates
template_tags = {
    tag
    for pattern in templates.values()
    for tag in pattern
}

# 1) COVERAGE: one sentence per (tag, primary_lemma),
#    but only for tags that appear in at least one template
coverage = []
for tag, pools in vocab.items():
    if tag not in template_tags:
        continue            # skip any tag not in your templates
    for lemma in pools['primary']:
        # now we know there is at least one template that contains this tag
        candidates = [tpl for tpl,p in templates.items() if tag in p]
        tpl        = random.choice(candidates)
        pattern    = templates[tpl]
        sent       = fill_template(pattern, forced_tag=tag, forced_word=lemma)
        coverage.append({'sentence': sent, 'pattern': tpl})

# 2) ENRICHMENT: sample many more with both secondary & primary words
enriched = []
SAMPLES_PER_TEMPLATE = 500
for tpl, pattern in templates.items():
    for _ in range(SAMPLES_PER_TEMPLATE):
        # fill_template(pattern) mixes secondary + primary for every slot
        s = fill_template(pattern)
        enriched.append({'sentence': s, 'pattern': tpl})

# 3) Combine and save
all_sents = coverage + enriched
df = pd.DataFrame(all_sents)
df.to_csv('data/generated_sentences.csv', index=False)

# 3. Seed Sentence Curation


# 3.1 Lexical Clarity Filtering

We require each lemma in a sentence to have one sense (POS) that accounts for ≥ 90 % of its corpus usage:

$$
p_{\max}(L)
= \frac{\max_{s} f(L,s)}{\sum_{s} f(L,s)}
\;\ge\; \theta\;(=0.9).
$$

In [29]:
# === Parameters ===
THETA_LEX  = 0.9
PP_CUTOFF  = 0.9

# Load generated sentences
gen_df = pd.read_csv('generated_sentences.csv')
gen_df['sentence_id'] = gen_df.index

# Load vocab lookup
vocab_df = pd.read_csv('vocab_lookup.csv')

# Build lemma→POS frequency map
lemma_pos = (
    vocab_df
      .groupby(['lemma', 'pos'])['frequency']
      .sum()
      .unstack(fill_value=0)
)
lemma_pos['total'] = lemma_pos.sum(axis=1)
lemma_pos['p_max'] = lemma_pos.max(axis=1) / lemma_pos['total']

# Ambiguous lemmas
ambiguous_lemmas = set(lemma_pos[lemma_pos['p_max'] < THETA_LEX].index)

# Simple tokenizer
tok_re = re.compile(r'\b\w+\b')
def tokenize_sentence(text: str) -> List[str]:
    return tok_re.findall(text.lower())

def is_lexically_clear(sentence: str) -> bool:
    toks = tokenize_sentence(sentence)
    return not any(tok in ambiguous_lemmas for tok in toks)

gen_df['lex_clear'] = gen_df['sentence'].apply(is_lexically_clear)

# 3.2 Syntactic Clarity Filtering (PP Attachment)

For each PP (`IN` + noun) in the **corpus**, we estimate

$$
P(p \mid \text{noun}=N)
\approx \frac{C(N,p)+1}{C(N)+|P|}, 
\quad
P(p \mid \text{verb}=V)
\approx \frac{C(V,p)+1}{C(V)+|P|}
$$

where we’ve added +1 (Laplace) smoothing and `|P|` = number of distinct prepositions.  
Then, in each candidate sentence, if a PP follows a noun *and* a verb, we keep it **only** if

```text
max(P(p|N), P(p|V)) ≥ 0.9

In [None]:
# Load POS tags and lemmatized tokens
pos_df   = pd.read_csv('pos_tags.csv')
lemma_df = pd.read_csv('lemmatized_tokens.csv')

# Merge on text_id, sentence_id, token
cf_df = pos_df.merge(
    lemma_df,
    how='inner',
    on=['text_id', 'sentence_id', 'token']
)

# Prepositions set
PREPS = set(cf_df[cf_df['pos'] == 'IN']['token'].str.lower())
P     = len(PREPS)

# Counters
C_np = defaultdict(int)
C_vp = defaultdict(int)
C_n  = Counter()
C_v  = Counter()

# Populate counts
for sid, group in cf_df.groupby('sentence_id'):
    rows = list(group[['token','pos','lemma']].itertuples(index=False, name=None))
    for tok, ptag, lem in rows:
        if ptag.startswith('NN'):
            C_n[lem] += 1
        elif ptag.startswith('VB'):
            C_v[lem] += 1
    for i, (tok, ptag, lem) in enumerate(rows):
        prep = tok.lower()
        if ptag == 'IN' and prep in PREPS and i > 0:
            prev_tok, prev_pos, prev_lem = rows[i-1]
            if prev_pos.startswith('NN'):
                C_np[(prev_lem, prep)] += 1
            for j in range(i-1, -1, -1):
                if rows[j][1].startswith('VB'):
                    C_vp[(rows[j][2], prep)] += 1
                    break

# Smoothed probability
def P_given(counts, totals, key):
    return (counts.get(key, 0) + 1) / (totals[key[0]] + P)

def is_syntactically_clear(sid: int) -> bool:
    rows = list(
        cf_df[cf_df['sentence_id'] == sid][['token','pos','lemma']]
        .itertuples(index=False, name=None)
    )
    for i, (tok, ptag, lem) in enumerate(rows):
        prep = tok.lower()
        if ptag == 'IN' and prep in PREPS and i > 0:
            prev_tok, prev_pos, prev_lem = rows[i-1]
            pn = (P_given(C_np, C_n, (prev_lem, prep))
                  if prev_pos.startswith('NN') else 0.0)
            verb_lem = next(
                (r_lem for _, r_pos, r_lem in reversed(rows[:i])
                 if r_pos.startswith('VB')),
                None
            )
            pv = (P_given(C_vp, C_v, (verb_lem, prep)) if verb_lem else 0.0)
            if max(pn, pv) < PP_CUTOFF:
                return False
    return True

gen_df['syn_clear'] = gen_df['sentence_id'].apply(is_syntactically_clear)

# Final seed sentences
curated = gen_df.loc[gen_df['lex_clear'] & gen_df['syn_clear'], 'sentence'].tolist()
print("## Curated Seed Sentences\n")
for s in curated:
    print(f"- {s}")

# Save to CSV
curated_df = pd.DataFrame(curated, columns=['sentence'])
curated_df.to_csv('data/seed_sentences.csv', index=False)

## Curated Seed Sentences

- He frap and chuck neither next delay , link along lacustrine junky , and recent frostian submission lighting_fixture ..
- Himself chuck again thi outcome at top poems thi word though us think invite !.
- Timber , hareem foreign_terrorist_organization white_fritillary , support junky , eloquent pageantry mean proceeding to fall_away ..
- Be hypercalcemia standardise this chinese communication out another lacustrine editor today black_person at teacher professor berlage ?.
- One light-heartedly fagot send virus an coral_bean dur rancidness to step_on crotal though themselves put_on_the_line programme ..
- Themselves still lace wait overflow half bit because rancidness to let white_fritillary out one mean tail ?.
- Thi silene_caroliniana muck_about to half anorthic & part , but me partially desensitize one plan !.
- Overleaf illiberally awaken acceptance sunbathe na see acceptance gravity_wave neither acceptance ..
- Provide a hasher anorthic autumn like theor

# 4. Ambiguity

## 4.1 Lexical Ambiguity Transformation
**objective**: Introduce lexical ambiguity by replacing a content word in the sentence with a homonym or polysemous word that fits the context. Lexical ambiguity arises when a word has multiple distinct meanings (i.e. homonymy or polysemy). For example, “organ” can mean a musical instrument or a body part, and “bank” can refer to a financial institution or a river shore. Using such words in a context that doesn’t resolve the sense leads to multiple possible interpretations. We prefer ambiguous words whose different senses are comparably common, so that one reading doesn’t overwhelmingly dominate. This ensures the sentence genuinely supports more than one “sense” reading.

**methodology**: We identify a content lemma in the seed sentence – typically a noun or verb (excluding function words) – and substitute it with a contextually plausible ambiguous word from our corpus-derived vocabulary. To maintain contextual plausibility, the replacement should match the original word’s part of speech and fit the sentence’s syntax and semantics. We leverage the provided (which lists lemmas and their frequencies in the corpus) to pick a substitute that the corpus is familiar with. We also consider words known to have multiple senses in general usage (for example, “paper”, “check”, “bit”, “contract”). The context around the word is checked to avoid strong disambiguating cues.  For instance, if the seed is “The professor fixed the organ in the chapel”, the word “organ” is already ambiguous (pipe organ vs. body organ) but the context “chapel” strongly favors the musical instrument meaning. We might instead substitute a different word or adjust context to keep both meanings plausible.

We implement a function `introduce_lexical_ambiguity(sentence, tokens, pos_tags, vocab)`
1. Parses the sentence into tokens and POS tags (we can use the provided `lemmatized_tokens.csv` and `pos_tags.csv` to retrieve this information).
2. Selects a candidate token for replacement – e.g. the main noun or verb – ensuring it’s not a stopword or punctuation.
3. Looks up the token’s lemma in a prepared list or dictionary of ambiguous lemmas (this list can be compiled from known homonyms or by analyzing the corpus for words with multiple senses). If the token itself is not ambiguous, choose a replacement lemma that is a homonym and fits syntactically.
4. Ensures the replacement does not alter grammatical number or tense. For example, if the original word was plural, the new word should also be plural if possible. Minor morphological adjustments are applied if needed (our vocabulary lookup provides examples which we can use to get the correct surface form).
5. Reconstructs the sentence with the new word in place of the original, preserving the rest of the sentence. The output sentence should still be fluent and grammatical, but now contains a lexically ambiguous element.


In [31]:
def introduce_lexical_ambiguity(sentence, tokens, lemmas, pos_tags, vocab_list):
    candidates = [(lemma, pos, idx) for idx, (lemma, pos) in enumerate(zip(lemmas, pos_tags)) 
                  if pos.startswith(('NN', 'VB', 'JJ'))]
    if not candidates:
        return None
    
    orig_lemma, orig_pos, orig_idx = random.choice(candidates)
    ambiguous_candidates = [w for w, p in vocab_list if p == orig_pos and w != orig_lemma]
    ambiguous_candidates = [w for w in ambiguous_candidates 
                            if len(wn.synsets(w, pos=wn.NOUN if orig_pos.startswith('NN') else 
                                                    wn.VERB if orig_pos.startswith('VB') else wn.ADJ)) > 1]
    replacement = random.choice(ambiguous_candidates) if ambiguous_candidates else None
    if not replacement:
        homonyms = [w for w, p in vocab_list if w == orig_lemma and p != orig_pos]
        replacement = orig_lemma if homonyms else None
    
    if not replacement:
        return None
    
    new_tokens = tokens.copy()
    new_tokens[orig_idx] = replacement
    return ' '.join(new_tokens)

## 4.2 Structural Ambiguity Transformation
**Objective:** Introduce structural ambiguity by altering the attachment of a prepositional phrase (PP) or similar modifier in the sentence. A classic structural ambiguity is PP-attachment ambiguity: a PP can modify either the verb or a noun, leading to different interpretations. For example, “I bought a computer with a GPU” is ambiguous – “with a GPU” could describe the computer’s component (attaching to the noun computer) or the manner of buying (attaching to the verb bought).Syntactically both attachments are possible, and extra knowledge is needed to decide which is intended. Our goal is to create such ambiguities by repositioning or re-scoping existing phrases in the seed sentence. This follows the observation by Hindle & Rooth (1993) that PP attachments can often swing between two plausible heads if the sentence is structured appropriately.
**Methodology**: We search for sentences containing a prepositional phrase, typically identified by a preposition (e.g., with, in, on, by, for, during, before) followed by a noun phrase. If a sentence has no prepositional phrase, we may look for other attachable modifiers (e.g., adverbial clauses) to create ambiguity, but PP is the primary target. Once a PP is found, we determine its current attachment: is it modifying a noun or the verb? This can be done via a dependency parse or by heuristic – for instance, if the PP immediately follows a noun without a comma or pause, it’s likely attached to that noun; if it comes after the verb and object, it may attach to the verb phrase. To introduce ambiguity, we produce a variant where the PP could attach to a different head than in the original. We do not change the linear order of words drastically; instead, we may insert or remove minor function words or punctuation, or slightly reorder phrases, so that the PP’s position allows an alternate parse.
- If originally the PP was part of the noun phrase (i.e., it specified the noun), we try to detach it so it can modify the verb. One approach is to move the PP to the end of the sentence if it wasn’t already. For example, original: “The chef [NP the cake on the plate] dropped.” (unambiguous: the cake on the plate was dropped). We can transform to: “The chef dropped the cake on the plate.” Now “on the plate” could describe where the chef dropped the cake (attached to the verb) or which cake (attached to cake). Another strategy is to remove any relative pronoun or punctuation that fixed the attachment. If the seed had “the cake that was on the plate”, dropping “that was” yields “the cake on the plate”, which is a PP that could ambiguously attach as in the previous example.
- If originally the PP was a verb modifier, we attempt to attach it to a noun. For instance, original: “The astronomer viewed [VP the comet with a telescope].” This is likely verb-attached (viewed with a telescope). We can create a variant “The astronomer viewed the comet with a tail.” where “with a tail” naturally attaches to comet (the comet with a tail) – causing ambiguity if one could also interpret that the viewing was done with a tail, which is silly semantically but grammatically analogous. A more subtle change is if the sentence had multiple nouns, we can reposition the PP immediately after a different noun. For example, “The scientist discussed the results with the professor.” is ambiguous as written (did the discussion happen together with the professor, or were the results that the professor had?). If the original was unambiguous, say “The scientist discussed the professor’s results.” (no PP), we introduce a PP: “The scientist discussed the results with the professor.” to allow two attachments. In practice, many seeds may already have a preposition; we just ensure the phrasing doesn’t disallow one of the readings.

We use a parsing library (e.g. spaCy or NLTK) to help identify the sentence structure. The function` introduce_structural_ambiguity(sentence)` might:
1. Parse the sentence to find prepositional dependency relations or the occurrence of preposition tags (IN) in the POS sequence.
2. If a preposition is found, identify the span of the prepositional phrase (from the preposition to the end of that noun phrase).
3. Determine current attachment: using dependency parse, check the head of the PP. Remember the original head (to keep the original meaning for reference).
4. Construct a new sentence variant where the PP is placed in a position that suggests a different attachment. This may involve rearranging the order of the object and PP, or introducing a slight pause/comma. We ensure the sentence remains fluent. For example:
    - If parse shows `VP -> PP` (verb attaches PP), try attaching to noun: if the verb has a direct object, place the PP right after that noun with no comma (which encourages noun attachment).
    - If parse shows `NP -> PP` (noun attaches PP), move the PP to follow the verb phrase or clause. Possibly add a comma before the PP if needed to indicate it’s not tightly bound to the noun. (However, adding a comma can sometimes resolve ambiguity by clearly separating it – so we use commas sparingly, maybe only if needed for fluency).
5. Return the new sentence. If no PP is present and we cannot create one easily, the function might return `None` for that seed (meaning no structural ambiguous variant was generated). In practice, because we want one structural variant per seed, we could also fabricate a simple PP (like “with something”) at the end of the sentence if grammatically acceptable, but it’s better to work with existing structure to preserve naturalness.


In [32]:
def introduce_structural_ambiguity(sentence):
    doc = nlp(sentence)
    # 1. Try to find an existing PP to reattach
    pp_token = None
    for token in doc:
        if token.dep_ == 'prep' and token.head.dep_ != 'ROOT':
            pp_token = token
            break

    if pp_token:
        # extract PP span text
        pp_span = doc[pp_token.i : pp_token.i + len(list(pp_token.subtree))]
        pp_text = pp_span.text
        head = pp_token.head

        tokens = [t.text for t in doc]
        # noun-attached → move to verb/clause
        if head.pos_ in {'NOUN', 'PROPN'}:
            start, end = pp_token.i, pp_token.i + len(list(pp_token.subtree))
            base = tokens[:start] + tokens[end:]
            # insert before final punctuation
            if base and base[-1] in {'.','?','!'}:
                base.insert(-1, pp_text)
            else:
                base.append(pp_text)
            return ' '.join(base)

        # verb-attached → attach to the direct object
        if head.pos_ == 'VERB':
            dobj = next((c for c in head.children if c.dep_ == 'dobj'), None)
            if dobj:
                dobj_span = doc[dobj.i : dobj.i + len(list(dobj.subtree))]
                out = []
                for tok in doc:
                    out.append(tok.text)
                    if tok.i == dobj_span[-1].i:
                        out.append(pp_text)
                return ' '.join(out)

    # —fallback: no usable PP found or no dobj—
    # 2. Try to build a PP from an existing noun in the sentence
    nouns = [t.text for t in doc if t.pos_ in {'NOUN','PROPN'}]
    if nouns:
        obj = random.choice(nouns)
        return sentence.rstrip(' .?!') + f" with {obj}."

    # 3. Last-resort generic PP list
    fallback_pps = [
        "with enthusiasm",
        "on the table",
        "in the room",
        "by the window",
        "for the first time"
    ]
    choice = random.choice(fallback_pps)
    return sentence.rstrip(' .?!') + f" {choice}."

## 4.3 Referential Ambiguity Transformation
**Objective:** Introduce referential ambiguity by replacing a noun phrase with a pronoun that could refer to more than one entity in the discourse. Referential ambiguity occurs when it’s unclear which entity a pronoun (or possessive adjective) refers to.  In a single sentence context (or short discourse), this often happens if two candidates for the pronoun antecedent exist with matching gender/number. Our transformations will create situations akin to Winograd Schema Challenge examples, where commonsense or additional context is needed to disambiguate the pronoun. For instance, “The town councilors refused to give the demonstrators a permit because they feared violence.” is referentially ambiguous: “they” could refer to the councilors or the demonstrators.
**Methodology:** We find sentences that mention at least two entities (people or objects). We then replace one of the mentions with a pronoun, chosen such that it grammatically could refer to either entity. The classic pattern is a sentence with two noun phrases and then a pronoun later that might point back to either. This can be within one complex sentence or across two simpler sentences. For simplicity, we transform a single sentence when possible. Key considerations:
- **Pronoun selection:** The pronoun must agree in number (singular/plural) and type (person vs thing) with the replaced noun. It should also match the other potential antecedent in these features, to maximize confusion. For example, if the sentence has two singular people “Alice” and “Beth”, replacing “Beth” with “she” yields ambiguity (who is “she”?). If one entity is male and another female, “he” or “she” would not be ambiguous because gender differentiates them; in that case we might use a gender-neutral pronoun “they” or find another sentence. We prefer scenarios where two candidates share grammatical gender or are both inanimate (for “it”).
- **Placement:** Often the ambiguity is strongest when the pronoun is in a subordinate clause or later part of the sentence referring back. For example, “X verbed Y because he ...” or “After he verbed, ...” etc. In our transformations, if a sentence is simple, we might introduce a conjunction or relative clause to accommodate a pronoun. However, since we must derive from the seed sentence, a more straightforward approach is replacing a second mention. For instance, if a seed sentence had “John moved Mark’s notebook.”, we can change it to “John moved his notebook.” Now “his” could refer to John or Mark, creating ambiguity. Another example: original “The editor thanked the reviewer for the feedback.” – we could transform to “The editor thanked the reviewer for his feedback.” (assuming both editor and reviewer are male or unspecified), making it unclear whose feedback it was. This is akin to Winograd schemas where a possessive or pronoun can refer to either party in the sentence

We implement a function `introduce_referential_ambiguity(sentence, tokens, pos_tags)` that:
1. Identifies noun phrases or named entities in the sentence. We can use simple heuristics on POS tags (e.g., sequences like `[DT] JJ* NN` or proper nouns NNP). We particularly look for two distinct NPs that could serve as antecedents. Often these are the subject and object of the sentence or the subject of main clause and subject of a subordinate clause.
2. Chooses one of the noun phrases to replace with a pronoun. A typical strategy is: if the sentence has a subject and an object, replace the object (or object’s head) with a pronoun, provided the subject and object are compatible with the same pronoun form. Alternatively, replace the subject if the sentence structure allows (but replacing the subject pronoun at the very beginning might leave no ambiguity if the object is clearly a different person). Replacing the second entity tends to leave ambiguity about whether the pronoun refers to the first or second.
3. Determine the correct pronoun: If the NP refers to a person (e.g., marked by a title or capitalized name), use “he” or “she” as appropriate (or “they” if gender is unknown, though singular they might be interpreted as plural in some cases, so we use it carefully). If both entities are things, use “it” (singular) or “they” (plural) accordingly. If both are groups or plural, “they” is the natural pronoun and is inherently ambiguous. We also handle possessives: sometimes replacing “X’s [noun]” with “his [noun]” or “their [noun]” can cause ambiguity, as in the earlier example.
4. Ensure that after substitution, the sentence is still grammatical (we may need to adjust the determiner: e.g., “John moved the notebook of Mark” -> “John moved his notebook” might involve restructuring the phrase). Usually, replacing a standalone noun or a possessive noun with a pronoun is straightforward.
5. Output the modified sentence.

In [33]:
def introduce_referential_ambiguity(sentence, tokens, pos_tags):
    noun_indices = [i for i, pos in enumerate(pos_tags) if pos.startswith('NNP') or pos.startswith('NN')]
    if len(noun_indices) < 2:
        return None
    
    i, j = noun_indices[0], noun_indices[1]
    noun1, noun2 = tokens[i], tokens[j]
    
    pronoun = "it"  # Default for abstract nouns in seeded sentences
    if pos_tags[j] in ('NNS', 'NNPS'):
        pronoun = "they"
    elif noun2[0].isupper():
        pronoun = "he"
    
    new_tokens = tokens.copy()
    new_tokens[j] = pronoun
    if j > 0 and pos_tags[j-1] in ('DT', 'PRP$'):
        new_tokens[j-1] = ''
    ambiguous_sentence = ' '.join([t for t in new_tokens if t])
    return ambiguous_sentence

In [None]:
# Load vocab_lookup.csv and create vocab_list
df_vocab = pd.read_csv("data/vocab_lookup.csv")
vocab_list = list(zip(df_vocab['lemma'], df_vocab['pos']))

# Load seed_sentences.csv
df_seeds = pd.read_csv("data/seed_sentences.csv")
seeded_sentences = df_seeds['sentence'].tolist()
output_data = []

# Initialize counters for statistics
stats = {
    "total": len(seeded_sentences),
    "referential": 0,
    "structural": 0,
    "lexical": 0
}

# Create a tqdm progress bar for notebooks
for sentence in tqdm(seeded_sentences, desc="Generating ambiguous sentences", 
                    leave=True, unit="sentence", colour="blue"):
    doc = nlp(sentence)
    tokens = [token.text for token in doc]
    lemmas = [token.lemma_ for token in doc]
    pos_tags = [token.tag_ for token in doc]
    
    # Referential ambiguity
    amb_ref = introduce_referential_ambiguity(sentence, tokens, pos_tags)
    if amb_ref:
        output_data.append({
            'original_sentence': sentence,
            'ambiguity_type': 'referential',
            'ambiguous_sentence': amb_ref
        })
        stats["referential"] += 1
    
    # Structural ambiguity
    amb_struct = introduce_structural_ambiguity(sentence)
    if amb_struct:
        output_data.append({
            'original_sentence': sentence,
            'ambiguity_type': 'structural',
            'ambiguous_sentence': amb_struct 
        })
        stats["structural"] += 1
    amb_lex = introduce_lexical_ambiguity(sentence, tokens, lemmas, pos_tags, vocab_list)
    if amb_lex:
        output_data.append({
            'original_sentence': sentence,
            'ambiguity_type': 'lexical',
            'ambiguous_sentence': amb_lex
        })
        stats["lexical"] += 1
        

# Create DataFrame and save to CSV
df_output = pd.DataFrame(output_data)
df_output.to_csv("data/final_dataset.csv", index=False)

print(f"✅ Ambiguity generation complete!")
print(f"📊 Statistics:")
print(f"   - Processed {stats['total']} seed sentences")
print(f"   - Generated {stats['referential']} sentences with referential ambiguity")
print(f"   - Generated {stats['structural']} sentences with structural ambiguity")
print(f"   - Total ambiguous sentences: {len(output_data)}")
print(f"💾 Results saved to 'final_dataset.csv'")

Generating ambiguous sentences:   0%|          | 0/25126 [00:00<?, ?sentence/s]

✅ Ambiguity generation complete!
📊 Statistics:
   - Processed 25126 seed sentences
   - Generated 24997 sentences with referential ambiguity
   - Generated 25126 sentences with structural ambiguity
   - Total ambiguous sentences: 74997
💾 Results saved to 'final_dataset.csv'


##  Build Wait-K Prefix Dataset
We start from original parallel pairs \((\mathbf{x}, \mathbf{y}) = (x_{1:L_x}, y_{1:L_y})\) and apply a Wait-\(K\) strategy.  For each target position \(t=1,\dots,L_y\), define the number of source tokens seen so far as  
$$
r(t) \;=\;\min\bigl(K + (t-1),\,L_x\bigr).
$$  
We then create prefix-to-next-word examples of the form  
$$
\bigl(x_{1:r(t)},\,y_{t-1}\bigr)\;\mapsto\;y_t,
$$  
where \(y_0\) is the special `<sos>` token.  In practice this yields up to \(L_y\) examples per sentence pair.  Finally, we save all examples to `final_dataset_2.csv` with columns:
- `source_prefix` = $x_{1:r(t)}$
- `prev_target`   = $y_{t-1}$
- `target_word`   = $y_t$

In [1]:
# 1) Load your original parallel corpus CSV
import pandas as pd
df = pd.read_csv('data/final_dataset.csv')

K = 3  # choose your latency budget

rows = []
for _, row in df.iterrows():
    src_tokens = row['ambiguous_sentence'].split()
    tgt_tokens = row['original_sentence'].split()
    Lx, Ly = len(src_tokens), len(tgt_tokens)
    for t in range(1, Ly + 1):
        r = min(K + (t - 1), Lx)
        prev_tok = tgt_tokens[t - 2] if t > 1 else '<sos>'
        rows.append({
            'source_prefix': ' '.join(src_tokens[:r]),
            'prev_target': prev_tok,
            'target_word': tgt_tokens[t - 1]
        })

new_df = pd.DataFrame(rows)
new_df.to_csv('final_dataset_2.csv', index=False)
print(f"Saved {len(new_df)} prefix examples to final_dataset_2.csv")

Saved 1225591 prefix examples to final_dataset_2.csv
