# Creating a Large Dataset for Ambiguity Detection

In this notebook, we will create a large dataset of sentence pairs consisting of original non-ambiguous sentences and their ambiguous counterparts. This dataset will be used to train an RNN encoder/decoder model for sequence-to-sequence sentence reconstruction. The pipeline involves downloading a large dataset of email-like sentences (the EnronSent Corpus), processing it to extract sentences with 10-20 words, creating a vocabulary lookup table, and introducing ambiguities using the provided functions.

The EnronSent Corpus is a cleaned version of the Enron Email Dataset, containing 96,106 messages across 45 plain text files, totaling 13.8 million words. It is suitable for linguistic analysis and matches the formal, email-like style of your target texts.

## Pipeline Overview
1. **Download and Extract**: Download the EnronSent Corpus and extract the text files.
2. **Sentence Selection**: Extract sentences with 10-20 words (excluding punctuation) and collect tokens, lemmas, and POS tags.
3. **Vocabulary Creation**: Generate a vocabulary lookup table with token, lemma, and POS columns.
4. **Ambiguity Introduction**: Apply lexical, structural, and referential ambiguity functions to create ambiguous counterparts.
5. **Save Outputs**: Save the dataset and vocabulary lookup table as CSV files.


## Step 1: Import Necessary Libraries

We need libraries for downloading the dataset, processing text, performing NLP tasks, and handling data. SpaCy is used for tokenization, lemmatization, POS tagging, and dependency parsing, while NLTK's WordNet is used for synonym lookup in the lexical ambiguity function.


In [1]:

# Import libraries
import requests
import tarfile
import os
import spacy
from tqdm.notebook import tqdm
import pandas as pd
import random
from glob import glob
from nltk.corpus import wordnet as wn

# Load SpaCy model
nlp = spacy.load("en_core_web_sm")

# Ensure necessary NLTK data is downloaded
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\claza\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\claza\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

## Step 2: Download and Extract the EnronSent Corpus

We download the EnronSent Corpus from the University of California San Diego. The dataset is a 25MB tar.gz file containing 45 plain text files. After downloading, we extract it to a folder named 'enronsentv1'.


In [None]:

url = "http://wstyler.ucsd.edu/files/enronsentv1.tar.gz"
filename = "enronsentv1.tar.gz"

if not os.path.exists(filename):
    response = requests.get(url)
    with open(filename, 'wb') as f:
        f.write(response.content)

# Extract the tar.gz file
with tarfile.open(filename, 'r:gz') as tar:
    tar.extractall()




## Step 3: Process Text Files to Select Sentences

We process each text file to extract sentences with 10-20 words (excluding punctuation). For each selected sentence, we store the text, tokens, lemmas, and fine-grained POS tags (e.g., 'NN', 'VB') using SpaCy's `token.tag_` attribute, which matches the Penn Treebank tags required by your ambiguity functions.


In [None]:
# The dataset is extracted to a folder named 'enronsentv1'# Download the dataset
from glob import glob
extracted_folder = "enronsent"
text_files = glob(os.path.join(extracted_folder, "enronsent??"))
# Function to process a single file
def process_file(file_path, nlp, chunk_size=500_000):
    with open(file_path, 'r', encoding='utf-8') as f:
        text = f.read()
    sentences_data = []
    for start in range(0, len(text), chunk_size):
        chunk = text[start:start+chunk_size]
        doc = nlp(chunk)
        for sent in doc.sents:
            tokens = [t.text for t in sent if not t.is_punct]
            if 10 <= len(tokens) <= 20:
                sentences_data.append({
                    'text': sent.text,
                    'tokens': tokens,
                    'lemmas': [t.lemma_ for t in sent if not t.is_punct],
                    'pos_tags': [t.tag_ for t in sent if not t.is_punct],
                })
    return sentences_data

# Process all files
selected_sentences = []
for file in tqdm(text_files):
    selected_sentences.extend(process_file(file, nlp))

# save the selected sentences to a CSV file
df = pd.DataFrame(selected_sentences)
df.to_csv("enron_sentences.csv", index=False)


  0%|          | 0/45 [00:00<?, ?it/s]

## Step 4: Create Vocabulary Lookup Table

We extract all unique (token, lemma, pos) triples from the selected sentences to create a vocabulary lookup table. The table has columns `token`, `lemma`, and `pos`, matching the structure of your existing `vocab_lookup_original_texts.csv` for easy merging.


In [None]:
# Collect all (token, lemma, pos) triples
vocab_data = []
for sent in selected_sentences:
    for token, lemma, pos in zip(sent['tokens'], sent['lemmas'], sent['pos_tags']):
        vocab_data.append((token, lemma, pos))

# Remove duplicates
unique_vocab = list(set(vocab_data))

# Create DataFrame
vocab_df = pd.DataFrame(unique_vocab, columns=['token', 'lemma', 'pos'])

# Save to CSV
vocab_df.to_csv('data/new_vocab_lookup.csv', index=False)


## Step 5: Create Vocabulary List for Ambiguity Introduction

The `vocab_list` is a list of unique (word, pos) pairs used by the `introduce_lexical_ambiguity` function to find replacement words.


In [7]:
# Create vocab_list
vocab_list = list(set((token, pos) for sent in selected_sentences for token, pos in zip(sent['tokens'], sent['pos_tags'])))


## Step 6: Define Ambiguity Introduction Functions

We define the provided functions for introducing lexical, structural, and referential ambiguities. These functions are used as-is to ensure consistency with your original approach.


## Step 6: Define Ambiguity Introduction Functions

We define the provided functions for introducing lexical, structural, and referential ambiguities. These functions are used as-is to ensure consistency with your original approach.


In [5]:
# Lexical Ambiguity
def introduce_lexical_ambiguity(sentence, tokens, lemmas, pos_tags, vocab_list):
    candidates = [(lemma, pos, idx) for idx, (lemma, pos) in enumerate(zip(lemmas, pos_tags)) 
                  if pos.startswith(('NN', 'VB', 'JJ'))]
    if not candidates:
        return None
    
    orig_lemma, orig_pos, orig_idx = random.choice(candidates)
    ambiguous_candidates = [w for w, p in vocab_list if p == orig_pos and w != orig_lemma]
    ambiguous_candidates = [w for w in ambiguous_candidates 
                            if len(wn.synsets(w, pos=wn.NOUN if orig_pos.startswith('NN') else 
                                                    wn.VERB if orig_pos.startswith('VB') else wn.ADJ)) > 1]
    replacement = random.choice(ambiguous_candidates) if ambiguous_candidates else None
    if not replacement:
        homonyms = [w for w, p in vocab_list if w == orig_lemma and p != orig_pos]
        replacement = orig_lemma if homonyms else None
    
    if not replacement:
        return None
    
    new_tokens = tokens.copy()
    new_tokens[orig_idx] = replacement
    return ' '.join(new_tokens)

# Structural Ambiguity
def introduce_structural_ambiguity(sentence):
    doc = nlp(sentence)
    # 1. Try to find an existing PP to reattach
    pp_token = None
    for token in doc:
        if token.dep_ == 'prep' and token.head.dep_ != 'ROOT':
            pp_token = token
            break

    if pp_token:
        # extract PP span text
        pp_span = doc[pp_token.i : pp_token.i + len(list(pp_token.subtree))]
        pp_text = pp_span.text
        head = pp_token.head

        tokens = [t.text for t in doc]
        # noun-attached → move to verb/clause
        if head.pos_ in {'NOUN', 'PROPN'}:
            start, end = pp_token.i, pp_token.i + len(list(pp_token.subtree))
            base = tokens[:start] + tokens[end:]
            # insert before final punctuation
            if base and base[-1] in {'.','?','!'}:
                base.insert(-1, pp_text)
            else:
                base.append(pp_text)
            return ' '.join(base)

        # verb-attached → attach to the direct object
        if head.pos_ == 'VERB':
            dobj = next((c for c in head.children if c.dep_ == 'dobj'), None)
            if dobj:
                dobj_span = doc[dobj.i : dobj.i + len(list(dobj.subtree))]
                out = []
                for tok in doc:
                    out.append(tok.text)
                    if tok.i == dobj_span[-1].i:
                        out.append(pp_text)
                return ' '.join(out)

    # —fallback: no usable PP found or no dobj—
    # 2. Try to build a PP from an existing noun in the sentence
    nouns = [t.text for t in doc if t.pos_ in {'NOUN','PROPN'}]
    if nouns:
        obj = random.choice(nouns)
        return sentence.rstrip(' .?!') + f" with {obj}."

    # 3. Last-resort generic PP list
    fallback_pps = [
        "with enthusiasm",
        "on the table",
        "in the room",
        "by the window",
        "for the first time"
    ]
    choice = random.choice(fallback_pps)
    return sentence.rstrip(' .?!') + f" {choice}."

# Referential Ambiguity
def introduce_referential_ambiguity(sentence, tokens, pos_tags):
    noun_indices = [i for i, pos in enumerate(pos_tags) if pos.startswith('NNP') or pos.startswith('NN')]
    if len(noun_indices) < 2:
        return None
    
    i, j = noun_indices[0], noun_indices[1]
    noun1, noun2 = tokens[i], tokens[j]
    
    pronoun = "it"  # Default for abstract nouns in seeded sentences
    if pos_tags[j] in ('NNS', 'NNPS'):
        pronoun = "they"
    elif noun2[0].isupper():
        pronoun = "he"
    
    new_tokens = tokens.copy()
    new_tokens[j] = pronoun
    if j > 0 and pos_tags[j-1] in ('DT', 'PRP$'):
        new_tokens[j-1] = ''
    ambiguous_sentence = ' '.join([t for t in new_tokens if t])
    return ambiguous_sentence


## Step 7: Create Dataset with Ambiguous Sentences

For each selected sentence, we apply the lexical, structural, and referential ambiguity functions. Each successful ambiguous sentence is paired with the original, along with the ambiguity type.


In [None]:
output_file = 'data/final_dataset.csv'
# Remove existing file if it exists so headers are written correctly
if os.path.exists(output_file):
    os.remove(output_file)

chunk_size = 100
chunk = []
total_pairs = 0

for i, sent_data in enumerate(tqdm(selected_sentences, desc="Creating dataset")):
    original = sent_data['text']
    tokens = sent_data['tokens']
    lemmas = sent_data['lemmas']
    pos_tags = sent_data['pos_tags']

    # Lexical ambiguity
    ambiguous_lex = introduce_lexical_ambiguity(original, tokens, lemmas, pos_tags, vocab_list)
    if ambiguous_lex and ambiguous_lex != original:
        chunk.append({'original': original, 'ambiguous': ambiguous_lex, 'type': 'lexical'})
        total_pairs += 1

    # Structural ambiguity
    ambiguous_struct = introduce_structural_ambiguity(original)
    if ambiguous_struct and ambiguous_struct != original:
        chunk.append({'original': original, 'ambiguous': ambiguous_struct, 'type': 'structural'})
        total_pairs += 1

    # Referential ambiguity
    ambiguous_ref = introduce_referential_ambiguity(original, tokens, pos_tags)
    if ambiguous_ref and ambiguous_ref != original:
        chunk.append({'original': original, 'ambiguous': ambiguous_ref, 'type': 'referential'})
        total_pairs += 1

    # Every chunk_size records, write to disk and clear memory
    if len(chunk) >= chunk_size:
        df_chunk = pd.DataFrame(chunk)
        header = not os.path.exists(output_file)
        df_chunk.to_csv(output_file, mode='a', index=False, header=header)
        chunk = []

# Write any remaining records
if chunk:
    df_chunk = pd.DataFrame(chunk)
    header = not os.path.exists(output_file)
    df_chunk.to_csv(output_file, mode='a', index=False, header=header)

print(f"Total number of sentence pairs written: {total_pairs}")

Creating dataset:   0%|          | 0/326283 [00:00<?, ?it/s]

Total number of sentence pairs written: 932354


## Notes

- **Vocabulary Lookup Table**: The `new_vocab_lookup.csv` file contains columns `token`, `lemma`, and `pos` (fine-grained POS tags like 'NN', 'VB'). This matches the structure of your existing `vocab_lookup_original_texts.csv`.
- **Dataset**: The `new_final_dataset.csv` file contains columns `original`, `ambiguous`, and `type` (indicating the type of ambiguity introduced).
- **Special Tokens**: The special tokens (`<PAD>`, `<SOS>`, `<EOS>`, `<UNK>`) are not included in this pipeline as they are typically handled during model training.
- **Ambiguity Functions**: The provided functions are used as-is, ensuring consistency with your original approach.
- **Performance**: Processing the entire corpus may be memory-intensive. If issues arise, consider processing files in smaller batches or sampling sentences.

This pipeline can be executed in a Jupyter Notebook to generate the required dataset and vocabulary lookup table. Ensure all libraries are installed and run the cells in order.


##  Build Wait-K Prefix Dataset
We start from original parallel pairs \((\mathbf{x}, \mathbf{y}) = (x_{1:L_x}, y_{1:L_y})\) and apply a Wait-\(K\) strategy.  For each target position \(t=1,\dots,L_y\), define the number of source tokens seen so far as  
$$
r(t) \;=\;\min\bigl(K + (t-1),\,L_x\bigr).
$$  
We then create prefix-to-next-word examples of the form  
$$
\bigl(x_{1:r(t)},\,y_{t-1}\bigr)\;\mapsto\;y_t,
$$  
where \(y_0\) is the special `<sos>` token.  In practice this yields up to \(L_y\) examples per sentence pair.  Finally, we save all examples to `final_dataset_2.csv` with columns:

- `source_prefix` = $x_{1:r(t)}$
- `prev_target`   = $y_{t-1}$
- `target_word`   = $y_t$

In [1]:
# 1) Load your original parallel corpus CSV
import pandas as pd
df = pd.read_csv('data/final_dataset.csv')

K = 3  # choose your latency budget

rows = []
for _, row in df.iterrows():
    src_tokens = row['ambiguous'].split()
    tgt_tokens = row['original'].split()
    Lx, Ly = len(src_tokens), len(tgt_tokens)
    for t in range(1, Ly + 1):
        r = min(K + (t - 1), Lx)
        prev_tok = tgt_tokens[t - 2] if t > 1 else '<sos>'
        rows.append({
            'source_prefix': ' '.join(src_tokens[:r]),
            'prev_target': prev_tok,
            'target_word': tgt_tokens[t - 1]
        })

new_df = pd.DataFrame(rows)
new_df.to_csv('data/final_dataset_2.csv', index=False)
print(f"Saved {len(new_df)} prefix examples to final_dataset_2.csv")

Saved 11724190 prefix examples to final_dataset_2.csv
