## Cleaning the non-prototypical arguments dataset from Papadimitriou et al. 2022

In this notebook, we'll be cleaning and preparing the swapped arguments dataset from [Papadimitriou et al. 2022](http://arxiv.org/abs/2203.06204). This dataset contains both natural and systematically swapped English sentences, designed to probe the representation of grammatical roles in language models when lexical expectations alone are insufficient.

By isolating the effect of word order on contextualization, this dataset highlights how models leverage context in critical instances where compositional meaning diverges from lexical expectations. We'll be repurposing this dataset as a benchmark for embedding reconstruction models (incl. `vec2text`), to evaluate their ability to recover non-prototypical sentences.

In [4]:
import os
import requests

DATASET_DIR="../../datasets/arguments-swapped"

# Create the directory if it doesn't exist
os.makedirs(DATASET_DIR, exist_ok=True)

# URLs of the CSV files
original_url = "https://raw.githubusercontent.com/toizzy/except-when-it-matters/main/data/argument-swapped-original.csv"
swapped_url = "https://raw.githubusercontent.com/toizzy/except-when-it-matters/main/data/argument-swapped-swapped.csv"

# Download and save the original CSV file
response = requests.get(original_url)
with open(os.path.join(DATASET_DIR, "original.csv"), "wb") as file:
    file.write(response.content)

# Download and save the swapped CSV file
response = requests.get(swapped_url)
with open(os.path.join(DATASET_DIR, "swapped.csv"), "wb") as file:
    file.write(response.content)

print("CSV files downloaded and saved successfully.")

CSV files downloaded and saved successfully.


Unfortunately the dataset has nonstandard quoting, so let's 

In [9]:
import pandas as pd
import csv

def preprocess_csv(file_path):
    # Read the CSV file using pandas with a tab delimiter and no quoting
    df = pd.read_csv(file_path, sep='\t', quoting=3, escapechar='\\', header=None,
                     names=['id', 'label', 'subj_start', 'subj', 'obj_start', 'obj', 'verb', 'text'])

    # Remove extra quotes from the 'text' column
    df['text'] = df['text'].str.replace('""', '"')

    return df

# Preprocess each CSV file
for file_name in os.listdir(DATASET_DIR):
    if file_name.endswith(".csv"):
        file_path = os.path.join(DATASET_DIR, file_name)
        
        # Preprocess the CSV file
        df = preprocess_csv(file_path)
        
        # Save the preprocessed data as TSV
        output_path = os.path.join(DATASET_DIR, file_name[:-4] + "_preprocessed.tsv")
        df.to_csv(output_path, sep='\t', index=False, header=True, quoting=csv.QUOTE_MINIMAL)


# Preprocess each CSV file
print("Preprocessing completed.")


Preprocessing completed.


In [11]:
from datasets import load_dataset

# Load the dataset from the local folder
dataset_dict = load_dataset("csv", data_files={"original": os.path.join(DATASET_DIR, "original_preprocessed.tsv"),
                                               "swapped": os.path.join(DATASET_DIR, "swapped_preprocessed.tsv")},
                            delimiter='\t')

In [48]:
import re
def fix_punctuation(example):
    text = example['text']

    # Remove extra spaces before and after punctuation marks
    text = re.sub(r'\s*([,.:;!?])\s*', r'\1 ', text)
    text = text.replace(' - ', '-')

    # Replace ellipsis with three dots without spaces
    text = re.sub(r'\.\s*\.\s*\.', '...', text)
    text = re.sub(r'(\.\.\.) ', r'\1', text)
    text = re.sub(r'\s*…+\s*', '... ', text)

    # Fix spacing around parentheses
    text = re.sub(r'\s*\(', ' (', text)
    text = re.sub(r'\)\s*', ') ', text)
    text = re.sub(r'\)\s*,', r'),', text)

    # Remove smart quotes
    text = re.sub(r'[“”]', '"', text)
    text = re.sub(r'’', "'", text)
    text = re.sub(r"''", '"', text)

    # Fix spacing before contractions
    text = re.sub(r"\s+'ve\b", "'ve", text)
    text = re.sub(r"\s+n't\b", "n't", text)
    text = re.sub(r"\s+'m\b", "'m", text)
    text = re.sub(r"\s+'ll\b", "'ll", text)

    # Remove faux escaping and fix spacing for possessives and contractions
    text = re.sub(r"\\'\s*s\b", "'s", text)
    text = re.sub(r"\\'\s*s\s+", "'s ", text)
    text = re.sub(r"\\'\s*", "'", text)

    # Remove space after currency symbol
    text = re.sub(r'\$\s+', '$', text)

    # Standardize quotes and ellipses
    text = re.sub(r'["""]', '"', text)
    text = re.sub(r"[''']", "'", text)

    # Remove all instances of '\'
    text = text.replace("\\","")
    
    # Remove '<<' and '>>'
    text = re.sub(r'<<|>>', '', text)
    
    # Replace '""' with '"'
    text = re.sub(r'""', '"', text)

    # Remove leading and trailing whitespace
    text = text.strip()

    example['text'] = text
    return example

# Apply the fix_punctuation() function to each dataset in the dataset_dict
dataset_dict = dataset_dict.map(fix_punctuation, num_proc=4)

Map (num_proc=4): 100%|██████████| 487/487 [00:00<00:00, 5109.46 examples/s]
Map (num_proc=4): 100%|██████████| 487/487 [00:00<00:00, 5066.25 examples/s]


In [49]:
dataset_dict["original"]["text"]

['sentence',
 'Other shops around this city have MUCH NICER and more TRANSPARENT owners.',
 'Hmmm...A person can not call a company, if you have no idea its name (since the designer is unknown...SUPPOSEDLY) , and order a gown without a dress name or style number.',
 '"The worst thing that can happen for any restaurant like Zahav is to have too many people write hyperbolic reviews making claims that " everyone " is going to " love " the food, decor and service. "',
 'The Inn touts a shower with dual shower heads, but only one worked.',
 'Finally a chambermaid stuck her head around the corner from the top of the stairs and told us sternly that we could not be accommodated until 3 M, no exceptions.',
 "But I've done hundreds of dog introductions myself (another place, I don't work here) , and owners can have unrealistic expectations and views of what they see when their dogs meet other dogs.",
 "But I've done hundreds of dog introductions myself (another place, I don't work here) , and ow