# Preprocess Data

The original NL sequences used to build the MARB dataset were retrieved from the enTenTen corpus using SketchEngine's *Concordance* tool. Since the tool operates with a fixed context window rather than sentence boundaries, and since the matches are returned as a CSV file with *left context*, *match* and *right context* as separate fields, this script was used to clean the matches before continuing with the dataset creation. 

In [1]:
import pandas as pd
import os

In [2]:
ententen_dir = '/srv/data/gussodato/thesis/ententen/'

a_person_path = os.path.join(ententen_dir, 'concordance_preloaded_ententen21_tt31_20240226154708.csv')
a_woman_path = os.path.join(ententen_dir, 'concordance_preloaded_ententen21_tt31_20240226154516.csv')
a_man_path = os.path.join(ententen_dir, 'concordance_preloaded_ententen21_tt31_20240226154231.csv')

paths = [(a_person_path, 'person'), (a_woman_path, 'woman'), (a_man_path, 'man')]

### Concordance search results
The first four lines of the concordance files are information about the search settings:

In [3]:
for path, person in paths:
    print(fr'Search term: {person}')
    print('-'*20)
    with open(path) as f:
        for _ in range(4):
            print(next(f).strip())
    print()

Search term: person
--------------------
﻿"corpus","preloaded/ententen21_tt31"
"subcorpus","-"
"concordance size","10000"
"query","Query:[lc=""\ba""] [lc=""person\b""];Random sample:10000"

Search term: woman
--------------------
﻿"corpus","preloaded/ententen21_tt31"
"subcorpus","-"
"concordance size","10000"
"query","Query:[lc=""\ba""] [lc=""woman\b""];Random sample:10000"

Search term: man
--------------------
﻿"corpus","preloaded/ententen21_tt31"
"subcorpus","-"
"concordance size","10000"
"query","Query:[lc=""\ba""] [lc=""man\b""];Random sample:10000"



The rest is a CSV file with *Left*, *KWIC* (Key Word In Context) and *Right* as separate fields:

In [4]:
raw_data_example = pd.read_csv(a_person_path, header=4)
raw_data_example.head(5)

Unnamed: 0,Reference,Left,KWIC,Right
0,tm.org,reduces stress and produces more neurological ...,a person,"with seizure disorder could, of course, enjoy ..."
1,hnn.us,"period I study, everyone''s well-being was at ...",a person,"was the colonist or the colonized, the enslave..."
2,hnn.us,is bought up of so-called liberals wanting to ...,a person,might kill 6-20 other people. </s><s> By defin...
3,uh.edu,"<s> When you know how to search your mind, ide...",a person,who can look at a beehive and change the world...
4,hnn.us,on grounds of uncollegiality. </s><s> And thes...,a person,who at times tends to interpret differences ov...


We need to remove surplus context and join the search expression with its left and right context. The result will be saved to a textfile:

In [5]:
def clean_matches(csv_path, person_word, savedir='/srv/data/gussodato/thesis/ententen/'):
    """
    Remove context outside of sentence boundaries. if sentence boundary is not included in context, 
    split at furthest comma. Write result to file.
    """
    print(f'Cleaning "{person_word}" samples in {csv_path}...')
    df = pd.read_csv(csv_path, header=4)
    leftcontexts = [line.split('>')[-1] if '>' in line else ' '.join(line.split(',')[1:]) for line in list(df['Left'])]
    matches = list(df['KWIC'])
    rightcontexts = [line.split('<')[0] if '<' in line else ' '.join(line.split(',')[:-1]) for line in list(df['Right'])]
    
    clean_sents = [' '.join(row) for row in zip(leftcontexts, matches, rightcontexts)]
    
    filename = os.path.join(savedir, person_word+'_clean.txt')
    print(f'Writing to {filename}...')
    with open(filename, 'w') as f:
        for line in clean_sents:
            f.write(line+'\n')
    print('Done!')

In [6]:
for path, person_word in [(a_person_path, 'person'), (a_woman_path, 'woman'), (a_man_path, 'man')]:
    clean_matches(path, person_word, ententen_dir)

Cleaning "person" samples in /srv/data/gussodato/thesis/ententen/concordance_preloaded_ententen21_tt31_20240226154708.csv...
Writing to /srv/data/gussodato/thesis/ententen/person_clean.txt...
Done!
Cleaning "woman" samples in /srv/data/gussodato/thesis/ententen/concordance_preloaded_ententen21_tt31_20240226154516.csv...
Writing to /srv/data/gussodato/thesis/ententen/woman_clean.txt...
Done!
Cleaning "man" samples in /srv/data/gussodato/thesis/ententen/concordance_preloaded_ententen21_tt31_20240226154231.csv...
Writing to /srv/data/gussodato/thesis/ententen/man_clean.txt...
Done!


### Done!