## Supplementary code for paper submission: 'Tracing Semantic Variation in Slang'.

This notebook contains the supplementary data pre-processing code for 'Tracing Semantic Variation in Slang'. Since we cannot publically release all entries from Green's Dictionary of Slang (GDoS) due to copyright terms, this note book illustrates how we pre-process raw data obtained from https://greensdictofslang.com/ and turn the data into a format that can be used to reproduce our experimental results.

Here is a list of non-standard Python packages you'll need. All of which can be obtained using *pip install*.

- numpy
- bs4

In [3]:
import bs4
import pickle
import re
import glob
import numpy as np
from tqdm import trange

In [4]:
from util import GSD_Definition, GSD_Word
from process import process_GSD

For illustration, we include the raw html dumps for 3 dictionary entries for the slang word *beast*. Each file is named after its hash tag organized by the original dictionary. The original entries can be found on the following webpages:

https://greensdictofslang.com/entry/23sqfua

https://greensdictofslang.com/entry/xzzdtua

https://greensdictofslang.com/entry/3e7vqxq

We not first crawl our directory for these hash tags:

In [5]:
word_hash = [s[:-5] for s in glob.glob('*.html')]

In [6]:
print(word_hash)

[]


The following pre-processing function will then take in a list of hash tags and process the respective html files. A pickle file will be generated for each word entry. Note that we do not collapse homonyms (i.e. same word form with multiple word entries) until the actual experiment.

In [7]:
process_GSD(word_hash, input_dir = "", output_dir = "")

0it [00:00, ?it/s]


This should generate 3 pickle files which we now load for further pre-processing.

In [8]:
data = [pickle.load(open(h+'.pickle', 'rb')) for h in word_hash]

The following code filters the reference entries according to the set of regions that we are interested in (in our case, US and UK). It also tries to automatically extract valid example usage sentences from the reference entries.

In [9]:
regions = ['[US]', '[UK]']
#regions = ['[US]', '[UK]', '[Aus]']

In [10]:
punctuations = '!\'"#$%&()\*\+,-\./:;<=>?@[\\]^_`{|}~'

re_punc = re.compile(r"["+punctuations+r"]+")
re_space = re.compile(r" +")

re_extract_quote = re.compile(r"[1-9/]+:")
re_extract_quote_all = re.compile(r"[1-9/]+:.*$")

def proc_quote_sent(sent):
    return re_extract_quote.sub(' ', re_extract_quote_all.findall(sent)[0]).strip()

def validate_quote_sent(word, sent):
    tokens = [s.lower() for s in re_space.sub(' ', re_punc.sub('', sent)).split(' ')]
    return word.lower() in tokens

data_proc = []

for i in trange(len(data)):
    w = data[i]
    if w.is_abbr():
        continue
    d_list = []
    for d in w.definitions:
        stamps = d.stamps
        region_set = set([s[1] for s in stamps])
        if np.any([r in region_set for r in regions]):
            new_stamps = [s for s in stamps if np.any([r==s[1] in region_set for r in regions])]
            new_def = GSD_Definition(d.def_sent)
            new_def.stamps = new_stamps
            new_def.contexts = {key:value for key, value in d.contexts.items() if key in new_stamps}
            d_list.append(new_def)
    if len(d_list) > 0:
        new_word = GSD_Word(w.word.replace("\\xe2\\x80\\x99", "'").replace("\\xe2\\x80\\x98", "'"), w.pos, w.homonym)
        new_word.definitions = d_list
        data_proc.append(new_word)

0it [00:00, ?it/s]


Here's what the data looks after after pre-processing:

In [11]:
_ = [print(d) for d in data_proc]

We now save the pre-processed data to be used for experiments. See the notebook *Trace.ipynb* in the code package for how this can be used to reproduce results in our paper.

In [12]:
np.save('GSD_sample_data.npy', data_proc)