# SemCore - Distributional Considerations

In the previous `SemCore.ipynb` notebook we created a dataset for analysis, which saw statistical signal for word sense disambiguation, using hidden activations.

However, one possible explanation for these results is that the signal was merely a result of more standard use cases having higher baseline correlation.

And then since the dataset over-represents the more standard use cases (tautologically) we get a "false positive" (an analogy can be drawn between this and sensitivity/specificity considerisations)

To investigate this possibility, we will generate a version of the dataset that draws uniformly from the word senses present.

In [7]:
import pandas as pd
import xml.etree.ElementTree as ET
from tqdm import tqdm
import nltk
from nltk.corpus import wordnet as wn

nltk.download('wordnet')

semcore_xml = "Data/Raw/semcor.data.xml"
wordnet_keys = "Data/Raw/semcor.gold.key.txt"

# parse XML
tree = ET.parse(semcore_xml)
root = tree.getroot()

all_sentences = []
for text_el in tqdm(root.findall('text')):
    text_id = text_el.get('id')
    for sentence_el in text_el.findall('sentence'):
        sentence_tokens = [
            {
                'tag': token.tag,
                'word': token.text,
                'lemma': token.get('lemma'),
                'pos': token.get('pos'),
                'id': token.get('id')
            }
            for token in sentence_el
        ]
        all_sentences.append({
            'text_id': text_id,
            'sentence_id': sentence_el.get('id'),
            'tokens': sentence_tokens
        })

def process_sentence(sentence):
    sentence_text = "|".join([token['word'] for token in sentence['tokens']])
    return [
        {
            'word': token['word'],
            'wordnet_join_id': token['id'],
            'sentence': sentence_text,
            'word_loc': idx
        }
        for idx, token in enumerate(sentence['tokens']) if token['tag'] == 'instance'
    ]

row_data = [record for sentence in tqdm(all_sentences) for record in process_sentence(sentence)]
df = pd.DataFrame(row_data)

# load WordNet keys
data = []
with open(wordnet_keys, "r", encoding="utf-8") as f:
    for line in f:
        local_key, wordnet_id = line.strip().split(" ", 1)
        data.append((local_key, wordnet_id))

wordnet_merge_df = pd.DataFrame(data, columns=["wordnet_join_id", "wordnet"])
df = df.merge(wordnet_merge_df).drop(columns=['wordnet_join_id'])

# get WordNet definitions
def def_from_sense_key(key):
    key = key.split(" ")[0]
    lemma = wn.lemma_from_key(key)
    synset = lemma.synset()
    return synset.definition()

def options_from_key(key):
    word = key.split("%")[0]
    return "|".join([syn.definition() for syn in wn.synsets(word)])

tqdm.pandas()
df['definition'] = df['wordnet'].progress_apply(def_from_sense_key)
df['definitions'] = df['wordnet'].progress_apply(options_from_key)


[nltk_data] Downloading package wordnet to /home/matt/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
100%|████████████████████████████████████████████████████████████████████████████████| 352/352 [00:00<00:00, 542.35it/s]
100%|██████████████████████████████████████████████████████████████████████████| 37176/37176 [00:00<00:00, 89210.59it/s]
100%|████████████████████████████████████████████████████████████████████████| 226036/226036 [00:18<00:00, 12217.22it/s]
100%|████████████████████████████████████████████████████████████████████████| 226036/226036 [00:07<00:00, 32206.04it/s]


In [8]:
df.head()

Unnamed: 0,word,sentence,word_loc,wordnet,definition,definitions
0,long,How|long|has|it|been|since|you|reviewed|the|ob...,1,long%3:00:02::,primarily temporal sense; being or indicating ...,desire strongly or persistently|primarily temp...
1,been,How|long|has|it|been|since|you|reviewed|the|ob...,4,be%2:42:03::,"have the quality of being; (copula, used with ...",a light strong brittle grey toxic bivalent met...
2,reviewed,How|long|has|it|been|since|you|reviewed|the|ob...,7,review%2:31:00::,look at again; examine again,a new appraisal or evaluation|an essay or arti...
3,objectives,How|long|has|it|been|since|you|reviewed|the|ob...,9,objective%1:09:00::,the goal intended to be attained (and which is...,the goal intended to be attained (and which is...
4,benefit,How|long|has|it|been|since|you|reviewed|the|ob...,12,benefit%1:21:00::,financial assistance in time of need,financial assistance in time of need|something...


In [9]:
# group by word, then by definition (sense), and balance samples per sense
uniform_dfs = []

for word, word_group in tqdm(df.groupby("word")):
    sense_groups = list(word_group.groupby("definition"))
    
    # Only balance if there are at least two senses
    if len(sense_groups) < 2:
        continue
    
    # Get the min number of samples available for any definition
    min_count = min(len(g) for _, g in sense_groups)
    
    # Sample min_count from each sense group
    for _, sense_group in sense_groups:
        uniform_dfs.append(sense_group.sample(min_count, random_state=42))

# combine all balanced samples
uniform_df = pd.concat(uniform_dfs).reset_index(drop=True)

100%|████████████████████████████████████████████████████████████████████████████| 33657/33657 [01:01<00:00, 543.96it/s]


In [21]:
uniform_df[uniform_df['word'] == 'model']

Unnamed: 0,word,sentence,word_loc,wordnet,definition,definitions
22386,model,While|there|may|be|several|such|industries|to|...,10,model%1:09:00::,a hypothetical description of a complex entity...,a hypothetical description of a complex entity...
22387,model,The|model|quite|plainly|thought|Michelangelo|c...,1,model%1:18:00::,a person who poses for a photographer or paint...,a hypothetical description of a complex entity...
22388,model,Eichmann|himself|is|a|model|of|how|the|myth|of...,4,model%1:09:02::,a representative form or pattern,a hypothetical description of a complex entity...
22389,model,The|magnetic resonance|absorption|was|detected...,9,model%1:09:03::,a type of product,a hypothetical description of a complex entity...
22390,model,"The|District Courts|,|in|the|framing|of|equita...",17,model%2:36:00::,"form in clay, wax, etc",a hypothetical description of a complex entity...
22391,model,When|carving|he|was|charged|with|spontaneous|e...,28,model%1:06:00::,representation of something (sometimes on a sm...,a hypothetical description of a complex entity...
22392,model,The|Glazer-Fine|Arts|edition|(|Concert-Disc|)|...,9,model%1:09:01::,something to be imitated,a hypothetical description of a complex entity...
22393,model,"Criminals|,|as|well|as|model|citizens|,|exerci...",5,model%5:00:00:worthy:00,worthy of imitation,a hypothetical description of a complex entity...


In [None]:
# shuffle and save the result

uniform_df = uniform_df.sample(frac=1, random_state=42).reset_index(drop=True)

uniform_df.to_csv('Data/Processed/SemCoreProcessedUniform.csv', index=False)