# SemCore dataset

In this notebook we'll break down how the SemCore dataset is organised, and see if it will come in useful

In [2]:
import pandas as pd
import xml.etree.ElementTree as ET
from tqdm import tqdm

In [35]:
semcore_xml  = "Data/Raw/semcor.data.xml"
wordnet_keys = "Data/Raw/semcor.gold.key.txt"

## The XML

```xml
<?xml version="1.0" encoding="UTF-8" ?>
<corpus lang="en" source="semcor">
<text id="d000" source="br-e30">
<sentence id="d000.s000">
<wf lemma="how" pos="ADV">How</wf>
<instance id="d000.s000.t000" lemma="long" pos="ADJ">long</instance>
<wf lemma="have" pos="VERB">has</wf>
<wf lemma="it" pos="PRON">it</wf>
<instance id="d000.s000.t001" lemma="be" pos="VERB">been</instance>
<wf lemma="since" pos="ADP">since</wf>
<wf lemma="you" pos="PRON">you</wf>
<instance id="d000.s000.t002" lemma="review" pos="VERB">reviewed</instance>
<wf lemma="the" pos="DET">the</wf>
<instance id="d000.s000.t003" lemma="objective" pos="NOUN">objectives</instance>
<wf lemma="of" pos="ADP">of</wf>
<wf lemma="you" pos="PRON">your</wf>
<instance id="d000.s000.t004" lemma="benefit" pos="NOUN">benefit</instance>
<wf lemma="and" pos="CONJ">and</wf>
<instance id="d000.s000.t005" lemma="service" pos="NOUN">service</instance>
<instance id="d000.s000.t006" lemma="program" pos="NOUN">program</instance>
<wf lemma="?" pos=".">?</wf>
</sentence>
...
```

We see a hierarchical data structure; corpus->text->sentence->(word form OR instance)

Here we are interested primarily in the instance data, since they have a lookup key; a companion file was included that matches these to the WordNet semantic meaning for the word.

For our purposes we don't care about the POS tagging, and are only interested in the instances. At cost of redundancy, we could convert this xml to a pandas dataframe which stores:
* instance
* source sentence
* wordnet key

In [4]:
tree = ET.parse(semcore_xml)
root = tree.getroot()

In [5]:
print(f'{root.tag=}')
print(f'{root.attrib=}')

root.tag='corpus'
root.attrib={'lang': 'en', 'source': 'semcor'}


In [6]:
i = 0
for child in root:
    print(f'{child.tag=}', f'{child.attrib=}')
    i = i + 1
    if i == 10:
        break

child.tag='text' child.attrib={'id': 'd000', 'source': 'br-e30'}
child.tag='text' child.attrib={'id': 'd001', 'source': 'br-l15'}
child.tag='text' child.attrib={'id': 'd002', 'source': 'br-f16'}
child.tag='text' child.attrib={'id': 'd003', 'source': 'br-j42'}
child.tag='text' child.attrib={'id': 'd004', 'source': 'br-g18'}
child.tag='text' child.attrib={'id': 'd005', 'source': 'br-e26'}
child.tag='text' child.attrib={'id': 'd006', 'source': 'br-f18'}
child.tag='text' child.attrib={'id': 'd007', 'source': 'br-f24'}
child.tag='text' child.attrib={'id': 'd008', 'source': 'br-n17'}
child.tag='text' child.attrib={'id': 'd009', 'source': 'br-h17'}


In [13]:
all_sentences = []

for text_el in tqdm(root.findall('text')):
    text_id = text_el.get('id')

    for sentence_el in text_el.findall('sentence'):
        sentence_id = sentence_el.get('id')
        sentence_tokens = []

        for token_el in sentence_el:
            token_info = {
                'tag': token_el.tag,                # 'wf' or 'instance'
                'word': token_el.text,
                'lemma': token_el.get('lemma'),
                'pos': token_el.get('pos'),
                'id': token_el.get('id')           # only present if tag == 'instance'
            }
            sentence_tokens.append(token_info)

        all_sentences.append({
            'text_id': text_id,
            'sentence_id': sentence_id,
            'tokens': sentence_tokens
        })


print(f"found {len(all_sentences)} sentences")

  0%|                                                                                           | 0/352 [00:00<?, ?it/s]

100%|████████████████████████████████████████████████████████████████████████████████| 352/352 [00:00<00:00, 359.34it/s]

found 37176 sentences





In [31]:
def process_sentence(sentence):
    
    sentence_text = " ".join([token['word'] for token in sentence['tokens']])

    records = []

    
    for idx, token in enumerate(sentence['tokens']):
        if token['tag'] == 'instance':
            record = {
                'word'            : token['word'],
                'wordnet_join_id' : token['id'],
                'sentence'        : sentence_text,
                'word_loc'        : idx
            }
            records.append(record)

    return records


In [32]:
row_data = [record for sentence in tqdm(all_sentences) for record in process_sentence(sentence)]

  0%|                                                                                         | 0/37176 [00:00<?, ?it/s]

100%|██████████████████████████████████████████████████████████████████████████| 37176/37176 [00:00<00:00, 56436.30it/s]


In [33]:
df = pd.DataFrame(row_data)
df.head(5)

Unnamed: 0,word,wordnet_join_id,sentence,word_loc
0,long,d000.s000.t000,How long has it been since you reviewed the ob...,1
1,been,d000.s000.t001,How long has it been since you reviewed the ob...,4
2,reviewed,d000.s000.t002,How long has it been since you reviewed the ob...,7
3,objectives,d000.s000.t003,How long has it been since you reviewed the ob...,9
4,benefit,d000.s000.t004,How long has it been since you reviewed the ob...,12


## Merging in the wordnet keys

These link up the "instances" to wordnet definitions

```csv
d000.s000.t000 long%3:00:02::
d000.s000.t001 be%2:42:03::
d000.s000.t002 review%2:31:00::
d000.s000.t003 objective%1:09:00::
d000.s000.t004 benefit%1:21:00::
d000.s000.t005 service%1:04:07::
```

Most take this form:

`lemma%ss_type:lex_filenum:lex_id::`

However there is special consideration for satellite adjectives and occasionally multiple word senses:

```
improved%5:00:00:better:00
public%5:00:00:common:02 public%3:00:00:
```

For now I'll merge these in as strings, then see what the wordnet data looks like to decide how to handle the edge cases.


In [56]:
data = []

with open(wordnet_keys, "r", encoding="utf-8") as f:
    for line in tqdm(f):
        parts = line.strip().split(" ", 1)
        local_key, wordnet_id = parts
        data.append((local_key, wordnet_id))

wordnet_merge_df = pd.DataFrame(data, columns=["wordnet_join_id", "wordnet"])

0it [00:00, ?it/s]

226036it [00:00, 437342.53it/s]


In [57]:
wordnet_merge_df.iloc[70:75] # check the wordnet ids without spaces aren't messed up

Unnamed: 0,wordnet_join_id,wordnet
70,d000.s010.t003,free%3:00:00::
71,d000.s010.t004,buying%1:04:00::
72,d000.s010.t005,service%1:14:05:: service%1:04:00::
73,d000.s010.t006,employee%1:18:00::
74,d000.s011.t000,improvement%1:04:00::


In [59]:
df = df.merge(wordnet_merge_df).drop(columns= ['wordnet_join_id'])

In [60]:
df.head()

Unnamed: 0,word,sentence,word_loc,wordnet
0,long,How long has it been since you reviewed the ob...,1,long%3:00:02::
1,been,How long has it been since you reviewed the ob...,4,be%2:42:03::
2,reviewed,How long has it been since you reviewed the ob...,7,review%2:31:00::
3,objectives,How long has it been since you reviewed the ob...,9,objective%1:09:00::
4,benefit,How long has it been since you reviewed the ob...,12,benefit%1:21:00::


## Getting definitions from WordNet

The wordnet files can be grabbed using the `nltk` package, so we don't need to worry about too much manual work now.

In [61]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/matt/nltk_data...


True

In [62]:
from nltk.corpus import wordnet as wn

In [65]:
demo_sense_key = df.iloc[0]["wordnet"]
print(f"{demo_sense_key=}")

demo_sense_key='long%3:00:02::'


In [66]:
lemma = wn.lemma_from_key(demo_sense_key)

synset = lemma.synset()

print(f"Word: {lemma.name()}")
print(f"Synset: {synset.name()}")
print(f"Definition: {synset.definition()}")
print(f"Examples: {synset.examples()}")


Word: long
Synset: long.a.01
Definition: primarily temporal sense; being or indicating a relatively great or greater than average duration or passage of time or a duration as specified
Examples: ['a long life', 'a long boring speech', 'a long time', 'a long friendship', 'a long game', 'long ago', 'an hour long']


In [67]:
def def_from_sense_key(key):
    lemma = wn.lemma_from_key(key)
    synset = lemma.synset()
    return synset.definition()

## Handling edge case

First lets see how it copes with the satellite adjectives:

In [70]:
def_from_sense_key('improved%5:00:00:better:00')

'become or made better in quality'

those work, what about when there are two?

In [72]:
def_from_sense_key('public%5:00:00:common:02 public%3:00:00:')

ValueError: too many values to unpack (expected 2)

That throws, lets see what the two definitions would have been:

In [74]:
print(f"def 1 is {def_from_sense_key('public%5:00:00:common:02')}")
print(f"def 2 is {def_from_sense_key('public%3:00:00:')}")

def 1 is affecting the people or community as a whole


ValueError: not enough values to unpack (expected 5, got 4)

Lets just use the first one in this case

In [75]:
def def_from_sense_key(key):
    key = key.split(" ")[0]
    lemma = wn.lemma_from_key(key)
    synset = lemma.synset()
    return synset.definition()

In [76]:
tqdm.pandas()
df['definition'] = df['wordnet'].progress_apply(def_from_sense_key)

100%|████████████████████████████████████████████████████████████████████████| 226036/226036 [00:22<00:00, 10131.24it/s]


In [78]:
df.sample(5)

Unnamed: 0,word,sentence,word_loc,wordnet,definition
45421,feel,You feel him every mile further away .,1,feel%2:39:00::,"perceive by a physical sensation, e.g., coming..."
72057,enrichment,It is possible that the idea of enrichment of ...,7,enrichment%1:04:00::,act of making fuller or more meaningful or rew...
115019,newly,This appears to result from both a reduced amo...,19,newly%4:02:00::,very recently
154614,adjournment,"Before adjournment Monday afternoon , the Sena...",1,adjournment%1:04:00::,the termination of a meeting
204776,spread,The soldiers themselves cannot stage a success...,17,spread%2:35:00::,distribute or disperse widely


## Including all the possible definitions

These will be the items that a model must distinguish between

In [89]:
def options_from_key(key):

    word = key.split("%")[0]

    synsets = wn.synsets(word)
    
    defs = []

    for syn in synsets:
        this_def = syn.definition()
        if '|' in this_def:
            raise Exception(f"forbidden character in: {this_def}, for {word}")
        defs.append(syn.definition())
        

    return "|".join(defs)

In [90]:
options_from_key('adjournment%1:04:00::')

'the termination of a meeting|the act of postponing to another time or place'

In [91]:
df['definitions'] = df['wordnet'].progress_apply(options_from_key)

100%|████████████████████████████████████████████████████████████████████████| 226036/226036 [00:10<00:00, 21544.58it/s]


In [92]:
df.sample(5)

Unnamed: 0,word,sentence,word_loc,wordnet,definition,definitions
47643,study,No one will deny that such broad developments ...,17,study%1:04:00::,a detailed critical inspection,a detailed critical inspection|applying the mi...
204758,adds up,"Even so , it adds up to impossible odds , exce...",4,add_up%2:42:01::,develop into,develop into|determine the sum of|add up in nu...
2160,permanently,"Second , they believed it important to determi...",19,permanently%4:02:00::,for a long time without essential change,for a long time without essential change
106607,plane,"Also , planetary gravitational attraction incr...",10,plane%1:25:00::,(mathematics) an unbounded two-dimensional shape,an aircraft that has a fixed wing and is power...
100392,liked,"The dialogue is sharp , witty and candid - typ...",38,like%2:37:05::,find enjoyable or agreeable,"a similar kind; ,|a kind of person|prefer or w..."


In [94]:
df.to_csv('Data/Processed/SemCoreProcessed.csv', index=False)