# SemCore dataset

In this notebook we'll break down how the SemCore dataset is organised, and see if it will come in useful

In [2]:
import pandas as pd
import xml.etree.ElementTree as ET
from tqdm import tqdm

In [3]:
semcore_xml = "Data/Raw/semcor.data.xml"

## The XML

```xml
<?xml version="1.0" encoding="UTF-8" ?>
<corpus lang="en" source="semcor">
<text id="d000" source="br-e30">
<sentence id="d000.s000">
<wf lemma="how" pos="ADV">How</wf>
<instance id="d000.s000.t000" lemma="long" pos="ADJ">long</instance>
<wf lemma="have" pos="VERB">has</wf>
<wf lemma="it" pos="PRON">it</wf>
<instance id="d000.s000.t001" lemma="be" pos="VERB">been</instance>
<wf lemma="since" pos="ADP">since</wf>
<wf lemma="you" pos="PRON">you</wf>
<instance id="d000.s000.t002" lemma="review" pos="VERB">reviewed</instance>
<wf lemma="the" pos="DET">the</wf>
<instance id="d000.s000.t003" lemma="objective" pos="NOUN">objectives</instance>
<wf lemma="of" pos="ADP">of</wf>
<wf lemma="you" pos="PRON">your</wf>
<instance id="d000.s000.t004" lemma="benefit" pos="NOUN">benefit</instance>
<wf lemma="and" pos="CONJ">and</wf>
<instance id="d000.s000.t005" lemma="service" pos="NOUN">service</instance>
<instance id="d000.s000.t006" lemma="program" pos="NOUN">program</instance>
<wf lemma="?" pos=".">?</wf>
</sentence>
...
```

We see a hierarchical data structure; corpus->text->sentence->(word form OR instance)

Here we are interested primarily in the instance data, since they have a lookup key; a companion file was included that matches these to the WordNet semantic meaning for the word.

For our purposes we don't care about the POS tagging, and are only interested in the instances. At cost of redundancy, we could convert this xml to a pandas dataframe which stores:
* instance
* source sentence
* wordnet key

In [4]:
tree = ET.parse(semcore_xml)
root = tree.getroot()

In [5]:
print(f'{root.tag=}')
print(f'{root.attrib=}')

root.tag='corpus'
root.attrib={'lang': 'en', 'source': 'semcor'}


In [6]:
i = 0
for child in root:
    print(f'{child.tag=}', f'{child.attrib=}')
    i = i + 1
    if i == 10:
        break

child.tag='text' child.attrib={'id': 'd000', 'source': 'br-e30'}
child.tag='text' child.attrib={'id': 'd001', 'source': 'br-l15'}
child.tag='text' child.attrib={'id': 'd002', 'source': 'br-f16'}
child.tag='text' child.attrib={'id': 'd003', 'source': 'br-j42'}
child.tag='text' child.attrib={'id': 'd004', 'source': 'br-g18'}
child.tag='text' child.attrib={'id': 'd005', 'source': 'br-e26'}
child.tag='text' child.attrib={'id': 'd006', 'source': 'br-f18'}
child.tag='text' child.attrib={'id': 'd007', 'source': 'br-f24'}
child.tag='text' child.attrib={'id': 'd008', 'source': 'br-n17'}
child.tag='text' child.attrib={'id': 'd009', 'source': 'br-h17'}


In [13]:
all_sentences = []

for text_el in tqdm(root.findall('text')):
    text_id = text_el.get('id')

    for sentence_el in text_el.findall('sentence'):
        sentence_id = sentence_el.get('id')
        sentence_tokens = []

        for token_el in sentence_el:
            token_info = {
                'tag': token_el.tag,                # 'wf' or 'instance'
                'word': token_el.text,
                'lemma': token_el.get('lemma'),
                'pos': token_el.get('pos'),
                'id': token_el.get('id')           # only present if tag == 'instance'
            }
            sentence_tokens.append(token_info)

        all_sentences.append({
            'text_id': text_id,
            'sentence_id': sentence_id,
            'tokens': sentence_tokens
        })


print(f"found {len(all_sentences)} sentences")

  0%|                                                                                           | 0/352 [00:00<?, ?it/s]

100%|████████████████████████████████████████████████████████████████████████████████| 352/352 [00:00<00:00, 359.34it/s]

found 37176 sentences





In [None]:
def process_sentence(sentence):
    
    sentence_text = " ".join([token['word'] for token in sentence['tokens']])

    records = []

    
    for idx, token in enumerate(sentence['tokens']):
        if token['tag'] == 'instance':
            record = {
                'word'            : token['word'],
                'wordnet_join_id' : token['id'],
                'sentence'        : sentence_text,
                'word_loc'        : idx
            }
            records.append(record)

    return records


In [27]:
row_data = [record for sentence in tqdm(all_sentences) for record in process_sentence(sentence)]

100%|██████████████████████████████████████████████████████████████████████████| 37176/37176 [00:00<00:00, 63058.57it/s]


In [30]:
df = pd.DataFrame(row_data)
df.head(5)

Unnamed: 0,word,wordnet_id,sentence,word_loc
0,long,d000.s000.t000,How long has it been since you reviewed the ob...,1
1,been,d000.s000.t001,How long has it been since you reviewed the ob...,4
2,reviewed,d000.s000.t002,How long has it been since you reviewed the ob...,7
3,objectives,d000.s000.t003,How long has it been since you reviewed the ob...,9
4,benefit,d000.s000.t004,How long has it been since you reviewed the ob...,12
