# Token classification
This notebook shows our approach to the data preprocessing. The goal is to have exactly one label per token*, including an "empty" label.
As long as we restrict our data to only one predicate, it should be feasible to determine two what other part of the sentence the role connects to.

\* In this step, token refers to the "tokenization" as applied to the PMB, i.e. the tokens in the "en.tok.off" files. E.g., "Alfred Nobel" is one token here.
Our LLM will tokenize our sentence differently, and will create one or more tokens per PMB token. This mapping will be handled later.

In [None]:
# ROLES/LABELS: Agent, Location, Topic, Patient, Theme, EMPTY
# Tags: 0=EMPTY, 1=Agent, 2=Location, 3=Patient, 4=Theme, 5=Topic

# sentence = "A brown dog and a grey dog are fighting in the snow"
# The goal is to generate:
# srl_tags = [1,1,1,1,1,1,1,0,0,2,2,2]
# tokens = ['A', 'brown', 'dog', 'and', 'a', 'grey', 'dog', 'are', 'fighting', 'in', 'the', 'snow']

In [1]:
import re
import os
# Example with one sentence:
# Note: forward slashes for Linux and WSL, backward slashes for Windows
# Windows example:
# file_path = r'C:\Users\bikow\Documents\AI\MSc\Computational Semantics\pmb-sample-4.0.0\data\en\gold\p00\d0004'
# WSL example:
file_path = r'/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p00/d0004/'
# file_path = r'/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p01/d2590/' # https://pmb.let.rug.nl/explorer/explore.php?part=01&doc_id=2590&type=der.xml&alignment_language=en
# file_path = r'/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p03/d0766/' # https://pmb.let.rug.nl/explorer/explore.php?part=03&doc_id=0766&type=der.xml&alignment_language=en

In [2]:
# THIS IS THE GOAL
# sentence = "A brown dog and a grey dog are fighting in the snow"
mapping = {"Agent": 1, "Location": 2, "Patient": 3, "Theme": 4, "Destination": 5, "Result": 6, "Stimulus": 7, "Experiencer": 8, "Co-Theme": 9, "Pivot": 10}

sentence = ""
sentence_id = '0'
tokens = []

# Get the tokens from the tokenized sentence file
with open(file_path+"en.tok.off") as file:
    for line in file:
        tokens.append(line.split(maxsplit = 3)[-1].rstrip())

sentence = ' '.join(tokens)

print(sentence)
print(tokens)

A brown dog and a grey dog are fighting in the snow
['A', 'brown', 'dog', 'and', 'a', 'grey', 'dog', 'are', 'fighting', 'in', 'the', 'snow']


## Our class-based approach
We take the en.parse.tags file and recreate the CCG structure using custom classes.
This allows us to figure out to what tokens each semantic role label belongs.

In [157]:
class CCGNode:
    def __init__(self, category = 'none', rule_type='none', parent=None, level = 0):
        self.category = category # eg s\np or np
        self.rule_type = rule_type # fa or ba or conj
        self.children = []
        self.parent = parent
        self.level = level
        self.isFirstArgument = True
    
    def addChild(self, child):
        if len(self.children) == 1:
            child.isFirstArgument = False
        elif len(self.children) == 2:
            raise Exception(repr(self), 'already has two children')
        child.level = self.level + 1
        self.children.append(child)
    
    def getSibling(self):
        if self.isFirstArgument:
            return self.parent.children[1]
        else:
            return self.parent.children[0]
    
    def assignTag(self, tag, tagFromTokenIdx):
        self.children[0].assignTag(tag, tagFromTokenIdx)
        if len(self.children) > 1:
            self.children[1].assignTag(tag, tagFromTokenIdx)
    
    def getTags(self, mapping = None):
        if len(self.children) == 0:
            return []
        if len(self.children) == 1:
            return self.children[0].getTags(mapping)
        return self.children[0].getTags(mapping) + self.children[1].getTags(mapping)
    
    def getCategories(self, onlyTokens = False):
        if onlyTokens:
            x = []
        else:
            x = [self.category]
        if len(self.children) == 0:
            return x
        if len(self.children) == 1:
            return x + self.children[0].getCategories(onlyTokens)
        return x + self.children[0].getCategories(onlyTokens) + self.children[1].getCategories(onlyTokens)
    
    
    def getTagsFromTokenIdx(self):
        if len(self.children) == 0:
            return []
        if len(self.children) == 1:
            return self.children[0].getTagsFromTokenIdx()
        return self.children[0].getTagsFromTokenIdx() + self.children[1].getTagsFromTokenIdx()
    
    def __repr__(self):
        return ''.join([' ' * self.level, 'CCGNODE', ' ', self.category, ' ', self.rule_type, '\n', '\n'.join([repr(child) for child in self.children])])

class CCGToken:
    def __init__(self, token, category, parent, assignedTag = '', verbnet = [], tokenIdx = 0):
        self.token = token
        self.category = category
        self.parent = parent
        self.assignedTag = assignedTag
        self.verbnet = verbnet
        self.children = []
        self.level = None
        self.isFirstArgument = True
        self.tokenIdx = tokenIdx
        self.tagFromTokenIdx = None
        
    def getSibling(self):
        if self.isFirstArgument:
            return self.parent.children[1]
        else:
            return self.parent.children[0]
    
    def assignTag(self, tag, tagFromTokenIdx):
        self.assignedTag = tag
        self.tagFromTokenIdx = tagFromTokenIdx
    
    def getTags(self, mapping):
        if mapping == None:
            return [self.assignedTag]
        else:
            if self.assignedTag == '':
                return [0]
            return [mapping[self.assignedTag]]
    
    def getCategories(self, _):
        return [self.category]
    
    def getTagsFromTokenIdx(self):
        return [self.tagFromTokenIdx]
    
    def __repr__(self):
        return ''.join([' ' * self.level, 'CCGTOKEN', ' ', self.token, ' ', self.category, ' ', self.assignedTag, ' ',' '.join(self.verbnet)])


In [94]:
def getTokens(file_path):
    tokens = []
    # Get the tokens from the tokenized sentence file
    with open(os.path.join(file_path, "en.tok.off")) as file:
        for line in file:
            token = line.split(maxsplit = 3)[-1].rstrip()
            tokens.append(token)
            if token in ['.', '?', '!']:
                break
        
    return tokens

In [158]:
def getTree(file_path, tokens):
    tokenIdx = 0
    topNode = None
    currentNode = None
    tokensWithVerbnet = []
    with open(os.path.join(file_path, "en.parse.tags")) as file:
        skipping = True
        for line in file:
            if skipping:
                if line.startswith('ccg'):
                    skipping = False
                    topNode = CCGNode()
                    currentNode = topNode
                continue
            if line == '\n':
                continue
            if line.startswith('ccg'): # Second sentence starts, we ignore this
                return topNode, tokensWithVerbnet
            trimmedLine = line.lstrip()
            nodeType, content = trimmedLine.split('(', 1)
            category = content.split(',')[0]
            level = len(line) - len(trimmedLine)
            while level <= currentNode.level:
                currentNode = currentNode.parent
            if nodeType == 't':
                if category in ['.']:
                    break
                if tokens[tokenIdx] in ['.', '!', '?']:
                    break
                
                vnSplit = content.split("verbnet:")
                if len(vnSplit) == 1:
                    verbnet = []
                else:
#                     verbnetLiteral = vnSplit[1].split(']')[0] + ']'
#                     verbnetUnfiltered = eval(verbnetLiteral)
#                     for role in verbnetUnfiltered:
#                         verbnetCounter[role] = verbnetCounter.get(role, 0) + 1
                    # It needs to combine to an np. Verbnet tags looking for a n for example,
                    # often describe adjectives and are not relevant for the main predicate
                    searchingFor = re.split(r'[\\\/]', category, 1)
                    if len(searchingFor) > 1 and ("np" in searchingFor[1]):
                        verbnetLiteral = vnSplit[1].split(']')[0] + ']'
                        verbnetUnfiltered = eval(verbnetLiteral)
                        for role in verbnetUnfiltered:
                            verbnetCounter[role] = verbnetCounter.get(role, 0) + 1
                        # If first element gets filtered out but not the second, replace with dummy value
                        verbnet = [r if r in mapping.keys() else '' for r in verbnetUnfiltered]
                        # Remove trailing dummy values
                        while (verbnet) and (verbnet[-1] == ''):
                            verbnet.pop()
                    else:
                        verbnet = []
                currentNode.addChild(CCGToken(tokens[tokenIdx], category = category, parent = currentNode, verbnet = verbnet, tokenIdx = tokenIdx))
                if len(verbnet) > 0:
                    tokensWithVerbnet.append(currentNode.children[-1])
                tokenIdx += 1
            else:
                currentNode.addChild(CCGNode(category, nodeType, parent=currentNode, level = level))
                currentNode = currentNode.children[-1]
#             print(topNode)


    return topNode, tokensWithVerbnet

In [139]:
def findCorrectLevel(current):
    while (not current.category.endswith('np')): # first application with non-nps
        current = current.parent
    lookingForward = (current.category[-3] == '/')
    if lookingForward:
        while ((not current.isFirstArgument) or current.parent.rule_type != 'fa'):
            current = current.parent
    else:
        while (current.isFirstArgument or current.parent.rule_type != 'ba'):
            current = current.parent
    return current

def assignTags(tokensWithVerbnet):
    for currentTokenWithVerbnet in tokensWithVerbnet:
        verbnet = currentTokenWithVerbnet.verbnet
        currentTokenIdx = round(currentTokenWithVerbnet.tokenIdx + 0.0, 1)
        for verbnetItem in verbnet:
            currentTokenWithVerbnet = findCorrectLevel(currentTokenWithVerbnet)
            sibling = currentTokenWithVerbnet.getSibling()
            sibling.assignTag(verbnetItem, currentTokenIdx)
            currentTokenWithVerbnet = currentTokenWithVerbnet.parent
            currentTokenIdx = round(currentTokenIdx + 0.1, 1)

In [146]:
def skipSentence(topNode, tokens):
    allCategories = topNode.getCategories()
    
    # Skip W-questions
    if 's:wq' in allCategories:
        return True
    
    # Skip sentences with "there"
    if 'np:thr' in allCategories:
        return True
    
    # Skip sentenses with "'s" as "us"
    for idx, token in enumerate(tokens):
        if token.lower() == "let":
            if tokens[idx + 1] in ["'s", "us"]:
                return True
#         if token == "'s":
#             tokenCategories = topNode.getCategories(onlyTokens = True)
#             if tokenCategories[idx] == 'np':
#                 return True
    
    # Skip sentences that miss a part (like "Think about it")
    if ("/" in topNode.children[0].category) or ("\\" in topNode.children[0].category):
        return True
    
    return False

In [161]:
mapping = {"Agent": 1, "Location": 2, "Patient": 3, "Theme": 4, "Destination": 5, "Result": 6, "Stimulus": 7, "Experiencer": 8, "Co-Theme": 9, "Pivot": 10}

# file_path = r'/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p00/d1382' # Who directed the film "Fail Safe"?
# file_path = r'/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p01/d2590/'
# file_path = r'/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p00/d0004/' # fighting dogs
# file_path = r'/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p00/d2208' # Tom saw a mouse
# file_path = r'/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p00/d0802' # Let's have sushi
file_path = r'/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p06/d3191' 
def getTokensAndLabels(file_path):
    tokens = getTokens(file_path)
    topNode, tokensWithVerbnet = getTree(file_path, tokens)
    if skipSentence(topNode, tokens):
        return None, None, None
    try:
        assignTags(tokensWithVerbnet)
    except AttributeError:
        skippedDirs.append(file_path)
        return None, None, None
    assignTags(tokensWithVerbnet)
    labels = topNode.getTags(mapping)
    origin = topNode.getTagsFromTokenIdx()
    if tokens[-1] not in ['.', '!', '?']:
        tokens.append('.')
    if len(tokens) > len(labels):
        labels.append(0)
        origin.append(0)
    if len(tokens) != len(labels):
        raise Exception(file_path, 'Length of token and labels does not match up!')
    return tokens, labels, origin
    

tokens, labels, origin = getTokensAndLabels(file_path)
print(tokens)
print(labels)
print(origin)

None
None
None


In [162]:
folder_path = r'/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/' 
verbnetCounter = {}
skippedDirs = []

def createDataset(parent_dir):
    i = 0
    dataset = {'tokens': [], 'labels': [], 'origin': []}
    for subdir, dirs, files in os.walk(parent_dir):
        if not os.path.exists(os.path.join(subdir, 'en.parse.tags')):
            continue
        i += 1
        if (i % 50 == 0):
            print(i)
            print(subdir)
        tokens, labels, origin = getTokensAndLabels(subdir)
        if tokens == None:
            continue
        dataset['tokens'].append(tokens)
        dataset['labels'].append(labels)
        dataset['origin'].append(origin)
    return dataset
        

dataset = createDataset(folder_path)


50
/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p00/d0888
100
/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p00/d1470
150
/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p00/d1699
200
/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p00/d1976
250
/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p00/d2293
300
/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p00/d2581
350
/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p00/d3011
400
/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p00/d3351
450
/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p01/d0875
500
/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p01/d1775
550
/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/d

4450
/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p30/d1818
4500
/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p30/d2819
4550
/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p31/d0016
4600
/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p31/d2101
4650
/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p31/d3483
4700
/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p32/d2195
4750
/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p33/d0838
4800
/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p33/d2941
4850
/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p34/d1939
4900
/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p34/d2944
4950
/mnt/c/Users/perry/Documents/uni/Master/CompSem/project

Exception: ('/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p44/d3393', 'Length of token and labels does not match up!')

In [125]:
# mapping = {"Agent": 1, "Location": 2, "Patient": 3, "Theme": 4, "Destination": 5, "Result": 6, "Stimulus": 7, "Experiencer": 8, "Co-Theme": 9, "Pivot": 10}
print(len(dataset['tokens']))
print(sorted(verbnetCounter.items(), key=lambda x:x[1], reverse = True))

366
[('Theme', 206), ('Agent', 170), ('Attribute', 79), ('Experiencer', 48), ('Patient', 42), ('Stimulus', 42), ('Co-Theme', 41), ('Location', 33), ('Pivot', 22), ('Destination', 19), ('Source', 14), ('Equal', 12), ('Recipient', 11), ('Time', 9), ('PartOf', 9), ('Topic', 8), ('Name', 7), ('Result', 7), ('Value', 7), ('Co-Agent', 6), ('Causer', 6), ('Manner', 6), ('User', 5), ('Instrument', 4), ('Duration', 4), ('Beneficiary', 4), ('AttributeOf', 4), ('Of', 3), ('Colour', 3), ('Creator', 3), ('Content', 3), ('Owner', 2), ('Material', 2), ('SubOf', 2), ('Context', 1), ('Quantity', 1), ('Start', 1), ('InstanceOf', 1), ('Goal', 1), ('Co-Patient', 1), ('Unit', 1), ('Product', 1), ('Frequency', 1)]


In [128]:
print(skippedDirs)

['/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p00/d0867', '/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p00/d0916', '/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p00/d0948', '/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p00/d1468', '/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p00/d1562', '/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p00/d1622', '/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p00/d1686', '/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p00/d1691', '/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p00/d1712', '/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold/p00/d1730', '/mnt/c/Users/perry/Documents/uni/Master/CompSem/project/pmb-4.0.0/data/en/gold

In [51]:
print(dataset)

{'tokens': [['A', 'brown', 'dog', 'and', 'a', 'grey', 'dog', 'are', 'fighting', 'in', 'the', 'snow', '.'], ['A', 'woman', 'in', 'a', 'white', 'dress', 'and', 'a', 'woman', 'in', 'a', 'blue', 'dress', 'are', 'standing', 'on', 'a', 'stage', '.'], ['You', 'are', 'screwed', '.'], ['This', 'is', 'a', 'man-made', 'language', '.'], ['Kraft', 'sold', 'Celestial Seasonings', '.'], ['Anna Politkovskaya', 'was', 'murdered', '.'], ['Alfred Nobel', 'is', 'the', 'inventor', 'of', 'dynamite', '.'], ['The', 'yakuza', 'are', 'the', 'Japanese', 'mafia', '.'], ['Russia', 'fears', 'the', 'system', '.'], ['Bountiful', 'reached', 'San Francisco', 'on', '1', 'November', '1945', '.'], ['Pierce', 'lives', 'near', 'Rossville Blvd', '.'], ['Yunus', 'founded', 'the', 'Grameen Bank', '30', 'years', 'ago', '.'], ['Maria', 'has', 'long', 'hair', '.'], ['How', 'slow', 'you', 'are', '!'], ['The', 'astronauts', 'went', 'up', 'to', 'the', 'moon', 'in', 'a', 'rocket', '.'], ['We', 'stood', 'at', 'the', 'door', 'and', 'wa

In [33]:
from datasets import Dataset

In [52]:
ds = Dataset.from_dict(dataset)


In [53]:
ds[0]

{'tokens': ['A',
  'brown',
  'dog',
  'and',
  'a',
  'grey',
  'dog',
  'are',
  'fighting',
  'in',
  'the',
  'snow',
  '.'],
 'labels': [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 2, 2, 0],
 'origin': [8.0,
  8.0,
  8.0,
  8.0,
  8.0,
  8.0,
  8.0,
  None,
  None,
  None,
  9.0,
  9.0,
  0.0]}

In [43]:
# mapping = {"Agent": 1, "Location": 2, "Patient": 3, "Theme": 4, "Topic":5, "Destination": 6, "Result": 7}
for d in ds:
    frmt = "{:>10}"*len(d['tokens'])
    print(frmt.format(*d['tokens']))
    print(frmt.format(*d['labels']))
#     print(frmt.format(*d['origin']))
    print()

         A     brown       dog       and         a      grey       dog       are  fighting        in       the      snow         .
         1         1         1         1         1         1         1         0         0         0         2         2         0

         A     woman        in         a     white     dress       and         a     woman        in         a      blue     dress       are  standing        on         a     stage         .
         4         4         4         4         4         4         4         4         4         4         4         4         4         0         0         0         2         2         0

       You       are   screwed         .
         1         0         0         0

      This        is         a  man-made  language         .
         4         0         0         0         0         0

     Kraft      soldCelestial Seasonings         .
         1         0         4         0

Anna Politkovskaya       was  murdered         .
      

 Everybody     hates        me         .
         0         0         0         0

         I        'm         a    genius         .
         4         0         0         0         0

        He     lives        in       the      city         .
         4         0         0         2         2         0

       The     storm       let        up         .
         3         3         0         0         0

       The      buds     began        to      open         .
         4         4         0         0         0         0

       The     blast destroyedeverything         .
         1         1         0         3         0

         I       was       n't  prepared         .
         0         0         0         0         0

       Tom        is       not      dead         .
         0         0         0         0         0

       The      doll       lay        on       the     floor         .
         4         4         0         0         2         2         0

         I   

In [50]:
ds.save_to_disk("test.hf")

Saving the dataset (0/1 shards):   0%|          | 0/423 [00:00<?, ? examples/s]

In [56]:
from datasets import load_from_disk

In [57]:
test_dataset = load_from_disk("test.hf")

In [58]:
test_dataset

Dataset({
    features: ['tokens', 'labels'],
    num_rows: 423
})