# 2. Find sense references

Now that we have converted all senses to lexunits, we can start finding these senses (actually lexunits, but I will use them interchangeably from now on) in the DutchSemCor corpus. This is what we'll be doing in this notebook.

There are quite a few constants we'll be reusing across notebooks, which is why these are defined in `constants.py`.

In [5]:
from constants import *

In [22]:
import os
import re
import pandas as pd
import json
from glob import glob

## Loading our dataset

First, we load in the dataset again we made in the previous notebook.

In [9]:
sense_df = pd.read_json("test_senses.json")

## Collecting all lexunits

Lexunits can come from various places. Either from a unary record, of from a record consisting of a head and a dependent. For completion's sake, we will collect examples of all lexunits in the dataset.

In [10]:
lexunits = []
for index, row in sense_df.iterrows():
    lexunits += row["lexunit"].split(",") if type(row["lexunit"]) == str else ""
    lexunits += row["head_lexunit"].split(",") if type(row["head_lexunit"]) == str else ""
    lexunits += row["dependent_lexunit"].split(",") if type(row["dependent_lexunit"]) == str else ""

lexunits = list(set(lexunits))    

token_find_re = "|".join(lexunits)
token_find_re

'|r_n-27702|r_v-3558|r_n-78986|r_v-6232|r_n-11376|r_n-7920|d_r-192237|r_n-21915|r_n-26194|d_n-133385|r_n-21559|d_n-279448|r_v-6154|r_n-40270|d_n-8623|d_n-540927|d_n-35109|r_n-10514|d_n-94604|d_r-175264|d_n-179791|r_v-5716|r_n-5830|r_n-42004|r_n-44260|r_n-29708|d_n-544095|r_n-39109|r_n-32066|r_v-2310|r_n-19348|r_v-8688|d_n-247355|d_v-418581|r_a-15983|r_n-15069|r_n-6221|r_n-22345|r_v-2602|r_n-16580|d_n-40070|d_n-202214|r_a-8711|r_n-20023|d_n-343136|d_n-181573|r_v-3128|d_n-103580|d_n-42953|r_v-6731|r_n-34613|r_n-18532|d_n-272829|d_n-134289|d_n-244513|d_n-53267|d_n-401393|d_n-174218|r_n-13973|r_a-14265|d_n-176646|r_v-8802|r_n-30526|r_n-40200|d_n-134793|d_v-134948|d_n-133649|d_v-109797|r_n-39082|d_n-69403|d_n-418754|r_a-14397|r_v-4038|d_n-202752|r_v-3908|d_v-27517|r_n-25044|r_v-2836|r_n-9402|r_n-5669|r_n-43864|r_v-4514|d_n-106114|r_n-42024|r_n-16528|r_n-27221|r_n-26392|r_n-27422|r_n-20047|r_v-4671|r_n-25285|r_n-25356|d_a-91722|d_n-420785|d_n-280142|r_n-20797|r_v-3515|r_n-20812|s_00032|r_n-2

Now, we will use grep to find all lines in the DutchSemCor corpus pointing to corpus examples of senses we're interested in. The lines will be collected in ".match" files.  
Please **MOVE** these .match files to the path specified in DUTCHSEMCOR_FILTERED_CSV_PATH afterwards.

In [None]:
!for f in data/dutsemcor/*.xml ; do grep -E -i -w 'regex_here' "$f" -A9999 > "${f%.xml}.match" ; done

## Checking for dead references

Not all references in DutchSemCor are transparent. Therefore, we check for every single match whether we can actually find the files in our dataset. Lassy hits are provided by Vincent Vandeghinste.

In [12]:
lassy_hits = pd.read_csv(f"{DATA_PATH}/lassy_hits.tsv",
                         sep="\t",
                         names=["docid",
                                "xmlid",
                                "previous_sentence",
                                "current_sentence",
                                "next_sentence"],
                        header=None)

We create a list of all match files we found earlier.

In [18]:
xml_file_paths = glob("{}*.match".format(DUTCHSEMCOR_FILTERED_CSV_PATH))
xml_file_paths.sort()

Now, we define a function which will look for the appropriate corpus file. The reference's information is appended to the appropriate JSONL file (per sense). If the reference cannot be found, "lost" is set to `True`.

In [19]:
def find_senses_from_xml(xml_file_path):
    # Check if the file is empty
    if os.stat(xml_file_path).st_size == 0:
        return []
    
    with open(xml_file_path) as reader:
        xml_content = reader.read()
        
    found_tokens = xml_content.split("\n")
        
    # Create a dataframe row for every token we found
    for found_token in tqdm(found_tokens, leave=False):
        # Sometimes, there are erroneous lines included. If there is no match,
        # just skip to the next line
        token_attributes = re.search(r'lemma="(.*?)" pos="(.*?)" sense="(.*?)" token_id="(.*?)"', found_token)
        if token_attributes is None:
            continue
        
        token_attributes = token_attributes.groups()
        
        # token_id has the following structure:
        # WR-P-P-G-0000070187.p.13.s.5.w.25
        # WR-P-P-G-0000070187 = docid
        # p.13.s.5.w.25 = xmlid
        # We can only search for xmlid. In the corpus result, we can further filter
        # for docid. So we need both. We use regular expressions.
        token_id = token_attributes[3]
        doc_id, xml_id = re.search(r"(.*?)\.(.*)", token_id).groups()
        
        lemma = token_attributes[0].replace("&#195;&#168;", "è")
        pos = token_attributes[1]
        sense = token_attributes[2]
        
        lost = False
        
        if doc_id.startswith("CGN"):
            cgn_id = re.search(r"_(.*?)$", doc_id).groups()[0]
            plk_filename = f"{SONAR_PATH}{cgn_id}.plk"
                
            if not os.path.exists(plk_filename):
                lost = True
        elif doc_id.startswith("WR-P-P-I"):
            lassy_hit = lassy_hits.loc[(lassy_hits["docid"] == doc_id) & 
                                       (lassy_hits["xmlid"] == xml_id)]
            
            if len(lassy_hit) == 0:
                lost = True
        else:        
            xml_filename = f"{SONAR_PATH}{doc_id}.folia.xml"
        
            if not os.path.exists(xml_filename):
                lost = True
        
        row = { "lemma": lemma,
                "pos": pos,
                "sense": sense,
                "xmlid": xml_id,
                "docid": doc_id,
                "lost": lost }
        
        jsonl_path = f"sense_example_references/{row['sense']}.jsonl"
        with open(jsonl_path, "at") as writer:
            json_raw = json.dumps(row)
            writer.write(f"{json_raw}\n")

We apply the function for every .match file.

In [None]:
for xml_file_path in tqdm(xml_file_paths):
    find_senses_from_xml(xml_file_path)