# 1. Converting synsets to lexunits

All pictographs are linked to a synset from the Cornetto database. However, later on, we will have to get examples from DutchSemCor, which works on the level of lexunits. Therefore, we will convert all synset ids to lexunit ids in this notebook.

## Loading the lexunit-synset bridge

First, we load in the lexunit-synset bridge. This will allow us to convert from synsets to lexunits.

In [1]:
from picto2vec.lexsynbridge import LexSynBridge

Then, we load in our conversion dataset. It might take a while to load in, since it internally converts this dataset to a Python `dict` for faster conversion. After loading, we test the conversion.

In [2]:
lexsynbridge = LexSynBridge("data/lex2syn_original.csv")

In [3]:
lexsynbridge.syn2lex("d_n-39469")

'r_n-15326,r_n-41758'

## Converting synsets to lexunits

That works! Now, let's load in our pictograph dataset and convert all columns with synsets in them to lexunits.

In [4]:
import pandas as pd
import numpy as np

In [5]:
sense_df = pd.read_csv("data/sclera.csv", names=[ "lemma", "synset", "relation", "head", "headrel", "dependent", "deprel", "antonym", "number", "lemma_english", "able" ])
sense_df.head()

Unnamed: 0,lemma,synset,relation,head,headrel,dependent,deprel,antonym,number,lemma_english,able
0,verdrietig,d_n-31553,synonym,,,,,,,"unfortunate,-depressed,-unlucky",
1,ruzie-maken,,,d_v-9067,synonym,d_n-30590,synonym,,,argue,
2,school,d_n-36313,synonym,,,,,school-rood,,school,
3,slecht-2,c_578,synonym,,,,,,,bad-/-not-ok,
4,gebaar-veel,n_a-512478,synonym,,,,,,,gesture-a-lot,


In [6]:
sense_df['lexunit'] = sense_df.apply(lambda row: lexsynbridge.syn2lex(row["synset"]), axis=1)
sense_df['head_lexunit'] = sense_df.apply(lambda row: ",".join(list(filter(lambda lexunit: lexunit != False, list(map(lambda synset: lexsynbridge.syn2lex(synset), row["head"].split(",")))))) if type(row["head"]) == str else np.nan, axis=1)
sense_df['dependent_lexunit'] = sense_df.apply(lambda row: ",".join(list(filter(lambda lexunit: lexunit !=False, list(map(lambda synset: lexsynbridge.syn2lex(synset), row["dependent"].split(",")))))) if type(row["dependent"]) == str else np.nan, axis=1)
sense_df.head()

d_v-293911 to lexunit failed!
r_n-24688 to lexunit failed!
r_n-5918 to lexunit failed!
c_545200 to lexunit failed!
r_n-23331 to lexunit failed!
r_n-10194 to lexunit failed!
r_a-10906 to lexunit failed!
d_n-323738 to lexunit failed!
d_a9366 to lexunit failed!
d_n40023 to lexunit failed!


Unnamed: 0,lemma,synset,relation,head,headrel,dependent,deprel,antonym,number,lemma_english,able,lexunit,head_lexunit,dependent_lexunit
0,verdrietig,d_n-31553,synonym,,,,,,,"unfortunate,-depressed,-unlucky",,"r_n-39979,d_n-40708,r_n-11597,r_n-11598,r_n-20...",,
1,ruzie-maken,,,d_v-9067,synonym,d_n-30590,synonym,,,argue,,,c_546110,"r_n-32330,r_n-8012,r_n-11053,r_n-16733,d_n-525..."
2,school,d_n-36313,synonym,,,,,school-rood,,school,,r_n-33120,,
3,slecht-2,c_578,synonym,,,,,,,bad-/-not-ok,,"d_a-208278,r_a-15054,r_a-10080",,
4,gebaar-veel,n_a-512478,synonym,,,,,,,gesture-a-lot,,"r_a-11649,r_a-15342,c_545575,r_a-16466,r_a-104...",,


Some conversions will have failed. I don't know why this is the case, but it's not something I can fix, since I've noticed that most of the databases for Dutch are a bit... messy. We hope that this won't pose issues later on.

## Exporting the new dataset

We export the new dataset as JSON. This will be helpful later on, since pandas can be kinda slow when indexing.

In [7]:
sense_df.to_json("test_senses.json", orient="records")

That's it for this notebook!