# Create the RDF datasets

## Introduction

In this Jupyter Notebook, we use Python to transform the content of a relational DB into 3 RDF datasets, consisting of:
1. the Lemma Bank (a series of instances of `ontolex:Form` used to lemmatize and index lexical resources): for the code, go to [this section](#the-lemma-bank);
2. a frequency list, reporting the nr. of attestation of each of the extracted lemmas into the textual resources scanned in the CLARIN VLO: code in [this section](#corpus-frequencies);
3. a lexical resource transformed into LLOD with the lemmas set to the URIs of our Lemma Bank (created in Step 1): code in [this section](#a-lexical-resource)

I use the dump of the DB created by our demo tool on Italian, imported on a local MariaDB from the file `CLARINDB.sql.tar.gz` (you can find it in the [data](../../data/) folder). This dump include the attestations of Italian lemmas in only 14 textual resources.

In order to follow along with the code:
- create a local DB called `CLARINDB`
- import the sql dump (e.g. uncompress it and then run: `mysql -u <YOUR_USER> -p CLARINDB < CLARINDB.sql`)
- create a ini configuration file called `dbconfig.ini` that looks like this:


```ini
[Database]
host = localhost
user = your_username
password = your_password
database = your_database_name
```

... or you can always edit the code in the [DB connection]() section to use e.g. a CSV dump!

### Requirements

- `rdflib`
- a DB connector (here, `mariadb` is used)
- a DB called CLARINDB with the data dumped in `data/CLARINDB.sql.tar.gz`

### DB Connection

Use this code if you have a local DB running with the same schema as that generated by our L2L demo tool. You can use the CLARINDB.sql dump for Italian in this folder, if you want (see above for instructions)

In [1]:
import configparser

config = configparser.ConfigParser()
config.read('dbconfig.ini')

# Get values from the 'Database' section
db_host = config.get('Database', 'host')
db_user = config.get('Database', 'user')
db_password = config.get('Database', 'password')
db_database = config.get('Database', 'database')

I use MariaDB. Change the DB connector accordingly, in case you have other engines

In [2]:
import mariadb

conn = mariadb.connect(
        user=db_user,
        password=db_password,
        host=db_host,
        database=db_database
        )

In [3]:
cur = conn.cursor()

In [4]:
q = 'SELECT * from it_lemmaBank;'
cur.execute(q)
res = cur.fetchall()

In [5]:
res[0]

(1, 'spiegare', 'VERB')

### Setting up `rdflib`

Let us define some widely used namespaces for classes, base URIs and properties and some shortcuts 

In [6]:
import rdflib
from rdflib import Namespace, URIRef, Graph, Literal, BNode
from rdflib.namespace import DC, RDF, RDFS, DCTERMS, VOID

In [7]:
# Namespaces
marl = Namespace('http://www.gsi.dit.upm.es/ontologies/marl/ns#')

lime = Namespace('http://www.w3.org/ns/lemon/lime#')
ontolex = Namespace("http://www.w3.org/ns/lemon/ontolex#")
frac = Namespace('http://www.w3.org/ns/lemon/frac#')
lexinfo = Namespace('https://www.lexinfo.net/ontology/3.0/lexinfo#')

### URIRefs

# properties
written_rep = ontolex.writtenRep
canonical_form = ontolex.canonicalForm
has_pos = lexinfo['partOfSpeech']

# classes
lexical_entry = ontolex.LexicalEntry
form = ontolex.Form

a = RDF.type


### POS Mapping

Also, we have to map the [UD Postags](https://universaldependencies.org/u/pos/index.html) to URIs in [Lexinfo](https://lexinfo.net/)

In [8]:
pos_map = {
    'NOUN': lexinfo['commonNoun'], 
    'ADV': lexinfo['adverb'],
    'ADJ': lexinfo['adjective'],
    'VERB': lexinfo['verb'],
    'PRON': lexinfo['pronoun'],
    'PROPN': lexinfo['properNoun'],
    'ADP': lexinfo['adposition'],
    'AUX': lexinfo['auxiliary'],
    'CCONJ': lexinfo['coordinatingConjunction'],
    'DET': lexinfo['determiner'],
    'NUM': lexinfo['numeral'],
    'PART': lexinfo['particle'],
    'SCONJ': lexinfo['subordinatingConjunction'],
    'PUNCT': lexinfo['punctuation'],
    'SYM': lexinfo['symbol'],
    'X': None
}

## The Lemma Bank

Here we create a RDF file with a dataset holding our collection of Lemmas. Each lemma is defined as an instance of `ontolex:Form`, it is assigned to a `void:Dataset` for our collection and is provided with some basic description (a human-readable label, a written representation, a POS)

First, we define a **URN schema** for the lemmas: this will serve as URI for our objects. I adopt the [CITE2](https://brillpublishers.gitlab.io/documentation-cts/DTS_CITE_Explained.html) architecture, to make the URIs as independent as possible from implementation. In particular:
- `urn:cite2`: specifies the protocol for the URN
- `circselod`: setz the namespace to the CIRCSE service for linguistic linked data
- `l2l.it`: sets the collection component to the Italian section of the `l2l` collection
- `lemma_{nr}`: identifies the object using the DB id

Let's make a test

In [9]:
lmurn = Namespace('urn:cite2:circselod:l2l.it:lemma_')
lmurn['1']

rdflib.term.URIRef('urn:cite2:circselod:l2l.it:lemma_1')

We initialize an empty `rdflib.Graph` to collect our triples. Then we add the first statements to our graph: we create a `void:Dataset` to collect our newly created lemma collection

In [14]:
g = Graph()

# Let's bind the namespaces
g.bind('ontolex', ontolex)
g.bind('dcterms', DCTERMS)
g.bind('void', VOID)
g.bind('lexinfo', lexinfo)

dtset = URIRef('urn:cite2:circselod:l2l.it:lemma_bank')

g.add((dtset, a, VOID.Dataset))
g.add((dtset, RDFS.label, Literal('L2L Lemma Bank of Italian')))
g.add((dtset, DCTERMS.title, Literal('L2L Lemma Bank of Italian')))
g.add((dtset, VOID.vocabulary, URIRef('http://www.w3.org/ns/lemon/ontolex')))

<Graph identifier=Nc54af6d5395e42fcb1a328ecd08e9737 (<class 'rdflib.graph.Graph'>)>

Now let's populate the graph with the DB content

In [15]:
for i, st, pos in res:
    l = lmurn[str(i)]
    g.add((l, a, form))
    g.add((l, written_rep, Literal(st)))
    g.add((l, RDFS.label, Literal(st)))
    g.add((l, has_pos, pos_map[pos]))
    g.add((l, DCTERMS.isPartOf, dtset))

In [16]:
len(g)

412214

Et voilà! Now let's serialize the dataset as turtle

In [25]:
g.serialize('l2l_it_lemmma_bank.ttl')

<Graph identifier=N0fab484eba3b417ab0a87dbb7550bd18 (<class 'rdflib.graph.Graph'>)>

## Corpus Frequencies

We use the candidate [frac](https://github.com/ontolex/frequency-attestation-corpus-information) extension of Ontolex to model the attestation. Our dataset will have include only the basic piece of infomation that a lexical entry canonically identified by a lemma from our previous graph (see above) has $n$ attestations in a given textual resource (defined as "corpus" by frac).

**TODO**: `frac` requires to list also the total number of tokens of a corpus, so that the information is stored and it becomes possible to calculate the relative frequencies. At the moment, we did not collect this piece of data at the time of the DB generation.

Here is how, according to the draft page, you register the frequency of all inflected form of a word in a text using `frac`:

```turtle
# word frequency, over all form variants 
epsd:kalag_strong_v a ontolex:LexicalEntry;
    frac:frequency [
        a frac:Frequency; 
        rdf:value "2398"^^xsd:int; 
        frac:observedIn <http://oracc.museum.upenn.edu/epsd2/pager>
    ] .
```

### URIs for the textual resources

First, let's get the data from the `it_resource_descriptor` table of the DB. Do **rember** that, for reasons of space, the DB dump is limited to only 14 Italian textual resources from those available on the CLARIN's [VLO](https://vlo.clarin.eu/?0).

In [19]:
q = 'SELECT * from it_resource_descriptor;'
cur.execute(q)
res = cur.fetchall()

In [20]:
import re

# this regexp should do to extract all the unique handle of the 14 CLARIN resources in the limited DB dump
hdl_reg = re.compile(r'(https_58__47__47_\S+)_64_format_61_cmdi;')

For future convenience, we create a dictionary where the key is the numeric resource ID in the DB table and the value is the CLARIN handle.

In [22]:
resource_dict = {}

for r in res:
    try:
        resource_dict[str(r[0])] = hdl_reg.findall(r[1])[0].replace('_58_', ':').replace('_47_', "/")
    except IndexError:
        print(f'handle not found: {r[0]}')


In [23]:
resource_dict['2']

'https://hdl.handle.net/20.500.12124/6'

### Frequency data

In [24]:
cur.execute('select id_resource, id_lemma, freq, it_lemmaBank.lemma from it_resource_lemma JOIN it_lemmaBank on id_lemma = it_lemmaBank.id;')
res = cur.fetchall()
len(res)

152239

In [25]:
res[0]

(1, 3049, 7, 'image')

In [17]:
from rdflib import XSD

g = Graph()

# Let's bind the namespaces
g.bind('ontolex', ontolex)
g.bind('dcterms', DCTERMS)
g.bind('void', VOID)
g.bind('frac', frac)
g.bind('xsd', XSD)

dtset = URIRef('urn:cite2:circselod:l2l.it:corpus_frequencies')

g.add((dtset, a, VOID.Dataset))
g.add((dtset, DCTERMS.description, Literal('L2L frequency data of lemmas in some of CLARIN textual resources for Italian')))
g.add((dtset, RDFS.label, Literal('L2L frequency data Italian')))
g.add((dtset, DCTERMS.title, Literal('L2L frequency data Italian')))
g.add((dtset, VOID.vocabulary, URIRef('http://www.w3.org/ns/lemon/ontolex')))


<Graph identifier=Nbf14fb226b0f4b14bd108809bfdc97cd (<class 'rdflib.graph.Graph'>)>

```turtle
# word frequency, over all form variants 
epsd:kalag_strong_v a ontolex:LexicalEntry;
    frac:frequency [
        a frac:Frequency; 
        rdf:value "2398"^^xsd:int; 
        frac:observedIn <http://oracc.museum.upenn.edu/epsd2/pager>
    ] .
```

In [19]:
lexurn = Namespace('urn:cite2:circselod:l2l.it:lex_')

In [21]:
for cid, lid, freq, lemmalab in res:
    freqnode = BNode()
    lex = lexurn[str(lid)]
    g.add((lex, a, lexical_entry))
    g.add((lex, canonical_form, lmurn[str(lid)]))
    g.add((lex, RDFS.label, Literal(lemmalab)))
    
    # the frequency blank node
    handle = URIRef(resource_dict[str(cid)])
    g.add((freqnode, a, frac.Frequency))
    g.add((freqnode, RDF.value, Literal(int(freq), datatype=XSD.integer)))
    g.add((freqnode, frac.observedIn, handle))

    g.add((lex, frac.frequency, freqnode))

In [22]:
len(g)

856287

Now let's serialize the graph!

In [23]:
g.serialize('l2l_it_frequencies.ttl')

<Graph identifier=Nbf14fb226b0f4b14bd108809bfdc97cd (<class 'rdflib.graph.Graph'>)>

## A Lexical Resource

As a final step, we link also one of the lexical resources that we identified in our [survey](). We chose the [OpenNER](http://hdl.handle.net/20.500.11752/ILC-73) sentiment lexicon for Italian. Once again, for convenience, the content of the dictionary has already been indexed and lemmatized by our [scraper]() and included in the DB dump (found in `data/CLARINDB.sql.tar.gz`). So, we'll get the data from the DB

In [10]:
cur.execute('select * from it_lex_res_elements')
res = cur.fetchall()
len(res)

25098

In [11]:
res[0]

(1,
 1,
 'id_0\tdi_cassetta\tadj\t0.333333333333\tnegative',
 'tsv_line',
 '{1}_{2}',
 '{4}')

The results, however, are **not** linked to the lemma bank. We'll do that on the fly. Let's transform our lemma bank into a dictionary `lemma_pos:id`

In [12]:
from collections import defaultdict

cur.execute('select * from it_lemmaBank')
lemma_dict = defaultdict(list)
for i,lm,upos in cur.fetchall():
    lemma_dict[f'{lm}_{upos.lower()}'].append(i)

In [18]:
# lemma_dict['sdruppo_noun'] # should be empty list
lemma_dict['occupazione_noun']

[5]

In [17]:
len([d for d in lemma_dict.values() if len(d) > 1])

0

Apparently, there are no duplicated couplets lemma,pos

In [40]:
g = Graph()

# Let's bind the namespaces
g.bind('ontolex', ontolex)
g.bind('dcterms', DCTERMS)
g.bind('void', VOID)
g.bind('lime', lime)
g.bind('opener', 'http://hdl.handle.net/20.500.11752/ILC-73#')
g.bind('marl', marl)

dtset = URIRef('http://hdl.handle.net/20.500.11752/ILC-73')

g.add((dtset, a, lime.Lexicon))
g.add((dtset, RDFS.label, Literal('OpeNER Sentiment Lexicon Italian - LMF')))
g.add((dtset, DCTERMS.title, Literal('OpeNER Sentiment Lexicon Italian - LMF')))

<Graph identifier=N31a33816575641d1b15507c58ed43cb1 (<class 'rdflib.graph.Graph'>)>

Here is what I want the final result to look like:

```turtle
opener:8 a ontolex:LexicalEntry ;
    rdfs:label 'impotente_adj';
    ontolex:canonicalForm <urn:cite2:circselod:l2l.it:lemma_39904> ;
    ontolex:sense [ marl:hasPolarity marl:Negative ] .
```

To model polarity, we use the same strategy used for the Latin Affectus polarity lexicon of Latin (as documented [here](https://zenodo.org/record/4067813))

In [41]:
opener = Namespace('http://hdl.handle.net/20.500.11752/ILC-73#')

for e in res:
    entry_id = e[0]
    entry_vals = e[2].split('\t')
    entry_key = f'{entry_vals[1]}_{entry_vals[2]}'
    try:
        entry_lm = lemma_dict[entry_key][0]
    except IndexError:
        continue
    else:
        entry_uri = opener[str(entry_id)]
        g.add((entry_uri, a, ontolex.LexicalEntry))
        g.add((entry_uri, RDFS.label, Literal(entry_key)))
        g.add((entry_uri, ontolex.canonicalForm, lmurn[str(entry_lm)]))

        # sense blank node
        sense = BNode()
        pol = entry_vals[-1].title()
        g.add((sense, marl.hasPolarity, marl[pol]))

        g.add((entry_uri, ontolex.sense, sense))
        g.add((dtset, lime.entry, entry_uri))

In [42]:
g.serialize('l2l_open_ner.ttl')

<Graph identifier=N31a33816575641d1b15507c58ed43cb1 (<class 'rdflib.graph.Graph'>)>