# Explore the RDF data

In this notebook we explore the RDF data generated with the code in [clarindb2lod.ipynb](clarindb2lod.ipynb). The data consist of 3 files (lemma bank, frequency and sentiment lexicon), and is also available in `data/rdf.zip`.

![the lemma "truffatore"](truffatore.png "Truffatore")

(what's in this picture? It's a visualization - created using [this]() python script - of the graph for the Italian word *truffatore* 'swindler', labeled as `Negative` in OpeNER and attested in 2 of the textual resources indexed with 1 occurrence each, namely [here](https://hdl.handle.net/20.500.12124/3) and [here](https://hdl.handle.net/11321/737))

## Using SPARQL

To follow along, you will need to have access to a triplestore. I suggest you donwload a copy of [Jena Fuseki]() and run it on your computer. With the default configuration, you will be able to query the triplestore at http://localhost:3030

Once you have a local instance of `fuseki` running:

1. create a Dataset named `l2l`
2. upload the three ttl files in `data/rdf.zip`

Then we can just use `requests` to send SPARQL queries to the endpoint and read the results into a `pandas.DataFrame`.

(Obviously, you'll need `pandas` to execute some of the cells below)

In [9]:
import requests
import pandas as pd

This is the address of our endpoint (default values). Change the endpoint address accordingly, if you have access to a different triplestore, or if you are running fuseki on a different port

In [74]:
endpoint = 'http://localhost:3030/l2l/query'

This query gets all the negative words from our LOD version of OpeNER, with the URI of the lemma, the lemma string and the POS

In [38]:
spql = '''
PREFIX lime: <http://www.w3.org/ns/lemon/lime#>
PREFIX ontolex: <http://www.w3.org/ns/lemon/ontolex#>
PREFIX marl: <http://www.gsi.dit.upm.es/ontologies/marl/ns#>
prefix lexinfo: <https://www.lexinfo.net/ontology/3.0/lexinfo#>

select * where {
  ?lex_entry a ontolex:LexicalEntry ;
      ontolex:canonicalForm ?lemma_uri ;
      ontolex:sense/marl:hasPolarity marl:Negative .
  ?lemma_uri ontolex:writtenRep ?lemma_str ;
      lexinfo:partOfSpeech ?pos
}
'''

In [33]:
r = requests.get(endpoint, params={'query': spql, 'format': 'csv'})

In [34]:
from io import StringIO
df = pd.read_csv(StringIO(r.text))
df.head()

Unnamed: 0,lex_entry,lemma_uri,lemma_str,pos
0,http://hdl.handle.net/20.500.11752/ILC-73#1000,urn:cite2:circselod:l2l.it:lemma_49680,autodistruzione,https://www.lexinfo.net/ontology/3.0/lexinfo#c...
1,http://hdl.handle.net/20.500.11752/ILC-73#10001,urn:cite2:circselod:l2l.it:lemma_69892,nuocere,https://www.lexinfo.net/ontology/3.0/lexinfo#verb
2,http://hdl.handle.net/20.500.11752/ILC-73#10004,urn:cite2:circselod:l2l.it:lemma_5376,rada,https://www.lexinfo.net/ontology/3.0/lexinfo#c...
3,http://hdl.handle.net/20.500.11752/ILC-73#1002,urn:cite2:circselod:l2l.it:lemma_5743,svalutazione,https://www.lexinfo.net/ontology/3.0/lexinfo#c...
4,http://hdl.handle.net/20.500.11752/ILC-73#10026,urn:cite2:circselod:l2l.it:lemma_13168,auspicare,https://www.lexinfo.net/ontology/3.0/lexinfo#verb


## Use case: the most attested negative and positive lemmas

Let us use the SPARQL enpoint to query all our datasets and get the positive/negative lemmas with the highest combined frequencies in all the lexical resources that we indexed. I am sure there is a smarter way to do that in SPARQL, but we'll just get a long list of lemmas and we'll use `pandas` to sort them out and do the math.

In [41]:
rq = '''PREFIX lime: <http://www.w3.org/ns/lemon/lime#>
PREFIX ontolex: <http://www.w3.org/ns/lemon/ontolex#>
PREFIX marl: <http://www.gsi.dit.upm.es/ontologies/marl/ns#>
prefix lexinfo: <https://www.lexinfo.net/ontology/3.0/lexinfo#>
prefix frac: <http://www.w3.org/ns/lemon/frac#>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

select ?lex ?lab ?pos ?pol ?freq_val ?freq_text where {
  <http://hdl.handle.net/20.500.11752/ILC-73> lime:entry ?lex .
  ?lex ontolex:canonicalForm ?lemma_uri ;
      ontolex:sense/marl:hasPolarity ?pol .
  ?lemma_uri ontolex:writtenRep ?lemma_str ;
      ontolex:writtenRep ?lab ;
      lexinfo:partOfSpeech ?pos ;
      lexinfo:partOfSpeech ?pos .
  ?freq_le ontolex:canonicalForm ?lemma_uri ;
           frac:frequency [ rdf:value ?freq_val ; frac:observedIn ?freq_text ] .	
FILTER (?pol IN (marl:Positive, marl:Negative))
}
'''

In [42]:
res = requests.get(endpoint, params={'query': rq, 'format': 'csv'})
res.status_code

200

All good! Now let's create the dataframe

In [43]:
df = pd.read_csv(StringIO(res.text))
df.head()

Unnamed: 0,lex,lab,pos,pol,freq_val,freq_text
0,http://hdl.handle.net/20.500.11752/ILC-73#10011,anonimo,https://www.lexinfo.net/ontology/3.0/lexinfo#c...,http://www.gsi.dit.upm.es/ontologies/marl/ns#P...,7,https://hdl.handle.net/20.500.12124/3
1,http://hdl.handle.net/20.500.11752/ILC-73#10012,polifonia,https://www.lexinfo.net/ontology/3.0/lexinfo#c...,http://www.gsi.dit.upm.es/ontologies/marl/ns#P...,2,https://hdl.handle.net/20.500.11752/OPEN-980
2,http://hdl.handle.net/20.500.11752/ILC-73#10018,nomenclatura,https://www.lexinfo.net/ontology/3.0/lexinfo#c...,http://www.gsi.dit.upm.es/ontologies/marl/ns#P...,1,https://hdl.handle.net/20.500.11752/OPEN-548
3,http://hdl.handle.net/20.500.11752/ILC-73#10018,nomenclatura,https://www.lexinfo.net/ontology/3.0/lexinfo#c...,http://www.gsi.dit.upm.es/ontologies/marl/ns#P...,2,https://hdl.handle.net/11356/1674
4,http://hdl.handle.net/20.500.11752/ILC-73#10024,invulnerabile,https://www.lexinfo.net/ontology/3.0/lexinfo#a...,http://www.gsi.dit.upm.es/ontologies/marl/ns#P...,4,https://hdl.handle.net/20.500.11752/OPEN-980


By inspecting some of the results, we notice that several generic verbs (like *pensare* 'to think', or *avere* 'to have') are annotated with a sentiment value in OpeNER, although an automatically assigned one, as it is visible from this fragment:

```xml
<LexicalEntry id="id_6396" partOfSpeech="verb">
      <Lemma writtenForm="avere"/>
      <Sense>
        <Confidence score="0.5" method="automatic"/>
        <Sentiment polarity="negative"/>
        <Domain/>
      </Sense>
    </LexicalEntry>
```

In our calculations we will just use nouns, adjectives and adverbs (we leave verbs out). 

We now create 2 pivot tables: one aggregating the total attestation score for positive words, one for the negative

In [75]:
# We create the filters: negative/positive polarity and every pos but verbs
marl_url = 'http://www.gsi.dit.upm.es/ontologies/marl/ns#'
lexinfo = 'https://www.lexinfo.net/ontology/3.0/lexinfo#'
neg_filt = (df.pol ==  f'{marl_url}Negative') & (df.pos != f'{lexinfo}verb')
pos_filt = (df.pol ==  f'{marl_url}Positive') & (df.pos != f'{lexinfo}verb')

pivot_table_neg = df[neg_filt].pivot_table(
    values='freq_val',  # Values to aggregate
    index=['lex', 'lab', 'pos', 'pol'],  # Columns to group by
    aggfunc='sum'  # Aggregation function (sum in this case)
)

pivot_table_pos = df[pos_filt].pivot_table(
    values='freq_val',  # Values to aggregate
    index=['lex', 'lab', 'pos', 'pol'],  # Columns to group by
    aggfunc='sum'  # Aggregation function (sum in this case)
)

And here are the top-10 most frequent **negative** nouns/adjectives. We find words like:
* servizio ('service')
* problema ('problem')
* punto ('point', ???)
* piccolo ('little', 'small')
* bisogno ('need')

...

In [66]:
pivot_table_neg.sort_values('freq_val', ascending=False).head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,freq_val
lex,lab,pos,pol,Unnamed: 4_level_1
http://hdl.handle.net/20.500.11752/ILC-73#18105,servizio,https://www.lexinfo.net/ontology/3.0/lexinfo#commonNoun,http://www.gsi.dit.upm.es/ontologies/marl/ns#Negative,2701
http://hdl.handle.net/20.500.11752/ILC-73#18444,problema,https://www.lexinfo.net/ontology/3.0/lexinfo#commonNoun,http://www.gsi.dit.upm.es/ontologies/marl/ns#Negative,2308
http://hdl.handle.net/20.500.11752/ILC-73#598,punto,https://www.lexinfo.net/ontology/3.0/lexinfo#commonNoun,http://www.gsi.dit.upm.es/ontologies/marl/ns#Negative,2137
http://hdl.handle.net/20.500.11752/ILC-73#22461,piccolo,https://www.lexinfo.net/ontology/3.0/lexinfo#adjective,http://www.gsi.dit.upm.es/ontologies/marl/ns#Negative,1983
http://hdl.handle.net/20.500.11752/ILC-73#22250,bisogno,https://www.lexinfo.net/ontology/3.0/lexinfo#commonNoun,http://www.gsi.dit.upm.es/ontologies/marl/ns#Negative,1969
http://hdl.handle.net/20.500.11752/ILC-73#16790,seme,https://www.lexinfo.net/ontology/3.0/lexinfo#commonNoun,http://www.gsi.dit.upm.es/ontologies/marl/ns#Negative,1964
http://hdl.handle.net/20.500.11752/ILC-73#7205,senso,https://www.lexinfo.net/ontology/3.0/lexinfo#commonNoun,http://www.gsi.dit.upm.es/ontologies/marl/ns#Negative,1583
http://hdl.handle.net/20.500.11752/ILC-73#13524,orribile,https://www.lexinfo.net/ontology/3.0/lexinfo#adjective,http://www.gsi.dit.upm.es/ontologies/marl/ns#Negative,1213
http://hdl.handle.net/20.500.11752/ILC-73#5849,difficile,https://www.lexinfo.net/ontology/3.0/lexinfo#adjective,http://www.gsi.dit.upm.es/ontologies/marl/ns#Negative,1136
http://hdl.handle.net/20.500.11752/ILC-73#14727,successivo,https://www.lexinfo.net/ontology/3.0/lexinfo#adjective,http://www.gsi.dit.upm.es/ontologies/marl/ns#Negative,1002


And here are the top-10 most frequent **positive** entries, like:
* amico ('friend')
* casa ('house', 'home')
* bello ('nice', 'beautiful')
* buono ('good')
* caro ('dear')

...

In [62]:
pivot_table_pos.sort_values('freq_val', ascending=False).head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,freq_val
lex,lab,pos,pol,Unnamed: 4_level_1
http://hdl.handle.net/20.500.11752/ILC-73#24469,amico,https://www.lexinfo.net/ontology/3.0/lexinfo#commonNoun,http://www.gsi.dit.upm.es/ontologies/marl/ns#Positive,4378
http://hdl.handle.net/20.500.11752/ILC-73#5396,casa,https://www.lexinfo.net/ontology/3.0/lexinfo#commonNoun,http://www.gsi.dit.upm.es/ontologies/marl/ns#Positive,3659
http://hdl.handle.net/20.500.11752/ILC-73#15983,bello,https://www.lexinfo.net/ontology/3.0/lexinfo#adjective,http://www.gsi.dit.upm.es/ontologies/marl/ns#Positive,3500
http://hdl.handle.net/20.500.11752/ILC-73#8815,buono,https://www.lexinfo.net/ontology/3.0/lexinfo#adjective,http://www.gsi.dit.upm.es/ontologies/marl/ns#Positive,3210
http://hdl.handle.net/20.500.11752/ILC-73#12542,caro,https://www.lexinfo.net/ontology/3.0/lexinfo#adjective,http://www.gsi.dit.upm.es/ontologies/marl/ns#Positive,3175
http://hdl.handle.net/20.500.11752/ILC-73#19436,primo,https://www.lexinfo.net/ontology/3.0/lexinfo#adjective,http://www.gsi.dit.upm.es/ontologies/marl/ns#Positive,3123
http://hdl.handle.net/20.500.11752/ILC-73#5259,legge,https://www.lexinfo.net/ontology/3.0/lexinfo#commonNoun,http://www.gsi.dit.upm.es/ontologies/marl/ns#Positive,2654
http://hdl.handle.net/20.500.11752/ILC-73#17373,vita,https://www.lexinfo.net/ontology/3.0/lexinfo#commonNoun,http://www.gsi.dit.upm.es/ontologies/marl/ns#Positive,2464
http://hdl.handle.net/20.500.11752/ILC-73#7864,decreto,https://www.lexinfo.net/ontology/3.0/lexinfo#commonNoun,http://www.gsi.dit.upm.es/ontologies/marl/ns#Positive,2055
http://hdl.handle.net/20.500.11752/ILC-73#19928,libero,https://www.lexinfo.net/ontology/3.0/lexinfo#adjective,http://www.gsi.dit.upm.es/ontologies/marl/ns#Positive,2016


What if we want to know where the negative noun *malfunzionamento* ('malfunctioning') occurs?

In [73]:
df[df.lex == 'http://hdl.handle.net/20.500.11752/ILC-73#21420'][['lab', 'freq_text', 'freq_val']]

Unnamed: 0,lab,freq_text,freq_val
7068,malfunzionamento,https://hdl.handle.net/11321/737,2
7069,malfunzionamento,https://hdl.handle.net/20.500.11752/OPEN-534,1
7070,malfunzionamento,https://hdl.handle.net/20.500.12124/3,2
