# Use Smell Embeddings

This notebook will allow you to load pre-trained RDF2vec embeddings computed on the [European Olfactory Knowledge Graph](http://data.odeuropa.eu) by [Odeuropa](http://odeuropa.eu)

In [6]:
from os import path
import pandas as pd
from gensim.models import KeyedVectors
from SPARQLWrapper import SPARQLWrapper, JSON

sparql = SPARQLWrapper("http://data.odeuropa.eu/repositories/odeuropa")
sparql.setReturnFormat(JSON)

In [7]:
def label(uri):
    q = '''
        PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
        PREFIX onto: <http://www.ontotext.com/>
        PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

        SELECT ?label
        FROM onto:explicit
        WHERE {
            <%s> skos:prefLabel|rdfs:label ?label
        }
    ''' % uri
    sparql.setQuery(q)
    ret = sparql.queryAndConvert()
    data = [l['label'] for l in ret['results']['bindings']]
    if len(data) < 1:
        return "smell"
    data.sort(key=lambda l: ('aaa' if l['xml:lang']== 'en' else l['xml:lang']) if 'xml:lang' in l else 0)
    return data[0]['value']

def clean(array):
    return [r for r in array if not pd.isna(r)]
    
def do_nothing(inp):
    return inp

def to_labels(array):
    return [label(r) if str(r).startswith('http') else r for r in array if not pd.isna(r)]

Load all resources

In [19]:
root = './embeddings'
all_data = pd.read_csv(path.join(root,'all_data.csv'))

voc_emb = KeyedVectors.load("embeddings/voc.kv")

smell_emb = KeyedVectors.load("embeddings/voc.kv")


    
def get(smell, with_labels=False): 
    res = all_data[all_data['smell'] == smell]
    proc = to_labels if with_labels else do_nothing
    return {
        'source': proc(clean(res['source'].unique().tolist())),
        'carrier': proc(clean(res['carrier'].unique().tolist())),
        'quality': proc(clean(res['quality'].unique().tolist())),
        'quality_type': proc(clean(res['quality_type'].unique().tolist())),
        'place': proc(clean(res['place'].unique().tolist())),
        'place_type': proc(clean(res['place_type'].unique().tolist())),
        'gesture': proc(clean(res['gesture'].unique().tolist())),
        'emotion': proc(clean(res['emotion'].unique().tolist())),
        'time': proc(clean(res['time'].unique().tolist()))
}

  all_data = pd.read_csv(path.join(root,'all_data.csv'))


In the following, we have a pandas dataframe containing all values for the most important properties of the graph, smell by smell

In [20]:
all_data

Unnamed: 0,smell,source,carrier,quality,quality_type,place,place_type,gesture,emotion,time
0,http://data.odeuropa.eu/smell/f37eb716-2bea-58...,http://data.odeuropa.eu/vocabulary/olfactory-o...,http://data.odeuropa.eu/vocabulary/olfactory-o...,,,,,,,1771
1,http://data.odeuropa.eu/smell/f37eb716-2bea-58...,http://data.odeuropa.eu/vocabulary/olfactory-o...,http://data.odeuropa.eu/vocabulary/olfactory-o...,,,,,,,1771
2,http://data.odeuropa.eu/smell/f37eb716-2bea-58...,http://data.odeuropa.eu/vocabulary/olfactory-o...,http://data.odeuropa.eu/vocabulary/olfactory-o...,,,,,,,1771
3,http://data.odeuropa.eu/smell/bce916c1-99b3-54...,,http://data.odeuropa.eu/vocabulary/olfactory-o...,,,,,,,1566
4,http://data.odeuropa.eu/smell/d5d25ebb-79d8-5e...,http://data.odeuropa.eu/vocabulary/olfactory-o...,,,,,,,,1660
...,...,...,...,...,...,...,...,...,...,...
241505,http://data.odeuropa.eu/smell/cebb7b35-b8a4-5c...,,,,,,,,,1862
241506,http://data.odeuropa.eu/smell/1c72f57e-1170-50...,,,,,,,,,1862
241507,http://data.odeuropa.eu/smell/f6463053-02ec-59...,,,,,,,,,1862
241508,http://data.odeuropa.eu/smell/aad93ee8-c53d-5f...,,,,,,,,,1862


In [21]:
all_data.sort_values(by=['source', 'carrier', 'quality', 'place', 'gesture', 'emotion'], axis=0).to_csv('all_data_sorted.csv', index=False)

In [22]:
all_data = pd.read_csv('all_data_sorted.csv')

  all_data = pd.read_csv('all_data_sorted.csv')


In [23]:
all_data['smell'].to_csv('all_data_x.csv', index=False)

You can manipulate the pandas dataframe or select a particular smell (e.g. http://data.odeuropa.eu/smell/b6cdd9fe-a1a1-5aa3-bf4c-162a5c2d1ead ) in this way

In [24]:
get('http://data.odeuropa.eu/smell/b6cdd9fe-a1a1-5aa3-bf4c-162a5c2d1ead')

{'source': ['http://data.odeuropa.eu/vocabulary/olfactory-objects/151',
  'http://data.odeuropa.eu/vocabulary/olfactory-objects/158',
  'http://data.odeuropa.eu/vocabulary/olfactory-objects/227',
  'http://data.odeuropa.eu/vocabulary/olfactory-objects/505'],
 'carrier': ['http://data.odeuropa.eu/vocabulary/olfactory-objects/451',
  'http://data.odeuropa.eu/vocabulary/olfactory-objects/452',
  'http://data.odeuropa.eu/vocabulary/olfactory-objects/474'],
 'quality': [],
 'quality_type': [],
 'place': [],
 'place_type': [],
 'gesture': [],
 'emotion': [],
 'time': ['1641']}

possibly also getting the labels of the terms

In [25]:
get('http://data.odeuropa.eu/smell/b6cdd9fe-a1a1-5aa3-bf4c-162a5c2d1ead', with_labels=True)

{'source': ['Snuff Box', 'Tobacco packaging', 'Tobacco', 'Match'],
 'carrier': ['Glass without stem', 'Jug', 'Ashtray'],
 'quality': [],
 'quality_type': [],
 'place': [],
 'place_type': [],
 'gesture': [],
 'emotion': [],
 'time': ['1641']}

We provide 2 kind of embeddings.

The vocabulary embeddings `voc_emb` have been computed on the terms belonging to our [controlled vocabularies](http://vocab.odeuropa.eu). In other words, these are the terms you see in the columns of the pandas dataframe.
These embeddings contains a number of terms equal to:

In [26]:
len(voc_emb)

2330

It is possible to search what are the 10 most similar elements for a term, like:

In [27]:
label('http://data.odeuropa.eu/vocabulary/olfactory-objects/269')

'Incense'

In [28]:
res = voc_emb.most_similar('http://data.odeuropa.eu/vocabulary/olfactory-objects/269', topn=10) # incense
['%.4f   %s   %s' % (r[1], r[0], label(r[0])) for r in res]

['0.7305   http://data.odeuropa.eu/vocabulary/olfactory-objects/267   Frankincense',
 '0.6982   http://data.odeuropa.eu/vocabulary/olfactory-objects/431   Amber',
 '0.6808   http://data.odeuropa.eu/vocabulary/olfactory-objects/204   Tolu balm',
 '0.6704   http://data.odeuropa.eu/vocabulary/olfactory-objects/455   Wine Bottle',
 '0.6635   http://data.odeuropa.eu/vocabulary/olfactory-objects/9   Burnt offering',
 '0.6632   http://data.odeuropa.eu/vocabulary/olfactory-objects/205   Peru balm',
 '0.6594   http://data.odeuropa.eu/vocabulary/historical-scent/resinous_frankincense   Frankincense',
 '0.6572   http://data.odeuropa.eu/vocabulary/olfactory-objects/220   Opoponax',
 '0.6520   http://data.odeuropa.eu/vocabulary/olfactory-objects/197   Labdanum',
 '0.6443   http://data.odeuropa.eu/vocabulary/olfactory-objects/456   Carafe']

In [29]:
len(smell_emb)

2330

It is possible to search what are the 10 most similar elements for a term, like:

In [38]:
sm = 'http://data.odeuropa.eu/smell/cde90c8f-3833-53d2-9a20-975c0d78d894'
label(sm)

'odor'

In [39]:
get(sm, with_labels=True)

{'source': [],
 'carrier': [],
 'quality': [],
 'quality_type': [],
 'place': [],
 'place_type': [],
 'gesture': [],
 'emotion': [],
 'time': []}

In [32]:
get(sm)

{'source': ['http://data.odeuropa.eu/vocabulary/olfactory-objects/227'],
 'carrier': [],
 'quality': ['http://data.odeuropa.eu/vocabulary/vdi-hedonic/05n',
  'http://data.odeuropa.eu/vocabulary/vdi-intensity/4'],
 'quality_type': ['http://data.odeuropa.eu/attribute-type/hedonic',
  'http://data.odeuropa.eu/attribute-type/intensity'],
 'place': [],
 'place_type': [],
 'gesture': [],
 'emotion': [],
 'time': ['1874']}

In [37]:
res = smell_emb.most_similar(sm, topn=10)
['%.4f   %s   %s' % (r[1], r[0], label(r[0])) for r in res]

KeyError: "Key 'http://data.odeuropa.eu/smell/5f40ad52-8525-5f41-91cf-8ad024f29df7' not present in vocabulary"

In [43]:
get(res[0][0], with_labels=True)

{'source': ['Blood'],
 'carrier': [],
 'quality': ['Light'],
 'quality_type': ['hedonic'],
 'place': [],
 'place_type': [],
 'gesture': [],
 'emotion': [],
 'time': ['1905']}

In [36]:
res[2][0]

'http://data.odeuropa.eu/vocabulary/olfactory-objects/204'