# Use Smell Embeddings

This notebook will allow you to load pre-trained RDF2vec embeddings computed on the [European Olfactory Knowledge Graph](http://data.odeuropa.eu) by [Odeuropa](http://odeuropa.eu)

In [44]:
from os import path
import pandas as pd
from gensim.models import KeyedVectors
from SPARQLWrapper import SPARQLWrapper, JSON

sparql = SPARQLWrapper("http://data.odeuropa.eu/repositories/odeuropa")
sparql.setReturnFormat(JSON)

In [45]:
def label(uri):
    q = '''
        PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
        PREFIX onto: <http://www.ontotext.com/>
        PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

        SELECT ?label
        FROM onto:explicit
        WHERE {
            <%s> skos:prefLabel|rdfs:label ?label
        }
    ''' % uri
    sparql.setQuery(q)
    ret = sparql.queryAndConvert()
    data = [l['label'] for l in ret['results']['bindings']]
    if len(data) < 1:
        return "smell"
    data.sort(key=lambda l: ('aaa' if l['xml:lang']== 'en' else l['xml:lang']) if 'xml:lang' in l else 0)
    return data[0]['value']

def clean(array):
    return [r for r in array if not pd.isna(r)]
    
def do_nothing(inp):
    return inp

def to_labels(array):
    return [label(r) if str(r).startswith('http') else r for r in array if not pd.isna(r)]

Load all resources

In [46]:
root = './embeddings'
all_data = pd.read_csv(path.join(root,'all_data.csv'))

voc_emb_path = path.join(root,'voc/embeddings.txt')
voc_emb = KeyedVectors.load_word2vec_format(voc_emb_path, binary=False, no_header=True)
voc_emb.save('voc_emb.kv')

smell_emb_path = './embeddings/smells/embeddings.txt'
smell_emb = KeyedVectors.load_word2vec_format(smell_emb_path, binary=False, no_header=True)
smell_emb.save('smell_emb.kv')


    
def get(smell, with_labels=False): 
    res = all_data[all_data['smell'] == smell]
    proc = to_labels if with_labels else do_nothing
    return {
        'source': proc(clean(res['source'].unique().tolist())),
        'carrier': proc(clean(res['carrier'].unique().tolist())),
        'quality': proc(clean(res['quality'].unique().tolist())),
        'quality_type': proc(clean(res['quality_type'].unique().tolist())),
        'place': proc(clean(res['place'].unique().tolist())),
        'place_type': proc(clean(res['place_type'].unique().tolist())),
        'gesture': proc(clean(res['gesture'].unique().tolist())),
        'emotion': proc(clean(res['emotion'].unique().tolist())),
        'time': proc(clean(res['time'].unique().tolist()))
}

  all_data = pd.read_csv(path.join(root,'all_data.csv'))


In the following, we have a pandas dataframe containing all values for the most important properties of the graph, smell by smell

In [29]:
all_data

Unnamed: 0,smell,source,carrier,quality,quality_type,place,place_type,gesture,emotion,time
0,http://data.odeuropa.eu/smell/f37eb716-2bea-58...,http://data.odeuropa.eu/vocabulary/olfactory-o...,http://data.odeuropa.eu/vocabulary/olfactory-o...,,,,,,,1771
1,http://data.odeuropa.eu/smell/f37eb716-2bea-58...,http://data.odeuropa.eu/vocabulary/olfactory-o...,http://data.odeuropa.eu/vocabulary/olfactory-o...,,,,,,,1771
2,http://data.odeuropa.eu/smell/f37eb716-2bea-58...,http://data.odeuropa.eu/vocabulary/olfactory-o...,http://data.odeuropa.eu/vocabulary/olfactory-o...,,,,,,,1771
3,http://data.odeuropa.eu/smell/bce916c1-99b3-54...,,http://data.odeuropa.eu/vocabulary/olfactory-o...,,,,,,,1566
4,http://data.odeuropa.eu/smell/d5d25ebb-79d8-5e...,http://data.odeuropa.eu/vocabulary/olfactory-o...,,,,,,,,1660
...,...,...,...,...,...,...,...,...,...,...
241505,http://data.odeuropa.eu/smell/cebb7b35-b8a4-5c...,,,,,,,,,1862
241506,http://data.odeuropa.eu/smell/1c72f57e-1170-50...,,,,,,,,,1862
241507,http://data.odeuropa.eu/smell/f6463053-02ec-59...,,,,,,,,,1862
241508,http://data.odeuropa.eu/smell/aad93ee8-c53d-5f...,,,,,,,,,1862


In [5]:
all_data.sort_values(by=['source', 'carrier', 'quality', 'place', 'gesture', 'emotion'], axis=0).to_csv('all_data_sorted.csv', index=False)

In [6]:
all_data = pd.read_csv('all_data_sorted.csv')

  all_data = pd.read_csv('all_data_sorted.csv')


In [7]:
all_data['smell'].to_csv('all_data_x.csv', index=False)

You can manipulate the pandas dataframe or select a particular smell (e.g. http://data.odeuropa.eu/smell/b6cdd9fe-a1a1-5aa3-bf4c-162a5c2d1ead ) in this way

In [8]:
get('http://data.odeuropa.eu/smell/b6cdd9fe-a1a1-5aa3-bf4c-162a5c2d1ead')

{'source': ['http://data.odeuropa.eu/vocabulary/olfactory-objects/151',
  'http://data.odeuropa.eu/vocabulary/olfactory-objects/158',
  'http://data.odeuropa.eu/vocabulary/olfactory-objects/227',
  'http://data.odeuropa.eu/vocabulary/olfactory-objects/505'],
 'carrier': ['http://data.odeuropa.eu/vocabulary/olfactory-objects/451',
  'http://data.odeuropa.eu/vocabulary/olfactory-objects/452',
  'http://data.odeuropa.eu/vocabulary/olfactory-objects/474'],
 'quality': [],
 'quality_type': [],
 'place': [],
 'place_type': [],
 'gesture': [],
 'emotion': [],
 'time': ['1641']}

possibly also getting the labels of the terms

In [9]:
get('http://data.odeuropa.eu/smell/b6cdd9fe-a1a1-5aa3-bf4c-162a5c2d1ead', with_labels=True)

{'source': ['Snuff Box', 'Tobacco packaging', 'Tobacco', 'Match'],
 'carrier': ['Glass without stem', 'Jug', 'Ashtray'],
 'quality': [],
 'quality_type': [],
 'place': [],
 'place_type': [],
 'gesture': [],
 'emotion': [],
 'time': ['1641']}

We provide 2 kind of embeddings.

The vocabulary embeddings `voc_emb` have been computed on the terms belonging to our [controlled vocabularies](http://vocab.odeuropa.eu). In other words, these are the terms you see in the columns of the pandas dataframe.
These embeddings contains a number of terms equal to:

In [10]:
len(voc_emb)

2442

It is possible to search what are the 10 most similar elements for a term, like:

In [47]:
label('http://data.odeuropa.eu/vocabulary/olfactory-objects/269')

'Incense'

In [49]:
res = voc_emb.most_similar('http://data.odeuropa.eu/vocabulary/olfactory-objects/269', topn=10) # incense
['%.4f   %s   %s' % (r[1], r[0], label(r[0])) for r in res]

['0.7755   http://data.odeuropa.eu/vocabulary/olfactory-objects/267   Frankincense',
 '0.7421   http://data.odeuropa.eu/vocabulary/olfactory-objects/399   Reukwerk',
 '0.7351   http://data.odeuropa.eu/vocabulary/olfactory-objects/25   Candle',
 '0.7176   http://data.odeuropa.eu/vocabulary/olfactory-objects/89   Incense Burner',
 '0.7078   http://data.odeuropa.eu/vocabulary/olfactory-objects/97   Lodereindoos',
 '0.7040   http://data.odeuropa.eu/vocabulary/olfactory-objects/262   Mildew',
 '0.7028   http://data.odeuropa.eu/vocabulary/olfactory-objects/207   Castoreum',
 '0.6883   http://data.odeuropa.eu/vocabulary/olfactory-objects/263   Mould',
 '0.6862   http://data.odeuropa.eu/vocabulary/olfactory-objects/251   Incense bowl',
 '0.6765   http://data.odeuropa.eu/vocabulary/olfactory-objects/228   Aqua mirabilis']

In [25]:
len(smell_emb)

33937

It is possible to search what are the 10 most similar elements for a term, like:

In [30]:
sm = 'http://data.odeuropa.eu/smell/5f40ad52-8525-5f41-91cf-8ad024f29df7'
label(sm)

'smell'

In [31]:
get(sm, with_labels=True)

{'source': ['Tobacco'],
 'carrier': [],
 'quality': ['Light', 'strong'],
 'quality_type': ['hedonic', 'intensity'],
 'place': [],
 'place_type': [],
 'gesture': [],
 'emotion': [],
 'time': ['1874']}

In [32]:
get(sm)

{'source': ['http://data.odeuropa.eu/vocabulary/olfactory-objects/227'],
 'carrier': [],
 'quality': ['http://data.odeuropa.eu/vocabulary/vdi-hedonic/05n',
  'http://data.odeuropa.eu/vocabulary/vdi-intensity/4'],
 'quality_type': ['http://data.odeuropa.eu/attribute-type/hedonic',
  'http://data.odeuropa.eu/attribute-type/intensity'],
 'place': [],
 'place_type': [],
 'gesture': [],
 'emotion': [],
 'time': ['1874']}

In [33]:
res = smell_emb.most_similar(sm, topn=10)
['%.4f   %s   %s' % (r[1], r[0], label(r[0])) for r in res]

['0.9877   http://data.odeuropa.eu/smell/a308a5e1-3f4f-55fe-b86a-b1e08bfc8be2   oozing',
 '0.9860   http://data.odeuropa.eu/smell/3b8285eb-c851-51d2-bb21-6855c15a64f8   odour',
 '0.9860   http://data.odeuropa.eu/smell/551aeb96-ce8e-5e4a-aed1-da04514eb7c0   smell',
 '0.9856   http://data.odeuropa.eu/smell/99f894fe-f7c0-5c6f-9e52-6bcde6b933a3   Breath',
 '0.9853   http://data.odeuropa.eu/smell/f9314647-ae7b-5328-898a-65b852e4f668   perfume',
 '0.9852   http://data.odeuropa.eu/smell/4c7c0457-3407-5d78-9a89-005b47a5e3fa   scent',
 '0.9845   http://data.odeuropa.eu/smell/fce16d7d-12ed-50ac-96ea-b35dc50b9819   smelling',
 '0.9843   http://data.odeuropa.eu/smell/f1bc745e-639d-548e-89a4-22074894042d   Smell',
 '0.9841   http://data.odeuropa.eu/smell/771a6740-e5ca-5738-af47-b7a01ae685e1   smelling',
 '0.9841   http://data.odeuropa.eu/smell/e2106ba6-8eba-5b12-86e4-e330dc3db8db   smell']

In [43]:
get(res[0][0], with_labels=True)

{'source': ['Blood'],
 'carrier': [],
 'quality': ['Light'],
 'quality_type': ['hedonic'],
 'place': [],
 'place_type': [],
 'gesture': [],
 'emotion': [],
 'time': ['1905']}

In [10]:
res[2][0]

'http://data.odeuropa.eu/smell/069550d6-f362-5a3d-b2c4-cb4b2e4b99ec'