## Loading pre-trained sense vectors 

To test with word sense embeddings you can use a pretrained model (sense vectors and sense probabilities). These sense         vectors were induced from Wikipedia using word2vec similarities between words in ego-networks. Sense probabilities are       stored in a separate file which is located next to the file with sense vectors. 

In [2]:
import sensegram
sense_vectors_fpath = "model/dewiki.txt.clusters.minsize5-1000-sum-score-20.sense_vectors"
sv = sensegram.SenseGram.load_word2vec_format(sense_vectors_fpath, binary=False)

## Getting the list of senses of a word 

Probabilities of senses will be loaded automatically if placed in the same folder as sense vectors and named according to the  same scheme as our pretrained files.

To examine how many senses were learned for a word call `get_senses` funcion:

In [5]:
word = "Hund"
sv.get_senses(word)

[('Hund#1', 0.976526), ('Hund#2', 0.023474)]

## Sense aware nearest neighbors

The function returns a list of sense names with probabilities for each sense. As one can see, our model has learned two senses for the word "ключ".

To understand which word sense is represented with a sense vector use `most_similar` function:


In [6]:
word = "Hund"
for sense_id, prob in sv.get_senses(word):
    print(sense_id)
    print("="*20)
    for rsense_id, sim in sv.wv.most_similar(sense_id):
        print("{} {:f}".format(rsense_id, sim))
    print("\n")

Hund#1
Papagei#1 0.948971
Pudel#1 0.948960
Mops#1 0.943027
Dackel#1 0.938603
Teddybär#1 0.934012
Zwilling#1 0.930496
Kater#1 0.929611
Kanarienvogel#1 0.927734
Bernhardiner#1 0.926941
Kauz#1 0.922378


Hund#2
guter_Kamerad#1 0.942419
bunter_Abend#1 0.939721
Petschorin#1 0.938157
hinzukommender#1 0.937122
geplantes_Comeback#1 0.936010
schicksalhafter#1 0.935064
Grosse_Erweckung#3 0.934708
armer_Hund#1 0.934650
Alois_Windisch#2 0.934103
eingezogener_spitzbogiger#1 0.933642




## Word sense disambiguation: loading word embeddings

To use our word sense disambiguation mechanism you also need word vectors or context vectors, depending on the dismabiguation  strategy. Those word are located in the ``model`` directory and has the extension ``.vectors``.

Our WSD mechanism is based on word similarities (`sim`) and requires word vectors to represent context words. In following we provide a disambiguation example using similarity strategy.

First, load word vectors using gensim library:


In [7]:
from gensim.models import KeyedVectors
word_vectors_fpath = "model/dewiki.txt.word_vectors"
wv = KeyedVectors.load_word2vec_format(word_vectors_fpath, binary=False, unicode_errors="ignore")

Then initialise the WSD object with sense and word vectors:

In [9]:
from wsd import WSD
wsd_model = WSD(sv, wv, window=5, method='sim', filter_ctx=3)

Disambiguation method: sim
Filter context: f = 3


The settings have the following meaning: it will extract at most `window`*2 words around the target word from the  sentence as context and it will use only three most discriminative context words for disambiguation.

Now you can disambiguate the word "table" in the sentence "They bought a table and chairs for kitchen" using       `dis_text` function. As input it takes a sentence with space separated tokens, a target word, and start/end indices of the target word in the given sentence.


In [10]:
word = "Hund"
context = "Die beste Voraussetzung für die Hund-Katze-Freundschaft ist, dass keiner von beiden in der Vergangenheit unangenehme Erlebnisse mit der anderen Gattung hatte. Am einfachsten ist die ungleiche WG, wenn sich zwei Jungtiere ein Zuhause teilen. Bei erwachsenen Tieren ist es einfacher, wenn sich Miezi in Bellos Haushalt einnistet – nicht umgekehrt, da Hunde Rudeltiere sind. Damit ein Hund das Kätzchen aber auch als Rudelmitglied sieht und nicht als Futter sollten ein paar Regeln beachtet werden"
wsd_model.dis_text(context, word, 0, 4)

('Hund#1', [0.11995293847479865, -0.10629601676025206])

# SDEWaC corpus

In [44]:
import sensegram
from wsd import WSD
from gensim.models import KeyedVectors


# Input data and paths
sense_vectors_fpath = "model/sdewac-v3.corpus.clusters.minsize5-1000-sum-score-20.sense_vectors"
word_vectors_fpath = "model/sdewac-v3.corpus.word_vectors"
context_words_max = 3 # change this paramters to 1, 2, 5, 10, 15, 20 : it may improve the results
context_window_size = 5 # this parameters can be also changed during experiments 
word = "Maus"
context = "Die Maus ist ein Eingabegerät (Befehlsgeber) bei Computern. Der allererste Prototyp wurde 1963 nach Zeichnungen von Douglas C. Engelbart gebaut; seit Mitte der 1980er Jahre bildet die Maus für fast alle Computertätigkeiten zusammen mit dem Monitor und der Tastatur eine der wichtigsten Mensch-Maschine-Schnittstellen. Die Entwicklung grafischer Benutzeroberflächen hat die Computermaus zu einem heute praktisch an jedem Desktop-PC verfügbaren Standardeingabegerät gemacht."
ignore_case = True

# Load models (takes long time)
sv = sensegram.SenseGram.load_word2vec_format(sense_vectors_fpath, binary=False)
wv = KeyedVectors.load_word2vec_format(word_vectors_fpath, binary=False, unicode_errors="ignore")

# Play with the model (is quick)
print("Probabilities of the senses:\n{}\n\n".format(sv.get_senses(word, ignore_case=ignore_case)))

for sense_id, prob in sv.get_senses(word, ignore_case=ignore_case):
    print(sense_id)
    print("="*20)
    for rsense_id, sim in sv.wv.most_similar(sense_id):
        print("{} {:f}".format(rsense_id, sim))
    print("\n")

# Disambiguate a word in a context
wsd_model = WSD(sv, wv, window=context_window_size, lang="de",
                filter_ctx=context_words_max, ignore_case=ignore_case)    
print(wsd_model.disambiguate(context, word))

Probabilities of the senses:
[('Maus#1', 0.885167), ('Maus#2', 0.114833), ('maus#1', 1.0)]


Maus#1
Wanze#1 0.941433
Mausefalle#1 0.911212
Libelle#1 0.908983
Puppe#2 0.908126
Kakerlake#1 0.907264
Ratte#1 0.904579
Spinne#1 0.904541
Garnele#1 0.904265
Biene#1 0.902711
Schnecke#1 0.902516


Maus#2
Cursortasten#1 0.972510
Pfeiltasten#1 0.964843
Richtungstasten#1 0.959181
Cursor_Tasten#1 0.955643
linken_Maustaste#1 0.945940
Tab_Taste#1 0.944255
gedrückter_Maustaste#2 0.937142
rechten_Maustaste#1 0.935822
Mausbewegung#1 0.935422
Sprungtaste#2 0.933055


maus#1
hp#2 0.970098
gedichte#1 0.954409
galerie#1 0.951155
meinem_bild#3 0.945498
fotografie#1 0.941011
alex#1 0.939576
wolf#1 0.935995
ds#1 0.934641
nachricht#3 0.934026
anfrage#1 0.933526


Disambiguation method: sim
Filter context: f = 3
Context words:
Eingabegerät	0.394
('Maus#2', [0.34084959179662255, 0.52808930616226313, 0.13438711940973358])
