# Semantics

# WordNet

NLTK provides a useful WordNet interface to play with the WordNet data (included into the `nltk.corpus`). Let's see how to use it

First we import the corpus

In [None]:
from nltk.corpus import wordnet as wn

We can use the `wn` object to get the synsets of a given word.

For instance, thise are the synsets related to the word `dog`

In [None]:
wn.synsets('dog')

Some words may have different POS tag synsets

In [None]:
wn.synsets('fish')

We can filter by POS tag

In [None]:
wn.synsets('fish', pos=wn.VERB)

Much more than expected. In order to make sense of each sense, we can plot their definition

In [None]:
for synset in wn.synsets('fish'):
    print(synset)
    print(synset.definition())
    print(" ")

Or examples for each synset

In [None]:
for synset in wn.synsets('dog'):
    print(synset)
    print(synset.examples())
    print(" ")

Or lemmas related to the synsets

In [None]:
for synset in wn.synsets('dog'):
    print(synset)
    print(synset.lemma_names())
    print(" ")

A cool feature of the NLTK WordNet corpus is that it gives access to the **Open Multilingual WordNet**.

It is useful to, for instance, get the lemmas in another languages a given synset, through the function `lemma_names`

In [None]:
import nltk
nltk.download('omw')  

In [None]:
for synset in wn.synsets('dog'):
    print(synset)
    print(synset.lemma_names(lang="ita"))
    print(" ")

Let's focus on the first dog sysnet.

We can access to its relationships (`hypernyms`, `hyponyms`, `holonyms`, ...)

In [None]:
dog = wn.synset('dog.n.01')
print(dog.hypernyms())
print("---------------")
print(dog.hyponyms())
print("---------------")
print(dog.member_holonyms())

Note that some relations are defined by WordNet only over Lemmas

In [None]:
good = wn.synset('good.a.01')
good.lemmas()[0].antonyms()

NLTK also has implemented the **path-based similarity** function that we explained in class by means of the function `path_similarity`. It returns a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. A score of 1 represents identity i.e. comparing a sense with itself will return 1.



In [None]:
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')
hit = wn.synset('hit.v.01')
slap = wn.synset('slap.v.01')

In [None]:
dog.path_similarity(cat)

In [None]:
hit.path_similarity(slap)

It has also the [IC-based](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.59.2199) similarity. To that end you have to load  an information content file from the `wordnet_ic` corpus and then use this information with the `res_similarity` function to compute the IC-based similarity.

In [None]:
from nltk.corpus import wordnet_ic
brown_ic = wordnet_ic.ic('ic-brown.dat')
dog.res_similarity(cat, brown_ic)

If you prefer it, you can also train you own IC dictionary from any corpus. This is very useful if you want to compute the similary between words based on some particular data that you have for a given task.

In [None]:
from nltk.corpus import genesis
genesis_ic = wn.ic(genesis, False, 0.0)
dog.res_similarity(cat, genesis_ic)

# PMI

In addition to the thesaurus-based metrics, we can also create similarity functions based on Distributional algorithms; that is, words that appear in similar contexts are expected to be similar.

In particular, in class we presented the Point-wise Mutual Information as a measure the set the similarity of two words based on their contexts. 



Find words that appear in the same context is actually quite easy by using NLTK's `Text.similar()` function. This function takes a word w, finds all contexts w1w w2, then finds all words w' that appear in the same context, i.e. w1w' w2. (You can find the implementation online at http://nltk.org/nltk/text.py)


In [None]:
import nltk 

text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())
text.similar('man')

We can also use the `PMI` to compute similarities between words.

To that end, I have defined a function that takes two words, a dictionary with the frequency of the words `unigram_freq` and another dictionary `bigram_freq` with the count of each pair of words in the corpus. 

With these two dicts we can compute the joint probability of each pair of words (calculated as the fraction of the number of times they appear together and the total frequency of pairs of words) and, finally, compute the PMI as the fraction of the joint probability and the product of the marginal probabilites of each word.

In [None]:
def pmi(word1, word2, unigram_freq, bigram_freq):
    import math
    marginal_word1 = float(unigram_freq[word1]) / sum(unigram_freq.values())
    marginal_word2 = float(unigram_freq[word2]) / sum(unigram_freq.values())
    joint_w1_w2 = float(bigram_freq[(word1, word2)])/sum(bigram_freq.values())
    pmi = round(math.log(max(0.0005,joint_w1_w2/(marginal_word1*marginal_word2)),2),2)
    return pmi

NLTK has a package `collocations` that makes quite easy to compute the count of each pair of words

In [None]:
bigrams = nltk.collocations.BigramCollocationFinder.from_words(nltk.corpus.brown.words(categories='news'), window_size = 20)
bigrams.apply_freq_filter(20)
bigrams_freq = bigrams.ngram_fd

We use the `FreqDist` function (which we already knew) to compute the individual frequencies of the words

In [None]:
unigrams = nltk.FreqDist( nltk.corpus.brown.words(categories="news"))
unigrams_freq = {unigram:freq for unigram, freq in unigrams.items() if freq >= 20}

We can now use the defined `pmi` function to compute the PMI similarity of two words.
Let us see some examples.

In [None]:
pmi(u"day", u"night", unigrams_freq, bigrams_freq)

In [None]:
pmi(u"per", u"cent", unigrams_freq, bigrams_freq)

In [None]:
pmi(u"day", u"administration", unigrams_freq, bigrams_freq)

In [None]:
pmi(u"government", u"administration", unigrams_freq, bigrams_freq)

NLTK also provides some useful classes inside the `collocations` package to automatically compute this PMI-based similarity. 

In [None]:
import nltk
from nltk.collocations import *

Collocations are expressions of multiple words which commonly co-occur. For example, the top ten bigram collocations in Brown news corpus are listed below, as measured using Pointwise Mutual Information (by using the `bigram_measures.pmi` function).

In [None]:
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(nltk.corpus.brown.words(categories='news'), window_size = 20)
finder.nbest(bigram_measures.pmi, 10)

While these words are highly collocated, the expressions are also very infrequent. Therefore it is useful to apply filters, such as ignoring all bigrams which occur less than 20 times in the corpus and removing the stopwords:


In [None]:
finder.apply_freq_filter(20)
ignored_words = nltk.corpus.stopwords.words('english')
finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in ignored_words)
finder.nbest(bigram_measures.pmi, 25)

# Word2vec

Using Wor2vec in Python is in fact quite straightforward thanks to the package `gensim` (https://radimrehurek.com/gensim/), which has a package focused on Word2vec where you can create your own embeddings from a dataset.

For more information on the generation of embeddings with this package, you can follow this tutorial: http://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code/#.WszQTXVuZhE

## Training word2vec

The following code creates two embeddings model, one for the brown corpus and one for the movie_reviews dataset. 

In [None]:
from gensim.models import Word2Vec
from nltk.corpus import brown, movie_reviews

In [None]:
b = Word2Vec(brown.sents(), hs=1, negative=0, size=100, window=5, min_count=5, workers=4)

In [None]:
mr = Word2Vec(movie_reviews.sents())

Once trained the models, we can compute similarities between words. Try different words and check the differences between the models.

In [None]:
b.wv.most_similar('man', topn=5)

In [None]:
mr.wv.most_similar('man', topn=5)

In [None]:
b.wv.most_similar('movie', topn=5)

In [None]:
mr.wv.most_similar('movie', topn=5)

In [None]:
print(mr.wv.similarity('man', 'woman'))
print(b.wv.similarity('man', 'woman'))

In [None]:
print(mr.wv.similarity('man', 'car'))

In [None]:
b.wv.doesnt_match("automobile car dinner".split())

In [None]:
print(b.wv.most_similar(positive=['father','doctor'], negative=['mother']))

# Pre-trained Word2Vec

First of all, you need to download the pretrained model from [https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit). This model has been trained with billions of words, for a vocabulary of 3 million words. The file itself is around 1.6GB and it is provided as a ".gz" compressed file (you need to decompressed first).

In [None]:
import gensim.models.keyedvectors as word2vec

Load the model. This may take a minute or two.

In [None]:
#path to downloaded file (unzipped)
#adapt to your system, the following would my path
#path = "C:\Users\Idafen Santana Perez\Downloads\GoogleNews-vectors-negative300.bin\GoogleNews-vectors-negative300.bin"

path = "YOUR_SYSTEM_PATH\GoogleNews-vectors-negative300.bin"

#loading the downloaded model
model = word2vec.KeyedVectors.load_word2vec_format(path, binary=True) 

Get the vector of the word 'cat'

In [None]:
cat = model['cat']
print(cat[:20])

Lets try the king-man+woman operation. Both 'king' and 'woman' are positive, while 'man' is a negative value.

This analogy can be read as "Man is to King, what Woman is to ______"

In [None]:
print(model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1))

Try now doctor-father+mother

In [None]:
print(model.most_similar(positive=['mother', 'doctor'], negative=['father'], topn=1))

And now a bit of geography

In [None]:
print(model.most_similar(positive=['Spain', 'Paris'], negative=['France'], topn=1))

In [None]:
print(model.most_similar(positive=['Madrid', 'Tenerife'], negative=['Gran_Canaria'], topn=1))

Test similarity metrics

In [None]:
print(model.similarity('woman', 'man'))
print(model.similarity('car', 'man'))
print(model.similarity('fridge', 'man'))
print(model.similarity('fridge', 'woman'))

We can also check how typos relate

In [None]:
print(model.most_similar(positive=['because','teh'],negative=['the']))

And how other words, such as groups of animals or name are somehow analogous

In [None]:
print(model.most_similar(positive=['fish','flock'],negative=['birds']))

In [None]:
print(model.most_similar(positive=['Jennifer','he'],negative=['she']))