# Semantics

## WordNet

NLTK provides a useful WordNet interface to play with the WordNet data (included into the `nltk.corpus`). Let's see how to use it

First we import the corpus

In [1]:
from nltk.corpus import wordnet as wn

We can use the `wn` object to get the synsets of a given word.

For instance, thise are the synsets related to the word `dog`

In [2]:
wn.synsets('dog')

[Synset('dog.n.01'),
 Synset('frump.n.01'),
 Synset('dog.n.03'),
 Synset('cad.n.01'),
 Synset('frank.n.02'),
 Synset('pawl.n.01'),
 Synset('andiron.n.01'),
 Synset('chase.v.01')]

Much more than expected. In order to make sense of each sense, we can plot their definition

In [5]:
for synset in wn.synsets('dog'):
    print synset
    print synset.definition()
    print " "

Synset('dog.n.01')
a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds
 
Synset('frump.n.01')
a dull unattractive unpleasant girl or woman
 
Synset('dog.n.03')
informal term for a man
 
Synset('cad.n.01')
someone who is morally reprehensible
 
Synset('frank.n.02')
a smooth-textured sausage of minced beef or pork usually smoked; often served on a bread roll
 
Synset('pawl.n.01')
a hinged catch that fits into a notch of a ratchet to move a wheel forward or prevent it from moving backward
 
Synset('andiron.n.01')
metal supports for logs in a fireplace
 
Synset('chase.v.01')
go after with the intent to catch
 


Or examples for each synset

In [6]:
for synset in wn.synsets('dog'):
    print synset
    print synset.examples()
    print " "

Synset('dog.n.01')
[u'the dog barked all night']
 
Synset('frump.n.01')
[u'she got a reputation as a frump', u"she's a real dog"]
 
Synset('dog.n.03')
[u'you lucky dog']
 
Synset('cad.n.01')
[u'you dirty dog']
 
Synset('frank.n.02')
[]
 
Synset('pawl.n.01')
[]
 
Synset('andiron.n.01')
[u'the andirons were too hot to touch']
 
Synset('chase.v.01')
[u'The policeman chased the mugger down the alley', u'the dog chased the rabbit']
 


Or lemmas related to the synsets

In [8]:
for synset in wn.synsets('dog'):
    print synset
    print synset.lemma_names()
    print " "

Synset('dog.n.01')
[u'dog', u'domestic_dog', u'Canis_familiaris']
 
Synset('frump.n.01')
[u'frump', u'dog']
 
Synset('dog.n.03')
[u'dog']
 
Synset('cad.n.01')
[u'cad', u'bounder', u'blackguard', u'dog', u'hound', u'heel']
 
Synset('frank.n.02')
[u'frank', u'frankfurter', u'hotdog', u'hot_dog', u'dog', u'wiener', u'wienerwurst', u'weenie']
 
Synset('pawl.n.01')
[u'pawl', u'detent', u'click', u'dog']
 
Synset('andiron.n.01')
[u'andiron', u'firedog', u'dog', u'dog-iron']
 
Synset('chase.v.01')
[u'chase', u'chase_after', u'trail', u'tail', u'tag', u'give_chase', u'dog', u'go_after', u'track']
 


A cool feature of the NLTK WordNet corpus is that it gives access to the **Open Multilingual WordNet**.

It is useful to, for instance, get the lemmas in another languages a given synset, through the function `lemma_names`

In [10]:
for synset in wn.synsets('dog'):
    print synset
    print synset.lemma_names(lang="spa")
    print " "

Synset('dog.n.01')
[u'can', u'perro']
 
Synset('frump.n.01')
[]
 
Synset('dog.n.03')
[]
 
Synset('cad.n.01')
[]
 
Synset('frank.n.02')
[u'frankfurt']
 
Synset('pawl.n.01')
[]
 
Synset('andiron.n.01')
[]
 
Synset('chase.v.01')
[u'rastrear']
 


Let's focus on the first dog sysnet.

We can access to its relationships (`hypernyms`, `hyponyms`, `holonyms`, ...)

In [12]:
dog = wn.synset('dog.n.01')
print dog.hypernyms()
print dog.hyponyms()
print dog.member_holonyms()

[Synset('canine.n.02'), Synset('domestic_animal.n.01')]
[Synset('basenji.n.01'), Synset('corgi.n.01'), Synset('cur.n.01'), Synset('dalmatian.n.02'), Synset('great_pyrenees.n.01'), Synset('griffon.n.02'), Synset('hunting_dog.n.01'), Synset('lapdog.n.01'), Synset('leonberg.n.01'), Synset('mexican_hairless.n.01'), Synset('newfoundland.n.01'), Synset('pooch.n.01'), Synset('poodle.n.01'), Synset('pug.n.01'), Synset('puppy.n.01'), Synset('spitz.n.01'), Synset('toy_dog.n.01'), Synset('working_dog.n.01')]
[Synset('canis.n.01'), Synset('pack.n.06')]


Note that some relations are defined by WordNet only over Lemmas

In [13]:
good = wn.synset('good.a.01')
good.lemmas()[0].antonyms()

[Lemma('bad.a.01.bad')]

NLTK also has implemented the **path-based similarity** function that we explained in class by means of the function `path_similarity`. It returns a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. A score of 1 represents identity i.e. comparing a sense with itself will return 1.



In [15]:
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')
hit = wn.synset('hit.v.01')
slap = wn.synset('slap.v.01')

In [16]:
dog.path_similarity(cat)

0.2

In [17]:
hit.path_similarity(slap)

0.14285714285714285

It has also the IC-based similarity. To that end you have to load  an information content file from the `wordnet_ic` corpus and then use this information with the `res_similarity` function to compute the IC-based similarity.

In [18]:
from nltk.corpus import wordnet_ic
brown_ic = wordnet_ic.ic('ic-brown.dat')
dog.res_similarity(cat, brown_ic)

7.911666509036577

If you prefer it, you can also train you own IC dictionary from any corpus. This is very useful if you want to compute the similary between words based on some particular data that you have for a given task.

In [19]:
from nltk.corpus import genesis
genesis_ic = wn.ic(genesis, False, 0.0)
dog.res_similarity(cat, genesis_ic)

7.204023991374843

## PMI

In addition to the thesaurus-based metrics, we can also create similarity functions based on Distributional algorithms; that is, words that appear in similar contexts are expected to be similar.

In particular, in class we presented the Point-wise Mutual Information as a measure the set the similarity of two words based on their contexts. 



Find words that appear in the same context is actually quite easy by using NLTK's `Text.similar()` function. This function takes a word w, finds all contexts w1w w2, then finds all words w' that appear in the same context, i.e. w1w' w2. (You can find the implementation online at http://nltk.org/nltk/text.py)


In [17]:
text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())
text.similar('woman')

man time day year car moment world family house country child boy
state job way war girl place word work


We can also use the `PMI` to compute similarities between words.

To that end, I have defined a function that takes two words, a dictionary with the frequency of the words `unigram_freq` and another dictionary `bigram_freq` with the count of each pair of words in the corpus. 

With these two dicts we can compute the joint probability of each pair of words (calculated as the fraction of the number of times they appear together and the total frequency of pairs of words) and, finally, compute the PMI as the fraction of the joint probability and the product of the marginal probabilites of each word.

In [80]:
def pmi(word1, word2, unigram_freq, bigram_freq):
    import math
    marginal_word1 = float(unigram_freq[word1]) / sum(unigram_freq.values())
    marginal_word2 = float(unigram_freq[word2]) / sum(unigram_freq.values())
    joint_w1_w2 = float(bigram_freq[(word1, word2)])/sum(bigram_freq.values())
    pmi = round(math.log(max(0.0005,joint_w1_w2/(marginal_word1*marginal_word2)),2),2)
    return pmi

NLTK has a package `collocations` that makes quite easy to compute the count of each pair of words

In [28]:
bigrams = nltk.collocations.BigramCollocationFinder.from_words(nltk.corpus.brown.words(categories='news'), window_size = 20)
finder.apply_freq_filter(20)
bigrams_freq = bigrams.ngram_fd

We use the `FreqDist` function (which we already knew) to compute the individual frequencies of the words

In [33]:
unigrams = nltk.FreqDist( nltk.corpus.brown.words(categories="news"))
unigrams_freq = {unigram:freq for unigram, freq in unigrams.items() if freq >= 20}

We can now use the defined `pmi` function to compute the PMI similarity of two words.
Let us see some examples.

In [64]:
pmi(u"day", u"night", unigrams_freq, bigrams_freq)

0.14

In [82]:
pmi(u"per", u"cent", unigrams_freq, bigrams_freq)

5.4

In [81]:
pmi(u"day", u"administration", unigrams_freq, bigrams_freq)

-10.97

In [74]:
pmi(u"government", u"administration", unigrams_freq, bigrams_freq)

1.67

NLTK also provides some useful classes inside the `collocations` package to automatically compute this PMI-based similarity. 

In [3]:
import nltk
from nltk.collocations import *

Collocations are expressions of multiple words which commonly co-occur. For example, the top ten bigram collocations in Brown news corpus are listed below, as measured using Pointwise Mutual Information (by using the `bigram_measures.pmi` function).

In [4]:
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(nltk.corpus.brown.words(categories='news'), window_size = 20)
finder.nbest(bigram_measures.pmi, 10)

[(u'$1,500,000', u'Rhine-Westphalia'),
 (u'$1,800', u'cadet'),
 (u'$1,800', u'termination'),
 (u'$1.4', u'subsidies'),
 (u'$1.5', u'$12.7'),
 (u'$10,000-per-year', u'French-born'),
 (u'$10,000-per-year', u'Holders'),
 (u'$10,000-per-year', u"d'hotel"),
 (u'$10,000-per-year', u'maitre'),
 (u'$100,000', u'kidnapping')]

While these words are highly collocated, the expressions are also very infrequent. Therefore it is useful to apply filters, such as ignoring all bigrams which occur less than 20 times in the corpus and removing the stopwords:


In [8]:
finder.apply_freq_filter(20)
ignored_words = nltk.corpus.stopwords.words('english')
finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in ignored_words)
finder.nbest(bigram_measures.pmi, 25)

[(u'per', u'cent'),
 (u'United', u'States'),
 (u'Mantle', u'Maris'),
 (u'New', u'York'),
 (u'White', u'House'),
 (u'sales', u'tax'),
 (u'home', u'runs'),
 (u'President', u'Kennedy'),
 (u'Mrs.', u'Mrs.'),
 (u'Mrs.', u'Robert'),
 (u'Robert', u'Mrs.'),
 (u'Mrs.', u'Jr.'),
 (u'Mrs.', u'William'),
 (u'Jr.', u'Mrs.'),
 (u'last', u'week'),
 (u'last', u'night'),
 (u'Mrs.', u'John'),
 (u'last', u'year'),
 (u'Mr.', u'Mrs.'),
 (u'John', u'Mrs.'),
 (u'Mr.', u'Mr.'),
 (u'Mrs.', u'Mr.'),
 (u'said', u'would'),
 (u'would', u'would'),
 (u'Mr.', u'said')]

## Word2vec

Using Wor2vec in Python is in fact quite straightforward thanks to the package `gensim` (https://radimrehurek.com/gensim/), which has a package focused on Word2vec where you can create your own embeddings from a dataset.

For more information on the generation of embeddings with this package, you can follow this tutorial: http://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code/#.WszQTXVuZhE

The following code creates two embeddings model, one for the brown corpus and one for the movie_reviews dataset. 

In [86]:
from gensim.models import Word2Vec
from nltk.corpus import brown, movie_reviews

In [87]:
b = Word2Vec(brown.sents())

In [88]:
mr = Word2Vec(movie_reviews.sents())

Once trained the models, we can compute similarities between words. Try different words and check the differences between the models.

In [119]:
b.wv.most_similar('man', topn=5)

[(u'woman', 0.8780252933502197),
 (u'girl', 0.8740298748016357),
 (u'boy', 0.8312857747077942),
 (u'young', 0.7791036367416382),
 (u'child', 0.7733349800109863)]

In [120]:
mr.wv.most_similar('man', topn=5)

[(u'woman', 0.8990576267242432),
 (u'girl', 0.8326300382614136),
 (u'boy', 0.8317219018936157),
 (u'child', 0.7928438186645508),
 (u'killer', 0.7519564032554626)]

In [127]:
b.wv.most_similar('movie', topn=5)

[(u'Queen', 0.959496259689331),
 (u'Faith', 0.9590216875076294),
 (u'seasonal', 0.9546026587486267),
 (u"town's", 0.9513021111488342),
 (u'boom', 0.9510686993598938)]

In [128]:
mr.wv.most_similar('movie', topn=5)

[(u'film', 0.9492945075035095),
 (u'picture', 0.8660645484924316),
 (u'sequel', 0.785641074180603),
 (u'case', 0.7478125095367432),
 (u'thing', 0.6979601383209229)]