# Lab-Assignment 2: Use Word Embeddings to extend Wordnet

Copyright, Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

The assignment is assessed as PASS/NO-PASS. In general, we value a critical analysis of running code more than just showing that you can create or run the code. So if you succesfully carried out the instructed commands in a notebook you are not done yet. We want you to analyse, understand what the code is doing with language and text. Be critical, think about how to explain what you observe and write this down in the notebook after running the code. We will clarify in the assignment when we expect this from you.

You can make the assignment as a group but make sure that you understand and can carry out the coding yourself. You need these skills for your final assignment that is graded. Feedback will be given at the group level.

In this assignment you:

<ol>
    <li> Use Wordnet NLTK to get all the organ synsets and lemmas in the **body** sense of the word.
    <li> Use Wiki2vec to get the words most similar to organ in the body meaning
    <li> Use Leipzig embeddings to get the words most similar to organ in the body meaning
    <li> Check which of these are types of organs in WordNet and which are not
    <li> Check if the ones that are not listed as types of organs are still related through WordNet-path-similarity
    <li> Evaluate these as organs to be added to WordNet
    <li> Check if the ones that are not in Wordnet as types of organs could be added as such
    <li> Compare and discuss the Wiki2vec and Leipzig results
</ol>

We assume you have analysed the following notebooks:

* Lab2.1.NLTK_wordnet
* Lab2.2 Using_Wordembeddings
* Lab2.3 Creating_Wordembeddings

This notebook should show how you used NLTK wordnet and applied embeddings to your words. Insert Mark Down rows to explain your steps and discuss any problems or issues. Once you generated the output and explanation, convert this notebook to PDF using the menu "File/Export Notebook As".

Submit on Canvas a PDF of your notebook for the assignment with output and your answers. Include the group name and student names that worked on the assignment. If you fail to produce the PDF, submit the notebook after running the cells.

## 2.1 WordNet coverage over hyponyms:

Get the WordNet synset for an organ in a body. Use the **word_similarity_wordnet_path** function from the Wordnet notebook to get the synset. Answer the following questions:

* Which synset in WordNet is the organ inside a body? What is the similarity to "body"?
* Which synset in WordNet is the organ as instrument? What is the similarity to "instrument"?


In [1]:
from nltk.corpus import wordnet as wn

In [3]:
def word_similarity_wordnet_path(w1, w2):
    top_sim_score = 0    
    top_sim_synset_w1 = ""
    top_sim_synset_w2 = ""
    for s1 in wn.synsets(w1, 'n'):
        for s2 in wn.synsets(w2, 'n'):
            sim = s1.path_similarity(s2)
            if sim>top_sim_score:
                top_sim_score = sim
                top_sim_synset_w1 = s1
                top_sim_synset_w2 = s2
    return top_sim_synset_w1, top_sim_synset_w2, top_sim_score

In [4]:
all_organ_synsets = wn.synsets('organ')

In [7]:
print('Organ synsets:')
for synset in all_organ_synsets:
    print(synset, synset.definition())

Organ synsets:
Synset('organ.n.01') a fully differentiated structural and functional unit in an animal that is specialized for some particular function
Synset('organ.n.02') a government agency or instrument devoted to the performance of some specific function
Synset('electric_organ.n.01') (music) an electronic simulation of a pipe organ
Synset('organ.n.04') a periodical that is published by a special interest group
Synset('organ.n.05') wind instrument whose sound is produced by means of pipes arranged in sets supplied with air from a bellows and controlled from a large complex musical keyboard
Synset('harmonium.n.01') a free-reed instrument in which air is forced through the reeds by bellows


In [9]:
print(word_similarity_wordnet_path('organ', 'body'))
print(word_similarity_wordnet_path('organ', 'instrument'))

(Synset('organ.n.01'), Synset('torso.n.01'), 0.3333333333333333)
(Synset('electric_organ.n.01'), Synset('musical_instrument.n.01'), 0.3333333333333333)


#### Question 1.
<i> The word synset for the organ in a body is Synset('organ.n.01'), with the definition: "a fully differentiated structural and functional unit in an animal that is specialized for some particular function". It has a similarity of 0.333 with the word 'body'. 

There are multiple word synsets for organ as instrument, namely Synset('electric_organ.n.01'), with the definition: "(music) an electronic simulation of a pipe organ",  Synset('organ.n.05'), with the definition: "wind instrument whose sound is produced by means of pipes arranged in sets supplied with air from a bellows and controlled from a large complex musical keyboard and" Synset('harmonium.n.01'), with the definition:  "a free-reed instrument in which air is forced through the reeds by bellows". The best similarty with "instrument" is with Synset('electric_organ.n.01'), which gets 0.333. </i>

Get all types of body-organs from WordNet. Use the **get_hyponym_family** function from the Wordnet notebook to get the family.

* How many synsets are in the body-organ family?

In [10]:
def get_hyponym_family (parent):
    family=[]
    children = parent.hyponyms()
    if children:
        family = family + children
        for child in children:
            grand_children = get_hyponym_family(child)
            if grand_children:
                family = family + grand_children
    return family

In [13]:
body_organ = wn.synset('organ.n.01')
body_organ_family = get_hyponym_family(body_organ)
print(len(body_organ_family))

293


#### Question 2.
<i> There are 293 synsets in the body-organ family. </i>

Use the **get_lemmas_from_wordnet_family** function to get all synonyms for the family. 

* How many synonyms do you get?

In [15]:
def get_lemmas_from_wordnet_family(wnfamily, language):
    lemmas = []
    for synset in wnfamily:
        slemmas = synset.lemma_names(language)
        for slemma in slemmas:
            lemmas.append(slemma)
    return lemmas

In [17]:
print(len(get_lemmas_from_wordnet_family(body_organ_family, "eng")))

582


#### Question 3.
<i> There are 582 synonyms for the synsets in the family body-organ. </i>

## 2.2 Get related words through word embeddings


### 2.2.1 Wiki2Vec embeddings:

Download the Wiki2Vec embeddings for your target language and load the model using the Gensim package. If you cannot load the complete model, load part of it.

In [4]:
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import numpy as np

Get a list of the 50 words that are most similar to ```organ in a body```. 
Experiment with the positive and negative parameter options to include body-organs and exclude music-organs.

In [5]:
#[CODE GOES HERE]

* How many of these words are also a body-organ in WordNet?

To answer this question you need to do the following:

* Create an empty list for ```new_words``` and an emopty list for ```wordnet_words```
* Create a for loop over all the results you get from wiki2vec using the **most_similar** function.
* Note that each result is a set with the word and itsscore.
* Get the word from each result, e.g. ```word=result[0].strip('ENTITY/')```, and the ```score=result[1]```. You need to strip the prefix ````ENTITY/``` from the wiki2vec word, using the **strip** function
* If the word is among the wordnet lemmas for the body-organs you store it in the wordnet_words list
* if not store it in a new_words list
* After completing the loop, print get the length of each result list: new_words and wordnet_words


In [6]:
#[CODE GOES HERE]

#### 4. [your answers go here]

Now make another for loop over all the words in new_words and try to get the **WordNet-path** similarity to "organ" with the **word_similarity_wordnet_path** function. If the similarity score is higher than zero, ```if sim>0:```, output the synsets and the score. Answer the following questions:

* How many of the new words are still in Wordnet but not in the body-organ family?
* How are these (n Wordnet but not in the body-family) related to the first meaning of organ (body)? Consider similarity and hypernym path.
* How many of the new words outside the body-family (inside and outside Wordnet) would really be types of organs in your opinion and how many are not?


In [7]:
#[CODE GOES HERE]

#### 5. [your answers go here]

* How is it possible that words are related through WordNet-path to ```organ.n.01``` but not in the organ-family?
* Do you consider Wiki2vec as a good resource for extending English WordNet? Motivate your answer!

#### 6. [your answers go here]

### 2.2.2 Embeddings from the Leipzig corpus:

Download a text corpus from the Leipniz corpora collection. 

Build an embedding model from that corpus as explained in the notebook **Lab2.3.Creating_Wordembeddings** or load the model from disk if you already build and saved it.

Get the 50 words that are most similar to body-organ in the same way as you did for the Wiki2vec embeddings.
Answer the following questions, similar as for the Wiki2vec embeddings.

* How many of the similar words from the Leipzig model are in the organ-family and how many are not?
* How many of the new words are still in Wordnet but not in the body-organ family?
* How many are related using path to the first meaning of organ (body)
* How many of these are also types of organs in your opinion and how many are not?


In [103]:
#[CODE GOES HERE]

##### 7. [your answers go here]

* Is there a difference between the lists given by Wiki2vec and the Leipzig model? Can you explain why?

#### 8.[your answers go here]

## End of assignment