### Approccio

- Prendo il termine più frequente nelle definizioni, sarà il genus
- Stopwords removing e lemming delle definizioni
- Prelevo tutto il sottoalbero di hyponimi del genus
- Prendo le definizioni (glossa) dei synset di cui ho trovato i hyponimi
- Faccio confronto tra definizioni di wordnet e lista di definizioni
- Restituisco il synset che ha definizioni più simile a quella della lista

### Imports

In [1]:
from nltk.corpus import stopwords
from collections import Counter
from gensim.test.utils import simple_preprocess
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet as wn
import random

### Methods

In [2]:
def get_text_from_file(path):
    '''
    Read a file and, after revoving all stopwords, return a list of words.
    '''
    file = []
    stop_words = set(stopwords.words('english'))
    with open (path, 'r') as f:
        for row in f:
            filtered_s = [w for w in word_tokenize(row) if not w.lower() in stop_words]
            file.append(simple_preprocess(str(filtered_s), deacc=True))
    f.close()
    return file

def get_most_freq_words(text, nword):
    '''
    Given a list of sententeces, return the nword most frequent words
    in each row of the document.
    '''
    genus = []
    for row in text:
        c = Counter()
        c.update(row)
        genus.append(c.most_common(nword))
    return genus

def get_hypos(word):
    '''
    Return all the hyponyms of a word.
    '''
    syn = get_synset(word)
    hypo_list = list(set([w for s in syn.closure(lambda s:s.hyponyms()) for w in s.lemma_names()]))
    return hypo_list

def get_hypers(word):
    '''
    Return all the hypernyms of a word.
    '''
    syn = get_synset(word)
    return syn.hypernyms()

def get_synset(word):
    '''
    Retrurn the first synset of a word.
    '''
    if(len(wn.synsets(word)) > 0):
        return wn.synsets(word)[0]
    return None


### Pre-processing data and find the genus

In [3]:
file = get_text_from_file('../res/def.csv')

genus = get_most_freq_words(file, 3)

print (genus)

[[('feeling', 11), ('human', 8), ('feel', 8)], [('human', 26), ('person', 5), ('homo', 5)], [('someone', 14), ('feeling', 7), ('anger', 7)], [('used', 22), ('object', 15), ('material', 13)]]


Prendo le parole più rillevanti nelle definizioni a mano e le uso come genus:
- Emotions = feeling
- Person = human
- Revenge = anger
- Brick = construction

In [4]:
genus_list = ["feeling", "person", "anger", "material"]

Proviamo con 3 genus per aumentare accuratezza:

In [5]:
genus_list = []
for el in genus:
    genus_list_inner = []
    for el2 in el:
        genus_list_inner.append(el2[0])
    genus_list.append(genus_list_inner)
        
genus_list

[['feeling', 'human', 'feel'],
 ['human', 'person', 'homo'],
 ['someone', 'feeling', 'anger'],
 ['used', 'object', 'material']]

### Main

In [8]:
# Extract the most used word in the definitions
key_words_defs = get_most_freq_words(file, 20)

for i in range(len(genus_list)):
    
    # Top 10 word used in the definitions
    key_row = []
    for el in key_words_defs[i]:
        key_row.append(el[0])
    
    #! Version with 1 genun
    # Get the hyponyms of the genus and find the definition of the hyponyms
    # hypo_list = get_hypos(genus_list[i])
    # print(hypo_list)
    # hypo_def = []
    # for hypo in hypo_list:
    #     hypo_def.append((hypo, get_synset(hypo).definition()))
    
    #! version with multiple genus
    hypo_list, hypo_def = [], []
    for el in genus_list[i]:
        hypo_list.append(get_hypos(el))
        
    hypo_list = [x for xs in hypo_list for x in xs]
    for hypo in hypo_list:
        hypo_def.append((hypo, get_synset(hypo).definition()))
    
    # Compare the definition of our definitions (def.csv file) with the definition of the hyponyms
    res = []
    for wndef in hypo_def: # Definition of the hyponyms in wordnet
        score = 0
        imp_words = []
        for key_word in key_row: # Definition given by us
            if(key_word in wndef[1]):
                score += 1
                imp_words.append(key_word)      
        
        # Store all the value
        res.append((score, wndef[0], imp_words, wndef[1]))
        
    sorted_list = sorted(res, key=lambda x: x[0])
    sorted_res = list(reversed(sorted_list))
    print("\t\t\t",genus_list[i])
    for k in range(5):
        print(f'Word: *{sorted_res[k][1]}*, score: *{sorted_res[k][0]}*, the key words are *{sorted_res[k][2]}* and the definition is *{sorted_res[k][3]}*')
        

			 ['feeling', 'human', 'feel']
Word: *painfulness*, score: *4*, the key words are *['feeling', 'feel', 'emotion', 'mental']* and the definition is *emotional distress; a fundamental feeling that people try to avoid*
Word: *wonder*, score: *4*, the key words are *['feeling', 'feel', 'something', 'range']* and the definition is *the feeling aroused by something strange and surprising*
Word: *wonderment*, score: *4*, the key words are *['feeling', 'feel', 'something', 'range']* and the definition is *the feeling aroused by something strange and surprising*
Word: *appetence*, score: *3*, the key words are *['feeling', 'feel', 'something']* and the definition is *a feeling of craving something; ; - Granville Hicks*
Word: *unpleasantness*, score: *3*, the key words are *['feeling', 'feel', 'state']* and the definition is *the feeling caused by disagreeable stimuli; one pole of a continuum of states of feeling*
			 ['human', 'person', 'homo']
Word: *Algonquin*, score: *3*, the key words are