# Exercise 3
Si richiede un’implementazione della teoria sulle valenze di Patrick Hanks. In particolare, partendo da un corpus a scelta e uno specifico verbo (tendenzialmente non troppo frequente e/o generico ma nemmeno raro), l’idea è di costruire dei possibili cluster semantici, con relativa frequenza. Ad es., dato il verbo "to see" con valenza v = 2, e usando un parser sintattico (ad es. Spacy), si possono collezionare eventuali fillers per i ruoli
di subj e obj del verbo, per poi convertirli in semantic types. Un cluster frequente su "to see" potrebbe unire subj = noun.person con obj = noun.arti f act. Si richiede di partire da un corpus di almeno alcune centinaia di istanze del verbo

In [47]:
import re
import spacy
from nltk.corpus import wordnet

nlp = spacy.load("en_core_web_sm")

In [48]:
VERB="hit"
CATEGORY_HEIGHT=2
MIN_SENTENCES=4

file_path = 'Corpus3_HIT'

In [49]:
def parse_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        testo = file.read()
    # Utilizziamo un'espressione regolare per separare il testo in frasi
    # considerando il punto seguito da uno spazio come delimitatore delle frasi
    phrases = re.split('\n', testo)
    return phrases

def tokenize_sentences(dataset):
    tokenized_dataset=[]
    for sentence in dataset:
        tokenized_dataset.append(nlp(sentence))
    return tokenized_dataset

dataset = parse_file(file_path)
dataset = tokenize_sentences(dataset)

In [50]:
#Dato il verbo di valenza 2 estraggo i relativi subj e obj se non si riesce a riconoscere uno dei due elementi la frase viene rimossa
def extract_verb_parameters(tokenized_dataset, verb):
    verb_parameters=[]
    for sentence in tokenized_dataset:
        subject=None
        complement_object=None
        for token in sentence:
            if token.head.lemma_.lower() == verb:
                if token.dep_ == "nsubj":
                    subject = token.text
                elif token.dep_ == "dobj":
                    complement_object = token.text

        if subject is not None and subject != '' and complement_object is not None and complement_object!= '':
            verb_parameters.append((subject,complement_object,sentence))
    return verb_parameters

verb_parameters=extract_verb_parameters(dataset,VERB)
print("Total sentences: ",len(dataset),"Number of sentences parsed correctly: ",len(verb_parameters))

Total sentences:  199 Number of sentences parsed correctly:  193


In [51]:
#Da WordNet si lavora sugli hyperonimi e si risale fino ad arrivare ad altezza CATEGORY_HEIGHT del albero di WN
def find_meaning_groups(verb_parameters):
    meanings={}
    for combination in verb_parameters:
        subj=combination[0]
        obj=combination[1]
        #Prendo i sensi di soggetto e oggetto
        subj_meaning=wordnet.synsets(subj)
        obj_meaning = wordnet.synsets(obj)
        if len(subj_meaning)>0 and len(obj_meaning)>0:
            #prendo i primi
            subj_meaning=subj_meaning[0]
            obj_meaning=obj_meaning[0]
            break_loop=True
            #Lavoro sull'esplorazione per il soggetto
            while break_loop:
                if subj_meaning.max_depth() <= CATEGORY_HEIGHT:
                    break_loop=False
                else:
                    hypernyms=subj_meaning.hypernyms()
                    if len(hypernyms)==0:
                        break_loop=False
                    else:
                        subj_meaning = hypernyms[0]
            break_loop = True
            #Lavoro sull'esplorazione per l'oggetto
            while break_loop:
                if obj_meaning.max_depth() <= CATEGORY_HEIGHT:
                    break_loop = False
                else:
                    hypernyms = obj_meaning.hypernyms()
                    if len(hypernyms) == 0:
                        break_loop = False
                    else:
                        obj_meaning = hypernyms[0]
            #Combino i significati a livello CATEGORY_HEIGHT trovati per oggetto e soggetto e li aggiungo se non ci sono già
            combination_name=(subj_meaning.name()+"--"+obj_meaning.name())
            if combination_name in meanings:
                meanings[combination_name].append(combination)
            else:
                meanings[combination_name]=[]
                meanings[combination_name].append(combination)
    return meanings

## Number of meanings for Verb

In [52]:
meanings=find_meaning_groups(verb_parameters)
print("Number of meanings (using CATEGORY_HEIGHT =",str(CATEGORY_HEIGHT),"): ",len(meanings))

Number of meanings (using CATEGORY_HEIGHT = 2 ):  36


## Show All meanings extract
Print result and show how many sentence for each meaning, the average and the all sentences

In [53]:
avg=0
for key in meanings:
    if MIN_SENTENCES <= len(meanings[key]):
        print("MEANING: ",key)
        print("NUMBER OF SENTENCES: ",len(meanings[key]))
        print("SENTENCES: ",meanings[key])
        print("")
    avg += len(meanings[key])
avg=avg/len(meanings)
print("AVARAGE NUMBER OF SENTENCES FOR MEANING: ",avg)

MEANING:  communication.n.02--communication.n.02
NUMBER OF SENTENCES:  5
SENTENCES:  [('jokes', 'note', The comedian's jokes hit a sour note, and the audience responded with awkward silence.), ('statement', 'headlines', The actor's controversial statement hit the headlines, sparking a media frenzy.), ('news', 'headlines', The news of the scandal hit the headlines, dominating the media's attention.), ('news', 'headlines', The news of the scandal hit the headlines, dominating the media's attention for weeks.), ('news', 'headlines', The news hit the headlines, dominating newspaper front pages and capturing public attention.)]

MEANING:  communication.n.02--object.n.01
NUMBER OF SENTENCES:  4
SENTENCES:  [('novel', 'shelves', The novel hit the shelves and quickly became a bestseller, captivating readers worldwide.), ('news', 'mill', The news hit the rumor mill, sparking speculation and gossip among friends and colleagues.), ('news', 'media', The news of the scandal hit social media, sparki