# Laboratorio 3

Si richiede un’implementazione della teoria sulle valenze di Patrick Hanks. In particolare, partendo da un corpus a scelta e uno specifico verbo (tendenzialmente non troppo frequente e/o generico ma nemmeno raro), l’idea è di costruire dei possibili cluster semantici, con relativa frequenza. Ad es. dato il verbo "to see" con valenza = 2, e usando un parser sintattico (ad es. Spacy), si possono collezionare eventuali fillers per i ruoli di subj e obj del verbo, per poi convertirli in semantic types. Un cluster frequente su "to see" potrebbe unire subj = noun.person con obj = noun.artifact. Si richiede di partire da un corpus di almeno alcune centinaia di istanze del verbo.

In [45]:
import pandas as pd
import spacy
import pandas as pd
import string
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords

## Metodi di supporto

### Normalizzazione della frase

In [46]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def normalize(sentence):
    tokens = nltk.word_tokenize(sentence)
    tokens = [token for token in tokens if token not in string.punctuation] #tolgo la punteggiatura
    tokens = [token.lower() for token in tokens] # sostituisco le maiuscole con le minuscole
    tokens = [token for token in tokens if token not in stop_words] # rimuovo le stop words
    tokens = [lemmatizer.lemmatize(token) for token in tokens] # lemmatizzo
    
    return tokens

### Algoritmo Simplified Lesk

In [47]:
def simplified_lesk(word, context):
    best_sense = None
    max_overlap = 0

    for sense in wn.synsets(word):
        signature = set(normalize(sense.definition())).union(set(normalize(' '.join(sense.examples()))))
        overlap = len(context.intersection(signature))
        if overlap > max_overlap:
            max_overlap = overlap
            best_sense = sense
    
    return best_sense

## Generazione delle strutture verbali con valenza 2 e slot subj e dobj

In [74]:
'''
Metodo che per ogni verbo considera solo la valenza 2 andando a fare filling degli slot nsubj e dobj
'''
def generate(corpus):
    parser = spacy.load('en_core_web_sm')
    verb_structures = []
    verb_clusters = []
    phrase_clusters = []

    for sentence in corpus['sentence']:
        sentence = sentence.replace('<s>', '').replace('</s>', '').replace('"', '').replace("'", '')[1:]
        parsified_sentence = parser(sentence)

        for token in parsified_sentence:
            if token.pos_ == 'VERB' and token.lemma_ == 'see':
                nsubj = None
                dobj = None
                for child in token.children:
                    if child.dep_ == 'nsubj':
                        nsubj = child.text
                    elif child.dep_ == 'dobj':
                        dobj = child.text

                if nsubj is not None and dobj is not None:
                    synset_nsubj = simplified_lesk(nsubj, set(normalize(sentence)))
                    synset_dobj = simplified_lesk(dobj, set(normalize(sentence)))

                    if synset_nsubj is not None and synset_dobj is not None:
                        semtype_nsubj = synset_nsubj.lexname().split('.')[1]
                        semtype_dobj = synset_dobj.lexname().split('.')[1]

                        verb_structures.append((nsubj, synset_nsubj, dobj, synset_dobj))
                        verb_clusters.append((semtype_nsubj, semtype_dobj))
                        phrase_clusters.append((semtype_nsubj, semtype_dobj, sentence))

    return set(verb_structures), set(verb_clusters), set(phrase_clusters)

## Stampa della collezione

In [71]:
'''
per ogni struttura stampa nsubj e dobj
'''
def print_structures(structures):
    i = 0
    for nsubj, syn_nsubj, dobj, syn_dobj in structures:
        print('Struttura ' + str(i))
        print('\t soggetto: ' + nsubj + ' - ' + str(syn_nsubj))
        print('\t oggetto: ' + dobj + ' - ' + str(syn_dobj))
        i += 1
        print('\n')

'''
per ogni struttura stampa il suo cluster semantico
'''
def print_clusters(clusters):
    i = 0
    for nsubj_semtype, dobj_semtype in clusters:
        print('Cluster ' + str(i))
        print('\t SemType soggetto: ' + nsubj_semtype)
        print('\t SemType oggetto: ' + dobj_semtype)
        i += 1
        print('\n')

## MAIN

In [75]:
corpus = pd.read_csv('english_wikipedia_sentence.csv')
structures, clusters, phrase_clusters = generate(corpus)

In [76]:
print_structures(structures)

Struttura 0
	 soggetto: BC - Synset('bc.r.01')
	 oggetto: appearance - Synset('appearance.n.05')


Struttura 1
	 soggetto: crops - Synset('crop.n.03')
	 oggetto: subsidies - Synset('subsidy.n.01')


Struttura 2
	 soggetto: Crescent - Synset('crescent.n.01')
	 oggetto: domestication - Synset('domestication.n.03')


Struttura 3
	 soggetto: I - Synset('iodine.n.01')
	 oggetto: face - Synset('face.n.01')


Struttura 4
	 soggetto: one - Synset('one.n.01')
	 oggetto: projection - Synset('projection.n.02')


Struttura 5
	 soggetto: workers - Synset('worker.n.01')
	 oggetto: success - Synset('success.n.01')


Struttura 6
	 soggetto: Lord - Synset('lord.v.01')
	 oggetto: number - Synset('number.n.01')


Struttura 7
	 soggetto: philosophy - Synset('philosophy.n.03')
	 oggetto: struggle - Synset('struggle.n.01')


Struttura 8
	 soggetto: Boys - Synset('male_child.n.01')
	 oggetto: children - Synset('child.n.01')


Struttura 9
	 soggetto: Ages - Synset('age.n.01')
	 oggetto: improvements - Synset(

In [77]:
print_clusters(clusters)

Cluster 0
	 SemType soggetto: attribute
	 SemType oggetto: act


Cluster 1
	 SemType soggetto: quantity
	 SemType oggetto: communication


Cluster 2
	 SemType soggetto: all
	 SemType oggetto: act


Cluster 3
	 SemType soggetto: cognition
	 SemType oggetto: act


Cluster 4
	 SemType soggetto: shape
	 SemType oggetto: act


Cluster 5
	 SemType soggetto: substance
	 SemType oggetto: body


Cluster 6
	 SemType soggetto: shape
	 SemType oggetto: event


Cluster 7
	 SemType soggetto: person
	 SemType oggetto: event


Cluster 8
	 SemType soggetto: person
	 SemType oggetto: person


Cluster 9
	 SemType soggetto: group
	 SemType oggetto: possession


Cluster 10
	 SemType soggetto: social
	 SemType oggetto: attribute




In [78]:
phrase_clusters

{('all',
  'act',
  'The period 2700–2300 BC saw the first appearance of the Sumerian abacus, a table of successive columns which delimited the successive orders of magnitude of their sexagesimal number system. '),
 ('attribute',
  'act',
  'The Middle Ages saw significant improvements in the agricultural techniques and technology. '),
 ('cognition',
  'act',
  'In essence, the philosophy sees anarchist struggle as a necessary component of feminist struggle and vice-versa. '),
 ('group',
  'possession',
  'Despite this progress, certain crops, such as cotton, still see subsidies in developed countries artificially deflating global prices, causing hardship in developing countries with non-subsidized farmers. '),
 ('person',
  'event',
  'Many workers and activists saw Bolshevik success as setting an example; Communist parties grew at the expense of anarchism and other socialist movements. '),
 ('person',
  'person',
  'His Boys & Girls Club sees 2,000 children throughout the year and bo