## Automatic Summarisation using Nasari
### Consegna 3 TLN 
#### Mario Scapellato

La risorsa NASARI rappresenta un insieme di vettori che invece di essere dei word embeddings sono dei sense embeddings in cui, i vettori che descrivono la risorsa, descrivono dei sensi.

In [281]:
import string
from nltk.corpus import stopwords
import nltk

In [282]:
NASARI = 'utils/NASARI_vectors/dd-small-nasari-15.txt'
NASARI_PATH = 'utils/NASARI_vectors/dd-nasari-link_210419.txt'

Reads the NASARI file and calculate a dict as output

In [283]:
def read_nasari(file):
    nasari = {}
    with open(file, 'r', encoding="utf8") as file:
        for row in file.readlines():
            line_splitted = row.split(";")
            dict_entry = {}
            
            # Start from 2 letter (delete "bn:")
            for term in line_splitted[2:]:
                # term and score written like this: "serotonin_1841.0"
                term_score = term.split("_")
                if len(term_score) > 1:
                    dict_entry[term_score[0]] = term_score[1]


            #Ottengo un dizionario NASARI dalla forma  {term: score, term: score, ...}    
            nasari[line_splitted[1].lower()] = dict_entry

    return nasari

Reads the given documents.

In [284]:
def read_doc(file):
    document = []
    with open(file, 'r', encoding="utf8") as file:
        for row in file.readlines():
            # does not consider lines starting with "#"
            if '#' not in row:
                row = row[:-1]
                if row != '':
                    document.append(row)
    return document

Computes the rank of the given vector. Method used to calculate the weighted overlap between nasari vectors.

In [285]:
#Calcolo il rango della feature condivisa da entrambi i vettori
def calculate_rank(vector, nasari_vector):
    for i in range(len(nasari_vector)):
        if nasari_vector[i] == vector:
            # returns index of nasari_vector egual to input vector
            return i + 1

Implementation of the Weighted Overlap between two nasari vectors.
$$
  WO(w_1,w_2) = \frac{\sum_{q \in O} (rank(q, v1) + rank(q, v2))^{-1}}{\sum_{i=1}^{|O|} (2i)^{-1}}
$$

More is WO and more will be similar that 2 vectors.

In [286]:
#Calcolo l'overlap tra due vettori NASARI
def overlap(nasari_vector_1, nasari_vector_2):
    overlap_keys = nasari_vector_1.keys() & nasari_vector_2.keys()
    list_overlap_keys = list(overlap_keys)

    if len(overlap_keys) != 0:
        rank_acc =(sum( 1/ (calculate_rank (vector, list(nasari_vector_1)) + calculate_rank (vector, list(nasari_vector_2))) for vector in list_overlap_keys))

        #Calcolo il denominatore
        den = (sum(list(map(lambda x: 1 / (2 * x), list(range(1, len(list_overlap_keys) + 1))))))
        
        return rank_acc/den
    
    return 0



A bag of word algorithm based approach. It calculates a list of word given a text doing stop word and punctuation removal.

In [287]:
def bag_of_word_approach(text):
    """
    :param text: input text
    :return: BoW representation of the text.
    """

    #prendo il testo, lo trasformo in minuscolo, lo tokenizzo e rimuovo le stopwords e la punteggiatura
    text = text.lower()
    
    stop_words = set(stopwords.words('english'))
    wordnet_lemmatizer = nltk.WordNetLemmatizer()
    
    # text tokenizzation
    tokens = nltk.word_tokenize(text)
    
    # remove stop_word and punctuation
    tokens = list(filter(lambda x: x not in stop_words and x not in string.punctuation, tokens))
    
    return set(wordnet_lemmatizer.lemmatize(token) for token in tokens)

Create a list of nasari vectors depending on the document title (topic).

In [288]:
def get_topic_from_title(document, nasari):
    """
    :param document: input document
    :param nasari: Nasari dictionary
    :return: a list of Nasari vectors.
    """

    title = document[0] #il titolo e' la prima riga del documento
    
    # topic calculated with BOW approach
    topic = bag_of_word_approach(title)

    # NB nasari_vectors is a dict of dicts {word: {{term:score},...}}
    nasari_vectors = []

    for word in topic: #per ogni parola del topic
        if word in nasari.keys(): #se la parola e' presente nel dizionario nasari
            nasari_vectors.append(nasari[word]) #aggiungo al vettore nasari la corrispondente  parola

    return nasari_vectors


Create a list of nasari vectors depending of text's terms. Very similar to the previous one.

In [289]:
#Calcolo il vettore NASARI dal testo. Stesso approccio di prima solo che qui e' con il testo e non con il titolo 
def text_to_nasari(text, nasari):
    """
    :param text: the list of text's terms
    :param nasari: Nasari dictionary
    :return: list of Nasari's vectors.
    """

    tokens = bag_of_word_approach(text)
    
    nasari_vectors = []

    for word in tokens:
        if word in nasari.keys():
            nasari_vectors.append(nasari[word]) 
        
    return nasari_vectors

Given a list of paragraph from a document, calculate how many of these are preserved depending to a percentage.

In [290]:
#Vedo i paragrafi da mantenere a seguito della compressione 
def calculate_lines_to_keep(doc_paragraphs, percentage):

    return len(doc_paragraphs) - int(round((percentage / 100) * len(doc_paragraphs), 0))
    

Given a list of paragraphs annotated with overlap scores, compute the summarized text.

In [291]:
#Riduco il documento in base al numero di paragrafi da mantenere
def reduce_document(doc_paragraphs_overlaps, lines_to_keep):
    """
    :param doc_paragraphs_overlaps: document's paragraphs as a list with an overlap score
    :param lines_to_keep: number of paragraph to keep
    :return: reduced document
    """
    # Order by weighted overlap
    document_sorted  = sorted(doc_paragraphs_overlaps, key=lambda x: x[1], reverse=True)
    reduced_document = document_sorted[:lines_to_keep] #il documento ridotto e' costituito dall'ordinamento dei paragrafi con overlap maggiore

    #print(reduced_document)
    
    reduced_document = sorted(reduced_document, key=lambda x: x[0], reverse=True) #ordino i paragrafi sulla base del documento ridotto

    # Obtain the text
    reduced_document_text = list(map(lambda x: x[2], reduced_document)) 
    
    # Add the title
    reduced_document_text = [document[0]] + reduced_document_text

    #print(reduced_document_text)
    
    return reduced_document_text
    

Applico la Summarization ai documenti applicando il metodo basato sul titolo.


In [292]:
def summarization(document, nasari, percentage):

    # Obtain the topics from the title
    topics = get_topic_from_title(document, nasari)
    doc_paragraphs = []
    i = 0

    # For each paragraph in the document
    for doc_paragraph in document[1:]:
        
        # obtain nasari rappresentation of the paragraph. Quindi per ogni paragrafo del documento ottengo il vettore nasari
        nasari_text_par = text_to_nasari(doc_paragraph, nasari)

        paragraph_weighted_overlap = 0 #overlap del paragrafo corrente
        
        #Per ogni parola del paragrafo
        # word is a nasari rappresentation of the term {word: {{term:score},...}}
        for word in nasari_text_par:
            topic_weighted_overlap = 0 #overlap del topic corrente
        
            for topic in topics:
                # for each topic compute the WO for topic and word (comulative)
                topic_weighted_overlap += overlap(word, topic)
            
            # Mean of WO (based on number of topic)
            if topic_weighted_overlap != 0:
                topic_weighted_overlap /= len(topics) #divido per il numero di topics, per vederne la rilveanza
            
            # Comulative paragraph's WO
            paragraph_weighted_overlap += topic_weighted_overlap

        if len(nasari_text_par) != 0:
            # Mean of paragraph's WO
            paragraph_weighted_overlap /= len(nasari_text_par)
            # Create a tuple with paragraph's number, WO and text. Append it.
            doc_paragraphs.append((i, paragraph_weighted_overlap, doc_paragraph))

        i += 1

    # Obtain number of lines to keep
    lines_to_keep = calculate_lines_to_keep(doc_paragraphs, percentage)
    
    # Finally we can execute summarization
    reduced_document = reduce_document(doc_paragraphs, lines_to_keep)
    
    return reduced_document

In [293]:
def write_text(text_summ, path):
        with open(path,"w") as f:
            for paragraph in text_summ:
                f.write(paragraph)
                f.write("\n\n")

Call all previously defined methods.
Also calcule BLEU and ROUGE score to see how similar the results are compared to the original documents.

In [294]:
from nltk.translate.bleu_score import sentence_bleu
from rouge import Rouge

docs = ['Andy-Warhol.txt', 'Ebola-virus-disease.txt', 'Life-indoors.txt', 'Napoleon-wiki.txt', 'Trump-wall.txt']

nasari = read_nasari(NASARI)
rouge = Rouge()
compression = [10,20,30] #percentage_decreas of 10,20,30% 

for doc in docs : 
    print('**************************** {}*******************************'.format(doc))
    text_path='texts_to_summarize/' +doc
    #text_summ_path = './texts_summarized/' +doc

    document = read_doc(text_path)
    
    for i in compression: #percentage_decreas of 10,20,30% utile per calcolarmi i punteggi
        summary = summarization(document, nasari, i)

        #write_text(summary,text_summ_path,bool)
        print(i, "% reduction",)
        
        # Compute BLEU only for 1-gram
        print("BLEU score: ", sentence_bleu([document], summary, weights=(1, 0, 0, 0)))
    
        # COmpute rouge scores for unigram, bigram and l-gram. F1, precision and recall.
        rouge_scores = rouge.get_scores(' '.join(summary), ' '.join(document))
        print("Rogue scores: ", rouge_scores)
        
    print('\n')
print()

##print document reducted in file "texts_summarized"
for doc in docs:
    #print('********{}*******'.format(doc))
    text_path = text_path='texts_to_summarize/' +doc

    document = read_doc(text_path)
    for i in compression:
        text_summ_path='./texts_summarized/' +str(i)+'_'+doc
        summ = summarization(document, nasari, i)
        write_text(summ, text_summ_path)

print("write completed")

**************************** Andy-Warhol.txt*******************************
10 % reduction
BLEU score:  0.8948393168143697
Rogue scores:  [{'rouge-1': {'r': 0.8813559322033898, 'p': 1.0, 'f': 0.9369369319568218}, 'rouge-2': {'r': 0.8511415525114155, 'p': 0.982086406743941, 'f': 0.9119373727163127}, 'rouge-l': {'r': 0.8813559322033898, 'p': 1.0, 'f': 0.9369369319568218}}]
20 % reduction
BLEU score:  0.7788007830714049
Rogue scores:  [{'rouge-1': {'r': 0.7812018489984591, 'p': 1.0, 'f': 0.8771626248332306}, 'rouge-2': {'r': 0.7296803652968037, 'p': 0.9815724815724816, 'f': 0.837087475464543}, 'rouge-l': {'r': 0.7812018489984591, 'p': 1.0, 'f': 0.8771626248332306}}]
30 % reduction
BLEU score:  0.6514390575310556
Rogue scores:  [{'rouge-1': {'r': 0.724191063174114, 'p': 1.0, 'f': 0.8400357413299089}, 'rouge-2': {'r': 0.6621004566210046, 'p': 0.9823848238482384, 'f': 0.7910529139021558}, 'rouge-l': {'r': 0.724191063174114, 'p': 1.0, 'f': 0.8400357413299089}}]


**************************** 