# NLP project


Project 21: Automatic Summarization  

We shall consider structured document containing a title, abstract and a set of subsections. We would like to build a text summarizer such that tracks important keywords in the document. For this purpose, the first step is identify these keywords.  

In [None]:
pip install --upgrade pip

In [None]:
!pip list
# tarkista löytyykö: lxml, html5lib, requests, selenium, webdriver-manager
# lisäohjeita task1

In [None]:
#jos nltk ei löydy asenna -> ! pip install nltk
import nltk
nltk.download("stopwords")
#from nltk.cluster.util import cosine_distance

## TASK 1
Assume the initial input is given as html document (choose an example of your own), we hypothesize that important keywords are initially contained in the words of titles, abstract and possibly titles of subsections of the document. Suggest a simple python script that inputs an html document and outputs the lists of words in the title, abstract and title of section/subsections.

In [None]:
from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from nltk.tokenize import sent_tokenize
import time
# Kaikki sivut ei anna koko html bodyä käyttämällä pelkkää requestia. Seleniumilla näyttää toimivan useammilla. 
# pip install -U selenium
# pip install webdriver-manager
# jos käytät anacondaa eikä meinaa toimia niin kokeile myös $ conda update pip

# Collect title, subtitles, abstract and body text from html file.
# Print out titles and abstract and construct one string based on
# the elements.

def _convertHtmlToStr(elements):
    str = ""
    for element in elements:
        if len(element.text.split()) > 1:
            str += element.text
            if not str.endswith("."):
                str += "."
            str += " "
    sentences = sent_tokenize(str)
    return str, len(sentences)

def scrape_article(url):
    article = ""
    driver = webdriver.Chrome(ChromeDriverManager().install())
    driver.get(url)

    # Wait for article to fully load
    time.sleep(3)

    soup = BeautifulSoup(driver.page_source, 'lxml')
    strElement = ""
    countTitle, countAbstract, countH2, countH3, countH4, countP = 0, 0, 0, 0, 0, 0

    strElement, countTitle = _convertHtmlToStr(soup.find("h1", {"class": "document-title"}))
    print("Title:\n{}\n\n".format(strElement))
    article += strElement
    article += ". "
    strElement, countAbstract = _convertHtmlToStr(soup.find("div", {"class": "abstract-text"}))
    print("Abstract:\n{}\n\n".format(strElement))
    article += strElement

    articleHtmlBody = soup.find("div", {"id": "article"})
    if articleHtmlBody == None:
        raise ValueError

    strElement, countH2 = _convertHtmlToStr(articleHtmlBody.find_all("h2"))
    print("Section titles:\n{}\n\n".format(strElement))
    article += strElement
    strElement, countH3 = _convertHtmlToStr(articleHtmlBody.find_all("h3"))
    print("Subsection titles:\n{}\n\n".format(strElement))
    article += strElement
    strElement, countH4 = _convertHtmlToStr(articleHtmlBody.find_all("h4"))
    print("Subsubsection titles:\n{}\n\n".format(strElement))
    article += strElement
    strElement, countP = _convertHtmlToStr(articleHtmlBody.find_all("p"))
    article += strElement
    countP += 1

    driver.close()

    counts = [countTitle, countAbstract, countH2, countH3, countH4, countP]
    return article, counts

url = "https://ieeexplore.ieee.org/document/6809191"
article, counts = scrape_article(url)

In [None]:
import yake

In [None]:
#Keyword search and analysis

w_extractor = yake.KeywordExtractor()

language = "en"
max_ngram_size = 2
deduplication_threshold = 0.9
numOfKeywords = 50 #alunperin 10

custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, top=numOfKeywords, features=None)
keywords = custom_kw_extractor.extract_keywords(article)


## TASK 2
Write down a simple python script that allows you to output the histogram of word frequency in the document, excluding the stopwords (see examples in online NLTK book). Use SpaCy named-entity tagger to identify person-named entities and organization-named entities in the document.

In [None]:
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
import numpy as np
from nltk.tokenize import word_tokenize
from collections import Counter

all_stopwords = stopwords.words('english')
all_stopwords.append('The')

In [None]:
text_tokens = word_tokenize(article)
tokens_without_sw = [word for word in text_tokens if word.isalpha() and word not in all_stopwords]
#print(filtered_sentence)

In [None]:

### Count histogram from every word manually ###
#charsToRemove = ".,()"
#wordCounts = Counter(tokens_without_sw)
wordCounts = Counter(tokens_without_sw)
wordCounts = wordCounts.most_common()

wordCounts = wordCounts[0:20]

words = list(zip(*wordCounts))[0]
occurency = list(zip(*wordCounts))[1]
fig, ax = plt.subplots(figsize=(18,5))
plt.bar(np.arange(len(words)), occurency, align='center')
plt.xticks(np.arange(len(words)), words, rotation='vertical')
plt.ylabel('Keyword count')
plt.xlabel('Keyword id')
plt.show()

In [None]:
#Use SpaCy to identify person-named entities and organization-named entities
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm

#vinkkiä https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da

#Identifying person and organization-named entities
wordsInStr = ""
for word in tokens_without_sw:
    wordsInStr += word
    wordsInStr += " "
    
nlp = en_core_web_sm.load()
doc = nlp(wordsInStr)

#Print only ORG or PERSON labeled entities
if doc.ents:
    for ent in doc.ents:
        if ent.label_ == "ORG" or ent.label_ == "PERSON":
            print(ent.text+ " - " + ent.label_)
else:
    print("No named entities found.")

## TASK 3

We would like the summarizer to contain frequent wording (excluding stopwords) and as many named-entities as possible. For this purpose, use the following heuristic to construct the summarizer. First we shall assume each sentence of the document as individual sub-document. Use TfIdf vectorizer to output the individual tfidef score of each word of each sentence (after initial preprocessing and wordnet lemmatization stage). Then consider only sentences that contain person or organization named-entities and use similar approach to output the tfidf score of the named-entities in each sentence. Finally construct the sentence (S) weight as a  weighted sum:
<br>
$$S_{weight}=\sum_{w\varepsilon S}W_{TfiDf}+2\sum_{NM\varepsilon S}NM_{TfiDf}+POS_s$$
<br>
where NMTfiDF stands for the TfIdF of named-entity NM in sentence S.  POSS corresponds to the sentence weight associated to the location of the sentence. So that the sentence location weight will be maximum (1) if located in the title of the document, 0.5 if located  in the title of one of the subsection, 0.25 if located in the title one of the subsubsection, 0.1 if located in one representative object of the document, and 0 if located only in the main text. Make sure to normalize the term tfidf and Nm tfidf weights and suggest a script to implement the preceding accordingly, so that the summarizer will contain the 10 sentences with the highest Sweight scores.  


In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('wordnet')
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler
import numpy as np
import pandas as pd

In [None]:
weights = {
    "1": 1,
    "2": 0.1,
    "3": 0.5,
    "4": 0.25,
    "else": 0.0
}

In [None]:
def _calculateFullScores(sentenceScores, namedEntityScores, counts):
    scaler = MinMaxScaler()
    weightList= []

    if len(counts) > 0:
        if counts[2] == 0:
            counts.pop[2]
    else:
        counts = [0, 0, 0, 0, len(sentenceScores)]

    for i in range(len(counts)):
        for j in range(counts[i]):
            if i > 3:
                weightList.append(weights["else"])
            else:
                weightList.append(weights[str(i+1)])

    df = pd.DataFrame({
        "Weights": weightList,
        "SentenceScores": sentenceScores,
        "EntityScores": namedEntityScores,
    })

    df[["SentencesScaled"]] = scaler.fit_transform(df[["SentenceScores"]])
    df[["EntitiesScaled"]] = scaler.fit_transform(df[["EntityScores"]])
    df["S_weight"] = df["SentencesScaled"] + (2 * df["EntitiesScaled"]) + df["Weights"]
    return df["S_weight"].tolist()


def _getNamedEntities(article):
    nlp = en_core_web_sm.load()
    doc = nlp(article)
    namedEntities = []
    
    for ent in doc.ents:
        if ent.label_ == "ORG" or ent.label_ == "PERSON":
                namedEntities.append(ent.text)

    return namedEntities


def _getSentencesWithMaxWeights(weights, sentences, numberOfSentences):
    arr = np.array(weights)
    indexes = np.argpartition(arr, -numberOfSentences)[-numberOfSentences:]
    sentences = np.array(sentences)
    return sentences[indexes]


def _preProcess(document):
    stopwords = list(set(nltk.corpus.stopwords.words('english')))
    WN_lemmatizer = WordNetLemmatizer()
    sentences = sent_tokenize(document)
    processedSentences = []
    tokens = []

    for sentence in sentences:
        words = word_tokenize(sentence)
        words = [WN_lemmatizer.lemmatize(word, pos="v") for word in words]

        # get rid of numbers and Stopwords
        words = [word for word in words if word.isalpha() and word not in stopwords]
        processedSentences.append(' '.join(word for word in words))
        tokens.extend(words)

    return processedSentences, tokens


def _tfidfScores(corpus, sentences):
    tfidf = TfidfVectorizer()
    fittedVectorizer = tfidf.fit(corpus)
    vectors = fittedVectorizer.transform(sentences).toarray()

    scores = []
    for i in range(len(vectors)):
        score = 0
        for j in range(len(vectors[i])):
            score = score + vectors[i][j]

        scores.append(score)
    return scores

In [None]:
def findTopSentences(document, numberOfSentences, isUrl):
    sentences, tokens = _preProcess(document)
    sentenceTfidfScores = _tfidfScores(tokens, sentences)
    namedEntitiesTfidfScores = _tfidfScores(_getNamedEntities(document), sentences)
    time.sleep(0.1)
    SWeight = []
    if isUrl:
        SWeight = _calculateFullScores(sentenceTfidfScores, namedEntitiesTfidfScores, counts)
    else:
        SWeight = _calculateFullScores(sentenceTfidfScores, namedEntitiesTfidfScores, [])
    topSentences = _getSentencesWithMaxWeights(SWeight, sent_tokenize(document), numberOfSentences)
    return list(topSentences)

topSentences = findTopSentences(article, 10, True)
for sentence in topSentences:
    print("{}\n".format(sentence))


## TASK 4
Test the above approach with Opinosis dataset available at https://kavita-ganesan.com/opinosis-opinion-dataset/#.YVw6J5ozY2x,  and record the corresponding Rouge-2 and Rouge-3 evaluation score. 

In [None]:
#Rouge 2 ja 3 scoring
#https://towardsdatascience.com/the-ultimate-performance-metric-in-nlp-111df6c64460
#pip install git+git://github.com/bdusell/rougescore.git

In [None]:
import rougescore as rouge

def getRouge(peer, model):
    rougeBi = rouge.rouge_2(peer, model, 1)
    rougeTri = rouge.rouge_3(peer, model, 1)
    
    return rougeBi,rougeTri

In [None]:
def read_folder(dir):
    
    topic = []
    
    for file in os.listdir(dir):   
        with open(os.path.join(dir + file)) as f:
            doc = f.readlines()
            f.close()
            doc = " ".join(doc)
            topic.append(doc)
            
    return topic           

def create_model(dir):
    
    model = []
    
    for folder in os.listdir(directory + "summaries-gold/"):
        
        gold = read_folder(directory + "summaries-gold/" + folder + "/")
        model.append(gold)
        
    
    return model

In [None]:
def summary(dir):
    
    list_summary = []
    for file in os.listdir(directory + "topics/"):
            
        with open(os.path.join(dir + "topics/" + file)) as f:
            doc = f.readlines()
            f.close()
            doc = " ".join(doc)
            
            summary = findTopSentences(doc, 10, False)
            list_summary.append(summary)
            
    return list_summary


In [None]:
import os

directory = "C:/Users/Markus/Documents/studies/NLP/data/Opinosis_dataset/"

summary = summary(directory)
model = create_model(directory)

In [None]:
#lause = ""
#for senctence in summary:
#    lause += sentence

#malli = ["The battery life of the ipod nano is very short. It seems to continue", "using battery even when the ipod is not in use, otherwise, it's a great product."]
#result = rouge.rouge_2(lause, model, 1)
#print(result)

In [None]:
#summary = ['It has worked well for local driving giving accurate directions for roads and streets .', ",  Very Accurate but with one small glitch I found ,  I'll explain in the CONS\n This is a great GPS, it is so easy to use and it is always accurate .", 'The Garmin is loaded with very accurate maps that generally know the roads in even the remotest areas .', "I used it the day I bought it,   and then this morning, and as soon as it comes on it is  ready to navigate  The only downfall of this product, and the only reason I did not give it 5 stars is the fact that the speed limit it displays for the road you are on isn't 100% accurate .", 'Depending on what you are using it for, it is a nice adjunct to a travel trip and the directions are accurate and usually the quickest, but not always .', "I'm really glad I bought it though, and like the easy to read graphics, the voice used to tell you the name of the street you are to turn on, the uncannily accurate estimates of mileage and time of arrival at your destination .", 'My new Garmin 255w had very Easy Set Up, Accurate Directions to locations, User Friendly Unit to anyone in my vehicle who tried it .', 'In closing, this is a fantastic GPS with some very nice features and is very accurate in directions .', 'but for the most part, we find that the Garmin software provides accurate directions, whereever we intend to go .', '0 out of 5 stars Inexpensive, accurate, plenty of features, August 6, 2009\n  The only glitch I have found so far is that the speed limits are not 100% accurate, although the GPS, amazingly, is able to very accurately tell you how fast your vehicle is moving .']
#summary = " ".join(summary)
#print(type(summary))
#model = ['This unit is generally quite accurate.  \n Set-up and usage are considered to be very easy. \n The maps can be updated, and tend to be reliable.\n', "The Garmin seems to be generally very accurate.\n It's easy to use with an intuitive interface.", 'It is very accurate, even in destination time.\n', 'Very accurate with travel and destination time.\n Negatives are not accurate with speed limits and rural roads.', 'Its accurate, fast and its simple operations make this a for sure buy.']
#bi, tri = getRouge(summary, model)

In [None]:
list_score = []

for i in range(len(summary)):
    summary_str = " ".join(summary[i])
    bi, tri = getRouge(summary_str, model[i])
    list_score.append((bi,tri))

In [None]:
import dataframe_image as dfi

df = pd.DataFrame(list_score)
df.columns=['rouge2', 'rouge3']
#df.rename(columns={'0': })
df.loc['mean'] = df.mean()
df.columns.names = ['topic number']
df
#dfi.export(df, 'dataframe.png')
#df.mean(axis=0)
#for list_score in list_scores:
#    df

## TASK 5

[x] We would like to improve the summarization by taking into account the diversity among the sentence in the sense that we would like to minimize redundancy among sentences. For this purpose, we shall use the sentence-to-sentence semantic similarity introduced in the NLP lab. 

[x] Next, instead of recording only the 10 sentences with highest Sweight scores, we shall record the 20 top sentences in terms of $S_{weight}$ scores. Then the selection of the top 10 sentences among the 20 sentences follows the following approach. 

[x] First, order the 20 sentences in the decreasing order of their $S_{weight}$ scores, say S1, S2, …, S20 (where S1 is the top ranked and S20 the 20th ranked sentence). 

[x] Second, we shall assume that S1 is always included in the summarizer, we shall then attempt to find the other sentences among S2 till S20 to be included into the summarizer. 

[x] Calculate the sentence-to-sentence similarity Sim(S1,Si) for i=1 to 20, the Sentence Sj that yields the minimum similarity with S1 will therefore be included in the summarizer. 

[x] Next, for each of the remaining sentences Sk (with k different from 1 and j), we calculate the sentence similarity with Sj. Therefore the sentence Sp that yields minimum value of “Sim(Sp, S1)+Sim(Sp,Sj)” will be included in the summarizer (Note: the quantity Sim(Sp, S1) is already calculated in previous step).  

[x] Similarly in the next phase, we should select a sentence Sl (l different from 1, j and k) so that  “Sim(Sl, S1)+Sim(Sl,Sj)+Sim(Sl,Sp)”, Etc.. 

[x] You then stop once you reached 10 sentences included in the summarizer. 

[ ] Suggest a script that includes this process.. and illustrate its functioning in the example you chosen in 1).

In [None]:
#kirjoitan tähän itselleni että pysyn ohjeiden perässä
#1.Luo 20 lauseen lista, missä lauseiden s(weight) pisteet ovat suurimmat (s1,s2,s3,...,s20)
#2.s1 on tiivistelmän ensimmäinen lause 
#    2.1 poista s1 listalta
#3.Vertaa loppuja lauseita s1. Lause joka on vähiten samanlainen s1 kanssa lisätään tiivistelmään, ja kutsutaan s(j)
#    3.1 poista s(j) listalta
#4.Vertaa loppuja lauseita s(j) ja taas alin arvo lisätään tiivistelmään. Lisätty lause s(p)
#    4.1 poista lause

In [None]:
#print(len(sentences))

In [None]:
#download larger pipeline package for spaCy
#python -m spacy download en_core_web_lg #tarkempi mutta 770mb kokoinen

#python -m spacy download en_core_web_sm #paljon pienempi mutta ei yhtä tarkka

In [None]:
#s1 määritys
def find_first_sentence(sentences):
    picked_sentences = []

    #choose dictionary
    nlp = spacy.load("en_core_web_sm")
    #nlp = spacy.load("en_core_web_md")

    #löydä ensimmäinen lause, korkein s(weigth)
    for sentence in sentences: 
        #lisää koodi s(weight) laskemiseen, tai valitse ensimmäinen lause jos lista on järjestyksessä
        s1 = sentence

    #poista valinta listasta ja lisää tiivistelmä listaan    
    picked_sentences.append(s1)
    sentences.remove(s1)

    #print(picked_sentences)
    return picked_sentences

In [None]:
#Loppujen yhdeksän lauseen valinta
def sentence_to_sentence(sentences):
#lista samanlaisuus pisteistä
    sim_score = []
    picked_sentences = find_first_sentence(sentences)
    #while pyörii kunnes 10 lausetta on löydetty
    while(len(picked_sentences)<10):
        sim_score.clear()

        for sentence in sentences:
            nlp_sentence = nlp(str(sentence))
            score = 0

            for p_sentence in picked_sentences:
                #vertaa kahta lausetta
                nlp_p_sentence = nlp(str(p_sentence))

                score += nlp_p_sentence.similarity(nlp_sentence)

            sim_score.append(score)


        #print(sim_score)
        min_value = min(sim_score)
        min_index = sim_score.index(min_value)   

        #print("Sentences left in the list: " + str(len(sentences)))
        #print("Smallest value: " + str(min_value))
        #print(sentences[min_index])

        picked_sentences.append(sentences[min_index])
        sentences.remove(sentences[min_index])
    return picked_sentences


In [None]:
picked_sentences = sentence_to_sentence(findTopSentences(article, 20, True))

In [None]:
print("Summarized text")
print(picked_sentences)

## TASK 6

We would like to make the choice of keywords not based on histogram frequency but using the open source RAKE https://www.airpair.com/nlp/keyword-extraction-tutorial. Repeat the previous process of selecting the sentences that are associated to the ten first keywords generated by RAKE. Comment on the quality of this summarizer based on your observation

In [None]:
#Repossa ollut asennus tiedosto ei kyennyt asentumaan windows ympäristössä ilman korjausta
#git clone https://github.com/zelandiya/RAKE-tutorial
#cd RAKE-tutorial

#Ennen asennusta mene setup.py tiedostoon ja poista slash (/) poluista: 
#package_dir={'nlp_rake': './'} ja 
#package_data={'nlp_rake': ['data/']}

#muutin "nlp-rake" nimen pelkäksi "rake" asennus tiedostossa.

#kuva setup_korjaus löytyy githubista, jonka jälkeen paketin asennus toimii
#python setup.py install 



In [None]:
#Asensin moduulin eri paikkaan kuin missä jupyter serveri polku, korjasin tällä polun
import sys 
sys.path.append("C:/Users/Markus/Documents/studies/NLP/RAKE-tutorial")

In [None]:
import rake 
import operator

In [None]:
#Korjaa polku, tiedosto löytyy githubista
rake_object = rake.Rake("C:/Users/Markus/Documents/studies/NLP/NLP/SmartStoplist.txt", 5, 3, 4) 

In [None]:
#sample_file = open("C:/NLP/RAKE-tutorial/data/docs/fao_test/w2167e.txt", 'r') #aseta teksti minkä haluat käsitellä
#text = sample_file.read()
sentenceList = rake.split_sentences(article)


In [None]:
keywords = rake_object.run(article)
#print("Keywords:", keywords[0:10]) #10 ensimmäistä
keywords_topten = []

for i in range(10):
    keywords_topten.append(keywords[i][0])
    
print(keywords_topten)

In [None]:
#extract sentences using keywords
dct = {}
for sentence in sentenceList:
    dct[sentence] = sum(1 for word in keywords_topten if word in sentence)

rake_sentences = [key for key,value in dct.items() if value == max(dct.values())]


print("\n".join(rake_sentences))

In [None]:
#Comparing results
print("Sentences in original text: {}, summarized amount: {}".format(len(sentenceList),len(rake_sentences)))

In [None]:
#Comment on results:

## TASK 7

It is also suggested to explore alternative implementations with larger number of summarization approaches implemented- https://github.com/miso-belica/sumy. Show how each of the implemented summarizer behaves when inputted with the same document you used in previous case.

In [None]:
#https://github.com/miso-belica/sumy
#pip install sumy

In [None]:
from sumy.parsers.html import HtmlParser
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer as LSASummarizer
from sumy.summarizers.lex_rank import LexRankSummarizer as LexRankSummarizer
from sumy.summarizers.luhn import LuhnSummarizer as LuhnSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words

LANGUAGE = "english"
SENTENCES_COUNT = 10

In [None]:
def sumySummarize(article):
    stemmer = Stemmer(LANGUAGE)
    summarizers = [LexRankSummarizer(stemmer), LSASummarizer(stemmer), LuhnSummarizer(stemmer)]
    parser = PlaintextParser.from_string(article, Tokenizer(LANGUAGE))
    results = []
    
    for summarizer in summarizers:
        summarizer.stop_words = get_stop_words(LANGUAGE)
        sentences = []
        for sentence in summarizer(parser.document, SENTENCES_COUNT):
            sentences.append(str(sentence))
        results.append(sentences)
    
    return results

sumySentences = sumySummarize(article)
for sentences in sumySentences:
    print("{}\n\n".format(sentences))

## TASK 8

Now we would like to compare the above summarizers and those in 3), 5) and 7) on a new dataset constructed as follows. First select an Elsevier journal of your own and select 10 papers highly ranked in the journal according to citation index (The journal papers should be well structured to contain Abstract, Introduction and Conclusion). 

For each of the ten papers, consider the introduction as the main document to seek to apply summarizer, and consider the Abstract and Conclusion as two golden summary of the document that you can use for assessment using ROUGE-1 and ROUGE-2 evaluation. 

Report in a table the evaluation score of each summarizer. 

In [None]:
#Rouge 1 & 2 pisteytyts koodi

In [None]:
files = []
golds = []
for i in range(1,11):
    with open('C:/Users/Markus/Documents/studies/NLP/NLP/Data/Task8_articles/article{}.txt'.format(i), encoding="utf8") as f:
        text = f.readlines()
        text = " ".join(text)
        res = text.split("\n \n")
        files.append(res[1])
        temp = []
        temp.append(res[0])
        temp.append(res[2])
        golds.append(temp)


In [None]:
def summarize_introduction(introduction):
    task_3_result = findTopSentences(introduction, 10, False)
    task_5_result = sentence_to_sentence(findTopSentences(introduction, 20, False))
    task_7_results = sumySummarize(introduction)
    result = [task_3_result, task_5_result, task_7_results[0], task_7_results[1], task_7_results[2]]
    #result = [task_5_result]
    return result

#peer str(summary), model list(gold_summaries)
def calculate_rouge(peer, model):

    rougeUno = rouge.rouge_1(peer, model, 1)
    rougeBi = rouge.rouge_2(peer, model, 1)

    return rougeUno,rougeBi

task8_rouge_scores = []

for i in range(10):
    summarization_results = summarize_introduction(files[i])
    results = []
    for result in summarization_results:
        result_str = " ".join(result)
        rouge_score = calculate_rouge(result_str, golds[i])
        results.append(rouge_score)
    task8_rouge_scores.append(results)
        
print(task8_rouge_scores)


In [4]:
score_transpose = []

for i in task8_rouge_scores:
    new_list = []
    
    for a in i:
        new_list.append(a[0])
        new_list.append(a[1])
        
    score_transpose.append(new_list)

In [6]:
col = pd.MultiIndex.from_arrays([['Sweight','Sweight', 'Sentence to sentence','Sentence to sentence', 'LexRank','LexRank', 'LSA','LSA', 'Luhn','Luhn'],
                                ["Rouge 1", "Rouge 2","Rouge 1", "Rouge 2","Rouge 1", "Rouge 2", "Rouge 1", "Rouge 2", "Rouge 1", "Rouge 2"]])
data = pd.DataFrame(score_transpose, columns=col)


data.index += 1
data.loc['mean'] = data.mean()

data

Unnamed: 0_level_0,Sweight,Sweight,Sentence to sentence,Sentence to sentence,LexRank,LexRank,LSA,LSA,Luhn,Luhn
Unnamed: 0_level_1,Rouge 1,Rouge 2,Rouge 1,Rouge 2,Rouge 1,Rouge 2,Rouge 1,Rouge 2,Rouge 1,Rouge 2
1,0.786417,0.72171,0.836462,0.764451,0.88178,0.770568,0.793777,0.7442,0.778105,0.723498
2,0.814556,0.719457,0.865453,0.752024,0.913803,0.790072,0.928253,0.804403,0.865105,0.770384
3,0.772098,0.636921,0.816362,0.664568,0.753766,0.657744,0.736401,0.62233,0.726089,0.644539
4,0.849633,0.761986,0.942759,0.824017,0.847242,0.762476,0.866905,0.791071,0.789841,0.714029
5,0.766896,0.647851,0.825667,0.706805,0.847481,0.726174,0.761547,0.652283,0.668436,0.569714
6,0.490887,0.43044,0.522996,0.452512,0.609344,0.52374,0.576275,0.497893,0.440459,0.395546
7,0.710258,0.626785,0.727599,0.650646,0.787852,0.70793,0.71845,0.645504,0.738662,0.655506
8,0.930017,0.829865,0.962865,0.850697,0.944962,0.850816,0.955183,0.845333,0.950058,0.848635
9,0.805026,0.737957,0.873221,0.789644,0.916003,0.81558,0.844484,0.773928,0.846774,0.769874
10,0.821067,0.732715,0.821953,0.725861,0.863125,0.776735,0.844484,0.773928,0.824069,0.75139


In [7]:
data.mean()

Sweight               Rouge 1    0.774686
                      Rouge 2    0.684569
Sentence to sentence  Rouge 1    0.819533
                      Rouge 2    0.718122
LexRank               Rouge 1    0.836536
                      Rouge 2    0.738183
LSA                   Rouge 1    0.802576
                      Rouge 2    0.715087
Luhn                  Rouge 1    0.762760
                      Rouge 2    0.684312
dtype: float64

## TASK 9

Design a simple GUI that allows the user to input a text or a link to a document to be summarized and output the summarizer according to 3), algorithms implemented in 7)

In [None]:
# run simpleGUI.py