# NLP project


Project 21: Automatic Summarization  

We shall consider structured document containing a title, abstract and a set of subsections. We would like to build a text summarizer such that tracks important keywords in the document. For this purpose, the first step is identify these keywords.  

In [None]:
pip install --upgrade pip

In [None]:
!pip list
# tarkista löytyykö: lxml, html5lib, requests, selenium, webdriver-manager
# lisäohjeita task1

In [None]:
#jos nltk ei löydy asenna -> ! pip install nltk
import nltk
nltk.download("stopwords")
#from nltk.cluster.util import cosine_distance

## TASK 1
Assume the initial input is given as html document (choose an example of your own), we hypothesize that important keywords are initially contained in the words of titles, abstract and possibly titles of subsections of the document. Suggest a simple python script that inputs an html document and outputs the lists of words in the title, abstract and title of section/subsections.

In [None]:
from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from nltk.tokenize import sent_tokenize
import time
# Kaikki sivut ei anna koko html bodyä käyttämällä pelkkää requestia. Seleniumilla näyttää toimivan useammilla. 
# pip install -U selenium
# pip install webdriver-manager
# jos käytät anacondaa eikä meinaa toimia niin kokeile myös $ conda update pip

# Collect title, subtitles, abstract and body text from html file.
# Print out titles and abstract and construct one string based on
# the elements.

def _convertHtmlToStr(elements):
    str = ""
    for element in elements:
        if len(element.text.split()) > 1:
            str += element.text
            if not str.endswith("."):
                str += "."
            str += " "
    sentences = sent_tokenize(str)
    return str, len(sentences)

url = "https://ieeexplore.ieee.org/document/6809191"
article = ""
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)

# Wait for article to fully load
time.sleep(3)

soup = BeautifulSoup(driver.page_source, 'lxml')
strElement = ""
countTitle, countAbstract, countH2, countH3, countH4, countP = 0, 0, 0, 0, 0, 0

strElement, countTitle = _convertHtmlToStr(soup.find("h1", {"class": "document-title"}))
print("Title:\n{}\n\n".format(strElement))
article += strElement
article += ". "
strElement, countAbstract = _convertHtmlToStr(soup.find("div", {"class": "abstract-text"}))
print("Abstract:\n{}\n\n".format(strElement))
article += strElement

articleHtmlBody = soup.find("div", {"id": "article"})
if articleHtmlBody == None:
    raise ValueError

strElement, countH2 = _convertHtmlToStr(articleHtmlBody.find_all("h2"))
print("Section titles:\n{}\n\n".format(strElement))
article += strElement
strElement, countH3 = _convertHtmlToStr(articleHtmlBody.find_all("h3"))
print("Subsection titles:\n{}\n\n".format(strElement))
article += strElement
strElement, countH4 = _convertHtmlToStr(articleHtmlBody.find_all("h4"))
print("Subsubsection titles:\n{}\n\n".format(strElement))
article += strElement
strElement, countP = _convertHtmlToStr(articleHtmlBody.find_all("p"))
article += strElement
countP += 1

driver.close()

counts = [countTitle, countAbstract, countH2, countH3, countH4, countP]

In [None]:
import yake

In [None]:
#Keyword search and analysis

w_extractor = yake.KeywordExtractor()

language = "en"
max_ngram_size = 2
deduplication_threshold = 0.9
numOfKeywords = 50 #alunperin 10

custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, top=numOfKeywords, features=None)
keywords = custom_kw_extractor.extract_keywords(article)

for kw in keywords:
    print(kw)


## TASK 2
Write down a simple python script that allows you to output the histogram of word frequency in the document, excluding the stopwords (see examples in online NLTK book). Use SpaCy named-entity tagger to identify person-named entities and organization-named entities in the document.

In [None]:
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
import numpy as np
from nltk.tokenize import word_tokenize
from collections import Counter

all_stopwords = stopwords.words('english')
all_stopwords.append('The')

In [None]:
text_tokens = word_tokenize(article)
tokens_without_sw = [word for word in text_tokens if word.isalpha() and word not in all_stopwords]

print(tokens_without_sw)
#print(filtered_sentence)

In [None]:

### Count histogram from every word manually ###
#charsToRemove = ".,()"
#wordCounts = Counter(tokens_without_sw)
wordCounts = Counter(tokens_without_sw)
wordCounts = wordCounts.most_common()

print(wordCounts)

wordCounts = wordCounts[0:20]

words = list(zip(*wordCounts))[0]
occurency = list(zip(*wordCounts))[1]
fig, ax = plt.subplots(figsize=(18,5))
plt.bar(np.arange(len(words)), occurency, align='center')
plt.xticks(np.arange(len(words)), words, rotation='vertical')
plt.ylabel('Keyword count')
plt.xlabel('Keyword id')
plt.show()

In [None]:
#Use SpaCy to identify person-named entities and organization-named entities
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm

#vinkkiä https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da

#Identifying person and organization-named entities
wordsInStr = ""
for word in tokens_without_sw:
    wordsInStr += word
    wordsInStr += " "
    
nlp = en_core_web_sm.load()
doc = nlp(wordsInStr)

#Print only ORG or PERSON labeled entities
if doc.ents:
    for ent in doc.ents:
        if ent.label_ == "ORG" or ent.label_ == "PERSON":
            print(ent.text+ " - " + ent.label_)
else:
    print("No named entities found.")

## TASK 3

We would like the summarizer to contain frequent wording (excluding stopwords) and as many named-entities as possible. For this purpose, use the following heuristic to construct the summarizer. First we shall assume each sentence of the document as individual sub-document. Use TfIdf vectorizer to output the individual tfidef score of each word of each sentence (after initial preprocessing and wordnet lemmatization stage). Then consider only sentences that contain person or organization named-entities and use similar approach to output the tfidf score of the named-entities in each sentence. Finally construct the sentence (S) weight as a  weighted sum:
<br>
$$S_{weight}=\sum_{w\varepsilon S}W_{TfiDf}+2\sum_{NM\varepsilon S}NM_{TfiDf}+POS_s$$
<br>
where NMTfiDF stands for the TfIdF of named-entity NM in sentence S.  POSS corresponds to the sentence weight associated to the location of the sentence. So that the sentence location weight will be maximum (1) if located in the title of the document, 0.5 if located  in the title of one of the subsection, 0.25 if located in the title one of the subsubsection, 0.1 if located in one representative object of the document, and 0 if located only in the main text. Make sure to normalize the term tfidf and Nm tfidf weights and suggest a script to implement the preceding accordingly, so that the summarizer will contain the 10 sentences with the highest Sweight scores.  


In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('wordnet')
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler
import numpy as np
import pandas as pd

In [None]:
weights = {
    "1": 1,
    "2": 0.1,
    "3": 0.5,
    "4": 0.25,
    "else": 0.0
}

In [None]:
def _calculateFullScores(sentenceScores, namedEntityScores, counts):
    scaler = MinMaxScaler()
    weightList= []

    if len(counts) > 0:
        if counts[2] == 0:
            counts.pop[2]
    else:
        counts = [0, 0, 0, 0, len(sentenceScores)]

    for i in range(len(counts)):
        for j in range(counts[i]):
            if i > 3:
                weightList.append(weights["else"])
            else:
                weightList.append(weights[str(i+1)])

    df = pd.DataFrame({
        "Weights": weightList,
        "SentenceScores": sentenceScores,
        "EntityScores": namedEntityScores,
    })

    df[["SentencesScaled"]] = scaler.fit_transform(df[["SentenceScores"]])
    df[["EntitiesScaled"]] = scaler.fit_transform(df[["EntityScores"]])
    df["S_weight"] = df["SentencesScaled"] + (2 * df["EntitiesScaled"]) + df["Weights"]

    return df["S_weight"].tolist()


def _getNamedEntities(article):
    nlp = en_core_web_sm.load()
    doc = nlp(article)
    namedEntities = []
    
    for ent in doc.ents:
        if ent.label_ == "ORG" or ent.label_ == "PERSON":
                namedEntities.append(ent.text)

    return namedEntities


def _getSentencesWithMaxWeights(weights, sentences, numberOfSentences):
    arr = np.array(weights)
    indexes = np.argpartition(arr, -numberOfSentences)[-numberOfSentences:]
    sentences = np.array(sentences)
    return sentences[indexes]


def _preProcess(document):
    stopwords = list(set(nltk.corpus.stopwords.words('english')))
    WN_lemmatizer = WordNetLemmatizer()
    sentences = sent_tokenize(document)
    processedSentences = []
    tokens = []

    for sentence in sentences:
        words = word_tokenize(sentence)
        words = [WN_lemmatizer.lemmatize(word, pos="v") for word in words]

        # get rid of numbers and Stopwords
        words = [word for word in words if word.isalpha() and word not in stopwords]
        processedSentences.append(' '.join(word for word in words))
        tokens.extend(words)

    return processedSentences, tokens


def _tfidfScores(corpus, sentences):
    tfidf = TfidfVectorizer()
    fittedVectorizer = tfidf.fit(corpus)
    vectors = fittedVectorizer.transform(sentences).toarray()

    scores = []
    for i in range(len(vectors)):
        score = 0
        for j in range(len(vectors[i])):
            score = score + vectors[i][j]

        scores.append(score)
    return scores

In [None]:
def findTopSentences(document, numberOfSentences, isUrl):
    sentences, tokens = _preProcess(document)
    sentenceTfidfScores = _tfidfScores(tokens, sentences)
    namedEntitiesTfidfScores = _tfidfScores(_getNamedEntities(document), sentences)
    if isUrl:
        SWeight = _calculateFullScores(sentenceTfidfScores, namedEntitiesTfidfScores, counts)
    else:
        SWeight = _calculateFullScores(sentenceTfidfScores, namedEntitiesTfidfScores, [])
    topSentences = _getSentencesWithMaxWeights(SWeight, sent_tokenize(document), numberOfSentences)
    return topSentences

topSentences = findTopSentences(article, 10, True)
for sentence in topSentences:
    print("{}\n".format(sentence))

In [None]:
testDoc = ""
with open("C:/Users/Markus/Documents/studies/NLP/NLP/accuracy_garmin_nuvi_255W_gps.data") as f:
    testDoc = f.readlines()
f.close()
testDoc = " ".join(testDoc)

s = findTopSentences(testDoc, 20, False)
print(s)


## TASK 4
Test the above approach with Opinosis dataset available at https://kavita-ganesan.com/opinosis-opinion-dataset/#.YVw6J5ozY2x,  and record the corresponding Rouge-2 and Rouge-3 evaluation score. 

In [None]:
#Rouge 2 ja 3 scoring
https://towardsdatascience.com/the-ultimate-performance-metric-in-nlp-111df6c64460


In [None]:
#pip install git+git://github.com/bdusell/rougescore.git

'''
In ROUGE, a "peer" summary produced by a machine summarization system is compared against 
one or more hand-written "model" summaries and then assigned a score from 0 to 1. This score is the
F-measure of recall vs. precision, and the evaluator can adjust a parameter α to control whether this 
score favors recall (does the peer summary contain all of the information in the model summaries?) 
or precision (does the peer summary contain only information in the model summaries?). 

When α ≈ 0, this score favors recall; when α ≈ 1, it favors precision. 
In the DUC conferences, α was set to 0, and a hard length limit was imposed on generated summaries. 
The original ROUGE implementation uses α = 0.5 by default.
'''



In [None]:
#testi teksti
peer = """
, and is very, very accurate .
 but for the most part, we find that the Garmin software provides accurate directions, whereever we intend to go .
 This function is not accurate if you don't leave it in battery mode say, when you stop at the Cracker Barrell for lunch and to play one of those trangle games with the tees .
 It provides immediate alternatives if the route from the online map program was inaccurate or blocked by an obstacle .
 I've used other GPS units, as well as GPS built into cars   and to this day NOTHING beats the accuracy of a Garmin GPS .
 It got me from point A to point B with 100% accuracy everytime .
 It has yet to disappoint, getting me everywhere with 100% accuracy .
0 out of 5 stars Honest, accurate review, , PLEASE READ !
 Aside from that, every destination I've thrown at has been 100% accurate .
In closing, this is a fantastic GPS with some very nice features and is very accurate in directions .
 Plus, I've always heard that there are  quirks  with any GPS being accurate, having POIs, etc .
 DESTINATION TIME, , This is pretty accurate too .
 But, it's always very accurate .
 The map is pretty accurate and the Point of interest database also is good .
 Most of the times, this info was very accurate .
I've even used it in the  pedestrian  mode, and it's amazing how accurate it is .
  ONLY  is only accurate when an ad says,  Top sirloin steak, ONLY $1 .
 The most accurate review stated that these machines are adjunct to a good map and signs on the interstate .
 The directions are highly accurate down to a  T  .
 Depending on what you are using it for, it is a nice adjunct to a travel trip and the directions are accurate and usually the quickest, but not always .
 The screen is easy to see, the voice tells you where you are and it's very accurate .
 It was accurate to the minute when it told me when I would arrive home .
0 out of 5 stars GPS Navigator doesn't navigate accurately on a straight road .
 I was familiar with the streets and only used the Nuvi to get an accurate arrival time estimate .
 but after that it is very easy and quite accurate to use .
 The accuracy at this point is very good .
While the 255W routing seems generally accurate and logical, on my first use I discovered that it does have some errors in its internal map .
 Bottom line is I wanted a unit that is accurate and had reliable satellite connection .
 I've used it around town and find it to be extremely accurate .
I found the maps to be inaccurate at first, but after I updated them from Garmin's website everything is golden .
 A lot of my friends' addresses are inaccurate by any GPS .
 It loads quickly, have pretty accurate directions, and can recalculate quickly when I miss a turn .
 Because the accuracy is good to the street address level, it may not be able to guide you to the exact location if your destination is inside a shopping mall .
I updated to the latest 2010 map soon after I received the unit, so the map is accurate to me .
 I was blown away at the accuracy and routing capability this thing had .
 I used it the day I bought it,   and then this morning, and as soon as it comes on it is  ready to navigate  The only downfall of this product, and the only reason I did not give it 5 stars is the fact that the speed limit it displays for the road you are on isn't 100% accurate .
 If your looking for a nice, accurate GPS for not so much money, got with this one .
0 out of 5 stars Inexpensive, accurate, plenty of features, August 6, 2009
 The only glitch I have found so far is that the speed limits are not 100% accurate, although the GPS, amazingly, is able to very accurately tell you how fast your vehicle is moving .
 I was a little disappointed in the inaccuracy of the posted speed limit, as I'm guilty of not paying close enough attention to those signs, especially w  interstate speed traps that are constantly changing up and down .
 The closest one that gives the most accurate route that I usually take is the Navigon .
 After 2 weeks, it has yet to make a mistake, and is always completely accurate ,  even to the point of telling me which side of the street my destination is on .
 It has worked well for local driving giving accurate directions for roads and streets .
The estimated time to arrival does not seem to calculate the travelling time accurately .
Accuracy is as good as any other unit, they all sometimes tell you you have arrived when you haven't, or continue to tell you to turn when you're already there .
 Accuracy is determined by the maps .
 Less traveled rural roads will not be accurate on any unit .
 Accuracy is within a few yards .
What the 255w does best is find a street address, business, point of interest, hospital or airport and give you turn, by, turn directions with amazing accuracy .
 The Garmin is loaded with very accurate maps that generally know the roads in even the remotest areas .
I'm really glad I bought it though, and like the easy to read graphics, the voice used to tell you the name of the street you are to turn on, the uncannily accurate estimates of mileage and time of arrival at your destination .
My new Garmin 255w had very Easy Set Up, Accurate Directions to locations, User Friendly Unit to anyone in my vehicle who tried it .
 I had a GPS 10, years ago when I owned a boat that was difficult to use and with very poor accuracy so I had assumed that the road GPS wasn't any better .
 Practiced visiting places I already knew to see how accurate the directions and maps would be .
 Easy to use, excellent accuracy, nice and intuitive interface .
 The directions provided have all been quite accurate thus far .
,  Very Accurate but with one small glitch I found ,  I'll explain in the CONS
This is a great GPS, it is so easy to use and it is always accurate .
Very easy to operate and pretty accurate as well, only led me astray once and that was in northern Maine where roads are few and paved ones fewer .
 Easy to use and amazed at how accurate this item is .
To date it's been a very easy to use and accurate .
 Mounted really easily and has been very accurate .
 seems to be rather accurate .
 It was accurate on determing original directions and recalculating when necessary .
Highly accurate, POIs are great .
 I can't believe how accurate and detailed the information estimated time of arrival,speed limits along the way,and detailed map of my route, to name a few .
 Speed of calculation, accuracy, and simplicity of operation are top notch .
"""


model = [
    """The voice is a bit robotic.
    The voice is very clear and loud enough.""",
    """Voice is clear and sweet.
    Voice commands are kindly fantastic.""",
    """The voice is very clear and loud.""",
    """The voices sound robotic.
    TTS mode is the most problematic.""",
    """255W garmin gps has more than 750 voices but the most of them sound like robots."""
]

In [None]:
import rougescore as rouge

rougeBi = rouge.rouge_2(peer, model, 1)
print("Rouge 2: ", rougeBi)
rougeTri = rouge.rouge_3(peer, model, 1)
print("Rouge 3: ", rougeTri)

In [None]:
#tulosten vertailu toisella laskurilla
from rouge_score import rouge_scorer

In [None]:
scorer = rouge_scorer.RougeScorer(['rouge2', 'rouge3'], use_stemmer=False)
scores = scorer.score(peer, "The voice is a bit robotic. The voice is very clear and loud enough.")

print(scores)

## TASK 5

[x] We would like to improve the summarization by taking into account the diversity among the sentence in the sense that we would like to minimize redundancy among sentences. For this purpose, we shall use the sentence-to-sentence semantic similarity introduced in the NLP lab. 

[x] Next, instead of recording only the 10 sentences with highest Sweight scores, we shall record the 20 top sentences in terms of $S_{weight}$ scores. Then the selection of the top 10 sentences among the 20 sentences follows the following approach. 

[x] First, order the 20 sentences in the decreasing order of their $S_{weight}$ scores, say S1, S2, …, S20 (where S1 is the top ranked and S20 the 20th ranked sentence). 

[x] Second, we shall assume that S1 is always included in the summarizer, we shall then attempt to find the other sentences among S2 till S20 to be included into the summarizer. 

[x] Calculate the sentence-to-sentence similarity Sim(S1,Si) for i=1 to 20, the Sentence Sj that yields the minimum similarity with S1 will therefore be included in the summarizer. 

[x] Next, for each of the remaining sentences Sk (with k different from 1 and j), we calculate the sentence similarity with Sj. Therefore the sentence Sp that yields minimum value of “Sim(Sp, S1)+Sim(Sp,Sj)” will be included in the summarizer (Note: the quantity Sim(Sp, S1) is already calculated in previous step).  

[x] Similarly in the next phase, we should select a sentence Sl (l different from 1, j and k) so that  “Sim(Sl, S1)+Sim(Sl,Sj)+Sim(Sl,Sp)”, Etc.. 

[x] You then stop once you reached 10 sentences included in the summarizer. 

[ ] Suggest a script that includes this process.. and illustrate its functioning in the example you chosen in 1).

In [None]:
kirjoitan tähän itselleni että pysyn ohjeiden perässä
1.Luo 20 lauseen lista, missä lauseiden s(weight) pisteet ovat suurimmat (s1,s2,s3,...,s20)
2.s1 on tiivistelmän ensimmäinen lause 
    2.1 poista s1 listalta
3.Vertaa loppuja lauseita s1. Lause joka on vähiten samanlainen s1 kanssa lisätään tiivistelmään, ja kutsutaan s(j)
    3.1 poista s(j) listalta
4.Vertaa loppuja lauseita s(j) ja taas alin arvo lisätään tiivistelmään. Lisätty lause s(p)
    4.1 poista lause

In [None]:
sentences = [
    "There have been days when I wished to be separated from my body, but today wasn’t one of those days.",
    "There are no heroes in a punk rock band.",
    "In the end, he realized he could see sound and hear words.",
    "She had a habit of taking showers in lemonade.",
    "He hated that he loved what she hated about hate.",
    "Mr. Montoya knows the way to the bakery even though he's never been there.",
    "Karen realized the only way she was getting into heaven was to cheat.",
    "Thirty years later, she still thought it was okay to put the toilet paper roll under rather than over.",
    "He appeared to be confusingly perplexed.",
    "Sometimes you have to just give up and win by cheating.",
    "It's never been my responsibility to glaze the donuts.",
    "Mom didn’t understand why no one else wanted a hot tub full of jello.",
    "He poured rocks in the dungeon of his mind.",
    "He is good at eating pickles and telling women about his emotional problems.",
    "He picked up trash in his spare time to dump in his neighbor's yard.",
    "I thought red would have felt warmer in summer but I didn't think about the equator.",
    "The family’s excitement over going to Disneyland was crazier than she anticipated.",
    "You're good at English when you know the difference between a man eating chicken and a man-eating chicken.",
    "With the high wind warning",
    "This made him feel like an old-style rootbeer float smells."
            ]

In [None]:
print(len(sentences))

In [None]:
#download larger pipeline package for spaCy
python -m spacy download en_core_web_lg #tarkempi mutta 770mb kokoinen

python -m spacy download en_core_web_sm #paljon pienempi mutta ei yhtä tarkka

In [None]:
#s1 määritys
picked_sentences = []

#choose dictionary
nlp = spacy.load("en_core_web_lg")
#nlp = spacy.load("en_core_web_md")

#löydä ensimmäinen lause, korkein s(weigth)
for sentence in sentences: 
    #lisää koodi s(weight) laskemiseen, tai valitse ensimmäinen lause jos lista on järjestyksessä
    s1 = sentence

#poista valinta listasta ja lisää tiivistelmä listaan    
picked_sentences.append(s1)
sentences.remove(s1)

print(picked_sentences)

In [None]:
#Loppujen yhdeksän lauseen valinta

#lista samanlaisuus pisteistä
sim_score = []

#while pyörii kunnes 10 lausetta on löydetty
while(len(picked_sentences)<10):
    sim_score.clear()
    
    for sentence in sentences:
        nlp_sentence = nlp(sentence)
        score = 0

        for p_sentence in picked_sentences:
            #vertaa kahta lausetta
            nlp_p_sentence = nlp(p_sentence)

            score += nlp_p_sentence.similarity(nlp_sentence)
                
        sim_score.append(score)
        
        
    print(sim_score)
    min_value = min(sim_score)
    min_index = sim_score.index(min_value)   

    print("Sentences left in the list: " + str(len(sentences)))
    print("Smallest value: " + str(min_value))
    print(sentences[min_index])

    picked_sentences.append(sentences[min_index])
    sentences.remove(sentences[min_index])



In [None]:
print("Summarized text")
print(picked_sentences)

## TASK 6

We would like to make the choice of keywords not based on histogram frequency but using the open source RAKE https://www.airpair.com/nlp/keyword-extraction-tutorial. Repeat the previous process of selecting the sentences that are associated to the ten first keywords generated by RAKE. Comment on the quality of this summarizer based on your observation

In [None]:
#Repossa ollut asennus tiedosto ei kyennyt asentumaan windows ympäristössä ilman korjausta
git clone https://github.com/zelandiya/RAKE-tutorial
cd RAKE-tutorial

#Ennen asennusta mene setup.py tiedostoon ja poista slash (/) poluista: 
#package_dir={'nlp_rake': './'} ja 
#package_data={'nlp_rake': ['data/']}

#muutin "nlp-rake" nimen pelkäksi "rake" asennus tiedostossa.

#kuva setup_korjaus löytyy githubista, jonka jälkeen paketin asennus toimii
python setup.py install 



In [None]:
#Asensin moduulin eri paikkaan kuin missä jupyter serveri polku, korjasin tällä polun
#import sys 
#sys.path.append("C:/NLP/RAKE-tutorial")

In [None]:
import rake 
import operator

In [None]:
#Korjaa polku, tiedosto löytyy githubista
rake_object = rake.Rake("C:/NLP/RAKE-tutorial/data/stoplists/SmartStoplist.txt", 5, 3, 4) 

In [None]:
sample_file = open("C:/NLP/RAKE-tutorial/data/docs/fao_test/w2167e.txt", 'r') #aseta teksti minkä haluat käsitellä
text = sample_file.read()
sentenceList = rake.split_sentences(text)
print(sentenceList[0:1])


In [None]:
keywords = rake_object.run(text)
#print("Keywords:", keywords[0:10]) #10 ensimmäistä
keywords_topten = []

for i in range(10):
    keywords_topten.append(keywords[i][0])
    
print(keywords_topten)

In [None]:
#extract sentences using keywords
dct = {}
for sentence in sentenceList:
    dct[sentence] = sum(1 for word in keywords_topten if word in sentence)

rake_sentences = [key for key,value in dct.items() if value == max(dct.values())]


print("\n".join(rake_sentences))

In [None]:
#Comparing results
print("Sentences in original text: {}, summarized amount: {}".format(len(sentenceList),len(rake_sentences)))

In [None]:
Comment on results:

## TASK 7

It is also suggested to explore alternative implementations with larger number of summarization approaches implemented- https://github.com/miso-belica/sumy. Show how each of the implemented summarizer behaves when inputted with the same document you used in previous case.

In [None]:
#https://github.com/miso-belica/sumy
#pip install sumy

In [None]:
from sumy.parsers.html import HtmlParser
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer as LSASummarizer
from sumy.summarizers.lex_rank import LexRankSummarizer as LexRankSummarizer
from sumy.summarizers.luhn import LuhnSummarizer as LuhnSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words

LANGUAGE = "english"
SENTENCES_COUNT = 10

In [None]:
def sumySummarize(article):
    stemmer = Stemmer(LANGUAGE)
    summarizers = [LexRankSummarizer(stemmer), LSASummarizer(stemmer), LuhnSummarizer(stemmer)]
    parser = PlaintextParser.from_string(article, Tokenizer(LANGUAGE))
    results = []
    
    for summarizer in summarizers:
        summarizer.stop_words = get_stop_words(LANGUAGE)
        sentences = []
        for sentence in summarizer(parser.document, SENTENCES_COUNT):
            sentences.append(sentence)
        results.append(sentences)
    
    return results

sumySentences = sumySummarize(article)
for sentences in sumySentences:
    print("{}\n\n".format(sentences))

## TASK 8

Now we would like to compare the above summarizers and those in 3), 5) and 7) on a new dataset constructed as follows. First select an Elsevier journal of your own and select 10 papers highly ranked in the journal according to citation index (The journal papers should be well structured to contain Abstract, Introduction and Conclusion). 

For each of the ten papers, consider the introduction as the main document to seek to apply summarizer, and consider the Abstract and Conclusion as two golden summary of the document that you can use for assessment using ROUGE-1 and ROUGE-2 evaluation. 

Report in a table the evaluation score of each summarizer. 

In [None]:
#Rouge 1 & 2 pisteytyts koodi


## TASK 9

Design a simple GUI that allows the user to input a text or a link to a document to be summarized and output the summarizer according to 3), algorithms implemented in 7)

In [None]:
# run simpleGUI.py