Quellen: 
- https://towardsdatascience.com/text-summarization-using-tf-idf-e64a0644ace3
- https://medium.com/@ashins1997/text-summarization-f2542bc6a167

**Ablauf vom folgenden Zusammenfassungsprozess:**

1. Text in Sätze aufteilen
2. Vorverarbeitung des Textes => Stoppwörter, Interpunktion
3. Eine Matrix für die Frquenz der Wörter in den Sätzen erstellen
4. Eine Matrix für die Relevanz der Wörter in jedem Satz erstellen
5. Wie viele Sätze beinhalten das Wort
6. Berechnung von td-idf, um die Relevanz von jedem Wort in einem Satz zu bestimmen
7. Berechnung der Scores für die Sätze
8. Schwellenwert für die Sätze bestimmen, um ihre Relevanz im ganzen Text zu bestimmen
9. Erstellung der Zusammenfassung


In [303]:
from nltk import sent_tokenize, word_tokenize, PorterStemmer
from nltk.corpus import stopwords
import math

In [304]:
text_to_sum = "WASHINGTON - The Trump administration has ordered the military to start withdrawing roughly 7,000 troops from Afghanistan in the coming months, two defense officials said Thursday, an abrupt shift in the 17-year-old war there and a decision that stunned Afghan officials, who said they had not been briefed on the plans. President Trump made the decision to pull the troops - about half the number the United States has in Afghanistan now - at the same time he decided to pull American forces out of Syria, one official said. The announcement came hours after Jim Mattis, the secretary of defense, said that he would resign from his position at the end of February after disagreeing with the president over his approach to policy in the Middle East. The whirlwind of troop withdrawals and the resignation of Mr. Mattis leave a murky picture for what is next in the United States’ longest war, and they come as Afghanistan has been troubled by spasms of violence afflicting the capital, Kabul, and other important areas.  The United States has also been conducting talks with representatives of the Taliban, in what officials have described as discussions that could lead to formal talks to end the conflict. Senior Afghan officials and Western diplomats in Kabul woke up to the shock of the news on Friday morning, and many of them braced for chaos ahead.  Several Afghan officials, often in the loop on security planning and decision-making, said they had received no indication in recent days that the Americans would pull troops out.  The fear that Mr. Trump might take impulsive actions, however, often loomed in the background of discussions with the United States, they said. They saw the abrupt decision as a further sign that voices from the ground were lacking in the debate over the war and that with Mr. Mattis’s resignation, Afghanistan had lost one of the last influential voices in Washington who channeled the reality of the conflict into the White House’s deliberations. The president long campaigned on bringing troops home, but in 2017, at the request of Mr. Mattis, he begrudgingly pledged an additional 4,000 troops to the Afghan campaign to try to hasten an end to the conflict. Though Pentagon officials have said the influx of forces - coupled with a more aggressive air campaign - was helping the war effort, Afghan forces continued to take nearly unsustainable levels of casualties and lose ground to the Taliban. The renewed American effort in 2017 was the first step in ensuring Afghan forces could become more independent without a set timeline for a withdrawal.  But with plans to quickly reduce the number of American troops in the country, it is unclear if the Afghans can hold their own against an increasingly aggressive Taliban. Currently, American airstrikes are at levels not seen since the height of the war, when tens of thousands of American troops were spread throughout the country.  That air support, officials say, consists mostly of propping up Afghan troops while they try to hold territory from a resurgent Taliban."

### Preprocessing

In [305]:
def text_preprocessing(sentences):
    ps = PorterStemmer()
    stop_words = set(stopwords.words('english'))
    processed_words = []
    for sent in sentences:
        words = word_tokenize(sent)
        words = [ps.stem(word.lower()) for word in words if word.isalnum()]
        processed_words += [word for word in words if word not in stop_words]
    return processed_words

### Create Matrices

In [306]:
def freq_matrix(sentences:list) -> dict:
    freq_matrix = {}
    for sent in sentences:
        freq_table = {}
        words_count = len(word_tokenize(sent))
        processed_words = text_preprocessing([sent])

        # Calculating the frequency of words in the sentence
        word_freq = {}
        for word in processed_words:
            word_freq[word] = (word_freq[word] + 1) if word in  word_freq else 1        
        
        # Calculating tf of the words in each sentence
        for word, count in word_freq.items():
            freq_table[word] = count / words_count
            freq_matrix[sent[:15]] = freq_table
    
    return freq_matrix
        

In [307]:
def tf_matrix(freq_matrix):
    tf_matrix = {}
    for sent, f_table in freq_matrix.items():
        tf_table = {}
        count_words_in_sentence = len(f_table)
        for word, count in f_table.items():
            tf_table[word] = count / count_words_in_sentence
        tf_matrix[sent] = tf_table

    return tf_matrix

In [308]:
def sentence_of_word(freq_matrix): #_create_documents_per_words(freq_matrix):
    word_per_doc_table = {}
    for sent, f_table in freq_matrix.items():
        for word, count in f_table.items():
            if word in word_per_doc_table:
                word_per_doc_table[word] += 1
            else:
                word_per_doc_table[word] = 1

    return word_per_doc_table

In [309]:
def idf_matrix(freq_matrix, count_doc_per_words, total_documents):
    idf_matrix = {}
    for sent, f_table in freq_matrix.items():
        idf_table = {}
        for word in f_table.keys():
            idf_table[word] = math.log10(total_documents / float(count_doc_per_words[word]))
        idf_matrix[sent] = idf_table

    return idf_matrix

In [310]:
def tf_idf_matrix(tf_matrix, idf_matrix):
    tf_idf_matrix = {}
    for (sent1, f_table1), (sent2, f_table2) in zip(tf_matrix.items(), idf_matrix.items()):
        tf_idf_table = {}
        for (word1, value1), (word2, value2) in zip(f_table1.items(),
            f_table2.items()):
            tf_idf_table[word1] = float(value1 * value2)
        tf_idf_matrix[sent1] = tf_idf_table

    return tf_idf_matrix

In [311]:
def sentences_score(tf_idf_matrix) -> dict:
    sentenceValue = {}
    for sent, f_table in tf_idf_matrix.items():
        total_score_per_sentence = 0
        count_words_in_sentence = len(f_table)
        for word, score in f_table.items():
            total_score_per_sentence += score
        sentenceValue[sent] = total_score_per_sentence / count_words_in_sentence

    return sentenceValue

In [312]:
def threshold(sentenceValue) -> int:
    sumValues = 0
    for entry in sentenceValue:
        sumValues += sentenceValue[entry]

    # Average value of a sentence from original summary_text
    threshold = (sumValues / len(sentenceValue))
    print("threshold: ",threshold)

    return threshold

In [313]:
# def create_summary(sentences, sentenceValue, threshold):
#     sentence_count = 0
#     summary = ''
#     for sentence in sentences:
#         if sentence[:15] in sentenceValue and sentenceValue[sentence[:15]] >= (threshold):
#             summary += " " + sentence
#             sentence_count += 1

#     return summary

In [314]:
def create_summary(sentences, sentenceValue, length_percent):
    sentence_count = 0
    summary = ''
    sumValues = sum(sentenceValue.values())
    threshold = length_percent * sumValues / len(sentenceValue)

    for sentence in sentences:
        if sentence[:15] in sentenceValue and sentenceValue[sentence[:15]] >= threshold:
            summary += " " + sentence
            sentence_count += 1

    return summary

In [315]:
# 1 Sentence Tokenize
sentences = sent_tokenize(text_to_sum)
total_documents = len(sentences)
#print(sentences)

# 2 Create the Frequency matrix of the words in each sentence.
freq_matrix = freq_matrix(sentences)
#print(freq_matrix)

'''
Term frequency (TF) is how often a word appears in a document, divided by how many words are there in a document.
'''
# 3 Calculate TermFrequency and generate a matrix
tf_matrix = tf_matrix(freq_matrix)
#print(tf_matrix)

# 4 creating table for documents per words
count_doc_per_words = sentence_of_word(freq_matrix)
#print(count_doc_per_words)

'''
Inverse document frequency (IDF) is how unique or rare a word is.
'''
# 5 Calculate IDF and generate a matrix
idf_matrix = idf_matrix(freq_matrix, count_doc_per_words, total_documents)
#print(idf_matrix)

# 6 Calculate TF-IDF and generate a matrix
tf_idf_matrix = tf_idf_matrix(tf_matrix, idf_matrix)
#print(tf_idf_matrix)

# 7 Important Algorithm: score the sentences
sentence_scores = sentences_score(tf_idf_matrix)
#print(sentence_scores)

# 8 Find the threshold
threshold = threshold(sentence_scores)
#print(threshold)

# 9 Important Algorithm: Generate the summary
summary = create_summary(sentences, sentence_scores, 0.5)
# summary = create_summary(sentences, sentence_scores, 0.6 * threshold)
print(summary)

threshold:  0.0016219022848881256
 President Trump made the decision to pull the troops - about half the number the United States has in Afghanistan now - at the same time he decided to pull American forces out of Syria, one official said. The announcement came hours after Jim Mattis, the secretary of defense, said that he would resign from his position at the end of February after disagreeing with the president over his approach to policy in the Middle East. The United States has also been conducting talks with representatives of the Taliban, in what officials have described as discussions that could lead to formal talks to end the conflict. Senior Afghan officials and Western diplomats in Kabul woke up to the shock of the news on Friday morning, and many of them braced for chaos ahead. Several Afghan officials, often in the loop on security planning and decision-making, said they had received no indication in recent days that the Americans would pull troops out. The fear that Mr. Tru