# **Text Summarization**

---
https://blog.floydhub.com/gentle-introduction-to-text-summarization-in-machine-learning/



รายชื่อสมาชิกกลุ่ม
1. ฉัตริน เหมรา 6242017426
2. บรรณวิชญ์ เอี่ยมสวัสดิ์ 6242051726
3. ยศวัจน์ อู่สิริมณีชัย 6242079326
4. รมย์รดา ลาดมะโรง 6242080926
5. ศรณ์ พงษ์นริศร 6242089626
6. สพล เชวงโชติ 6242097626
7. คุณานนท์ วิมุตติไชย 6242011626

**Step 1: Preparing the data**

หากมีการดึงเอาบทความมาจากอินเทอร์เน็ตใช้ก็จะมีขั้นตอนดังนี้ ถ้ามีข้อมูลtextอยู่แล้วสามารถข้ามขั้นตอนแรกไปได้เลย


In [None]:
import bs4 as BeautifulSoup
import urllib.request  

# Fetching the content from the URL
fetched_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/20th_century')

article_read = fetched_data.read()

# Parsing the URL content and storing in a variable
article_parsed = BeautifulSoup.BeautifulSoup(article_read,'html.parser')

# Returning <p> tags
paragraphs = article_parsed.find_all('p')

article_content = ''

# Looping through the paragraphs and adding them to the variable
for p in paragraphs:  
    article_content += p.text

In [None]:
print(article_content)

The 20th (twentieth) century began on
January 1, 1901 (MCMI), and ended on December 31, 2000 (MM).[1] The 20th century was dominated by significant events that defined the modern era: Spanish flu pandemic, World War I and World War II, nuclear weapons, nuclear power and space exploration, nationalism and decolonization, technological advances, and the Cold War and post-Cold War conflicts. These reshaped the political and social structure of the globe.
The 20th century saw a massive transformation of humanity's relationship with the natural world. Global population, sea level rise, and ecological collapses increased while competition for land and dwindling resources accelerated deforestation, water depletion, and the mass extinction of many of the world's species and decline in the population of others. Man-made global warming increased the risk of extreme weather conditions.
Additional themes include intergovernmental organizations and cultural homogenization through developments in em

**Step 2: Processing the data**

เตรียมข้อมูลเพื่อสร้างdictionaryที่รวมคำที่สำคัญและคะแนนของคำนั้นๆ

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

def _create_dictionary_table(text_string) -> dict:
   
    # Removing stop words
    stop_words = set(stopwords.words("english"))
    
    #split into tokens or words
    words = word_tokenize(text_string)
    
    # Reducing words to their root form
    stem = PorterStemmer()
    
    # Creating dictionary for the word frequency table
    frequency_table = dict()
    for wd in words:
        wd = stem.stem(wd)
        if wd in stop_words:
            continue
        if wd in frequency_table:
            frequency_table[wd] += 1
        else:
            frequency_table[wd] = 1

    return frequency_table

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
def check_max_frequency(frequency_table) -> int:
  max_frequency = 0
  for wd in frequency_table:
    if frequency_table[wd] > max_frequency:
      max_frequency = frequency_table[wd]
    
  return max_frequency

In [None]:
def _create_weight_dictionary_table(text_string, max_frequency) -> dict:
  # Removing stop words
    stop_words = set(stopwords.words("english"))
    
    #split into tokens or words
    words = word_tokenize(text_string)
    
    # Reducing words to their root form
    stem = PorterStemmer()
    
    # Creating dictionary for the word frequency table
    frequency_table = dict()
    for wd in words:
        wd = stem.stem(wd)
        if wd in stop_words:
            continue
        if wd in frequency_table:
            frequency_table[wd] += 1 / max_frequency
        else:
            frequency_table[wd] = 1 / max_frequency

    return frequency_table

In [None]:
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [None]:
stem = PorterStemmer()
examples = ['wants', 'hated', 'eating', 'died', 'see', 'went', 'dying']
stem_examples = [stem.stem(example) for example in examples]
print(stem_examples)

['want', 'hate', 'eat', 'die', 'see', 'went', 'die']


In [None]:
frequency_table = _create_dictionary_table(article_content)

In [None]:
print(frequency_table)

{'20th': 20, '(': 12, 'twentieth': 2, ')': 12, 'centuri': 45, 'began': 4, 'januari': 3, '1': 6, ',': 248, '1901': 3, 'mcmi': 1, 'end': 15, 'decemb': 3, '31': 3, '2000': 3, 'MM': 1, '.': 127, '[': 41, ']': 41, 'wa': 35, 'domin': 3, 'signific': 4, 'event': 1, 'defin': 1, 'modern': 3, 'era': 1, ':': 2, 'spanish': 1, 'flu': 1, 'pandem': 1, 'world': 48, 'war': 39, 'I': 4, 'II': 8, 'nuclear': 10, 'weapon': 6, 'power': 12, 'space': 4, 'explor': 3, 'nation': 17, 'decolon': 3, 'technolog': 21, 'advanc': 8, 'cold': 5, 'post-cold': 1, 'conflict': 8, 'reshap': 1, 'polit': 5, 'social': 1, 'structur': 1, 'globe': 1, 'saw': 2, 'massiv': 2, 'transform': 2, 'human': 10, "'s": 21, 'relationship': 1, 'natur': 3, 'global': 17, 'popul': 12, 'sea': 1, 'level': 2, 'rise': 3, 'ecolog': 2, 'collaps': 3, 'increas': 4, 'competit': 3, 'land': 1, 'dwindl': 1, 'resourc': 3, 'acceler': 2, 'deforest': 1, 'water': 1, 'deplet': 1, 'mass': 3, 'extinct': 2, 'mani': 13, 'speci': 1, 'declin': 2, 'man-mad': 1, 'warm': 2, 'r

In [None]:
max_frequency = check_max_frequency(frequency_table)
print(max_frequency)

248


In [None]:
weight_frequency_table = _create_weight_dictionary_table(article_content, max_frequency)
print(weight_frequency_table)

{'20th': 0.08064516129032254, '(': 0.04838709677419353, 'twentieth': 0.008064516129032258, ')': 0.04838709677419353, 'centuri': 0.18145161290322567, 'began': 0.016129032258064516, 'januari': 0.012096774193548387, '1': 0.024193548387096774, ',': 0.9999999999999992, '1901': 0.012096774193548387, 'mcmi': 0.004032258064516129, 'end': 0.06048387096774191, 'decemb': 0.012096774193548387, '31': 0.012096774193548387, '2000': 0.012096774193548387, 'MM': 0.004032258064516129, '.': 0.512096774193548, '[': 0.16532258064516117, ']': 0.16532258064516117, 'wa': 0.14112903225806442, 'domin': 0.012096774193548387, 'signific': 0.016129032258064516, 'event': 0.004032258064516129, 'defin': 0.004032258064516129, 'modern': 0.012096774193548387, 'era': 0.004032258064516129, ':': 0.008064516129032258, 'spanish': 0.004032258064516129, 'flu': 0.004032258064516129, 'pandem': 0.004032258064516129, 'world': 0.19354838709677405, 'war': 0.15725806451612892, 'I': 0.016129032258064516, 'II': 0.03225806451612903, 'nucl

In [None]:
check_max_weight = check_max_frequency(weight_frequency_table)
print(check_max_weight)

0.9999999999999992


**Step 3:  Tokenizing the article into sentences**

ทำการแบ่งเนื้อหาเป็นประโยคแต่ละประโยค แล้วเก็บใส่เข้าตัวแปร sentences โดยใช้ built-in function จาก nltk คือ sent_tokenize

In [None]:
from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(article_content)

In [None]:
print(sentences)

['The 20th (twentieth) century began on\nJanuary 1, 1901 (MCMI), and ended on December 31, 2000 (MM).', '[1] The 20th century was dominated by significant events that defined the modern era: Spanish flu pandemic, World War I and World War II, nuclear weapons, nuclear power and space exploration, nationalism and decolonization, technological advances, and the Cold War and post-Cold War conflicts.', 'These reshaped the political and social structure of the globe.', "The 20th century saw a massive transformation of humanity's relationship with the natural world.", "Global population, sea level rise, and ecological collapses increased while competition for land and dwindling resources accelerated deforestation, water depletion, and the mass extinction of many of the world's species and decline in the population of others.", 'Man-made global warming increased the risk of extreme weather conditions.', 'Additional themes include intergovernmental organizations and cultural homogenization thro

**Step 4: Finding the weighted frequencies of the sentences**

ทำอัลกอริทึมหาค่าน้ำหนักของแต่ละประโยค เพื่อหาว่าประโยคใดเป็นใจความสำคัญของบทความ
และเพื่อป้องกันไม่ให้ประโยคที่ยาวมีค่าน้ำหนักมากกว่าประโยคที่สั้น เราจึงต้องหารค่าน้ำหนักด้วยจำนวนคำในประโยคโดยไม่นับรวม stop word ในประโยคนั้น ๆ

In [None]:
def _calculate_sentence_scores(sentences, weight_frequency_table) -> dict:   

    # Algorithm for scoring a sentence by its words
    sentence_weight = dict()

    for sentence in sentences:
        sentence_wordcount = (len(word_tokenize(sentence)))
        sentence_wordcount_without_stop_words = 0
        for word_weight in weight_frequency_table:
            if word_weight in sentence.lower():
                sentence_wordcount_without_stop_words += 1
                if sentence[:7] in sentence_weight:
                    sentence_weight[sentence[:7]] += weight_frequency_table[word_weight]
                else:
                    sentence_weight[sentence[:7]] = weight_frequency_table[word_weight]

        sentence_weight[sentence[:7]] = sentence_weight[sentence[:7]] /        sentence_wordcount_without_stop_words
      
    return sentence_weight

In [None]:
sentence_scores = _calculate_sentence_scores(sentences, weight_frequency_table)

In [None]:
print(sentence_scores)

{'The 20t': 0.06442195458404072, '[1] The': 0.07389112903225802, 'These r': 0.07891705069124419, 'Global ': 0.06955645161290319, 'Man-mad': 0.0732009925558312, 'Additio': 0.0686443932411674, 'Automob': 0.09724857685009487, 'Great a': 0.10201612903225801, 'The rep': 0.10282258064516119, 'The Mar': 0.04217523975588494, 'Through': 0.07056451612903222, 'The dis': 0.06143887945670625, 'It took': 0.061400293255131924, '[6][7][': 0.07997311827956984, 'Penicil': 0.04349951124144677, '[citati': 0.08419316481914765, 'Trade i': 0.06761786600496279, 'Until t': 0.05942275042444819, '[9]\nThe': 0.11028225806451608, '[10] Th': 0.10526315789473684, 'It was ': 0.09020213657310433, 'Unlike ': 0.09953310696095075, 'The cen': 0.10614143920595531, 'Nationa': 0.06622678396871944, 'Terms l': 0.11370967741935478, 'Scienti': 0.08479059083897789, 'Horses ': 0.07610887096774194, 'These d': 0.06690561529271209, 'Humans ': 0.17016129032258057, 'Mass me': 0.07459677419354836, 'Advance': 0.0481854838709677, 'Rapid t

In [None]:
max_weight_senetence = check_max_frequency(sentence_scores)
print(max_weight_senetence)

0.1883064516129031


**Step 5: Calculating the threshold of the sentences**

สร้างเกณฑ์จากค่าเฉลี่ยของคะแนนแต่ละประโยค

In [None]:
def _calculate_average_score(sentence_weight) -> int:
   
    # Calculating the average score for the sentences
    sum_values = 0
    for entry in sentence_weight:
        sum_values += sentence_weight[entry]

    # Getting sentence average value from source text
    average_score = (sum_values / len(sentence_weight))

    return average_score

In [None]:
threshold = _calculate_average_score(sentence_scores)

In [None]:
print(threshold)

0.0874855199331064


**Step 6: Getting the summary**

สรุปบทความจากตัวแปรที่สร้าง

In [None]:
def _get_article_summary(sentences, sentence_weight, threshold):
    sentence_counter = 0
    article_summary = ''

    for sentence in sentences:
        if sentence[:7] in sentence_weight and sentence_weight[sentence[:7]] >= (threshold):
            article_summary += " " + sentence
            sentence_counter += 1

    return article_summary

In [None]:
article_summary = _get_article_summary(sentences, sentence_scores, 1.5 * threshold)

In [None]:
print(article_summary)

 Humans explored space for the first time, taking their first footsteps on the Moon. However, these same wars resulted in the destruction of the imperial system. The victorious Bolsheviks then established the Soviet Union, the world's first communist state. At the beginning of the period, the British Empire was the world's most powerful nation,[13] having acted as the world's policeman for the past century. In total, World War II left some 60 million people dead. At the beginning of the century, strong discrimination based on race and sex was significant in most societies. The world was undergoing its second major period of globalization; the first, which started in the 18th century, having been terminated by World War I. Since the US was in a dominant position, a major part of the process was Americanization. Terrorism, dictatorship, and the spread of nuclear weapons were pressing global issues. The world was still blighted by small-scale wars and other violent conflicts, fueled by co