<a href="https://colab.research.google.com/github/AnetaKovacheva/text_summarization/blob/main/Text_Summarization_in_English_and_Bulgarian.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text summarization

This Notebook explains what text summarization is, and how the extractive approach works. Several examples are provided, both with short and long texts in English and Bulgarian.

Text summarization is an NLP task for "producing a concise and fluent summary while preserving key information and overall meaning" ([ref](https://towardsdatascience.com/understand-text-summarization-and-create-your-own-summarizer-in-python-b26a9f09fc70)). 

There are two types of summarization: *abstractive* and *extractive* summarization. Abstractive methods select words based on semantic understanding, even if those words did not appear in the source documents. It aims at producing important material in a new way. They interpret and examine the text using advanced natural language techniques in order to generate a new shorter text that conveys the most critical information from the original text. It can be correlated to the way human reads a text article or blog post and then summarizes in their own word.

Extractive methods attempt to summarize articles by selecting a subset of words that retain the most important points. This approach weights the important part of sentences and uses the same to form the summary. Different algorithms and techniques are used to define weights for the sentences and further rank them based on importance and similarity among each other.

Abstractive summarization is not well studied yet since it requires a deeper understanding of the text as compared to the extractive approach.

Purely extractive summaries often times give better results compared to automatic abstractive summaries. This is because abstractive summarization methods cope with problems such as semantic representation, inference and natural language generation which is relatively harder than data-driven approaches such as sentence extraction ([ref](https://towardsdatascience.com/understand-text-summarization-and-create-your-own-summarizer-in-python-b26a9f09fc70)).

When it comes to extractive summarization, it is good to understand the term *Cosine similarity*. It is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. Since the sentences in the examples below (inspired by [this](https://www.mygreatlearning.com/blog/text-summarization-in-python/) article) will be represented as bunch of vectors, Cosine similarity will be used to find the similarity among sentences. It measures cosine of the angle between vectors. Zero (angle equal to 0) indicates the sentences are similar.

I use texts (short and longer) from Deep.AI and *The Guardian* for English texts, and *Mediapool* and *Dnevnik* for texts in Bulgarian, to show how extractive summarization works.

### Imports

In [1]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

from pprint import pprint

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 1. Load texts

Shorter texts are directly stored in variables. Longer are loaded from files. Four different text are used in this exercise. 

The first one (short, in English) explains what Unsupervised Machine Learning is ([source](https://deepai.org/machine-learning-glossary-and-terms/unsupervised-learning
)).

The second one (long, in English) is an [article](https://www.theguardian.com/environment/2022/aug/20/un-seeks-plan-to-beat-plastic-nurdles-the-tiny-scourges-of-the-oceans
) from the Guardian about UN's intention to cope with plastics in the oceans.

The third text is from [Mediapool](https://www.mediapool.bg/gartsiya-izliza-ot-nadzora-na-kreditorite-si-news338989.html) (short, in Bulgarian) about Greece's economy and that it no longer will be monitored by its creditors. 

Finally, the last text is from [Dnevnik](https://www.dnevnik.bg/sviat/2022/08/21/4382081_vuv_velikobritaniia_durvetata_veche_reagirat_vse_edno/?ref=home_NaiNovoto). It is longer and in Bulgarian, and is a story about trees in UK which behaves as it is already an autumn.

In [2]:
text_1 = "Unsupervised learning is a kind of machine learning where a model must look for patterns in a dataset with no labels and with minimal human supervision. This is in contrast to supervised learning techniques, such as classification or regression, where a model is given a training set of inputs and a set of observations, and must learn a mapping from the inputs to the observations. In unsupervised learning, only the inputs are available, and a model must look for interesting patterns in the data. Another name for unsupervised learning is knowledge discovery. Common unsupervised learning techniques include clustering, and dimensionality reduction."

In [3]:
text_1

'Unsupervised learning is a kind of machine learning where a model must look for patterns in a dataset with no labels and with minimal human supervision. This is in contrast to supervised learning techniques, such as classification or regression, where a model is given a training set of inputs and a set of observations, and must learn a mapping from the inputs to the observations. In unsupervised learning, only the inputs are available, and a model must look for interesting patterns in the data. Another name for unsupervised learning is knowledge discovery. Common unsupervised learning techniques include clustering, and dimensionality reduction.'

In [4]:
with open("un_beat_plastics.txt", "r") as f:
  text_2 = f.read()

In [5]:
text_2

'Maritime authorities are considering stricter controls on the ocean transport of billions of plastic pellets known as nurdles after a series of spillages around the world. Campaigners warn that nurdles are one of the most common micro-plastic pollutants in the seas, washing up on beaches from New Zealand to Cornwall. The multicoloured pellets produced by petrochemical companies are used as building blocks for plastic products, from bags to bottles and piping. Billions of nurdles washed up in Sri Lanka in May last year after the container ship X-Press Pearl caught fire and sank in the Indian Ocean. The United Nations said the spillage of about 1,680 tonnes of nurdles was the worst maritime disaster in Sri Lanka’s history, with one official saying the spillage was like a “cluster bomb”. The International Maritime Organization, a UN agency, has asked pollution experts to examine the options for “reducing the environmental risk associated with the maritime transport of plastic pellets (nu

In [6]:
text_3 = "Гърция излиза от надзора на международните кредитори след повече от десетилетие на ограничения във финансовия сектор. 'Добрите показатели на гръцката икономика и успешните реформи позволяват свалянето на строгия финансов контрол', съобщи финансовият министър Христос Стайкурас, цитиран от БНР.  В следствие на икономическата криза доходите на гърците намаляха с една четвърт, а много хора останаха без работа. Преди това Гърция беше натрупала огромен външен дълг, който дълги години бе прикриван от националната статистка. Финансовата криза от 2009 година обаче оголи всички проблеми и Гърция трябваше да бъде спасявана от фалит. Финансовата дисциплина на страната позволи възстановяването да започне. Експертите посочват, че последните години Гърция е променила икономическия модел и е успяла да привлече инвестиции, за да обслужва кредитите си. Преди това трябваше да бъде проведени болезнени решения като приватизация на редица държавни активи, които предизвикаха масови протести."

In [7]:
text_3

"Гърция излиза от надзора на международните кредитори след повече от десетилетие на ограничения във финансовия сектор. 'Добрите показатели на гръцката икономика и успешните реформи позволяват свалянето на строгия финансов контрол', съобщи финансовият министър Христос Стайкурас, цитиран от БНР.  В следствие на икономическата криза доходите на гърците намаляха с една четвърт, а много хора останаха без работа. Преди това Гърция беше натрупала огромен външен дълг, който дълги години бе прикриван от националната статистка. Финансовата криза от 2009 година обаче оголи всички проблеми и Гърция трябваше да бъде спасявана от фалит. Финансовата дисциплина на страната позволи възстановяването да започне. Експертите посочват, че последните години Гърция е променила икономическия модел и е успяла да привлече инвестиции, за да обслужва кредитите си. Преди това трябваше да бъде проведени болезнени решения като приватизация на редица държавни активи, които предизвикаха масови протести."

In [8]:
with open("trees_uk_bg.txt", "r") as f:
  text_4 = f.read()

In [9]:
text_4

'Горещата вълна и сушата са тласнали дърветата в обширни части от Великобритания в режим на оцеляване, като листата окапват или променят цвета си в резултат на стрес. Учените наричат това "фалшива есен" и предупреждават, че някои дървета може да умрат в резултат на това, съобщи Би Би Си. Кестенявите листа и ранното опадане на листата са признаци, появяващи се при смяната на сезоните Но дните са много по-дълги, отколкото в ранната есен, за да започнат тези естествени есенни процеси. Дърветата реагират така, защото са стресирани, казва Лий Хънт, старши съветник по градинарството в Кралското градинарско дружество. Той казва, че през всичките му 45 години това е една от най-тежките години, които е виждал по отношение на щетите по дърветата в провинцията. Особено са пострадали по-младите дървета без достатъчно дълбока и обширна коренова система, а тези, засадени на бедна почва покрай пътищата, могат да изсъхнат и да умрат. Дърветата, които са загубили само няколко листа с малко пожълтяване,

## 2. Text preprocessing

Text preprocessing invloves removing stop words and tokenization (i.e., split sentences by white space). Stop words are those words that do not bear meaningfull information outside the given context such as i, you, this, that, what, etc. English stop words are encoded in NLTK library but Bulgarian are not yet. For this reason, a list of a stop words in Bulgarian (available [here](https://github.com/stopwords-iso/stopwords-bg/blob/master/stopwords-bg.txt)) is stored in a variable.

In [10]:
stop_words_bg = {'а', 'автентичен', 'аз', 'ако', 'ала', 'бе', 'без', 'беше', 'бивш', 'бивша', 'бившо', 'бил',
    'била', 'били', 'било', 'благодаря', 'близо', 'бъдат', 'бъде', 'бяха', 'в', 'вас', 'ваш', 'ваша', 'вероятно',
    'вече', 'взема', 'ви', 'вие', 'винаги', 'внимава', 'време', 'все', 'всеки', 'всички', 'всичко', 'всяка',
    'във', 'въпреки', 'върху', 'г', 'ги', 'главен', 'главна', 'главно', 'глас', 'го', 'година', 'години',
    'годишен', 'д', 'да', 'дали', 'два', 'двама', 'двамата', 'две', 'двете', 'ден', 'днес', 'дни', 'до', 'добра',
    'добре', 'добро', 'добър', 'докато', 'докога', 'дори', 'досега', 'доста', 'друг', 'друга', 'други', 'е',
    'евтин', 'едва', 'един', 'една', 'еднаква', 'еднакви', 'еднакъв', 'едно', 'екип', 'ето', 'живот', 'за',
    'забавям', 'зад', 'заедно', 'заради', 'засега', 'заспал', 'затова', 'защо', 'защото', 'и', 'из', 'или', 'им',
    'има', 'имат', 'иска', 'й', 'каза', 'как', 'каква', 'какво', 'както', 'какъв', 'като', 'кога', 'когато', 'което',
    'които', 'кой', 'който', 'колко', 'която', 'къде', 'където', 'към', 'лесен', 'лесно', 'ли', 'лош', 'м', 'май',
    'малко', 'ме', 'между', 'мек', 'мен', 'месец', 'ми', 'много', 'мнозина', 'мога', 'могат', 'може', 'мокър', 'моля',
    'момента', 'му', 'н', 'на', 'над', 'назад', 'най', 'направи', 'напред', 'например', 'нас', 'не', 'него', 'нещо',
    'нея', 'ни', 'ние', 'никой', 'нито', 'нищо', 'но', 'нов', 'нова', 'нови', 'новина', 'някои', 'някой', 'няколко',
    'няма', 'обаче', 'около', 'освен', 'особено', 'от', 'отгоре', 'отново', 'още', 'пак', 'по', 'повече', 'повечето',
    'под', 'поне', 'поради', 'после', 'почти', 'прави', 'пред', 'преди', 'през', 'при', 'пък', 'първата', 'първи',
    'първо', 'пъти', 'равен', 'равна', 'с', 'са', 'сам', 'само', 'се', 'сега', 'си', 'син', 'скоро', 'след', 'следващ',
    'сме', 'смях', 'според', 'сред', 'срещу', 'сте', 'съм', 'със', 'също', 'т', 'т.н.', 'тази', 'така', 'такива',
    'такъв', 'там', 'твой', 'те', 'тези', 'ти', 'то', 'това', 'тогава', 'този', 'той', 'толкова', 'точно', 'три',
    'трябва', 'тук', 'тъй', 'тя', 'тях', 'у', 'утре', 'харесва', 'хиляди', 'ч', 'часа', 'че', 'често', 'чрез', 'ще',
    'щом', 'юмрук', 'я', 'як'}

In [11]:
stop_words_en = set(stopwords.words("english"))

The texts are tokenized by applying the function below.

In [12]:
def tokenize_text(text):
  """
  Splits a text into tokens.
  Args: Text / strings
  Returns tokens
  """
  return word_tokenize(text)

Tokeinized texts are stored in variables titled "words". One of the splitted texts is shown thereafter.

In [13]:
words_1 = tokenize_text(text_1)
words_2 = tokenize_text(text_2)
words_3 = tokenize_text(text_3)
words_4 = tokenize_text(text_4)
print(words_1)

['Unsupervised', 'learning', 'is', 'a', 'kind', 'of', 'machine', 'learning', 'where', 'a', 'model', 'must', 'look', 'for', 'patterns', 'in', 'a', 'dataset', 'with', 'no', 'labels', 'and', 'with', 'minimal', 'human', 'supervision', '.', 'This', 'is', 'in', 'contrast', 'to', 'supervised', 'learning', 'techniques', ',', 'such', 'as', 'classification', 'or', 'regression', ',', 'where', 'a', 'model', 'is', 'given', 'a', 'training', 'set', 'of', 'inputs', 'and', 'a', 'set', 'of', 'observations', ',', 'and', 'must', 'learn', 'a', 'mapping', 'from', 'the', 'inputs', 'to', 'the', 'observations', '.', 'In', 'unsupervised', 'learning', ',', 'only', 'the', 'inputs', 'are', 'available', ',', 'and', 'a', 'model', 'must', 'look', 'for', 'interesting', 'patterns', 'in', 'the', 'data', '.', 'Another', 'name', 'for', 'unsupervised', 'learning', 'is', 'knowledge', 'discovery', '.', 'Common', 'unsupervised', 'learning', 'techniques', 'include', 'clustering', ',', 'and', 'dimensionality', 'reduction', '.']

Shorter texts has around 100 - 150 words, whereas the longer ones between 300 and 500.

In [14]:
print(f"Number of words in text 1: {len(words_1)}")
print(f"Number of words in text 2: {len(words_2)}")
print(f"Number of words in text 3: {len(words_3)}")
print(f"Number of words in text 4: {len(words_4)}")

Number of words in text 1: 112
Number of words in text 2: 308
Number of words in text 3: 152
Number of words in text 4: 491


## 3. Create frequency table

The score of each word is kept in a table. It contains information about the number of times each word (outside the stop words list) has appeared in the text. The first function makes the frequency table for English texts, and the second - for the Bulgarian.

In [15]:
def make_frequency_table_en(words):
  """
  Creates table with words and their frequency in a tokenized text in English
  Args: Tokenized text
  Returns the frequency table
  """
  freq_table = dict()

  for word in words:
    word = word.lower()
    if word in stop_words_en:
      continue
    if word in freq_table:
      freq_table[word] +=1
    else:
      freq_table[word] = 1
      
  return freq_table

In [16]:
def make_frequency_table_bg(words):
  """
  Creates table with words and their frequency in a tokenized text in Bulgarian
  Args: Tokenized text
  Returns the frequency table
  """
  freq_table = dict()

  for word in words:
    word = word.lower()
    if word in stop_words_bg:
      continue
    if word in freq_table:
      freq_table[word] +=1
    else:
      freq_table[word] = 1
      
  return freq_table

Now, frequency tables are compiled for all four tokenized texts.

In [17]:
freq_table_1 = make_frequency_table_en(words_1)
freq_table_2 = make_frequency_table_en(words_2)
freq_table_3 = make_frequency_table_bg(words_3)
freq_table_4 = make_frequency_table_bg(words_4)

## 4. Create dictionary with scores of each sentence and compute values

The next step is to compute and track the scores of each sentence. To that end, the whole text is tokenized into sentences. Thereafter, the sentence value is computed based on the number of times a word appears in the sentence in question.

In [18]:
def compute_sentence_score(text, freq_table):
  """
  Tokenizes text into sentences and computes sentence score
  Args: Text, Frequency table
  Returns: tokenized sentences, and dictionary with each sentence score (value)
  """
  sentences = sent_tokenize(text)
  sentence_value = dict()

  for sentence in sentences:
    for word, freq in freq_table.items():
      if word in sentence.lower():
        if sentence in sentence_value:
          sentence_value[sentence] += freq
        else:
          sentence_value[sentence] = freq

  return sentences, sentence_value

Sentence score is computed for all texts by applying the function above. Scores assigned to the sentences in the first text is displayed below.

In [19]:
sentences_1, sentence_value_1 = compute_sentence_score(text_1, freq_table_1)
sentences_2, sentence_value_2 = compute_sentence_score(text_2, freq_table_2)
sentences_3, sentence_value_3 = compute_sentence_score(text_3, freq_table_3)
sentences_4, sentence_value_4 = compute_sentence_score(text_4, freq_table_4)

In [20]:
sentence_value_1

{'Unsupervised learning is a kind of machine learning where a model must look for patterns in a dataset with no labels and with minimal human supervision.': 37,
 'This is in contrast to supervised learning techniques, such as classification or regression, where a model is given a training set of inputs and a set of observations, and must learn a mapping from the inputs to the observations.': 40,
 'In unsupervised learning, only the inputs are available, and a model must look for interesting patterns in the data.': 39,
 'Another name for unsupervised learning is knowledge discovery.': 21,
 'Common unsupervised learning techniques include clustering, and dimensionality reduction.': 30}

Sentence value is used for computing the average value for a sentence. The function below performs these computations.

In [21]:
def compute_avg_values(sentence_value):
  """
  Computes the average value (score) for a sentence from the original text
  Args: Values of all sentences in a text
  Returns: the average value for a sentence.
  """
  sum_values = 0
  for sentence in sentence_value:
    sum_values += sentence_value[sentence]
  
  average = int(sum_values / len(sentence_value))

  return average

The averge value is computed for all four texts and is printed thereafter.

In [22]:
avg_value_1 = compute_avg_values(sentence_value_1)
avg_value_2 = compute_avg_values(sentence_value_2)
avg_value_3 = compute_avg_values(sentence_value_3)
avg_value_4 = compute_avg_values(sentence_value_4)

print(f"The average value for text 1 is: {avg_value_1}")
print(f"The average value for text 2 is: {avg_value_2}")
print(f"The average value for text 3 is: {avg_value_3}")
print(f"The average value for text 4 is: {avg_value_4}")

The average value for text 1 is: 33
The average value for text 2 is: 45
The average value for text 3 is: 25
The average value for text 4 is: 66


## 5. Produce text summary

Text summary is generated from those existing sentences, whose score is greater than the average value increased by 20%. The function below checks which sentences meet these criteria, and collects them in a variable.

In [23]:
def text_summarization(sentences, sentence_value, avg_value):
  """
  Computes each sentence score and generates text summarization
  Args: sentences: tokenized text into sentences
        sentence_value: Sentence score
        avg_value: Average value for a sentence
  Returns: summarized text
  """
  summary = ''
  for sentence in sentences:
    if (sentence in sentence_value) and (sentence_value[sentence] > (1.2 * avg_value)):
      summary += " " + sentence

  return summary

In [24]:
summary_1 = text_summarization(sentences_1, sentence_value_1, avg_value_1)
summary_2 = text_summarization(sentences_2, sentence_value_2, avg_value_2)
summary_3 = text_summarization(sentences_3, sentence_value_3, avg_value_3)
summary_4 = text_summarization(sentences_4, sentence_value_4, avg_value_4)

The summarized texts are printed below.

In [25]:
summary_1

' This is in contrast to supervised learning techniques, such as classification or regression, where a model is given a training set of inputs and a set of observations, and must learn a mapping from the inputs to the observations.'

In [26]:
summary_2

' The United Nations said the spillage of about 1,680 tonnes of nurdles was the worst maritime disaster in Sri Lanka’s history, with one official saying the spillage was like a “cluster bomb”. The International Maritime Organization, a UN agency, has asked pollution experts to examine the options for “reducing the environmental risk associated with the maritime transport of plastic pellets (nurdles)”. In a submission by Sri Lanka to the IMO after the X-Press Pearl sinking, officials said: “The incident has resulted in deaths of marine species such as turtles, whales and dolphins.'

In [27]:
summary_3

" 'Добрите показатели на гръцката икономика и успешните реформи позволяват свалянето на строгия финансов контрол', съобщи финансовият министър Христос Стайкурас, цитиран от БНР."

In [28]:
summary_4

' Горещата вълна и сушата са тласнали дърветата в обширни части от Великобритания в режим на оцеляване, като листата окапват или променят цвета си в резултат на стрес. Трудно е да се предскажат дългосрочните последици от сушата, но експертите по екология смятат, че седмици на изсъхнали пасища и твърда като скала почва в голяма част от Южна Англия ще окажат голямо влияние върху дивата природа. "Тези растения осигуряват жизненоважно местообитание за насекоми и риби и загубата им от екосистемата причинява големи промени нагоре по хранителната верига", казва д-р Майк Боус от Британския център за екология и хидрология.'