# **Text Summarization using NLP**


**What is text summarization?**

Text summarization is the process of distilling the most important information from a source text.

**Why automatic text summarization?**



1.   Summaries reduce reading time.
2.   When researching documents,summaries make the  selection process easier.
3.   Automatic summarization improves the effectiveness of indexing.
4.   Automatice summarization algorithms are less biased than human summarization.
5.   Personalized summaries are useful in question-answering systems as they provied personalized information.
6.   Using automatic or semi-automatic summarization systems enables commercial abstract services to increase the number of text documents they are able to process.





# **Type of summarization**

![alt text](https://drive.google.com/uc?id=1AqwSGEpi3vzAOLVt_5XXRXokZHvcn43B)



**How to do text summarization**


*   Text cleaning
*   Sentence tokenization
*   Word tokenzation
*   Word-frequency table
*   Summarization 
 
 

  **Text variable**








In [1]:
text = """
ABSTRACT
In recent years, people are seeking for a solution to improve text
summarization for Thai language. Although several solutions such
as PageRank, Graph Rank, Latent Semantic Analysis (LSA)
models, etc., have been proposed, research results in Thai text
summarization were restricted due to limited corpus in Thai
language with complex grammar. This paper applied a text
summarization system for Thai travel news based on keyword
scored in Thai language by extracting the most relevant sentences
from the original document. We compared LSA and Non-negative
Matrix Factorization (NMF) to find the algorithm that is suitable
with Thai travel news. The suitable compression rates for Generic
Sentence Relevance score (GRS) and K-means clustering were also
evaluated. From these experiments, we concluded that keyword
scored calculation by LSA with sentence selection by GRS is the
best algorithm for summarizing Thai Travel News, compared with
human with the best compression rate of 20%.
CCS Concepts
• Information systems ➝ Information retrieval ➝ Retrieval
tasks and goals➝ Summarization
Keywords
Text summarization; extractive summarization; non-negative
matrix factorization
1. INTRODUCTION
Daily newspaper has abundant of data that users do not have
enough time for reading them. It is difficult to identify the relevant
information to satisfy the information needed by users. Automatic
summarization can reduce the problem of information overloading
and it has been proposed previously in English and other languages.
However, there were only a few research results in Thai text
summarization due to the lack of corpus in Thai language and the
complicated grammar.
Text Summarization [1] is a technique for summarizing the content
of the documents. It consists of three steps: 1) create an
intermediate representation of the input text, 2) calculate score for
the sentences based on the concepts, and 3) choose important
sentences to be included in the summary. Text summarization can
be divided into 2 approaches. The first approach is the extractive
summarization, which relies on a method for extracting words and
searching for keywords from the original document. The second
approach is the abstractive summarization, which analyzes words
by linguistic principles with transcription or interpretation from the
original document. This approach implies more effective and
accurate summary than the extractive methods. However, with the
lack of Thai corpus, we chose to apply an extractive summarization
method for Thai text summarization.
This research focused on the sentence extraction function based on
keyword score calculation then selecting important sentences based
on the Generic Sentence Relevance score (GRS), calculated from
Latent Semantic Analysis (LSA) and Non-negative Matrix
Factorization (NMF). We also tried using K-means clustering for
document summarization. In this experiment, we compared 5
models for 5 rounds with Thai travel news using the compression
rates of 20%, 30% and 40% and reported the rate and method that
produced the best result from the experiment.
2. RELATED WORKS
In recent years, several models in Thai Text summarization have
been introduced. Suwanno, N. et al. [2] proposed a Thai text
summarization that extracted a paragraph from a document based
on Thai compound nouns, term frequency method, and headline
score for generating a summary. Chongsuntornsri, A., et al. [3]
proposed a new approach for Text summarization in Thai based on
content- and graph-based with the use of Topic Sensitive PageRank
algorithm for summarizing and ranking of text segments.
Jaruskulchai C., et al. [4] proposed a method to summarize
documents by extracting important sentences from combining the
specific properties (Local Property) and the overall properties
(Global Property) of the sentences. The overall properties were
based on the relationship between sentences in the document. From
their experiments, the summarization of the industrial news got
60% precision, 44% recall, and 50.9% F-measure, the general news
got the 51.8% precision, 38.5% recall, and 43.1% F-measure while
the fashion magazines got 53.0% precision, 33.0% recall, and
40.4% F-measure.
"""



# Let's Get Started with SpaCy

In [2]:
# pip install -U spacy
# python -m spacy download en_core_web_sm

In [3]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

2023-05-29 13:32:48.362085: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [4]:
stopwords = list(STOP_WORDS)

In [5]:
nlp = spacy.load('en_core_web_sm')
# For some OS you need to make a correctino installing the package separately using this command:
# !pip3 install -U spacy
# !python3 -m spacy download en_core_web_sm
# run both command above to install a separate package from the spacy

In [6]:
doc = nlp(text)

In [7]:
tokens = [token.text for token in doc]
print(tokens)

['\n', 'Maria', 'Sharapova', 'has', 'basically', 'no', 'friends', 'as', 'tennis', 'players', 'on', 'the', 'WTA', 'Tour', '.', 'The', 'Russian', 'player', 'has', 'no', 'problems', 'in', 'openly', 'speaking', 'about', 'it', 'and', 'in', 'a', 'recent', 'interview', 'she', 'said', ':', "'", 'I', 'do', "n't", 'really', 'hide', 'any', 'feelings', 'too', 'much', '.', '\n', 'I', 'think', 'everyone', 'knows', 'this', 'is', 'my', 'job', 'here', '.', 'When', 'I', "'m", 'on', 'the', 'courts', 'or', 'when', 'I', "'m", 'on', 'the', 'court', 'playing', ',', 'I', "'m", 'a', 'competitor', 'and', 'I', 'want', 'to', 'beat', 'every', 'single', 'person', 'whether', 'they', "'re", 'in', 'the', 'locker', 'room', 'or', 'across', 'the', 'net', '.', '\n', 'So', 'I', "'m", 'not', 'the', 'one', 'to', 'strike', 'up', 'a', 'conversation', 'about', 'the', 'weather', 'and', 'know', 'that', 'in', 'the', 'next', 'few', 'minutes', 'I', 'have', 'to', 'go', 'and', 'try', 'to', 'win', 'a', 'tennis', 'match', '.', '\n', 'I'

In [8]:
punctuation = punctuation + '\n'
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\n'

In [9]:
word_frequencies = {}
for word in doc:
  if word.text.lower() not in stopwords:
    if word.text.lower() not in punctuation:
      if word.text not in word_frequencies.keys():
        word_frequencies[word.text] = 1
      else:
        word_frequencies[word.text] += 1

In [10]:
print(word_frequencies)

{'Maria': 1, 'Sharapova': 1, 'basically': 1, 'friends': 5, 'tennis': 6, 'players': 6, 'WTA': 1, 'Tour': 1, 'Russian': 1, 'player': 2, 'problems': 1, 'openly': 1, 'speaking': 1, 'recent': 1, 'interview': 1, 'said': 2, 'hide': 1, 'feelings': 1, 'think': 4, 'knows': 1, 'job': 1, 'courts': 2, 'court': 1, 'playing': 1, 'competitor': 1, 'want': 1, 'beat': 1, 'single': 1, 'person': 2, 'locker': 1, 'room': 1, 'net': 1, 'strike': 1, 'conversation': 1, 'weather': 1, 'know': 1, 'minutes': 1, 'try': 1, 'win': 1, 'match': 1, 'pretty': 1, 'competitive': 1, 'girl': 1, 'hellos': 1, 'sending': 1, 'flowers': 1, 'Uhm': 1, 'friendly': 1, 'close': 2, 'lot': 2, 'away': 1, 'strategic': 1, 'different': 4, 'men': 1, 'tour': 2, 'women': 1, 'sport': 1, 'mean': 1, 'categorized': 1, 'going': 1, 'interests': 2, 'completely': 1, 'jobs': 1, 'met': 1, 'parts': 1, 'life': 1, 'thinks': 1, 'greatest': 1, 'ultimately': 1, 'small': 1, 'things': 1, 'interested': 1}


In [11]:
max_frequency = max(word_frequencies.values())

In [12]:
max_frequency

6

In [13]:
for word in word_frequencies.keys():
  word_frequencies[word] = word_frequencies[word]/max_frequency

In [14]:
print(word_frequencies)

{'Maria': 0.16666666666666666, 'Sharapova': 0.16666666666666666, 'basically': 0.16666666666666666, 'friends': 0.8333333333333334, 'tennis': 1.0, 'players': 1.0, 'WTA': 0.16666666666666666, 'Tour': 0.16666666666666666, 'Russian': 0.16666666666666666, 'player': 0.3333333333333333, 'problems': 0.16666666666666666, 'openly': 0.16666666666666666, 'speaking': 0.16666666666666666, 'recent': 0.16666666666666666, 'interview': 0.16666666666666666, 'said': 0.3333333333333333, 'hide': 0.16666666666666666, 'feelings': 0.16666666666666666, 'think': 0.6666666666666666, 'knows': 0.16666666666666666, 'job': 0.16666666666666666, 'courts': 0.3333333333333333, 'court': 0.16666666666666666, 'playing': 0.16666666666666666, 'competitor': 0.16666666666666666, 'want': 0.16666666666666666, 'beat': 0.16666666666666666, 'single': 0.16666666666666666, 'person': 0.3333333333333333, 'locker': 0.16666666666666666, 'room': 0.16666666666666666, 'net': 0.16666666666666666, 'strike': 0.16666666666666666, 'conversation': 

In [15]:
sentence_tokens = [sent for sent in doc.sents]
print(sentence_tokens)

[
Maria Sharapova has basically no friends as tennis players on the WTA Tour., The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much. 
, I think everyone knows this is my job here., When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.
, So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match. 
, I'm a pretty competitive girl., I say my hellos, but I'm not sending any players flowers as well., Uhm, I'm not really friendly or close to many players.
, I have not a lot of friends away from the courts.', When she said she is not really close to a lot of players, is that something strategic that she is doing?, Is it different on the men's tour than the women's tour?, ', No, not at all.
, I think just beca

In [16]:
sentence_scores = {}
for sent in sentence_tokens:
  for word in sent:
    if word.text.lower() in word_frequencies.keys():
      if sent not in sentence_scores.keys():
        sentence_scores[sent] = word_frequencies[word.text.lower()]
      else:
        sentence_scores[sent] += word_frequencies[word.text.lower()]


In [17]:
sentence_scores

{
 Maria Sharapova has basically no friends as tennis players on the WTA Tour.: 3.3333333333333335,
 The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much. : 1.8333333333333333,
 I think everyone knows this is my job here.: 0.9999999999999999,
 When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.: 2.1666666666666665,
 So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match. : 2.333333333333333,
 I'm a pretty competitive girl.: 0.5,
 I say my hellos, but I'm not sending any players flowers as well.: 1.5,
 Uhm, I'm not really friendly or close to many players.: 1.5,
 I have not a lot of friends away from the courts.': 1.6666666666666667,
 When she said she is not really close to a lot of players, is t

In [18]:
from heapq import nlargest

In [19]:
select_length = int(len(sentence_tokens)*0.3)
select_length

5

In [20]:
summary = nlargest(select_length, sentence_scores, key = sentence_scores.get)

In [21]:
summary

[I think just because you're in the same sport doesn't mean that you have to be friends with everyone just because you're categorized, you're a tennis player, so you're going to get along with tennis players. ,
 I think everyone just thinks because we're tennis players we should be the greatest of friends.,
 
 Maria Sharapova has basically no friends as tennis players on the WTA Tour.,
 I have friends that have completely different jobs and interests, and I've met them in very different parts of my life.,
 So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match. ]

In [22]:
final_summary = [word.text for word in summary]

In [23]:
summary = ' '.join(final_summary)

In [24]:
print(text)


Maria Sharapova has basically no friends as tennis players on the WTA Tour. The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much. 
I think everyone knows this is my job here. When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.
So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match. 
I'm a pretty competitive girl. I say my hellos, but I'm not sending any players flowers as well. Uhm, I'm not really friendly or close to many players.
I have not a lot of friends away from the courts.' When she said she is not really close to a lot of players, is that something strategic that she is doing? Is it different on the men's tour than the women's tour? 'No, not at all.
I think just because you're in the sa

In [25]:
print(summary)

I think just because you're in the same sport doesn't mean that you have to be friends with everyone just because you're categorized, you're a tennis player, so you're going to get along with tennis players. 
 I think everyone just thinks because we're tennis players we should be the greatest of friends. 
Maria Sharapova has basically no friends as tennis players on the WTA Tour. I have friends that have completely different jobs and interests, and I've met them in very different parts of my life.
 So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match. 

