# **Text Summarization using NLP**


**What is text summarization?**

Text summarization is the process of distilling the most important information from a source text.

**Why automatic text summarization?**



1.   Summaries reduce reading time.
2.   When researching documents,summaries make the  selection process easier.
3.   Automatic summarization improves the effectiveness of indexing.
4.   Automatice summarization algorithms are less biased than human summarization.
5.   Personalized summaries are useful in question-answering systems as they provied personalized information.
6.   Using automatic or semi-automatic summarization systems enables commercial abstract services to increase the number of text documents they are able to process.





# **Type of summarization**

![alt text](https://drive.google.com/uc?id=1AqwSGEpi3vzAOLVt_5XXRXokZHvcn43B)



**How to do text summarization**


*   Text cleaning
*   Sentence tokenization
*   Word tokenzation
*   Word-frequency table
*   Summarization 
 
 

  **Text variable**








In [1]:
text = """
ABSTRACT
In recent years, people are seeking for a solution to improve text
summarization for Thai language. Although several solutions such
as PageRank, Graph Rank, Latent Semantic Analysis (LSA)
models, etc., have been proposed, research results in Thai text
summarization were restricted due to limited corpus in Thai
language with complex grammar. This paper applied a text
summarization system for Thai travel news based on keyword
scored in Thai language by extracting the most relevant sentences
from the original document. We compared LSA and Non-negative
Matrix Factorization (NMF) to find the algorithm that is suitable
with Thai travel news. The suitable compression rates for Generic
Sentence Relevance score (GRS) and K-means clustering were also
evaluated. From these experiments, we concluded that keyword
scored calculation by LSA with sentence selection by GRS is the
best algorithm for summarizing Thai Travel News, compared with
human with the best compression rate of 20%.
CCS Concepts
• Information systems ➝ Information retrieval ➝ Retrieval
tasks and goals➝ Summarization
Keywords
Text summarization; extractive summarization; non-negative
matrix factorization
1. INTRODUCTION
Daily newspaper has abundant of data that users do not have
enough time for reading them. It is difficult to identify the relevant
information to satisfy the information needed by users. Automatic
summarization can reduce the problem of information overloading
and it has been proposed previously in English and other languages.
However, there were only a few research results in Thai text
summarization due to the lack of corpus in Thai language and the
complicated grammar.
Text Summarization [1] is a technique for summarizing the content
of the documents. It consists of three steps: 1) create an
intermediate representation of the input text, 2) calculate score for
the sentences based on the concepts, and 3) choose important
sentences to be included in the summary. Text summarization can
be divided into 2 approaches. The first approach is the extractive
summarization, which relies on a method for extracting words and
searching for keywords from the original document. The second
approach is the abstractive summarization, which analyzes words
by linguistic principles with transcription or interpretation from the
original document. This approach implies more effective and
accurate summary than the extractive methods. However, with the
lack of Thai corpus, we chose to apply an extractive summarization
method for Thai text summarization.
This research focused on the sentence extraction function based on
keyword score calculation then selecting important sentences based
on the Generic Sentence Relevance score (GRS), calculated from
Latent Semantic Analysis (LSA) and Non-negative Matrix
Factorization (NMF). We also tried using K-means clustering for
document summarization. In this experiment, we compared 5
models for 5 rounds with Thai travel news using the compression
rates of 20%, 30% and 40% and reported the rate and method that
produced the best result from the experiment.
2. RELATED WORKS
In recent years, several models in Thai Text summarization have
been introduced. Suwanno, N. et al. [2] proposed a Thai text
summarization that extracted a paragraph from a document based
on Thai compound nouns, term frequency method, and headline
score for generating a summary. Chongsuntornsri, A., et al. [3]
proposed a new approach for Text summarization in Thai based on
content- and graph-based with the use of Topic Sensitive PageRank
algorithm for summarizing and ranking of text segments.
Jaruskulchai C., et al. [4] proposed a method to summarize
documents by extracting important sentences from combining the
specific properties (Local Property) and the overall properties
(Global Property) of the sentences. The overall properties were
based on the relationship between sentences in the document. From
their experiments, the summarization of the industrial news got
60% precision, 44% recall, and 50.9% F-measure, the general news
got the 51.8% precision, 38.5% recall, and 43.1% F-measure while
the fashion magazines got 53.0% precision, 33.0% recall, and
40.4% F-measure.
"""



# Let's Get Started with SpaCy

In [2]:
# pip install -U spacy
# python -m spacy download en_core_web_sm

In [3]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

In [4]:
stopwords = list(STOP_WORDS)

In [5]:
nlp = spacy.load('en_core_web_sm')
# For some OS you need to make a correctino installing the package separately using this command:
# !pip3 install -U spacy
# !python3 -m spacy download en_core_web_sm
# run both command above to install a separate package from the spacy

In [6]:
doc = nlp(text)

In [7]:
tokens = [token.text for token in doc]
print(tokens)

['\n', 'ABSTRACT', '\n', 'In', 'recent', 'years', ',', 'people', 'are', 'seeking', 'for', 'a', 'solution', 'to', 'improve', 'text', '\n', 'summarization', 'for', 'Thai', 'language', '.', 'Although', 'several', 'solutions', 'such', '\n', 'as', 'PageRank', ',', 'Graph', 'Rank', ',', 'Latent', 'Semantic', 'Analysis', '(', 'LSA', ')', '\n', 'models', ',', 'etc', '.', ',', 'have', 'been', 'proposed', ',', 'research', 'results', 'in', 'Thai', 'text', '\n', 'summarization', 'were', 'restricted', 'due', 'to', 'limited', 'corpus', 'in', 'Thai', '\n', 'language', 'with', 'complex', 'grammar', '.', 'This', 'paper', 'applied', 'a', 'text', '\n', 'summarization', 'system', 'for', 'Thai', 'travel', 'news', 'based', 'on', 'keyword', '\n', 'scored', 'in', 'Thai', 'language', 'by', 'extracting', 'the', 'most', 'relevant', 'sentences', '\n', 'from', 'the', 'original', 'document', '.', 'We', 'compared', 'LSA', 'and', 'Non', '-', 'negative', '\n', 'Matrix', 'Factorization', '(', 'NMF', ')', 'to', 'find', 

In [8]:
punctuation = punctuation + '\n'
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\n'

In [9]:
word_frequencies = {}
for word in doc:
  if word.text.lower() not in stopwords:
    if word.text.lower() not in punctuation:
      if word.text not in word_frequencies.keys():
        word_frequencies[word.text] = 1
      else:
        word_frequencies[word.text] += 1

In [10]:
print(word_frequencies)

{'ABSTRACT': 1, 'recent': 2, 'years': 2, 'people': 1, 'seeking': 1, 'solution': 1, 'improve': 1, 'text': 8, 'summarization': 17, 'Thai': 16, 'language': 4, 'solutions': 1, 'PageRank': 2, 'Graph': 1, 'Rank': 1, 'Latent': 2, 'Semantic': 2, 'Analysis': 2, 'LSA': 4, 'models': 3, 'etc': 1, 'proposed': 5, 'research': 3, 'results': 2, 'restricted': 1, 'limited': 1, 'corpus': 3, 'complex': 1, 'grammar': 2, 'paper': 1, 'applied': 1, 'system': 1, 'travel': 3, 'news': 5, 'based': 8, 'keyword': 3, 'scored': 2, 'extracting': 3, 'relevant': 2, 'sentences': 7, 'original': 3, 'document': 6, 'compared': 3, 'Non': 2, 'negative': 3, 'Matrix': 2, 'Factorization': 2, 'NMF': 2, 'find': 1, 'algorithm': 3, 'suitable': 2, 'compression': 3, 'rates': 2, 'Generic': 2, 'Sentence': 2, 'Relevance': 2, 'score': 5, 'GRS': 3, 'K': 2, 'means': 2, 'clustering': 2, 'evaluated': 1, 'experiments': 2, 'concluded': 1, 'calculation': 2, 'sentence': 2, 'selection': 1, 'best': 3, 'summarizing': 3, 'Travel': 1, 'News': 1, 'human'

In [11]:
max_frequency = max(word_frequencies.values())

In [12]:
max_frequency

17

In [13]:
for word in word_frequencies.keys():
  word_frequencies[word] = word_frequencies[word]/max_frequency

In [14]:
print(word_frequencies)

{'ABSTRACT': 0.058823529411764705, 'recent': 0.11764705882352941, 'years': 0.11764705882352941, 'people': 0.058823529411764705, 'seeking': 0.058823529411764705, 'solution': 0.058823529411764705, 'improve': 0.058823529411764705, 'text': 0.47058823529411764, 'summarization': 1.0, 'Thai': 0.9411764705882353, 'language': 0.23529411764705882, 'solutions': 0.058823529411764705, 'PageRank': 0.11764705882352941, 'Graph': 0.058823529411764705, 'Rank': 0.058823529411764705, 'Latent': 0.11764705882352941, 'Semantic': 0.11764705882352941, 'Analysis': 0.11764705882352941, 'LSA': 0.23529411764705882, 'models': 0.17647058823529413, 'etc': 0.058823529411764705, 'proposed': 0.29411764705882354, 'research': 0.17647058823529413, 'results': 0.11764705882352941, 'restricted': 0.058823529411764705, 'limited': 0.058823529411764705, 'corpus': 0.17647058823529413, 'complex': 0.058823529411764705, 'grammar': 0.11764705882352941, 'paper': 0.058823529411764705, 'applied': 0.058823529411764705, 'system': 0.0588235

In [15]:
sentence_tokens = [sent for sent in doc.sents]
print(sentence_tokens)

[
ABSTRACT
In recent years, people are seeking for a solution to improve text
summarization for Thai language., Although several solutions such
as PageRank, Graph Rank, Latent Semantic Analysis (LSA)
models, etc., have been proposed, research results in Thai text
summarization were restricted due to limited corpus in Thai
language with complex grammar., This paper applied a text
summarization system for Thai travel news based on keyword
scored in Thai language by extracting the most relevant sentences
from the original document., We compared LSA and Non-negative
Matrix Factorization (NMF) to find the algorithm that is suitable
with Thai travel news., The suitable compression rates for Generic
Sentence Relevance score (GRS) and K-means clustering were also
evaluated., From these experiments, we concluded that keyword
scored calculation by LSA with sentence selection by GRS is the
best algorithm for summarizing Thai Travel News, compared with
human with the best compression rate of 20%.


In [16]:
sentence_scores = {}
for sent in sentence_tokens:
  for word in sent:
    if word.text.lower() in word_frequencies.keys():
      if sent not in sentence_scores.keys():
        sentence_scores[sent] = word_frequencies[word.text.lower()]
      else:
        sentence_scores[sent] += word_frequencies[word.text.lower()]


In [17]:
sentence_scores

{
 ABSTRACT
 In recent years, people are seeking for a solution to improve text
 summarization for Thai language.: 2.176470588235294,
 Although several solutions such
 as PageRank, Graph Rank, Latent Semantic Analysis (LSA)
 models, etc., have been proposed, research results in Thai text
 summarization were restricted due to limited corpus in Thai
 language with complex grammar.: 3.117647058823529,
 This paper applied a text
 summarization system for Thai travel news based on keyword
 scored in Thai language by extracting the most relevant sentences
 from the original document.: 4.352941176470588,
 We compared LSA and Non-negative
 Matrix Factorization (NMF) to find the algorithm that is suitable
 with Thai travel news.: 1.3529411764705883,
 The suitable compression rates for Generic
 Sentence Relevance score (GRS) and K-means clustering were also
 evaluated.: 1.1176470588235294,
 From these experiments, we concluded that keyword
 scored calculation by LSA with sentence selection by GR

In [18]:
from heapq import nlargest

In [19]:
select_length = int(len(sentence_tokens)*0.3)
select_length

9

In [20]:
summary = nlargest(select_length, sentence_scores, key = sentence_scores.get)

In [21]:
summary

[CCS Concepts
 • Information systems ➝ Information retrieval ➝ Retrieval
 tasks and goals➝ Summarization
 Keywords
 Text summarization; extractive summarization; non-negative
 matrix factorization
 1.,
 From
 their experiments, the summarization of the industrial news got
 60% precision, 44% recall, and 50.9% F-measure, the general news
 got the 51.8% precision, 38.5% recall, and 43.1% F-measure while
 the fashion magazines got 53.0% precision, 33.0% recall, and
 40.4% F-measure.,
 This paper applied a text
 summarization system for Thai travel news based on keyword
 scored in Thai language by extracting the most relevant sentences
 from the original document.,
 [3]
 proposed a new approach for Text summarization in Thai based on
 content- and graph-based with the use of Topic Sensitive PageRank
 algorithm for summarizing and ranking of text segments.,
 [2] proposed a Thai text
 summarization that extracted a paragraph from a document based
 on Thai compound nouns, term frequency metho

In [22]:
final_summary = [word.text for word in summary]

In [23]:
summary = ' '.join(final_summary)

In [24]:
print(text)


ABSTRACT
In recent years, people are seeking for a solution to improve text
summarization for Thai language. Although several solutions such
as PageRank, Graph Rank, Latent Semantic Analysis (LSA)
models, etc., have been proposed, research results in Thai text
summarization were restricted due to limited corpus in Thai
language with complex grammar. This paper applied a text
summarization system for Thai travel news based on keyword
scored in Thai language by extracting the most relevant sentences
from the original document. We compared LSA and Non-negative
Matrix Factorization (NMF) to find the algorithm that is suitable
with Thai travel news. The suitable compression rates for Generic
Sentence Relevance score (GRS) and K-means clustering were also
evaluated. From these experiments, we concluded that keyword
scored calculation by LSA with sentence selection by GRS is the
best algorithm for summarizing Thai Travel News, compared with
human with the best compression rate of 20%.
CCS Co

In [25]:
print(summary)

CCS Concepts
• Information systems ➝ Information retrieval ➝ Retrieval
tasks and goals➝ Summarization
Keywords
Text summarization; extractive summarization; non-negative
matrix factorization
1. From
their experiments, the summarization of the industrial news got
60% precision, 44% recall, and 50.9% F-measure, the general news
got the 51.8% precision, 38.5% recall, and 43.1% F-measure while
the fashion magazines got 53.0% precision, 33.0% recall, and
40.4% F-measure.
 This paper applied a text
summarization system for Thai travel news based on keyword
scored in Thai language by extracting the most relevant sentences
from the original document. [3]
proposed a new approach for Text summarization in Thai based on
content- and graph-based with the use of Topic Sensitive PageRank
algorithm for summarizing and ranking of text segments.
 [2] proposed a Thai text
summarization that extracted a paragraph from a document based
on Thai compound nouns, term frequency method, and headline
score for 