# PyData Cyprus #5 meetup—a gentle Introduction to Text Summarization

Automatic text summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document. **The main idea of summarization is to find a subset of data which contains the “information” of the entire set**. Such techniques are widely used in industry today.

In this demo, we will review several methods of implementing text data summarization techniques with python. 

![title](img/picture.png)

## Extractive Text Summarization Techniques With sumy

ref: https://github.com/miso-belica/sumy

In [None]:
# ub-comment to install, if missing: !pip install sumy

**Extractive** text summarization techniques perform summarization by picking portions of texts and constructing a summary, unlike **abstractive** techniques which conceptualize a summary and paraphrases it .

In [None]:
#Plain text parsers since we are parsing through text
from sumy.parsers.plaintext import PlaintextParser

#for tokenization
from sumy.nlp.tokenizers import Tokenizer

BBC Dataset: http://mlg.ucd.ie/datasets/bbc.html

In [80]:
#name of the plain-text file ~ bbc news dataset
filename = "datasetbbc/tech/004.txt"

file = open(filename, "r")
for line in file:
   print(line)

Digital guru floats sub-$100 PC



Nicholas Negroponte, chairman and founder of MIT's Media Labs, says he is developing a laptop PC that will go on sale for less than $100 (£53).



He told the BBC World Service programme Go Digital he hoped it would become an education tool in developing countries. He said one laptop per child could be " very important to the development of not just that child but now the whole family, village and neighbourhood". He said the child could use the laptop like a text book. He described the device as a stripped down laptop, which would run a Linux-based operating system, "We have to get the display down to below $20, to do this we need to rear project the image rather than using an ordinary flat panel.



"The second trick is to get rid of the fat , if you can skinny it down you can gain speed and the ability to use smaller processors and slower memory." The device will probably be exported as a kit of parts to be assembled locally to keep costs down. Mr N

In [None]:
parser = PlaintextParser.from_file(filename, Tokenizer("english"))

### 1. Lex Rank

This a graphical based text summarizer―the main idea is that sentences "recommend" other similar sentences to the reader.
ref: https://github.com/wikibusiness/lexrank

In [81]:
from sumy.summarizers.lex_rank import LexRankSummarizer 
summarizer = LexRankSummarizer()

#Summarize the document with 2 sentences
summary = summarizer(parser.document, 2) 
for sentence in summary:
    print(sentence)

Nicholas Negroponte, chairman and founder of MIT's Media Labs, says he is developing a laptop PC that will go on sale for less than $100 (£53).
He said one laptop per child could be " very important to the development of not just that child but now the whole family, village and neighbourhood".


### 2. Luhn

It is one of the earliest suggested algorithm by the famous IBM researcher it was named after. It scores sentences based on frequency of the most important words.

In [82]:
from sumy.summarizers.luhn import LuhnSummarizer
summarizer_1 = LuhnSummarizer()
summary_1 =summarizer_1(parser.document,2)
for sentence in summary_1:
    print(sentence)

"Nokia make 200 million cell phones a year, so for us to claim we're going to make 200 million laptops is a big number, but we're not talking about doing it in three or five years, we're talking about months."
That's for five or six years, so if we can distribute and sell laptops in quantities of one million or more to ministries of education that's cheaper and the marketing overheads go away."


### 3. LSA

Latent semantic analysis is an unsupervised method of summarization it combines term frequency techniques with singular value decomposition to summarize texts. It is one of the most recent suggested technique for summerization

In [83]:
from sumy.summarizers.lsa import LsaSummarizer
summarizer_2 = LsaSummarizer()
summary_2 =summarizer_2(parser.document,2)
for sentence in summary_2:
    print(sentence)

Nicholas Negroponte, chairman and founder of MIT's Media Labs, says he is developing a laptop PC that will go on sale for less than $100 (£53).
Mr Negroponte wants the laptops to become more common than mobile phones but conceded this was ambitious.


### 4. Text Rank

Text rank is a graph-based summarization technique with keyword extractions in from document.

In [84]:
from sumy.summarizers.text_rank import TextRankSummarizer
summarizer_3 = TextRankSummarizer()
summary_3 =summarizer_3(parser.document,2)
for sentence in summary_3:
    print(sentence)

He described the device as a stripped down laptop, which would run a Linux-based operating system, "We have to get the display down to below $20, to do this we need to rear project the image rather than using an ordinary flat panel.
"The second trick is to get rid of the fat , if you can skinny it down you can gain speed and the ability to use smaller processors and slower memory."


Sampling just a few. The results are reasonable and can be used by humans to generally understand long texts and their contents. This has made it quite easy to summarize document but its also important for the engineer to understand the underlying statistics and mathematical implementation of each algorithm to see which one suites your task well.

**read more,**

- https://medium.com/@ondenyi.eric/extractive-text-summarization-techniques-with-sumy-3d3b127a0a32
- https://rare-technologies.com/text-summarization-in-python-extractive-vs-abstractive-techniques-revisited/
- https://github.com/miso-belica/sumy
    