<a href="https://colab.research.google.com/github/DrAlexSanz/nlpv2-course/blob/master/Text_sumarization_with_libraries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Let's do summarization with a library. In a normal application I should try to summarize documents in a few different ways and then choose.

In [None]:
!pip install sumy
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer

import pandas as pd
import textwrap

import nltk
nltk.download('punkt')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
# Get the data

!wget https://lazyprogrammer.me/course_files/nlp/bbc_text_cls.csv

--2022-08-14 14:45:04--  https://lazyprogrammer.me/course_files/nlp/bbc_text_cls.csv
Resolving lazyprogrammer.me (lazyprogrammer.me)... 104.21.23.210, 172.67.213.166, 2606:4700:3030::ac43:d5a6, ...
Connecting to lazyprogrammer.me (lazyprogrammer.me)|104.21.23.210|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5085081 (4.8M) [text/csv]
Saving to: ‘bbc_text_cls.csv.5’


2022-08-14 14:45:04 (227 MB/s) - ‘bbc_text_cls.csv.5’ saved [5085081/5085081]



In [None]:
df = pd.read_csv("bbc_text_cls.csv")

In [None]:
# My dataset is just one article
doc = df[df["labels"] == "business"]["text"].sample(random_state = 42)

In [None]:
summarizer = TextRankSummarizer()

parser = PlaintextParser.from_string(doc.iloc[0].split("\n", 1)[1], Tokenizer("english"))

summary = summarizer(parser.document, sentences_count = 5)

In [None]:
summary # Should be similar but not necessarily equal to the previous approach

(<Sentence: Retail sales dropped by 1% on the month in December, after a 0.6% rise in November, the Office for National Statistics (ONS) said.>,
 <Sentence: The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%.>,
 <Sentence: The ONS echoed an earlier caution from Bank of England governor Mervyn King not to read too much into the poor December figures.>,
 <Sentence: Some analysts put a positive gloss on the figures, pointing out that the non-seasonally-adjusted figures showed a performance comparable with 2003.>,
 <Sentence: The November-December jump last year was roughly comparable with recent averages, although some way below the serious booms seen in the 1990s.>)

In [None]:
for s in summary:
    print(textwrap.fill(str(s), replace_whitespace=False, fix_sentence_endings=True))

Retail sales dropped by 1% on the month in December, after a 0.6% rise
in November, the Office for National Statistics (ONS) said.
The ONS revised the annual 2004 rate of growth down from the 5.9%
estimated in November to 3.2%.
The ONS echoed an earlier caution from Bank of England governor Mervyn
King not to read too much into the poor December figures.
Some analysts put a positive gloss on the figures, pointing out that
the non-seasonally-adjusted figures showed a performance comparable
with 2003.
The November-December jump last year was roughly comparable with
recent averages, although some way below the serious booms seen in the
1990s.


In [None]:
# Another summarizer. And it's different, as expected.

summarizer_lsa = LsaSummarizer()

parser_lsa = PlaintextParser.from_string(doc.iloc[0].split("\n", 1)[1], Tokenizer("english"))

summary_lsa = summarizer_lsa(parser_lsa.document, sentences_count = 5)

for s in summary_lsa:
    print(textwrap.fill(str(s), replace_whitespace=False, fix_sentence_endings=True))

UK retail sales fell in December, failing to meet expectations and
making it by some counts the worst Christmas since 1981.
Morrisons, Woolworths, House of Fraser, Marks & Spencer and Big Food
all said that the festive period was disappointing.
And a British Retail Consortium survey found that Christmas 2004 was
the worst for 10 years.
Yet, other retailers - including HMV, Monsoon, Jessops, Body Shop and
Tesco - reported that festive sales were well up on last year.
Investec chief economist Philip Shaw said he did not expect the poor
retail figures to have any immediate effect on interest rates.


In [None]:
# Use gensim and see how it changes so much
from gensim.summarization.summarizer import summarize
summary = summarize(doc.iloc[0].split("\n", 1)[1])
print(textwrap.fill(summary, replace_whitespace=False, fix_sentence_endings=True))

Retail sales dropped by 1% on the month in December, after a 0.6% rise
in November, the Office for National Statistics (ONS) said.
The ONS
echoed an earlier caution from Bank of England governor Mervyn King
not to read too much into the poor December figures.
"The retail sales
figures are very weak, but as Bank of England governor Mervyn King
indicated last night, you don't really get an accurate impression of
Christmas trading until about Easter," said Mr Shaw.
