## Extractive Summarization

Useful links:

* [Automatic summarising: factors and
directions (1999)](https://www.cl.cam.ac.uk/archive/ksj21/ksjdigipapers/summbook99.pdf) - newcomers to summarization should start here. Contains definitions and reviews different approaches and goals of summarization.
* [Text Summarization in Python: Extractive vs. Abstractive techniques revisited (2017)](https://rare-technologies.com/text-summarization-in-python-extractive-vs-abstractive-techniques-revisited/) - overview of summarization techniques in Python
* [Lectures on Summarization Techniques (2015?)](https://www.youtube.com/watch?v=N5N-HCUE3G4) from old Coursera NLP course - it describes many summarization techniques, emphasis on research, most of these techniques don't have implementations in Python
* [Centroid-based Text Summarization through Compositionality of Word Embeddings](http://www.aclweb.org/anthology/W17-1003) - interesting article on using word embeddings to replace Bag of Words representation of an older article. Has a remarkably [good implementation](https://github.com/gaetangate/text-summarizer) (it worked out of the box, which is uncommon for academic implementations, I just added setup.py to make it pip-installable).

## TextRank

## Notes

* **PyTextRank** - weird arcane API, doesn't expose simple function call as gensim/summa
* **sumy** - requires pipeline (doesn't just work on raw strings)
* **pyteaser** - only Python 2

In [1]:
import pandas as pd

import gensim
import summa
import text_summarizer
import nltk

from sklearn.datasets import fetch_20newsgroups


In [2]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/kuba/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/kuba/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
articles = fetch_20newsgroups()['data'][:1000]

In [4]:
article_texts = [article.split('\n', maxsplit=5)[5].replace('\n', ' ') for article in articles]

article_texts = [article for article in article_texts if article.count('. ') > 2]

article_texts = pd.Series(article_texts)

In [5]:
len(article_texts)

862

In [6]:
article_texts[0]

'  I was wondering if anyone out there could enlighten me on this car I saw the other day. It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a Bricklin. The doors were really small. In addition, the front bumper was separate from the rest of the body. This is  all I know. If anyone can tellme a model name, engine specs, years of production, where this car is made, history, or whatever info you have on this funky looking car, please e-mail.  Thanks, - IL    ---- brought to you by your neighborhood Lerxst ----     '

In [7]:
%%time

summa_summaries = article_texts.apply(summa.summarizer.summarize)

CPU times: user 17 s, sys: 464 ms, total: 17.5 s
Wall time: 12.6 s


In [8]:
%%time

gensim_summaries = article_texts.apply(gensim.summarization.summarize)

CPU times: user 13.8 s, sys: 319 ms, total: 14.1 s
Wall time: 10.5 s


## Centroid-based summarization

In [9]:
centroid_bow_summarizer = text_summarizer.CentroidBOWSummarizer(preprocess_type='regex')

In [10]:
%%time

centroid_bow_summaries = article_texts.apply(centroid_bow_summarizer.summarize)

CPU times: user 5.92 s, sys: 7.86 ms, total: 5.93 s
Wall time: 5.91 s


In [12]:
embedding_model = text_summarizer.centroid_word_embeddings.load_gensim_embedding_model('glove-wiki-gigaword-50')



In [13]:
centroid_word_embedding_summarizer = text_summarizer.CentroidWordEmbeddingsSummarizer(embedding_model, preprocess_type='regex')

CPU times: user 1.92 ms, sys: 159 µs, total: 2.08 ms
Wall time: 6.24 ms


In [14]:
%%time

centroid_word_embedding_summaries = article_texts.apply(centroid_word_embedding_summarizer.summarize)

CPU times: user 9.06 s, sys: 22.3 ms, total: 9.08 s
Wall time: 9.09 s
