# Workshop - Summarizing text extractively

In this task, we will examine a classic application of TF-IDF for extractive text summarization.

## Document retrieval


## Text summarization

Text summarization is a typical natural language processing task that aims to extract relevant information from a given text. There are two main approaches to this task:

* **Extractive text summarization**: This task aims to retrieve the most relevant chunks of text that are most likely to summarize the content of the text. In this task, textual chunks, sections, or segments of the text are obtained. For example:
> Input: "Alan Mathison Turing OBE FRS (/ˈtjʊərɪŋ/; June 23, 1912 - June 7, 1954) was an English mathematician, computer scientist, logician, cryptanalyst, philosopher, and theoretical biologist. [6][7] Turing had a major influence on the development of theoretical computer science, providing a formalization of the concepts of algorithm and computation with the Turing machine, which can be considered a model of a general-purpose computer.[8][9][10] Turing is widely regarded as the father of theoretical computer science and artificial intelligence.[11]"

  > Output: “Alan Mathison Turing OBE FRS (/ˈtjʊərɪŋ/; June 23, 1912 – June 7, 1954) was an English mathematician, computer scientist, logician, cryptanalyst, philosopher, and theoretical biologist.”
* **Abstract text summarization**: This is a task that aims to synthesize the text, i.e., when the summary does not necessarily have to be part of the text. It involves the automatic generation of a coherent and related text.

## Required libraries

This task must be resolved with the following dependencies:**

In [None]:
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import spacy

from wordcloud import WordCloud
from sklearn.feature_extraction.text import TfidfVectorizer

## Data

Let's define the text we are going to process:

In [None]:
text = """Geoffrey Everest Hinton CC FRS FRSC[11] (born 6 December 1947) is a British-Canadian cognitive psychologist and computer scientist, most noted for his work on artificial neural networks. Since 2013, he has divided his time working for Google (Google Brain) and the University of Toronto. In 2017, he co-founded and became the Chief Scientific Advisor of the Vector Institute in Toronto.[12][13]
With David Rumelhart and Ronald J. Williams, Hinton was co-author of a highly cited paper published in 1986 that popularized the backpropagation algorithm for training multi-layer neural networks,[14] although they were not the first to propose the approach.[15] Hinton is viewed as a leading figure in the deep learning community.[16][17][18][19][20] The dramatic image-recognition milestone of the AlexNet designed in collaboration with his students Alex Krizhevsky[21] and Ilya Sutskever for the ImageNet challenge 2012[22] was a breakthrough in the field of computer vision.[23]
Hinton received the 2018 Turing Award, together with Yoshua Bengio and Yann LeCun, for their work on deep learning.[24] They are sometimes referred to as the "Godfathers of AI" and "Godfathers of Deep Learning",[25][26] and have continued to give public talks together.[27]
After his Ph.D. he worked at the University of Sussex, and (after difficulty finding funding in Britain)[29] the University of California, San Diego, and Carnegie Mellon University.[1] He was the founding director of the Gatsby Charitable Foundation Computational Neuroscience Unit at University College London,[1] and is currently[30] a professor in the computer science department at the University of Toronto. He holds a Canada Research Chair in Machine Learning, and is currently an advisor for the Learning in Machines & Brains program at the Canadian Institute for Advanced Research. Hinton taught a free online course on Neural Networks on the education platform Coursera in 2012.[31] Hinton joined Google in March 2013 when his company, DNNresearch Inc., was acquired. He is planning to "divide his time between his university research and his work at Google".[32]
Hinton's research investigates ways of using neural networks for machine learning, memory, perception and symbol processing. He has authored or co-authored over 200 peer reviewed publications.[2][33]
While Hinton was a professor at Carnegie Mellon University (1982–1987), David E. Rumelhart and Hinton and Ronald J. Williams applied the backpropagation algorithm to multi-layer neural networks. Their experiments showed that such networks can learn useful internal representations of data.[14] In an interview of 2018,[34] Hinton said that "David E. Rumelhart came up with the basic idea of backpropagation, so it's his invention." Although this work was important in popularizing backpropagation, it was not the first to suggest the approach.[15] Reverse-mode automatic differentiation, of which backpropagation is a special case, was proposed by Seppo Linnainmaa in 1970, and Paul Werbos proposed to use it to train neural networks in 1974.[15]
During the same period, Hinton co-invented Boltzmann machines with David Ackley and Terry Sejnowski.[35] His other contributions to neural network research include distributed representations, time delay neural network, mixtures of experts, Helmholtz machines and Product of Experts. In 2007 Hinton coauthored an unsupervised learning paper titled Unsupervised learning of image transformations.[36] An accessible introduction to Geoffrey Hinton's research can be found in his articles in Scientific American in September 1992 and October 1993.[37]
In October and November 2017 respectively, Hinton published two open access research papers[38][39] on the theme of capsule neural networks, which according to Hinton are "finally something that works well."[40]
Notable former PhD students and postdoctoral researchers from his group include Peter Dayan,[41] Sam Roweis,[41] Richard Zemel,[3][6] Brendan Frey,[7] Radford M. Neal,[8] Ruslan Salakhutdinov,[9] Ilya Sutskever,[10] Yann LeCun[42] and Zoubin Ghahramani.
"""

## **Define the NLP pipeline**

Define the steps necessary to solve the `spacy` task:

In [None]:
# Your code here
spacy.cli.download("en_core_web_sm")
nlp = spacy.load('en_core_web_sm')

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


## **1. Tokenize the document**

Build a list of phrases using `spacy`:

In [None]:
# Your code here

## **2. Preprocess the sentences**

Implement the `preprocess` function to clean up the text:

* Remove special characters (punctuation marks and numbers)
* Convert each word to lowercase.
* Remove empty sentences
* Remove line breaks, tabs, and repeated spaces.

In [None]:
# Your code here

## **3. Build a TFIDF**

Build a TF-IDF representation using `sklearn`:

Try different vectorizer settings, including:

* With and without idf weighting.
* With and without sublinear scaling.
* Different normalizations (None, l1, l2)

In [None]:
# Your code here

## **4. Shows the number of sentences and vocabulary size**

In [None]:
# Your code here

## **5. Display the tfidf representation as a pandas dataframe**

In [None]:
# Your code here

## **6. Estimate the importance of each sentence in the text**

Try different aggregation functions (sum, mean, std, var, min, max) to obtain a single number that represents each document:

In [None]:
# Your code here

## **7. Identify the most important sentences in the text**

Find the 10 most important sentences in the text. You must filter them, but keep in mind that they must maintain the order in which they appear in the original text.

In [None]:
# Your code here

## **8. Try other preprocessing techniques or representation variations to improve results**

In [None]:
# Your code here