# Workshop - Summarizing text extractively

In this task, we will examine a classic application of TF-IDF for extractive text summarization.

## Document retrieval


## Text summarization

Text summarization is a typical natural language processing task that aims to extract relevant information from a given text. There are two main approaches to this task:

* **Extractive text summarization**: This task aims to retrieve the most relevant chunks of text that are most likely to summarize the content of the text. In this task, textual chunks, sections, or segments of the text are obtained. For example:
> Input: "Alan Mathison Turing OBE FRS (/ˈtjʊərɪŋ/; June 23, 1912 - June 7, 1954) was an English mathematician, computer scientist, logician, cryptanalyst, philosopher, and theoretical biologist. [6][7] Turing had a major influence on the development of theoretical computer science, providing a formalization of the concepts of algorithm and computation with the Turing machine, which can be considered a model of a general-purpose computer.[8][9][10] Turing is widely regarded as the father of theoretical computer science and artificial intelligence.[11]"

  > Output: “Alan Mathison Turing OBE FRS (/ˈtjʊərɪŋ/; June 23, 1912 – June 7, 1954) was an English mathematician, computer scientist, logician, cryptanalyst, philosopher, and theoretical biologist.”
* **Abstract text summarization**: This is a task that aims to synthesize the text, i.e., when the summary does not necessarily have to be part of the text. It involves the automatic generation of a coherent and related text.

## Required libraries

This task must be resolved with the following dependencies:**

In [21]:
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import spacy

from wordcloud import WordCloud
from sklearn.feature_extraction.text import TfidfVectorizer

## Data

Let's define the text we are going to process:

In [22]:
text = """Geoffrey Everest Hinton CC FRS FRSC[11] (born 6 December 1947) is a British-Canadian cognitive psychologist and computer scientist, most noted for his work on artificial neural networks. Since 2013, he has divided his time working for Google (Google Brain) and the University of Toronto. In 2017, he co-founded and became the Chief Scientific Advisor of the Vector Institute in Toronto.[12][13]
With David Rumelhart and Ronald J. Williams, Hinton was co-author of a highly cited paper published in 1986 that popularized the backpropagation algorithm for training multi-layer neural networks,[14] although they were not the first to propose the approach.[15] Hinton is viewed as a leading figure in the deep learning community.[16][17][18][19][20] The dramatic image-recognition milestone of the AlexNet designed in collaboration with his students Alex Krizhevsky[21] and Ilya Sutskever for the ImageNet challenge 2012[22] was a breakthrough in the field of computer vision.[23]
Hinton received the 2018 Turing Award, together with Yoshua Bengio and Yann LeCun, for their work on deep learning.[24] They are sometimes referred to as the "Godfathers of AI" and "Godfathers of Deep Learning",[25][26] and have continued to give public talks together.[27]
After his Ph.D. he worked at the University of Sussex, and (after difficulty finding funding in Britain)[29] the University of California, San Diego, and Carnegie Mellon University.[1] He was the founding director of the Gatsby Charitable Foundation Computational Neuroscience Unit at University College London,[1] and is currently[30] a professor in the computer science department at the University of Toronto. He holds a Canada Research Chair in Machine Learning, and is currently an advisor for the Learning in Machines & Brains program at the Canadian Institute for Advanced Research. Hinton taught a free online course on Neural Networks on the education platform Coursera in 2012.[31] Hinton joined Google in March 2013 when his company, DNNresearch Inc., was acquired. He is planning to "divide his time between his university research and his work at Google".[32]
Hinton's research investigates ways of using neural networks for machine learning, memory, perception and symbol processing. He has authored or co-authored over 200 peer reviewed publications.[2][33]
While Hinton was a professor at Carnegie Mellon University (1982–1987), David E. Rumelhart and Hinton and Ronald J. Williams applied the backpropagation algorithm to multi-layer neural networks. Their experiments showed that such networks can learn useful internal representations of data.[14] In an interview of 2018,[34] Hinton said that "David E. Rumelhart came up with the basic idea of backpropagation, so it's his invention." Although this work was important in popularizing backpropagation, it was not the first to suggest the approach.[15] Reverse-mode automatic differentiation, of which backpropagation is a special case, was proposed by Seppo Linnainmaa in 1970, and Paul Werbos proposed to use it to train neural networks in 1974.[15]
During the same period, Hinton co-invented Boltzmann machines with David Ackley and Terry Sejnowski.[35] His other contributions to neural network research include distributed representations, time delay neural network, mixtures of experts, Helmholtz machines and Product of Experts. In 2007 Hinton coauthored an unsupervised learning paper titled Unsupervised learning of image transformations.[36] An accessible introduction to Geoffrey Hinton's research can be found in his articles in Scientific American in September 1992 and October 1993.[37]
In October and November 2017 respectively, Hinton published two open access research papers[38][39] on the theme of capsule neural networks, which according to Hinton are "finally something that works well."[40]
Notable former PhD students and postdoctoral researchers from his group include Peter Dayan,[41] Sam Roweis,[41] Richard Zemel,[3][6] Brendan Frey,[7] Radford M. Neal,[8] Ruslan Salakhutdinov,[9] Ilya Sutskever,[10] Yann LeCun[42] and Zoubin Ghahramani.
"""

## **Define the NLP pipeline**

Define the steps necessary to solve the `spacy` task:

In [23]:
# Your code here
spacy.cli.download("en_core_web_sm")
nlp = spacy.load('en_core_web_sm')

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


## **1. Tokenize the document**

Build a list of phrases using `spacy`:

In [24]:
doc = nlp(text) # This line processes the text variable (defined in cell d41ac8d3) using the spaCy language model nlp (loaded in cell 91d2c18d). This creates a Doc object, which is a container for the processed text and provides access to various linguistic annotations.
phrases = [sent.text for sent in doc.sents] #  line iterates through the sentences in the doc object (doc.sents) and extracts the text of each sentence (sent.text), creating a list of these sentences called phrases.
display(phrases[:2]) #  Display the first 2 phrases as an example

['Geoffrey Everest Hinton CC FRS FRSC[11] (born 6 December 1947) is a British-Canadian cognitive psychologist and computer scientist, most noted for his work on artificial neural networks.',
 'Since 2013, he has divided his time working for Google (Google Brain) and the University of Toronto.']

## **2. Preprocess the sentences**

Implement the `preprocess` function to clean up the text:

* Remove special characters (punctuation marks and numbers)
* Convert each word to lowercase.
* Remove empty sentences
* Remove line breaks, tabs, and repeated spaces.

In [25]:
def preprocess(text):
    # Remove special characters (punctuation marks and numbers)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Convert each word to lowercase.
    text = text.lower()
    # Remove line breaks, tabs, and repeated spaces.
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Apply the preprocess function to each phrase
processed_phrases = [preprocess(phrase) for phrase in phrases]

# Remove empty sentences
processed_phrases = [phrase for phrase in processed_phrases if phrase]

display(processed_phrases[:5]) # Display the first 5 processed phrases as an example

['geoffrey everest hinton cc frs frsc born december is a britishcanadian cognitive psychologist and computer scientist most noted for his work on artificial neural networks',
 'since he has divided his time working for google google brain and the university of toronto',
 'in he cofounded and became the chief scientific advisor of the vector institute in toronto with david rumelhart and ronald j williams hinton was coauthor of a highly cited paper published in that popularized the backpropagation algorithm for training multilayer neural networks although they were not the first to propose the approach hinton is viewed as a leading figure in the deep learning community the dramatic imagerecognition milestone of the alexnet designed in collaboration with his students alex krizhevsky and ilya sutskever for the imagenet challenge was a breakthrough in the field of computer vision hinton received the turing award together with yoshua bengio and yann lecun for their work on deep learning they

## **3. Build a TFIDF**

Build a TF-IDF representation using `sklearn`:

Try different vectorizer settings, including:

* With and without idf weighting.
* With and without sublinear scaling.
* Different normalizations (None, l1, l2)

In [26]:
# Initialize the TF-IDF Vectorizer
# We can experiment with different settings here:
# - use_idf=True/False: With or without inverse document frequency weighting
# - sublinear_tf=True/False: Apply sublinear TF scaling (1 + log(tf))
# - norm: Normalization to apply ('l1', 'l2', or None)

tfidf_vectorizer = TfidfVectorizer(
    use_idf=True,        # Example: using IDF weighting
    sublinear_tf=True,   # Example: applying sublinear TF scaling
    norm='l2'            # Example: using L2 normalization
)

# Fit and transform the processed phrases
tfidf_matrix = tfidf_vectorizer.fit_transform(processed_phrases)

# The tfidf_matrix is a sparse matrix. We can convert it to a dense array if needed,
# but it's generally more memory efficient to work with the sparse matrix.
# tfidf_dense = tfidf_matrix.todense()

print("TF-IDF matrix shape:", tfidf_matrix.shape)

TF-IDF matrix shape: (17, 284)


## **4. Shows the number of sentences and vocabulary size**

In [27]:
# Get the number of processed sentences
num_sentences = len(processed_phrases)
# Get the vocabulary size from the TF-IDF vectorizer
vocabulary_size = len(tfidf_vectorizer.get_feature_names_out())

# Display the number of sentences and vocabulary size
print(f"Number of sentences: {num_sentences}")
print(f"Vocabulary size: {vocabulary_size}")

Number of sentences: 17
Vocabulary size: 284


## **5. Display the tfidf representation as a pandas dataframe**

In [28]:
# Get the feature names (words) from the vectorizer
feature_names = tfidf_vectorizer.get_feature_names_out()

# Convert the sparse TF-IDF matrix to a dense format and then to a pandas DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.todense(), columns=feature_names)

# Display the first few rows of the TF-IDF DataFrame
display(tfidf_df.head())

Unnamed: 0,access,accessible,according,ackley,acquired,advanced,advisor,after,ai,alex,...,williams,with,work,worked,working,works,yann,yoshua,zemel,zoubin
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.169803,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.312586,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.073727,0.142961,0.084435,0.084435,...,0.073727,0.138781,0.060237,0.084435,0.0,0.0,0.073727,0.084435,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.234557,0.204811,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## **6. Estimate the importance of each sentence in the text**

Try different aggregation functions (sum, mean, std, var, min, max) to obtain a single number that represents each document:

In [29]:
# Estimate sentence importance using different aggregation functions

# Sum of TF-IDF scores for each sentence
sentence_importance_sum = tfidf_df.sum(axis=1)

# Mean of TF-IDF scores for each sentence
sentence_importance_mean = tfidf_df.mean(axis=1)

# Standard deviation of TF-IDF scores for each sentence
sentence_importance_std = tfidf_df.std(axis=1)

# Variance of TF-IDF scores for each sentence
sentence_importance_var = tfidf_df.var(axis=1)

# Minimum TF-IDF score for each sentence
sentence_importance_min = tfidf_df.min(axis=1)

# Maximum TF-IDF score for each sentence
sentence_importance_max = tfidf_df.max(axis=1)

# Display the importance scores for each aggregation method
print("Sentence Importance (Sum):")
display(sentence_importance_sum.head())

print("\nSentence Importance (Mean):")
display(sentence_importance_mean.head())

print("\nSentence Importance (Standard Deviation):")
display(sentence_importance_std.head())

print("\nSentence Importance (Variance):")
display(sentence_importance_var.head())

print("\nSentence Importance (Minimum):")
display(sentence_importance_min.head())

print("\nSentence Importance (Maximum):")
display(sentence_importance_max.head())

Sentence Importance (Sum):


Unnamed: 0,0
0,4.755047
1,3.702337
2,10.497836
3,4.602754
4,3.44781



Sentence Importance (Mean):


Unnamed: 0,0
0,0.016743
1,0.013036
2,0.036964
3,0.016207
4,0.01214



Sentence Importance (Standard Deviation):


Unnamed: 0,0
0,0.057028
1,0.057992
2,0.046501
3,0.057184
4,0.058186



Sentence Importance (Variance):


Unnamed: 0,0
0,0.003252
1,0.003363
2,0.002162
3,0.00327
4,0.003386



Sentence Importance (Minimum):


Unnamed: 0,0
0,0.0
1,0.0
2,0.0
3,0.0
4,0.0



Sentence Importance (Maximum):


Unnamed: 0,0
0,0.238015
1,0.414514
2,0.177197
3,0.283323
4,0.385978


## **7. Identify the most important sentences in the text**

Find the 10 most important sentences in the text. You must filter them, but keep in mind that they must maintain the order in which they appear in the original text.

In [30]:
# Combine original phrases with their importance scores (using sum in this example)
sentence_scores = pd.DataFrame({'phrase': phrases, 'importance': sentence_importance_sum})

# Sort sentences by importance in descending order
sorted_sentences = sentence_scores.sort_values(by='importance', ascending=False)

# Select the top 10 most important sentences
top_10_sentences = sorted_sentences.head(10)

# Get the indices of the top 10 sentences in the original order
top_10_indices = top_10_sentences.index.tolist()

# Sort the indices to maintain the original order
top_10_indices.sort()

# Get the top 10 sentences in their original order
summary_sentences = [phrases[i] for i in top_10_indices]

print("Top 10 Most Important Sentences (in original order):")

Top 10 Most Important Sentences (in original order):


In [31]:
# Display the summary sentences
for sentence in summary_sentences:
    print(f"- {sentence}")

- Geoffrey Everest Hinton CC FRS FRSC[11] (born 6 December 1947) is a British-Canadian cognitive psychologist and computer scientist, most noted for his work on artificial neural networks.
- In 2017, he co-founded and became the Chief Scientific Advisor of the Vector Institute in Toronto.[12][13]
With David Rumelhart and Ronald J. Williams, Hinton was co-author of a highly cited paper published in 1986 that popularized the backpropagation algorithm for training multi-layer neural networks,[14] although they were not the first to propose the approach.[15] Hinton is viewed as a leading figure in the deep learning community.[16][17][18][19][20] The dramatic image-recognition milestone of the AlexNet designed in collaboration with his students Alex Krizhevsky[21] and Ilya Sutskever for the ImageNet challenge 2012[22] was a breakthrough in the field of computer vision.[23]
Hinton received the 2018 Turing Award, together with Yoshua Bengio and Yann LeCun, for their work on deep learning.[24]

## **8. Try other preprocessing techniques or representation variations to improve results**

In [32]:
# This cell is for experimenting with different preprocessing techniques or TF-IDF variations

# Example 1: Stop word removal and Lemmatization using spaCy
# This is typically appliedafter initial tokenization and before building the TF-IDF

def preprocess_with_stopwords_lemma(text):
    doc = nlp(text) # Process the text with spaCy
    # Remove stop words and lemmatize tokens (convert words to their base form)
    # Filter out stop words and punctuation
    processed_tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
    return " ".join(processed_tokens)

# # Apply the new preprocessing function to the original phrases
processed_phrases_v2 = [preprocess_with_stopwords_lemma(phrase) for phrase in phrases]
# # Remove any empty strings that might result from preprocessing
processed_phrases_v2 = [phrase for phrase in processed_phrases_v2 if phrase]

print("Processed phrases with stop word removal and lemmatization:")
display(processed_phrases_v2[:5])

# Example 2: Experimenting with TF-IDF Vectorizer settings
# Re-run the TF-IDF vectorization and subsequent steps with a new vectorizer

tfidf_vectorizer_v2 = TfidfVectorizer(
    use_idf=False,        # Example: without Inverse Document Frequency weighting
    sublinear_tf=False,   # Example: without applying sublinear TF scaling (1 + log(tf))
    norm=None,            # Example: without normalization ('l1' or 'l2')
    # max_df=0.95,        # Ignore terms that appear in more than 95% of the documents
    # min_df=2            # Ignore terms that appear in less than 2 documents
)

# # Fit and transform either the original processed_phrases or processed_phrases_v2
tfidf_matrix_v2 = tfidf_vectorizer_v2.fit_transform(processed_phrases) # Or processed_phrases_v2 if you used the new preprocessing

print("\nTF-IDF matrix shape (v2):", tfidf_matrix_v2.shape)

# Remember to re-run the subsequent cells (showing sentence importance, displaying as dataframe, etc.)
# after making changes in this cell to see the effect of the variations.

Processed phrases with stop word removal and lemmatization:


['Geoffrey Everest Hinton CC FRS frsc[11 bear 6 December 1947 british canadian cognitive psychologist computer scientist note work artificial neural network',
 '2013 divide time work Google Google Brain University Toronto',
 '2017 co found Chief Scientific Advisor Vector Institute Toronto.[12][13 \n David Rumelhart Ronald J. Williams Hinton co author highly cite paper publish 1986 popularize backpropagation algorithm train multi layer neural networks,[14 propose approach.[15 Hinton view lead figure deep learning community.[16][17][18][19][20 dramatic image recognition milestone AlexNet design collaboration student Alex Krizhevsky[21 Ilya Sutskever ImageNet challenge 2012[22 breakthrough field computer vision.[23 \n Hinton receive 2018 Turing Award Yoshua Bengio Yann LeCun work deep learning.[24 refer Godfathers AI Godfathers Deep learning",[25][26 continue public talk together.[27 \n ph.d. work University Sussex difficulty find funding britain)[29 University California San Diego Carneg


TF-IDF matrix shape (v2): (17, 284)
