<a href="https://colab.research.google.com/github/Rajat-Kumar-Pandey/MACHINE-LEARNING/blob/main/text_Summerization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Building a Text Summarizer

Importing required libraries

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from spacy.lang.en import English
import numpy as np

Load spacy model for sentence tokenization

In [3]:
# Initialize Spacy English model with sentencizer
nlp = English()
nlp.add_pipe('sentencizer')

<spacy.pipeline.sentencizer.Sentencizer at 0x7d32c6e3ea40>

In [4]:
text_corpus = """
Google celebrated British illustrator and artist Sir John Tenniel's
200th birth anniversary with a doodle on February 28. An acclaimed
Victorian painter, Tenniel is celebrated for his illustrations for
Lewis Carroll's Alice's Adventures in Wonderland and Through the Looking-Glass.
Tenniel was born in Bayswater, West London in 1820. At the age of 20, Tenniel
received a major eye injury and eventually, lost sight in his right eye.
From a very early age, Tenniel was appreciated as a humorist and soon after,
also cultured his talent for scholarly caricature.
His first illustration was for Samuel Carter Hall's The Book of British
Ballads in 1842. Eight years later, he joined the historic weekly magazine
Punch as a political cartoonist. Lewis Carroll noticed Tenniel's distinct style
of work and in 1864, approached the artist to illustrate his book, Alice's
Adventures in Wonderland. This association marked Carroll and Tenniel's creative
partnership and continued with Through the Looking Glass in 1872. "The result:
a series of classic characters, such as Alice and the Cheshire Cat, as depicted
in the Doodle artwork's rendition of their iconic meeting-characters who, along
with many others, remain beloved by readers of all ages to this day," the Google
Doodle page says. After working with Lewis Carroll, Tenniel resumed his work with
Punch. For his work, Tenniel also received a knighthood in 1893.
Sir John Tenniel died on February 25, 1914. He was 93.
"""

Create spacy document for further sentence level tokenization

In [6]:
# Process text to get sentences
doc = nlp(text_corpus.replace("\n", ""))
sentences = [sent.text.strip() for sent in doc.sents if sent.text.strip()]



In [7]:

print("Senetence are: \n", sentences)

Senetence are: 
 ["Google celebrated British illustrator and artist Sir John Tenniel's 200th birth anniversary with a doodle on February 28.", "An acclaimed Victorian painter, Tenniel is celebrated for his illustrations for Lewis Carroll's Alice's Adventures in Wonderland and Through the Looking-Glass.", 'Tenniel was born in Bayswater, West London in 1820.', 'At the age of 20, Tenniel received a major eye injury and eventually, lost sight in his right eye.', 'From a very early age, Tenniel was appreciated as a humorist and soon after, also cultured his talent for scholarly caricature.', "His first illustration was for Samuel Carter Hall's The Book of British Ballads in 1842.", 'Eight years later, he joined the historic weekly magazine Punch as a political cartoonist.', "Lewis Carroll noticed Tenniel's distinct style of work and in 1864, approached the artist to illustrate his book, Alice's Adventures in Wonderland.", 'This association marked Carroll and Tenniel\'s creative partnership 


Creating sentence organizer

In [8]:
# Let's create an organizer which will store the sentence ordering to later reorganize the
# scored sentences in their correct order
sentence_organizer = {k:v for v,k in enumerate(sentences)}


Peeking into our sentence organizer

In [9]:
print("Our sentence organizer: \n", sentence_organizer)

Our sentence organizer: 
 {"Google celebrated British illustrator and artist Sir John Tenniel's 200th birth anniversary with a doodle on February 28.": 0, "An acclaimed Victorian painter, Tenniel is celebrated for his illustrations for Lewis Carroll's Alice's Adventures in Wonderland and Through the Looking-Glass.": 1, 'Tenniel was born in Bayswater, West London in 1820.': 2, 'At the age of 20, Tenniel received a major eye injury and eventually, lost sight in his right eye.': 3, 'From a very early age, Tenniel was appreciated as a humorist and soon after, also cultured his talent for scholarly caricature.': 4, "His first illustration was for Samuel Carter Hall's The Book of British Ballads in 1842.": 5, 'Eight years later, he joined the historic weekly magazine Punch as a political cartoonist.': 6, "Lewis Carroll noticed Tenniel's distinct style of work and in 1864, approached the artist to illustrate his book, Alice's Adventures in Wonderland.": 7, 'This association marked Carroll and

In [12]:
# Create a TF-IDF model with proper boolean value for smooth_idf
tf_idf_vectorizer = TfidfVectorizer(min_df=2, max_features=None,
                                    strip_accents='unicode',
                                    analyzer='word',
                                    token_pattern=r'\w{1,}',
                                    ngram_range=(1, 3),
                                    use_idf=True, smooth_idf=True,  # Corrected here
                                    sublinear_tf=True, stop_words='english')


In [13]:

# Passing our sentences treating each as one document to TF-IDF vectorizer
tf_idf_vectorizer.fit(sentences)

In [14]:

# Transforming our sentences to TF-IDF vectors
sentence_vectors = tf_idf_vectorizer.transform(sentences)

Performing sentence scoring

In [15]:
# Getting sentence scores for each sentences
sentence_scores = np.array(sentence_vectors.sum(axis=1)).ravel()

# Sanity checkup
print(len(sentences) == len(sentence_scores))

True


In [16]:
# Getting top-n sentences
N = 3
top_n_sentences = [sentences[ind] for ind in np.argsort(sentence_scores, axis=0)[::-1][:N]]

Performing final summarization

In [17]:
# Let's now do the sentence ordering using our prebaked sentence_organizer
# Let's map the scored sentences with their indexes
mapped_top_n_sentences = [(sentence,sentence_organizer[sentence]) for sentence in top_n_sentences]
print("Our top_n_sentence with their index: \n")
for element in mapped_top_n_sentences:
    print(element)

# Ordering our top-n sentences in their original ordering
mapped_top_n_sentences = sorted(mapped_top_n_sentences, key = lambda x: x[1])
ordered_scored_sentences = [element[0] for element in mapped_top_n_sentences]

# Our final summary
summary = " ".join(ordered_scored_sentences)

Our top_n_sentence with their index: 

("An acclaimed Victorian painter, Tenniel is celebrated for his illustrations for Lewis Carroll's Alice's Adventures in Wonderland and Through the Looking-Glass.", 1)
("Lewis Carroll noticed Tenniel's distinct style of work and in 1864, approached the artist to illustrate his book, Alice's Adventures in Wonderland.", 7)
("Google celebrated British illustrator and artist Sir John Tenniel's 200th birth anniversary with a doodle on February 28.", 0)


In [18]:
print("Summary: \n", summary)


Summary: 
 Google celebrated British illustrator and artist Sir John Tenniel's 200th birth anniversary with a doodle on February 28. An acclaimed Victorian painter, Tenniel is celebrated for his illustrations for Lewis Carroll's Alice's Adventures in Wonderland and Through the Looking-Glass. Lewis Carroll noticed Tenniel's distinct style of work and in 1864, approached the artist to illustrate his book, Alice's Adventures in Wonderland.
