


## Text Summarization

## Aim: To create a short, clear, concise and fluent summary of a long text document without changing the semantic structure of the sentence.



## Loading the text document

In [None]:
text=  "/content/drive/MyDrive/Colab Notebooks/express.txt"
with open(text, "r", encoding ="utf-8") as f:
  original_text = f.read()

In [None]:
file = open( "/content/drive/MyDrive/Colab Notebooks/express.txt", "r")

<_io.TextIOWrapper name='/content/drive/MyDrive/Colab Notebooks/express.txt' mode='r' encoding='UTF-8'>

# Extractive Summarization Techniques

# 1. Frequency Method

In [None]:
import pandas as pd
import numpy as np
import nltk
import itertools

In [None]:
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords

## Frequency of Each Word in the document

In [None]:
stopwords1 = set(stopwords.words("english"))
words = word_tokenize(original_text)
freqTable = {}
for word in words:
  word = word.lower()
  if word in stopwords1:
    continue
  if word in freqTable:
    freqTable[word] += 1
  else :
    freqTable[word] = 1
freq = dict(itertools.islice(freqTable.items(), 10))
print("Frequency of each word in the document:\n")
freq



Frequency of each word in the document:



{'indian': 1,
 'railways': 3,
 'introduced': 2,
 'two': 5,
 'new': 3,
 'vande': 7,
 'bharat': 7,
 'express': 4,
 'trains': 8,
 'mumbai-solapur': 4}

## Sentence Values

In [None]:
sentences_tokens = sent_tokenize(original_text)
sentenceValue = {}
for sentence in sentences_tokens:
  for word, freq in freqTable.items():
    if word in sentence.lower():
      if sentence in sentenceValue:
        sentenceValue[sentence] += freq
      else :
        sentenceValue[sentence] = freq
sumValues = 0
for sentence in sentenceValue:
  sumValues += sentenceValue[sentence]
  average = int(sumValues / len(sentenceValue))
sentenceValue

{'Indian Railways has introduced two new Vande Bharat Express trains on the Mumbai-Solapur and Mumbai-Sainagar Shirdi routes.': 81,
 'This new upgraded version of the nation’s first indigenous semi-high-speed train will offer superior comfort and enhanced rail travel experience for passengers.': 47,
 'To amuse travellers during their journey, the new advanced Vande Bharat Express 2.0 is offering the popular board game “Snakes and Ladders” for passengers.': 99,
 'This will be available only on two trains: Mumbai-Solapur and Mumbai-Sainagar Shirdi.': 49,
 'The snakes and ladders game is a very popular game worldwide.': 34,
 'It requires a maximum of four players and a minimum of two players to play.': 30,
 'However, in these new-age trains, the railways expect that the game will engage passengers and rekindle the memories of the person playing the board game.': 85,
 'The authorities have devised the game in such a way that it looks like the route of the blue and white color train.': 37,


## Summary

In [None]:
summary = ''
for sentence in sentences_tokens:
  if (sentence in sentenceValue) and(sentenceValue[sentence] > (1.5*average)):
    summary += "" + sentence
summary.split(".")

['To amuse travellers during their journey, the new advanced Vande Bharat Express 2',
 '0 is offering the popular board game “Snakes and Ladders” for passengers',
 'In Mumbai- Sainagar Shirdi and Mumbai-Solapur Vande Bharat Express, the board game will start at Chhatrapati Shivaji Maharaj Terminus (CSMT) and the ladders are replaced with Vande Bharat trains',
 'If you land on the board where the train does not halts, you will go down in snakes and if you land at the halt will jump through the Vande Bharat (ladder) leapfrogging to higher rows\nInaugurating two Vande Bharat trains on Friday, Prime Minister Narendra Modi has called it a “grand picture of modern India”',
 'Apart from these, the trains have automatic plug doors, touch-free sliding doors, revolving seats in executive class, 32 inches passenger information and infotainment system in every coach, emergency lighting in each coach, emergency talk-back units, better heat ventilation, UV lamp for the germ-free supply of air, speci

## 2. TextRank Algorithm


In [None]:
import gensim
from gensim.summarization import summarize

In [None]:
summary =summarize(original_text)
summary

'Indian Railways has introduced two new Vande Bharat Express trains on the Mumbai-Solapur and Mumbai-Sainagar Shirdi routes.\nTo amuse travellers during their journey, the new advanced Vande Bharat Express 2.0 is offering the popular board game “Snakes and Ladders” for passengers.\nIn Mumbai- Sainagar Shirdi and Mumbai-Solapur Vande Bharat Express, the board game will start at Chhatrapati Shivaji Maharaj Terminus (CSMT) and the ladders are replaced with Vande Bharat trains.'

In [None]:
summary_by_ratio = summarize(original_text, ratio = 0.1)
summary_by_ratio

'Indian Railways has introduced two new Vande Bharat Express trains on the Mumbai-Solapur and Mumbai-Sainagar Shirdi routes.'

In [None]:
summary_by_word_count =summarize(original_text, word_count =50)
summary_by_word_count

'Indian Railways has introduced two new Vande Bharat Express trains on the Mumbai-Solapur and Mumbai-Sainagar Shirdi routes.\nTo amuse travellers during their journey, the new advanced Vande Bharat Express 2.0 is offering the popular board game “Snakes and Ladders” for passengers.'

## 3. Using Sumy Library

In [None]:
pip install sumy

In [None]:
import sumy



## 1.   LexRank

### A particular sentence is recommended by other similar sentences and hence is ranked higher.



In [None]:
from sumy.parsers.plaintext import PlaintextParser # To analyse the syntactic structure of the sentence
from sumy.nlp.tokenizers import Tokenizer

In [None]:
from sumy.summarizers.lex_rank import LexRankSummarizer

In [None]:
my_parser = PlaintextParser.from_string(original_text,Tokenizer('english'))

In [None]:
lex_rank_summarizer = LexRankSummarizer()
lexrank_summary = lex_rank_summarizer(my_parser.document,sentences_count=3)
for sentence in lexrank_summary:
  print(sentence)

Indian Railways has introduced two new Vande Bharat Express trains on the Mumbai-Solapur and Mumbai-Sainagar Shirdi routes.
However, in these new-age trains, the railways expect that the game will engage passengers and rekindle the memories of the person playing the board game.
In Mumbai- Sainagar Shirdi and Mumbai-Solapur Vande Bharat Express, the board game will start at Chhatrapati Shivaji Maharaj Terminus (CSMT) and the ladders are replaced with Vande Bharat trains.


## 2. LSA
### * Unsupervised Learning Algorithm
### * Based on SVD


In [None]:
from sumy.summarizers.lsa import LsaSummarizer

In [None]:
lsa_summarizer=LsaSummarizer()
lsa_summary= lsa_summarizer(my_parser.document,3)


for sentence in lsa_summary:
    print(sentence)

Indian Railways has introduced two new Vande Bharat Express trains on the Mumbai-Solapur and Mumbai-Sainagar Shirdi routes.
This new upgraded version of the nation’s first indigenous semi-high-speed train will offer superior comfort and enhanced rail travel experience for passengers.
Keeping this in mind, the railways have added several more features to these two trains.


## 3. Luhn
### * Based on TF-IDF
### * It is useful when very low frequent words as well as highly frequent words(stopwords) are both not significant.

In [None]:
from sumy.summarizers.luhn import LuhnSummarizer

In [None]:
luhn_summarizer=LuhnSummarizer()
luhn_summary=luhn_summarizer(my_parser.document,sentences_count=3)


for sentence in luhn_summary:
  print(sentence)

Indian Railways has introduced two new Vande Bharat Express trains on the Mumbai-Solapur and Mumbai-Sainagar Shirdi routes.
If you land on the board where the train does not halts, you will go down in snakes and if you land at the halt will jump through the Vande Bharat (ladder) leapfrogging to higher rows Inaugurating two Vande Bharat trains on Friday, Prime Minister Narendra Modi has called it a “grand picture of modern India”.
Both Mumbai-Solapur and Mumbai-Sainagar Shirdi Vande Bharat Express trains will climb 1 in 37 gradient ghat section without banker engine in Bhor ghat i.e. Khandala-Lonavala section and in Thul ghat i.e. Kasara ghat respectively.


# Abstractive Summarization

# BART
## Bart uses a standard seq2seq/machine translation architecture with a bidirectional encoder (like BERT) and a left-to-right decoder (like GPT).
## "BartForConditionalGeneration" is used for conditional generation tasks like summarization

## "bart-large-cnn" is a pretrained model, fine tuned especially for summarization task.

In [None]:
pip install transformers

In [None]:
import transformers

In [None]:
from transformers import BartForConditionalGeneration, BartTokenizer, BartConfig

In [None]:
tokenizer=BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model=BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

In [None]:
inputs = tokenizer.batch_encode_plus([original_text],return_tensors='pt')
summary_ids = model.generate(inputs['input_ids'], early_stopping=True)



In [None]:
bart_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
bart_summary

'Indian Railways has introduced two new Vande Bharat Express trains on the Mumbai-Solapur and Mumbai-Sainagar Shirdi routes. The trains will offer superior comfort and enhanced rail travel experience for passengers. Prime Minister Narendra Modi has called it a ‘grand picture of modern India’'

## Example 2:

In [None]:
pip install docx2txt

In [None]:
import docx2txt

In [None]:
my_text = docx2txt.process("/content/drive/MyDrive/Colab Notebooks/budget_speech.docx")

In [None]:
summary_by_ratio = summarize(my_text,ratio =0.01)
summary_by_ratio

'The components of the scheme will include not only financial support but also access to advanced skill training, knowledge of modern digital techniques and efficient green technologies, brand promotion, linkage with local and global markets, digital payments, and social security.\nThis will enable inclusive, farmer-centric solutions through relevant information services for crop planning and health, improved access to farm inputs, credit, and insurance, help for crop estimation, market intelligence, and support for growth of agri-tech industry and start-ups.\nWe will launch a new sub-scheme of PM Matsya Sampada Yojana with targeted investment of ` 6,000 crore to further enable activities of fishermen, fish vendors, and micro & small enterprises, improve value chain efficiencies, and expand the market.\nTo further deepen domestic value addition in manufacture of mobile phones, I propose to provide relief in customs duty on import of certain parts and inputs like camera lens and continu

# Conclusion:
### We have used newspaper article for text summarization. There are various methods where can implement text summarization provided by the gensim libraries. Each of these methods result in different summaries. Thus, given any large document, a short summary of the text can be created.