# Text Summarization using Extractive and Abstractive methods

## Text summarization using the Extractive method

In this notebook we will be using the Latent semantic analysis for extractive summarization.

### Importing the required packages

* Pandas and numpy for data manipulation and data reterival techniques, also for creating data-frames.
* Matplotlib for visualizing the trends in the data.
* For dimensionality reduction, import the TruncatedSVD (Singular value decomposition) usually for analysis for the most frequent and weighted terms based on the Term-frequency inverse document frequency (Tfidf) vectorizer.
* Regular expressions for preprocessing the textual data.
* For removing the stopwords from the cropus, import the stopwords library from the nltk.

In [6]:
import pandas as pd
import numpy as np
from pandas import Series, DataFrame

import matplotlib.pyplot as plt

import re
import string

import sklearn as sk
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer

import nltk 
from nltk.corpus import stopwords

### Preprocessing the corpus

* The below function is used for preprocessing the corpus by removing the punctuations, lowering the characters, tokenzing the words and removing the stopwords.
* By preprocessing we can get the most important terms in the document which are helpful in analysis and makes it more readable and understandable.

In [7]:
def text_preprocess(text):
    list_obj = "".join([char.lower() for char in text if char not in string.punctuation])
    tokenized = re.split('\W+', list_obj)
    no_stopwords = " ".join([word for word in tokenized if word not in stopwords.words('english')])
    return no_stopwords

### Reading the data.

* Reading the data and following the preprocessing steps.

In [8]:
corpus = open('/content/drive/MyDrive/Colab Notebooks/Python Data science /Python Scripts/Refactored_Py_DS_ML_Bootcamp-master/Natural language processing/document.txt', 'r').read()

In [9]:
corpus_list = corpus.split('\n')[:-6]
dataset = DataFrame(corpus_list, columns=['Text'])
dataset['Preprocessed'] = dataset['Text'].apply(lambda x: text_preprocess(x))

### Vectorizing the data

In [10]:
tfidf = TfidfVectorizer()
vectors = tfidf.fit(dataset['Preprocessed'].tolist())
vectorized = tfidf.transform(dataset['Preprocessed'].tolist())
vectorized.todense()

matrix([[0.        , 0.10724106, 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        ...,
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.13578852],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ]])

### Singualar Value Decomposition (Latent semantic analysis).

In [11]:
svd = TruncatedSVD(n_components=30, n_iter=150)

decomposed = svd.fit_transform(vectorized)

svd.components_

array([[ 0.11061634,  0.01339401,  0.00489212, ...,  0.00489212,
         0.00489212,  0.01647105],
       [-0.12230496, -0.00932177, -0.00284044, ..., -0.00284044,
        -0.00284044,  0.02483275],
       [-0.01879132,  0.01345226, -0.0012168 , ..., -0.0012168 ,
        -0.0012168 , -0.01941212],
       ...,
       [-0.0311394 ,  0.03773901, -0.00457225, ..., -0.00457225,
        -0.00457225,  0.02410171],
       [ 0.00108786,  0.00508928, -0.01583101, ..., -0.01583101,
        -0.01583101, -0.06957262],
       [ 0.01534356, -0.0080803 , -0.00505606, ..., -0.00505606,
        -0.00505606,  0.00987365]])

In [12]:
terms = vectors.get_feature_names()

for i, comp in enumerate(svd.components_):
  termsInComps = zip(terms, comp)
  sortedComp = sorted(termsInComps, key=lambda x: x[1], reverse=True)[:10]
  print(f'Concept {i + 1}:')
  for i in sortedComp:
    print(i[0])
  print()

Concept 1:
text
neural
summarization
network
data
recurrent
words
information
networks
new

Concept 2:
network
neural
recurrent
feedback
problem
using
networks
gradient
loops
output

Concept 3:
length
data
order
idea
sequences
sequenced
recordings
networks
varying
architecture

Concept 4:
extractive
abstractive
model
ways
summary
words
proposed
rulings
creates
expressions

Concept 5:
words
order
lyrics
much
come
advanced
new
normal
used
topic

Concept 6:
model
proposed
numerous
texts
task
long
si
sequence
idea
paragraphs

Concept 7:
sequences
may
want
input
additionally
affair
indeed
machine
operation
use

Concept 8:
model
proposed
demonstrate
sequential
separate
text
architecture
dealt
new
used

Concept 9:
learning
information
use
separate
sequence
additionally
affair
indeed
machine
operation

Concept 10:
demanded
perform
numerous
true
bumps
cannot
capture
challenging
computations
dont

Concept 11:
demonstrate
model
gradient
problem
vanishing
either
times
lstm
assessed
assessment

Con



## Text summarization using the Abstractive method

In this notebook we will be using the BART algorithm for text summarization which has a pretrained model available in the hugging face package.

### Installing the required package

The package used is the Transformers from the huggingface community. Which provides wide variety of packages for nlp and deep learning with pre-trained models available.

In [13]:
pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 28.0 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 66.1 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 70.2 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.25.1


### Importing the required packages.

* We will be importing Bert from the transformers package.
* For data preprocessing we will be importing the regular expressions package.

Using the BartForConditionalGeneration() function we can use the link of the pre-trained model from the huggingface community. 

We will then be creating a tokenizer which will be our pre-trained-model for encoding and decoding the data. 


In [14]:
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
import re

In [15]:
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

Downloading:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

In [16]:
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

### Functions for the required summarization.

* preprocess_data - Takes parameters: Document.
* segment_data - Takes parameters: integer n, Document. 
* pipeline - Takes parameters: integer n, segmented data. 

Function for removing the line breaks and special characters for the corpus.

In [17]:
def preprocess_data(corpus: str):
  doc_list = re.split('\n+', corpus)
  long_text = "".join(doc_list)
  preprocessed_data = " ".join(re.split('\W+', long_text))
  return preprocessed_data

Used for dividing the corpus into equal segments for tokenizing. As the pre-trained model has limited id generation (which is 1024) hence we need to limit the data input for tokenizing. so we are segmenting the data into equal paritions based on the input parameter.

In [18]:
def segment_data(n: int, data: str):
  segmented_data = []
  length = int(len(data) / n)
  for i in range(n):
    if i == 0:
      segmented_data.append(data[:length])
    else:
      segmented_data.append(data[length * i: (i + 1) * length])

  return segmented_data

Pipeline function helps in generating the summaries for the segmented corpus.

In [19]:
def pipeline(n: int, data):
  summaries = []
  for i in range(n):
    tokens = tokenizer([data[i]], return_tensors='pt')
    id_generation = model.generate(tokens['input_ids'], max_length=500, early_stopping=False)
    summaries.append([tokenizer.decode(id, skip_special_tokens=True) for id in id_generation])
  
  return summaries

Comparision function helps in getting the detalied view of the original data and summarized data.

In [20]:
  def comparision(n, original_data, summarized_data):
    print("\nLength of Segmented Documents vs Length of Summaries")
    for i in range(n):
      print(f"Document - {i + 1} ---- {len(original_data[i])}\t\tSummary - {i + 1} ---- {len(summarized_data[i][0])}")
    return original_data, "\n",summarized_data

### Reading the document

* The document input can be given with the help of a text document stored in the google drive. The text document should be around 2000 words.

* This data will then be going through series of functions to summarize the text.
* Depending on the segments of the data specified, the execution of pipeline function will vary. (execution takes around 1 - 3 minutes)

In [21]:
document = open('/content/drive/MyDrive/Colab Notebooks/Python Data science /Python Scripts/Refactored_Py_DS_ML_Bootcamp-master/Natural language processing/document.txt', 'r').read()

In [22]:
preprocessed = preprocess_data(document)
segmented_data = segment_data(n=3, data=preprocessed)
summarized = pipeline(n=3, data=segmented_data)
comparision(n=3, original_data=segmented_data, summarized_data=summarized)


Length of Segmented Documents vs Length of Summaries
Document - 1 ---- 4095		Summary - 1 ---- 533
Document - 2 ---- 4095		Summary - 2 ---- 546
Document - 3 ---- 4095		Summary - 3 ---- 698


(['Text summarization is an NLP grounded fashion which involves converting large number of paragraphs into simple accessible judgment done by involving various grammatical connections sentence matching etc Analogy of the original paragraphs or texts is maintained Let s say that we are reading a review or newspaper or some kind of research paper where you come across huge paragraphs This becomes a tedious task for anyone who just want to focus on the main context of what they are reading and save their time NLP involves creation of analogy and generating the texts This method of generating the texts should be ensured that verbatim is not occurring Verbatim is basically called as the words that already have been used in the original paragraph or text should not be used again We can ensure the percentage of the words that can be used in creating the summary By text summarization we easily filter out the main context of any content provided and take meaning full actions with that summary A