# Building a Text Summarizer

## Importing required libraries

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from spacy.lang.en import English
import numpy as np

## Load spacy model for sentence tokenization

In [None]:
nlp = English()
nlp.add_pipe('sentencizer')

<spacy.pipeline.sentencizer.Sentencizer at 0x7f82111638c0>

In [None]:
text_corpus = """
“Yalla!” It is a lion’s roar of victory that has become synonymous with Tunisian tennis star Ons Jabeur, who went thundering into the Wimbledon final on Thursday when she defeated Tatjana Maria.

It took three sets for the 27-year-old to achieve the historic feat of becoming the first woman of north African or Arab descent to reach a grand slam final.

In the six years since Jabeur broke into the world’s Top 100 in 2016, the Tunisian star’s ascent has seen a flurry of firsts. In October last year she became the first Arab player – man or woman – in the world’s Top 10. This year she won the Madrid Open, becoming the first Arab or north African woman to win a WTA 1000 event.

On Saturday, the new world No 2 faces Elena Rybakina of Kazakhstan in the match of a lifetime, as she plays for her first Wimbledon title.

Against a backdrop of looming economic collapse in Tunisia, a country bitterly divided over its future political direction, Jabeur’s success has galvanised a nation with limited historical interest in tennis.

Amid the harsh political reality the country faces, the tennis star, christened the “Minister of Happiness” by the press, has been more than needed.

At Tennis Club de Tunis, the oldest tennis club in Tunis, established in 1923, where Jabeur trains, the courts are unusually empty.

Many have travelled home to celebrate Saturday’s Eid al-Adha, the festival of sacrifice celebrated throughout the Muslim world after the hajj pilgrimage, to mark Abraham’s willingness to sacrifice his son for God.

But Saturday’s celebration will be markedly different. The number of young Tunisians playing at Tennis Club de Tunis has risen sharply since Jabeur’s ascent.

“Subscriptions to the school of tennis have exploded over the last two years, even through the lockdown and the pandemic,” said Sammi Baccar, the club’s sports director, pointing to a long list of young champions the club has produced. “I have never known tennis be this popular in Tunisia.”

At Tennis Club de Tunis, the female champion isn’t simply a distant figure known only through television. She is one of them, drawing spectators to her like moths to the flame whenever she plays.

“It’s very important to see that she hasn’t changed a little bit because of the fame,” said Baccar, perched on a plastic chair. “She loves all the little kids. My kids, she always remembers their names and the details of their lives.”

Leaving the ladder down for others to climb after her is a personal ambition of Jabeur, as she revealed in an interview with the Guardian last month. “I see myself like I’m on a mission,” she said.

“I tell myself I chose to do this. Let’s say, I chose to inspire people. I chose to be the person that I am. I want to share my experience one day and really get more and more generations here.”

Over the years, tennis in Tunisia has grown from an elite sport to one that is routinely played and watched with interest in cafes across the country, a shift freelance sports journalist Souhail Khmira attributes directly to Jabeur’s success.

“Lots of Tunisian women have always competed in athletics at the highest level,” Khmira said.

“Sporting success isn’t just for men. However, Ons has shed more light on that. She’s walked in the footsteps of all those women and paved the way for a lot of young girls to follow. They’ve watched her, this middle class girl, and the obstacles she faced and overcame.”

Playing inside Tennis Club de Tunis’s maze of courts, 14-year-old Sarah Boughzala will be among the millions of Tunisian teenagers who will be watching the final tomorrow, and an admirer of one of Jabeur’s favourite techniques. “Her drop shot is very good. She knows how to use it to win,” she said.

To the millions of young Tunisian girls watching, the message Jabeur sends is resoundingly clear. “She’s very inspiring and one day I want to be just like her,” says 12-year-old Fatma Hamdouni, smiling at the mention of Jabeur’s name. “She can do anything she wants. She just needs to believe in herself.”"""

## Result / Summary

In [None]:
print("Summary: \n", summary)

Summary: 
 The number of young Tunisians playing at Tennis Club de Tunis has risen sharply since Jabeur’s ascent. ”Over the years, tennis in Tunisia has grown from an elite sport to one that is routinely played and watched with interest in cafes across the country, a shift freelance sports journalist Souhail Khmira attributes directly to Jabeur’s success. ”Playing inside Tennis Club de Tunis’s maze of courts, 14-year-old Sarah Boughzala will be among the millions of Tunisian teenagers who will be watching the final tomorrow, and an admirer of one of Jabeur’s favourite techniques. “


## Creating a function summarizer

In [None]:
print("Summarizer Result: \n", summarizer(text=text_corpus, tokenizer=nlp, max_sent_in_summary=3))

Summarizer Result: 
 The number of young Tunisians playing at Tennis Club de Tunis has risen sharply since Jabeur’s ascent. ”Over the years, tennis in Tunisia has grown from an elite sport to one that is routinely played and watched with interest in cafes across the country, a shift freelance sports journalist Souhail Khmira attributes directly to Jabeur’s success. ”Playing inside Tennis Club de Tunis’s maze of courts, 14-year-old Sarah Boughzala will be among the millions of Tunisian teenagers who will be watching the final tomorrow, and an admirer of one of Jabeur’s favourite techniques. “


# Pretrained text summarization models 

## Extractive

### Text Rank

In [None]:
import gensim
from gensim.summarization import summarize

In [None]:
# Passing the text corpus to summarizer 
short_summary = summarize(text_corpus,ratio=0.5,word_count=300)
print(short_summary)

“Yalla!” It is a lion’s roar of victory that has become synonymous with Tunisian tennis star Ons Jabeur, who went thundering into the Wimbledon final on Thursday when she defeated Tatjana Maria.
It took three sets for the 27-year-old to achieve the historic feat of becoming the first woman of north African or Arab descent to reach a grand slam final.
In the six years since Jabeur broke into the world’s Top 100 in 2016, the Tunisian star’s ascent has seen a flurry of firsts.
Against a backdrop of looming economic collapse in Tunisia, a country bitterly divided over its future political direction, Jabeur’s success has galvanised a nation with limited historical interest in tennis.
Amid the harsh political reality the country faces, the tennis star, christened the “Minister of Happiness” by the press, has been more than needed.
The number of young Tunisians playing at Tennis Club de Tunis has risen sharply since Jabeur’s ascent.
“Subscriptions to the school of tennis have exploded over th

### Text Summarization with Sumy


In [None]:
!pip install sumy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sumy
  Downloading sumy-0.11.0-py2.py3-none-any.whl (97 kB)
[K     |████████████████████████████████| 97 kB 3.0 MB/s 
[?25hCollecting breadability>=0.1.20
  Downloading breadability-0.1.20.tar.gz (32 kB)
Collecting docopt<0.7,>=0.6.1
  Downloading docopt-0.6.2.tar.gz (25 kB)
Collecting pycountry>=18.2.23
  Downloading pycountry-22.3.5.tar.gz (10.1 MB)
[K     |████████████████████████████████| 10.1 MB 10.3 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Building wheels for collected packages: breadability, docopt, pycountry
  Building wheel for breadability (setup.py) ... [?25l[?25hdone
  Created wheel for breadability: filename=breadability-0.1.20-py2.py3-none-any.whl size=21711 sha256=5b09a87faf0e77285be28a37be757cbf8285d326deaa87a8f2db4771d5312be2

In [None]:
import sumy 


In [None]:
import nltk
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# Importing the parser and tokenizer
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
# Import the LexRank summarizer
from sumy.summarizers.lex_rank import LexRankSummarizer
# Initializing the parser
my_parser = PlaintextParser.from_string(text_corpus,Tokenizer('english'))

#### LexRank

In [None]:
# Creating a summary of 3 sentences.
lex_rank_summarizer = LexRankSummarizer()
lexrank_summary = lex_rank_summarizer(my_parser.document,sentences_count=10)
# Printing the summary
for sentence in lexrank_summary:
  print(sentence)

In the six years since Jabeur broke into the world’s Top 100 in 2016, the Tunisian star’s ascent has seen a flurry of firsts.
In October last year she became the first Arab player – man or woman – in the world’s Top 10.
This year she won the Madrid Open, becoming the first Arab or north African woman to win a WTA 1000 event.
Against a backdrop of looming economic collapse in Tunisia, a country bitterly divided over its future political direction, Jabeur’s success has galvanised a nation with limited historical interest in tennis.
At Tennis Club de Tunis, the female champion isn’t simply a distant figure known only through television.
“I see myself like I’m on a mission,” she said.
“I tell myself I chose to do this.
She’s walked in the footsteps of all those women and paved the way for a lot of young girls to follow.
Playing inside Tennis Club de Tunis’s maze of courts, 14-year-old Sarah Boughzala will be among the millions of Tunisian teenagers who will be watching the final tomorrow, 

#### français

In [None]:
text='''En la recevant au Palais présidentiel de Carthage, avec son entraîneur et son mari qui est aussi son préparateur physique, le chef de l'Etat a voulu la remercier avec cette décoration prestigieuse pour avoir hissé «haut le drapeau du pays dans les évènements sportifs internationaux».

«Elle a été l'ambassadrice de la Tunisie» et a montré «la capacité de notre jeunesse à briller dans tous les domaines quand les moyens et les conditions nécessaires lui sont fournies», selon un communiqué officiel.

«Félicitations pour cette réussite et les prochaines réussites. Vous donnez une image aux jeunes Tunisiens et de la femme tunisienne qui relève tous les défis», lui a dit le président Kais Saïed, en lui décernant la médaille.

Jabeur, surnommée «ministre du bonheur» en Tunisie, est devenue la première Arabe et Africaine à atteindre la finale d'un tournoi du Grand Chelem, après avoir battu l'Allemande Maria Tatiana en demi-finale à Wimbledon.'''
my_parser_fr = PlaintextParser.from_string(text,Tokenizer('french'))
# Creating a summary of 3 sentences.
lex_rank_summarizer = LexRankSummarizer()
lexrank_summary = lex_rank_summarizer(my_parser_fr.document,sentences_count=3)
# Printing the summary
for sentence in lexrank_summary:
  print(sentence)

En la recevant au Palais présidentiel de Carthage, avec son entraîneur et son mari qui est aussi son préparateur physique, le chef de l'Etat a voulu la remercier avec cette décoration prestigieuse pour avoir hissé «haut le drapeau du pays dans les évènements sportifs internationaux».
«Elle a été l'ambassadrice de la Tunisie» et a montré «la capacité de notre jeunesse à briller dans tous les domaines quand les moyens et les conditions nécessaires lui sont fournies», selon un communiqué officiel.
«Félicitations pour cette réussite et les prochaines réussites.


In [None]:
str(sentence) in text

True

#### LSA (Latent semantic analysis)

In [None]:
#Import the Summarizer
from sumy.summarizers.lsa import LsaSummarizer

In [None]:
# creating the summarizer
lsa_summarizer=LsaSummarizer()
lsa_summary=lsa_summarizer(my_parser.document,20)
# Printing the summary
for sentence in lsa_summary:
    print(sentence)

“Yalla!” It is a lion’s roar of victory that has become synonymous with Tunisian tennis star Ons Jabeur, who went thundering into the Wimbledon final on Thursday when she defeated Tatjana Maria.
It took three sets for the 27-year-old to achieve the historic feat of becoming the first woman of north African or Arab descent to reach a grand slam final.
In October last year she became the first Arab player – man or woman – in the world’s Top 10.
This year she won the Madrid Open, becoming the first Arab or north African woman to win a WTA 1000 event.
On Saturday, the new world No 2 faces Elena Rybakina of Kazakhstan in the match of a lifetime, as she plays for her first Wimbledon title.
Against a backdrop of looming economic collapse in Tunisia, a country bitterly divided over its future political direction, Jabeur’s success has galvanised a nation with limited historical interest in tennis.
At Tennis Club de Tunis, the oldest tennis club in Tunis, established in 1923, where Jabeur trains

#### Luhn

In [None]:
# Import the summarizer
from sumy.summarizers.luhn import LuhnSummarizer

In [None]:
#  Creating the summarizer
luhn_summarizer=LuhnSummarizer()
luhn_summary=luhn_summarizer(my_parser.document,sentences_count=20)
# Printing the summary
for sentence in luhn_summary:
    print(sentence)

“Yalla!” It is a lion’s roar of victory that has become synonymous with Tunisian tennis star Ons Jabeur, who went thundering into the Wimbledon final on Thursday when she defeated Tatjana Maria.
It took three sets for the 27-year-old to achieve the historic feat of becoming the first woman of north African or Arab descent to reach a grand slam final.
In the six years since Jabeur broke into the world’s Top 100 in 2016, the Tunisian star’s ascent has seen a flurry of firsts.
In October last year she became the first Arab player – man or woman – in the world’s Top 10.
This year she won the Madrid Open, becoming the first Arab or north African woman to win a WTA 1000 event.
On Saturday, the new world No 2 faces Elena Rybakina of Kazakhstan in the match of a lifetime, as she plays for her first Wimbledon title.
Against a backdrop of looming economic collapse in Tunisia, a country bitterly divided over its future political direction, Jabeur’s success has galvanised a nation with limited his

#### KL-Sum

In [None]:
from sumy.summarizers.kl import KLSummarizer
# Instantiating the  KLSummarizer
kl_summarizer=KLSummarizer()
kl_summary=kl_summarizer(my_parser.document,sentences_count=20)

# Printing the summary
for sentence in kl_summary:
    print(sentence)

It took three sets for the 27-year-old to achieve the historic feat of becoming the first woman of north African or Arab descent to reach a grand slam final.
In the six years since Jabeur broke into the world’s Top 100 in 2016, the Tunisian star’s ascent has seen a flurry of firsts.
Against a backdrop of looming economic collapse in Tunisia, a country bitterly divided over its future political direction, Jabeur’s success has galvanised a nation with limited historical interest in tennis.
Amid the harsh political reality the country faces, the tennis star, christened the “Minister of Happiness” by the press, has been more than needed.
At Tennis Club de Tunis, the oldest tennis club in Tunis, established in 1923, where Jabeur trains, the courts are unusually empty.
Many have travelled home to celebrate Saturday’s Eid al-Adha, the festival of sacrifice celebrated throughout the Muslim world after the hajj pilgrimage, to mark Abraham’s willingness to sacrifice his son for God.
But Saturday

In [None]:
str(sentence) in original_text

NameError: ignored

## Abstractive Text Summarization

In [None]:
!pip install transformers

### T5

In [None]:
#import
from transformers import T5Tokenizer, T5Config, T5ForConditionalGeneration


In [None]:
!pip install sentencepiece

**!!!!!!!!!!!!!!!!!!!!!!!always restart , T5 depend to sentencepiece!!!!!!!!!!!!!!!!!!**

In [None]:
# Instantiating the model and tokenizer 
my_model = T5ForConditionalGeneration.from_pretrained('t5-small')
tokenizer = T5Tokenizer.from_pretrained('t5-small')

In [None]:
# Concatenating the word "summarize:" to raw text
text = "summarize:" + text_corpus

In [None]:
# encoding the input text
input_ids=tokenizer.encode(text, return_tensors='pt', max_length=512,truncation=True)
# Generating summary ids
summary_ids = my_model.generate(input_ids)
summary_ids

In [None]:
# Decoding the tensor and printing the summary.
t5_summary = tokenizer.decode(summary_ids[0])
print(t5_summary)

###BART

In [None]:
# Importing the model
from transformers import BartForConditionalGeneration, BartTokenizer, BartConfig
# Loading the model and tokenizer for bart-large-cnn

tokenizer=BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model=BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

In [None]:
# Encoding the inputs and passing them to model.generate()
inputs = tokenizer.batch_encode_plus([text_corpus],return_tensors='pt')
summary_ids = model.generate(inputs['input_ids'], early_stopping=True)

**!!!!!!!!!!!!!!!!! takes a lot of time while running !!!!!!!!!!!!!!!!**

In [None]:
# Decoding and printing the summary
bart_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(bart_summary)

### GPT-2 

In [None]:
# Importing model and tokenizer
from transformers import GPT2Tokenizer,GPT2LMHeadModel
# Instantiating the model and tokenizer with gpt-2
tokenizer=GPT2Tokenizer.from_pretrained('gpt2')
model=GPT2LMHeadModel.from_pretrained('gpt2')

In [None]:
# Encoding text to get input ids & pass them to model.generate()
inputs=tokenizer.batch_encode_plus([text_corpus],return_tensors='pt',max_length=300,truncation=True)
summary_ids=model.generate(inputs['input_ids'],early_stopping=True)
summary_ids

In [None]:
# Decoding and printing summary

GPT_summary=tokenizer.decode(summary_ids[0],skip_special_tokens=True)
print(GPT_summary)

# Articles Data

## kaggle env installation

In [None]:
! pip install -q kaggle

In [None]:
!ls

sample_data


In [None]:
 from google.colab import files

files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"noorjemmali","key":"8599fe9994313b79f1dba7ba63e40ea6"}'}

In [None]:
! mkdir ~/.kaggle

In [None]:
! cp kaggle.json /root/.kaggle

In [None]:
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
! kaggle datasets list

ref                                                             title                                            size  lastUpdated          downloadCount  voteCount  usabilityRating  
--------------------------------------------------------------  ----------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
akshaydattatraykhare/diabetes-dataset                           Diabetes Dataset                                  9KB  2022-10-06 08:55:25          10106        324  1.0              
whenamancodes/covid-19-coronavirus-pandemic-dataset             COVID -19 Coronavirus Pandemic Dataset           11KB  2022-09-30 04:05:11           8103        260  1.0              
stetsondone/video-game-sales-by-genre                           Video Game Sales by Genre                        12KB  2022-10-31 17:56:01            545         23  1.0              
hasibalmuzdadid/global-air-pollution-dataset                    Global Air Pollu

In [None]:
#! kaggle datasets download -d gowrishankarp/newspaper-text-summarization-cnn-dailymail

In [None]:
! mkdir ./data_articles

In [None]:
#! unzip newspaper-text-summarization-cnn-dailymail.zip

In [None]:
! kaggle datasets download -d snapcrack/all-the-news

Downloading all-the-news.zip to /content
 94% 230M/244M [00:01<00:00, 186MB/s]
100% 244M/244M [00:01<00:00, 170MB/s]


In [None]:
! unzip /content/all-the-news.zip

Archive:  /content/all-the-news.zip
  inflating: articles1.csv           
  inflating: articles2.csv           
  inflating: articles3.csv           


## imports

In [None]:
import pandas as pd
import re #-> regex library
#re.sub(pattern, repl, string, count=0, flags=0)  ##syntax
import nltk
from nltk.corpus import stopwords
import torch   

## data cleaning


In [None]:
df_1 = pd.read_csv("/content/articles1.csv")

In [None]:
df_1.dropna(subset=['content'], inplace = True)
df_1.dropna(subset=['title'], inplace = True)
df_1.drop_duplicates(inplace = True)
df_1.dropna(inplace=True)

In [None]:
for x in df_1.index:
  if len(df_1.loc[x, "content"]) < 1000:
    df_1.drop(x,inplace=True)

In [None]:
#df_1['year'] = df_1['year'].astype(int)
#df_1['month'] = df_1['month'].astype(int)
df_1.drop(["url","year","month","Unnamed: 0"],inplace=True,axis=1)

In [None]:
#df_2=df_1.loc[:,""]
df_2=df_1[["id","content"]]
df_2
d2=df_1.values.tolist()
df=df_2.content.str.len().sort_values()

In [None]:
df

42038         1
45537         2
45341         2
45639         2
44325         2
          ...  
18159     73958
6407      85948
7174      96682
22223    119764
17533    149346
Name: content, Length: 41394, dtype: int64

In [None]:
df_2.content[45537]

In [None]:
l=df_1['content'].values.tolist()
l.sort(key=len)


In [None]:
len(l)

## text cleaning

In [None]:

contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not",

                           "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",

                           "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",

                           "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",

                           "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",

                           "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",

                           "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",

                           "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",

                           "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",

                           "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",

                           "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",

                           "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",

                           "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",

                           "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",

                           "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",

                           "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",

                           "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",

                           "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",

                           "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",

                           "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",

                           "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",

                           "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",

                           "you're": "you are", "you've": "you have"}

In [None]:
text=l[1000]

In [None]:
text

'’SpaceX made history on Friday by landing its Falcon 9 rocket back on a barge in the middle of the Atlantic Ocean.’ ’The Falcon 9 launched out of Cape Canaveral, Florida.’ ”The goal of the launch was to send the Dragon cargo spacecraft up to the International Space Station (ISS) where, among other things, it’ll drop off an   module.” ’’ ’’ ”Here’s a   of the landing. Practically hit the center!” ’’ ’After that, the Falcon 9 rocket aimed to stick the landing on a barge in the Atlantic Ocean, called ”Of Course I Still Love You.” ().’ ’After four failed attempts, it was successful.’ ’’ ’’ ’’ ’’ ’SpaceX made history on Friday by landing its. ..’'

### Normalization

In [None]:
#text=re.sub(r"(http[s]?\://\S+)|:|[0-9]|[\"\-|\(\)_—\[\]”“…\.,\?!=\+{}\$\^]|([#@]\S+)|\n", "",text)
text = text.lower()
#Remove any text inside the parenthesis()
text = re.sub(r'\([^)]*\)','', text)
#contraction mapping
text = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in text.split(" ")])
#Remove (‘s)
text = re.sub(r"['’‘]s\b","",text)
text = re.sub("[^a-zA-Z]", " ", text)


In [None]:
text=re.sub('\s+|[\'’]',' ',text)

### Stop Words

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
stop_words = set(stopwords.words('english'))

In [None]:
tokens = [w for w in text.split() if not w in stop_words]

In [None]:
#removing short word
long_words=[]
for i in tokens:
    if len(i)>=3:                  
    long_words.append(i)   
text=(" ".join(long_words)).strip()

## Summarization

### BART (good but has limitation for long texts ☹ )

In [None]:
# Importing the model
from transformers import BartForConditionalGeneration, BartTokenizer, BartConfig, pipeline
# Loading the model and tokenizer for bart-large-cnn
summarization = pipeline("summarization")
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6", tokenizer="sshleifer/distilbart-cnn-12-6", framework="pt")
summarizer(text, min_length=50)
print(summarizer)

### BERT

In [None]:
!pip install bert-extractive-summarizer

In [None]:
!pip install  spacy

In [None]:
from summarizer import Summarizer,TransformerSummarizer
bert_model = Summarizer()  

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

In [None]:
bert_summary = bert_model(text,num_sentences=5,min_length=60)
bert_summary

'spacex made history friday landing falcon rocket back barge middle atlantic ocean falcon launched cape canaveral florida goal launch send dragon cargo spacecraft international space station among things drop module landing practically hit center falcon rocket aimed stick landing barge atlantic ocean called course still love four failed attempts successful spacex made history friday landing'