## Import necessary packages

In [1]:
import pandas as pd
import numpy as np
import re

pd.set_option("display.max_colwidth", 200)


## Import data

The dataset consists of 4515 examples and contains Author_name, Headlines, Url of Article, Short text, Complete Article.

[New summary kaggle link](https://www.kaggle.com/datasets/sunnysai12345/news-summary?select=news_summary_more.csv)

In [2]:
df = pd.read_csv("news_summary_more.csv")
df.head()

Unnamed: 0,headlines,text
0,upGrad learner switches to career in ML & Al with 90% salary hike,"Saurav Kant, an alumnus of upGrad and IIIT-B's PG Program in Machine learning and Artificial Intelligence, was a Sr Systems Engineer at Infosys with almost 5 years of work experience. The program ..."
1,Delhi techie wins free food from Swiggy for one year on CRED,"Kunal Shah's credit card bill payment platform, CRED, gave users a chance to win free food from Swiggy for one year. Pranav Kaushik, a Delhi techie, bagged this reward after spending 2000 CRED coi..."
2,New Zealand end Rohit Sharma-led India's 12-match winning streak,New Zealand defeated India by 8 wickets in the fourth ODI at Hamilton on Thursday to win their first match of the five-match ODI series. India lost an international match under Rohit Sharma's capt...
3,Aegon life iTerm insurance plan helps customers save tax,"With Aegon Life iTerm Insurance plan, customers can enjoy tax benefits on your premiums paid and save up to â¹46,800^ on taxes. The plan provides life cover up to the age of 100 years. Also, cust..."
4,"Have known Hirani for yrs, what if MeToo claims are not true: Sonam","Speaking about the sexual harassment allegations against Rajkumar Hirani, Sonam Kapoor said, ""I've known Hirani for many years...What if it's not true, the [#MeToo] movement will get derailed."" ""I..."


## Text Preprocessing

**Steps**
- Convert to lower case
- Removing punctuations, numbers and white spaces.

In [3]:
def preprocess(text):
  text = re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', text)
  text = re.sub(r'\'', ' ', text)
  text = text.lower()
  return text

In [4]:
preprocessed_headlines = []
cleaned_text = []

# cleaning healines
for z in df.headlines:
  preprocessed_headlines.append(preprocess(z))

# cleaning text
for t in df.text:
  cleaned_text.append(preprocess(t))

print(preprocessed_headlines[0])
print('===============================')
print(cleaned_text[0])

upgrad learner switches to career in ml   al with 90  salary hike
saurav kant  an alumnus of upgrad and iiit b s pg program in machine learning and artificial intelligence  was a sr systems engineer at infosys with almost 5 years of work experience  the program and upgrad s 360 degree career support helped him transition to a data scientist at tech mahindra with 90  salary hike  upgrad s online power learning has powered 3 lakh  careers 


## Model Building

To install required packages, run:
- pip install transformers

In [5]:
from transformers import BartTokenizer, BartForConditionalGeneration

  from .autonotebook import tqdm as notebook_tqdm


In [6]:

tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")

Downloading: 100%|██████████| 1.29M/1.29M [00:02<00:00, 635kB/s]


In [7]:
def summerized_sentencesinput(input_text, max_len, min_len):
  # tokenize
  input_txt_ids = tokenizer([input_text], return_tensors='pt', max_length=1024, truncation=True)

  # summarize
  summary_ids = model.generate(input_txt_ids['input_ids'],num_beams=4, max_length=int(max_len), min_length=int(min_len), early_stopping=True)


  # get the text summary
  summary = ([tokenizer.decode(i, skip_special_tokens=True, clean_up_tokenization_spaces=False) for i in summary_ids]) 
  
  return summary


In [27]:
cleaned_text[0]

'saurav kant  an alumnus of upgrad and iiit b s pg program in machine learning and artificial intelligence  was a sr systems engineer at infosys with almost 5 years of work experience  the program and upgrad s 360 degree career support helped him transition to a data scientist at tech mahindra with 90  salary hike  upgrad s online power learning has powered 3 lakh  careers '

In [29]:
input_text = cleaned_text[0]

print("Summarized text: \n",summerized_sentencesinput(input_text, 100, 10))

Summarized text: 
 ['saurav kant is an alumnus of upgrad and iiit b s pg program in machine learning and artificial intelligence. He is now a data scientist at tech mahindra with 90 salary hike.']


---