# News Summarizer


1. Create an end-to-end solution that performs extractive text summarization of newspapers. (Bonus points for SOTA methods)
2. Create an end-to-end solution that performs abstractive text summarization of newspapers. (Bonus points for SOTA methods)
3. Test out all solutions using newspaper articles from recent times.
4. Benchmark both solutions for latency, accuracy (subjective) and usefulness
5. Serve both models using Tensorflow serving (or any other form of serving)
6. Try to optimize the latency of both solutions.
Dataset : Find newspapers here : https://github.com/codelucas/newspaper


### Setup google drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
cwd = "/content/drive/MyDrive/Project"
dataset_path = cwd + "/" + "news_articles_dataset_10k.csv"
!ls {cwd}

 bert_scratch.pth		 test_bdf.json
 data_df_with_sent_count.csv	 test_df
 Ipynb				 test_doc_dict.pkl
 news_articles_10k.csv		 test_sents_list.pkl
 news_articles_dataset_10k.csv	 train_bdf.csv
 news_articles_dataset.csv	 train_bdf.json
'News Summarizer Info.gdoc'	 train_df
 old				 train_doc_dict.pkl
 test_bdf.csv			 train_sents_list.pkl


### Install necessary packages

In [None]:
# Install the most re version of TensorFlow to use the improved
# masking support for `tf.keras.layers.MultiHeadAttention`.
!apt install --allow-change-held-packages libcudnn8=8.1.0.77-1+cuda11.2
!pip uninstall -y -q tensorflow keras tensorflow-estimator tensorflow-text
!pip install -q tensorflow_datasets
!pip install -q -U tensorflow-text 
!pip install -q -U tensorflow

#Install newspaper library
!pip install -q -U newspaper3k

#Install huggingface transfomer library
!pip install -q -U transformers
!pip install -q -U sentencepiece
!pip install -q -U rouge-score
!pip install -q -U tensorflow_hub
!pip install -q -U datasets

#Download large langeuage model en_core_web_lg using spacy
!python -m spacy download en_core_web_lg

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
The following packages will be REMOVED:
  libcudnn8-dev
The following held packages will be changed:
  libcudnn8
The following packages will be DOWNGRADED:
  libcudnn8
0 upgraded, 0 newly installed, 1 downgraded, 1 to remove and 20 not upgraded.
Need to get 430 MB of archives.
After this operation, 1,392 MB disk space will be freed.
Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  libcudnn8 8.1.0.77-1+cuda11.2 [430 MB]
Fetched 430 MB in 7s (63.8 MB/s)
(Reading database ... 123941 files and directories currently installed.)
Removing libcudnn8-dev (8.1.1.33-1+cuda11.2) ...
update-alternatives: removing manually selected alternative - switching libcudnn to auto mode
(Reading database ... 123918 files and directories currently ins

#### Wrap the cell output

In [None]:
# wrap the output in colab cells
from IPython.display import HTML, display
from IPython.display import Image 

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)


## 1. Import the required libraries

In [None]:
import os
import pickle
from tqdm.notebook import tqdm

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Import tensorflow library
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import Model
from tensorflow.keras import Sequential
from tensorflow.keras import models
from tensorflow.keras import layers
from tensorflow.keras import optimizers
from tensorflow.keras import losses
from tensorflow.keras import callbacks
from tensorflow.keras import metrics

In [None]:
# Import torch library
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader 

In [None]:
# import tensorflow extended modules
import tensorflow_datasets as tfds
import tensorflow_hub as hub
import tensorflow_text as tf_text

In [None]:
import transformers
from transformers import AutoTokenizer, AutoModel
from transformers import pipeline, set_seed
transformers.utils.move_cache()

Moving 0 files to the new cache system


0it [00:00, ?it/s]

In [None]:
import newspaper
from newspaper import Config
from newspaper import Article

In [None]:
# Import nlp libraries
import nltk
import spacy
from nltk.corpus import stopwords
from string import punctuation
from rouge_score import rouge_scorer 

In [None]:
# download nltk models
nltk.download("stopwords")
nltk.download("punkt")  

# load nlp model en_core_web_lg
nlp = spacy.load('en_core_web_lg')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## 2. Load the dataset

### 2.1 Download the dataset
Download the different news articles from differnt sources using newspaper library

In [None]:
# websites for scraping news
urls = ['https://www.euronews.com'
, "https://www.cnn.com"
, "https://www.news18.com"
, "https://eng.bharattimes.co.in/"
, 'https://www.marketwatch.com'
, "https://timesofindia.indiatimes.com/"
, "https://indiatimes.com/"
, "https://economictimes.indiatimes.com/"
, "https://www.businessinsider.in/"
, "https://zeenews.india.com/"
, "https://www.livemint.com/"
, "https://edition.cnn.com/"
, "https://cnn.com/"
, "https://www.newslaundry.com/"
, "https://scroll.in/"
, "https://www.financialexpress.com/"
, "https://www.businesstoday.in/"
, "https://www.foxnews.com/"
, "https://www.moneycontrol.com/"
, "https://www.afaqs.com/"
, "https://yourstory.com/"
, "https://www.nytimes.com/"
, "https://www.cnbc.com/"
, "https://www.cnet.com/"
, "https://techcrunch.com/"
, "https://www.ndtv.com/"
, "https://www.nasdaq.com/"
, "https://www.businessinsider.in/"
, "https://www.washingtonpost.com/"
, "https://www.bbc.co.uk/"
, "https://www.forbes.com/"
, "https://www.cbsnews.com"
, "https://www.wsj.com"
, "https://www.usatoday.com"
, "https://www.dailymail.co.uk"
, "https://www.theguardian.com"
, "https://www.euromonitor.com"
, "https://www.republicworld.com"
, "https://www.einnews.com"
, "https://www.thehindu.com"
, "https://www.dnaindia.com"
, "https://www.japantimes.co.jp"
, "https://www.abplive.com"
, "https://www.newsweek.com"
, "https://www.firstpost.com"
, "https://www.thehill.com"
, "https://www.indianexpress.com"
, "https://www.businessinsider.com"
, "https://www.dw.com"
, "https://www.sfgate.com"
, "https://www.thetimes.co.uk"
, "https://www.nzherald.co.nz/"
, "https://www.newindianexpress.com/"
, "https://www.thejakartapost.com"
, "https://www.voanews.com"
, "https://www.business-standard.com"
, "https://www.aljazeera.com/"
, "https://www.deccanherald.com"
, "https://www.scmp.com"
]

print(f"Total websites in the dataset: {len(urls)}")

Total websites in the dataset: 59


In [None]:
# Base class build the dataset of news articles
# download the article using newspaper library
class Dataset:
    def __init__(self, sources, df=None):
        self.sources = sources
        self.total_links = 0
        self.total_scraped = 0

        # Configuration for downlodig articles
        # self.USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
        self.USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'

        self.config = Config()
        self.config.browser_user_agent = self.USER_AGENT
        self.config.request_timeout = 15
        self.columns = ["id",
                    "title",
                    "text",
                    "summary", 
                    "authors",
                    "url",
                    "publish_date",
                    "language",
                    "keywords",
            ]


    #crate the data dictionary with keys as columns
    def get_data_dict(self, data):
        return {x: y for x, y in zip(self.columns, data)}
    
    # creates the dataframe
    def create_df(self, data=None, is_save=False):
        if data:
            data = [self.get_data_dict(data)]

        df = pd.DataFrame(data=data, columns=self.columns)

        if is_save:
            df.to_csv(self.save_path)
            return
        return df
    
    
    # scrape the website and extract all text data
    # and store data into pandas dataframe
    def build(self, save_path="./articles_dataset.csv", verbose=False):

        # create the df if not exists
        self.save_path = save_path
        if not os.path.exists(self.save_path):
            self.create_df(is_save=True)

        for url in tqdm.tqdm(self.sources):
            source = newspaper.Source(url, config=self.config)
            source.download()
            source.parse()              #taking text from the html pages 
            source.set_categories()
            source.download_categories()  # we'll download the categories from the website 
            source.parse_categories()

            # source.set_feeds()
            # source.download_feeds()
            # source.parse_feeds()

            source.generate_articles(1300)

            curr_count = len(source.articles)
            self.total_links += curr_count

            if verbose:
                print(url)
                print(curr_count, "URL's Found")
                print("Donwloading dataset...")
            
            source.download_articles(4)
            source.parse_articles()
            
            for ind, a in enumerate(source.articles):
                if verbose:
                    print(f"({ind+1}/{curr_count}) Scraping... ", a.url)
                
                # a.download() # download article
                # a.parse() # parse the html
                a.nlp() # use nlp

                if a.is_valid_body() and a.meta_lang == "en":
                    self.total_scraped += 1

                    if verbose:
                        print(f"({ind+1}/{curr_count}) Adding..", url)

                    keywords = ", ".join(a.keywords)
                    authors = ", ".join(a.authors)

                    data = [self.total_scraped, 
                            str(a.title) or None,
                            str(a.text) or None,
                            str(a.summary) or None, 
                            str(authors) or None,
                            str(a.url),
                            str(a.publish_date) or None,
                            str(a.meta_lang) or None,
                            str(keywords)
                    ]

                    curr_df = self.create_df(data)

                    curr_df.to_csv(save_path, mode="a", header=not os.path.exists(save_path))

                else:
                    print("Scraping Failed", a.url)
                
                # if ind == 10:
                #     break
            print(f"\n\nTotal {self.total_scraped} Articles scraped so far...") 

        print(f"\n\nTotal {self.total_scraped} Articles scraped")

    # read the csv data file using pandas
    def get_df(self):
        return pd.read_csv(self.save_path)
    

In [None]:
dataset_path = cwd + "/content/drive/MyDrive/Colab Notebooks/" + "news_articles_dataset.csv"

### 2.2 Load the dataset and preprocess

In [None]:
# load the dataset
df = pd.read_csv(dataset_path)
df.head()

Unnamed: 0,id,title,text,summary,authors,url,publish_date,language,keywords
0,0,Czech voters turn to the far-right for answers...,The Czech Republic is facing one of the highes...,The Czech Republic is facing one of the highes...,"Czech Mother, Dezider Galbavy, Czech Pensioner...",https://www.euronews.com/2022/10/13/czech-vote...,2022-10-13 00:00:00,en,"voters, crisis, bryan, rising, social, spd, en..."
1,1,This 1373km long undersea cable will bring 'gr...,Greece is in embarking on one of Europe’s most...,Greece is in embarking on one of Europe’s most...,"Ioannis Karydas, Ceo Of Copelouzos Group",https://www.euronews.com/2022/09/17/this-1373k...,2022-09-17 00:00:00,en,"greece, electricity, long, energy, mw, europes..."
2,2,Inside the US facility where 199 'legally dead...,On the wall full of patients' portraits is Mat...,"According to More, the facility currently acco...","Dr Arthur Caplan, Professor Of Bioethics, New ...",https://www.euronews.com/next/2022/10/14/insid...,2022-10-14 00:00:00,en,"inside, legally, await, caplan, blood, pets, f..."
3,3,Feeling supersonic: Felix Baumgartner on 10 ye...,"10 years ago on 14 October 2012, Felix Baumgar...",To celebrate the 10th year anniversary of the ...,"Felix Baumgartner, Austria'S Record Breaking S...",https://www.euronews.com/culture/2022/10/14/fe...,2022-10-14 00:00:00,en,"skydive, felix, moon, im, lot, world, bull, su..."
4,4,"Lithuania wants German troops 'for marriage, n...",Politicians in Lithuania are locked in an argu...,Politicians in Lithuania are locked in an argu...,"Gitanas Nausėda, President Of Lithuania",https://www.euronews.com/2022/10/14/lithuania-...,2022-10-14 00:00:00,en,"president, good, stand, wants, brigade, german..."
