## 1. Extracting an article as a text file from a webpage

For this purpose, we will use BeatifulSoup library. It is Python library that is used for web scraping purposes to pull the data out of HTML and XML files. It creates a parse tree from page source code that can be used to extract data in a hierarchical and more readable manner.
We will provide the URL of the webpage and output its content into a text file titled 'article.txt'.

In [1]:
import urllib.request
from bs4 import BeautifulSoup
  
# Pass url of the webpage and the path where you want to save it as a text file)
urllib.request.urlretrieve("https://en.wikipedia.org/wiki/The_Hunger_Games",
                           "webpage.txt")

# Open the file in reading mode  
file = open("webpage.txt", "r", encoding="UTF-8")

# Read the file
contents = file.read()

# Parse it through an HTML parser provided by BeautifulSOup library
soup = BeautifulSoup(contents, 'html.parser')
  
# Open a new empty file
f = open("article.txt", "w", encoding="UTF-8")
  
# traverse paragraphs from soup and save in the empty file
text=""
for data in soup.find_all("p"):
    text = text+data.get_text()
    f.writelines(text)
  
f.close()

## 2. Text Summarization

We will be using the Transformers library for text summarization. It provides general-purpose architectures for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with a number of pretrained models in many languages.

Install the required libraries if not already installed using the command : pip install transformers torch sentencepiece

In [2]:
from transformers import pipeline

# Open the article in read mode
f = open("article.txt", "r", encoding="utf8")
to_tokenize = f.read()
to_tokenize = to_tokenize[:1024]

In our project, we are using HuggingFace module which has performed finetuning with the CNN/DailyMail summarization dataset, providing us with pretrained models.

## Using DistilBART: 

DistilBART is same as BART but smaller. BART (Bidirectional and Auto-Regressive Transformers) combines a bidirectional BERT-like (Bidirectional Encoder Representations from Transformers) encoder with a GPT-like decoder (Generative Pre-Training), allowing us to benefit from BERT bidirectionality while being able to generate text, which is not one of BERT’s key benefits. Using the BART architecture, we can finetune a model for summarization. 

In [3]:
# Initialize the HuggingFace summarization pipeline
summarizer = pipeline("summarization")
summarized = summarizer(to_tokenize, min_length=75, max_length=200)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


In [4]:
# Print summarized text
print(summarized)

[{'summary_text': ' The Hunger Games is a series of young adult dystopian novels written by American author Suzanne Collins . The first three novels are titled "The Hunger Games" and "Catching Fire" Each was adapted for film, with the film adaptation of Mocking Games split into two feature-length films . The series is set in the Hunger Games universe, and the first two novels were both New York Times best sellers .'}]


In [5]:
# Store summarized text in a file
summaryBART = open("summaryBART.txt", "w")
summaryBART.write(str(summarized)) # converted the list to string
summaryBART.close()

## Using T5:
T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format. T5 converts all NLP problems into a text-to-text format. It is trained using teacher forcing, we always need an input sequence and a corresponding target sequence. 

In [6]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

# T5 uses a max_length of 512 so we cut the article to 512 tokens.
inputs = tokenizer("summarize: " + to_tokenize, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(inputs["input_ids"], max_length=200, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)

print(tokenizer.decode(outputs[0]))

<pad> the series is set in the Hunger Games universe. the first three novels are a trilogy following teenage protagonist Katniss Everdeen. each was adapted for film, establishing the film series.</s>


In [7]:
summaryT5 = open("summaryT5.txt", "w")
summaryT5.write(tokenizer.decode(outputs[0]))
summaryT5.close()

## Conclusion
We used BeautifulSoup to retrieve textual data from a URL and then used HuggingFace to summarize that text succintly. We used both the distilled BART and T5 models.
Our BART model seems to have provided a better summary of the article.