# Automated Web Content Summarization using GPT-4 and BART

This Jupyter Notebook outlines the steps involved in creating a project that
interacts with a large language model (GPT-4) to retrieve a relevant website,
extracts and processes the content from that website, and then summarizes the
content using the `facebook/bart-large-cnn` summarizer model from Hugging Face.




# Installing Necessary Libraries

In [None]:
!pip install transformers



In [None]:
from transformers import pipeline
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

In [None]:
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

In [None]:
!pip install llm



# Interacting with GPT-4 Model using the `llm` Python Library
The first step in this project involves interacting with the GPT-4 model using the `llm` Python library. The goal here is to generate a prompt that asks for the best website containing information about the former footballer Diego Maradona. The response from the model is expected to include a URL to the relevant website.

In [None]:
import llm

Removed my API key from the project for security reasons. I will consider hashing the key in future so that it is useable for the task but not readable.

In [None]:
model = llm.get_model("gpt-4o")
model.key = ''
response = model.prompt('find the best webpage about diego maradona returning only the webpage most relevant to his biography')
response.text()

'The best webpage for a comprehensive biography of Diego Maradona is likely to be his Wikipedia page. Wikipedia often provides detailed, well-sourced information about the lives of notable individuals, including their early life, career milestones, achievements, and personal challenges. You can find his biography at the following link:\n\n[Diego Maradona - Wikipedia](https://en.wikipedia.org/wiki/Diego_Maradona)\n\nThis page is frequently updated and contains extensive information about his career as a footballer, his achievements, controversies, and his impact on the sport and popular culture.'

# Extracting the Website URL using Regular Expressions
The next step is to extract the website URL from the response provided by the GPT-4 model. This is accomplished using Python's `re` module for regular expression matching. The extracted URL is then used in the following steps for accessing and parsing the webpage content.

In [None]:
import re

def extract_url(text):
    # Regular expression to match URLs
    url_pattern = re.compile(r'(https?://\S+)')
    # Find all matches in the text
    urls = url_pattern.findall(text)
    return urls

# Example usage
url = extract_url(response.text())
print(url)


['https://en.wikipedia.org/wiki/Diego_Maradona)']


In [None]:
url = url[0][:-1]

In [None]:
#webpage = urlopen(url[0][:-1]).read()

#soup = BeautifulSoup(webpage, 'lxml')

#paras = []
#for paragraph in soup.find_all('p'):
#  paras.append(str(paragraph.text))


#headers = []
#for header in soup.find_all('span', attrs={'class': 'mw-headline'}):
#  headers.append(str(header.text))

#text = [word for pair in zip(paras, headers) for word in pair]
#text = ' '.join(text)


#text = re.sub(r"\[.*?\]+", '', text)

#text


# Accessing, Parsing, and Preprocessing the Webpage
In this step, the extracted URL is used to access the webpage. The content of the webpage is parsed using the BeautifulSoup library, specifically focusing on text within <p> tags which typically contain the main body of the article. The text is then preprocessed to make it suitable for summarization. This preprocessing may involve cleaning up HTML tags, removing special characters, and normalizing whitespace.

In [None]:
def fetch_and_parse_wikipedia(url):
    try:
        # Attempting to fetch the page content
        webpage = urlopen(url).read()

        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(webpage, 'lxml')

        # Extracting all paragraphs
        paras = [paragraph.text for paragraph in soup.find_all('p')]

        # Extracting all headers
        headers = [header.text for header in soup.find_all('span', attrs={'class': 'mw-headline'})]

        # Combining paragraphs and headers otherwise, if the lengths are not equal, appending remaining elements.
        combined_text = []
        min_length = min(len(paras), len(headers))
        for i in range(min_length):
            combined_text.append(paras[i])
            combined_text.append(headers[i])

        # Appending any remaining paragraphs or headers
        combined_text.extend(paras[min_length:])
        combined_text.extend(headers[min_length:])

        text = ' '.join(combined_text)

        # Remove references
        text = re.sub(r"\[.*?\]+", '', text)

        return text
    except Exception as e:
        print(f"Error: {e} for URL: {url}")
        return None


In [None]:
def preprocess_text(text):
    # Removing special characters and excessive whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

def chunk_text(text, chunk_size=500):
    # Spliting text into chunks of specified size
    words = text.split(' ')
    chunks = [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]
    return chunks

# Summarizing the Text using the BART Model
The final step of the project involves summarizing the preprocessed text using the facebook/bart-large-cnn model available through the Hugging Face Transformers library. The summarizer is applied to the cleaned text, and the resulting summary is displayed. This provides a concise version of the original article, capturing the key points.

In [None]:
page_content = fetch_and_parse_wikipedia(url)

if page_content:
    page_content = preprocess_text(page_content)
    text_chunks = chunk_text(page_content, chunk_size=500)

    summarizer = pipeline("summarization", model='facebook/bart-large-cnn')
    summaries = []

    for chunk in text_chunks:
        summary = summarizer(chunk, max_length=100, min_length=40, do_sample=False)
        summaries.append(summary[0]['summary_text'])

    final_summary = ' '.join(summaries)
    print(final_summary)
else:
    print("Failed to fetch or parse the page.")

Your max_length is set to 100, but your input_length is only 79. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=39)


Diego Armando Maradona (Spanish: ; 30 October 1960 – 25 November 2020) was an Argentine professional football player and manager. He was one of the two joint winners of the FIFA Player of the 20th Century award. He played for Argentinos Juniors, Barcelona, Napoli, Sevilla and Newell's Old Boys during his club career. In his international career with Argentina, he earned 91 caps and scored 34 goals. He became the coach of Argentina's national Diego Armando Maradona was born on 30 October 1960 in Lanús, Buenos Aires Province. He was the youngest player in the history of Argentine Primera División club Dorados. At age eight, he was spotted by a talent scout while playing in his local club Estrella Roja. He made his professional debut for Argentinos Juniors, 10 days before his 16th birthday, versus Talleres de Córdoba. Maradona spent five years at Argentinos Juniors, from 1976 to 1981, scoring 115 goals in 167 appearances before his US$4 million transfer to Boca Juniors. After the 1982 Wor

# Conclusion

This project demonstrates the process of using a large language model to automate the retrieval of relevant web content, followed by web scraping, preprocessing, and summarization. The approach leverages state-of-the-art natural language processing tools to extract meaningful insights from web-based information, making it applicable to a wide range of use cases where summarization of long-form content is required.

# Additional Considerations:


*   Error Handling: Implement error handling, especially when accessing the webpage (e.g., handle cases where the webpage is not accessible or the URL is incorrect).
*   Edge Cases: I had to consider scenarios where the GPT-4 model may not return a valid URL or where the content of the webpage is not well-suited for summarization.


*   Performance Optimization: When dealing with large articles, I had to chunk the text before passing it to the summarizer to avoid memory issues.



