
# **Build a News Articles Summarizer**

Introduction
In today's fast-paced world, it's essential to stay updated with the latest news and information. However, going through multiple news articles can be time-consuming. To help you save time and get a quick overview of the important points, let’s develop a News Articles Summarizer application using ChatGPT and LangChain. With this powerful tool, we can scrape online articles, extract their titles and text, and generate concise summaries. Within this lesson, we will walk you through the workflow of constructing a summarizer. We will employ the concepts we discussed in earlier lessons, demonstrating their application in a real-world scenario.

Workflow for Building a News Articles Summarizer
Here’s what we are going to do in this project.



And here are the steps described in more detail:


1.   Install required libraries: To get started, ensure you have the necessary libraries installed: **requests, newspaper3k, and langchain**.
2.  Scrape articles: Use **the requests library to scrape** the content of the target news articles from their respective URLs.
3. Extract titles and text: Employ the **newspaper library to parse the scraped HTML** and extract the titles and text of the articles.
4.Preprocess the text: **Clean and preprocess the extracted texts** to make them suitable for input to ChatGPT.
5. Generate summaries: **Utilize ChatGPT to summarize** the extracted articles' text concisely.
6. Output the results: **Present the summaries** along with the original titles, allowing users to grasp the main points of each article quickly.


In [None]:
pip install langchain==0.1.4 deeplake openai==1.10.0 tiktoken



In [None]:
!pip install python-dotenv



In [None]:
from dotenv import load_dotenv

load_dotenv('/content/APIKeys.env')

True

In [None]:
import os
from dotenv import load_dotenv, find_dotenv
_= load_dotenv(find_dotenv())

OPENAI_API_KEY = os.environ['OPENAI_API_KEY']
GenerativeAIActiveLoop = os.environ['GenerativeAIActiveLoop']

We picked the URL of a news article to generate a summary. The following code fetches articles from a list of URLs using the requests library with a custom User-Agent header. It then extracts the title and text of each article using the newspaper library.

In [None]:
pip install -q newspaper3k

In [None]:
import requests
from newspaper import Article

In [None]:
import requests
from newspaper import Article

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'
}

article_url = "https://www.artificialintelligence-news.com/2022/01/25/meta-claims-new-ai-supercomputer-will-set-records/"

session = requests.Session()

try:
    response = session.get(article_url, headers=headers, timeout=10)

    if response.status_code == 200:
        article = Article(article_url)
        article.download()
        article.parse()

        print(f"Title: {article.title}")
        print(f"Text: {article.text}")

    else:
        print(f"Failed to fetch article at {article_url}")
except Exception as e:
    print(f"Error occurred while fetching article at {article_url}: {e}")

Title: Meta claims its new AI supercomputer will set records
Text: Ryan is a senior editor at TechForge Media with over a decade of experience covering the latest technology and interviewing leading industry figures. He can often be sighted at tech conferences with a strong coffee in one hand and a laptop in the other. If it's geeky, he’s probably into it. Find him on Twitter (@Gadget_Ry) or Mastodon (@gadgetry@techhub.social)

Meta (formerly Facebook) has unveiled an AI supercomputer that it claims will be the world’s fastest.

The supercomputer is called the AI Research SuperCluster (RSC) and is yet to be fully complete. However, Meta’s researchers have already begun using it for training large natural language processing (NLP) and computer vision models.

RSC is set to be fully built in mid-2022. Meta says that it will be the fastest in the world once complete and the aim is for it to be capable of training models with trillions of parameters.

“We hope RSC will help us build entire

The next code imports essential classes and functions from the LangChain and sets up a ChatOpenAI instance with a temperature of 0 for controlled response generation. Additionally, it imports chat-related message schema classes, which enable the smooth handling of chat-based tasks. The following code will start by setting the prompt and filling it with the article’s conten

In [None]:
from langchain.schema import(
    HumanMessage
    )

# we get the article data from the scraping part
article_title = article.title
article_text = article.text


# prepare template for prompt
template = """You are a very good assistant that summarizes online articles.

Here's the article you want to summarize.

==================
Title: {article_title}

{article_text}
==================

Write a summary of the previous article.
"""


prompt = template.format(article_title=article.title, article_text=article.text)

messages = [HumanMessage(content=prompt)]

The **HumanMessage** is a structured data format representing user messages within the chat-based interaction framework. The **ChatOpenAI class** is utilized to interact with the AI model, while the **HumanMessage schema **provides a standardized representation of user messages. **The template consists of placeholders **for the **article's title and content**, which will be substituted with the actual article_title and article_text. This process simplifies and streamlines the creation of **dynamic prompts** by allowing you to define a template with placeholders and then replace them with actual data when needed.

In [None]:
from langchain.chat_models import ChatOpenAI

#load the model
chat= ChatOpenAI(model_name = "gpt-3.5-turbo" , temperature =0)

As we loaded the model and set the temperature to 0. We’d use the chat() instance to generate a summary by passing a single HumanMessage object containing the formatted prompt. The AI model processes this prompt and returns a concise summary:

In [None]:
#generate summary
summary =chat(messages)
summary

AIMessage(content="Meta (formerly Facebook) has unveiled a new AI supercomputer called the AI Research SuperCluster (RSC) that is set to be the world's fastest once fully built in mid-2022. The supercomputer will be capable of training models with trillions of parameters and is expected to be 20x faster than Meta's current clusters. Meta aims to use RSC for tasks such as real-time voice translations and identifying harmful content on its platforms. The supercomputer was designed with security and privacy controls in mind to allow Meta to use real-world examples from its production systems in training.")

# If we want a bulleted list, we can modify a prompt and get the result.

In [None]:
# prepare template for prompt
template = """You are an advanced AI assistant that summarizes online articles into bulleted lists.

Here's the article you need to summarize.

==================
Title: {article_title}

{article_text}
==================

Now, provide a summarized version of the article in a bulleted list format.
"""


# format prompt
prompt = template.format(article_title=article.title, article_text=article.text)

# generate summary
summary = chat([HumanMessage(content=prompt)])
print(summary.content)

- Meta (formerly Facebook) has unveiled an AI supercomputer called the AI Research SuperCluster (RSC) that it claims will be the world's fastest.
- RSC is still under construction but Meta's researchers have already started using it for training large NLP and computer vision models.
- Once fully built in mid-2022, RSC is expected to be capable of training models with trillions of parameters and be 20x faster than Meta's current clusters.
- Meta aims to use RSC to develop AI systems for applications like real-time voice translations and AR games in the metaverse.
- RSC is designed with security and privacy controls to allow Meta to use real-world data from its production systems for training.
- Meta believes that RSC's performance, reliability, security, and privacy features are unprecedented at such a scale.


In [None]:
import requests
from newspaper import Article

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'
}

article_url = "https://www.cnbc.com/2024/02/16/tech-and-ai-companies-sign-accord-to-combat-election-related-deepfakes.html"

session = requests.Session()

try:
    response = session.get(article_url, headers=headers, timeout=10)

    if response.status_code == 200:
        article = Article(article_url)
        article.download()
        article.parse()

        print(f"Title: {article.title}")
        print(f"Text: {article.text}")

    else:
        print(f"Failed to fetch article at {article_url}")
except Exception as e:
    print(f"Error occurred while fetching article at {article_url}: {e}")




Title: Microsoft, Google, Amazon and tech peers sign pact to combat election-related misinformation
Text: Sam Altman, CEO of OpenAI, attends the 54th annual meeting of the World Economic Forum, in Davos, Switzerland, on Jan. 18, 2024.

A group of 20 leading tech companies on Friday announced a joint commitment to combat AI misinformation in this year's elections.

The industry is specifically targeting deepfakes, which can use deceptive audio, video and images to mimic key stakeholders in democratic elections or to provide false voting information.

Microsoft , Meta , Google , Amazon , IBM , Adobe and chip designer Arm all signed the accord. Artificial intelligence startups OpenAI, Anthropic and Stability AI also joined the group, alongside social media companies such as Snap , TikTok and X.

Tech platforms are preparing for a huge year of elections around the world that affect upward of four billion people in more than 40 countries. The rise of AI-generated content has led to serious 

In [None]:
# prepare template for prompt
template = """You are an advanced AI assistant that summarizes online articles into bulleted lists.

Here's the article you need to summarize.

==================
Title: {article_title}

{article_text}
==================

Now, provide a summarized version of the article in a bulleted list format.
"""


# format prompt
prompt = template.format(article_title=article.title, article_text=article.text)

# generate summary
summary = chat([HumanMessage(content=prompt)])
print(summary.content)

- Meta (formerly Facebook) has unveiled an AI supercomputer called the AI Research SuperCluster (RSC) that it claims will be the world's fastest.
- RSC is still under construction but is already being used for training large NLP and computer vision models.
- Meta aims for RSC to be capable of training models with trillions of parameters and to power AI systems for real-time voice translations and AR applications in the metaverse.
- Meta expects RSC to be 20x faster than its current clusters, 9x faster at running the NVIDIA Collective Communication Library, and 3x faster at training large-scale NLP workflows.
- RSC will enable models with tens of billions of parameters to finish training in three weeks compared to nine weeks prior to RSC.
- RSC was designed with security and privacy controls to allow Meta to use real-world data from its production systems for training AI models.
- Meta believes this is the first time performance, reliability, security, and privacy have been tackled at s


# Instruct the model - Ask Prompt to translate in French

In [None]:
# prepare template for prompt
template = """You are an advanced AI assistant that summarizes online articles into bulleted lists in French.

Here's the article you need to summarize.

==================
Title: {article_title}

{article_text}
==================

Now, provide a summarized version of the article in a bulleted list format.
"""


# format prompt
prompt = template.format(article_title=article.title, article_text=article.text)

# generate summary
summary = chat([HumanMessage(content=prompt)])
print(summary.content)

- Meta (anciennement Facebook) a dévoilé un superordinateur d'IA appelé le AI Research SuperCluster (RSC) qui sera le plus rapide au monde une fois terminé en 2022.
- Le RSC est utilisé pour former de grands modèles de traitement du langage naturel (NLP) et de vision par ordinateur.
- Meta espère que le RSC permettra de construire de nouveaux systèmes d'IA pour des applications telles que les traductions vocales en temps réel et les jeux en réalité augmentée.
- Le RSC devrait être 20 fois plus rapide que les clusters actuels de Meta, 9 fois plus rapide pour exécuter la NVIDIA Collective Communication Library (NCCL) et 3 fois plus rapide pour former des flux de travail NLP à grande échelle.
- Meta utilise le RSC pour avancer dans la recherche sur des tâches vitales telles que l'identification de contenus nuisibles sur ses plateformes en utilisant de vraies données.
- Le RSC a été conçu avec des contrôles de sécurité et de confidentialité pour permettre à Meta d'utiliser des exemples du 

# **Output Parsers**
Now, let’s improve the previous section by using Output Parsers. The Pydantic output parser in LangChain offers a flexible way to shape the outputs from language models according to pre-defined schemas.  When used alongside prompt templates, it enables more structured interactions with language models, making it easier to extract and work with the information provided by the model.

The prompt template includes the format instructions from our parser, which guide the language model to produce the output in the desired structured format. The idea is to demonstrate how you could use **PydanticOutputParser** class to receive the output as a type List that holds each bullet point instead of a string. The advantage of having a list is the possibility to loop through the results or index a specific item.

As mentioned before, the **PydanticOutputParser wrapper** is used to create a parser that will parse the output from the string into a data structure. The custom **ArticleSummary** class, which inherits the Pydantic package’s  **BaseModel class,** will be used to parse the model’s output.

We defined the schema to present a **title** along with a **summary variabl**e that represents a **list of strings** using the **Field object**. The **description argument** will describe what each variable must represent and help the model to achieve it. Our custom class also includes a **validator function** to ensure that the generated output contains at least three bullet points

In [None]:
from langchain.output_parsers import PydanticOutputParser
from pydantic import field_validator
from pydantic import BaseModel, Field
from typing import List

#create output parser
class ArticleSumamry(BaseModel):
  title: str=Field(description = "Title of the article")
  summary : List[str]= Field(description= "Bulleted list summary of the article")

# validating whether the generated summary has at least three lines
@field_validator("summary")

def has_three_or_more_lines(cls, list_of_lines):
        if len(list_of_lines) < 3:
            raise ValueError("Generated summary has less than three bullet points!")
        return list_of_lines

#setup output parser
parser = PydanticOutputParser(pydantic_object=ArticleSumamry)

In [None]:
from langchain.prompts import PromptTemplate


# create prompt template
# notice that we are specifying the "partial_variables" parameter
template = """
You are a very good assistant that summarizes online articles.

Here's the article you want to summarize.

==================
Title: {article_title}

{article_text}
==================

{format_instructions}
"""

prompt = PromptTemplate(
    template=template,
    input_variables=["article_title", "article_text"],
    partial_variables={"format_instructions": parser.get_format_instructions()}
)

# Format the prompt using the article title and text obtained from scraping
formatted_prompt = prompt.format_prompt(article_title=article_title, article_text=article_text)

Lastly, **the GPT-3 model with the temperature set to 0.0 ** is initialized, which means the output will be deterministic, favoring the most likely outcome over randomness/creativity. The parser object then converts the string output from the model to a defined schema using the .parse() method.

In [None]:
from langchain.llms import OpenAI

# instantiate model class
model = OpenAI(model_name="gpt-3.5-turbo-instruct", temperature=0.0)

# Use the model to generate a summary
output = model(formatted_prompt.to_string())

# Parse the output into the Pydantic model
parsed_output = parser.parse(output.split("\"]}")[0] + "\"]}")
parsed_output

ArticleSumamry(title='Meta claims its new AI supercomputer will set records', summary=["Meta (formerly Facebook) has unveiled an AI supercomputer called the AI Research SuperCluster (RSC) that is set to be the world's fastest once completed in mid-2022.", "The RSC is already being used by Meta's researchers for training large NLP and computer vision models.", 'Meta hopes that the RSC will pave the way for building technologies for the metaverse and will be 20x faster than their current V100-based clusters.', 'The RSC is also estimated to be 9x faster at running the NVIDIA Collective Communication Library (NCCL) and 3x faster at training large-scale NLP workflows.', "Meta's previous AI research infrastructure only used open source and publicly-available datasets, but the RSC was designed with security and privacy controls in mind to allow for the use of real-world data from their production systems.", 'This will enable Meta to advance research for tasks such as identifying harmful conte

The Pydantic output parser is a powerful method for molding and structuring the output from language models. It uses the Pydantic library, known for its data validation capabilities, to define and enforce data schemas for the model's output.

# **This is a recap of what we did:**



*   We defined a **Pydantic data structure** named **ArticleSummary.** This model serves as a blueprint for the desired structure of the generated article summary. It comprises fields for the title and the summary, which is expected to be a list of strings representing bullet points.
*  Importantly, we incorporate a **validator** within this model to ensure the summary comprises at least three points, thereby maintaining a certain level of detail in the summarization.
*  We then instantiate a **parser object** using our ArticleSummary class. This parser plays a crucial role in ensuring the output generated by the language model aligns with the **defined structures** of our custom schema.
*  To direct the language model's output, we create the **prompt template**. The template instructs the model to act as an assistant that summarizes online articles by incorporating the parser object.
*  So, output parsers enable us to specify the desired format of the model's output, making extracting meaningful information from the model's responses easier.

# **Conclusion**
We've successfully navigated the path of crafting our ***News Articles Summarizer*** leveraging the **potential of PromptTemplates and OutputParsers,** showing the capabilities of prompt handling LangChain. The Pydantic output parser is a powerful method for molding and structuring the output from language models. It uses the Pydantic library, known for its data validation capabilities, to define and enforce data schemas for the model's output.

Following this, we define a Pydantic model named "ArticleSummary.” This model serves as a blueprint for the desired structure of the generated article summary. It comprises fields for the title and the summary, which is expected to be a list of strings representing bullet points. Importantly, we incorporate a validator within this model to ensure the summary comprises at least three points, thereby maintaining a certain level of detail in the summarization.

We then instantiate a PydanticOutputParser, passing it to the "ArticleSummary" model. This parser plays a crucial role in ensuring the output generated by the language model aligns with the structure outlined in the "Article Summary" model.

A good understanding of prompt and output design nuances equips you to customize the model to produce results that perfectly match your specific requirements.