In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# News summarization with LangChain agents and Vertex AI PaLM text models

## Overview

It is a very common practice in real-world business scenarios to integrate Large Language Models (LLMs) with external sources of knowledge or applications. In a groundbreaking paper titled [Synergizing Reasoning and Acting in Language Models](https://arxiv.org/pdf/2210.03629.pdf), Google Research and Princeton University introduced a paradigm - ReAct -  that combines reasoning and acting with LLMs by allowing models to interact with external environments to gather additional information.

[LangChain](https://python.langchain.com/en/latest/index.html) is a framework that allows you to create applications powered by language models. It includes extensive support for [agents](https://python.langchain.com/en/latest/modules/agents.html) based on the ReAct concepts. LangChain agents use [tools](https://python.langchain.com/en/latest/modules/agents/tools.html) to interact with external systems. A tool is a component that performs a specific task, such as retrieving data from an external search engine. LangChain includes a number of predefined tools, such as tools for interacting with Google Search, Wikipedia, ArXiv, SQL databases, and many more. You can also define your own tools.

This notebook illustrates how to use LangChain agents with Vertex AI PaLM text models and custom tools. You will build an agent that can help you discover the most popular Google Search terms and analyze news articles related to those terms. An agent like that could be beneficial in a variety of business situations, including marketing, political analysis, and more.

The custom tools you will develop will give the agent access to information from the [Google Trends dataset](https://pantheon.corp.google.com/marketplace/product/bigquery-public-datasets/google-search-trends?project=jk-mlops-dev) and [the GDELT database](https://www.gdeltproject.org/). 
The Google Trends dataset contains the top 25 overall and top 25 rising queries from Google Trends in the past 30 days. The dataset is hosted on Google BigQuery as part of the Google Cloud Datasets initiative.

The GDELT Project, which is supported by [Google Jigsaw](https://jigsaw.google.com/), monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages. The GDELT database is free to use and accessible via a variety of interfaces, including Google BigQuery and the REST API. In this notebook, we will be using the REST API.

The notebook is structured as follows:
- You will begin by installing the necessary packages and configuring your GCP environment.
- Next, you will define and test the custom LangChain tools around the Google Trends dataset and the GDEL API.
- Finally, you will experiment with using the tools with a few different types of LangChain agents.

## Install pre-requisites
Install the following python packages.

In [None]:
! pip install -U google-cloud-aiplatform
! pip install -U langchain
! pip install -U python-dateutil
! pip install -U newspaper3k

---

#### ⚠️ Do not forget to RESTART THE RUNTIME before continue.

---

## Configure Google Cloud environment settings

Set the following constants to reflect your GCP environment.
- `PROJECT_ID`: Your Google Cloud Project ID.
- `REGION`: The region to use for Vertex AI

In [None]:
PROJECT_ID = '<YOUR PROJECT ID HERE>'
REGION = 'us-central1'

Initialize Vertex SDK.

In [None]:
import vertexai

vertexai.init(project=PROJECT_ID, location=REGION)

## Define custom LangChain tools

LangChain tools allow LangChain agents to communicate with other systems. For more information on how to build  and use them, please refer to the [Tools getting started documentation](https://python.langchain.com/en/latest/modules/agents/tools/getting_started.html)


This section of the notebook defines two custom tools:
- The Google Trends dataset tool allows an agent to retrieve a list of top-ranked search keywords on a given date. The tool retrieves this information from the Google Trends BigQuery dataset.
- The GDELT tool allows an agent to retrieve news articles that best match a set of keywords, on a given date, and with the given tone. The tool uses [the GDELT API](https://blog.gdeltproject.org/gdelt-doc-2-0-api-debuts/) to retrieve the articles' metadata and content.

There are a few options for implementing LangChain tools, including using the `Tool dataclass`, subclassing the `BaseTool class`, or using the `Tool decorator`. Due to the relatively complex logic, both tools are implemented by subclassing the LangChain BaseTool class.


### Google Trends dataset tool 

The tool extracts a list of the most popular search terms on a specific date within the last 30 days. The tool is a wrapper around the Google Trends Bigquery dataset. The tool expects to receive a JSON object as input, with the following format:


```
{
    "date": ""05-16-2023"
}
```

Google Trends only stores the top search terms for the previous 30 days, so the date you provide must be within the following range: `[current_date - 30, current_date - 1]`. If you provide a date outside of this range, the tool will return an empty list of keywords.


The tool can handle a variety of date formats to account for the different ways that an LLM might return dates.


In [None]:
import json

from datetime import date, timedelta, time, datetime
from typing import Any, Dict, List, Optional, ClassVar, Tuple
from google.cloud import bigquery
from dateutil.parser import parse as parse_date
from langchain.tools import BaseTool


class QueryGoogleTrendsDatasetTool(BaseTool):
    name = "query_google_trends"
    description = """Useful for when you need to find top search terms on a given date. 
    Input is a JSON object that has the field date.
    The date must be in the following format: YYYY-MM-DD.  
    """

    client: Any
    project_id: str 
    location: str = 'US'

    def _retrieve_top_terms(self, date:str):
        """Retrieves top terms from BigQuery for a given date."""

        query = f"""SELECT term, rank FROM `bigquery-public-data.google_trends.top_terms`
            WHERE refresh_date = '{date}'
            GROUP BY 1,2
                ORDER by rank ASC
            """
        query_job = self.client.query(
            query=query,
            location=self.location)
        df = query_job.to_dataframe()
        return df

    def _parse_date(self, json_params_str: str):
        """Retrieves a date from the JSON input parameters
        and normalizes it to the format required by BigQuery."""

        params = json.loads(json_params_str)

        if 'date' in params:
            try:
                dt = parse_date(params['date'])
                dt = dt.date()
            except:
                dt = date.today()
        else:
            dt = date.today()

        if dt >= date.today() or dt <= date.today() - timedelta(days=30):
            dt_str = ""
        else:
            dt_str = dt.strftime('%Y-%m-%d')

        return dt_str

    def _run(self, 
             json_params_str: str):
        """Return top search terms as a JSON list."""

        refresh_date = self._parse_date(json_params_str)
        terms = json.dumps([])
        if refresh_date:
            df = self._retrieve_top_terms(refresh_date)
            if not df.empty:
                terms = df.loc[0].values[0]
                terms = json.dumps(terms.split(' '))

        return terms

    def _arun(self, json_params: str):
        raise NotImplementedError("This tool does not support async")

Let's run a quick test.

In [None]:
google_trends_tool = QueryGoogleTrendsDatasetTool(
    project_id=PROJECT_ID,
    client=bigquery.Client(project=PROJECT_ID)) # type: ignore
date_str = (date.today() - timedelta(days=10)).strftime('%Y-%m-%d')
input_params = f'{{"date": "{date_str}"}}'
google_trends_tool.run(input_params)

### GDELT Search tool 

The `GDELT Search tool` enables an agent to obtain information about articles that match a set of keywords. The tool takes a JSON object as input, with the following format:


```
{
    "date": "05-16-2023",
    "keywords": ["Real", "Madrid"],
    "tone": "positive"
}
```

The tool will search for articles published between the dates `date - time_window_days` and `date + time_window_days`, where `time_window_days` is a configurable parameter. The `tone` field can be either `positive`, `negative`, or `unknown`. Please refer to the [GDELT documentation](https://blog.gdeltproject.org/gdelt-doc-2-0-api-debuts/#:~:text=theme%3ATERROR-,Tone,-.%20Allows%20you%20to) for more information on the tone setting. We define `positive` as a tone value greater than 5 and `negative` as a tone value less than 5. If the tone field is set to `unknown`, it will not be used in the GDELT API query.

Besides the time_window, there are other configuration parameters that control the results, including the maximum number of returned records (`max_records`) and the maximum distance between the keywords in an article (`n_near_words`).

The tool returns a text that is a compilation of article titles and article summaries for the found articles.

The tool is made up of two components:
- A retriever that searches for articles that match the input and retrieves the metadata and content of the articles. It uses the GDELT 2.0 DOC API and is implemented as a [LangChain retriever](https://docs.langchain.com/docs/components/indexing/retriever).
- A summarizer that is a [LangChain LLM chain](https://docs.langchain.com/docs/components/chains/llm-chain) that summarizes the content of the articles.

The tool first runs the retriever to get a list of LangChain documents with the content and metadata of the articles. It then runs the summarizing chain on each article content to create a list of summaries. Finally, it combines the article titles and the summaries into a single piece of text to return as an output.



Let's start by defining the retriever.

In [None]:
import requests
from newspaper import Article
from newspaper import ArticleException
from langchain.schema import BaseRetriever, Document

class GDELTRetriever(BaseRetriever):
    gdelt_api_url: ClassVar[str]='https://api.gdeltproject.org/api/v2/doc/doc'
    mode: str='ArtList'
    format: str='json'
    max_records: int=10
    n_near_words: int=50
    source_country: str='US'
    source_lang: str='english'
    time_window_days: int=3

    def _get_articles_info(self, keywords: str, startdatetime: datetime, enddatetime: datetime, tone: str) -> Dict:
        """Retrieves article links and article metadata from GDELT API."""

        startdatetime_str = startdatetime.strftime('%Y%m%d%H%M%S')
        enddatetime_str = enddatetime.strftime('%Y%m%d%H%M%S')
        if tone == 'positive':
            tone = 'tone>5'
        elif tone == 'negative':
            tone = 'tone<-5'
        else:
            tone = ''

        query = f'near{self.n_near_words}:"{keywords}" '
        query += f'sourcecountry:{self.source_country} sourcelang:{self.source_lang} '
        if tone:
            query += f'{tone}'
        else:
            query += query[:-1]
        params = {'query': query,
                  'format': self.format,
                  'mode': self.mode,
                  'maxrecords': str(self.max_records),
                  'startdatetime': startdatetime_str,
                  'enddatetime': enddatetime_str}

        response = requests.get(self.gdelt_api_url, params=params)
        response.raise_for_status()
        return response.json()

    def _parse_article(self, url: str) -> str:
        """Retrieves and scrapes an article from a given URL."""
        article = Article(url)

        try:
            article.download()
            article.parse()
        except ArticleException:
            return ""
        else:
            return article.text

    def _get_documents(self, articles: Dict) -> List[Document]:
        """Converts a list of articles into a list of LangChain documents."""
        documents = []
        unique_docs = set()
        
        if not articles:
            return documents

        for article in articles['articles']:
            parsed_article = self._parse_article(article['url'])
            if parsed_article and (article['title'] not in unique_docs):
                unique_docs.add(article['title'])
                document = Document(
                    page_content=parsed_article,
                    metadata={
                      'title': article['title'],
                      'url': article['url'],
                      'domain': article['domain'],
                      'date': article['seendate']
                  }
                )
                documents.append(document)
        return documents

    def get_relevant_documents(self, query: str, filters: Optional[Dict[str, object]] = None) -> List[Document]:
        """Retrieves relevant documents from GDELT API for a given query."""

        if filters and filters.get('date'):
            event_date = filters['date']
        else:
            event_date = date.today()
        if filters and filters.get('tone'):
            tone = filters['tone']
        else:
            tone = ''
        startdatetime = datetime.combine(event_date - timedelta(days=self.time_window_days), datetime.min.time())
        enddatetime = datetime.combine(event_date + timedelta(days=self.time_window_days), datetime.min.time())
        articles = self._get_articles_info(query, startdatetime, enddatetime, tone)
        documents = self._get_documents(articles)

        return documents

    async def aget_relevant_documents(self, query: str) -> List[Document]:
        raise NotImplementedError("Async interface to GDELT not implemented")

And run a quick test on it.

In [None]:
import pprint
pp = pprint.PrettyPrinter(indent=4)

retriever = GDELTRetriever(max_records=5)

keywords = "Seattle Seahawks"
filters = {
    "date": date(2022, 2, 2),
    "tone": "positive"
}
articles = retriever.get_relevant_documents(
    query=keywords,
    filters=filters
)

for article in articles:
    print(article.page_content[0:300])
    pp.pprint(article.metadata)
    print('----------------------------')

We can now define the LangChain tool, which encapsulates the retriever and the summarizer. 


In [None]:
from __future__ import annotations

from dateutil.parser import parse as parse_date
from langchain.prompts import PromptTemplate
from langchain.prompts.base import BasePromptTemplate
from langchain.chains.llm import LLMChain 
from langchain.tools import BaseTool


class GDELTSearchTool(BaseTool):
    name = "gdelt_search"
    description = """
    Useful for when you need to find news or articles matching a list of terms.
    Input is a JSON object that has date, keywords, and  tone fields. 
    The keywords field is a JSON list of terms to use for search.
    The date field is a date to use for search. The date format is YYYY-MM-DD.
    The tone field can be positive, negative or unknown. If you are not sure what the tone is use unknown.
    """

    failed_to_retrieve_message: str = "Could not retrieve any articles."

    gdelt_retriever: GDELTRetriever
    summarize_chain: LLMChain
    max_words: int = 3000

    def _parse_params(self, json_params_str: str) -> [str, dict]:
        """Parse input parameters."""
        query = ""
        filters = {}

        params = json.loads(json_params_str)

        if ('keywords' in params) and (isinstance(params['keywords'], list)):
            query = ' '.join(params['keywords'])

        if 'date' in params:
            filters['date'] = parse_date(params['date']).date()

        if 'tone' in params and params['tone'] in ['positive', 'negative']:
            filters['tone'] = params['tone']

        return query, filters

    def _summarize_articles(self, articles: List[Document]) -> List[str]:
        summaries = []
        for article in articles:
            summary = self.summarize_chain.run(article.page_content[0:self.max_words])       
            summaries.append(summary)
        return summaries 


    def _prepare_output(self, articles: List[Document], summaries: List[str]) -> str:
        """Prepare output returned by the tool."""
        
        response = ""
        for document, summary in zip(articles, summaries):
            record = f"""
            -------------------------------------------------------
            Article title: {document.metadata['title']}
            Article summary: {summary}
            """
            response += record

        return response   
    
    def _run(self, json_params_str: str):
        query, filters = self._parse_params(json_params_str)

        if not query:
            return self.failed_to_retrieve_message

        articles = self.gdelt_retriever.get_relevant_documents(
            query=query,
            filters=filters
        )
  
        if not articles:
            return self.failed_to_retrieve_message
        
        summaries = self._summarize_articles(articles)
        response = self._prepare_output(articles, summaries)

        return response

    def _arun(self, json_params: str):
        raise NotImplementedError("This tool does not support async")

Let's create and instance of the tool and run a quick test.

First you need to create a summarizing LLMChain.

In [None]:
from langchain.llms import VertexAI

llm = VertexAI(
    model_name='text-bison@001',
    max_output_tokens=1024,
    temperature=0.0,
    top_p=0.8,
    top_k=40,
    verbose=True,
)

prompt_template = """Write a brief summary of the following article:

Article:
{article}

Summary:
"""

prompt = PromptTemplate(template=prompt_template, input_variables=["article"])
llm_chain = LLMChain(llm=llm, prompt=prompt)

Next, create the instance of the GDELT retriever.

In [None]:
retriever = GDELTRetriever(max_records=5)

Use the retriever and the summarizing chain to create an instance of the tool.

In [None]:
gdelt_tool = GDELTSearchTool(summarize_chain=llm_chain, gdelt_retriever=retriever)

To run the test let's search for news about Seattle Seahawks that were published around the 3rd of February, 2023, and had a positive tone.

In [None]:
print(gdelt_tool._run('{"keywords": ["Seattle", "Seahawks"], "date": "2023-02-03", "tone": "positive"}'))

The tool returned a single piece of text that compiles article titles and article summaries for all matched articles.

We now have all the components required to experiment with LangChain agents.

## Experiment with LangChain agents 

Let's start with a `zero-shot-react-description` agent.

A `zero-shot-react-description` agent is the type of an agent that only acts on the current request and has no memory. It is based on the ReAct framework and uses ReAct style prompts when communicating with an LLM. It decides which tool to use to perform an action solely based on the tool's description, so finding the correct description of the tool is critical to the agent's performance.

Note:

Large language models (LLMs) are not aware of the current date. As a result, if you use relative time references (e.g. yesterday, two todays from now, etc.) in your interactions, they will hallucinate the current date. This can be addressed by configuring the agent to use yet another external tool, but for the purpose of this notebook, we will use absolute date references in the examples.


In [None]:
from langchain.agents import initialize_agent
from langchain.agents import AgentType

tools = [google_trends_tool, gdelt_tool]
agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True, intermediate_results=True)

Let's display the default prompt that is used by the agent.

In [None]:
print(agent.agent.llm_chain.prompt.template)

The prompt, as you can see, merges the tool descriptions from the tools we defined earlier in the notebook with the ReAct style instructions.

Let's now run a few queries.

In [None]:
# Google Trends dataset in BigQuery only stores data from the past month.
# Change to a valid date.
agent("What were the top trending news on November 13th 2023?")

The agent utilized the Google Trends tool to obtain the response. Note how the agent properly formatted the input to the tool as a JSON object, in accordance with the tool's instructions.

Let's try another question.

In [None]:
agent("Get and summarize news articles referring to Seattle Seahawks published on 06-13-2023?")

The agent was able to answer this question using the GDELT tool. Once more, notice the correctly formatted JSON input. The agent used the GDELT tool to get summaries of each document and then summarized the articles using the LLM.

Let's try a more complex question now.


In [None]:
agent("Find the news articles for the top trending topics on November 13th.")

The agent used both the Google Trends and the GDELT tool to answer this question.

Now, let's try a composite request combining a question and an imperative instruction.

In [None]:
agent("What were the top trending search terms on November 13th, 2023?. Find and summarize the news articles for them")

### Conversational Agent

Another type of LangChain agent is a `conversational-react-description agent`. This agent has a memory that allows it to keep track of the context between multiple questions.



In [None]:
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(memory_key="chat_history")
agent = initialize_agent(tools, llm, agent=AgentType.CONVERSATIONAL_REACT_DESCRIPTION, verbose=True, memory=memory)

Let's display the default prompt used by the agent.

In [None]:
print(agent.agent.llm_chain.prompt.template) # type: ignore

Let's fix the default prompt to refer to a Google's model.

In [None]:
PREFIX = """Assistant is a large language model trained by Google.

Assistant is designed to be able to assist with a wide range of tasks, from answering simple questions to providing in-depth explanations and discussions on a wide range of topics. As a language model, Assistant is able to generate human-like text based on the input it receives, allowing it to engage in natural-sounding conversations and provide responses that are coherent and relevant to the topic at hand.

Assistant is constantly learning and improving, and its capabilities are constantly evolving. It is able to process and understand large amounts of text, and can use this knowledge to provide accurate and informative responses to a wide range of questions. Additionally, Assistant is able to generate its own text based on the input it receives, allowing it to engage in discussions and provide explanations and descriptions on a wide range of topics.

Overall, Assistant is a powerful tool that can help with a wide range of tasks and provide valuable insights and information on a wide range of topics. Whether you need help with a specific question or just want to have a conversation about a particular topic, Assistant is here to assist.

TOOLS:
------

Assistant has access to the following tools:"""
FORMAT_INSTRUCTIONS = """To use a tool, please use the following format:

```
Thought: Do I need to use a tool? Yes
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
```

When you have a response to say to the Human, or if you do not need to use a tool, you MUST use the format:

```
Thought: Do I need to use a tool? No
{ai_prefix}: [your response here]
```"""

SUFFIX = """Begin!

Previous conversation history:
{chat_history}

New input: {input}
{agent_scratchpad}"""

new_prompt = agent.agent.create_prompt(
    tools=tools,
    prefix=PREFIX,
    suffix=SUFFIX,
    format_instructions=FORMAT_INSTRUCTIONS,
)

agent.agent.llm_chain.prompt = new_prompt

print(agent.agent.llm_chain.prompt.template)

Let's start with the question about the trending keywords.

In [None]:
agent("What were the top trending keywords on Novemeber 13th, 2023?")

We will now ask a follow-up question that assumes that the agent has already answered the previous question.


In [None]:
agent("Can you summarize the news articles for them?")

## Review

This notebook showed how to use Vertex PaLM text models with LangChain ReAct agents and custom tools. You learned how to:
- Create LangChain components like Tools and Retrievers
- Configure LangChain zero-shot and conversational agents
- Modify the default prompts used by the agents
- Interact with agents using both simple and complex queries
