# Notebook 2:Using Web Search with RAG

In this notebook we will show how to use the Google Search API as a retriever for RAG. The LLM is not used in this notebook - you can see the full process in a later notebook.  

We will consider 
* Methods using Google Search
* Methods using Duck Duck Go

Note: The Google Custom Search Engine (CSE) used in this notebook is restricted to usatoday.com articles, but you will set up your own with the links provided however you wish.

# Import libraries and load the 10k and stock data

In [58]:
import pandas as pd
import os
import re
import numpy as np
from datetime import datetime
import requests

# To use with the router
from sklearn.metrics.pairwise import cosine_similarity

company_name="Tesla"

In [5]:
query = "What is the latest news about Elon Musk's pay package? site:www.usatoday.com"

# Google Search

In this section, we provide 2 approaches to using Google Search, but please not this requires both an [API key](https://developers.google.com/custom-search/v1/overview) and a [Custom Search Engine Key](https://programmablesearchengine.google.com/)

* Method 1: Use [Langchain's GoogleSearchAPIWrapper](https://python.langchain.com/docs/integrations/tools/google_search)
* Method 2: Use [Google's API](https://developers.google.com/custom-search/v1/reference/rest/v1/cse/list) directly

There are also more comprehensive scrapping of the website using [Serper](https://serpapi.com/) and the corresponding integration from [Langchain](https://python.langchain.com/docs/integrations/tools/google_serper), but in this notebook we are keeping it to high level results.

In [6]:
# Using API KEY names used by Langchain's GoogleSearchAPIWrapper for simplicity
os.environ["GOOGLE_CSE_ID"] = os.environ.get("GOOGLE_SEARCH_USAT_ID")
os.environ["GOOGLE_API_KEY"] = os.environ.get("GOOGLE_SEARCH_API_KEY")

### Method 1: [Langchain's GoogleSearchAPIWrapper](https://python.langchain.com/docs/integrations/tools/google_search)

In [8]:
from langchain.tools import Tool
from langchain_community.utilities import GoogleSearchAPIWrapper

search = GoogleSearchAPIWrapper()

tool = Tool(
    name="google_search",
    description="Search Google for recent results.",
    func=search.run,
)

In [9]:
results = tool.run(query)
results

'Jan 31, 2024 ... They must negotiate a new pay package that has the approval of shareholders and of Musk, who recently demanded an increase in his ownership\xa0... Jan 30, 2024 ... Elon Musk\'s $55 billion pay package at Tesla was struck down by a Delaware judge after a shareholder challenged it as excessive. Feb 12, 2024 ... 30, Delaware Chancellor Kathaleen St. Jude McCormick invalidated the pay package that Tesla established for Musk in 2018, ruling that the\xa0... Nov 15, 2022 ... Elon Musk pay package at Tesla challenged in court. By RANDALL CHASE AP ... new compensation plan to help finance his dream of colonizing Mars. Feb 2, 2024 ... ... Elon Musk\'s hefty pay package earlier this week. Musk said the Austin-based company will hold a shareholder vote to determine if Tesla will\xa0... Nov 16, 2022 ... In the past three weeks, Musk has laid off thousands of workers at Twitter, abruptly ended remote work and fired employees who criticized\xa0... Jul 1, 2023 ... New and unverified 

In [None]:
def google_string_to_list(results):
    #split results into pages by date such as Feb 12, 2024
    pages = [page for page in re.split("(\w{3} \d{1,2}, \d{4})", results) if len(page) > 0]
    all_dates = re.findall("(\w{3} \d{1,2}, \d{4})", results)

    results_list=[]
    for i in range(0, len(pages), 2):
        if pages[i] == all_dates[int(i/2)]:
            #if pages[i+1] startes with " ... " remove that part of string
            # if pages[i+1].startswith(" ... "):
            #     results_list.append(pages[i]+":"+pages[i+1][4:])
            # else:
            results_list.append(pages[i]+":"+pages[i+1])
        else:
            results_list.append(pages[i])
            results_list.append(pages[i+1].strip("... "))
    return results_list

In [49]:
results_list = google_string_to_list(results)
results_list


['Jan 31, 2024: ... They must negotiate a new pay package that has the approval of shareholders and of Musk, who recently demanded an increase in his ownership\xa0... ',
 "Jan 30, 2024: ... Elon Musk's $55 billion pay package at Tesla was struck down by a Delaware judge after a shareholder challenged it as excessive. ",
 'Feb 12, 2024: ... 30, Delaware Chancellor Kathaleen St. Jude McCormick invalidated the pay package that Tesla established for Musk in 2018, ruling that the\xa0... ',
 'Nov 15, 2022: ... Elon Musk pay package at Tesla challenged in court. By RANDALL CHASE AP ... new compensation plan to help finance his dream of colonizing Mars. ',
 "Feb 2, 2024: ... ... Elon Musk's hefty pay package earlier this week. Musk said the Austin-based company will hold a shareholder vote to determine if Tesla will\xa0... ",
 'Nov 16, 2022: ... In the past three weeks, Musk has laid off thousands of workers at Twitter, abruptly ended remote work and fired employees who criticized\xa0... ',
 '

### Method 2: Use [Google's API](https://developers.google.com/custom-search/v1/reference/rest/v1/cse/list) directly

In [83]:
def get_search_results_list(query, num_results=10):
    """
    Get search results from Google Custom Search API
    Returns a list of dictionaries
    """
    data_list = []
    for page_start in range(1, num_results+1, 10):
        url = f"https://www.googleapis.com/customsearch/v1?key={os.environ['GOOGLE_API_KEY']}&cx={os.environ['GOOGLE_CSE_ID']}&q={query}&start={page_start}"
        data = requests.get(url).json()
        data_list.append(data)

    return data_list

def format_search_results(data_list):
    """

    Returns formatted search results to use as context for LLM.

    Parameters:
    - data: Dictionary containing search result items.
    """
    if type(data_list) == dict:
        data_list = [data_list]
        
    results = ""
    for data in data_list:
        search_items = data.get("items", [])        
        for i, search_item in enumerate(search_items, start=1):
            # Initialize default values for optional fields
            title = search_item.get("title", "N/A")
            snippet = search_item.get("snippet", "N/A")
            html_snippet = search_item.get("htmlSnippet", "N/A")
            link = search_item.get("link", "N/A")
            
            # Attempt to extract the long description, handling missing data gracefully
            long_description = "N/A"
            try:
                long_description = search_item["pagemap"]["metatags"][0].get("og:description", "N/A")
            except KeyError:
                pass  # If the path to the long description does not exist, keep it as "N/A"
            results += f"Title: {title}\nDescription: {snippet}\nLong description: {long_description}\nURL: {link}\n\n"
    return results[:-2]

In [85]:
results_list = get_search_results_list(query, num_results=10)
print(format_search_results(results_list))

Title: Elon Musk's $55 billion pay package voided. What will Tesla do next?
Description: Jan 31, 2024 ... They must negotiate a new pay package that has the approval of shareholders and of Musk, who recently demanded an increase in his ownership ...
Long description: Why a Delaware judge canceled the $55.8 billion Tesla pay package that helped make Elon Musk the world’s richest person.
URL: https://www.usatoday.com/story/money/2024/01/31/why-musk-tesla-pay-package-struck-down/72429357007/

Title: Judge Voids Elon Musk's $55 Billion Tesla Pay Package
Description: Jan 30, 2024 ... Elon Musk's $55 billion pay package at Tesla was struck down by a Delaware judge after a shareholder challenged it as excessive.
Long description: Elon Musk's $55 billion pay package at Tesla was struck down by a Delaware judge after a shareholder challenged it as excessive. Ed Ludlow reports.
URL: https://www.usatoday.com/videos/news/2024/01/30/judge-voids-elon-musks-55-billion-tesla-pay-package/72413711007/



# DuckDuckGo

Depending on what you are looking for you may want to use different calls that are part of the [DuckDuckGo python package](https://pypi.org/project/duckduckgo-search/#1-text---text-search-by-duckduckgocom). Here are 2 api routes

* Method 1: Answers api route
* Method 2: Text api route

In [87]:
from duckduckgo_search import DDGS
print(query)

What is the latest news about Elon Musk's pay package? site:www.usatoday.com


In [4]:
answers = []

with DDGS() as ddgs:
    for r in ddgs.answers(company_name):
        answers.append(r)

What is the latest news about Elon Musk's pay package? site:www.usatoday.com




In [5]:
answers

[{'icon': None,
  'text': 'Tesla, Inc. Tesla, Inc. is an American multinational automotive and clean energy company headquartered in...',
  'topic': None,
  'url': 'https://duckduckgo.com/Tesla%2C_Inc.'},
 {'icon': 'https://duckduckgo.com/i/a9c448ae.jpeg',
  'text': 'Nikola Tesla A Serbian-American inventor, electrical engineer, mechanical engineer, and futurist.',
  'topic': None,
  'url': 'https://duckduckgo.com/Nikola_Tesla'},
 {'icon': 'https://duckduckgo.com/i/de3f3ee7.jpg',
  'text': 'Tesla (unit) The unit of magnetic flux density in the International System of Units.',
  'topic': None,
  'url': 'https://duckduckgo.com/Tesla_(unit)'},
 {'icon': None,
  'text': 'Tesla a.s. TESLA a.s. is a Czech manufacturer and supplier of special radio communication and security...',
  'topic': 'Companies and organizations',
  'url': 'https://duckduckgo.com/Tesla_a.s.'},
 {'icon': None,
  'text': 'Tesla Electric Light and Manufacturing An electric lighting company in Rahway, New Jersey that opera

In [89]:
results = []
with DDGS() as ddgs:
    for r in ddgs.text(query, region='wt-wt', safesearch='moderate', timelimit='y', max_results=10):
        results.append(r)

results

[{'title': "Elon Musk's $55 billion pay package voided. What will Tesla do next?",
  'href': 'https://www.usatoday.com/story/money/2024/01/31/why-musk-tesla-pay-package-struck-down/72429357007/',
  'body': 'A Delaware judge this week voided the $55.8 billion Tesla pay package that helped make Musk the world\'s richest person, calling it an "unfathomable sum" that was unfair to shareholders ...'},
 {'title': 'Elon Musk says Tesla shareholders to vote on incorporating in Texas',
  'href': 'https://www.usatoday.com/story/money/2024/02/01/elon-musk-tesla-delaware-texas/72440851007/',
  'body': "Tuesday evening, a Delaware judge Tuesday overturned Elon Musk's $55.8 billion Tesla pay package, siding with a shareholder who argued the company breached its fiduciary duties by awarding Musk ..."},
 {'title': "Judge Voids Elon Musk's $55 Billion Tesla Pay Package - USA TODAY",
  'href': 'https://www.usatoday.com/videos/news/2024/01/30/judge-voids-elon-musks-55-billion-tesla-pay-package/7241371100

In [90]:
results[0]

{'title': "Elon Musk's $55 billion pay package voided. What will Tesla do next?",
 'href': 'https://www.usatoday.com/story/money/2024/01/31/why-musk-tesla-pay-package-struck-down/72429357007/',
 'body': 'A Delaware judge this week voided the $55.8 billion Tesla pay package that helped make Musk the world\'s richest person, calling it an "unfathomable sum" that was unfair to shareholders ...'}