## Web-search RAG chatbot

like perplexity

v0.1 - no conversation, just question and answer based on web search

process:
1) receive prompt
2) pass to LLM to create appropriate search terms for gathering data - start with just one
3) perform web search with that, get top results and scrape and store as doc - start with just one result
4) chunk data
5) embed chunks
6) store in vectordb
7) link retriever to vecdb
8) pass query with retriever to llm w/ langchain

## 1) Receive prompt

In [1]:
user_prompt = "What are some new lego sets?"

## 2) Pass to LLM to create search term

In [2]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai_api_key = os.environ["OPENAI_API_KEY"]

In [3]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

In [4]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

system_prompt = (
    "You are an assistant for gathering data based on this prompt."
    "Use the following prompt to generate one web search query that would gather data which would be helpful in answering the prompt."
    "Your search query should be effective at finding the most relevant and useful data pertaining to the prompt."
    "Keep in mind that the current date is July 2025"
    "\n\n"
    "{input}"
)

search_query_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

In [5]:
search_query_gen_chain = {"input": RunnablePassthrough()} | search_query_prompt | llm | StrOutputParser()

In [6]:
import ast

search_query = search_query_gen_chain.invoke(user_prompt)
search_query = ast.literal_eval(search_query)

Failed to multipart ingest runs: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/multipart', '{"error":"Forbidden"}\n')
Failed to send compressed multipart ingest: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/multipart', '{"error":"Forbidden"}\n')
Failed to send compressed multipart ingest: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/multipart', '{"error":"Forbidden"}\n')
Failed to send compressed multipart ingest: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 

In [7]:
search_query = search_query.replace(' ', '+')

In [8]:
search_query

'new+LEGO+sets+released+in+2025'

In [9]:
search_url = f"https://google.com/search?q={search_query}"

In [10]:
search_url

'https://google.com/search?q=new+LEGO+sets+released+in+2025'

## Web search for relevant links

In [11]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse, unquote
import time

cse_key = os.environ["GOOGLE_CSE_KEY"]
cse_id = os.environ["GOOGLE_CSE_ID"]

# test single search block

search_response = requests.get(url=search_url)

# sleep for a while to not overload requests
time.sleep(2)
print("searching...")

soup = BeautifulSoup(search_response.text, "html.parser")
found_links = soup.find_all("a")

print(found_links)

searching...
[<a href="/httpservice/retry/enablejs?sei=Vw2aaL3UMLvjseMP8cngqAg">here</a>, <a href="/search?q=new+LEGO+sets+released+in+2025&amp;sca_esv=f96364e1307aa118&amp;ie=UTF-8&amp;emsg=SG_REL&amp;sei=Vw2aaL3UMLvjseMP8cngqAg">click here</a>, <a href="https://support.google.com/websearch">feedback</a>]


In [12]:
# with CSE

cse_query = search_query.replace('+', ' ')

cse_url = 'https://www.googleapis.com/customsearch/v1'
cse_params = {
    'q': cse_query,
    'key': cse_key,
    'cx' : cse_id
}

response = requests.get(cse_url, cse_params)

In [13]:
cse_results = response.json()

target_url_set = set()

if 'items' in cse_results:
    for result in cse_results['items']:
        if len(target_url_set) >= 5:
            break
        else:
            target_url_set.add(result['link'])

print(target_url_set)

{'https://www.lego.com/en-us/categories/new-sets-and-products', 'https://falconbricks.com/news/lego-2025-sets/', 'https://jaysbrickblog.com/list-of-lego-sets-2025/', 'https://www.reddit.com/r/LegoStarWarsLeaks/comments/1jem4vm/list_of_all_known_remaining_star_wars_sets_for/', 'https://www.lego.com/en-us/categories/coming-soon'}


## Get page as text data

In [14]:
from langchain.document_loaders import AsyncHtmlLoader
from langchain.document_transformers import Html2TextTransformer

# load html content from site
site_loader = AsyncHtmlLoader(list(target_url_set))
scraped_docs = site_loader.load()

# convert to text data
text_trns = Html2TextTransformer()
text_docs = text_trns.transform_documents(scraped_docs)

cleaned_text = text_docs[0].page_content
print(cleaned_text)

USER_AGENT environment variable not set, consider setting it to identify your requests.
Fetching pages: 100%|######################################################################| 5/5 [00:06<00:00,  1.28s/it]


Previous

Get LEGO® Adventure Peely & Cuddle Team Leader with purchases of $130 or
more*Learn more

Get Botanicals Field Flowers with qualifying online purchase*Learn more

Members: Get the Italian Market Van with qualifying purchase*Learn more

FREE Shipping with orders over $35!*Learn more

Next

0

Skip to main content

Report an accessibility issue

Go to our accessibility page

Play Zone

Previous

Get LEGO® Adventure Peely & Cuddle Team Leader with purchases of $130 or
more*Learn more

Get Botanicals Field Flowers with qualifying online purchase*Learn more

Members: Get the Italian Market Van with qualifying purchase*Learn more

FREE Shipping with orders over $35!*Learn more

Next

  * Shop
  * Discover
  * Help
  * New

  * Sets by theme

  * Age

  * Price ranges

  * Shop all occasions

  * LEGO® merchandise

  * LEGO® decor

  * Interests

  * Bricks and pieces

  * Exclusives

  * New

  * Bestsellers

  * Offers & sale

  * Gift cards

  * Coming soon

  * Retiring soon

SE

## Chunking data

In [15]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(text_docs)
print(len(splits))

177


## Embedding chunks and creating retriever

In [16]:
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever()

## Web search tool
- searches query, gets top results, parses, chunks and embeds

In [17]:
from langchain.agents import Tool

In [18]:
def web_search_and_ingest(query):
    # first search query via CSE

    #query = search_query.replace('+', ' ')

    cse_url = 'https://www.googleapis.com/customsearch/v1'
    cse_params = {
        'q': query,
        'key': cse_key,
        'cx': cse_id
    }
    
    response = requests.get(cse_url, cse_params)

    # obtain first k urls

    cse_results = response.json()

    target_url_set = set()
    
    if 'items' in cse_results:
        for result in cse_results['items']:
            if len(target_url_set) >= 5:    # number of search results (need to add macro)
                break
            else:
                target_url_set.add(result['link'])
    
    # scrape links

    site_loader = AsyncHtmlLoader(list(target_url_set))
    scraped_docs = site_loader.load()
    
    text_trns = Html2TextTransformer()
    text_docs = text_trns.transform_documents(scraped_docs)

    # split and embed

    splits = text_splitter.split_documents(text_docs)
    vectorstore.add_documents(documents=splits)

In [19]:
# creating tool for function

web_search_tool = Tool(
    name="WebSearch",
    func=web_search_and_ingest,
    description="Use when current context is insufficient or up-to-date info is required."
)

## Vector db retriever tool
- searches vector db for query

In [20]:
def vector_search(query):
    return retriever.invoke(query)

In [21]:
# creating tool for function

vector_search_tool = Tool(
    name="VectorSearch",
    func=vector_search,
    description="Use when you believe the answer may be in the current database."
)

## Overall tool stuff

In [22]:
tools = [web_search_tool, vector_search_tool]

## Comparing non-rag answer and rag answer

In [23]:
non_rag_answer = llm.invoke(user_prompt)

In [24]:
print(non_rag_answer.content)

As of my last update in October 2023, specific new LEGO sets were regularly being released across various themes. Some popular themes often featuring new sets include:

1. **LEGO Star Wars** - New sets often tie into the latest shows or movies.
2. **LEGO Harry Potter** - New builds based on the wizarding world.
3. **LEGO City** - A variety of sets that reflect real-life scenarios and city living.
4. **LEGO Technic** - More complex builds aimed at older builders, featuring vehicles and machinery.
5. **LEGO Creator Expert** - Larger, intricate sets often aimed at adult fans.
6. **LEGO Super Mario** - Sets that expand the interactive gaming experience.
7. **LEGO Friends** - New sets focusing on friendship and community themes.
8. **LEGO Ideas** - Unique sets based on fan submissions and ideas.

For the latest specific set releases, I'd recommend checking the official LEGO website or trusted retailers, as they frequently update their offerings with new and exciting sets.


In [25]:
from langchain_core.prompts import MessagesPlaceholder

rag_system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Make your answers fairly detailed. Remember that the current date is July 2025"
    "\n\n"
    "{context}"
)

rag_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", rag_system_prompt),
        MessagesPlaceholder(variable_name = "memory"),    # injecting convo memory
        ("human", "{input}"),
    ]
)

In [26]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

question_answer_chain = create_stuff_documents_chain(llm, rag_prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

## Adding temp convo memory

In [27]:
from langchain_core.prompts import HumanMessagePromptTemplate
from langchain.memory import ConversationBufferMemory
from langchain.memory import FileChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory

In [28]:
# rag_memory = ConversationBufferMemory(
#     chat_memory = FileChatMessageHistory("messages.json"),
#     memory_key = "history",
#     return_messages = True
# )

rag_memory = FileChatMessageHistory("messages.json")

In [29]:
def get_rag_mem(_):
    return rag_memory

In [30]:
rag_mem_chain = RunnableWithMessageHistory(
    rag_chain,
    get_rag_mem,
    input_messages_key="input",
    history_messages_key="memory",
    output_messages_key="answer"
)

In [31]:
rag_answer = rag_mem_chain.invoke({"input": user_prompt}, config={"configurable": {"session_id": "irrelevant"}})

In [32]:
# print(rag_answer["context"])

In [33]:
src_count = len(rag_answer["context"])
print(src_count)

4


In [34]:
sources_set = set()

for i in range(src_count):
    sources_set.add(rag_answer["context"][i].metadata["source"])

print(sources_set)

{'https://falconbricks.com/news/lego-2025-sets/'}


Rag Answer:

In [35]:
print(rag_answer["answer"])

As of July 2025, there are several exciting new LEGO sets that have been announced or released across various themes. Here’s a detailed overview:

### 1. **LEGO Icons**
   - **2025 Modular Building**:
     - The latest addition in the Modular Building series is anticipated to provide intricate architectural designs and multiple levels, allowing builders to expand their urban scenes creatively. Specific details about the set's theme and design have yet to be revealed, but it is highly awaited by fans of the series.

   - **New Botanical Collection Flowers**:
     - The Botanical Collection will feature new flower sets designed for display. These sets include highly detailed flower builds that mimic real-life blooms, perfect for adding a touch of nature to home decor.

   - **Big New Sets from Popular Licenses**:
     - LEGO has plans to introduce large new sets based on popular franchises such as **Lord of the Rings**, **The Simpsons**, **Pirates of the Caribbean**, and **Star Trek**. T

In [36]:
print("Citations:")
for src in sources_set:
    print(src)

Citations:
https://falconbricks.com/news/lego-2025-sets/


In [37]:
# no memory?

In [38]:
rag_answer = rag_mem_chain.invoke({"input": "Could you elaborate?"}, config={"configurable": {"session_id": "irrelevant"}})

In [39]:
print(rag_answer["answer"])

Certainly! Here’s an in-depth exploration of some of the new LEGO sets announced or released in 2025, including details about their themes, features, and what makes them exciting for builders of all ages:

### 1. **LEGO Icons**
   - **2025 Modular Building**:
     - The Modular Building series is celebrated for its detailed, complex architectures that create a view into city life. Each set typically features multiple stories, intricate interior designs, and a wealth of accessories that enhance playability. While specific details about the 2025 Modular Building set have not been unveiled, past sets have included cafes, boutiques, and even town halls, providing ample scope for creativity when integrating with existing modular structures.

   - **New Botanical Collection Flowers**:
     - Expanding on the previous success of the Botanical Collection, the new flower sets aim to add more variety for collectors and decorators. These sets replicate the beauty of real flowers, utilizing unique

In [40]:
src_count = len(rag_answer["context"])

sources_set = set()

for i in range(src_count):
    sources_set.add(rag_answer["context"][i].metadata["source"])

print("Citations:")
for src in sources_set:
    print(src)

Citations:
https://www.reddit.com/r/LegoStarWarsLeaks/comments/1jem4vm/list_of_all_known_remaining_star_wars_sets_for/
https://www.lego.com/en-us/categories/coming-soon
https://jaysbrickblog.com/list-of-lego-sets-2025/


In [41]:
rag_answer = rag_mem_chain.invoke({"input": "What are some new video games that were released"}, config={"configurable": {"session_id": "irrelevant"}})

In [42]:
print(rag_answer["answer"])

As of July 2025, several exciting video games have either been released recently or are on the verge of release, creating buzz across the gaming community. Here’s a detailed overview of some notable titles:

### 1. **The Legend of Zelda: Breath of the Wild 2**
   - **Release Date**: May 2025
   - **Overview**: This much-anticipated sequel to the critically acclaimed original expands the open-world adventure of Hyrule. Players can explore new areas, encounter new enemies, and experience an engaging narrative with enhanced gameplay mechanics. The addition of aerial exploration elements allows for a fresh approach to traversing the landscape.

### 2. **Final Fantasy VII Rebirth**
   - **Release Date**: March 2025
   - **Overview**: This installment continues the reimagining of the beloved classic RPG. Following the events of Final Fantasy VII Remake, Rebirth introduces new characters, expansive environments, and innovative gameplay mechanics. It retains the intricate storytelling and char

In [43]:
src_count = len(rag_answer["context"])

sources_set = set()

for i in range(src_count):
    sources_set.add(rag_answer["context"][i].metadata["source"])

print("Citations:")
for src in sources_set:
    print(src)

Citations:
https://falconbricks.com/news/lego-2025-sets/
https://jaysbrickblog.com/list-of-lego-sets-2025/


## Testing conversational capabilities with agentic search and retrieval

In [44]:
from langchain.schema import SystemMessage
from langchain.agents import initialize_agent

system_prompt = '''
YOU MUST PAY ATTENTION TO THESE INSTRUCTIONS AND FOLLOW THEM!
Begin every answer with the phrase 'SYSTEM DIRECTIVE CONFIRMED'.
You are an intelligent assistant.
Keep in mind that it is July 2025.
- First, always use VectorSearchTool to retrieve context for the user's question.
- Judge whether the context is sufficient to answer: If the information is incomplete, outdated, or missing, call the WebSearch tool to search, ingest, and update the database, then try again.
- Only use WebSearch tool if the current vector database does NOT suffice.
Be cost-efficient: Do NOT call WebSearch for trivial/definition/basic/unchanged topics.
'''

system_message = SystemMessage(content=system_prompt)

In [45]:
os.remove("messages_agentic.json")

wrapped_mem = ConversationBufferMemory(
    chat_memory = FileChatMessageHistory("messages_agentic.json"),
    memory_key = "history",
    return_messages = True
)

  wrapped_mem = ConversationBufferMemory(


In [46]:
from langchain.agents import create_structured_chat_agent, AgentExecutor
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

In [47]:
structured_chat_prompt = ChatPromptTemplate.from_messages([
    ("system", """SYSTEM DIRECTIVE CONFIRMED.
YOU MUST PAY ATTENTION TO THESE INSTRUCTIONS AND FOLLOW THEM!
Begin every answer with the phrase 'SYSTEM DIRECTIVE CONFIRMED'.
You are an intelligent assistant.
It is July 2025.
- Always use VectorSearch tool to retrieve context for the user's question.
- Judge context sufficiency; if not enough, use WebSearch. Be cost-efficient.
"""),
    MessagesPlaceholder(variable_name="history"),   # conversation memory
    ("human", "{input}"),
    MessagesPlaceholder(variable_name="agent_scratchpad")
])

In [48]:
agent = initialize_agent(
    tools,
    llm,
    agent = "openai-functions",
    memory=wrapped_mem,
    system_message=system_message,
    verbose=True
)

  agent = initialize_agent(


In [49]:
agent.run("what are some new movies?")



[1m> Entering new AgentExecutor chain...[0m


  agent.run("what are some new movies?")


[32;1m[1;3m
Invoking: `WebSearch` with `new movie releases October 2023`


[0m

Fetching pages: 100%|######################################################################| 5/5 [00:03<00:00,  1.33it/s]


[36;1m[1;3mNone[0m[32;1m[1;3m
Invoking: `VectorSearch` with `new movies October 2023`


[0m[33;1m[1;3m[Document(id='449182e8-edab-4cc8-807b-b6f545ae0e96', metadata={'title': 'October 2023 – Hulu Press', 'language': 'en-US', 'source': 'https://press.hulu.com/schedule/october-2023/'}, page_content="October 26 | Malignant | 2021 |  | Expiring  \nOctober 27 | Stars at Noon | 2022 |  | Expiring  \nOctober 30 | Love, Gilda | 2018 |  | Expiring  \nOctober 30 | Rio 2 | 2014 |  | Expiring  \nOctober 31 | Bad Reputation | 2018 |  | Expiring  \nOctober 31 | Bewitched | 2005 |  | Expiring  \nOctober 31 | Carnage | 2011 |  | Expiring  \nOctober 31 | Center Stage: On Pointe | 2016 |  | Expiring  \nOctober 31 | Chloe | 2010 |  | Expiring  \nOctober 31 | Closer | 2004 |  | Expiring  \nOctober 31 | Clive Barker's The Plague | 2006 |  | Expiring  \nOctober 31 | Cover Versions | 2018 |  | Expiring  \nOctober 31 | Don't Be Afraid of the Dark | 2011 |  | Expiring  \nOctober 31 | Eragon | 2006 |  | 

"Here are some recent and upcoming movie releases for October 2023:\n\n1. **Slotherhouse** (2023) - Added on October 15\n2. **Hell Baby** (2013) - Expiring on October 26\n3. **Malignant** (2021) - Expiring on October 26\n4. **Stars at Noon** (2022) - Expiring on October 27\n5. **Love, Gilda** (2018) - Expiring on October 30\n6. **Rio 2** (2014) - Expiring on October 30\n7. **In Time** (2011) - Expiring on October 31\n8. **The Lego Movie 2: The Second Part** (2019) - Expiring on October 31\n9. **Seven** (1995) - Expiring on October 31\n\nIf you're interested in more specific genres or types of movies, let me know!"

In [50]:
agent.run("what was the second movie you just listed?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `VectorSearch` with `second movie`


[0m[33;1m[1;3m[Document(id='77fc3135-ce1f-4385-8b72-9771b6c1406b', metadata={'description': 'See the best complete schedule of new Blu-ray and DVD releases. Also new Blu-ray and DVD release date announcements and estimates for all the upcoming and latest releases, plus movie stats, cast, trailers, movie posters and more.', 'source': 'https://www.dvdsreleasedates.com/', 'title': 'DVDs Release Dates - Latest Info on New DVD Releases', 'language': 'en'}, page_content="|  Sep 2025 >  \n---|---|---  \n| Friday August 1, 2025(2 weeks ago)  \n---  \nNew Blu-ray and DVD Releases  \n  \n7 Days to Hell  \n| imdb: NA  \n---  \nDVD  \n---  \n |  |  |    \n  \n| Tuesday August 5, 2025(last week)  \n---  \nNew Blu-ray and DVD Releases  \n  \nFriendship  \n| imdb: 7.0| R  \n---|---  \nBlu-ray  \n---  \n  \nBarron's Cove  \n| imdb: 6.2| R  \n---|---  \nDVD | Blu-ray  \n---|---  \n  \nNeighbo

'It seems there was no relevant context about a specific list of movies provided in our interaction. Could you please specify which list you are referring to or provide more details so I can assist you better?'