## Web-search RAG chatbot

like perplexity

v0.1 - no conversation, just question and answer based on web search

process:
1) receive prompt
2) pass to LLM to create appropriate search terms for gathering data - start with just one
3) perform web search with that, get top results and scrape and store as doc - start with just one result
4) chunk data
5) embed chunks
6) store in vectordb
7) link retriever to vecdb
8) pass query with retriever to llm w/ langchain

## 1) Receive prompt

In [1]:
user_prompt = "What are some new lego sets?"

## 2) Pass to LLM to create search term

In [2]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai_api_key = os.environ["OPENAI_API_KEY"]

In [3]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

In [4]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

system_prompt = (
    "You are an assistant for gathering data based on this prompt."
    "Use the following prompt to generate one web search query that would gather data which would be helpful in answering the prompt."
    "Your search query should be effective at finding the most relevant and useful data pertaining to the prompt."
    "Keep in mind that the current date is July 2025"
    "\n\n"
    "{input}"
)

search_query_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

In [5]:
search_query_gen_chain = {"input": RunnablePassthrough()} | search_query_prompt | llm | StrOutputParser()

In [6]:
import ast

search_query = search_query_gen_chain.invoke(user_prompt)
search_query = ast.literal_eval(search_query)

Failed to multipart ingest runs: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/multipart', '{"error":"Forbidden"}\n')
Failed to send compressed multipart ingest: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/multipart', '{"error":"Forbidden"}\n')
Failed to send compressed multipart ingest: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/multipart', '{"error":"Forbidden"}\n')
Failed to send compressed multipart ingest: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 

In [7]:
search_query = search_query.replace(' ', '+')

In [8]:
search_query

'new+LEGO+sets+released+in+2025'

In [9]:
search_url = f"https://google.com/search?q={search_query}"

In [10]:
search_url

'https://google.com/search?q=new+LEGO+sets+released+in+2025'

## Web search for relevant links

In [11]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse, unquote
import time

cse_key = os.environ["GOOGLE_CSE_KEY"]
cse_id = os.environ["GOOGLE_CSE_ID"]

# test single search block

search_response = requests.get(url=search_url)

# sleep for a while to not overload requests
time.sleep(2)
print("searching...")

soup = BeautifulSoup(search_response.text, "html.parser")
found_links = soup.find_all("a")

print(found_links)

searching...
[<a href="/httpservice/retry/enablejs?sei=3w6KaJDNH9S_juMP2tmIwQo">here</a>, <a href="/search?q=new+LEGO+sets+released+in+2025&amp;sca_esv=27b91714de888794&amp;ie=UTF-8&amp;emsg=SG_REL&amp;sei=3w6KaJDNH9S_juMP2tmIwQo">click here</a>, <a href="https://support.google.com/websearch">feedback</a>]


In [12]:
# with CSE

cse_query = search_query.replace('+', ' ')

cse_url = 'https://www.googleapis.com/customsearch/v1'
cse_params = {
    'q': cse_query,
    'key': cse_key,
    'cx' : cse_id
}

response = requests.get(cse_url, cse_params)

In [13]:
cse_results = response.json()

target_url_set = set()

if 'items' in cse_results:
    for result in cse_results['items']:
        if len(target_url_set) >= 5:
            break
        else:
            target_url_set.add(result['link'])

print(target_url_set)

{'https://www.reddit.com/r/legolotrfans/comments/1dbrchi/next_lego_lotr_sets_and_when_and_any_romours_for/', 'https://jaysbrickblog.com/list-of-lego-sets-2025/', 'https://falconbricks.com/news/lego-2025-sets/', 'https://www.lego.com/en-us/categories/coming-soon', 'https://falconbricks.com/news/lego-june-2025-sets/'}


## Get page as text data

In [14]:
from langchain.document_loaders import AsyncHtmlLoader
from langchain.document_transformers import Html2TextTransformer

# load html content from site
site_loader = AsyncHtmlLoader(list(target_url_set))
scraped_docs = site_loader.load()

# convert to text data
text_trns = Html2TextTransformer()
text_docs = text_trns.transform_documents(scraped_docs)

cleaned_text = text_docs[0].page_content
print(cleaned_text)

USER_AGENT environment variable not set, consider setting it to identify your requests.
Fetching pages: 100%|######################################################| 5/5 [00:03<00:00,  1.35it/s]


Skip to main content

Open menu Open navigation Go to Reddit Home

r/legolotrfans A chip A close button

Get App Get the Reddit app  Log In Log in to Reddit

Expand user menu Open settings menu

Go to legolotrfans

r/legolotrfans

r/legolotrfans

r/legolotrfans is for fans of LEGO Lotr, Hobbit, and anything else Tolkien!

* * *

Members Online

•

createScientist

#  Next Lego lotr Sets? and When? and any romours for 2025?

Just bought rivendell for 685$ Aud,( rrp $800), i have a feeling if the
expensive lotr icon continue we will get Minas Tirith, like that ideas set
that had 10K supporter but didnt get approved? It would make for a stunning
set!.

Anyone know of any upcoming Lotr romours for sets in 2025 or the near future?
and Lastly will smaller and affordable Lotr sets make a return?(and when if
so?). Although there is a huge controversy for Rings of Power, I personally
hope everyone watches it cause it might help make lego a return to a full Lego
Lotr sets waves. Sorry for the ty

## Chunking data

In [15]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(text_docs)
print(len(splits))

174


## Embedding chunks and creating retriever

In [16]:
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever()

## Web search tool
- searches query, gets top results, parses, chunks and embeds

In [17]:
from langchain.agents import Tool

In [18]:
def web_search_and_ingest(query):
    # first search query via CSE

    #query = search_query.replace('+', ' ')

    cse_url = 'https://www.googleapis.com/customsearch/v1'
    cse_params = {
        'q': query,
        'key': cse_key,
        'cx': cse_id
    }
    
    response = requests.get(cse_url, cse_params)

    # obtain first k urls

    cse_results = response.json()

    target_url_set = set()
    
    if 'items' in cse_results:
        for result in cse_results['items']:
            if len(target_url_set) >= 5:    # number of search results (need to add macro)
                break
            else:
                target_url_set.add(result['link'])
    
    # scrape links

    site_loader = AsyncHtmlLoader(list(target_url_set))
    scraped_docs = site_loader.load()
    
    text_trns = Html2TextTransformer()
    text_docs = text_trns.transform_documents(scraped_docs)

    # split and embed

    splits = text_splitter.split_documents(text_docs)
    vectorstore.add_documents(documents=splits)

In [19]:
# creating tool for function

web_search_tool = Tool(
    name="WebSearch",
    func=web_search_and_ingest,
    description="Use when current context is insufficient or up-to-date info is required."
)

## Vector db retriever tool
- searches vector db for query

In [20]:
def vector_search(query):
    return retriever.invoke(query)

In [21]:
# creating tool for function

vector_search_tool = Tool(
    name="VectorSearch",
    func=vector_search,
    description="Use when you believe the answer may be in the current database."
)

## Overall tool stuff

In [22]:
tools = [web_search_tool, vector_search_tool]

## Comparing non-rag answer and rag answer

In [23]:
non_rag_answer = llm.invoke(user_prompt)

In [24]:
print(non_rag_answer.content)

As of my last update in October 2023, I don't have access to real-time data on newly released LEGO sets. However, LEGO typically releases new sets regularly throughout the year, including seasonal sets, themes based on popular movies and franchises, and original LEGO themes.

For the most current LEGO sets, I recommend checking the official LEGO website or major retailers that carry LEGO products. They often feature the latest releases, including promotional items and exclusive sets. Additionally, fan sites and LEGO enthusiast communities frequently share information about new releases and upcoming sets.


In [25]:
from langchain_core.prompts import MessagesPlaceholder

rag_system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Make your answers fairly detailed. Remember that the current date is July 2025"
    "\n\n"
    "{context}"
)

rag_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", rag_system_prompt),
        MessagesPlaceholder(variable_name = "memory"),    # injecting convo memory
        ("human", "{input}"),
    ]
)

In [26]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

question_answer_chain = create_stuff_documents_chain(llm, rag_prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

## Adding temp convo memory

In [27]:
from langchain_core.prompts import HumanMessagePromptTemplate
from langchain.memory import ConversationBufferMemory
from langchain.memory import FileChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory

In [28]:
# rag_memory = ConversationBufferMemory(
#     chat_memory = FileChatMessageHistory("messages.json"),
#     memory_key = "history",
#     return_messages = True
# )

rag_memory = FileChatMessageHistory("messages.json")

In [29]:
def get_rag_mem(_):
    return rag_memory

In [30]:
rag_mem_chain = RunnableWithMessageHistory(
    rag_chain,
    get_rag_mem,
    input_messages_key="input",
    history_messages_key="memory",
    output_messages_key="answer"
)

In [31]:
rag_answer = rag_mem_chain.invoke({"input": user_prompt}, config={"configurable": {"session_id": "irrelevant"}})

In [32]:
# print(rag_answer["context"])

In [33]:
src_count = len(rag_answer["context"])
print(src_count)

4


In [34]:
sources_set = set()

for i in range(src_count):
    sources_set.add(rag_answer["context"][i].metadata["source"])

print(sources_set)

{'https://falconbricks.com/news/lego-2025-sets/'}


Rag Answer:

In [35]:
print(rag_answer["answer"])

As of July 2025, several new LEGO sets have been announced or released across various themes. Here’s a detailed look at some of the sets:

### 1. **LEGO Icons**
   - **2025 Modular Building**
     - A highly anticipated addition to the Modular Building series, this set features a detailed design that encourages customization and integration with previous modular sets.

   - **Botanical Collection Flowers**
     - New sets in the Botanical Collection are being released, which include more intricate flower builds, perfect for those who appreciate both LEGO and home decor.

   - **New Licenses**
     - Exciting large sets are anticipated based on popular franchises such as **Lord of the Rings, The Simpsons, Pirates of the Caribbean,** and **Star Trek**.

### 2. **LEGO Technic**
   - **New LEGO Technic Sets**
     - A total of **19 new LEGO Technic sets** are rumored for 2025, including two significant **Formula 1 cars** that are likely to appeal to motorsport fans.

### 3. **LEGO Star War

In [36]:
print("Citations:")
for src in sources_set:
    print(src)

Citations:
https://falconbricks.com/news/lego-2025-sets/


In [37]:
# no memory?

In [38]:
rag_answer = rag_mem_chain.invoke({"input": "Could you elaborate?"}, config={"configurable": {"session_id": "irrelevant"}})

In [39]:
print(rag_answer["answer"])

Of course! Here's a more detailed breakdown of some notable new LEGO sets that are either recently released or slated for release in 2025:

### 1. **LEGO Icons**
   - **2025 Modular Building**: 
     - While specific details have yet to be fully disclosed, this new modular building is expected to include intricate architectural features that align with previous modular sets. Design elements typically seen in these sets might include a mix of residential and commercial features, along with minifigures that add to the storyline of the set.
   - **Botanical Collection Flowers**:
     - This collection, which focuses on plants and flowers, continues to expand with new sets that provide intricate builds. These sets are designed for display and often feature highly detailed parts that mimic real flowers, making them an excellent choice for home decor.

### 2. **LEGO Technic**
   - **New LEGO Technic Sets (2025)**: 
     - A total of **19 new sets** are anticipated within the LEGO Technic lin

In [40]:
src_count = len(rag_answer["context"])

sources_set = set()

for i in range(src_count):
    sources_set.add(rag_answer["context"][i].metadata["source"])

print("Citations:")
for src in sources_set:
    print(src)

Citations:
https://jaysbrickblog.com/list-of-lego-sets-2025/
https://falconbricks.com/news/lego-2025-sets/
https://www.lego.com/en-us/categories/coming-soon


In [41]:
rag_answer = rag_mem_chain.invoke({"input": "What are some new video games that were released"}, config={"configurable": {"session_id": "irrelevant"}})

In [42]:
print(rag_answer["answer"])

As of July 2025, several exciting video games have recently been released or are scheduled for release that are generating buzz among gamers. Here’s a detailed overview of some of the notable titles:

### 1. **The Legend of Zelda: Breath of the Wild 2**
   - **Release Date**: May 2025
   - **Overview**: This highly anticipated sequel builds on the critically acclaimed original game, expanding the open world of Hyrule with new locations, gameplay mechanics, and an engaging narrative. Players can explore the skies above Hyrule, solve puzzles, and encounter new enemies and allies, all in stunning graphics indicative of the advancements made since the first game.

### 2. **Final Fantasy VII Rebirth**
   - **Release Date**: March 2025
   - **Overview**: This game continues the reimagined story of Cloud Strife and his companions, following the events of the first part, Final Fantasy VII Remake. Players can expect new gameplay mechanics, redefined characters, and expansive environments that e

In [43]:
src_count = len(rag_answer["context"])

sources_set = set()

for i in range(src_count):
    sources_set.add(rag_answer["context"][i].metadata["source"])

print("Citations:")
for src in sources_set:
    print(src)

Citations:
https://falconbricks.com/news/lego-2025-sets/
https://falconbricks.com/news/lego-june-2025-sets/
https://www.reddit.com/r/legolotrfans/comments/1dbrchi/next_lego_lotr_sets_and_when_and_any_romours_for/


## Testing conversational capabilities with agentic search and retrieval

In [54]:
from langchain.schema import SystemMessage
from langchain.agents import initialize_agent

system_prompt = '''
You are an intelligent assistant.
Keep in mind that it is July 2025.
- First, always use VectorSearchTool to retrieve context for the user's question.
- Judge whether the context is sufficient to answer: If the information is incomplete, outdated, or missing, call the WebSearch tool to search, ingest, and update the database, then try again.
- Only use WebSearch tool if the current vector database does NOT suffice.
Be cost-efficient: Do NOT call WebSearch for trivial/definition/basic/unchanged topics.
'''

system_message = SystemMessage(content=system_prompt)

In [65]:
wrapped_mem = ConversationBufferMemory(
    chat_memory = FileChatMessageHistory("messages_agentic.json"),
    memory_key = "history",
    return_messages = True
)

In [66]:
agent = initialize_agent(
    tools,
    llm,
    agent = "openai-functions",
    memory=wrapped_mem,
    system_message=system_message,
    verbose=True
)

In [67]:
agent.run("what are some new movies?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `WebSearch` with `new movie releases October 2023`


[0m

Fetching pages: 100%|######################################################| 5/5 [00:08<00:00,  1.67s/it]


[36;1m[1;3mNone[0m[32;1m[1;3m
Invoking: `VectorSearch` with `new movies 2023`


[0m[33;1m[1;3m[Document(id='a3bbd1a2-89b2-4db5-aab3-7ff7521f64ca', metadata={'language': 'en', 'source': 'https://en.wikipedia.org/wiki/List_of_American_films_of_2023', 'title': 'List of American films of 2023 - Wikipedia'}, page_content='2010s  \n2010 2011 2012 2013 2014  \n2015 2016 2017 2018 2019  \n2020s  \n2020 2021 2022 2023 2024  \n2025 2026 2027 2028  \n  \nThe following is a list of American films released in 2023. The year featured\na diverse array of cinematic productions, ranging from major studio\nblockbusters to independent and streaming platform releases. Notable films\nincluded _Oppenheimer_ , a biographical drama directed by Christopher Nolan\nexploring the life of physicist J. Robert Oppenheimer; _Barbie_ , a fantasy\ncomedy starring Margot Robbie and Ryan Gosling; _The Super Mario Bros. Movie_\n, the first billion dollar movie to be based off of a video game; _Guardians\nof the Ga

"Here are some notable movies released in 2023:\n\n1. **Oppenheimer** - A biographical drama directed by Christopher Nolan that explores the life of physicist J. Robert Oppenheimer.\n2. **Barbie** - A fantasy comedy starring Margot Robbie and Ryan Gosling.\n3. **The Super Mario Bros. Movie** - An animated film based on the popular video game franchise, which became the first billion-dollar movie based on a video game.\n4. **Guardians of the Galaxy Vol. 3** - The latest installment in the Marvel Cinematic Universe, directed by James Gunn.\n5. **John Wick: Chapter 4** - The fourth film in the action franchise, featuring Keanu Reeves as the titular assassin.\n\nThese films represent a mix of blockbuster hits and are from various genres, reflecting the diversity in cinema for the year. If you're looking for specific genres or more recent releases, let me know!"