## Web-search RAG chatbot

like perplexity

v0.1 - no conversation, just question and answer based on web search

process:
1) receive prompt
2) pass to LLM to create appropriate search terms for gathering data - start with just one
3) perform web search with that, get top results and scrape and store as doc - start with just one result
4) chunk data
5) embed chunks
6) store in vectordb
7) link retriever to vecdb
8) pass query with retriever to llm w/ langchain

## 1) Receive prompt

In [71]:
user_prompt = "What are some new lego sets?"

## 2) Pass to LLM to create search term

In [2]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai_api_key = os.environ["OPENAI_API_KEY"]

In [3]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

In [72]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

system_prompt = (
    "You are an assistant for gathering data based on this prompt."
    "Use the following prompt to generate one web search query that would gather data which would be helpful in answering the prompt."
    "Your search query should be effective at finding the most relevant and useful data pertaining to the prompt."
    "Keep in mind that the current date is July 2025"
    "\n\n"
    "{input}"
)

search_query_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

In [73]:
search_query_gen_chain = {"input": RunnablePassthrough()} | search_query_prompt | llm | StrOutputParser()

In [74]:
import ast

search_query = search_query_gen_chain.invoke(user_prompt)
search_query = ast.literal_eval(search_query)

In [75]:
search_query = search_query.replace(' ', '+')

In [76]:
search_query

'new+LEGO+sets+released+July+2025'

In [77]:
search_url = f"https://google.com/search?q={search_query}"

In [78]:
search_url

'https://google.com/search?q=new+LEGO+sets+released+July+2025'

## Web search for relevant links

In [79]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse, unquote
import time

cse_key = os.environ["GOOGLE_CSE_KEY"]
cse_id = os.environ["GOOGLE_CSE_ID"]

# test single search block

search_response = requests.get(url=search_url)

# sleep for a while to not overload requests
time.sleep(2)
print("searching...")

soup = BeautifulSoup(search_response.text, "html.parser")
found_links = soup.find_all("a")

print(found_links)

searching...
[<a href="/httpservice/retry/enablejs?sei=N2KCaMe7LMOhseMPqfDU0Ak">here</a>, <a href="/search?q=new+LEGO+sets+released+July+2025&amp;sca_esv=c29e0acc1e59cdbc&amp;ie=UTF-8&amp;emsg=SG_REL&amp;sei=N2KCaMe7LMOhseMPqfDU0Ak">click here</a>, <a href="https://support.google.com/websearch">feedback</a>]


In [80]:
# with CSE

cse_query = search_query.replace('+', ' ')

cse_url = 'https://www.googleapis.com/customsearch/v1'
cse_params = {
    'q': cse_query,
    'key': cse_key,
    'cx' : cse_id
}

response = requests.get(cse_url, cse_params)

In [81]:
cse_results = response.json()

target_url_set = set()

if 'items' in cse_results:
    for result in cse_results['items']:
        if len(target_url_set) >= 5:
            break
        else:
            target_url_set.add(result['link'])

print(target_url_set)

{'https://jaysbrickblog.com/news/buying-guide-every-new-lego-set-releasing-july-2025/', 'https://www.ign.com/articles/new-lego-sets-for-july-2025', 'https://www.lego.com/en-us/categories/coming-soon', 'https://brickbanter.com/category/2025/july-2025/every-new-lego-set-releasing-in-july-2025/', 'https://falconbricks.com/news/lego-july-2025-sets/'}


## Get page as text data

In [82]:
from langchain.document_loaders import AsyncHtmlLoader
from langchain.document_transformers import Html2TextTransformer

# load html content from site
site_loader = AsyncHtmlLoader(list(target_url_set))
scraped_docs = site_loader.load()

# convert to text data
text_trns = Html2TextTransformer()
text_docs = text_trns.transform_documents(scraped_docs)

cleaned_text = text_docs[0].page_content
print(cleaned_text)

Fetching pages: 100%|######################################################| 5/5 [00:04<00:00,  1.14it/s]


  * About Me
  * LEGO Reviews
  * News
  * 2025 LEGO Sets
  * 2024 Advent Calendars

SUBSCRIBE

* ## Subscribe for updates

Enter your email address here to receive updates about new posts from Jay's
Brick Blog - straight to your inbox!

Email Address

Subscribe

Join 6,848 other subscribers

  * Reviews
  * News
  * LEGO Minifig Scanner

  * About Me
  * LEGO Reviews
  * News
  * 2025 LEGO Sets
  * 2024 Advent Calendars

  * Reviews
  * News
  * LEGO Minifig Scanner

# Buying Guide: Every new LEGO set releasing in July 2025

June 28, 2025 | 0

LEGO fans get a slight reprieve in July 2025 with a relatively small number of
highly anticipated releases.

LEGO fans will be able to look forward to the launch of **10375 How to Train
Your Dragon: Toothless** , **10357 Shelby Cobra 427 S/C** , as well as **43008
Nike Dunk** , which are the biggest ticket items.

Star Wars fans can also look forward to the very fun **75428 Battle Droid With
STAP**!

For fans of LEGO Ideas, the **40788 Friendly 

## Chunking data

In [83]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(text_docs)
print(len(splits))

59


## Embedding chunks and creating retriever

In [84]:
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever()

## Comparing non-rag answer and rag answer

In [85]:
non_rag_answer = llm.invoke(user_prompt)

In [86]:
print(non_rag_answer.content)

As of my last knowledge update in October 2023, some of the new LEGO sets released around that time included:

1. **LEGO Icons - Eiffel Tower**: A highly detailed and large-scale model of the famous Paris landmark.
2. **LEGO Star Wars - The Mandalorian Forge**: A set featuring iconic scenes and characters from "The Mandalorian" series.
3. **LEGO Creator Expert - Titanic**: A massive set replicating the iconic ocean liner with intricate detailing.
4. **LEGO Harry Potter - Hogwarts Magical Trunk**: A playset that combines various elements from the Harry Potter universe, including different character builds and magical accessories.
5. **LEGO Super Mario - Character Packs Series 6**: Includes new characters to enhance Super Mario Adventures.
6. **LEGO Friends - Botanical Garden**: A detailed and colorful set featuring a garden and various plant species.

For the very latest releases, I recommend checking the official LEGO website or major retailers, as new sets are frequently launched thro

In [87]:
rag_system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Make your answers fairly detailed. Remember that the current date is July 2025"
    "\n\n"
    "{context}"
)

rag_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", rag_system_prompt),
        MessagesPlaceholder(variable_name = "memory"),    # injecting convo memory
        ("human", "{input}"),
    ]
)

In [88]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

question_answer_chain = create_stuff_documents_chain(llm, rag_prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

## Adding temp convo memory

In [89]:
from langchain_core.prompts import HumanMessagePromptTemplate
from langchain_core.prompts import MessagesPlaceholder
from langchain.memory import ConversationBufferMemory
from langchain.memory import FileChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory

In [90]:
# rag_memory = ConversationBufferMemory(
#     chat_memory = FileChatMessageHistory("messages.json"),
#     memory_key = "history",
#     return_messages = True
# )

rag_memory = FileChatMessageHistory("messages.json")

In [91]:
def get_rag_mem(_):
    return rag_memory

In [92]:
rag_mem_chain = RunnableWithMessageHistory(
    rag_chain,
    get_rag_mem,
    input_messages_key="input",
    history_messages_key="memory"
)

In [93]:
rag_answer = rag_mem_chain.invoke({"input": user_prompt}, config={"configurable": {"session_id": "irrelevant"}})

Error in RootListenersTracer.on_chain_end callback: KeyError('output')


In [32]:
# print(rag_answer["context"])

[Document(id='cb9556b4-b775-4d0d-8ad5-90e74a03fa48', metadata={'language': 'en-US', 'title': '2025 Movies: Release Dates For Most Anticipated Films', 'source': 'https://deadline.com/lists/2025-movies/'}, page_content="**RELATED:The Movies That Have Made More Than $1 Billion At The Box Office**\n\nJanuary kicked it off with combination of theatrical releases like the\nChristopher Abbott-starring _Wolf Man_ and streamer offerings including\nNetflix's _Back In Action_ starring Jamie Foxx and Cameron Diaz as well as\nPrime Video's _You 're Cordially Invited_ starring Will Ferrell and Reese\nWitherspoon. February marked the first of the Marvel films releasing in 2025\n-- _Captain America: Brave New World_ -- as well as the return of Renée\nZellweger's Bridget Jones in the franchise's fourth installment _Bridget\nJones: Mad About the Boy_ , which will be streaming on Peacock just in time\nfor Valentine's Day.\n\n**RELATED:29 Of The Most Anticipated New And Returning TV Shows Of 2025**\n\nScr

In [94]:
src_count = len(rag_answer["context"])
print(src_count)

4


In [95]:
sources_set = set()

for i in range(src_count):
    sources_set.add(rag_answer["context"][i].metadata["source"])

print(sources_set)

{'https://brickbanter.com/category/2025/july-2025/every-new-lego-set-releasing-in-july-2025/', 'https://falconbricks.com/news/lego-july-2025-sets/', 'https://jaysbrickblog.com/news/buying-guide-every-new-lego-set-releasing-july-2025/', 'https://www.ign.com/articles/new-lego-sets-for-july-2025'}


Rag Answer:

In [96]:
print(rag_answer["answer"])

As of July 2025, several new LEGO sets have been released, including:

1. **LEGO Icons - How to Train Your Dragon: Toothless** (Set 10375) - Priced at $69.99, this set celebrates the beloved character Toothless from the "How to Train Your Dragon" series.

2. **LEGO Icons - Shelby Cobra 427 S/C** (Set 10357) - This classic car model is available for $159.99 and captures the iconic design of the Shelby Cobra.

3. **LEGO Nike Dunk** (Set 43008) - A part of the new collaboration between LEGO and Nike, this set is being sold for $99.99.

4. **LEGO Star Wars - Battle Droid with STAP** (Set 75428) - This set is part of the extensive Star Wars collection and features a Battle Droid along with a STAP. 

5. **LEGO Marvel - Miles Morales’ Mask** (Set 76329) - This LEGO set focuses on the character Miles Morales from the Marvel universe.

Additionally, other notable sets that have also been mentioned include:

- **LEGO Super Mario: Mario Kart – Mario & Standard Kart** - Priced at $169.99.
- **LEGO

In [97]:
print("Citations:")
for src in sources_set:
    print(src)

Citations:
https://brickbanter.com/category/2025/july-2025/every-new-lego-set-releasing-in-july-2025/
https://falconbricks.com/news/lego-july-2025-sets/
https://jaysbrickblog.com/news/buying-guide-every-new-lego-set-releasing-july-2025/
https://www.ign.com/articles/new-lego-sets-for-july-2025


In [98]:
# no memory?