## Web-search RAG chatbot

like perplexity

v0.1 - no conversation, just question and answer based on web search

process:
1) receive prompt
2) pass to LLM to create appropriate search terms for gathering data - start with just one
3) perform web search with that, get top results and scrape and store as doc - start with just one result
4) chunk data
5) embed chunks
6) store in vectordb
7) link retriever to vecdb
8) pass query with retriever to llm w/ langchain

## 1) Receive prompt

In [1]:
user_prompt = "What are some new lego sets?"

## 2) Pass to LLM to create search term

In [2]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai_api_key = os.environ["OPENAI_API_KEY"]

In [3]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

In [4]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

system_prompt = (
    "You are an assistant for gathering data based on this prompt."
    "Use the following prompt to generate one web search query that would gather data which would be helpful in answering the prompt."
    "Your search query should be effective at finding the most relevant and useful data pertaining to the prompt."
    "Keep in mind that the current date is July 2025"
    "\n\n"
    "{input}"
)

search_query_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

In [5]:
search_query_gen_chain = {"input": RunnablePassthrough()} | search_query_prompt | llm | StrOutputParser()

In [6]:
import ast

search_query = search_query_gen_chain.invoke(user_prompt)
search_query = ast.literal_eval(search_query)

Failed to multipart ingest runs: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/multipart', '{"error":"Forbidden"}\n')
Failed to send compressed multipart ingest: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/multipart', '{"error":"Forbidden"}\n')
Failed to send compressed multipart ingest: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/multipart', '{"error":"Forbidden"}\n')
Failed to send compressed multipart ingest: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 

In [7]:
search_query = search_query.replace(' ', '+')

In [8]:
search_query

'new+Lego+sets+released+in+2025'

In [9]:
search_url = f"https://google.com/search?q={search_query}"

In [10]:
search_url

'https://google.com/search?q=new+Lego+sets+released+in+2025'

## Web search for relevant links

In [11]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse, unquote
import time

cse_key = os.environ["GOOGLE_CSE_KEY"]
cse_id = os.environ["GOOGLE_CSE_ID"]

# test single search block

search_response = requests.get(url=search_url)

# sleep for a while to not overload requests
time.sleep(2)
print("searching...")

soup = BeautifulSoup(search_response.text, "html.parser")
found_links = soup.find_all("a")

print(found_links)

searching...
[<a href="/httpservice/retry/enablejs?sei=2xWGaLmbC8SgnesPhICkKQ">here</a>, <a href="/search?q=new+Lego+sets+released+in+2025&amp;sca_esv=29c485ae8b21612f&amp;ie=UTF-8&amp;emsg=SG_REL&amp;sei=2xWGaLmbC8SgnesPhICkKQ">click here</a>, <a href="https://support.google.com/websearch">feedback</a>]


In [12]:
# with CSE

cse_query = search_query.replace('+', ' ')

cse_url = 'https://www.googleapis.com/customsearch/v1'
cse_params = {
    'q': cse_query,
    'key': cse_key,
    'cx' : cse_id
}

response = requests.get(cse_url, cse_params)

In [13]:
cse_results = response.json()

target_url_set = set()

if 'items' in cse_results:
    for result in cse_results['items']:
        if len(target_url_set) >= 5:
            break
        else:
            target_url_set.add(result['link'])

print(target_url_set)

{'https://jaysbrickblog.com/list-of-lego-sets-2025/', 'https://www.lego.com/en-us/categories/coming-soon', 'https://falconbricks.com/news/lego-2025-sets/', 'https://www.lego.com/en-us/categories/new-sets-and-products', 'https://www.reddit.com/r/Legoleak/comments/1l7fbcw/every_upcoming_lego_ideas_set_details_june_2025/'}


## Get page as text data

In [14]:
from langchain.document_loaders import AsyncHtmlLoader
from langchain.document_transformers import Html2TextTransformer

# load html content from site
site_loader = AsyncHtmlLoader(list(target_url_set))
scraped_docs = site_loader.load()

# convert to text data
text_trns = Html2TextTransformer()
text_docs = text_trns.transform_documents(scraped_docs)

cleaned_text = text_docs[0].page_content
print(cleaned_text)

USER_AGENT environment variable not set, consider setting it to identify your requests.
Fetching pages: 100%|######################################################| 5/5 [00:03<00:00,  1.39it/s]


  * About Me
  * LEGO Reviews
  * News
  * 2025 LEGO Sets
  * 2024 Advent Calendars

SUBSCRIBE

* ## Subscribe for updates

Enter your email address here to receive updates about new posts from Jay's
Brick Blog - straight to your inbox!

Email Address

Subscribe

Join 6,862 other subscribers

  * Reviews
  * News
  * LEGO Minifig Scanner

  * About Me
  * LEGO Reviews
  * News
  * 2025 LEGO Sets
  * 2024 Advent Calendars

  * Reviews
  * News
  * LEGO Minifig Scanner

Here's a complete list of all upcoming **new 2025 LEGO sets** , with set
numbers, names and more sourced from around the web!

Here are all the new LEGO sets that will launch worldwide in 2025! Click the
links below to jump straight to the sections.

  * **LEGO Botanicals 2025**
  * **LEGO Ideas June 2025**
  * **LEGO Icons June 2025**
  * **LEGO Star Wars June 2025**
  * **LEGO City June 2025**
  * **LEGO Bluey 2025**
  * **LEGO Friends June 2025**
  * **LEGO Animal Crossing 2025**
  * **LEGO Disney June 2025**
  * **LEG

## Chunking data

In [15]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(text_docs)
print(len(splits))

105


## Embedding chunks and creating retriever

In [16]:
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever()

## Comparing non-rag answer and rag answer

In [17]:
non_rag_answer = llm.invoke(user_prompt)

In [18]:
print(non_rag_answer.content)

As of my last update, I don't have access to real-time data, including the latest LEGO sets. However, LEGO frequently releases new sets, including themes like LEGO City, LEGO Creator, LEGO Star Wars, LEGO Harry Potter, and various licensed themes. To find the most current releases, I recommend visiting the official LEGO website or checking out fan sites and forums dedicated to LEGO, where new product announcements are regularly shared. If you're looking for a specific theme or type of set, let me know, and I can provide more information based on previous releases!


In [19]:
from langchain_core.prompts import MessagesPlaceholder

rag_system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Make your answers fairly detailed. Remember that the current date is July 2025"
    "\n\n"
    "{context}"
)

rag_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", rag_system_prompt),
        MessagesPlaceholder(variable_name = "memory"),    # injecting convo memory
        ("human", "{input}"),
    ]
)

In [20]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

question_answer_chain = create_stuff_documents_chain(llm, rag_prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

## Adding temp convo memory

In [21]:
from langchain_core.prompts import HumanMessagePromptTemplate
from langchain.memory import ConversationBufferMemory
from langchain.memory import FileChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory

In [22]:
# rag_memory = ConversationBufferMemory(
#     chat_memory = FileChatMessageHistory("messages.json"),
#     memory_key = "history",
#     return_messages = True
# )

rag_memory = FileChatMessageHistory("messages.json")

In [23]:
def get_rag_mem(_):
    return rag_memory

In [24]:
rag_mem_chain = RunnableWithMessageHistory(
    rag_chain,
    get_rag_mem,
    input_messages_key="input",
    history_messages_key="memory",
    output_messages_key="answer"
)

In [25]:
rag_answer = rag_mem_chain.invoke({"input": user_prompt}, config={"configurable": {"session_id": "irrelevant"}})

In [26]:
# print(rag_answer["context"])

In [27]:
src_count = len(rag_answer["context"])
print(src_count)

4


In [28]:
sources_set = set()

for i in range(src_count):
    sources_set.add(rag_answer["context"][i].metadata["source"])

print(sources_set)

{'https://jaysbrickblog.com/list-of-lego-sets-2025/', 'https://www.lego.com/en-us/categories/coming-soon', 'https://www.lego.com/en-us/categories/new-sets-and-products'}


Rag Answer:

In [29]:
print(rag_answer["answer"])

As of July 2025, some of the new LEGO sets that have been released or are set to be released include:

1. **LEGO Star Wars**
   - **75400 Plo Koon’s Jedi Starfighter Microfighter** – released on June 1, 2025.
   - **75411 Darth Maul Mech** – released on June 1, 2025.
   - **75412 Death Trooper & Night Trooper Battle Pack** – released on June 1, 2025.

2. **LEGO Friends Sets for 2025** – New sets are available that are part of the LEGO Friends collection.

3. **LEGO DUPLO Sets for 2025** – New sets have been added to the LEGO DUPLO collection.

4. **LEGO City Sets for 2025** – The latest LEGO City sets are now available.

5. **How to Train Your Dragon: Toothless** – A set available for $69.99.

There are more sets being introduced regularly, so fans can look forward to a variety of options in different themes.


In [30]:
print("Citations:")
for src in sources_set:
    print(src)

Citations:
https://jaysbrickblog.com/list-of-lego-sets-2025/
https://www.lego.com/en-us/categories/coming-soon
https://www.lego.com/en-us/categories/new-sets-and-products


In [31]:
# no memory?

In [32]:
rag_answer = rag_mem_chain.invoke({"input": "Could you elaborate?"}, config={"configurable": {"session_id": "irrelevant"}})

In [33]:
print(rag_answer["answer"])

Certainly! Here’s a more detailed overview of some recent LEGO sets, particularly focusing on themes and specifics:

### 1. **LEGO Star Wars**
   - **75400 Plo Koon’s Jedi Starfighter Microfighter**
     - This compact version of Plo Koon's starfighter includes a mini-figure of Plo Koon himself and is designed for younger builders or anyone who enjoys a playful take on iconic starfighters from the Star Wars universe. Released on June 1, 2025, it captures the essence of the Jedi starfighter with simplified building techniques.

   - **75411 Darth Maul Mech**
     - Also released on June 1, 2025, this set features a mech designed in the style of Darth Maul. It includes articulation and weapons, allowing for dynamic play. The set is aimed at both collectors and younger fans who enjoy building and engaging with villainous characters from the Star Wars series.

   - **75412 Death Trooper & Night Trooper Battle Pack**
     - Released on the same date, this battle pack comes with multiple min

In [34]:
src_count = len(rag_answer["context"])

sources_set = set()

for i in range(src_count):
    sources_set.add(rag_answer["context"][i].metadata["source"])

print("Citations:")
for src in sources_set:
    print(src)

Citations:
https://www.reddit.com/r/Legoleak/comments/1l7fbcw/every_upcoming_lego_ideas_set_details_june_2025/
https://jaysbrickblog.com/list-of-lego-sets-2025/
https://falconbricks.com/news/lego-2025-sets/


In [35]:
rag_answer = rag_mem_chain.invoke({"input": "What are some new video games that were released"}, config={"configurable": {"session_id": "irrelevant"}})

In [36]:
print(rag_answer["answer"])

As of July 2025, there are several notable video games that have been released recently or are on the horizon:

1. **The Legend of Zelda: Breath of the Wild 2**
   - Released in May 2025, this much-anticipated sequel builds upon the open-world gameplay of its predecessor, adding new locations, mechanics, and story elements that expand the Legend of Zelda universe.

2. **Final Fantasy VII Rebirth**
   - The follow-up to the critically acclaimed Final Fantasy VII Remake, Rebirth was released in March 2025. It continues the reimagined journey of Cloud Strife and his companions and introduces new gameplay mechanics and story arcs.

3. **Star Wars: Knights of the Old Republic Remake**
   - After many delays, the remake was officially released in April 2025. This updated version of the classic RPG features enhanced graphics and gameplay, allowing players to experience the iconic storyline with modern advancements in technology.

4. **Spider-Man 2**
   - Released in June 2025, this game featu

In [37]:
src_count = len(rag_answer["context"])

sources_set = set()

for i in range(src_count):
    sources_set.add(rag_answer["context"][i].metadata["source"])

print("Citations:")
for src in sources_set:
    print(src)

Citations:
https://jaysbrickblog.com/list-of-lego-sets-2025/
https://www.lego.com/en-us/categories/new-sets-and-products
