## Web-search RAG chatbot

like perplexity

v0.1 - no conversation, just question and answer based on web search

process:
1) receive prompt
2) pass to LLM to create appropriate search terms for gathering data - start with just one
3) perform web search with that, get top results and scrape and store as doc - start with just one result
4) chunk data
5) embed chunks
6) store in vectordb
7) link retriever to vecdb
8) pass query with retriever to llm w/ langchain

## 1) Receive prompt

In [26]:
user_prompt = "What is the latest Star Wars movie or show?"

## 2) Pass to LLM to create search term

In [3]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai_api_key = os.environ["OPENAI_API_KEY"]

In [4]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

In [5]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

system_prompt = (
    "You are an assistant for gathering data based on this prompt."
    "Use the following prompt to generate one web search query that would gather data which would be helpful in answering the prompt."
    "Your search query should be effective at finding the most relevant and useful data pertaining to the prompt."
    "Keep in mind that the current date is July 2025"
    "\n\n"
    "{input}"
)

search_query_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

In [6]:
search_query_gen_chain = {"input": RunnablePassthrough()} | search_query_prompt | llm | StrOutputParser()

In [27]:
import ast

search_query = search_query_gen_chain.invoke(user_prompt)
search_query = ast.literal_eval(search_query)

In [28]:
search_query = search_query.replace(' ', '+')

In [29]:
search_query

'latest+Star+Wars+movie+or+show+release+July+2025'

In [30]:
google_search_url = f"https://www.google.com/search?q={search_query}"

In [31]:
google_search_url

'https://www.google.com/search?q=latest+Star+Wars+movie+or+show+release+July+2025'

## TODO
make sure that this reruns until we get good results

In [38]:
import requests

search_response = requests.get(url=google_search_url)
search_response

<Response [200]>

In [39]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(search_response.text, "html.parser")
found_links = soup.find_all("a")

In [40]:
found_links

[<a href="/?sa=X&amp;ved=0ahUKEwiCl8SOjMSOAxWBwzgGHTtCBSsQOwgC"><span class="V6gwVd">G</span><span class="iWkuvd">o</span><span class="cDrQ7">o</span><span class="V6gwVd">g</span><span class="ntlR9">l</span><span class="iWkuvd tJ3Myc">e</span></a>,
 <a href="/search?q=latest+Star+Wars+movie+or+show+release+July+2025&amp;sca_esv=ed31745f76a805c8&amp;ie=UTF-8&amp;gbv=1&amp;sei=cQd5aMKaFIGH4-EPu4SV2AI">here</a>,
 <a class="eZt8xd" href="/search?q=latest+Star+Wars+movie+or+show+release+July+2025&amp;sca_esv=ed31745f76a805c8&amp;ie=UTF-8&amp;tbm=nws&amp;source=lnms&amp;sa=X&amp;ved=0ahUKEwiCl8SOjMSOAxWBwzgGHTtCBSsQ_AUIBigB">News</a>,
 <a class="eZt8xd" href="/search?q=latest+Star+Wars+movie+or+show+release+July+2025&amp;sca_esv=ed31745f76a805c8&amp;ie=UTF-8&amp;tbm=isch&amp;source=lnms&amp;sa=X&amp;ved=0ahUKEwiCl8SOjMSOAxWBwzgGHTtCBSsQ_AUIBygC">Images</a>,
 <a href="/url?q=https://maps.google.com/maps%3Fq%3Dlatest%2BStar%2BWars%2Bmovie%2Bor%2Bshow%2Brelease%2BJuly%2B2025%26um%3D1%26ie%3DUTF

In [41]:
from urllib.parse import urlparse, unquote

target_url = ''

for i in found_links:
    href = i['href']
    if href.startswith('/url?q='):
        # exclude irrelevant findings like maps etc.
        # toy version for now, replace with more robust process like blacklist text file later

        target_url = href.split('/url?q=')[1].split('&')[0]
        domain = urlparse(unquote(target_url)).netloc.lower()
        if domain.endswith('google.com'):
            continue  # not a real organic result
        
        break

print(target_url)

https://in.ign.com/obi-wan-kenobi/118620/feature/upcoming-and-next-star-wars-movies-and-tv-shows-2022-release-dates-and-beyond


## Get page as text data

In [42]:
from langchain.document_loaders import AsyncHtmlLoader
from langchain.document_transformers import Html2TextTransformer

# load html content from site
site_loader = AsyncHtmlLoader(target_url)
scraped_docs = site_loader.load()

# convert to text data
text_trns = Html2TextTransformer()
text_docs = text_trns.transform_documents(scraped_docs)

cleaned_text = text_docs[0].page_content
print(cleaned_text)

Fetching pages: 100%|######################################################| 1/1 [00:00<00:00,  8.08it/s]


IGN Logo

India

  * Home
  * IGNdia
  * News
  * Reviews
  * Interviews
  * Features
  * Lists
  * Opinions & Columns
  * Guides
  * Wiki
  * Videos
  * Gaming
  * Tech
  * Anime and Manga
  * Science
  * Comics
  * Follow IGN India
  * More

  * Search

Home

### Reviews

  * All Reviews
  * Game Reviews
  * Movie Reviews
  * TV Reviews
  * Comic Reviews
  * Tech Reviews

Home

### Gaming

  * PC
  * PlayStation
  * Xbox
  * Mobile
  * Esports
  * Nintendo Switch

Home

### Anime and Manga

  * All Anime Coverage
  * Cosplay
  * IGN Otaku Update

Home

### Follow IGN India

  * Instagram
  * Facebook
  * Twitter/X
  * YouTube

Home

### More

  * About IGN India
  * Contact the Editorial Staff
  * Advertise
  * Press
  * User Agreement
  * Privacy Policy
  * Cookie Policy
  * RSS

  * IGN India is operated by Fork Media Group Pvt. Ltd. under license from IGN Entertainment and its affiliates.

Change Region United States United Kingdom Australia Africa Adria
Serbian/Croatian Adria Slo

## Chunking data

In [43]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(text_docs)
print(len(splits))

19


## Embedding chunks and creating retriever

In [44]:
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever()

## Comparing non-rag answer and rag answer

In [59]:
non_rag_answer = llm.invoke(user_prompt)

In [61]:
print(non_rag_answer.content)

As of October 2023, the latest Star Wars series is *Ahsoka*, which premiered on Disney+ in August 2023. The most recent movie in the Star Wars franchise is *Rogue Squadron*, which was previously scheduled but has faced delays and has not yet been released. Always check official sources for the most up-to-date information, as new content may be announced or released after my last update.


In [54]:
rag_system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Keep your answer relatively concise. Remember that the current date is July 2025"
    "\n\n"
    "{context}"
)

rag_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", rag_system_prompt),
        ("human", "{input}"),
    ]
)

In [55]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

question_answer_chain = create_stuff_documents_chain(llm, rag_prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

In [56]:
rag_answer = rag_chain.invoke({"input": user_prompt})

In [57]:
print(rag_answer["answer"])

The latest Star Wars show released is "Skeleton Crew," which debuted on December 2, 2024.
