## Web-search RAG chatbot

like perplexity

v0.1 - no conversation, just question and answer based on web search

process:
1) receive prompt
2) pass to LLM to create appropriate search terms for gathering data - start with just one
3) perform web search with that, get top results and scrape and store as doc - start with just one result
4) chunk data
5) embed chunks
6) store in vectordb
7) link retriever to vecdb
8) pass query with retriever to llm w/ langchain

## 1) Receive prompt

In [66]:
user_prompt = "What are some popular new movies?"

## 2) Pass to LLM to create search term

In [33]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai_api_key = os.environ["OPENAI_API_KEY"]

In [34]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

In [35]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

system_prompt = (
    "You are an assistant for gathering data based on this prompt."
    "Use the following prompt to generate one web search query that would gather data which would be helpful in answering the prompt."
    "Your search query should be effective at finding the most relevant and useful data pertaining to the prompt."
    "Keep in mind that the current date is July 2025"
    "\n\n"
    "{input}"
)

search_query_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

In [36]:
search_query_gen_chain = {"input": RunnablePassthrough()} | search_query_prompt | llm | StrOutputParser()

In [67]:
import ast

search_query = search_query_gen_chain.invoke(user_prompt)
search_query = ast.literal_eval(search_query)

In [68]:
search_query = search_query.replace(' ', '+')

In [69]:
search_query

'popular+movies+released+in+2024+and+2025'

In [70]:
google_search_url = f"https://www.google.com/search?q={search_query}"

In [71]:
google_search_url

'https://www.google.com/search?q=popular+movies+released+in+2024+and+2025'

## Web search for relevant links

In [72]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse, unquote
import time

target_url_set = set()

while len(target_url_set) == 0:

        search_response = requests.get(url=google_search_url)
        
        soup = BeautifulSoup(search_response.text, "html.parser")
        found_links = soup.find_all("a")
        
        for i in found_links:
            if len(target_url_set) >= 5:
                break
            
            href = i['href']
            if href.startswith('/url?q='):
                # exclude irrelevant findings like maps etc.
                # toy version for now, replace with more robust process like blacklist text file later
            
                target_url = href.split('/url?q=')[1].split('&')[0]
                domain = urlparse(unquote(target_url)).netloc.lower()
                if domain.endswith('google.com') or domain.endswith('youtube.com'):
                    continue  # not a real organic result

                target_url_set.add(target_url)

        # sleep for a while to not overload requests
        time.sleep(2)

print(target_url_set)

{'https://www.imdb.com/year/2025', 'https://editorial.rottentomatoes.com/guide/best-movies-of-2019/', 'https://www.imdb.com/list/ls540805315/', 'https://editorial.rottentomatoes.com/guide/best-new-rom-coms-romance-movies/', 'https://editorial.rottentomatoes.com/guide/best-new-movies/'}


In [73]:
for link in found_links:
    print(link)

<a href="/?sa=X&amp;ved=0ahUKEwiA3Pi0ycaOAxUq1TgGHZFoLs4QOwgC"><span class="V6gwVd">G</span><span class="iWkuvd">o</span><span class="cDrQ7">o</span><span class="V6gwVd">g</span><span class="ntlR9">l</span><span class="iWkuvd tJ3Myc">e</span></a>
<a href="/search?q=popular+movies+released+in+2024+and+2025&amp;sca_esv=b3f6a4e4ace9f962&amp;ie=UTF-8&amp;gbv=1&amp;sei=J1R6aMCsOKqq4-EPkdG58Qw">here</a>
<a class="eZt8xd" href="/search?q=popular+movies+released+in+2024+and+2025&amp;sca_esv=b3f6a4e4ace9f962&amp;ie=UTF-8&amp;tbm=nws&amp;source=lnms&amp;sa=X&amp;ved=0ahUKEwiA3Pi0ycaOAxUq1TgGHZFoLs4Q_AUIBigB">News</a>
<a class="eZt8xd" href="/search?q=popular+movies+released+in+2024+and+2025&amp;sca_esv=b3f6a4e4ace9f962&amp;ie=UTF-8&amp;tbm=isch&amp;source=lnms&amp;sa=X&amp;ved=0ahUKEwiA3Pi0ycaOAxUq1TgGHZFoLs4Q_AUIBygC">Images</a>
<a href="/url?q=https://maps.google.com/maps%3Fq%3Dpopular%2Bmovies%2Breleased%2Bin%2B2024%2Band%2B2025%26um%3D1%26ie%3DUTF-8%26ved%3D1t:200713%26ictx%3D111&amp;opi=899

## Get page as text data

In [74]:
from langchain.document_loaders import AsyncHtmlLoader
from langchain.document_transformers import Html2TextTransformer

# load html content from site
site_loader = AsyncHtmlLoader(list(target_url_set))
scraped_docs = site_loader.load()

# convert to text data
text_trns = Html2TextTransformer()
text_docs = text_trns.transform_documents(scraped_docs)

cleaned_text = text_docs[0].page_content
print(cleaned_text)

Fetching pages: 100%|######################################################| 5/5 [00:04<00:00,  1.09it/s]


Menu

Movies

Release calendarTop 250 moviesMost popular moviesBrowse movies by genreTop box
officeShowtimes & ticketsMovie newsIndia movie spotlight

TV shows

What's on TV & streamingTop 250 TV showsMost popular TV showsBrowse TV shows
by genreTV news

Watch

What to watchLatest trailersIMDb OriginalsIMDb PicksIMDb SpotlightFamily
entertainment guideIMDb Podcasts

Awards & events

EmmysSuperheroes GuideSan Diego Comic-ConSummer Watch GuideBest Of 2025 So
FarDisability Pride MonthSTARmeter AwardsAwards CentralFestival CentralAll
events

Celebs

Born todayMost popular celebsCelebrity news

Community

Help centerContributor zonePolls

For industry professionals

* Language

English (United States)

  * Language
  * 

  * Fully supported
  *   * English (United States)
  * 
Partially supported

  *   * Français (Canada)
  * Français (France)
  * Deutsch (Deutschland)
  * हिंदी (भारत)
  * Italiano (Italia)
  * Português (Brasil)
  * Español (España)
  * Español (México)

AllAll

Watchlist

## Chunking data

In [75]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(text_docs)
print(len(splits))

217


## Embedding chunks and creating retriever

In [76]:
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever()

## Comparing non-rag answer and rag answer

In [77]:
non_rag_answer = llm.invoke(user_prompt)

In [78]:
print(non_rag_answer.content)

As of October 2023, here are some popular new movies that have generated buzz:

1. **Killers of the Flower Moon** - Directed by Martin Scorsese and starring Leonardo DiCaprio and Robert De Niro, this crime drama explores the Osage murders in the 1920s.

2. **Dune: Part Two** - The highly anticipated sequel to Denis Villeneuve's adaptation of Frank Herbert's epic sci-fi novel continues the story of Paul Atreides.

3. **Marvel's The Marvels** - This superhero film features Captain Marvel, Ms. Marvel, and Monica Rambeau as they team up to face a new threat.

4. **Napoleon** - Directed by Ridley Scott, this historical drama stars Joaquin Phoenix as the infamous French leader Napoleon Bonaparte.

5. **Five Nights at Freddy's** - Based on the popular video game series, this horror film brings the animatronic horrors to life.

6. **The Hunger Games: The Ballad of Songbirds and Snakes** - A prequel to the original Hunger Games trilogy, focusing on a young Coriolanus Snow.

7. **Wish** - An ani

In [86]:
rag_system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Make your answers fairly detailed. Remember that the current date is July 2025"
    "\n\n"
    "{context}"
)

rag_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", rag_system_prompt),
        ("human", "{input}"),
    ]
)

In [87]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

question_answer_chain = create_stuff_documents_chain(llm, rag_prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

In [88]:
rag_answer = rag_chain.invoke({"input": user_prompt})

In [89]:
print(rag_answer["context"])

[Document(id='07989160-fb62-43d6-8982-9208526ddf35', metadata={'title': 'Advanced search', 'source': 'https://www.imdb.com/year/2025', 'language': 'en-US', 'description': ''}, page_content='* * *\n\n* * *\n\nIn theaters\n\n* * *\n\nShow all titles\n\nIn theaters near you\n\nIn favorite theaters\n\nIn theaters with online ticketing (US only)\n\n* * *\n\n* * *\n\nAdult titles\n\n* * *\n\nExclude\n\nInclude\n\n* * *\n\n1-50 of 11,722\n\nSort byPopularityPopularityA-ZUser ratingNumber of ratingsUS box\nofficeRuntimeYearRelease date\n\n  * ### 1\\. Superman\n\n20252h 9mPG-1368Metascore\n\n7.6 (118K)Rate\n\nMark as watched\n\nSuperman must reconcile his alien Kryptonian heritage with his human\nupbringing as reporter Clark Kent. As the embodiment of truth, justice and the\nhuman way he soon finds himself in a world that views these as old-fashioned.\n\n  * ### 2\\. Jurassic World: Rebirth\n\n20252h 13mPG-1350Metascore\n\n6.2 (58K)Rate\n\nMark as watched\n\nFive years post-Jurassic World: Dom

In [90]:
src_count = len(rag_answer["context"])
print(src_count)

4


In [91]:
sources_set = set()

for i in range(src_count):
    sources_set.add(rag_answer["context"][i].metadata["source"])

print(sources_set)

{'https://editorial.rottentomatoes.com/guide/best-new-rom-coms-romance-movies/', 'https://www.imdb.com/year/2025', 'https://editorial.rottentomatoes.com/guide/best-new-movies/', 'https://editorial.rottentomatoes.com/guide/best-movies-of-2019/'}


Rag Answer:

In [92]:
print(rag_answer["answer"])

As of July 2025, some popular new movies in theaters include:

1. **Superman** - This film has received a positive reception, boasting a Metascore of 68 and a user rating of 7.6 based on 118K ratings. It explores Superman's struggle to reconcile his Kryptonian heritage with his human upbringing as Clark Kent, delivering themes of truth, justice, and humanity.

2. **Jurassic World: Rebirth** - Continuing the Jurassic franchise, this movie has a runtime of 2 hours and 13 minutes and is rated PG-13. It follows an expedition five years after "Jurassic World: Dominion," focusing on the extraction of DNA from prehistoric creatures for a medical breakthrough. It has a Metascore of 50 and a user rating of 6.2 based on 58K ratings.

3. **F1: The Movie** - A documentary-style movie running 2 hours and 35 minutes, F1: The Movie is rated PG-13 and has a Metascore of 68 and a user rating of 7.9 from 110K ratings. It likely delves into the world of Formula 1 racing, though specific details about its

In [93]:
print("Citations:")
for src in sources_set:
    print(src)

Citations:
https://editorial.rottentomatoes.com/guide/best-new-rom-coms-romance-movies/
https://www.imdb.com/year/2025
https://editorial.rottentomatoes.com/guide/best-new-movies/
https://editorial.rottentomatoes.com/guide/best-movies-of-2019/
