## Web-search RAG chatbot

like perplexity

v0.1 - no conversation, just question and answer based on web search

process:
1) receive prompt
2) pass to LLM to create appropriate search terms for gathering data - start with just one
3) perform web search with that, get top results and scrape and store as doc - start with just one result
4) chunk data
5) embed chunks
6) store in vectordb
7) link retriever to vecdb
8) pass query with retriever to llm w/ langchain

## 1) Receive prompt

In [1]:
user_prompt = "What are some popular new movies?"

## 2) Pass to LLM to create search term

In [2]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai_api_key = os.environ["OPENAI_API_KEY"]

In [3]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

In [4]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

system_prompt = (
    "You are an assistant for gathering data based on this prompt."
    "Use the following prompt to generate one web search query that would gather data which would be helpful in answering the prompt."
    "Your search query should be effective at finding the most relevant and useful data pertaining to the prompt."
    "Keep in mind that the current date is July 2025"
    "\n\n"
    "{input}"
)

search_query_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

In [5]:
search_query_gen_chain = {"input": RunnablePassthrough()} | search_query_prompt | llm | StrOutputParser()

In [6]:
import ast

search_query = search_query_gen_chain.invoke(user_prompt)
search_query = ast.literal_eval(search_query)

Failed to multipart ingest runs: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/multipart', '{"error":"Forbidden"}\n')
Failed to send compressed multipart ingest: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/multipart', '{"error":"Forbidden"}\n')
Failed to send compressed multipart ingest: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 Client Error: Forbidden for url: https://api.smith.langchain.com/runs/multipart', '{"error":"Forbidden"}\n')
Failed to send compressed multipart ingest: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('403 

In [7]:
search_query = search_query.replace(' ', '+')

In [8]:
search_query

'popular+new+movies+released+in+2025'

In [9]:
search_url = f"https://google.com/search?q={search_query}"

In [10]:
search_url

'https://google.com/search?q=popular+new+movies+released+in+2025'

## Web search for relevant links

In [12]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse, unquote
import time

cse_key = os.environ["GOOGLE_CSE_KEY"]
cse_id = os.environ["GOOGLE_CSE_ID"]

# test single search block

search_response = requests.get(url=search_url)

# sleep for a while to not overload requests
time.sleep(2)
print("searching...")

soup = BeautifulSoup(search_response.text, "html.parser")
found_links = soup.find_all("a")

print(found_links)

searching...
[<a href="/httpservice/retry/enablejs?sei=ZZ1_aN2-LqyTseMPob-fGQ">here</a>, <a href="/search?q=popular+new+movies+released+in+2025&amp;sca_esv=397bf5706f45ee43&amp;ie=UTF-8&amp;emsg=SG_REL&amp;sei=ZZ1_aN2-LqyTseMPob-fGQ">click here</a>, <a href="https://support.google.com/websearch">feedback</a>]


In [20]:
# with CSE

cse_query = search_query.replace('+', ' ')

cse_url = 'https://www.googleapis.com/customsearch/v1'
cse_params = {
    'q': cse_query,
    'key': cse_key,
    'cx' : cse_id
}

response = requests.get(cse_url, cse_params)

In [23]:
cse_results = response.json()

target_url_set = set()

if 'items' in cse_results:
    for result in cse_results['items']:
        if len(target_url_set) >= 5:
            break
        else:
            target_url_set.add(result['link'])

print(target_url_set)

{'https://editorial.rottentomatoes.com/guide/best-new-movies/', 'https://www.imdb.com/calendar/', 'https://editorial.rottentomatoes.com/article/the-most-anticipated-movies-of-2025/', 'https://deadline.com/lists/2025-movies/', 'https://www.imdb.com/list/ls566987552/'}


## Get page as text data

In [24]:
from langchain.document_loaders import AsyncHtmlLoader
from langchain.document_transformers import Html2TextTransformer

# load html content from site
site_loader = AsyncHtmlLoader(list(target_url_set))
scraped_docs = site_loader.load()

# convert to text data
text_trns = Html2TextTransformer()
text_docs = text_trns.transform_documents(scraped_docs)

cleaned_text = text_docs[0].page_content
print(cleaned_text)

USER_AGENT environment variable not set, consider setting it to identify your requests.
Fetching pages: 100%|######################################################| 5/5 [00:02<00:00,  1.82it/s]


  *   * 

Home

Top Box Office

Tickets & Showtimes

DVD & Streaming

TV

News

SIGN UP LOG IN

Jester McGree

Logout

**About Rotten Tomatoes ®** **Critics**

__

__

  * __Home
  * __Box Office
  * __TV
  * __DVD
  * __MORE

  * News
  * __ __

Movies

TV Shows

Fanstore

News

Showtimes

  * Trending on RT

# Best New Movies of 2025, Ranked by Tomatometer

(Photo by Netflix/Courtesy Everett Collection. KPOP DEMON HUNTERS.)

* * *

**The latest:** In an auspicious debut for the new DCU, _**Superman **_is
Certified Fresh! Also in the mix: Nick Offerman political thriller
_**Sovereign**_ , and Netflix animation sensation _**KPop Demon Hunters**_.  
  
July kicks off with the promise of **40 Acres** , a **Cloud** , plus a
**Familiar Touch**.  
  
We've just added **_Deep Cover_** , **_F1 The Movie_** , **_Sorry, Baby_** ,
**_The Queen of My Dreams_** , **_Elio_** , and **_28 Years Later_**.

* * *

Welcome to the best new movies of 2025! (If you're looking for the previous
big list of 2

## Chunking data

In [25]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(text_docs)
print(len(splits))

194


## Embedding chunks and creating retriever

In [26]:
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever()

## Comparing non-rag answer and rag answer

In [27]:
non_rag_answer = llm.invoke(user_prompt)

In [28]:
print(non_rag_answer.content)

As of October 2023, some popular new movies that have garnered attention include:

1. **Killers of the Flower Moon** - Directed by Martin Scorsese, this film explores the Osage Indian murders and the birth of the FBI.
  
2. **Dune: Part Two** - The sequel to the acclaimed adaptation of Frank Herbert's science fiction novel, continuing the story of Paul Atreides.

3. **The Marvels** - This superhero film is a sequel to "Captain Marvel" and features Carol Danvers teaming up with other heroes.

4. **Napoleon** - Directed by Ridley Scott, this historical drama stars Joaquin Phoenix as the iconic French leader.

5. **The Hunger Games: The Ballad of Songbirds and Snakes** - A prequel to the original Hunger Games series, exploring the origins of the games and the character of Coriolanus Snow.

6. **Aquaman and the Lost Kingdom** - The sequel to the first Aquaman film, continuing the adventures of Arthur Curry.

7. **Wonka** - A new take on the classic story, focusing on Willy Wonka's early ye

In [29]:
rag_system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Make your answers fairly detailed. Remember that the current date is July 2025"
    "\n\n"
    "{context}"
)

rag_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", rag_system_prompt),
        ("human", "{input}"),
    ]
)

In [30]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

question_answer_chain = create_stuff_documents_chain(llm, rag_prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

In [31]:
rag_answer = rag_chain.invoke({"input": user_prompt})

In [32]:
print(rag_answer["context"])

[Document(id='cb9556b4-b775-4d0d-8ad5-90e74a03fa48', metadata={'language': 'en-US', 'title': '2025 Movies: Release Dates For Most Anticipated Films', 'source': 'https://deadline.com/lists/2025-movies/'}, page_content="**RELATED:The Movies That Have Made More Than $1 Billion At The Box Office**\n\nJanuary kicked it off with combination of theatrical releases like the\nChristopher Abbott-starring _Wolf Man_ and streamer offerings including\nNetflix's _Back In Action_ starring Jamie Foxx and Cameron Diaz as well as\nPrime Video's _You 're Cordially Invited_ starring Will Ferrell and Reese\nWitherspoon. February marked the first of the Marvel films releasing in 2025\n-- _Captain America: Brave New World_ -- as well as the return of Renée\nZellweger's Bridget Jones in the franchise's fourth installment _Bridget\nJones: Mad About the Boy_ , which will be streaming on Peacock just in time\nfor Valentine's Day.\n\n**RELATED:29 Of The Most Anticipated New And Returning TV Shows Of 2025**\n\nScr

In [33]:
src_count = len(rag_answer["context"])
print(src_count)

4


In [34]:
sources_set = set()

for i in range(src_count):
    sources_set.add(rag_answer["context"][i].metadata["source"])

print(sources_set)

{'https://deadline.com/lists/2025-movies/', 'https://editorial.rottentomatoes.com/guide/best-new-movies/', 'https://editorial.rottentomatoes.com/article/the-most-anticipated-movies-of-2025/'}


Rag Answer:

In [35]:
print(rag_answer["answer"])

As of July 2025, some popular new movies that have been released or are anticipated include:

1. **Jurassic World: Rebirth** - Released on July 2, 2025, this film is part of the popular Jurassic franchise and follows the exciting adventures within a world where dinosaurs still roam.

2. **Captain America: Brave New World** - Released in February 2025, this Marvel installment continues the story of Captain America, contributing to the overall Marvel Cinematic Universe.

3. **Bridget Jones: Mad About the Boy** - Featuring Renée Zellweger reprising her role as Bridget Jones, this film is streaming on Peacock and was released just in time for Valentine's Day 2025.

4. **The Fantastic Four: First Steps** - Set to debut later in July 2025, this film introduces the beloved superhero team in a new narrative.

5. **Freakier Friday** - Scheduled for August 2025, this film is a fresh take on the concept of body switching, which has been popular in filmmaking.

These films span a variety of genres

In [36]:
print("Citations:")
for src in sources_set:
    print(src)

Citations:
https://deadline.com/lists/2025-movies/
https://editorial.rottentomatoes.com/guide/best-new-movies/
https://editorial.rottentomatoes.com/article/the-most-anticipated-movies-of-2025/
