<a href="https://colab.research.google.com/github/HirunaD/LangChain/blob/main/07_2_hHybrid_Search_Rag_using_Langchain_and_OpenAI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Hybrid Search RAG using Langchain and OpenAI**

In [1]:
!pip install pypdf -q
!pip install langchain -q
!pip install langchain_community -q
!pip install langchain_openai -q
!pip install langchain_chroma -q
!pip install rank_bm25 -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m304.6/304.6 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m26.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m58.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m438.9/438.9 kB[0m [31m32.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.2/45.2 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.0/69.0 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?

In [2]:
# Import necessary libraries
import os
from google.colab import userdata

**Initialize OpenAI LLM**

In [4]:
from langchain_openai import ChatOpenAI

# Set OpenAI API key
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

# Initialize the ChatOpenAI model
llm = ChatOpenAI(
    model="gpt-4.1-nano",
    temperature=0
)

**Initialize Embedding Model**

In [5]:
from langchain_openai import OpenAIEmbeddings
embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

**Load PDF Document**

In [6]:
from langchain_community.document_loaders import PyPDFLoader

loader=PyPDFLoader("/content/codeprolk.pdf")

docs=loader.load()

**Split Documents into Chunks**

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=250, chunk_overlap=30)

chunks = splitter.split_documents(docs)

In [8]:
len(chunks)

33

**Create Semantic Search Retriever**

In [9]:
from langchain_chroma import Chroma

vectorstore=Chroma.from_documents(chunks, embedding_model)

vectorstore_retreiver = vectorstore.as_retriever(search_kwargs={"k": 2})

In [10]:
vectorstore_retreiver

VectorStoreRetriever(tags=['Chroma', 'OpenAIEmbeddings'], vectorstore=<langchain_chroma.vectorstores.Chroma object at 0x7d53bcac2ad0>, search_kwargs={'k': 2})

**Create Keyword Search Retriever**

In [11]:
from langchain.retrievers import BM25Retriever

keyword_retriever = BM25Retriever.from_documents(chunks)

keyword_retriever.k =  2

In [12]:
keyword_retriever

BM25Retriever(vectorizer=<rank_bm25.BM25Okapi object at 0x7d53b8385890>, k=2)

**Create Hybrid Search Retriever**

In [13]:
from langchain.retrievers import EnsembleRetriever

ensemble_retriever = EnsembleRetriever(retrievers = [vectorstore_retreiver, keyword_retriever], weights = [0.5, 0.5])

In [14]:
ensemble_retriever

EnsembleRetriever(retrievers=[VectorStoreRetriever(tags=['Chroma', 'OpenAIEmbeddings'], vectorstore=<langchain_chroma.vectorstores.Chroma object at 0x7d53bcac2ad0>, search_kwargs={'k': 2}), BM25Retriever(vectorizer=<rank_bm25.BM25Okapi object at 0x7d53b8385890>, k=2)], weights=[0.5, 0.5])

**Define Prompt Template**

In [15]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

# Define a message template for the chatbot
message = """
Answer this question using the provided context only.

{question}

Context:
{context}
"""

# Create a chat prompt template from the message
prompt = ChatPromptTemplate.from_messages([("human", message)])

**Create RAG Chain with Hybrid Search**

In [16]:
chain = (
    {
      "context": ensemble_retriever,
      "question": RunnablePassthrough()
    }
    | prompt
    | llm
)

**Invoke RAG Chain with Example Questions**

In [21]:
response = chain.invoke("what is about the document?")

print(response.content)

The document is about CodePRO LK, an educational platform that was established in response to the challenges of the COVID-19 pandemic, focusing on remote learning and digital skills. It discusses the platform's vision, its efforts to collaborate with industry experts, educational institutions, and tech companies, and its initiatives to provide learners with resources and opportunities to stay ahead in the evolving tech landscape. The document also highlights the platform's support services, including consultation and its YouTube channel as an extension of its educational offerings.


In [22]:
for doc in keyword_retriever.invoke("what are the popular videos in codeprolk"):
  print(doc.page_content)
  print("---------------------")

appreciation and sharing how the videos have assisted them in their learning journeys. 
Impact 
The CodePRO LK YouTube channel has played a significant role in democratizing tech
---------------------
industry, ensuring that learners are well-prepared for real-world challenges. 
Enhanced Learning Tools 
The platform plans to integrate more interactive and adaptive learning tools to personalize the
---------------------


In [23]:
for doc in vectorstore_retreiver.invoke("what are the popular videos in codeprolk"):
  print(doc.page_content)
  print("---------------------")

appreciation and sharing how the videos have assisted them in their learning journeys. 
Impact 
The CodePRO LK YouTube channel has played a significant role in democratizing tech
---------------------
CodePRO LK is committed to strengthening its community through regular engagement 
activities such as webinars, live coding sessions, hackathons, and tech talks. These events
---------------------


In [24]:
for doc in ensemble_retriever.invoke("what are the popular videos in codeprolk"):
  print(doc.page_content)
  print("---------------------")

appreciation and sharing how the videos have assisted them in their learning journeys. 
Impact 
The CodePRO LK YouTube channel has played a significant role in democratizing tech
---------------------
CodePRO LK is committed to strengthening its community through regular engagement 
activities such as webinars, live coding sessions, hackathons, and tech talks. These events
---------------------
industry, ensuring that learners are well-prepared for real-world challenges. 
Enhanced Learning Tools 
The platform plans to integrate more interactive and adaptive learning tools to personalize the
---------------------
