## RAG-powered-knowledge-based Assistant

### Import Libraries  

In [1]:
import os
from PyPDF2 import PdfReader, PdfWriter
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_google_genai.chat_models import ChatGoogleGenerativeAI
from langchain_google_genai.llms import GoogleGenerativeAI
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
import warnings
warnings.filterwarnings("ignore")
from dotenv import load_dotenv

load_dotenv()
Google_api_key = os.getenv("GOOGLE_API_KEY")
HUGGINGFACE_API_KEY = os.getenv("HUGGINGFACE_API_KEY")
GEMINI_MODEL = os.getenv("GEMINI_MODEL")
LLM_PROVIDER = os.getenv("LLM_PROVIDER")
EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL")


  from .autonotebook import tqdm as notebook_tqdm

All support for the `google.generativeai` package has ended. It will no longer be receiving 
updates or bug fixes. Please switch to the `google.genai` package as soon as possible.
See README for more details:

https://github.com/google-gemini/deprecated-generative-ai-python/blob/main/README.md

  from google.generativeai.caching import CachedContent  # type: ignore[import]


In [2]:
os.getcwd()
def load_pdf(pdf_path: str) -> str:
    try:
        reader = PdfReader(pdf_path) 
        text = ""
        for page in reader.pages: 
            text += page.extract_text()+ "\n"
        return text
    except Exception as e:
        print(f"Error loading PDF {pdf_path}: {e}")
        return ""
 
simple_pdf = load_pdf("../data/sample.pdf")

print("Document loaded sucessfully!")
print(f"Total length: {len(simple_pdf)} characters")
print(f"Number of lines: {len(simple_pdf.split('.'))} sentences")


Document loaded sucessfully!
Total length: 1960 characters
Number of lines: 16 sentences


In [3]:
# Split the text into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 450,
    chunk_overlap  = 100,
    length_function = len,
)
chunks = text_splitter.split_text(simple_pdf)
print(f"Total chunks created: {len(chunks)}")

for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i+1} ---")
    print(chunk )
    print("\n")

Total chunks created: 6
--- Chunk 1 ---
Page 1: Introduction to Artificial Intelligence
Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are
programmed to think like humans and mimic their actions. The term may also be applied to any
machine that exhibits traits associated with a human mind such as learning and problem-solving. AI
is widely used today in applications such as recommendation systems, voice assistants, fraud


--- Chunk 2 ---
is widely used today in applications such as recommendation systems, voice assistants, fraud
detection, and autonomous vehicles.
Machine Learning is a subset of AI that focuses on building systems that learn from data. Instead of
being explicitly programmed, these systems improve their performance as they are exposed to
more data.


--- Chunk 3 ---
Page 2: Natural Language Processing and Embeddings
Natural Language Processing (NLP) is a field of AI that gives machines the ability to read,
understand, and de

In [4]:
# Initialize the embedding model
load_dotenv()


# Create embeddings for the chunks
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

print("‚úÖ Embedding model loaded!")

# Test: Convert one chunk to embedding
sample_chunk = chunks
sample_embedding = embeddings.embed_documents(sample_chunk)

print(f"\nüìù Original text: {sample_chunk}")
print(f"\nüî¢ Embedding (first 10 numbers): {sample_embedding[:10]}")
print(f"üìä Embedding size: {len(sample_embedding)} dimensions")  

‚úÖ Embedding model loaded!

üìù Original text: ['Page 1: Introduction to Artificial Intelligence\nArtificial Intelligence (AI) refers to the simulation of human intelligence in machines that are\nprogrammed to think like humans and mimic their actions. The term may also be applied to any\nmachine that exhibits traits associated with a human mind such as learning and problem-solving. AI\nis widely used today in applications such as recommendation systems, voice assistants, fraud', 'is widely used today in applications such as recommendation systems, voice assistants, fraud\ndetection, and autonomous vehicles.\nMachine Learning is a subset of AI that focuses on building systems that learn from data. Instead of\nbeing explicitly programmed, these systems improve their performance as they are exposed to\nmore data.', 'Page 2: Natural Language Processing and Embeddings\nNatural Language Processing (NLP) is a field of AI that gives machines the ability to read,\nunderstand, and derive mean

In [5]:
# Create Vector Database
vector_store = Chroma.from_texts(
    texts=chunks,
    embedding=embeddings,   # ‚úÖ correct keyword
    collection_name="example_collection",
    persist_directory="../chroma_db"
)

print("Vector database created!")
print(f"Stored {vector_store._collection.count()} chunks") 

Vector database created!
Stored 6 chunks


In [9]:
# similarity search test

# ask a query
query = "What is Machine Learning?"

print(f"\n Query: {query}")
print("finding similar chunks...")

similar_docs = vector_store.similarity_search(query, k=3)

print(f"Found {len(similar_docs)} relevant chunks:\n")

for i, doc in enumerate(similar_docs, 1):
    print(f"\n--- Similar Chunk {i} ---")
    print(doc.page_content) 


 Query: What is Machine Learning?
finding similar chunks...
Found 3 relevant chunks:


--- Similar Chunk 1 ---
is widely used today in applications such as recommendation systems, voice assistants, fraud
detection, and autonomous vehicles.
Machine Learning is a subset of AI that focuses on building systems that learn from data. Instead of
being explicitly programmed, these systems improve their performance as they are exposed to
more data.

--- Similar Chunk 2 ---
Page 1: Introduction to Artificial Intelligence
Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are
programmed to think like humans and mimic their actions. The term may also be applied to any
machine that exhibits traits associated with a human mind such as learning and problem-solving. AI
is widely used today in applications such as recommendation systems, voice assistants, fraud

--- Similar Chunk 3 ---
Page 2: Natural Language Processing and Embeddings
Natural Language Process

In [14]:
def simple_qa(query: str) -> str:
    # Initialize the LLM
    results = vector_store.similarity_search(query, k=2)

    answer = "\n\n".join([doc.page_content for doc in results])
    return answer

query = [
    "What is Machine Learning?",
    "Explain the concept of Artificial Intelligence.",
    "How does Natural Language Processing work?"
]
for q in query:
    print(f"\n\n--- Query: {q} ---")
    answer = simple_qa(q) 
    print(f"Answer: {answer}")



--- Query: What is Machine Learning? ---
Answer: is widely used today in applications such as recommendation systems, voice assistants, fraud
detection, and autonomous vehicles.
Machine Learning is a subset of AI that focuses on building systems that learn from data. Instead of
being explicitly programmed, these systems improve their performance as they are exposed to
more data.

Page 1: Introduction to Artificial Intelligence
Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are
programmed to think like humans and mimic their actions. The term may also be applied to any
machine that exhibits traits associated with a human mind such as learning and problem-solving. AI
is widely used today in applications such as recommendation systems, voice assistants, fraud


--- Query: Explain the concept of Artificial Intelligence. ---
Answer: Page 1: Introduction to Artificial Intelligence
Artificial Intelligence (AI) refers to the simulation of human i

In [16]:
# testing llm to retrieve context

def initialize_llm(query: str) -> str:
    # Initialize the LLM
    llm = ChatGoogleGenerativeAI(
        model=GEMINI_MODEL,
        temperature=0,
        max_output_tokens=1024,
        api_key=Google_api_key
    )

    # Create a RetrievalQA chain
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=vector_store.as_retriever(),
        return_source_documents=False
    )

    # Get the answer from the chain
    answer = qa_chain.run(query)
    return answer

query = [
    "What is Machine Learning?",
    "Explain the concept of Artificial Intelligence.",
    "How does Natural Language Processing work?"
]
for q in query:
    print(f"\n\n--- Query: {q} ---")
    answer = initialize_llm(q) 
    print(f"Answer: {answer}") 



--- Query: What is Machine Learning? ---
Answer: Machine Learning is a subset of AI that focuses on building systems that learn from data. These systems improve their performance as they are exposed to more data, instead of being explicitly programmed.


--- Query: Explain the concept of Artificial Intelligence. ---
Answer: Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. It can also be applied to any machine that exhibits traits associated with a human mind, such as learning and problem-solving.


--- Query: How does Natural Language Processing work? ---
Answer: Natural Language Processing (NLP) is a field of AI that gives machines the ability to read, understand, and derive meaning from human languages. One of the most important techniques in modern NLP is the use of embeddings. Embeddings convert words, sentences, or documents into numerical vectors that capture semantic meanin

In [18]:
# Let's see what's happening behind the scenes

import numpy as np

def analyze_search(query):
    """Show how similarity search works"""
    
    print(f"üîç Analyzing query: '{query}'\n")
    
    # Get query embedding
    query_embedding = embeddings.embed_query(query)
    print(f"1Ô∏è‚É£ Query converted to {len(query_embedding)} numbers")
    print(f"   First 5 numbers: {query_embedding[:5]}\n")
    
    # Search
    results = vector_store.similarity_search_with_score(query, k=3)
    
    print("2Ô∏è‚É£ Comparing with all chunks in database...\n")
    print("3Ô∏è‚É£ Top 3 most similar chunks:\n")
    
    for i, (doc, score) in enumerate(results):
        print(f"   Rank {i+1} | Similarity Score: {score:.4f}")
        print(f"   Text: {doc.page_content[:80]}...")
        print()
    
    return results

# Try it
analyze_search("What is machine learning?")

üîç Analyzing query: 'What is machine learning?'

1Ô∏è‚É£ Query converted to 384 numbers
   First 5 numbers: [-0.01995455101132393, 0.009877976030111313, 0.010249646380543709, 0.029553720727562904, 0.027186432853341103]

2Ô∏è‚É£ Comparing with all chunks in database...

3Ô∏è‚É£ Top 3 most similar chunks:

   Rank 1 | Similarity Score: 0.5879
   Text: is widely used today in applications such as recommendation systems, voice assis...

   Rank 2 | Similarity Score: 0.9423
   Text: Page 1: Introduction to Artificial Intelligence
Artificial Intelligence (AI) ref...

   Rank 3 | Similarity Score: 1.2039
   Text: Page 2: Natural Language Processing and Embeddings
Natural Language Processing (...



[(Document(metadata={}, page_content='is widely used today in applications such as recommendation systems, voice assistants, fraud\ndetection, and autonomous vehicles.\nMachine Learning is a subset of AI that focuses on building systems that learn from data. Instead of\nbeing explicitly programmed, these systems improve their performance as they are exposed to\nmore data.'),
  0.5878883600234985),
 (Document(metadata={}, page_content='Page 1: Introduction to Artificial Intelligence\nArtificial Intelligence (AI) refers to the simulation of human intelligence in machines that are\nprogrammed to think like humans and mimic their actions. The term may also be applied to any\nmachine that exhibits traits associated with a human mind such as learning and problem-solving. AI\nis widely used today in applications such as recommendation systems, voice assistants, fraud'),
  0.9422617554664612),
 (Document(metadata={}, page_content='Page 2: Natural Language Processing and Embeddings\nNatural Lan