<a href="https://colab.research.google.com/github/Camrahd/Gen-AI-Projects/blob/main/RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**1. Install dependencies**

In [None]:
!pip install -q \
  langchain==0.3.0 \
  langchain-openai==0.3.0 \
  langchain-community==0.3.0 \
  langchain-text-splitters==0.3.0 \
  faiss-cpu \
  pypdf \
  tiktoken

In [None]:
# step2: import dependecies

import os
from google.colab import userdata, files

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain.chains import RetrievalQA


# === STEP 3: Load OpenAI API key from Colab Secrets ===
"""
In Colab:
- Go to:  üîê "Editor" (left sidebar) ‚Üí "Secrets" ‚Üí "Add new secret"
- Name: OPENAI_API_KEY
- Value: your actual OpenAI API key
"""
try:
    os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')
    if not os.environ["OPENAI_API_KEY"]:
        raise ValueError("Empty OPENAI_API_KEY.")
    print("‚úÖ API key loaded from Colab secrets!")
except Exception as e:
    print("‚ùå Could not load OPENAI_API_KEY from Colab secrets.")
    print("Error:", e)
    raise


# === STEP 4: Upload PDF/TXT files ===
print("\nüìÑ Upload one or more PDF/TXT files:")
uploaded = files.upload()
file_names = list(uploaded.keys())
print("‚úÖ Uploaded files:", file_names)

# === STEP 5: Load ALL Documents ===
from langchain_community.document_loaders import PyPDFLoader, TextLoader

all_docs = []

for file_path in file_names:
    print(f"\nüì• Loading: {file_path}")
    if file_path.lower().endswith(".pdf"):
        loader = PyPDFLoader(file_path)
    elif file_path.lower().endswith(".txt"):
        loader = TextLoader(file_path, encoding="utf-8")
    else:
        print(f"‚õî Skipping {file_path}: only .pdf or .txt supported.")
        continue

    docs = loader.load()
    print(f"   ‚Üí Loaded {len(docs)} page(s)/document(s)")
    all_docs.extend(docs)

print(f"\n‚úÖ Total documents/pages loaded: {len(all_docs)}")


# === STEP 6: Split into Chunks ===
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=150,
    separators=["\n\n", "\n", " ", ""],
)

texts = splitter.split_documents(all_docs)
print(f"‚úÖ Split into {len(texts)} chunks")


# === STEP 7: Create Embeddings & FAISS Vector Store ===

print("\nüîç Creating embeddings and FAISS vector store...")

embeddings = OpenAIEmbeddings()

# Extract raw text + metadata for FAISS
texts_list = [doc.page_content for doc in texts]
metadatas = [doc.metadata for doc in texts]

vectorstore = FAISS.from_texts(
    texts=texts_list,
    embedding=embeddings,
    metadatas=metadatas,
)

print("‚úÖ Vector DB (FAISS) ready!")


# === STEP 8: Build RetrievalQA Chain (Simple RAG) ===
"""
We use:
- retriever = FAISS index (semantic search over chunks)
- LLM = ChatOpenAI (chat completion model)
- chain = RetrievalQA: it retrieves top-k chunks and "stuffs" them into the prompt.
"""

llm = ChatOpenAI(
    model="gpt-4o-mini",   # or "gpt-4o", "gpt-3.5-turbo", etc.
    temperature=0,
)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # simplest: concatenate retrieved chunks into context
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True,  # optional: to inspect which chunks were used
)

print("\nü§ñ RAG Bot is READY! Type questions about your document.")
print("Type 'exit' or 'quit' to stop.\n")


# === STEP 9: Interactive Loop ===
while True:
    query = input("You: ").strip()
    if query.lower() in ["exit", "quit", "bye"]:
        print("RAG: Goodbye! üëã")
        break

    if not query:
        continue

    try:
        result = qa.invoke({"query": query})
        answer = result["result"]
        print(f"\nRAG: {answer}\n")

        # Optional: show which document chunks were used
        # (helpful for debugging / transparency)
        print("--- Sources ---")
        for i, doc in enumerate(result["source_documents"], start=1):
            src = doc.metadata.get("source", "N/A")
            page = doc.metadata.get("page", "N/A")
            print(f"[{i}] source={src}, page={page}")
        print("---------------\n")

    except Exception as e:
        print(f"‚ùå Error: {e}\n")

‚úÖ API key loaded from Colab secrets!

üìÑ Upload one or more PDF/TXT files:


Saving Aetna_Tutorial.pdf to Aetna_Tutorial (2).pdf
Saving Educosys_Agentic_Hackathon_Guidelines.pdf to Educosys_Agentic_Hackathon_Guidelines (5).pdf
‚úÖ Uploaded files: ['Aetna_Tutorial (2).pdf', 'Educosys_Agentic_Hackathon_Guidelines (5).pdf']

üì• Loading: Aetna_Tutorial (2).pdf
   ‚Üí Loaded 3 page(s)/document(s)

üì• Loading: Educosys_Agentic_Hackathon_Guidelines (5).pdf
   ‚Üí Loaded 2 page(s)/document(s)

‚úÖ Total documents/pages loaded: 5
‚úÖ Split into 6 chunks

üîç Creating embeddings and FAISS vector store...
‚úÖ Vector DB (FAISS) ready!

ü§ñ RAG Bot is READY! Type questions about your document.
Type 'exit' or 'quit' to stop.

You: what is hackathon about?

RAG: The hackathon organized by Educosys is focused on building Agentic Applications that demonstrate intelligent, autonomous, or context-aware behavior. Participants are encouraged to create applications targeting real-world use cases such as productivity, education, automation, customer service, or creative tools. 