<a href="https://colab.research.google.com/github/AbdullahHasan0/OpenAI-RAG-QA/blob/main/rag_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##

## Install Libraries

In [1]:
!pip install langchain openai langchain_community PyPDF ChromaDB





* langchain → Core library for building RAG pipelines, chains, and LLM apps.

* openai → For connecting to OpenAI APIs.

* langchain_community → Extra integrations and community-built tools (like custom retrievers, connectors, etc.).

In [2]:
from openai import OpenAI
import os

from langchain_community.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain.chains import RetrievalQAWithSourcesChain

from google.colab import files

import gradio as gr


* from openai import OpenAI → Core OpenAI API client.

* import os → Access environment variables, file paths, etc.

* LangChain imports:

    * OpenAIEmbeddings → Converts text into embeddings using OpenAI.

    * OpenAI → LLM wrapper for OpenAI models.

    * Chroma → Vector database for storing and searching embeddings.

    * RecursiveCharacterTextSplitter → Splits documents into chunks intelligently with overlap.

    * PyPDFLoader → Reads PDF files and converts to text.

    * RetrievalQAWithSourcesChain → Builds a RAG pipeline that also provides source references.

* from google.colab import files → For loading pdf interactively

* import gradio → For interactive output

## Function to connect OpenAI API

In [3]:
from google.colab import userdata

# Replace 'YOUR_SECRET_NAME' with the actual name you gave your secret
KEY = userdata.get('OPENAI_API_KEY')



def connect_openai(api_key):


    # In Colab: store your OpenAI API key in Secrets or set it as an environment variable
    api_key = api_key

    if api_key:
        print("Connecting to OpenAI API...")
        try:
            openai_client = OpenAI(openai_api_key=api_key)
            print("Connected to OpenAI API")
            return openai_client
        except Exception as e:
            print(f"Error connecting to OpenAI: {e}")
            return None
    else:
        print("API key not found. Please add your OpenAI key in Colab Secrets or set OPENAI_API_KEY.")
        return None

# Usage
llm_client = connect_openai(KEY)

Connecting to OpenAI API...


  openai_client = OpenAI(openai_api_key=api_key)


Connected to OpenAI API


* Grabs your API key from Colab Secrets.
* If the key exists → connects to OpenAI and returns the client.
* If not → tells you to save the key first.
* Handles errors safely so your notebook doesn’t crash.
* Now llm_client is ready to use in your RAG pipeline.


#### NOTE
1. Go to Tools → Settings → Secrets (or search for “Secrets” in Colab).

2. Add OPENAI_API_KEY as the key and your OpenAI key as the value.

3. This avoids exposing keys in the notebook.

## Loading PDF

In [16]:
def load_pdf_colab():
    """
    Let the user pick a PDF from their computer in Colab,
    then load it using PyPDFLoader.
    """


    uploaded = files.upload()  # Opens file picker
    if not uploaded:
        print("No file uploaded.")
        return None

    # Get the uploaded file name
    file_path = list(uploaded.keys())[0]
    print(f"Loading PDF from {file_path}...")

    try:
        loader = PyPDFLoader(file_path)
        raw_documents = loader.load()
        print(f"Loaded {len(raw_documents)} pages from the PDF")
        print("-"*10)
        print(raw_documents[0].page_content[:500])
        return file_path,raw_documents

    except Exception as e:
        print(f"Error loading PDF: {e}")
        return None

file_path,raw_docs = load_pdf_colab()

Saving Abdullah_Hasan_AI _RESUME.pdf to Abdullah_Hasan_AI _RESUME (3).pdf
Loading PDF from Abdullah_Hasan_AI _RESUME (3).pdf...
Loaded 2 pages from the PDF
----------
Syed Abdullah Hasan 
AI/ML Engineer   
Karachi, Pakistan | Ph: +923228220707 | abdullahhasan1045@gmail.com | LinkedIn: Abdullah Hasan | 
Github: Abdullah Hasan 
  
Professional Summary 
Aspiring AI/ML Engineer with hands-on experience in machine learning, deep learning, computer vision, 
NLP, and LLM-based applications. Skilled in Python, TensorFlow, PyTorch, Scikit-learn, and LangChain. 
Experienced in building end-to-end AI solutions including medical image classification, sentiment analysis, 


* Opens a file picker in Colab → you select the PDF.

* Automatically gets the file name of the uploaded PDF.

* Uses PyPDFLoader to read the PDF and split it into pages.

* Prints number of pages and previews the first 500 characters.

* Returns a list of pages (raw_documents) ready for chunking and embeddings.

## Splitting Document (Chunking)

In [5]:
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 150

def split_documents(raw_documents, chunk_size=1000, chunk_overlap=150):
    """
    Split PDF pages into smaller text chunks for embeddings.
    """
    from langchain.text_splitter import RecursiveCharacterTextSplitter

    print("\nSplitting the loaded document into smaller chunks...")

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )

    documents = text_splitter.split_documents(raw_documents)

    if not documents:
        raise ValueError("Error: Splitting resulted in zero documents")

    print(f"Document split into {len(documents)} chunks.")
    return documents

documents = split_documents(raw_docs, chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)


Splitting the loaded document into smaller chunks...
Document split into 5 chunks.


* PDF pages are too big for embeddings → we split them into smaller chunks.

* chunk_size=1000 → each chunk has up to 1000 characters.

* chunk_overlap=150 → 150 characters overlap between chunks to keep context.

* RecursiveCharacterTextSplitter → smartly splits text without breaking sentences awkwardly.

* Returns a list of chunks ready for embeddings.

* Prints how many chunks the document was split into.

In [6]:
documents

[Document(metadata={'producer': 'Microsoft® Word 2021', 'creator': 'Microsoft® Word 2021', 'creationdate': '2025-08-18T10:36:43+05:00', 'author': 'Ashir Afzal', 'moddate': '2025-08-18T10:36:43+05:00', 'source': 'Abdullah_Hasan_AI _RESUME (2).pdf', 'total_pages': 2, 'page': 0, 'page_label': '1'}, page_content='Syed Abdullah Hasan \nAI/ML Engineer   \nKarachi, Pakistan | Ph: +923228220707 | abdullahhasan1045@gmail.com | LinkedIn: Abdullah Hasan | \nGithub: Abdullah Hasan \n  \nProfessional Summary \nAspiring AI/ML Engineer with hands-on experience in machine learning, deep learning, computer vision, \nNLP, and LLM-based applications. Skilled in Python, TensorFlow, PyTorch, Scikit-learn, and LangChain. \nExperienced in building end-to-end AI solutions including medical image classification, sentiment analysis, \ncontextual QA systems, and real-time computer vision pipelines. Passionate about applying technical skills \nto real-world projects and growing in collaborative, high-performance 

In [7]:
## Let's display an example chunk
print("\n--- Example chunk: (Chunk 0) ---")
print(documents[0].page_content)
print("\n--- Metadata for Chunk 0 ---")
doc_source = documents[0].metadata['source'].split("/")[-1]
print(doc_source)


--- Example chunk: (Chunk 0) ---
Syed Abdullah Hasan 
AI/ML Engineer   
Karachi, Pakistan | Ph: +923228220707 | abdullahhasan1045@gmail.com | LinkedIn: Abdullah Hasan | 
Github: Abdullah Hasan 
  
Professional Summary 
Aspiring AI/ML Engineer with hands-on experience in machine learning, deep learning, computer vision, 
NLP, and LLM-based applications. Skilled in Python, TensorFlow, PyTorch, Scikit-learn, and LangChain. 
Experienced in building end-to-end AI solutions including medical image classification, sentiment analysis, 
contextual QA systems, and real-time computer vision pipelines. Passionate about applying technical skills 
to real-world projects and growing in collaborative, high-performance teams. 
  
Professional Experience   
Data Science Fellow – Bytewise Limited                                                                                                       
June 2024 – Sep 2024  
• Expertise in data cleaning, scaling, encoding, and building various machine learni

## Initializing Embeddings


In [8]:
def create_vector_store(documents):
    """
    Create embeddings for document chunks and store them in Chroma vector database.
    """

    print("Initializing OpenAI Embeddings model...")
    embeddings = OpenAIEmbeddings(openai_api_key=KEY)
    print("OpenAI Embeddings model initialized.")

    print("\nCreating ChromaDB vector store...")
    vector_store = Chroma.from_documents(documents=documents, embedding=embeddings)

    # Verify the number of items in the store
    vector_count = vector_store._collection.count()
    print(f"ChromaDB vector store created with {vector_count} items.")

    if vector_count == 0:
        print("Warning: Vector store creation resulted in 0 items.")

    return vector_store

vector_store = create_vector_store(documents)

Initializing OpenAI Embeddings model...


  embeddings = OpenAIEmbeddings(openai_api_key=KEY)


OpenAI Embeddings model initialized.

Creating ChromaDB vector store...
ChromaDB vector store created with 5 items.


* Embeddings: Converts each chunk of text into a numeric vector using OpenAIEmbeddings.

* Vector Store (ChromaDB): Stores all embeddings for fast semantic search/retrieval.

* _collection.count() → checks how many vectors were stored.

* Returns vector_store → now you can use it for RAG queries.

* Prints messages so you know each step is working.

## Retrieval

In [9]:
def test_similarity_search(vector_store, query, top_k=2):
    """
    Test the vector store by finding documents similar to a query.
    """
    print("\n--- Testing Similarity Search in vector store ---")
    print(f"Searching for documents similar to: '{query}'")

    try:
        similar_docs = vector_store.similarity_search(query, k=top_k)
        print(f"Found {len(similar_docs)} similar documents.")

        for i, doc in enumerate(similar_docs):
            print(f"\n--- Document {i+1} ---")
            content_snippet = doc.page_content[:700].strip() + "..."  # first 700 characters
            source = doc.metadata.get('source', 'Unknown').split("/")[-1]
            print(f"Content snippet: {content_snippet}")
            print(f"Source: {source}")

        return similar_docs

    except Exception as e:
        print(f"Error occurred while searching for similar documents: {e}")
        return []

test_docs = test_similarity_search(vector_store, "what is sick leave policy", top_k=2)



--- Testing Similarity Search in vector store ---
Searching for documents similar to: 'what is sick leave policy'
Found 2 similar documents.

--- Document 1 ---
Content snippet: • Data Science by Plus W 株式会社 
  
Technical Skills: 
Programming & Libraries: Python, NumPy, Pandas, Scikit-learn, TensorFlow, PyTorch, LangChain 
ML & AI Techniques: Regression, Classification, Clustering, CNNs, Transfer Learning, Attention 
Mechanisms, LLMs 
Data Handling: Data preprocessing, Feature engineering, SQL, NoSQL 
Visualization: Matplotlib, Seaborn, Power BI 
Deployment & Tools: Flask, Streamlit, Gradio, Jupyter Notebooks, Google Colab, Git 
Soft Skills: Fast learner, proactive team player, growth-oriented mindset 
  
Projects 
Brain Hemorrhage Detection (Final Year Project) 
• Built an end-to-end CNN pipeline using EfficientNetB2 for multiclass classification of 5 brain 
hemorr...
Source: Abdullah_Hasan_AI _RESUME (2).pdf

--- Document 2 ---
Content snippet: Syed Abdullah Hasan 
AI/ML Engineer   

* Lets you query your vector store with a question.

* similarity_search() → finds the top k chunks most relevant to your query.

* Prints the first 700 characters of each chunk as a preview.

* Prints the source filename for reference.

* Returns the list of similar documents so you can use them in a RAG chain or further processing.

## Building & Testing The RAG CHAIN USING LANGCHAIN

In [11]:
def create_qa_chain(vector_store, temperature=0, k=2):
    """
    Create a RAG (Retrieval-Augmented Generation) QA chain using OpenAI LLM and a vector store.
    """

    # Step 1: Configure retriever
    retriever = vector_store.as_retriever(search_kwargs={"k": k})
    print("Retrieval configured successfully from vector store.")

    # Step 2: Configure LLM
    llm = OpenAI(temperature=temperature, openai_api_key=KEY)
    print("OpenAI LLM successfully configured.")

    # Step 3: Create RetrievalQAWithSourcesChain
    qa_chain = RetrievalQAWithSourcesChain.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=retriever,
        return_source_documents=False,
        verbose=True
    )
    print("RetrievalQAWithSourcesChain created")

    return qa_chain

qa_chain = create_qa_chain(vector_store, temperature=0, k=2)


Retrieval configured successfully from vector store.
OpenAI LLM successfully configured.
RetrievalQAWithSourcesChain created


* Retriever: Gets the top k relevant chunks from the vector store for any query.

* LLM: OpenAI model that generates answers using the retrieved chunks.

* RAG Chain: Combines retriever + LLM → now your QA system can answer questions with context from your PDF.

* return_source_documents=False → doesn’t return full source docs in output (optional).

* verbose=True → prints what’s happening internally for debugging.

* Returns qa_chain → ready to answer questions.

In [12]:
def test_rag_chain(qa_chain, query):
    """
    Run a query through the RAG QA chain and display answer and sources.
    """
    print("\n--- Testing the Full RAG Chain ---")
    print(f"Query: {query}")

    try:
        result = qa_chain.invoke({"question": query})

        print("\n--- Answer ---")
        print(result.get("answer", "No answer generated."))

        print("\n--- Sources ---")
        print(result.get("sources", "No sources found."))

        # Optional: show source document snippets if available
        if "source_documents" in result and result["source_documents"]:
            print("\n--- Source Document Details ---")
            for i, doc in enumerate(result["source_documents"]):
                content_snippet = doc.page_content[:250].strip() + "..."
                print(f"Doc {i+1}: {content_snippet}")

        return result

    except Exception as e:
        print(f"Error occurred: {e}")
        return None

result = test_rag_chain(qa_chain, "Information about sick leaves")



--- Testing the Full RAG Chain ---
Query: Information about sick leaves


[1m> Entering new RetrievalQAWithSourcesChain chain...[0m

[1m> Finished chain.[0m

--- Answer ---
 The professional experience includes sick leaves.


--- Sources ---
Abdullah_Hasan_AI _RESUME (2).pdf


* Sends your query ("Information about sick leaves") to the RAG QA chain.

* Prints the generated answer from OpenAI.

* Prints source references (if available).

* Optionally shows first 250 characters of the source documents for context.

* Handles errors safely → notebook won’t crash if something goes wrong.

* Returns the result dictionary → you can access answer, sources, or source_documents programmatically.

In [17]:
def ask_document(user_query):

    """Processes the user query using the RAG chain and returns formatted results"""

    print(f"\nProcessing Gradio query: '{user_query}'")
    if not user_query or user_query.strip() == "":
        print("Empty query received. Returning prompt for valid input.")
        return "Please enter a valid query."

    try:
        result = qa_chain.invoke({"question": user_query})

        answer = result.get("answer", "I couldn't find an answer in the provided document.")
        sources = result.get("sources", "No specific sources identified.")

        if sources == file_path:
            sources = f"Retrieved from: {file_path}"

        elif isinstance(sources, list):
            sources = ", ".join(list(set(sources)))

        print(f" --> Answer generated: {answer[:100].strip()}...")
        print(f" --> Sources Identified: {sources}")

        return answer.strip(), sources

    except Exception as e:
        print(f"Error occurred while processing query: {e}")
        return "An error occurred while processing your query."


## Creating Gradio Interface

In [18]:
def launch_gradio_interface(qa_chain):
    """
    Launch a Gradio interface to ask questions to the HR document QA system.
    """

    print("\nSetting up Gradio Interface...")

    with gr.Blocks(theme=gr.themes.Ocean(), title="Document QA Assistant") as demo:
        gr.Markdown(
            """
            ## Document QA Assistant
            Ask questions about your HR document and get answers with sources.
            """
        )

        # Input Component
        question_input = gr.Textbox(
            label="Ask a question about the document",
            placeholder="Type your question here...",
            lines=2
        )

        # Output Components
        with gr.Row():
            answer_output = gr.Textbox(
                label="Answer",
                placeholder="The answer will be displayed here...",
                lines=4,
                interactive=False
            )

            sources_output = gr.Textbox(
                label="Sources",
                placeholder="The sources will be displayed here...",
                lines=2,
                interactive=False
            )

        # Buttons
        with gr.Row():
            submit_button = gr.Button("Submit", variant="primary")
            clear_button = gr.ClearButton(
                components={question_input, answer_output, sources_output},
                value="Clear All"
            )


        # Connect submit button to QA function
        submit_button.click(
            fn=ask_document,
            inputs=question_input,
            outputs=[answer_output, sources_output]
        )

    print("Gradio Interface setup complete. Launching the app...")
    demo.launch()

launch_gradio_interface(qa_chain)



Setting up Gradio Interface...
Gradio Interface setup complete. Launching the app...
It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://6b2c432bdde5a7f251.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)
