<a href="https://colab.research.google.com/github/AbdullahHasan0/AI-Powered-PDF-Question-Answering-RAG-Pipeline-/blob/main/rag_pipeline_huggingface.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Rag Pipeline Using Huggingface

### Install Dependencies

In [1]:
# Install necessary libraries for Hugging Face RAG pipeline
!pip install -q langchain_community
!pip install -q chromadb
!pip install -q tiktoken
!pip install -q PyPDF
!pip install -q langchain_huggingface
!pip install -q langchain_chroma
!pip install -q bitsandbytes accelerate


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m41.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.8/19.8 MB[0m [31m105.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.2/284.2 kB[0m [31m26.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m98.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m103.3/103.3 kB[0m [31m11.2 MB/s[0m eta [36m0

## Imports & Hugging Face Login Support

In [2]:
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig
from langchain_huggingface import HuggingFaceEmbeddings, HuggingFacePipeline
from langchain.llms import HuggingFaceHub
from langchain_community.vectorstores import Chroma
from langchain_chroma import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain.chains import RetrievalQAWithSourcesChain
from huggingface_hub import notebook_login


### 1.Transformers

`AutoModel`, `AutoModelForCausalLM`, `AutoTokenizer` → load models and tokenizers for embeddings or RAG.

`pipeline` → quick wrapper for generation or embedding tasks.

`BitsAndBytesConfig` → optimize model loading (quantization, GPU-friendly).

### 2. LangChain / Hugging Face integration

`HuggingFaceEmbeddings` → generate embeddings from HF models.

`HuggingFacePipeline` → wrap HF pipeline as an LLM for LangChain.

`HuggingFaceHub` → access models hosted on Hugging Face Hub.

### 3. Vector Stores & QA Chain

`Chroma` → vector database for embeddings.

`RecursiveCharacterTextSplitter` → split PDFs into chunks.

`PyPDFLoader` → load PDFs.

`RetrievalQAWithSourcesChain` → combine retriever + LLM for RAG with source references.

### 4. Hugging Face Login

`notebook_login()` → authenticate with Hugging Face Hub for private models or large models.

## Loading PDF


In [3]:
def load_pdf():
    """
    Allows user to pick a PDF file in Colab and loads its content using PyPDFLoader.
    Returns a list of raw documents.
    """
    try:
        from google.colab import files
    except ImportError:
        print("File picker only works in Colab. Provide FILE_PATH manually in other environments.")
        return None

    print("Please upload your PDF document:")
    uploaded = files.upload()

    if not uploaded:
        print("No file uploaded.")
        return None

    # Get the first uploaded file
    file_path = list(uploaded.keys())[0]
    print(f"Loading PDF from {file_path}...")

    try:
        loader = PyPDFLoader(file_path)
        raw_documents = loader.load()
        print(f"Loaded {len(raw_documents)} pages from the PDF")
        print("-"*10)
        print(raw_documents[0].page_content[:500])  # preview first 500 characters
        return file_path,raw_documents

    except Exception as e:
        print(f"Error loading PDF: {e}")
        return None

file_path,raw_documents = load_pdf()


Please upload your PDF document:


Saving Abdullah_Hasan_AI _RESUME.pdf to Abdullah_Hasan_AI _RESUME.pdf
Loading PDF from Abdullah_Hasan_AI _RESUME.pdf...
Loaded 2 pages from the PDF
----------
Syed Abdullah Hasan 
AI/ML Engineer   
Karachi, Pakistan | Ph: +923228220707 | abdullahhasan1045@gmail.com | LinkedIn: Abdullah Hasan | 
Github: Abdullah Hasan 
  
Professional Summary 
Aspiring AI/ML Engineer with hands-on experience in machine learning, deep learning, computer vision, 
NLP, and LLM-based applications. Skilled in Python, TensorFlow, PyTorch, Scikit-learn, and LangChain. 
Experienced in building end-to-end AI solutions including medical image classification, sentiment analysis, 


* Uses Colab file picker to upload a PDF at runtime.

* Loads PDF using PyPDFLoader → returns list of pages as raw_documents.

* Prints a preview of the first 500 characters.

* Handles errors if the PDF can’t be loaded.

* Works general-purpose, like your OpenAI notebook.

## Splitting Document (Chunking)


In [4]:
def split_documents_into_chunks(raw_documents, chunk_size=1000, chunk_overlap=150):
    """
    Splits loaded PDF documents into smaller text chunks for embedding.

    Parameters:
        raw_documents (list): List of loaded PDF pages.
        chunk_size (int): Maximum characters per chunk.
        chunk_overlap (int): Overlap between chunks to preserve context.

    Returns:
        list: List of chunked documents.
    """
    if not raw_documents:
        print("No documents to split.")
        return []

    print("\nSplitting the loaded document into smaller chunks...")

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )

    documents = text_splitter.split_documents(raw_documents)

    if not documents:
        raise ValueError("Error: Splitting resulted in zero documents")

    print(f"Document split into {len(documents)} chunks.")
    return documents

documents = split_documents_into_chunks(raw_documents)




Splitting the loaded document into smaller chunks...
Document split into 5 chunks.


* Takes loaded PDF pages (raw_documents) as input.

* Splits them into smaller chunks for embeddings (controlled by chunk_size and chunk_overlap).

* Overlap keeps context between chunks so answers aren’t cut off.

* Returns list of chunked documents, ready for embedding or vector store.

* Raises an error if splitting fails or returns zero chunks.

In [5]:
def create_vector_store(documents, model_name="BAAI/bge-large-en-v1.5"):
    """
    Creates embeddings using a Hugging Face model and stores them in ChromaDB.

    Parameters:
        documents (list): List of chunked documents.
        model_name (str): Hugging Face embeddings model name.

    Returns:
        Chroma: Chroma vector store containing embeddings.
    """
    if not documents:
        print("No documents provided for vector store creation.")
        return None

    print("Initializing HuggingFace Embeddings model...")
    embedding = HuggingFaceEmbeddings(model_name=model_name)
    print(f"HuggingFace Embeddings model '{model_name}' initialized")

    print("\nCreating ChromaDB vector store...")
    vector_store = Chroma.from_documents(documents=documents, embedding=embedding)

    # Verify number of items
    vector_count = vector_store._collection.count()
    print(f"ChromaDB vector store created with {vector_count} items.")

    if vector_count == 0:
        print("Warning: Vector store creation resulted in 0 items")

    return vector_store

vector_store = create_vector_store(documents)

Initializing HuggingFace Embeddings model...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/779 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

HuggingFace Embeddings model 'BAAI/bge-large-en-v1.5' initialized

Creating ChromaDB vector store...
ChromaDB vector store created with 5 items.


* Loads a Hugging Face embeddings model (BAAI/bge-large-en-v1.5 by default).

* Converts chunked documents into embeddings.

* Stores embeddings in Chroma vector store for fast similarity search.

* Prints the number of vectors stored.

* Handles empty input gracefully.

## Similarity Search Function


In [6]:
def test_similarity_search(vector_store, query, k=2):
    """
    Performs a similarity search on the vector store for a given query.

    Parameters:
        vector_store (Chroma): The Chroma vector store containing embeddings.
        query (str): The search query string.
        k (int): Number of similar documents to retrieve.

    Returns:
        list: List of similar documents.
    """
    if not vector_store:
        print("Vector store not found. Please create it first.")
        return []

    print(f"\n--- Testing Similarity Search for: '{query}' ---")
    try:
        similar_docs = vector_store.similarity_search(query, k=k)
        print(f"Found {len(similar_docs)} similar documents.")

        for i, doc in enumerate(similar_docs):
            print(f"\n--- Document {i+1} ---")
            content_snippet = doc.page_content[:700].strip() + "..."
            source = doc.metadata['source'].split("/")[-1] if 'source' in doc.metadata else "Unknown"
            print(f"Content snippet: {content_snippet}")
            print(f"Source: {source}")

        return similar_docs

    except Exception as e:
        print(f"Error occurred while searching for similar documents: {e}")
        return []

test_docs = test_similarity_search(vector_store, "what is sick leave policy")




--- Testing Similarity Search for: 'what is sick leave policy' ---
Found 2 similar documents.

--- Document 1 ---
Content snippet: June 2024 – Sep 2024  
• Expertise in data cleaning, scaling, encoding, and building various machine learning models, 
including neural networks and CNNs. 
• Partnered with teams to create innovative solutions for complex real-world challenges. 
• Committed to advancing machine learning skills through practical experience and ongoing 
education. 
• Fellow of the Month – June’24, Bytewise Limited  
  
Academic Qualification   
University Of Karachi - UBIT                                                                                                      2021 – 2024 
• Bachelors of Science in Computer Science  
  
Course Certifications:  
• Certified Associate Data Scientist by DataCamp 
• Da...
Source: Abdullah_Hasan_AI _RESUME.pdf

--- Document 2 ---
Content snippet: Syed Abdullah Hasan 
AI/ML Engineer   
Karachi, Pakistan | Ph: +923228220707 | abdullahha

* Takes vector store and query as input.

* Performs semantic similarity search (k top documents).

* Prints a preview (first 700 characters) of each matched document.

* Prints source file name if available.

* Returns the list of similar documents for further processing.

* Handles errors gracefully.

## Hugging Face Login + LLM Setup

In [8]:
def setup_huggingface_llm(model_name="mistralai/Mistral-7B-Instruct-v0.2", max_new_tokens=512):
    """
    Logs into Hugging Face Hub and sets up a quantized Hugging Face LLM pipeline.

    Parameters:
        model_name (str): HF model name to load.
        max_new_tokens (int): Maximum tokens for generation.

    Returns:
        llm (HuggingFacePipeline): LangChain wrapper around HF pipeline.
    """
    from huggingface_hub import notebook_login
    from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, BitsAndBytesConfig
    from langchain_huggingface import HuggingFacePipeline

    # Login
    login_status = notebook_login()
    print(f"HuggingFace Login Status: {login_status}")
    print("IF ERROR RERUN THIS CELL AGAIN")

    # Quantization config for memory efficiency
    bnb_config = BitsAndBytesConfig(
        load_in_8bit=True,
        bnb_8bit_use_double_quant=True,
        bnb_8bit_quant_type="nf8",
        bnb_8bit_compute_dtype="float16"
    )

    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # Load quantized model
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto"
    )

    # Create text-generation pipeline
    generator = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        device_map="auto"
    )

    # Wrap for LangChain
    llm = HuggingFacePipeline(
        pipeline=generator,
        model_kwargs={"max_new_tokens": max_new_tokens, "do_sample": True}
    )

    print(f"HuggingFace LLM '{model_name}' successfully configured with 4-bit quantization.")
    return llm

llm = setup_huggingface_llm()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

HuggingFace Login Status: None


tokenizer_config.json:   0%|          | 0.00/2.10k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Device set to use cuda:0


HuggingFace LLM 'mistralai/Mistral-7B-Instruct-v0.2' successfully configured with 4-bit quantization.


## Configure Retriever

In [9]:
# Configure retriever from vector store
retriever = vector_store.as_retriever(search_kwargs={"k": 2})
print("Retriever configured successfully from vector store.")

Retriever configured successfully from vector store.


* Converts the Chroma vector store into a retriever.

* k=2 → fetches the top 1 most relevant chunk for each query.

* This retriever will be passed to the RAG QA chain.

## Setup RAG QA Chain with Custom Prompt

In [10]:
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

# Explicit prompt for small models
custom_prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=(
        "You are a helpful assistant. Answer the question using ONLY the information in the context below.\n"
        "If the context does not contain the answer, say you don't know.\n\n"
        "Context:\n{context}\n\nQuestion: {question}\n\nAnswer concisely:"
    )
)

# Create the Retrieval QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type="stuff",  # suitable for short docs
    chain_type_kwargs={"prompt": custom_prompt},
    return_source_documents=True,
    verbose=True
)

print("Retrieval QA chain created with custom prompt.")


Retrieval QA chain created with custom prompt.


* Custom prompt ensures small models don’t hallucinate → only answer from the context.

* RetrievalQA.from_chain_type combines:

* LLM → Hugging Face pipeline

* Retriever → vector store chunks

* Prompt template → how the answer is generated

* return_source_documents=True → allows you to show which chunk the answer came from.

* chain_type="stuff" → simple strategy for small/medium documents.

## Clean RAG Output Function

In [11]:
def cleaned_answer(result):
    """
    Cleans the output of the RetrievalQA chain by extracting the answer part.

    Parameters:
        result (dict): Output dictionary from qa_chain.run() or qa_chain.invoke().

    Returns:
        str: Cleaned answer text.
    """
    # Split by the prompt marker to get only the answer text
    result_safe = result['result'].split('Answer concisely: ')[-1]
    return result_safe


## Gradio Query Handler Function

In [40]:
def ask_document(user_query):
    """
    Processes the user query using the Hugging Face RAG chain and returns formatted results.

    Parameters:
        user_query (str): The question input by the user via Gradio.

    Returns:
        tuple: (answer, sources) - Answer string and source references.
    """
    print(f"\nProcessing Gradio query: '{user_query}'")

    # Handle empty query
    if not user_query or user_query.strip() == "":
        print("Empty query received. Returning prompt for valid input.")
        return "Please enter a valid query.", ""

    try:
        # Run the RAG chain
        result = qa_chain.invoke({"query": user_query})
        print(result)


        # Extract answer and sources
        answer = cleaned_answer(result)
        sources = result['source_documents'][0].metadata['source']

        # Format sources nicely
        if sources == file_path:
            sources = f"Retrieved from: {file_path}"
        elif isinstance(sources, list):
            sources = ", ".join(list(set(sources)))

        print(f" --> Answer generated: {answer[:100].strip()}...")
        print(f" --> Sources Identified: {sources}")

        return answer.strip(),sources

    except Exception as e:
        print(f"Error occurred while processing query: {e}")
        return "An error occurred while processing your query.", ""


In [38]:
ask_document("candidate skills")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Processing Gradio query: 'candidate skills'


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
{'query': 'candidate skills', 'result': "You are a helpful assistant. Answer the question using ONLY the information in the context below.\nIf the context does not contain the answer, say you don't know.\n\nContext:\nJune 2024 – Sep 2024  \n• Expertise in data cleaning, scaling, encoding, and building various machine learning models, \nincluding neural networks and CNNs. \n• Partnered with teams to create innovative solutions for complex real-world challenges. \n• Committed to advancing machine learning skills through practical experience and ongoing \neducation. \n• Fellow of the Month – June’24, Bytewise Limited  \n  \nAcademic Qualification   \nUniversity Of Karachi - UBIT                                                                                                      2021 – 2024 \n• Bachelors of Science in Computer Science  \n  \nCourse Certifications:  \n• Cert

"The candidate has expertise in data cleaning, scaling, encoding, and building various machine learning models including neural networks and CNNs. They have experience partnering with teams to create innovative solutions for complex challenges and are committed to advancing their machine learning skills. Their academic background includes a Bachelor's degree in Computer Science. They hold certifications in data science and various machine learning techniques. Their technical skills include proficiency in Python, NumPy, Pandas, Scikit-learn, TensorFlow, PyTorch, and LangChain, as well as ML & AI techniques such as regression, classification, clustering, CNNs, transfer learning, attention mechanisms, and LLMs. They have experience in data handling with data preprocessing, feature engineering, SQL, and NoSQL, and in visualization with Matplotlib, Seaborn, and Power BI. Their deployment & tools skills include Flask, Streamlit, Gradio, Jupyter Notebooks, Google Colab, and Git. They possess 

In [35]:
print()

Abdullah_Hasan_AI _RESUME.pdf


## Gradio Interface Function

In [41]:
import gradio as gr

def launch_gradio_interface(qa_chain):
    """
    Launch a Gradio interface to interact with the HR document QA system.

    Parameters:
        qa_chain: The RetrievalQA chain to handle queries.
    """
    print("\nSetting up Gradio Interface...")

    with gr.Blocks(theme=gr.themes.Ocean(), title="Document QA Assistant") as demo:
        gr.Markdown(
            """
            ## Document QA Assistant
            Ask questions about your HR document and get answers with sources.
            """
        )

        # Input component
        question_input = gr.Textbox(
            label="Ask a question about the HR document",
            placeholder="Type your question here...",
            lines=2
        )

        # Output components
        with gr.Row():
            answer_output = gr.Textbox(
                label="Answer",
                placeholder="The answer will be displayed here...",
                lines=4,
                interactive=False
            )

            sources_output = gr.Textbox(
                label="Sources",
                placeholder="The sources will be displayed here...",
                lines=2,
                interactive=False
            )

        # Buttons
        with gr.Row():
            submit_button = gr.Button("Submit", variant="primary")
            clear_button = gr.ClearButton(
                components={question_input, answer_output, sources_output},
                value="Clear All"
            )


        # Connect submit button to QA function
        submit_button.click(
            fn=ask_document,
            inputs=question_input,
            outputs=[answer_output, sources_output]
        )

    print("Gradio Interface setup complete. Launching the app...")
    demo.launch()

launch_gradio_interface(qa_chain)


Setting up Gradio Interface...
Gradio Interface setup complete. Launching the app...
It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://1155fc4e195ae5500e.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)
