<a href="https://colab.research.google.com/github/Nadine-kassir/RAG-Project/blob/main/Rag_assistant_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%capture
!pip install langchain langchain_community

In [2]:
import os
import numpy as np
from langchain_community.embeddings import HuggingFaceBgeEmbeddings, OpenAIEmbeddings
from langchain_community.llms import HuggingFacePipeline
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from transformers import pipeline
from langchain.chains.combine_documents.stuff import create_stuff_documents_chain


In [3]:
# Load PDF documents
data_folder = "data"

# Make sure folder exists
if not os.path.exists(data_folder):
    os.makedirs(data_folder)

# Load all PDFs from the folder
loader = PyPDFDirectoryLoader(data_folder)
documents = loader.load()

print(f"Loaded {len(documents)} documents from '{data_folder}'")


Loaded 0 documents from 'data'


In [4]:
!pip install pypdf

Collecting pypdf
  Downloading pypdf-6.1.2-py3-none-any.whl.metadata (7.1 kB)
Downloading pypdf-6.1.2-py3-none-any.whl (323 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/323.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m317.4/323.6 kB[0m [31m14.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m323.6/323.6 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-6.1.2


In [31]:
#  Load all PDFs from the "data" folder
loader = PyPDFDirectoryLoader("./data")

docs_before_split = loader.load()

In [32]:
# Split text into smaller chunks
text_splitter =  RecursiveCharacterTextSplitter(
    chunk_size =500,
    chunk_overlap = 50
)
docs_after_split = text_splitter.split_documents(docs_before_split)

In [33]:
len(docs_after_split[0].page_content)

489

I can't directly upload files for you, but I can guide you on how to do it using the Colab interface.

Here are the steps to upload your PDF files to the Colab environment:

1.  **Open the File Browser:** Click on the folder icon on the left sidebar of your Colab notebook. This will open the file browser pane.
2.  **Navigate to the desired directory:** If you want to upload to a specific folder (like `data`), navigate to that folder in the file browser. You might need to create the folder first if it doesn't exist (right-click in the file browser and select "New folder").
3.  **Upload your files:** Once you are in the target directory, you can either:
    *   Drag and drop your PDF files from your computer directly into the file browser pane.
    *   Right-click in the file browser pane and select "Upload". Then, choose the PDF files from your computer.

After uploading, make sure the file path in your code (cell `f612f4c2`) matches the location where you uploaded the files in the Colab environment (e.g., `./data/`). Then, you can re-run the cells starting from the document loading step.

In [36]:
print(f"Number of documents loaded: {len(docs_before_split)}")

Number of documents loaded: 59


In [37]:
avg_doc_length = lambda docs: sum([len(doc.page_content) for doc in docs])//len(docs)
avg_char_before_split = avg_doc_length(docs_before_split)
avg_char_after_split = avg_doc_length(docs_after_split)
print(f'before split: {avg_char_before_split}')
print(f'after split: {avg_char_after_split}')

before split: 5043
after split: 444


In [38]:
#Hugging Face embedding model
huggingface_embeddings = HuggingFaceBgeEmbeddings(
    model_name= "sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs = {'device' : 'cuda'},
    encode_kwargs = {'normalize_embeddings' : True}
)

In [18]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m24.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.12.0


In [39]:
vectorstore = FAISS.from_documents(docs_after_split, huggingface_embeddings)

In [40]:
query="What are the treatments for breast cancer in young women?"

In [41]:
relevant_documents = vectorstore.similarity_search(query)

In [42]:
print(relevant_documents[1].page_content)


breast changes associated with pregnancy, which can make 
it difficult to distinguish a breast mass in a pregnant 
woman,
18 resulting in PABC patients tending to receive a 
more advanced diagnosis.17,19
PABC treatment
Breast cancer treatment can vary with cancer stage, 
hormone receptor and other biomarker status as well as 
general health status.
20 The main treatments for breast 
cancer include surgery, radiation therapy and systemic 
treatments.14 However, when a breast cancer patient is


In [43]:
#turning FAISS vector store into a retriever object, which will be plugged directly into RAG pipeline.
retriever = vectorstore.as_retriever(search_type="similarity" , search_kwargs={"k" : 3})

In [44]:
query = "What are the main risk factors for breast cancer?"
docs = retriever.get_relevant_documents(query)

for i, doc in enumerate(docs):
    print(f"\nDocument {i+1}:\n")
    print(doc.page_content[:500])  # show first 500 chars



Document 1:

developing breast cancer. Modifiable risk factors may be changed or avoided and include 
obesity, a sedentary lifestyle and exposure to exogenous hormones. Factors such as a 
person’s genetic predisposition and aging are non-modifiable and are unavoidable (Table 1).
referral pathway
In the UK there are two major referral pathways for patients with suspected breast cancer. 
Approximately 52% of cases of breast cancer are diagnosed via referrals from primary

Document 2:

a breast cancer diagnosis.
If the ﬁrst-degree relative was diagnosed before age 40 years, the risk
of developing breast cancer increases by a factor of two.
Increase risk of breast cancer in women with one or more ﬁrst-degree
relatives diagnosed with prostate cancer.
[107–110]
Table 1. Factors Associated with Risk of Breast Cancer Development
aRisk factors currently under study are not shown in this table.

Document 3:

cords; Table 1 ) [133] in its probabilistic assessment of breast cancer risk. The multi

  docs = retriever.get_relevant_documents(query)


In [45]:

# connect a model from Hugging Face to generate text answers from the retrieved context.
model_id = "mistralai/Mistral-7B-Instruct-v0.2"

pipe = pipeline(
    "text-generation",
    model=model_id,
    max_new_tokens=512,
    temperature=0.3,
    repetition_penalty=1.1
)

llm = HuggingFacePipeline(pipeline=pipe)


config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Device set to use cuda:0
  llm = HuggingFacePipeline(pipeline=pipe)


In [46]:
prompt_template = """
You are an expert medical assistant.
Use the following pieces of context to answer the question accurately and clearly.
If you don't find enough information, say "I don't have enough data to answer confidently."

Context:
{context}

Question: {question}

IMPORTANT: You must wrap your entire answer inside <results></results> tags.

Format your response exactly as follows:
<results>
[Your answer here]
</results>

Answer:
"""

PROMPT = PromptTemplate(
    template=prompt_template,
    input_variables=["context", "question"]
)

In [47]:
# ✅ Create the RetrievalQA chain correctly
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # tells LangChain how to use your prompt
    retriever=vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 3}),
    return_source_documents=True ,
    chain_type_kwargs={"prompt": PROMPT}

)

In [48]:
import re

def extract_results(response):
    """Extract content from <results> tags - finds LAST occurrence"""
    start_tag = '<results>'
    end_tag = '</results>'

    # Find ALL occurrences
    start_indices = [i for i in range(len(response)) if response.startswith(start_tag, i)]
    end_indices = [i for i in range(len(response)) if response.startswith(end_tag, i)]

    if start_indices and end_indices:
        start_idx = start_indices[-1] + len(start_tag)
        end_idx = end_indices[-1]

        if end_idx > start_idx:
            content = response[start_idx:end_idx].strip()
            print(f"DEBUG: Extracted from last occurrence, length: {len(content)}")
            return content

    return response.strip()

def query_rag_fixed(query):
    """Fixed version that handles multiple tag occurrences"""
    try:
        result = qa_chain.invoke({"query": query})
        full_result = result.get('result', 'No answer found.')

        # Extract answer (will use LAST occurrence)
        answer = extract_results(full_result)

        # Display
        print("\n" + "=" * 80)
        print("ANSWER:")
        print(answer)
        print("=" * 80)

        # Sources
        print("\n📚 SOURCES:")
        source_docs = result.get('source_documents', [])
        sources = {}
        for doc in source_docs:
            src = doc.metadata.get('source', 'Unknown')
            page = doc.metadata.get('page', 'N/A')
            sources.setdefault(src, []).append(page)

        for i, (src, pages) in enumerate(sources.items(), 1):
            page_list = ', '.join(map(str, sorted(set(pages))))
            print(f"  {i}. {src} (Pages: {page_list})")

        print("=" * 80)

        return answer

    except Exception as e:
        print(f"❌ Error: {str(e)}")
        return None

# Test
query = "What are the latest treatments for breast cancer?"
answer = query_rag_fixed(query)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


DEBUG: Extracted from last occurrence, length: 834

ANSWER:
The latest treatments for breast cancer include systemic treatments such as chemotherapy, hormone therapy, and targeted therapies. Systemic treatments are used to destroy cancer cells that have spread beyond the breast or lymph nodes. Chemotherapy uses powerful drugs to kill rapidly dividing cells, including cancer cells. Hormone therapy targets specific receptors on cancer cells that respond to certain hormones. Targeted therapies use monoclonal antibodies or other substances to target specific molecules involved in cancer growth and progression. One promising area of research is immunotherapies, such as immune checkpoint blockade (ICB), which can help the body's own immune system recognize and attack cancer cells. However, more research is needed before these treatments become widely available for breast cancer patients.

📚 SOURCES:
  1. data/TheHistoryOfEarlyBreastCancerTreatment.pdf (Pages: 0)
  2. data/Pregnancy and Breas

In [49]:
import gradio as gr
import re

# Predefined extraction function - finds LAST occurrence of <results> tags
def extract_results(response):
    """Extract content from LAST occurrence of <results> tags"""
    start_tag = '<results>'
    end_tag = '</results>'

    # Find LAST occurrence using rfind()
    end_idx = response.rfind(end_tag)

    if end_idx != -1:
        # Find the <results> that comes before this </results>
        start_idx = response.rfind(start_tag, 0, end_idx)

        if start_idx != -1:
            content = response[start_idx + len(start_tag):end_idx].strip()
            if content:  # Make sure it's not empty
                return content

    # Fallback: return full response if no tags found
    return response.strip()

# Wrap RAG chatbot in a function
def ask_bot(question):
    if not question or question.strip() == "":
        return "Please enter a question."

    try:
        # Use invoke method for the RAG chain
        result = qa_chain.invoke({"query": question})

        # Get the full result
        full_result = result.get('result', 'No answer found.')

        # Extract clean answer using predefined function
        clean_answer = extract_results(full_result)

        # Get sources
        source_docs = result.get('source_documents', [])
        if source_docs:
            sources_text = "\n\n📚 **Sources:**\n"
            sources = {}
            for doc in source_docs:
                src = doc.metadata.get('source', 'Unknown')
                page = doc.metadata.get('page', 'N/A')
                sources.setdefault(src, []).append(page)

            for i, (src, pages) in enumerate(sources.items(), 1):
                page_list = ', '.join(map(str, sorted(set(pages))))
                sources_text += f"{i}. {src} (Pages: {page_list})\n"

            return clean_answer + sources_text
        else:
            return clean_answer

    except Exception as e:
        return f"❌ Error: {str(e)}\n\nPlease try rephrasing your question."

# Create Gradio interface with enhanced features
interface = gr.Interface(
    fn=ask_bot,
    inputs=gr.Textbox(
        label="Your Question",
        placeholder="e.g., What are the risk factors for breast cancer?",
        lines=3
    ),
    outputs=gr.Textbox(
        label="Answer",
        lines=12
    ),
    title="🩺 Breast Cancer RAG Chatbot",
    description="Ask questions about breast cancer based on the loaded medical PDFs. The system uses RAG to provide accurate, context-based answers.",
    examples=[
        ["What are the main types of breast cancer?"],
        ["What are the early symptoms of breast cancer?"],
        ["What screening methods are recommended?"],
        ["What are the treatment options available?"]
    ],
    theme=gr.themes.Soft(),
    allow_flagging="never"
)

# Launch with options
interface.launch(
    share=True,
    server_name="0.0.0.0",  # Allow external access
    server_port=7861         # Default Gradio port
)



Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://d83f1d39c3ae700d4c.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


