<a href="https://colab.research.google.com/github/Dias-lezdo/med-embbed/blob/main/medEmbbed.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [6]:
# ----------------------------
# 1. Install Required Libraries
# ----------------------------

# Install Python libraries
!pip install -U pdfplumber
!pip install -U PyMuPDF
!pip install -U sentence-transformers
!pip install -U faiss-cpu
!pip install -U nltk
!pip install -U pytesseract
!pip install -U pdf2image
!pip install -U PyPDF2
!pip install -U requests

# Install system-level dependencies
!sudo apt-get update
!sudo apt-get install -y tesseract-ocr libtesseract-dev
!sudo apt-get install -y poppler-utils

# Download NLTK data
import nltk
nltk.download('punkt')


Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: nltk
  Attempting uninstall: nltk
    Found existing installation: nltk 3.8.1
    Uninstalling nltk-3.8.1:
      Successfully uninstalled nltk-3.8.1
Successfully installed nltk-3.9.1


Collecting pytesseract
  Downloading pytesseract-0.3.13-py3-none-any.whl.metadata (11 kB)
Downloading pytesseract-0.3.13-py3-none-any.whl (14 kB)
Installing collected packages: pytesseract
Successfully installed pytesseract-0.3.13
Collecting pdf2image
  Downloading pdf2image-1.17.0-py3-none-any.whl.metadata (6.2 kB)
Downloading pdf2image-1.17.0-py3-none-any.whl (11 kB)
Installing collected packages: pdf2image
Successfully installed pdf2image-1.17.0
Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1
Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:3 https://developer.downl

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [9]:
import os
import pdfplumber
import fitz  # PyMuPDF
from sentence_transformers import SentenceTransformer
import faiss
from nltk.tokenize import sent_tokenize
import nltk
import PyPDF2
import pytesseract
from pdf2image import convert_from_path

# Download NLTK data
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [10]:
# Define the path to your PDF file
pdf_path = '/content/brain_injury.pdf'  # <-- Replace with your actual PDF path

# Verify that the PDF exists
if not os.path.exists(pdf_path):
    raise FileNotFoundError(f"PDF not found at {pdf_path}. Please check the path and ensure the file exists.")
else:
    print(f"PDF found at {pdf_path}")

# ----------------------------
# 4. Check if PDF is Encrypted
# ----------------------------

def check_and_decrypt_pdf(path):
    with open(path, 'rb') as f:
        reader = PyPDF2.PdfReader(f)
        if reader.is_encrypted:
            print("PDF is encrypted. Attempting to decrypt...")
            try:
                reader.decrypt('')
                print("PDF decrypted successfully.")
                return True
            except Exception as e:
                print(f"Failed to decrypt PDF: {e}")
                return False
        else:
            print("PDF is not encrypted.")
            return True

# Check and decrypt if necessary
is_accessible = check_and_decrypt_pdf(pdf_path)
if not is_accessible:
    raise Exception("Cannot access encrypted PDF without a password.")

# ----------------------------
# 5. Extract Text from PDF
# ----------------------------

def extract_text_pdfplumber(path):
    print("Extracting text using pdfplumber...")
    all_text = ""
    try:
        with pdfplumber.open(path) as pdf:
            for page_number, page in enumerate(pdf.pages, start=1):
                text = page.extract_text()
                if text:
                    all_text += text + "\n"
                else:
                    print(f"No text found on page {page_number} using pdfplumber.")
    except Exception as e:
        print(f"Error extracting text with pdfplumber: {e}")
    return all_text

def extract_text_pymupdf(path):
    print("Extracting text using PyMuPDF...")
    all_text = ""
    try:
        doc = fitz.open(path)
        for page_number, page in enumerate(doc, start=1):
            text = page.get_text()
            if text:
                all_text += text + "\n"
            else:
                print(f"No text found on page {page_number} using PyMuPDF.")
    except Exception as e:
        print(f"Error extracting text with PyMuPDF: {e}")
    return all_text

def extract_text_ocr(path):
    print("Extracting text using OCR...")
    all_text = ""
    try:
        pages = convert_from_path(path, dpi=300)
        for page_number, page in enumerate(pages, start=1):
            text = pytesseract.image_to_string(page)
            if text.strip():
                all_text += text + "\n"
            else:
                print(f"No text found on page {page_number} via OCR.")
    except Exception as e:
        print(f"Error during OCR extraction: {e}")
    return all_text

# Attempt extraction with pdfplumber
all_text = extract_text_pdfplumber(pdf_path)

# If no text found, try PyMuPDF
if not all_text.strip():
    print("No text extracted using pdfplumber. Trying PyMuPDF...")
    all_text = extract_text_pymupdf(pdf_path)

# If still no text, try OCR
if not all_text.strip():
    print("No text extracted using PyMuPDF. Attempting OCR...")
    all_text = extract_text_ocr(pdf_path)

# Final check
if not all_text.strip():
    raise Exception("Failed to extract any text from the PDF using all methods.")

print("\n--- Extracted Text Sample ---\n")
print(all_text[:1000])  # Print first 1000 characters

# ----------------------------
# 6. Preprocess and Chunk the Text
# ----------------------------

def chunk_text(text, max_tokens=500, overlap=50):
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = ""

    for sentence in sentences:
        # Approximate token count by word count
        current_length = len(current_chunk.split())
        sentence_length = len(sentence.split())
        if current_length + sentence_length <= max_tokens:
            current_chunk += " " + sentence
        else:
            chunks.append(current_chunk.strip())
            # Start new chunk with overlap
            current_chunk = ' '.join(current_chunk.split()[-overlap:]) + " " + sentence
    if current_chunk:
        chunks.append(current_chunk.strip())
    return chunks

# Chunk the text
print("\nChunking the text...")
text_chunks = chunk_text(all_text, max_tokens=500, overlap=50)
print(f"Total Chunks: {len(text_chunks)}")

if text_chunks:
    print("\n--- First Chunk ---\n")
    print(text_chunks[0])
else:
    raise Exception("No text chunks were created. Ensure that text extraction was successful.")

# ----------------------------
# 7. Encode Text Chunks
# ----------------------------

def encode_chunks(chunks, model_name="abhinand/MedEmbed-base-v0.1"):
    print("\nLoading SentenceTransformer model...")
    model = SentenceTransformer(model_name)
    print("Encoding text chunks...")
    embeddings = model.encode(chunks, convert_to_tensor=True)
    embeddings_np = embeddings.cpu().numpy()
    print(f"Embeddings shape: {embeddings_np.shape}")
    return embeddings_np, model

# Encode the chunks
embeddings_np, model = encode_chunks(text_chunks, model_name="abhinand/MedEmbed-base-v0.1")  # Using MedEmbed model

# ----------------------------
# 8. Initialize FAISS Index
# ----------------------------

def initialize_faiss(embeddings, use_gpu=False):
    print("\nInitializing FAISS index...")
    dimension = embeddings.shape[1]
    index = faiss.IndexFlatL2(dimension)
    print("Adding embeddings to the FAISS index...")
    index.add(embeddings)
    print(f"Total vectors in FAISS index: {index.ntotal}")

    if use_gpu and faiss.get_num_gpus() > 0:
        print("Using GPU for FAISS index.")
        res = faiss.StandardGpuResources()
        index = faiss.index_cpu_to_gpu(res, 0, index)
    else:
        print("Using CPU for FAISS index.")

    return index

# Initialize FAISS (set use_gpu=True if GPU is available and desired)
use_gpu = faiss.get_num_gpus() > 0
index = initialize_faiss(embeddings_np, use_gpu=use_gpu)

# ----------------------------
# 9. Define Query Function
# ----------------------------

def query_index(query, model, index, chunks, top_k=3):
    print(f"\nEncoding the query: '{query}'")
    query_embedding = model.encode([query], convert_to_tensor=True).cpu().numpy()

    print("Searching the FAISS index...")
    distances, indices = index.search(query_embedding, top_k)

    results = []
    for i in range(top_k):
        idx = indices[0][i]
        distance = distances[0][i]
        results.append((distance, chunks[idx]))
    return results

# ----------------------------
# 10. Execute an Example Query
# ----------------------------

# Define your query
user_query = "What are the common signs and symptoms that indicate a person may have sustained a TBI?"

# Query the index
results = query_index(user_query, model, index, text_chunks, top_k=3)

# Display the results
print("\n--- Top Relevant Chunks ---\n")
for i, (distance, chunk) in enumerate(results, start=1):
    print(f"Result {i}:")
    print(f"Distance: {distance}")
    print(f"Chunk: {chunk}\n")

PDF found at /content/brain_injury.pdf
PDF is not encrypted.
Extracting text using pdfplumber...
No text found on page 1 using pdfplumber.
No text found on page 2 using pdfplumber.
No text found on page 3 using pdfplumber.
No text found on page 4 using pdfplumber.
No text found on page 5 using pdfplumber.
No text found on page 6 using pdfplumber.
No text found on page 7 using pdfplumber.
No text found on page 8 using pdfplumber.
No text extracted using pdfplumber. Trying PyMuPDF...
Extracting text using PyMuPDF...
No text found on page 1 using PyMuPDF.
No text found on page 2 using PyMuPDF.
No text found on page 3 using PyMuPDF.
No text found on page 4 using PyMuPDF.
No text found on page 5 using PyMuPDF.
No text found on page 6 using PyMuPDF.
No text found on page 7 using PyMuPDF.
No text found on page 8 using PyMuPDF.
No text extracted using PyMuPDF. Attempting OCR...
Extracting text using OCR...

--- Extracted Text Sample ---

National Institute of
Neurological Disorders
and Stroke


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/201 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.04k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

Encoding text chunks...
Embeddings shape: (11, 768)

Initializing FAISS index...
Adding embeddings to the FAISS index...
Total vectors in FAISS index: 11
Using CPU for FAISS index.

Encoding the query: 'What are the common signs and symptoms that indicate a person may have sustained a TBI?'
Searching the FAISS index...

--- Top Relevant Chunks ---

Result 1:
Distance: 0.40895509719848633
Chunk: Some accidents or trauma can cause both penetrating and non-penetrating TBI in the same person. Signs and symptoms of traumatic brain injury Headache, dizziness, confusion, and fatigue tend to start immediately after an injury but resolve over time. Emotional symptoms such as frustration and irritability tend to develop during recovery. Seek immediate medical attention if the person experiences any of the following symptoms, especially within the first 24 hours
after an injury to the head:

Physical symptoms of TBI

Headache

Convulsions or seizures

Blurred or double vision

Unequal eye pupil s