In [5]:
pip install pymupdf transformers faiss-cpu torch sentence-transformers

Collecting pymupdf
  Downloading PyMuPDF-1.23.7-cp310-none-manylinux2014_x86_64.whl (4.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.4/4.4 MB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
Collecting faiss-cpu
  Downloading faiss_cpu-1.7.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.6/17.6 MB[0m [31m35.6 MB/s[0m eta [36m0:00:00[0m
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting PyMuPDFb==1.23.7 (from pymupdf)
  Downloading PyMuPDFb-1.23.7-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (30.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.6/30.6 MB[0m [31m26.9 MB/s[0m eta [36m0:00:00[0m
Collecting sentenc

1**This line installs the required Python packages using the pip package manager.**


In [11]:
import fitz
import numpy as np
import faiss
import torch
from transformers import AutoTokenizer, AutoModel
from sentence_transformers import SentenceTransformer


2**These lines import necessary libraries and modules for working with PDFs, numerical operations, Faiss for similarity search, PyTorch for deep learning, and various models for natural language processing.**

In [12]:
def extract_text_from_pdf(pdf_path):
    with fitz.open(pdf_path) as doc:
        text = ""
        for page in doc:
            text += page.get_text()
    return text


3**Defines a function extract_text_from_pdf that takes a PDF file path as input and returns the extracted text using the PyMuPDF library (fitz).**

In [13]:
# Hugging Face Transformer
tokenizer_hf = AutoTokenizer.from_pretrained("bert-base-uncased")
model_hf = AutoModel.from_pretrained("bert-base-uncased")


4**Initializes a tokenizer and model from the Hugging Face Transformers library using the BERT model architecture.**

In [14]:
def create_hf_embeddings(text):
    inputs = tokenizer_hf(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        outputs = model_hf(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()


5**Defines a function create_hf_embeddings that takes a text as input, tokenizes it, and obtains embeddings using the Hugging Face model. The embeddings are the mean of the last hidden states.**

In [15]:
# Sentence Transformer
model_st = SentenceTransformer('all-MiniLM-L6-v2')


6**Initializes a Sentence Transformer model using the MiniLM architecture.**


In [16]:
def create_st_embeddings(text):
    return model_st.encode(text)


7**Defines a function create_st_embeddings that takes a text as input and obtains embeddings using the Sentence Transformer model.**

In [17]:
pdf_paths = [
    "/content/sample1.pdf",
    "/content/sample2.pdf",
    "/content/sample3.pdf",
    "/content/sample4.pdf",
    "/content/sample5.pdf"
]


8**Defines a list of file paths for PDF documents.**


In [18]:
texts = [extract_text_from_pdf(pdf) for pdf in pdf_paths]
embeddings_hf = [create_hf_embeddings(text) for text in texts]
embeddings_st = [create_st_embeddings(text) for text in texts]


9**Extracts text from each PDF and computes embeddings using both Hugging Face and Sentence Transformer models.**

In [19]:
embeddings_flat_hf = np.vstack(embeddings_hf)
embeddings_flat_st = np.vstack(embeddings_st)


10**Stacks the embeddings vertically to create matrices of embeddings for each model**

In [20]:
def save_embeddings(embeddings, file_name):
    dimension = embeddings.shape[1]
    index = faiss.IndexFlatL2(dimension)
    index.add(embeddings)
    faiss.write_index(index, file_name)


11**Defines a function save_embeddings that takes embeddings and a file name as input, creates a Faiss index, adds embeddings to the index, and writes the index to a file.**

In [21]:
save_embeddings(embeddings_flat_hf, "embeddings_hf.index")
save_embeddings(embeddings_flat_st, "embeddings_st.index")


11**Saves the embeddings to Faiss index files.**


In [22]:
def load_index(file_name):
    return faiss.read_index(file_name)


12**Defines a function load_index that reads a Faiss index from a file.**


In [23]:
def search_index(index, embedding):
    _, I = index.search(embedding, k=1)
    return I[0][0]


13**Defines a function search_index that takes a Faiss index and an embedding, performs a similarity search, and returns the index of the nearest neighbor.**


In [24]:
def answer_question(question, texts, model, tokenizer=None):
    if tokenizer:
        embedding = create_hf_embeddings(question)
    else:
        embedding = model.encode(question)
    embedding = embedding.reshape(1, -1)
    idx = search_index(load_index("embeddings_hf.index" if tokenizer else "embeddings_st.index"), embedding)
    return texts[idx]


14**Defines a function answer_question that takes a question, a list of texts, a model, and an optional tokenizer. It computes the embedding of the question using either Hugging Face or Sentence Transformer, performs a similarity search, and returns the answer.**


In [25]:
questions = [
    "Outline the key tenets of sustainable development."
"Explain the functioning of quantum computing."
"Provide a historical overview of the Roman Empire."
]


15**Defines a list of example questions.**

In [26]:
# Compare answers from both models
for question in questions:
    print(f"Question: {question}")
    print("Answer using Hugging Face model:", answer_question(question, texts, model_hf, tokenizer_hf))
    print("Answer using Sentence Transformer model:", answer_question(question, texts, model_st))
    print("\n" + "-"*50 + "\n")


Question: Outline the key tenets of sustainable development.Explain the functioning of quantum computing.Provide a historical overview of the Roman Empire.
Answer using Hugging Face model: Description: Consult for laparoscopic gastric bypass.. 
medical_specialty: Bariatrics 
sample_name : Laparoscopic Gastric Bypass Consult - 1  
transcription:  
HISTORY OF PRESENT ILLNESS: , I have seen ABC today.  He is a very pleasant gentleman who is 42 
years old, 344 pounds.  He is 5'9".  He has a BMI of 51.  He has been overweight for ten years since 
the age of 33, at his highest he was 358 pounds, at his lowest 260.  He is pursuing surgical attempts 
of weight loss to feel good, get healthy, and begin to exercise again.  He wants to be able to exercise 
and play volleyball.  Physically, he is sluggish.  He gets tired quickly.  He does not go out often.  When 
he loses weight he always regains it and he gains back more than he lost.  His biggest weight loss is 
25 pounds and it was three months