<a href="https://colab.research.google.com/github/SarahSamehh/Rag_Model/blob/main/Medical_Chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Medical Text Processing and Summarization Pipeline

This notebook implements a comprehensive pipeline for processing medical text, generating embeddings, retrieving relevant information, and summarizing content. The following sections detail the steps involved, the code implementation, and the rationale behind specific choices.

## Table of Contents
1. [Imports](#scrollTo=OMQ5svYUBmxi)
2. [Text Extraction](#scrollTo=bckgEgjTCEpr)
3. [Text Cleaning](#scrollTo=LaZKMXdkPNN0)
4. [Content-Based Chunking](#scrollTo=p7yX7rJenV5v)
5. [Semantic Chunking](#scrollTo=_m2EZf8xBnW6)
6. [FAISS Indexing](#scrollTo=1zOTiLSYYh5k)
7. [Retrieving Relevant Chunks](#scrollTo=WK8WsLwlYmhD)
8. [Summarization](#scrollTo=jSN8soTjZhNs)
9. [Answer Generation](#scrollTo=kio6T1rTZjSU)
10. [Evaluation Metrics](#scrollTo=yPOslx1iaTBE)
11. [Model Saving](#scrollTo=8o4R7ui8tv08)

In [None]:
!pip install faiss-cpu # library of Vector Database
!pip install transformers

Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m36.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.9.0.post1


In [None]:
!pip install PyPDF2 sentence-transformers chromadb langchain

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting chromadb
  Downloading chromadb-0.5.20-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.115.5-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.32.1-py3-none-any.whl.metadata (6.6 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.7.3-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.20.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0

In [None]:

import PyPDF2 # Importing PyPDF2 to extract text from the PDF

# Function to extract text from a PDF file
def extract_text_from_pdf(pdf_file):
    """
    Extracts all text from the given PDF file.

    Args:
        pdf_file (str): Path to the PDF file.

    Returns:
        str: The extracted text from the PDF.
    """
    text = "" # Variable to store the extracted text
    with open(pdf_file, "rb") as file: # Opening the PDF file in binary mode
        reader = PyPDF2.PdfReader(file) # Initializing the PDF reader
        for page_num in range(len(reader.pages)): # Iterating through all pages
            page = reader.pages[page_num] # Reading each page by its number
            text += page.extract_text() # Appending extracted text to the "text" variable
    return text # Return the extracted text

# Example: Extract text from your PDF
pdf_file_path = "/content/Medical_book.pdf" # Specify the path to your PDF file
pdf_text = extract_text_from_pdf(pdf_file_path) # Extract the text from the PDF

# Saving the extracted text to a text file
with open("medical-text.txt", "w") as f: # Open (or create) a text file to save the extracted content
    f.write(pdf_text) # Write the extracted text into the file


In [None]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize lemmatizer and stopwords
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def clean_text(text):
    # 1. Remove unwanted characters and standardize text
    text = re.sub(r'\n+', ' ', text)  # Replace newlines with space
    text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with single space
    text = text.lower()                # Convert to lowercase

    # 2. Tokenization
    tokens = text.split()

    # 3. Remove stop words and lemmatize
    cleaned_tokens = [
        lemmatizer.lemmatize(word) for word in tokens if word not in stop_words and word.isalpha()
    ]

    # 4. Join cleaned tokens back into a string
    cleaned_text = ' '.join(cleaned_tokens)
    return cleaned_text

# Load the extracted text from your .txt file
file_path = '/content/medical-text.txt'
with open(file_path, 'r', encoding='utf-8') as file:
    raw_text = file.read()

# Clean the text
cleaned_text = clean_text(raw_text)

# Optional: Save the cleaned text to a new file
cleaned_file_path = '/content/medical-text.txt'
with open(cleaned_file_path, 'w', encoding='utf-8') as cleaned_file:
    cleaned_file.write(cleaned_text)

print("Text cleaning completed. Cleaned text saved to:", cleaned_file_path)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Text cleaning completed. Cleaned text saved to: /content/medical-text.txt


In [None]:

################################ Basic chunking ##################################
import nltk # Importing Natural Language Toolkit (nltk) package
nltk.download('punkt') # Downloading the necessary Tokenizer Model (Punkt for sentence Tokenization)
nltk.download('punkt_tab')
from nltk.tokenize import sent_tokenize # Importing sentence Tokenization function

def content_based_chunking(text, max_chunk_size=500, overlap_size=100):
    """
      Splitting the text into chunks based on content

      Parameters:

      text: the inpt to be chunked

      max_chunk_size: The maximum number of characters allowed in a chunk.The default 500 value ensures each chunk has enough context
      while staying within the token limits of transformers allowing for tokenization without exceeding model's capacity.

      overlap_size: Number of overlapping characters between chunks.This ensures smoother transitions between chunks as it includes
      a specified number of characters (the overlap value) from the previous chunk at the beginning of the new chunk. and helps model
      retain important context across breaks.The default value 100 to preserve continuity reducing the chance of losing critical context
      improving semantic understanding and leading to eading to better predictions and comprehension of the content.

      Returns:
      list of text chunks

    """

    sentences = sent_tokenize(text) # # Break the input text into individual sentences to allow for more accurate chunking
    chunks = [] # Creating a list to hold the created chunks
    current_chunk = "" # Initializing current chunk variable

    for sentence in sentences: # Looping through each sentence
        if len(current_chunk) + len(sentence) <= max_chunk_size: # If the total length of current chunk and the sentence doesn't exceed the limit
            current_chunk += sentence + " " # Append the sentence to the current chunk
        else: #If adding the sentence exceeds the limit
            chunks.append(current_chunk.strip()) # append the current chunk to the chunks list after removing any white space using strip.
            overlap_chunk = current_chunk[-overlap_size:].strip() # Extract the last overlap_size characters to ensure continuity between chunks
            current_chunk = overlap_chunk + sentence + " " # Start the current chunk with the overlap and append the new sentence

    if current_chunk:  # Check if there's any leftover text in the current chunk
        chunks.append(current_chunk.strip()) # append the current chunk to the chunks list after removing any white space using strip.

    return chunks # return the final list of chunks

# Apply Content-Based chunking
chunks = content_based_chunking(pdf_text, max_chunk_size=550, overlap_size=100)
# max_chunk_size = 550: We slightly increased the chunk size to 550 for flexibility. This allows for slightly longer chunks,
# but still stays within a safe range for models that handle ~512 tokens.


print(f"Number of chunks: {len(chunks)}") # printing the number of created chunks

with open('dynamic_overlap_chunks_paraphrase_mpnet_base_v2.txt', 'w') as f: # open a new file to store the chunks
    for i, chunk in enumerate(chunks): # Looping through the chunks
        f.write(f"Chunk {i+1}:\n{chunk}\n\n")  # Save each chunk to the file with its corresponding number
print("Chunks saved to dynamic_overlap_chunks_paraphrase_mpnet_base_v2.txt") # Notify that the chunks have been saved


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Number of chunks: 6804
Chunks saved to dynamic_overlap_chunks_paraphrase_mpnet_base_v2.txt


In [None]:
with open("/content/medical-text.txt", "r") as f: # loading the saved text book
    pdf_text = f.read()


In [None]:
pdf_text



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import torch # Importing PyTorch to use for tensor operations and model handling
from transformers import AutoTokenizer, AutoModel # Importing classes for pre-trained model and tokenizer from Hugging Face
from nltk.tokenize import sent_tokenize # Importing sentence tokenization function from nltk
import numpy as np # Importing numpy for array manipulations
from sklearn.metrics.pairwise import cosine_similarity # importing cosine similarity from sklearn for semantic similarity measurement

# Load the pre-trained tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/paraphrase-mpnet-base-v2')
# This tokenizer is used to convert sentences into token IDs compatible with the model.
# The 'paraphrase-mpnet-base-v2' model is selected for its state-of-the-art capabilities in generating high-quality embeddings.
# It is specifically designed for tasks such as semantic similarity and paraphrase detection,
# and is pre-trained on a large corpus, making it well-suited for extracting meaningful sentence-level embeddings.

model = AutoModel.from_pretrained('sentence-transformers/paraphrase-mpnet-base-v2')
# Load the corresponding model that generates embeddings from the tokenized input.
# This model outputs sentence embeddings in a multi-dimensional space,
# where semantically similar sentences are represented as vectors that are closer together.

# Function to get sentence embeddings
def get_sentence_embedding(sentence):
    """
    Gets the embedding (vector representation) of a given sentence using a pre-trained language model.

    Args:
        sentence : The input sentence.

    Returns:
        np.ndarray: The sentence embedding as a NumPy array.
    """

    inputs = tokenizer(sentence, return_tensors='pt', padding=True, truncation=True)
    """
    sentence:  The input sentence that needs to be tokenized
    return_tensors='pt': This specifies that the output should be PyTorch tensors to be compatible with the model.
    padding: ensures all sentences are padded to the same length.
    truncation : ensures sentences longer than the maximum length are truncated.
    """
    with torch.no_grad(): # disable gradient calculation since we're only interfering with the model to save memory and computatuons
        outputs = model(**inputs) # Passing the tokenized input to the model without updating weights(Forward Pass)

    # Mean pooling: # Mean pooling: Compute the average of the embeddings for all tokens in the sentence
    # This aggregates the token-level embeddings into a single vector that represents the entire sentence
    # The result is a (batch_size, hidden_size) tensor, where each row represents the entire sentence's embedding
    # instead of(batch_size, sequence_length, hidden_size)

    embeddings = outputs.last_hidden_state.mean(dim=1)

    return embeddings.squeeze().numpy() # Return the embedding as a numpy array for easier manipulation after squeezing any extra dimensions

# Function for semantic chunking based on embedding similarity
def semantic_chunking(text, max_chunk_size=500, threshold=0.85):
    """
    Splits the input text into semantically coherent chunks based on cosine similarity.

    Parameters:
    - text : The input text to be chunked.
    - max_chunk_size : Maximum number of characters allowed in a chunk.
    - threshold : Cosine similarity threshold for grouping sentences. .

    Returns:
    - List[str]: A list of semantically coherent chunks of text.
    """


    sentences = sent_tokenize(text) # Split the input text into individual sentences
    chunks = [] # Initialize a List to store the resulting chunks
    current_chunk = sentences[0] # Start with the first sentence as the initial chunk
    current_chunk_emb = get_sentence_embedding(current_chunk) # Get the embedding of the first sentence

    for sentence in sentences[1:]: # Iterate over the remaining sentences
        sentence_emb = get_sentence_embedding(sentence) # Get the embedding for the current sentence
        similarity = cosine_similarity([current_chunk_emb], [sentence_emb])[0][0]
        # Calculate the cosine similarity between the current chunk embedding and the new sentence's embedding.

        # Check if the new sentence is semantically similar enough to be included in the current chunk
        # in case that the total length of current chunk and the new sentence doesn't exceed the limit

        if similarity >= threshold and len(current_chunk) + len(sentence) <= max_chunk_size:
       # Update the current chunk's embedding by averaging it with the new sentence's embedding

            current_chunk += " " + sentence # Append the sentence to the current chunk
            current_chunk_emb = (current_chunk_emb + sentence_emb) / 2
            # Average the embeddings of the current chunk to represent the updated chunk.
            # This ensures that the chunk embedding evolves as more sentences are added.

        else: # If the similarity is below the threshold or the total length exceeds the limit
            chunks.append(current_chunk.strip()) # Append the current chunk to the list of chunks
            current_chunk = sentence # Start a new chunk with the current sentence
            current_chunk_emb = get_sentence_embedding(current_chunk) # Get the embedding of the new sentence

    if current_chunk: # If there's any remaining text in the current chunk
        chunks.append(current_chunk.strip()) # Add the last chunk to the list of chunks

    return chunks # Return the list of chunks

# Example usage (assuming pdf_text is defined)
chunks = semantic_chunking(pdf_text, max_chunk_size=500, threshold=0.85)
# As discussed, `max_chunk_size=500` ensures manageable chunk sizes, and `threshold=0.85` ensures chunks are semantically cohesive.

print(f"Number of chunks: {len(chunks)}") # Print the number of chunks created

# Save the chunks into a text file
with open('semantic_chunks_paraphrase_mpnet_base_v2.txt', 'w') as f: # Open a text file to save the chunks
    for i, chunk in enumerate(chunks): # Iterate over the chunks
        f.write(f"Chunk {i+1}:\n{chunk}\n\n") # Write each chunk to the file, labeling them by chunk number
print("Chunks saved to semantic_chunks_paraphrase_mpnet_base_v2.txt") # Print a confirmation message after saving the chunks


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/594 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

Number of chunks: 1
Chunks saved to semantic_chunks_paraphrase_mpnet_base_v2.txt


In [None]:
# Save the chunks into a text file
with open('semantic_chunks_paraphrase_mpnet_base_v2.txt', 'w') as f:
    for i, chunk in enumerate(chunks):
        f.write(f"Chunk {i+1}:\n{chunk}\n\n")
print("Chunks saved to semantic_chunks.txt")


Chunks saved to semantic_chunks.txt


In [None]:
# Load chunks from saved file
with open('/content/semantic_chunks_paraphrase_mpnet_base_v2.txt', 'r') as f:
    loaded_chunks = f.read().split("\n\n")  # Splitting by double newline which separates chunks

print(f"Loaded {len(loaded_chunks)} chunks")


Loaded 2 chunks


In [None]:

import faiss  # FAISS is a library for efficient similarity search and clustering of dense vectors.
import numpy as np  # NumPy is used for handling numerical data, especially arrays and matrices.
from transformers import AutoTokenizer, AutoModel  # Importing tokenizer and model from Hugging Face transformers.
import torch

In [None]:

import faiss  # FAISS is a library for efficient similarity search and clustering of dense vectors.
import numpy as np  # NumPy is used for handling numerical data, especially arrays and matrices.
from transformers import AutoTokenizer, AutoModel  # Importing tokenizer and model from Hugging Face transformers.
import torch  # PyTorch is used as the deep learning framework for handling the model.

# Load the pre-trained PubMedBERT model and tokenizer from Hugging Face for generating embeddings specific to biomedical data.
# 'NeuML/pubmedbert-base-embeddings' is a model fine-tuned for generating sentence embeddings for biomedical text.
tokenizer = AutoTokenizer.from_pretrained('NeuML/pubmedbert-base-embeddings')
model = AutoModel.from_pretrained('NeuML/pubmedbert-base-embeddings')

# Function to get the embedding of a text chunk (vector representation of the chunk).
def get_chunk_embedding(chunk):
    # Tokenizing the input text chunk, converting it into tokens (words into IDs) that the model can understand.
    # `return_tensors='pt'`: This returns the result in PyTorch tensor format (which the model requires).
    # `padding=True`: This ensures that the input is padded to match the required input size of the model.
    # `truncation=True`: If the chunk is too long, it is truncated to fit the model’s input size.
    # `max_length=512`: The PubMedBERT model has a token limit of 512 tokens, so we set this to prevent errors with long chunks.
    inputs = tokenizer(chunk, return_tensors='pt', padding=True, truncation=True, max_length=512)

    # Using `torch.no_grad()` to prevent gradient computation as we're doing inference, not training.
    with torch.no_grad():
        # Pass the tokenized input through the model to get the hidden states (embeddings) of each token.
        outputs = model(**inputs)

    # Perform mean pooling: we take the mean of all the token embeddings to create a single embedding for the chunk.
    # This averages the embeddings across the sequence (dim=1 refers to averaging across tokens in the sentence).
    embedding = outputs.last_hidden_state.mean(dim=1)

    # Convert the tensor to a NumPy array and remove unnecessary dimensions using `squeeze()`.
    return embedding.squeeze().numpy()

# Assume `loaded_chunks` is a list of text chunks (previously split from the document).
# We compute the embedding for each chunk using the `get_chunk_embedding` function.
chunk_embeddings = np.array([get_chunk_embedding(chunk) for chunk in loaded_chunks])

# FAISS (Facebook AI Similarity Search) index initialization.
# This index will allow for efficient similarity searches among the embeddings.

# `chunk_embeddings.shape[1]`: This gets the number of dimensions of the chunk embeddings.
# The dimensionality `d` refers to the size of the vector produced by the model (for PubMedBERT, this is typically 768).
# We don't need to manually specify it since the shape is derived from the embeddings.
d = chunk_embeddings.shape[1]

# Create a FAISS index for similarity search using L2 distance (Euclidean distance).
# `faiss.IndexFlatL2(d)`: This initializes an index that allows for fast searches using L2 (Euclidean) distance in `d` dimensions.
# L2 is a popular choice for high-dimensional spaces because it gives a meaningful distance between embeddings.
index = faiss.IndexFlatL2(d)

# Add the chunk embeddings to the FAISS index.
# This allows us to perform similarity searches on these embeddings later.
index.add(chunk_embeddings)

# Print confirmation of index creation with the number of chunks.
print(f"FAISS index with {len(chunk_embeddings)} chunks created.")


In [None]:
faiss.write_index(index, "faiss_index_file_chunking.index")
print("FAISS index saved.")


In [None]:
index = faiss.read_index("/content/drive/MyDrive/faiss_index_file_chunking.index")
print("FAISS index loaded.")

FAISS index loaded.


In [None]:

import faiss  # FAISS is used for efficient similarity search and clustering of dense vectors.
import numpy as np  # NumPy is used to handle arrays and matrices, especially for numerical data.
from transformers import AutoTokenizer, AutoModel  # AutoTokenizer and AutoModel are used to load pre-trained tokenizers and models from Hugging Face.
import torch  # PyTorch is used as the framework for handling the model's input and inference.

# Load the pre-trained sentence embeddings model
# 'sentence-transformers/all-mpnet-base-v2' is a model fine-tuned for general-purpose sentence embeddings.
# This model captures semantic information from sentences and is widely used for tasks like semantic search, clustering, etc.
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-mpnet-base-v2')

# Function to compute the embedding for a text chunk.
def get_chunk_embedding(chunk):
    # Tokenizing the input chunk of text.
    # `return_tensors='pt'`: Specifies that the tokenized output should be returned in PyTorch tensor format, required for the model.
    # `padding=True`: This ensures that all sequences are padded to the same length, so they can be processed in batches.
    # `truncation=True`: If the chunk is longer than the model's maximum token length, it will be truncated.
    # Truncation is important to avoid exceeding the model's input size limit.
    inputs = tokenizer(chunk, return_tensors='pt', padding=True, truncation=True)

    # Disable gradient calculation using `torch.no_grad()` because we are only doing inference, not training.
    # This reduces memory usage and speeds up computation.
    with torch.no_grad():
        # Pass the tokenized input through the pre-trained model to generate embeddings.
        outputs = model(**inputs)

    # Mean pooling is used to generate a single embedding for the chunk.
    # `outputs.last_hidden_state` contains the embeddings for each token in the chunk.
    # We take the mean of these embeddings across the token dimension (`dim=1`) to produce one vector that represents the entire chunk.
    embedding = outputs.last_hidden_state.mean(dim=1)

    # Convert the PyTorch tensor to a NumPy array for easier handling and remove extra dimensions using `squeeze()`.
    return embedding.squeeze().numpy()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

In [None]:
import faiss  # FAISS is used for efficient similarity search and clustering of dense vectors.
import numpy as np  # NumPy is used to handle arrays and matrices, especially for numerical data.
from transformers import AutoTokenizer, AutoModel, AutoModelForSeq2SeqLM # Importing necessary classes from Transformers
import torch

In [None]:
!pip install sentence-transformers



In [None]:


import faiss
import numpy as np
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
from sentence_transformers import SentenceTransformer

# ... (load_chunks_and_create_index and get_chunk_embedding functions remain the same) ...

# Function to retrieve relevant chunks (remains the same)
def retrieve_relevant_chunks(query):
    query_embedding = get_chunk_embedding(query).reshape(1, -1)
    D, I = index.search(query_embedding, 5)
    retrieved_chunks = [loaded_chunks[i] for i in I[0] if 0 <= i < len(loaded_chunks)]
    return " ".join(retrieved_chunks)


# Load Pegasus tokenizer and model for summarization and answer generation
tokenizer_summarizer = AutoTokenizer.from_pretrained("google/pegasus-xsum")
model_summarizer = AutoModelForSeq2SeqLM.from_pretrained("google/pegasus-xsum")


tokenizer_config.json:   0%|          | 0.00/87.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.52M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/259 [00:00<?, ?B/s]

In [None]:
def summarize_context(context):
    """Summarizes the given context using the Pegasus model."""
    inputs = tokenizer_summarizer.encode("summarize: " + context, return_tensors="pt", max_length=1024, truncation=True)
    summary_ids = model_summarizer.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
    summary = tokenizer_summarizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

In [None]:
def generate_answer(query, context, temperature=0.3):
    """Generates an answer to the query based on the context using the Pegasus model."""
    input_text = f"Question: {query}\nContext: {context}\nAnswer:"
    inputs = tokenizer_summarizer.encode(input_text, return_tensors='pt', max_length=1024, truncation=True)
    outputs = model_summarizer.generate(
        inputs,
        max_new_tokens=150,
        temperature=temperature,
        pad_token_id=tokenizer_summarizer.eos_token_id,
        num_beams=3,
        early_stopping=True
    )
    return tokenizer_summarizer.decode(outputs[0], skip_special_tokens=True)

In [None]:
# Define the user's query about the definition of "Death"
query = "what is the symptoms of ADHD?"

# Retrieve relevant chunks of text that relate to the query
# The `retrieve_relevant_chunks` function uses a similarity search to find the most relevant pieces of text
# from the previously indexed chunks based on the user's query.
context = retrieve_relevant_chunks(query)

# This context will then be used in further processing, such as summarization or answer generation.


In [None]:
# Summarize the relevant context retrieved based on the user's query
summary = summarize_context(context)

# Print the summary to the console
print("Summary:", summary)


Summary: Abnormal results The doctor will inform the woman of her specific increased risk as compared to the “normal” risk of a stan-dard case. stan-dard cases are more common in women with a family history of breast cancer and those with a family history of ovarian cancer.


In [None]:
!pip install gradio

Collecting gradio
  Downloading gradio-5.3.0-py3-none-any.whl.metadata (15 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.4.0-py3-none-any.whl.metadata (2.9 kB)
Collecting gradio-client==1.4.2 (from gradio)
  Downloading gradio_client-1.4.2-py3-none-any.whl.metadata (7.1 kB)
Collecting huggingface-hub>=0.25.1 (from gradio)
  Downloading huggingface_hub-0.26.1-py3-none-any.whl.metadata (13 kB)
Collecting markupsafe~=2.0 (from gradio)
  Downloading MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.9 (from gradio)
  Downloading python_multipart-0.0.12-py3-none-any.whl.metadata (1.9 kB)
Collecting ruff>=0.2.2 (from gradio)
  Downloading ruff-0.7.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.w

In [None]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m225.3/232.6 kB[0m [31m6.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [None]:
# prompt: enhance the following code and use the variables of the notebook
# enhance the interfacing

import PyPDF2
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk
from nltk.tokenize import sent_tokenize
from google.colab import drive
import torch
from transformers import AutoTokenizer, AutoModel, AutoModelForSeq2SeqLM
from nltk.tokenize import sent_tokenize
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import faiss
import gradio as gr

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Mount Google Drive
drive.mount('/content/drive')

# --- Load Chunks and FAISS Index ---
# (Assuming you have already generated and saved the FAISS index and chunks as in your original code)

# Load the FAISS index
index = faiss.read_index("/content/drive/MyDrive/faiss_index_file_chunking.index")
print("FAISS index loaded.")

# Load the chunks
with open('/content/drive/MyDrive/final_chunks.txt', 'r') as f:
    loaded_chunks = f.read().split("\n\n")

print(f"Loaded {len(loaded_chunks)} chunks")


# ---  Embedding Models ---
# Pre-trained PubMedBERT model for biomedical text embeddings
tokenizer_pubmed = AutoTokenizer.from_pretrained('NeuML/pubmedbert-base-embeddings')
model_pubmed = AutoModel.from_pretrained('NeuML/pubmedbert-base-embeddings')

# Pre-trained sentence transformers model for general-purpose embeddings
tokenizer_allmpnet = AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v2')
model_allmpnet = AutoModel.from_pretrained('sentence-transformers/all-mpnet-base-v2')

# Pegasus model for summarization and question answering
tokenizer_summarizer = AutoTokenizer.from_pretrained("google/pegasus-xsum")
model_summarizer = AutoModelForSeq2SeqLM.from_pretrained("google/pegasus-xsum")

# --- Helper Functions (retained with slight modifications) ---

def get_chunk_embedding(chunk, model_choice='pubmed'):
    if model_choice == 'pubmed':
      tokenizer = tokenizer_pubmed
      model = model_pubmed
    elif model_choice == 'allmpnet':
      tokenizer = tokenizer_allmpnet
      model = model_allmpnet
    else:
        raise ValueError("Invalid model choice. Select 'pubmed' or 'allmpnet'")


    inputs = tokenizer(chunk, return_tensors='pt', padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    embedding = outputs.last_hidden_state.mean(dim=1)
    return embedding.squeeze().numpy()


def retrieve_relevant_chunks(query, model_choice='pubmed'):
    query_embedding = get_chunk_embedding(query, model_choice).reshape(1, -1)
    D, I = index.search(query_embedding, 5)  # Search for 5 nearest neighbors
    retrieved_chunks = [loaded_chunks[i] for i in I[0] if 0 <= i < len(loaded_chunks)]
    return " ".join(retrieved_chunks)


def summarize_context(context):
  inputs = tokenizer_summarizer.encode("summarize: " + context, return_tensors="pt", max_length=1024, truncation=True)
  summary_ids = model_summarizer.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
  summary = tokenizer_summarizer.decode(summary_ids[0], skip_special_tokens=True)
  return summary


def generate_answer(query, context, temperature=0.3):
    input_text = f"Question: {query}\nContext: {context}\nAnswer:"
    inputs = tokenizer_summarizer.encode(input_text, return_tensors='pt', max_length=1024, truncation=True)
    outputs = model_summarizer.generate(inputs, max_new_tokens=150, temperature=temperature, pad_token_id=tokenizer_summarizer.eos_token_id, num_beams=3, early_stopping=True)
    return tokenizer_summarizer.decode(outputs[0], skip_special_tokens=True)


# --- Gradio Interface ---
def qa_interface(query, model_select):
    context = retrieve_relevant_chunks(query, model_choice=model_select)
    summary = summarize_context(context)
    answer = generate_answer(query, context)  # Using Pegasus for answer generation
    return summary, answer

iface = gr.Interface(
    fn=qa_interface,
    inputs=[
        gr.Textbox(label="Enter your query"),
        gr.Radio(["pubmed", "allmpnet"], label="Select Embedding Model", value="pubmed")
    ],
    outputs=[
        gr.Textbox(label="Summary"),
        gr.Textbox(label="Answer")
    ],
    title="Medical Q&A System",
    description="Ask questions about the medical document."
)


iface.launch()



Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://6f842c3f741778b0e0.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


In [None]:
# Summarize the relevant context retrieved based on the user's query query :   what is the symptoms of adhd , this query is by allMiniLm model and t-5 summarizer
summary = summarize_context(context)

# Print the summary to the console
print("Summary:", summary)


In [None]:
pip install nltk rouge-score


In [None]:
!pip install rouge

In [None]:
import nltk  # Import the Natural Language Toolkit for text processing
from nltk.translate.bleu_score import sentence_bleu  # Import the BLEU score function
from rouge import Rouge  # Import the ROUGE library for evaluating summarization

# Ensure you have downloaded the necessary NLTK resources
nltk.download('punkt')  # Download the punkt tokenizer models for sentence splitting

def evaluate_response(generated_answer, reference_answer):
    # Tokenize the generated and reference answers
    reference_tokens = nltk.word_tokenize(reference_answer)  # Tokenize the reference answer into words
    generated_tokens = nltk.word_tokenize(generated_answer)  # Tokenize the generated answer into words

    # Calculate BLEU score
    bleu_score = sentence_bleu([reference_tokens], generated_tokens)  # Compute BLEU score

    # Calculate ROUGE score
    rouge = Rouge()  # Initialize the ROUGE evaluator
    rouge_scores = rouge.get_scores(generated_answer, reference_answer)[0]  # Compute ROUGE scores
    rouge_score = rouge_scores['rouge-l']['f']  # Extract the F1 score for ROUGE-L

    # Return scores
    return {
        'bleu_score': bleu_score,  # Return the BLEU score
        'rouge_score': rouge_score,  # Return the ROUGE score
    }

# Example usage
generated_answer = summary  # Replace with the generated answer from the model
reference_answer = context  # Replace with the reference answer
scores = evaluate_response(generated_answer, reference_answer)  # Evaluate the generated answer
print(f"BLEU Score: {scores['bleu_score']}")  # Print the BLEU score
print(f"ROUGE Score: {scores['rouge_score']}")  # Print the ROUGE score

In [None]:
# Save sentence-transformer model and tokenizer
tokenizer.save_pretrained("sentence-transformer-model")
model.save_pretrained("sentence-transformer-model")
print("Sentence transformer model saved.")

# Save summarization model and tokenizer
tokenizer_summarizer.save_pretrained("summarization-model")
model_summarizer.save_pretrained("summarization-model")
print("Summarization model saved.")


In [None]:
np.save("chunk_embeddings.npy", chunk_embeddings)


In [None]:
model.save_pretrained('faq_model_allminilm_')
tokenizer.save_pretrained('faq_model_allminilm_tokenizer')
faiss.write_index(index, 'faiss_index_allminilm.index')
print("RAG model and FAISS index saved successfully.")

In [None]:
from huggingface_hub import login

# Replace "your_hf_token" with your actual Hugging Face token
login("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")


In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load your model and tokenizer (this can be a fine-tuned model as well)
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")

# Save the model and tokenizer to a directory
model.save_pretrained("./my_model")
tokenizer.save_pretrained("./my_model")


In [None]:
from huggingface_hub import HfApi

# Create a new model repository
api = HfApi()
api.create_repo(repo_id="Medical-RAG-Model", private=True)  # Change to `private=True` if you want a private repo


In [None]:
from huggingface_hub import HfApi

# Upload the saved model to your Hugging Face repository
api.upload_folder(
    folder_path="./my_model",  # Path to your saved model directory
    repo_id="Abdelrahman-Hassan-1/Medical-RAG-Model",  # Replace with your Hugging Face username and repo name
    commit_message="Upload my medical rag model from Colab"
)
