<a id="1"></a>
# <div style="text-align:center; border-radius:15px 50px; padding:7px; color:white; margin:0; font-size:110%; font-family:Pacifico; background-color:#0073e6; overflow:hidden"><b> LLM Gemma - Covid19</b></div>

<div style="text-align: center;">
  <img src="https://img.freepik.com/vetores-gratis/banner-do-surto-de-coronavirus-covid-19-com-celulas-virais_1017-24631.jpg?t=st=1724384629~exp=1724388229~hmac=5c31ce6c796ba544054e1e57d15f76f07b9190bbc62387b38b46c4c5355933f3&w=740" alt="Banner do Surto de Coronavirus" />
</div>

In [None]:
# Installing packages
!pip install transformers
!pip install sentence_transformers
!pip install faiss-cpu
!pip install torch
!pip install PyPDF2
!pip install nltk

In [None]:
# Importing libraries

# Importing system
import os
import faiss

# Importing documents
import PyPDF2

# Importing csv and math libraries
import numpy as np
import pandas as pd

# Importing Libraries LLM
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from sentence_transformers import SentenceTransformer

In [None]:
# Importing libraries natural language processing 
import nltk
from nltk.tokenize import sent_tokenize

# Downloading package nlp punkt
nltk.download('punkt')

In [None]:
# Authentication with HUGGING FACE
import os
HUGGING_FACE_ACCESS_TOKEN = os.environ['HUGGING_FACE_ACCESS_TOKEN'] = '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'

In [None]:
# File path for PDF processing
pdf_path = "/kaggle/input/acetic-acidas-antiviral-agent/US8957108.pdf"

# Initialize the DataFrame to store paths, text chunks, and embeddings
df_documents = pd.DataFrame(columns=['path', 'text_chunks', 'embeddings'])

In [None]:
from transformers import AutoModelForCausalLM

# Model parameters
model_name = 'google/gemma-2-2b-it'

# Load the pre-trained model with specified configurations
model = AutoModelForCausalLM.from_pretrained(model_name,
                                             torch_dtype=torch.float16,
                                             token=HUGGING_FACE_ACCESS_TOKEN).to('cuda')
model

### Documentation for LLM Gemma Model Implementation

This documentation outlines the steps for implementing the LLM Gemma model using the `AutoModelForCausalLM` class from the Hugging Face Transformers library. The model is deployed on a CUDA-enabled GPU to leverage hardware acceleration.

#### 1. **Model Name**
- `model_name`: The specific model used is `'google/gemma-2-2b-it'`, a pre-trained large language model by Google, designed for Italian text generation and natural language processing tasks.

#### 2. **Loading the Model**
- The model is loaded using the `AutoModelForCausalLM.from_pretrained()` method. This function retrieves the pre-trained model based on the specified model name and configures it for causal language modeling tasks.

#### 3. **Torch Data Type**
- `torch_dtype=torch.float16`: The model utilizes 16-bit floating point precision (`float16`) for faster computation and reduced memory usage on the GPU, which is particularly useful for large models.

#### 4. **Token Authentication**
- `token=HUGGING_FACE_ACCESS_TOKEN`: Access to the model requires authentication via a Hugging Face API token. This token is necessary to access the model from the Hugging Face Hub.

#### 5. **Deploying to CUDA**
- `.to('cuda')`: The model is deployed to a CUDA-enabled GPU using `.to('cuda')`. This ensures that the model operations are executed on the GPU, significantly speeding up the processing time compared to CPU execution.

#### 7. **Summary**
This setup allows for the efficient loading and deployment of the Gemma LLM on a GPU, optimized for tasks involving Italian language processing. By using `float16` precision and deploying on CUDA, the model achieves faster inference times and better resource utilization.

In [None]:
# Load the tokenizer with the specified token
tokenizer = AutoTokenizer.from_pretrained(model_name, token=HUGGING_FACE_ACCESS_TOKEN)

#### 1. **Tokenizer Initialization**
- `tokenizer`: The tokenizer is initialized using the `AutoTokenizer.from_pretrained()` method from the Hugging Face Transformers library.

#### 2. **Model Name**
- `model_name`: The same model name `'google/gemma-2-2b-it'` is used to ensure that the tokenizer is compatible with the pre-trained model.

#### 3. **Token Authentication**
- `token=HUGGING_FACE_ACCESS_TOKEN`: Similar to the model, the tokenizer also requires access to the Hugging Face API, authenticated via an API token. This token grants access to the tokenizer associated with the specific model.

#### 4. **Summary**
The tokenizer is a crucial component that ensures the text is correctly pre-processed and post-processed for the LLM Gemma model. By using the `AutoTokenizer` with the correct model name and authentication token, you ensure seamless integration with the pre-trained model. This setup allows for efficient text encoding and decoding, necessary for generating and understanding the model’s predictions.

In [None]:
from sentence_transformers import SentenceTransformer

# Initialize the sentence encoder with the specified model
encoder = SentenceTransformer('all-MiniLM-L6-v2')

#### 1. **Sentence Encoder Initialization**
- `encoder`: The encoder is initialized using the `SentenceTransformer` class from the SentenceTransformers library.

#### 2. **Model Name**
- `'all-MiniLM-L6-v2'`: The specific model used is `'all-MiniLM-L6-v2'`, which is a smaller, faster version of the MiniLM model. This model is optimized for sentence embeddings, providing a good balance between performance and computational efficiency.

#### 3. **Usage**
The encoder is used to convert sentences or texts into dense vector representations (embeddings). These embeddings capture the semantic meaning of the text and can be used for tasks such as similarity comparison, clustering, or as input features for machine learning models.

#### 4. **Summary**
The `SentenceTransformer` model `'all-MiniLM-L6-v2'` is an efficient and effective tool for generating sentence embeddings. By initializing the encoder with this model, you can easily convert sentences into vector representations that capture their semantic content, which is useful for a wide range of NLP applications. The model balances accuracy and speed, making it suitable for both large-scale and real-time applications.

In [None]:
# Initialize the DataFrame
documents_df = pd.DataFrame(columns=['file_path', 'text_segments', 'embeddings'])

In [None]:
def get_text_from_pdf(file_path):
    try:
        with open(file_path, 'rb') as pdf_file:
            pdf_reader = PyPDF2.PdfReader(pdf_file)
            extracted_text = "".join([page.extract_text() for page in pdf_reader.pages])
        return extracted_text
    except Exception as error:
        print(f"Error reading {file_path}: {error}")
        return ""

def divide_text_into_segments(full_text, segment_size=1000):
    sentence_list = sent_tokenize(full_text)
    text_segments = []
    current_segment = ""

    for sentence in sentence_list:
        if len(current_segment) + len(sentence) <= segment_size:
            current_segment += sentence + " "
        else:
            text_segments.append(current_segment.strip())
            current_segment = sentence + " "

    if current_segment:
        text_segments.append(current_segment.strip())

    return text_segments

In [None]:
# Define the input path to the directory or file
# Replace with your actual path
input_path = "/kaggle/input/acetic-acidas-antiviral-agent/US8957108.pdf" 

In [None]:
# Function to extract text from a PDF
def extract_text_from_pdf(pdf_path):
    try:
        with open(pdf_path, 'rb') as file:
            reader = PyPDF2.PdfReader(file)
            text = "".join([page.extract_text() for page in reader.pages])
        return text
    except Exception as e:
        print(f"Error reading {pdf_path}: {e}")
        return ""

# Function to split text into chunks
def split_text_into_chunks(text, max_chunk_size=1000):
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = ""

    for sentence in sentences:
        if len(current_chunk) + len(sentence) <= max_chunk_size:
            current_chunk += sentence + " "
        else:
            chunks.append(current_chunk.strip())
            current_chunk = sentence + " "

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

# Load the sentence transformer model
encoder = SentenceTransformer('all-MiniLM-L6-v2')

# Directory or file path for PDF processing
pdf_directory = "/kaggle/input/article-a-serological-assay-to-detect-sarscov2/2020.03.17.20037713v1.full.pdf"

# Initialize the DataFrame to store paths, text chunks, and embeddings
df_documents = pd.DataFrame(columns=['path', 'text_chunks', 'embeddings'])

# Check if the path is a directory or a file
if os.path.isdir(pdf_directory):
    # If it's a directory, iterate over the files in it
    for filename in os.listdir(pdf_directory):
        if filename.endswith(".pdf"):
            print(filename)
            pdf_path = os.path.join(pdf_directory, filename)
            text = extract_text_from_pdf(pdf_path)
            chunks = split_text_into_chunks(text)
            document_embeddings = encoder.encode(chunks)
            new_row = pd.DataFrame({'path': [pdf_path], 'text_chunks': [chunks], 'embeddings': [document_embeddings]})
            df_documents = pd.concat([df_documents, new_row], ignore_index=True)
elif os.path.isfile(pdf_directory):
    # If it's a file, process it directly
    pdf_path = pdf_directory
    print(pdf_path)
    text = extract_text_from_pdf(pdf_path)
    chunks = split_text_into_chunks(text)
    document_embeddings = encoder.encode(chunks)
    new_row = pd.DataFrame({'path': [pdf_path], 'text_chunks': [chunks], 'embeddings': [document_embeddings]})
    df_documents = pd.concat([df_documents, new_row], ignore_index=True)
else:
    # If the path is neither a directory nor a file, print an error message
    print(f"{pdf_directory} is neither a valid directory nor a file.")

# Display the resulting DataFrame
df_documents

In [None]:
# Create a FAISS index from all document embeddings

# Stack all embeddings from the DataFrame into a single numpy array
all_embeddings = np.vstack(df_documents['embeddings'].tolist())

# Determine the dimensionality of the embeddings
dimension = all_embeddings.shape[1]

# Initialize a FAISS index with L2 (Euclidean) distance metric
index = faiss.IndexFlatL2(dimension)

# Add all embeddings to the FAISS index
index.add(all_embeddings)


In [None]:
def find_most_similar_segments(search_query, top_k=20):
    """
    Find the most similar text segments to the search query using FAISS index.

    Parameters:
    search_query (str): The query string to search for similar segments.
    top_k (int): The number of top similar segments to retrieve. Default is 3.

    Returns:
    list: A list of dictionaries containing the document path, the text segment, and the similarity distance.
    """
    query_embedding = encoder.encode([search_query])
    distances, indices = index.search(query_embedding, top_k)
    similar_segments = []
    total_segments = sum(len(segments) for segments in documents_df['text_segments'])

    for i, idx in enumerate(indices[0]):
        if idx < total_segments:
            doc_idx = 0
            segment_idx = idx
            while segment_idx >= len(documents_df['text_segments'].iloc[doc_idx]):
                segment_idx -= len(documents_df['text_segments'].iloc[doc_idx])
                doc_idx += 1
            similar_segments.append({
                'document': documents_df['file_path'].iloc[doc_idx],
                'segment': documents_df['text_segments'].iloc[doc_idx][segment_idx],
                'distance': distances[0][i]
            })
    return similar_segments

def generate_answer(search_query, context_text, max_length=1000):
    """
    Generate an answer to a query based on the provided context using a pre-trained language model.

    Parameters:
    search_query (str): The query string to generate an answer for.
    context_text (str): The context text to base the answer on.
    max_length (int): The maximum length of the generated answer. Default is 1000 tokens.

    Returns:
    str: The generated answer.
    """
    prompt = f"Context: {context_text}\n\nQuestion: {search_query}\n\nAnswer:"
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to('cuda')

    with torch.no_grad():
        output = model.generate(input_ids, max_new_tokens=max_length, num_return_sequences=1)

    decoded_output = tokenizer.decode(output[0], skip_special_tokens=True)

    # Extracting the answer part by removing the prompt portion
    answer_start = decoded_output.find("Answer:") + len("Answer:")
    generated_answer = decoded_output[answer_start:].strip()

    return generated_answer

def search_documents(search_query):
    """
    Search for the most relevant document segments and generate an answer to the query.

    Parameters:
    search_query (str): The query string to search for and generate a response to.

    Returns:
    tuple: A tuple containing the generated answer and the list of similar segments.
    """
    similar_segments = find_most_similar_segments(search_query)
    context_text = " ".join([result['segment'].replace("\n", "") for result in similar_segments])
    generated_answer = generate_answer(search_query, context_text)
    return generated_answer, similar_segments


# Questions 

In [None]:
# Define the search query
search_query = "vaccine proteins vaccine for covid19"

# Run the search and generate an answer
generated_answer, relevant_segments = search_documents(search_query)

# Print the query and the generated answer
print(f"Query: {search_query}\n\n-----\n")
print(f"Generated answer: {generated_answer}\n\n-----\n")

# Print the relevant segments
print("Relevant segments:")
for segment in relevant_segments:
    print(f"Document: {segment['document']}")
    print(f"Segment: {segment['segment']}".replace("\n", ""))
    print(f"Distance: {segment['distance']}")
    print()


In [None]:
# Define the search query
search_query = "vaccine SARS-CoV-2"

# Run the search and generate an answer
generated_answer, relevant_segments = search_documents(search_query)

# Print the query and the generated answer
print(f"Query: {search_query}\n\n-----\n")
print(f"Generated answer: {generated_answer}\n\n-----\n")

# Print the relevant segments
print("Relevant segments:")
for segment in relevant_segments:
    print(f"Document: {segment['document']}")
    print(f"Segment: {segment['segment']}".replace("\n", ""))
    print(f"Distance: {segment['distance']}")
    print()


In [None]:
# Define the search query
search_query = "covid"

# Run the search and generate an answer
generated_answer, relevant_segments = search_documents(search_query)

# Print the query and the generated answer
print(f"Query: {search_query}\n\n-----\n")
print(f"Generated answer: {generated_answer}\n\n-----\n")

# Print the relevant segments
print("Relevant segments:")
for segment in relevant_segments:
    print(f"Document: {segment['document']}")
    print(f"Segment: {segment['segment']}".replace("\n", ""))
    print(f"Distance: {segment['distance']}")
    print()


In [None]:
# Define the search query
search_query = "covid 19 vaccine proteins"

# Run the search and generate an answer
generated_answer, relevant_segments = search_documents(search_query)

# Print the query and the generated answer
print(f"Query: {search_query}\n\n-----\n")
print(f"Generated answer: {generated_answer}\n\n-----\n")

# Print the relevant segments
print("Relevant segments:")
for segment in relevant_segments:
    print(f"Document: {segment['document']}")
    print(f"Segment: {segment['segment']}".replace("\n", ""))
    print(f"Distance: {segment['distance']}")
    print()


# Reference

Carraro, F. (2024). *AI RAG PDF Search in Multiple Documents using Gemma 2 2B on Colab* [GitHub repository]. 

GitHub. https://github.com/fabriciocarraro/AI_RAG_PDF_Search_in_multiple_documents_using_Gemma_2_2B_on_Colab