# Work Methodology


# Observations from our documents:
- They contain chemical signals such as images, tables, structured text, and varied formats (bold, italics, headers),
- I do not want to include headers or footers (noise),
- They are in English

## We work with multi-structure PDFs. A multi-structure PDF may include:

* Continuous text (paragraphs).
* Tables with organized data.
* Images with embedded text (requires OCR).
* Headers, titles, sections.
* Margins, notes, columns, footers.

These elements can vary greatly between documents, so we need a modular and flexible approach.

# Objective:
Extract the useful and structured content from each PDF, ignoring noise (headers, footers), preserving tables, and respecting the textual hierarchy (titles, sections, etc.).

# Strategy

1. Convert PDFs to Markdown using OpenAI
2. Standardize to Document format to be used with LangChain
3. Split into chunks
4. Embedding
5. Create a vector database with Chroma

# Considerations
- Extraction of structured metadata for sources, citations, and authorities
- GraphRAG
- Images
- Chunking by tokens or structure
- Which embedding to use (OpenAI or others)

# Justify decisions
- LangChain is compatible with all major embedding model providers such as OpenAI, Cohere, HuggingFace, etc. They are implemented as Embedding classes and provide two methods: one for embedding documents and another for embedding queries (requests).
- GPT-4o mini is a smaller and more affordable version of OpenAI's GPT-4o model, offering a balance between performance and cost-effectiveness for various AI applications.
- Chroma is an open-source embedding database designed to facilitate the development of applications using language models (LLMs). It allows storing texts, converting them into vectors, and performing similarity searches efficiently. It integrates easily with tools like LangChain (in Python and JavaScript) and LlamaIndex.

# Embedding Selection

## Prediction-based:

### OpenAI Embeddings:
- text-embedding-3-small
- text-embedding-3-large

### HuggingFace / SentenceTransformers
- all-MiniLM-L6-v2
- all-mpnet-base-v2
- bge-base-en
- intfloat/e5-large-v2
- e5
- bge
- GTE

## Frequency-based
- Count Vector / Tf-idf Vector
- Co-occurrence matrices

## Ad-hoc embeddings
Embeddings specifically trained for a particular domain or task.

## Evaluate
- Recall / MRR: How well it retrieves relevant documents
- Embedding size: Affects performance on queries
- Inference time: Important if processing many documents







# Install dependencies

In [None]:
!pip install openai python-docx pdfplumber
!pip install pymupdf
!pip install -U langchain-community
!pip install chromadb
!pip install langchain_openai

# Import libraries

In [None]:
# Imports necesarios
import os
import re
import pickle
import docx
import fitz
import pdfplumber
from google.colab import drive
import openai
import tiktoken
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain_openai import OpenAIEmbeddings
from langchain.schema import Document
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
from langchain.text_splitter import CharacterTextSplitter
from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

In [None]:
# Configurar API KEY
API_KEY = ""
os.environ["OPENAI_API_KEY"] = API_KEY
client = OpenAI(api_key=API_KEY)

# Load multiple PDFs

## Load documents from a folder in Drive

In [None]:
# Montar Google Drive en Colab
drive.mount('/content/drive')

In [None]:
# Rutas de almacenamiento
pdf_folder = "/content/drive/MyDrive/TFM/Grupo_1_RAG_Chemical_Safety/Notebooks/Safety Data Sheets/"
output_folder = "/content/drive/MyDrive/TFM/Grupo_1_RAG_Chemical_Safety/Notebooks/output_md_openai/"
output_DB_Chroma = "/content/drive/MyDrive/TFM/Grupo_1_RAG_Chemical_Safety/Notebooks/Chroma_DB/"

# Clean documents



In [None]:
def extract_pdf_text(pdf_path):
    # Opens a PDF and extracts the text from all its pages
    text = ""
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_text = page.extract_text()
            if page_text:
                text += page_text + "\n"
    return text

# Process all documents

In [None]:
from openai import OpenAI
client = OpenAI()

def convert_pdfs_to_markdown(pdf_folder, output_folder, pages_per_block=5):
    # Converts all PDFs in a folder into structured Markdown files using LLM
    os.makedirs(output_folder, exist_ok=True)

    for filename in os.listdir(pdf_folder):
        if not filename.lower().endswith(".pdf"):
            continue

        pdf_path = os.path.join(pdf_folder, filename)
        output_path = os.path.join(output_folder, os.path.splitext(filename)[0] + ".md")

        print(f"📄 Processing: {filename}")
        try:
            doc = fitz.open(pdf_path)
            markdown_blocks = []

            total_pages = len(doc)
            for start_page in range(0, total_pages, pages_per_block):
                end_page = min(start_page + pages_per_block, total_pages)

                # Extract concatenated text from the pages in the block
                block_text = ""
                for p in range(start_page, end_page):
                    block_text += doc[p].get_text() + "\n\n"

                # Skip empty blocks
                if not block_text.strip():
                    print(f"Block pages {start_page+1} to {end_page} empty, skipped.")
                    continue

                prompt = f"""
                You are an expert in data science and document conversion.
                Convert the following text extracted from pages {start_page+1} to {end_page} of a PDF document into clean,
                well-structured Markdown suitable for ingestion into a Retrieval-Augmented Generation (RAG) system.
                It is essential to preserve **all relevant information** without omitting any section of the pages.

                Requirements:
                - Maintain the original hierarchical structure of headings using Markdown syntax (#, ##, ###, etc.) as accurately as possible.
                - Correctly format lists, tables, and any logical content structures.
                - Do NOT include Markdown code blocks (```markdown``` or any other code syntax).
                - Do NOT add or retain titles like # Safety Data Sheet..
                - Ensure **no content is omitted**, especially near the beginning of the pages.
                - Keep the output in **clean, clear, and readable Markdown format**.
                - Write all output in **English**.
                - Ignore text styling such as underlines or colored fonts; treat all text as plain.
                - For any images referenced on the pages, insert the image filename as a placeholder in the Markdown at the appropriate location.
                - Remove any page numbers, headers, or footers such as "Page 1", "Page 2", etc.

                Original text from pages {start_page+1} to {end_page}:{block_text}
                """

                response = client.chat.completions.create(
                    model="gpt-4o-mini",
                    messages=[
                        {
                            "role": "system",
                            "content": "You are an expert in data science and precise document conversion to Markdown for RAG systems. Preserve semantic integrity and remove irrelevant noise."
                        },
                        {"role": "user", "content": prompt},
                    ],
                    temperature=0.0,
                )

                markdown_block = response.choices[0].message.content.strip()
                markdown_blocks.append(markdown_block)
                print(f" Block pages {start_page+1} to {end_page} processed.")

            if markdown_blocks:
                full_markdown = "\n\n---\n\n".join(markdown_blocks)
            else:
                full_markdown = "_Document empty or no relevant text found._"

            with open(output_path, "w", encoding="utf-8") as f:
                f.write(full_markdown)
            print(f"Full file saved at: {output_path}")

        except Exception as e:
            print(f"Error processing {filename}: {e}")

In [None]:
convert_pdfs_to_markdown(pdf_folder, output_folder)

# Standardize output as LangChain Document

In [None]:
def load_markdown_as_documents(markdown_folder):
    # Loads all Markdown files from a folder and returns them as LangChain Documents
    documents = []
    for filename in os.listdir(markdown_folder):
        if filename.endswith(".md"):
            filepath = os.path.join(markdown_folder, filename)
            with open(filepath, "r", encoding="utf-8") as f:
                text = f.read()
                documents.append(Document(
                    page_content=text,
                    metadata={"source": filename}
                ))
    return documents

In [None]:
docs = load_markdown_as_documents(output_folder)

# Split into chunks

In [None]:
encoding = tiktoken.encoding_for_model("gpt-4")

In [None]:
# Chunks by Markdown structure
def split_into_md_chunks(docs):
    # Splits documents into chunks based on Markdown headers
    docs_chunks_md = []

    # Set splitting criteria
    headers_to_split_on = [("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3")]
    splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=headers_to_split_on,
        return_each_line=False,
        strip_headers=False
    )

    # Process documents and split into chunks
    for doc_index, doc in enumerate(docs):
        chunks = splitter.split_text(doc.page_content)
        for i, chunk in enumerate(chunks):
            chunk.metadata.update(doc.metadata)
            docs_chunks_md.append(chunk)
            token_count = len(encoding.encode(chunk.page_content))
            char_length = len(chunk.page_content)

            print(f"Doc {doc_index} - Chunk {i} - Tokens: {token_count} - Characters: {char_length}")
    return docs_chunks_md

In [None]:
# Token-based chunks
def split_into_token_chunks(docs, chunk_size=500, chunk_overlap=100):
    # Splits documents into chunks based on token count
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        add_start_index=True,
    )
    docs_chunks = []
    for doc_index, doc in enumerate(docs):
        chunks = splitter.split_documents([doc])
        for i, chunk in enumerate(chunks):
            token_count = len(encoding.encode(chunk.page_content))
            char_length = len(chunk.page_content)
            print(f"Doc {doc_index} - Chunk {i} - Tokens: {token_count} - Characters: {char_length}")
            docs_chunks.append(chunk)
    return docs_chunks

In [None]:
chunks = split_into_token_chunks(docs)

In [None]:
# Visual inspection of chunks
def inspect_chunks(docs_chunks_md, chunk_index):
    # Displays information and content of a specific Markdown chunk
    chunk = docs_chunks_md[chunk_index]
    print(f"Chunk {chunk_index}")
    print(f"Original document: {chunk.metadata.get('source', 'unknown')}")
    print(f"Tokens: {len(encoding.encode(chunk.page_content))}")
    print(f"Characters: {len(chunk.page_content)}")
    print(f"\nContent:\n{'='*80}\n{chunk.page_content}\n{'='*80}\n")

In [None]:
inspect_chunks(chunks, 0)

# Convert chunks to embeddings and create Vector Database

In [None]:
# Declare embedding model compatible with GPT-4 family
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

# embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

In [None]:
def create_embedding_db(docs_chunks):
    # Build and persist a vector database from document chunks
    vectorstore = Chroma.from_documents(
        documents=docs_chunks,
        embedding=embeddings,
        persist_directory=output_DB_Chroma
    )
    vectorstore.persist()
    return vectorstore


In [None]:
create_embedding_db(chunks)

# RAG: Retrieve + Ask



In [None]:
# query = "3-IN-1-All-Purpose-Cleaner"
query = "What are the hazard statements and recommended safety precautions for handling the 3-IN-1-All-Purpose-Cleaner, including personal protection?"

## Retrieval simple

In [None]:
# Cargar DB Chroma
db = Chroma(persist_directory=output_DB_Chroma, embedding_function=embeddings)

In [None]:
# Create retriever with similarity search and metadata filtering
k = 3
fetch_k = 20
search_type = "similarity" #mmr
# filtro_metadatos = {"product_name": "3-IN-1-All-Purpose-Cleaner"}
retriever = db.as_retriever(
    search_type=search_type,
    search_kwargs={"k": k, "fetch_k": fetch_k},
    # filter=filtro_metadatos or {}
)

In [None]:
# Retrieve relevant context
# relevant_documents = retriever.get_relevant_documents(query)
relevant_documents = retriever.invoke(query)

# Display documents
for i, doc in enumerate(relevant_documents):
    print(f"\n - Option {i+1}")
    print(f"Source document: {doc.metadata.get('source', 'unknown')}")
    print(f"Detected relevant content:\n{doc.page_content}\n")

# Create input context
context = "\n\n---\n\n".join(doc.page_content for doc in relevant_documents)


## Retrieval connected to the OpenAI API

In [None]:
# Estructura del prompt para el modelo
prompt = f"""
You are a laboratory assistant expert in safety documentation for chemical products. Based on the following extracted documentation,
respond clearly and accurately to the user's question.Use only the information provided in the context.
If the answer is not present, respond with: "The information is not available in the documentation."
User question:{query}
Context: {contexto}
"""

In [None]:
# Send to the model
from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a chemical safety assistant. Respond in English with clarity and accuracy."},
        {"role": "user", "content": prompt}
    ],
    temperature=0.2,
)


In [None]:
# Display response
print(response.choices[0].message.content.strip())