# **Gen AI Intensive Course Capstone 2025Q1**
# **AI HealthDecode**
# Juan Romero
https://github.com/Jromero12/geanai_course_25

## **Problem Definition** <a id='title1'></a>

- Why is this problem important to solve?

Advances in medicine, nutrition, and technology have significantly increased life expectancy worldwide. However, our goal today isn’t simply to live longer—it’s to remain healthy, both physically and mentally, throughout those extra years. As populations age, future generations will inherit societies where older adults continue to contribute actively. This shift creates sustainability pressures on health‑care systems and pension schemes, but it also unlocks rich opportunities for intergenerational collaboration and innovation in health, work, and technology.

Preventive health programs—including regular check‑ups and the use of personal data from wearables and wellness apps—empower individuals to spot risks early and adjust habits before chronic diseases develop.

In many Latin American and Caribbean (LAC) countries, lab results are still stored on paper or in fragmented, poorly managed digital systems. As a result, compiling a patient’s complete history often means tracking down records from multiple labs or printing stacks of reports—many of which get lost or become illegible. AI HealthDecode addresses these challenges by centralizing and translating laboratory data into clear, actionable insights, ensuring both patients and clinicians have seamless access to a patient’s full diagnostic history.



### **The objective:** <a id='subtitle1_2'></a>

AI HealthDecode supports this workflow by harnessing the latest breakthroughs in generative AI. I’ve developed a Retrieval‑Augmented Generation (RAG) model to interpret laboratory test results with these core objectives:

- **Bring Generative AI into Everyday Health:**
AI HealthDecode isn’t just a backend engine—it lives at the point of care and in patients’ hands, delivering instant, conversational insights.

- **Centralize & Annotate Lab Data in Plain Language:**
Raw laboratory tables can be cryptic even for busy clinicians. AI HealthDecode aggregates every result into one unified repository and generates clear, jargon‑free explanations for non‑specialists.

- **Empower Clinicians with Contextual Insights & Pattern Detection:**
Rather than replacing medical judgment, AI HealthDecode amplifies it—providing longitudinal trend charts, flagging outliers, and surfacing relevant guidelines so doctors can intervene earlier.

- **Enable Patients & Foster a Preventive‑Health Mindset:**
True transformation happens when patients own their health journey. AI HealthDecode translates data into actionable advice, delivers personalized reminders, and builds health literacy to sustain long‑term well‑being.

By weaving together human‑centric AI, transparent communication, clinician collaboration, and patient empowerment, AI HealthDecode drives earlier detection, sharper decision‑making, and ultimately longer, healthier lives.

## **Setup**

Start by installing and importing the Python SDK.

## **Remove unused conflicting packages** 

In [None]:
!pip uninstall -qqy jupyterlab 
!pip uninstall -qqy chromadb
!pip uninstall -qqy langchain-community

In [None]:
!pip install -U -q "google-genai==1.7.0"

In [None]:
%%bash
# upgrade pip quietly
pip install --upgrade pip > /dev/null 2>&1

# install both packages, discarding all stdout/stderr
pip install chromadb langchain-community > /dev/null 2>&1 || true

In [None]:
#!pip install -U -q langchain-community

In [None]:
#!pip install ace_tools

## **Import necessary libraries**

In [None]:
from google import genai
from google.genai import types

from IPython.display import Markdown
from google.api_core import retry

from langchain.document_loaders import PyPDFLoader
import re
import pandas as pd
from datetime import datetime

import chromadb

from chromadb import Documents, EmbeddingFunction, Embeddings
from google.api_core import retry

from google.genai import types

from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter



genai.__version__

## **Set up API key**

In [None]:
from kaggle_secrets import UserSecretsClient

GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")

## **Explore available models**
I will be using the [`embedContent`](https://ai.google.dev/api/embeddings#method:-models.embedcontent) API method to calculate embeddings in this guide.

In [None]:
client = genai.Client(api_key=GOOGLE_API_KEY)

for m in client.models.list():
    if "embedContent" in m.supported_actions:
        print(m.name)

## **Data**

In [None]:
%%bash
for file in lab_file1.pdf lab_file2.pdf lab_file3.pdf; do
  wget -q "https://raw.githubusercontent.com/Jromero12/geanai_course_25/main/$file"
done

In [None]:
pdf_paths = ['lab_file1.pdf', 'lab_file2.pdf', 'lab_file3.pdf']
all_docs = [
    doc
    for path in pdf_paths
    for doc in PyPDFLoader(path).load()
]
print(f"Loaded {len(all_docs)} pages total.")

In [None]:
records = []

for page in all_docs:
    text = page.page_content

    # 1. Flexible date regex: allows spaces between year digits
    m = re.search(
        r'FECHA\s+DE\s+REGISTRO\s*[:\s-]*'   # header
        r'([0-3]?\d)\s*/\s*'                 # day
        r'([01]?\d)\s*/\s*'                  # month
        r'(\d(?:\s*\d){3})',                 # year (4 digits, spaces allowed)
        text,
        flags=re.IGNORECASE
    )
    if m:
        day_str, mon_str, year_str = m.group(1), m.group(2), m.group(3)
        # remove any spaces in the captured year
        year_str = re.sub(r'\s+', '', year_str)  # "202 3" → "2023"
        
        # build a datetime
        dt = datetime(
            year=int(year_str),
            month=int(mon_str),
            day=int(day_str)
        )
        year_month = dt.strftime('%Y%m')    # e.g. "202303" or "202504"
        day        = dt.strftime('%d')      # e.g. "21"
    else:
        year_month = day = None

    # 2. Your existing lab‐test regex
    pattern = re.findall(
        r'(?P<test>[A-Z0-9 \-\(\)]+)\s+'
        r'(?P<method>[\w\s\-óÓéÉíÍúÚñÑ]+)?\*?\s+'
        r'(?P<result>\d+\.?\d*)\s+'
        r'(?P<units>\w+\s*/\s*\w+)\s+'
        r'(?P<range>[\d\.]+\s*-\s*[\d\.]+)',
        text
    )
    if not pattern:
        continue

    # 3. Build DataFrame for this page, attach date parts
    df_page = pd.DataFrame(
        pattern,
        columns=["Exam/Test", "Method", "Result", "Units", "Reference Range"]
    )
    df_page["YearMonth"] = year_month
    df_page["Day"]       = day

    records.append(df_page)

# 4. Concatenate and clean up as before
df = pd.concat(records, ignore_index=True)
df = df.apply(lambda col: col.str.strip() if col.dtype == "object" else col)
df = df[
    ["YearMonth", "Day", "Exam/Test", "Method", "Result", "Units", "Reference Range"]
]

print(df)

In [None]:
# 1. Concatenate all rows into one string, with YearMonth and Day in their own columns
text_data = "\n".join(
    df.apply(
        lambda row: (
            f"{row['YearMonth']} | {row['Day']} | "
            f"{row['Exam/Test']} | {row['Method'] or ''} | "
            f"{row['Result']} {row['Units']} | {row['Reference Range']}"
        ),
        axis=1
    )
)

# 2. Wrap as a single LangChain Document
data_filtrado = [Document(page_content=text_data)]

# 3. Define and apply the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2500,
    chunk_overlap=300,
    length_function=len
)
documents = text_splitter.split_documents(data_filtrado)

# 4. Output
print(f"Generamos {len(documents)} fragmentos")
for i, doc in enumerate(documents, start=1):
    print(f"\n--- Fragmento {i} ---\n{doc.page_content}\n")

## **Creating the embedding database with ChromaDB**

In [None]:
# Define a retry policy. The model might make multiple consecutive calls automatically
# for a complex query, this ensures the client retries if it hits quota limits.

is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})

if not hasattr(genai.models.Models.generate_content, '__wrapped__'):
  genai.models.Models.generate_content = retry.Retry(
      predicate=is_retriable)(genai.models.Models.generate_content)

In [None]:
# Define a helper to retry when per-minute quota is reached.
is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})


class GeminiEmbeddingFunction(EmbeddingFunction):
    # Specify whether to generate embeddings for documents, or queries
    document_mode = True

    @retry.Retry(predicate=is_retriable)
    def __call__(self, input: Documents) -> Embeddings:
        if self.document_mode:
            embedding_task = "retrieval_document"
        else:
            embedding_task = "retrieval_query"

        response = client.models.embed_content(
            model="models/text-embedding-004",
            contents=input,
            config=types.EmbedContentConfig(
                task_type=embedding_task,
            ),
        )
        return [e.values for e in response.embeddings]

Now create a [Chroma database client](https://docs.trychroma.com/getting-started) that uses the `GeminiEmbeddingFunction` and populate the database with the documents you defined above.

In [None]:
DB_NAME = "googlecardb"

embed_fn = GeminiEmbeddingFunction()
embed_fn.document_mode = True

chroma_client = chromadb.Client()
db = chroma_client.get_or_create_collection(name=DB_NAME, embedding_function=embed_fn)

db.add(documents=[doc.page_content for doc in documents], ids=[str(i) for i in range(len(documents))])

## Retrieval: Find relevant documents

To search the Chroma database, call the `query` method. Note that you also switch to the `retrieval_query` mode of embedding generation.

In [None]:
# Switch to query mode when generating embeddings.
embed_fn.document_mode = False

# Search the Chroma DB using the specified query.
query = "Can you give me an explanation of the test results? Compare the periods"

result = db.query(query_texts=[query], n_results=5)
[all_passages] = result["documents"]

query_oneline = query.replace("\n", " ")

prompt = f"""
You are an assistant specialized in interpreting clinical lab results commonly requested by doctors. Your task is to help users understand these results clearly, especially if they have no background in health or medicine.

Explain each concept using simple language, avoid technical jargon, and maintain a friendly and approachable tone. If any of the text is unrelated to the user’s question, feel free to ignore it.

Since this assistant will be primarily used in Latin American countries, respond in Spanish, but also provide the explanation in English for those who may need it.

Always end with the following disclaimer:
Important: These are general recommendations only. It’s best to consult a doctor or a registered dietitian. They can provide a personalized treatment or nutrition plan based on your specific needs and medical history. Don’t hesitate to ask them if you have more questions!

PREGUNTA: {query_oneline}
"""


# Add the retrieved documents to the prompt.
for passage in all_passages:
    passage_oneline = passage.replace("\n", " ")
    prompt += f"PASSAGE: {passage_oneline}\n"

In [None]:
answer = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=prompt)

Markdown(answer.text)

In [None]:
# Search the Chroma DB using the specified query.
query = "Can you give me some dietary recommendations based on the latest analyzed results?"

result = db.query(query_texts=[query], n_results=5)
[all_passages] = result["documents"]

query_oneline = query.replace("\n", " ")

prompt = f"""
You are an assistant specialized in interpreting clinical lab results commonly requested by doctors. Your task is to help users understand these results clearly, especially if they have no background in health or medicine.

Explain each concept using simple language, avoid technical jargon, and maintain a friendly and approachable tone. If any of the text is unrelated to the user’s question, feel free to ignore it.

Since this assistant will be primarily used in Latin American countries, respond in Spanish, but also provide the explanation in English for those who may need it.

Always end with the following disclaimer:
Important: These are general recommendations only. It’s best to consult a doctor or a registered dietitian. They can provide a personalized treatment or nutrition plan based on your specific needs and medical history. Don’t hesitate to ask them if you have more questions!

PREGUNTA: {query_oneline}
"""


# Add the retrieved documents to the prompt.
for passage in all_passages:
    passage_oneline = passage.replace("\n", " ")
    prompt += f"PASSAGE: {passage_oneline}\n"

answer = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=prompt)

Markdown(answer.text)

In [None]:
# Search the Chroma DB using the specified query.
query = "Which specialist do you recommend I book to get a deeper interpretation of my analysis and guidance on the next steps?"

result = db.query(query_texts=[query], n_results=5)
[all_passages] = result["documents"]

query_oneline = query.replace("\n", " ")

prompt = f"""You are an assistant specialized in interpreting clinical lab results commonly requested by doctors. Your task is to help users understand these results clearly, especially if they have no background in health or medicine.

Explain each concept using simple language, avoid technical jargon, and maintain a friendly and approachable tone. If any of the text is unrelated to the user’s question, feel free to ignore it.

Since this assistant will be primarily used in Latin American countries, respond in Spanish, but also provide the explanation in English for those who may need it.

Always end with the following disclaimer:
Important: These are general recommendations only. It’s best to consult a doctor or a registered dietitian. They can provide a personalized treatment or nutrition plan based on your specific needs and medical history. Don’t hesitate to ask them if you have more questions!


PREGUNTA: {query_oneline}
"""


# Add the retrieved documents to the prompt.
for passage in all_passages:
    passage_oneline = passage.replace("\n", " ")
    prompt += f"PASSAGE: {passage_oneline}\n"

answer = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=prompt)

Markdown(answer.text)