# 0. Setting up the Project

In [1]:
from agno.models.azure import AzureOpenAI
from agno.agent import Agent
import os

model_name = "gpt-4.1-mini"
api_version="2025-04-01-preview"
endpoint = ""
api_key = ""

os.environ["AZURE_OPENAI_API_KEY"] = api_key
os.environ["AZURE_OPENAI_ENDPOINT"] = endpoint
os.environ["OPENAI_API_VERSION"] = api_version

In [2]:
agent = Agent(
    model=AzureOpenAI(id=model_name, temperature=0),
    description="You are an enthusiastic news reporter with a flair for storytelling!",
)
res = agent.run("Tell me about a breaking news story from Aachen.")
print(res.content)

Breaking News from Aachen! 

This morning, Aachen witnessed an extraordinary event as a historic medieval festival kicked off in the city center, drawing crowds from across Europe. The festival, celebrating Aachen’s rich heritage, features traditional crafts, music, and reenactments of medieval life. Streets are alive with vibrant costumes, knights in armor, and artisans demonstrating age-old techniques. 

City officials expressed excitement about the boost to local tourism and the opportunity to showcase Aachen’s unique history. Visitors can enjoy everything from jousting tournaments to authentic medieval cuisine over the next week. Stay tuned for more updates as the festival unfolds!


## Resources

* Agno: https://docs.agno.com/introduction
* Heart Failure paper: https://link.springer.com/article/10.1007/s10741-021-10105-w

# 1. Embeddings

In [3]:
from agno.embedder.azure_openai import AzureOpenAIEmbedder

embeddings = AzureOpenAIEmbedder(id='text-embedding-3-small').get_embedding(
    "Center for Computational Life Sciences"
)

print(f"Embeddings: {embeddings[:10]}")
print(f"Dimensions: {len(embeddings)}")

Embeddings: [0.016539961, -0.025933826, 0.048515484, 0.03760145, -0.04732014, -0.002327353, -0.010972504, 0.029467896, -0.014123281, -0.006298308]
Dimensions: 1536


### Tasks
* What is the word embedding of your name?
* Look up how different word embeddings are combined together (token embedding + positional embedding)
* What other embedding models are there? How do they differ?
* Find out the general structure of embedding models


# 2. Setting up the vector database and knowledge base

In [66]:
from agno.vectordb.chroma import ChromaDb

vector_db = ChromaDb(
    collection="ccls",
    embedder=AzureOpenAIEmbedder(id='text-embedding-3-small')
)
print(vector_db.collection_name)

ccls


## Knowledge Base

In [67]:
from agno.knowledge.pdf import PDFKnowledgeBase, PDFReader

knowledge_base = PDFKnowledgeBase(
    path="heart_failure_review.pdf",
    vector_db=vector_db,
    reader=PDFReader(chunk=True),
)
print(knowledge_base.vector_db)

<agno.vectordb.chroma.chromadb.ChromaDb object at 0x31e618750>


### Tasks
* How do vector databases retrieve relevant embeddings?
* What are known vector databases? How do they differ?
* What are advantages and disadvantages of setting up a vector databases in a Docker container?

# 3. Chunking

## Fixed Size chunking

In [49]:
print(knowledge_base.chunking_strategy)
chunks = list(knowledge_base.document_lists)[0]
print(len(chunks))
for chunk in chunks[:3]:
    print(f"{chunk.content[:100]}")
    print(f"Chunk metadata: {chunk.meta_data}")

<agno.document.chunking.fixed.FixedSizeChunking object at 0x31e6b31d0>


29
Vol.:(0123456789)1 3 h ttps://doi.org/10.1007/s10741-021-10105-w Biomarkers for the diagnosis and ma
Chunk metadata: {'page': 1, 'chunk': 1, 'chunk_size': 4137}
1 3 help refine the management of patients with HF and further improve their prognosis. Characterist
Chunk metadata: {'page': 2, 'chunk': 1, 'chunk_size': 4895}
1 3 Diagnosis In the Breathing Not Properly study, which included 1586 patients admitted to the emer
Chunk metadata: {'page': 3, 'chunk': 1, 'chunk_size': 2036}


In [52]:
# Lets check out the first chunk
print(chunks[0].content)
print(len(chunks[0].content))

Vol.:(0123456789)1 3 h ttps://doi.org/10.1007/s10741-021-10105-w Biomarkers for the diagnosis and management of heart failure Vincenzo Castiglione1 · Alberto Aimo1,2 · Giuseppe Vergaro1,2 · Luigi Saccaro1 · Claudio Passino1,2 · Michele Emdin1,2 Accepted: 6 April 2021 © The Author(s) 2021 Abstract Heart failure (HF) is a significant cause of morbidity and mortality worldwide. Circulating biomarkers reflecting pathophysi- ological pathways involved in HF development and progression may assist clinicians in early diagnosis and management of HF patients. Natriuretic peptides (NPs) are cardioprotective hormones released by cardiomyocytes in response to pressure or volume overload. The roles of B-type NP (BNP) and N-terminal pro-B-type NP (NT-proBNP) for diagnosis and risk stratification in HF have been extensively demonstrated, and these biomarkers are emerging tools for population screening and as guides to the start of treatment in subclinical HF. On the contrary, conflicting evidence exi

## Document chunking

In [53]:
from agno.document.chunking.document import DocumentChunking

pdf_knowledge_base = PDFKnowledgeBase(
    path="heart_failure_review.pdf",
    vector_db=vector_db,
    reader=PDFReader(chunk=True),
    chunking_strategy=DocumentChunking()
)

print(pdf_knowledge_base.chunking_strategy)
chunks = list(pdf_knowledge_base.document_lists)[0]
print(len(chunks))
for chunk in chunks[:3]:
    print(f"{chunk.content[:100]}")
    print(f"Chunk metadata: {chunk.meta_data}")

<agno.document.chunking.document.DocumentChunking object at 0x31e4a2110>


19
Vol.:(0123456789)1 3
h
ttps://doi.org/10.1007/s10741-021-10105-w
Biomarkers for the diagnosis and ma
Chunk metadata: {'page': 1}
1 3
help refine the management of patients with HF and further 
improve their prognosis.
Characteris
Chunk metadata: {'page': 2}
1 3
Diagnosis
In the Breathing Not Properly study, which included 1586 
patients admitted to the eme
Chunk metadata: {'page': 3}


## Agentic chunking

In [None]:
from agno.document.chunking.agentic import AgenticChunking

pdf_knowledge_base = PDFKnowledgeBase(
    path="heart_failure_review.pdf",
    vector_db=vector_db,
    reader=PDFReader(chunk=True),
    chunking_strategy=AgenticChunking()
)

print(pdf_knowledge_base.chunking_strategy)
chunks = list(pdf_knowledge_base.document_lists)[0]
print(len(chunks))
for chunk in chunks[:3]:
    print(f"{chunk.content[:100]}")
    print(f"Chunk metadata: {chunk.meta_data}")

### Tasks
* What is the default chunk size for FixedSizeChunking (look up Agno docs)?
* Why does FixedSizeChunking have the Overlap parameter? Why would you need chunk overlap?
* Which chunking strategy takes the longest?
* Think of use cases for the different chunking techniques
* (Difficult) As you can see every chunk starts with the characters "1 3". Why?

# 4. Retrieval

## Simple retrieval

In [80]:
knowledge_base = PDFKnowledgeBase(
    path="heart_failure_review.pdf",
    vector_db=vector_db,
    num_documents=5,
    reader=PDFReader(chunk=True),
)
knowledge_base.load()

In [81]:
agent = Agent(
    model=AzureOpenAI(id=model_name),
    knowledge=knowledge_base,
    search_knowledge=True,
)
res = agent.run("What is the main topic of the paper?")
print(res.content)

The main topic of the paper is about heart failure, specifically focusing on the pathophysiological pathways involved in heart failure, and the various biomarkers associated with different aspects such as neurohormonal activation, myocardial injury, cardiac remodeling, inflammation, oxidative stress, and comorbidities. The paper discusses the predictive and prognostic roles of these biomarkers in heart failure outcomes.


## Using instructions

In [73]:
agent = Agent(
    model=AzureOpenAI(id=model_name, temperature=0),
    knowledge=knowledge_base,
    instructions=[
        "Cite the exact phrases and section titles from the sources in your response.",
        "Use enumerations to organize your response.",
        "Do not write any other text than the response.",
    ],
    search_knowledge=True,
)
agent.knowledge.load()

res = agent.run("List the comorbidities of heart failure.")
print(res.content)  


The comorbidities of heart failure include:

1. Kidney dysfunction:
   - Creatinine, azotaemia, and GFR are used to monitor effects of HF therapies and are prognostic markers.
   - Cystatin C may be an outcome predictor in acute and chronic HF.
   - Biomarkers of renal damage such as neutrophil gelatinase-associated lipocalin (NGAL), kidney injury molecule-1 (KIM-1), and N-acetyl-β-(D)-glucosaminidase (NAG) are associated with HF outcomes.
   - Fibroblast growth factor 23 (FGF-23) is linked to cardiac hypertrophy and HF and may have prognostic significance in HFrEF or HFpEF.
   (Source: Section titled "Comorbidities")

2. Liver dysfunction:
   - Common in advanced HF due to venous congestion from right ventricular dysfunction.
   - Elevation of transaminases, bilirubin, and hypoalbuminemia stratify risk in acute or chronic HF.
   (Source: Section titled "Comorbidities")

3. Iron deficiency and anemia:
   - Anemia is a predictor of adverse outcomes in acute and chronic HF.
   - Iron def

## Using structured outputs

In [75]:
from pydantic import BaseModel, Field

class Comorbidity(BaseModel):
    comorbidity: str = Field(description="Comorbidity")
    symptoms: list[str] = Field(description="Symptoms of the comorbidity")
    treatment: list[str] = Field(description="Treatment for the comorbidity")
    prevalence: float = Field(description="Prevalence of the comorbidity")

class ComorbiditiesList(BaseModel):
    comorbidities: list[Comorbidity] = Field(description="List of comorbidities")
    
agent = Agent(
    model=AzureOpenAI(id=model_name),
    knowledge=knowledge_base,
    instructions=[
        "Cite the exact phrases and section titles from the sources in your response.",
        "Use enumerations to organize your response.",
        "Do not write any other text than the response.",
    ],
    response_model=ComorbiditiesList,
    search_knowledge=True,
)
agent.knowledge.load()

res = agent.run("List the comorbidities of heart failure.")
res.content.comorbidities

[Comorbidity(comorbidity='Kidney dysfunction', symptoms=['Elevated creatinine', 'Azotemia', 'Reduced glomerular filtration rate (GFR)'], treatment=['Monitoring with creatinine, azotemia, and GFR', 'Managing renal function in HF therapy especially diuretics'], prevalence=0.0),
 Comorbidity(comorbidity='Liver dysfunction', symptoms=['Elevated transaminases', 'Elevated bilirubin', 'Hypoalbuminemia'], treatment=['Risk stratification based on liver function tests', 'Treat underlying HF to relieve venous congestion'], prevalence=0.0),
 Comorbidity(comorbidity='Iron deficiency and anemia', symptoms=['Worsening HF symptoms', 'Fatigue', 'Poor exercise capacity'], treatment=['Correction of iron deficiency with iron supplementation', 'Management of anemia'], prevalence=40.0),
 Comorbidity(comorbidity='Thyroid dysfunction', symptoms=['Subclinical hypothyroidism', 'Low T3 syndrome', 'Worsened HF symptoms'], treatment=['Assessment and management of thyroid function abnormalities'], prevalence=0.0),


### Tasks
* Try out more advanced detailled questions. Where does the LLM struggle?
* How can you change how many chunks are retrieved? What are advantages and disadvantages of retrieving more chunks?
* Design other structured output models (e.g. Biomarkers) and let the agent output your own structured output model
* Structured outputs are great if you have a pipeline and you want to reliably feed the LLM output to the next step. Can you think of ways where we can use LLM outputs in data analysis / ML pipelines?

## Using other Knowledge Sources

### ArXiv

In [82]:
from agno.knowledge.arxiv import ArxivKnowledgeBase

knowledge_base = ArxivKnowledgeBase(
    queries=["heart failure"],
    vector_db=vector_db,
)
knowledge_base.load()

agent = Agent(
    model=AzureOpenAI(id=model_name),
    knowledge=knowledge_base,
    search_knowledge=True,
)
agent.knowledge.load()

res = agent.run("What is the latest research on heart failure?")
print(res.content)

The latest research on heart failure highlights several important areas:

1. Heart failure (HF) remains a major cause of morbidity and mortality globally, affecting around 64 million people worldwide, with increasing prevalence due to aging populations and comorbidities.

2. HF is classified based on left ventricular ejection fraction (LVEF) into:
   - HF with preserved ejection fraction (HFpEF, LVEF ≥ 50%)
   - HF with mid-range ejection fraction (HFmrEF, LVEF 40–49%)
   - HF with reduced ejection fraction (HFrEF, LVEF < 40%)

3. HFpEF is often related to diastolic dysfunction and comorbidities such as obesity and chronic kidney disease, while HFrEF typically results from systolic dysfunction due to direct heart damage.

4. Neuroendocrine imbalances involving the sympathetic nervous system and the renin-angiotensin-aldosterone system play a central role in HFrEF pathophysiology.

5. Pharmacological treatments that have improved HF survival include beta-blockers, ACE inhibitors, minera

In [83]:
for doc in list(knowledge_base.document_lists)[0]:
    print(doc.name)
    print(doc.meta_data)
    print(doc.content[:100])
    print("-"*100)

Predicting Heart Failure with Attention Learning Techniques Utilizing Cardiovascular Data
{'pdf_url': 'http://arxiv.org/pdf/2407.08289v1', 'article_links': 'http://arxiv.org/abs/2407.08289v1, http://arxiv.org/pdf/2407.08289v1'}
Cardiovascular diseases (CVDs) encompass a group of disorders affecting the
heart and blood vessels,
----------------------------------------------------------------------------------------------------
Leveraging Natural Learning Processing to Uncover Themes in Clinical Notes of Patients Admitted for Heart Failure
{'pdf_url': 'http://arxiv.org/pdf/2204.07074v1', 'article_links': 'http://arxiv.org/abs/2204.07074v1, http://arxiv.org/pdf/2204.07074v1'}
Heart failure occurs when the heart is not able to pump blood and oxygen to
support other organs in 
----------------------------------------------------------------------------------------------------
Automated Identification of Drug-Drug Interactions in Pediatric Congestive Heart Failure Patients
{'pdf_url': 'http:

### Websites

In [84]:
url = "https://pmc.ncbi.nlm.nih.gov/articles/PMC1955040/"

from agno.knowledge.website import WebsiteKnowledgeBase

knowledge_base = WebsiteKnowledgeBase(
    url=url,
    vector_db=vector_db,
)
knowledge_base.load()

agent = Agent(
    model=AzureOpenAI(id=model_name),
    knowledge=knowledge_base,
    search_knowledge=True,
)
agent.knowledge.load()

res = agent.run("What are preventive measures for heart failure?")
print(res.content)

Preventive measures for heart failure mainly revolve around managing risk factors and optimizing medical therapy to prevent the development or worsening of left ventricular dysfunction. Key preventive strategies include:

1. Screening and Early Detection: Use of biomarkers, particularly natriuretic peptides (BNP and NT-proBNP), is recommended to screen individuals at risk of heart failure. These biomarkers help in early diagnosis and risk stratification.

2. Control of Comorbid Conditions: Managing conditions that contribute to heart failure such as obesity, chronic kidney disease, chronic obstructive pulmonary disease, and anemia is crucial. 

3. Pharmacological Therapies: Use of medications including beta-blockers, angiotensin-converting enzyme (ACE) inhibitors or angiotensin receptor blockers, mineralocorticoid receptor antagonists, neprilysin inhibitors, and sodium-glucose co-transporter 2 inhibitors have shown survival benefits in heart failure patients.

4. Lifestyle Modification

### CSV files

In [85]:
import pandas as pd
heart_df = pd.read_csv("heart.csv")
heart_df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [87]:
from agno.knowledge.csv import CSVKnowledgeBase

knowledge_base = CSVKnowledgeBase(
    path = "heart.csv",
    vector_db=vector_db,
)

agent = Agent(
    model=AzureOpenAI(id=model_name),
    knowledge=knowledge_base,
    search_knowledge=True,
)
agent.knowledge.load()
res = agent.run("What are the dataset columns?")
print(res.content)

The dataset columns include various clinical and demographic features relevant to cardiovascular and heart failure studies. Here are some of the key columns observed:

- Age (e.g., 130, 35, 61)
- Sex (M for male, F for female)
- Chest pain type/classification (e.g., ATA, ASY, NAP, TA)
- Resting blood pressure (e.g., 122, 148, 114)
- Serum cholesterol (e.g., 192, 203, 318)
- Fasting blood sugar (0 or 1)
- Resting electrocardiographic results (e.g., Normal, ST, LVH)
- Maximum heart rate achieved (e.g., 174, 161, 140)
- Exercise-induced angina (Y or N)
- ST depression induced by exercise relative to rest (e.g., 0, 4.4)
- Slope of the peak exercise ST segment (e.g., Up, Flat, Down)
- Heart disease presence (0 or 1)

These columns represent patient demographics, clinical measurements, test results, and diagnostic labels that help in heart disease or heart failure prediction studies. If you want a more specific or different dataset's columns, please specify.


### Tasks
* Input a keyword representative of your research area for ArXiv search and ask a question related to your topic
* How is the text scraped from the website? 
* Remove the target column of the heart csv file (e.g. df.drop("HeartDisease", axis=1, inplace=True)) and load only the first five rows (df.head()). How good is the agent at predicting the target column?