# Summarizing with LLMs

This notebook demonstrates how to summarize the output of LLMs using the [datasette LLM command line tool](https://llm.datasette.io/en/stable/).

See also:

- [How to use LLMs with OAK](https://incatools.github.io/ontology-access-kit/howtos/use-llms.html)


## Install the LLM command line tool

```
pip install llm
```

You may also want to install plugins for your models of choice:

```
pip install llm-deepseek
```

## Summarize outputs

You can redirect any output you like to `llm`. For example, consider this OAK query to get definition of all kinds of hearts in Uberon:

In [6]:
!runoak -i sqlite:obo:uberon definitions .sub "circulatory organ"

id	label	definition
UBERON:0000948	heart	A myogenic muscular circulatory organ found in the vertebrate cardiovascular system composed of chambers of cardiac muscle. It is the primary circulatory organ.
UBERON:0007100	primary circulatory organ	A hollow, muscular organ, which, by contracting rhythmically, keeps up the circulation of the blood or analogs[GO,modified].
UBERON:0015202	lymph heart	A circulatory organ that is reponsible for pumping lymph throughout the body.
UBERON:0015227	peristaltic circulatory vessel	A vessel down which passes a wave of muscular contraction, that forces the flow of haemolymphatic fluid.
UBERON:0015228	circulatory organ	A hollow, muscular organ, which, by contracting rhythmically, contributes to the circulation of lymph, blood or analogs. Examples: a chambered vertebrate heart; the tubular peristaltic heart of ascidians; the dorsal vessel of an insect; the lymoh heart of a reptile.
UBERON:0015229	accessory circulatory organ	A circulatory organ that is not r

In [7]:
!runoak -i sqlite:obo:uberon definitions .sub "circulatory organ" | llm -m 4o -s "give a summary of these terms and critical comments on definitions"

This dataset contains definitions and critical comments on various anatomical terms related to circulatory and lymphatic organs. Here's a summary of the terms listed:

1. **Heart (UBERON:0000948):** Defined as a myogenic muscular organ in vertebrates, responsible for circulating blood through its chambers of cardiac muscle. It is characterized as the primary circulatory organ.

2. **Primary Circulatory Organ (UBERON:0007100):** Described as a hollow, muscular organ that rhythmically contracts to maintain blood circulation. This definition emphasizes the functional role of the heart or equivalent structures in different organisms.

. **Lymph Heart (UBERON:0015202):** A type of circulatory organ whose main function is to pump lymph throughout the body, highlighting its role in the lymphatic system rather than the blood circulatory system.

4. **Peristaltic Circulatory Vessel (UBERON:0015227):** A vessel that uses waves of muscular contraction to move haemolymphatic fluid, commonly found 

## Templates

The llm tool allows you to define templates.

`llm templates edit summarize-definitions` 

Then in your editor:

```yaml
system: give a summary of these terms and critical comments on definitions
```

In [8]:
!runoak -i sqlite:obo:uberon definitions .sub "circulatory organ" | llm -m 4o -t summarize-definitions


This dataset provides definitions for various anatomical terms related to circulatory organs across different species, especially focusing on aspects of their structure and functions.

 **Heart (UBERON:0000948)**: Defined as a myogenic muscular organ in vertebrates responsible for circulating blood, it's depicted as the primary organ in the cardiovascular system. Critical insight could involve the need for clarification on variations in structure and function across different vertebrate species.

2. **Primary Circulatory Organ (UBERON:0007100)**: Essentially an organ responsible for keeping blood or similar substances circulating via rhythmic contractions. The definition emphasizes its hollow and muscular nature. Critiques might focus on the broad definition that necessitates specifying how it differs from accessory organs in terms of function.

3. **Lymph Heart (UBERON:0015202)**: A specialized organ pumping lymph, reflecting its distinct role from blood-circulating hearts. The defini

## Gene summaries

Create a template for summarizing gene annotations:

`llm templates edit summarize-gaf-for-gene` 

```yaml
system: I will provide you with GAF for a gene. Summarize the function of the gene.
  Give a one short description a biologist would understand.
  You may weave together multiple terms where there is redundancy.
  You should aim to be faithful to the GAF, but be aware that mistakes and over-annotation happens.
  If you see things that are unlikely, you can omit these.
  You may also produce some commentary at the end
  (e.g. 'the GAF showed annotation to X but this contradicts what is known about the gene')
  Do not focus on the evidence, or names, or IDs, or metadata about the annotation,
  just write the biological narrative.
  The exception is if this is really relevant (e.g. you may call into question a very old annotation if it
  does not make sense).
  Be aware that historically there has been over-annotation with experimental codes, for example, phenotypes from downstream effects.
  These are less relevant, and you should focus on the core activity, cellular process, and localization.
  You may however choose to briefly summarize phenotypic annotations (e.g. the role of G in process P has downstream effects E1, ...).
  Use your judgment to explain the story biologically rather than simply regurgitating terms.
  Note that the IBA code (inferred from biological ancestor) reflects high quality annotations in many species because these terms
  have been reviewed in a phylogenetic context and checked for over-annotation.
  But note that IBAs may sometimes be less complete, especially for organism-specific knowledge.
  Use your own biological knowledge.
  If aspects of the model are not clear, or you think there are errors, then at the end of your summary report on problems or anything that was not clear.
```

In [9]:
!runoak -i amigo:NCBITaxon:9606 associations -p i,p -H  --expand GO:0009229 

# Query IDs: GO:0009229
# Ontology closure predicates: rdfs:subClassOf, BFO:0000050
#
# The results include a round of expansion
#
subject	predicate	object	property_values	subject_label	predicate_label	object_label	negated	publications	evidence_type	supporting_objects	primary_knowledge_source	aggregator_knowledge_source	subject_closure	subject_closure_label	object_closure	object_closure_label	comments
UniProtKB:Q9BZV2	biolink:related_to	GO:0009229		SLC19A3	None	thiamine diphosphate biosynthetic process	False	GO_REF:0000107	IEA		Ensembl	infores:go					
UniProtKB:Q9H3S4	biolink:related_to	GO:0009229		TPK1	None	thiamine diphosphate biosynthetic process	False	GO_REF:0000041	IEA		UniProt	infores:go					
UniProtKB:Q9H3S4	biolink:related_to	GO:0009229		TPK1	None	thiamine diphosphate biosynthetic process	False	PMID:11342111	IDA		UniProt	infores:go					
UniProtKB:Q9H3S4	biolink:related_to	GO:0009229		TPK1	None	thiamine diphosphate biosynthetic process	False	PMID:38547260	IDA		UniProt	infores:go

In [10]:
!runoak -i amigo:NCBITaxon:9606 associations -p i,p -H  --expand GO:0009229 | llm -m 4o -t summarize-gaf-for-gene

The gene annotated with the process "thiamine diphosphate biosynthetic process" is involved in the synthesis of thiamine diphosphate (TDP), a coenzyme form of thiamine (vitamin B1) critical for various enzymatic reactions. Here's a summary of the functions of related gene products:

TPK1 (Thiamine Pyrophosphokinase 1)**: TPK1 plays a direct role in the thiamine diphosphate biosynthetic process by catalyzing the conversion of thiamine (vitamin B1) into thiamine diphosphate. This enzyme exhibits thiamine diphosphokinase activity, utilizing ATP in the phosphorylation process. It is also involved in the regulation of acetyl-CoA biosynthesis from pyruvate, a crucial step in energy metabolism. It predominantly localizes in the cytosol.

2. **SLC19A2 and SLC19A3 (Thiamine Transporters)**: These are integral membrane proteins that primarily facilitate the transmembrane transport of thiamine and its derivatives. They exhibit thiamine transmembrane transporter activity and localize to the plasma