# Summarizing with LLMs

This notebook demonstrates how to summarize the output of LLMs using the [datasette LLM command line tool](https://llm.datasette.io/en/stable/).

See also:

- [How to use LLMs with OAK](https://incatools.github.io/ontology-access-kit/howtos/use-llms.html


## Install the LLM command line tool

```
pip install llm
```

You may also want to install plugins for your models of choice:

```
pip install llm-deepseek
```

## Summarize outputs

You can redirect any output you like to `llm`. For example, consider this OAK query to get definition of all kinds of hearts in Uberon:

In [1]:
!runoak -i sqlite:obo:uberon definitions .sub "circulatory organ"

id	label	definition
UBERON:0000948	heart	A myogenic muscular circulatory organ found in the vertebrate cardiovascular system composed of chambers of cardiac muscle. It is the primary circulatory organ.
UBERON:0007100	primary circulatory organ	A hollow, muscular organ, which, by contracting rhythmically, keeps up the circulation of the blood or analogs[GO,modified].
UBERON:0015202	lymph heart	A circulatory organ that is reponsible for pumping lymph throughout the body.
UBERON:0015227	peristaltic circulatory vessel	A vessel down which passes a wave of muscular contraction, that forces the flow of haemolymphatic fluid.
UBERON:0015228	circulatory organ	A hollow, muscular organ, which, by contracting rhythmically, contributes to the circulation of lymph, blood or analogs. Examples: a chambered vertebrate heart; the tubular peristaltic heart of ascidians; the dorsal vessel of an insect; the lymoh heart of a reptile.
UBERON:0015229	accessory circulatory organ	A circulatory organ that is

In [2]:
!runoak -i sqlite:obo:uberon definitions .sub "circulatory organ" | llm -m 4o -s "give a summary of these terms and critical comments on definitions"

The terms provided mainly describe various structures within the circulatory system, both in vertebrates and invertebrates, with a focus on definitions from anatomical and biological perspectives. Here is a summary of the terms along with critical comments on their definitions:

1. **Heart (UBERON:0000948):** Defined as a myogenic muscular organ in vertebrates, responsible for circulating blood through cardiac muscle chambers. The term appropriately highlights the heart’s primary function and structural characteristics.

2. **Primary Circulatory Organ (UBERON:0007100):** Described as a hollow, muscular organ that maintains blood circulation through rhythmic contractions. The definition is clear, albeit a bit redundant with the general concept of a "heart."

3. **Lymph Heart (UBERON:0015202):** Defined as an organ pumping lymph throughout the body. The definition clearly states its function, but it might benefit from specifying its presence in particular animal groups.

4. **Per

## Templates

The llm tool allows you to define templates.

`llm templates edit summarize-definitions` 

Then in your editor:

```yaml
system: give a summary of these terms and critical comments on definitions
```

In [3]:
!runoak -i sqlite:obo:uberon definitions .sub "circulatory organ" | llm -m 4o -t summarize-definitions


The dataset provides definitions for various types of circulatory organs and structures within the UBERON ontology, which is a comprehensive multi-species anatomy ontology encompassing multiple biological domains. Here is a summary of each term included:

1. **Heart (UBERON:0000948)**: Described as a myogenic muscular organ within the vertebrate cardiovascular system. It is the primary circulatory organ that functions by moving blood throughout the body via chambers of cardiac muscle.

2. **Primary Circulatory Organ (UBERON:0007100)**: A hollow, muscular organ that maintains blood circulation through rhythmic contractions. The definition is adapted from the Gene Ontology (GO).

3. **Lymph Heart (UBERON:0015202)**: A circulatory organ tasked with pumping lymphatic fluid throughout the body. It serves a function analogous to the heart but specific to lymph circulation.

4. **Peristaltic Circulatory Vessel (UBERON:0015227)**: Described as a vessel in which muscular contractions se

## Gene summaries

Create a template for summarizing gene annotations:

`llm templates edit summarize-gaf-for-gene` 

```yaml
system: I will provide you with GAF for a gene. Summarize the function of the gene.
  Give a one short description a biologist would understand.
  You may weave together multiple terms where there is redundancy.
  You should aim to be faithful to the GAF, but be aware that mistakes and over-annotation happens.
  If you see things that are unlikely, you can omit these.
  You may also produce some commentary at the end
  (e.g. 'the GAF showed annotation to X but this contradicts what is known about the gene')
  Do not focus on the evidence, or names, or IDs, or metadata about the annotation,
  just write the biological narrative.
  The exception is if this is really relevant (e.g. you may call into question a very old annotation if it
  does not make sense).
  Be aware that historically there has been over-annotation with experimental codes, for example, phenotypes from downstream effects.
  These are less relevant, and you should focus on the core activity, cellular process, and localization.
  You may however choose to briefly summarize phenotypic annotations (e.g. the role of G in process P has downstream effects E1, ...).
  Use your judgment to explain the story biologically rather than simply regurgitating terms.
  Note that the IBA code (inferred from biological ancestor) reflects high quality annotations in many species because these terms
  have been reviewed in a phylogenetic context and checked for over-annotation.
  But note that IBAs may sometimes be less complete, especially for organism-specific knowledge.
  Use your own biological knowledge.
  If aspects of the model are not clear, or you think there are errors, then at the end of your summary report on problems or anything that was not clear.
```

In [8]:
!runoak -i amigo:NCBITaxon:9606 associations -p i,p -H  --expand GO:0009229 

# Query IDs: GO:0009229
# Ontology closure predicates: rdfs:subClassOf, BFO:0000050
#
# The results include a round of expansion
#
subject	predicate	object	property_values	subject_label	predicate_label	object_label	negated	publications	evidence_type	supporting_objects	primary_knowledge_source	aggregator_knowledge_source	subject_closure	subject_closure_label	object_closure	object_closure_label	comments
UniProtKB:Q9BZV2	biolink:related_to	GO:0009229		SLC19A3	None	thiamine diphosphate biosynthetic process	False	GO_REF:0000107	IEA		Ensembl	infores:go					
UniProtKB:Q9H3S4	biolink:related_to	GO:0009229		TPK1	None	thiamine diphosphate biosynthetic process	False	GO_REF:0000041	IEA		UniProt	infores:go					
UniProtKB:Q9H3S4	biolink:related_to	GO:0009229		TPK1	None	thiamine diphosphate biosynthetic process	False	PMID:11342111	IDA		UniProt	infores:go					
UniProtKB:Q9H3S4	biolink:related_to	GO:0009229		TPK1	None	thiamine diphosphate biosynthetic process	False	PMID:38547260	IDA		UniProt	i

In [11]:
!runoak -i amigo:NCBITaxon:9606 associations -p i,p -H  --expand GO:0009229 | llm -m 4o -t summarize-gaf-for-gene

The gene products associated with the thiamine diphosphate biosynthetic process (GO:0009229) involve several proteins including SLC19A3, TPK1, SLC25A19, SLC19A2, and THTPA, each contributing to the synthesis and transport of thiamine-related compounds in different capacities.

SLC19A3 is primarily involved in the transport of thiamine and pyridoxine across the plasma membrane, displaying thiamine transmembrane transporter activity. It facilitates thiamine transmembrane transport and is located on the plasma membrane. This protein also plays a role in thiamine transport in general, contributing to the overall metabolic process of thiamine-containing compounds.

TPK1 (thiamine pyrophosphate kinase) catalyzes the ATP-dependent phosphorylation of thiamine to form thiamine diphosphate, which is the active form of thiamine used as a coenzyme in various enzymatic reactions. It has been detected in the cytosol and also possesses kinase and ATP binding activity.

SLC25A19 is responsible f