This is a simple walkthrough of how the code works.

First, we look at the main components of the pipeline, the `context_shortener` and the `form_filler`
The `context_shortener` is responsible for the retrieval part (reducing the length of the context, from the full document to the chunks.)
The `form_filler` takes the context and a form, and fills it.

## context shortening

In [1]:
import context_shortening

  from .autonotebook import tqdm as notebook_tqdm


First we chunk some text

In [None]:
# some sample text from wikipedia
document = """Melanoma is the most dangerous type of skin cancer; it develops from the melanin-producing cells known as melanocytes.[1] It typically occurs in the skin, but may rarely occur in the mouth, intestines, or eye (uveal melanoma).[1][2]"""
chunks = context_shortening.chunk_by_headeres_and_clean(document, 50, 0, False)
for i, chunk in enumerate(chunks):
    print(i,":",chunk.text)

0 : Melanoma is the most dangerous type of skin
1 : cancer; it develops from the melanin-producing
2 : cells known as melanocytes.[1] It typically
3 : occurs in the skin, but may rarely occur in the
4 : mouth, intestines, or eye (uveal melanoma).[1][2]


Now we need a pydantic model in order for the context_shortener to know what to look for.

In [3]:
import pydantic
from typing import Literal
class Disease(pydantic.BaseModel):

    Disease_name: str = pydantic.Field(description="Name of disease")
    Body_part: Literal["Leg", "Arm", "Skin"] = pydantic.Field(description="Affected body part")


Now lets try out some context shorteners.

In [4]:
description_based_retriever = context_shortening.Retrieval(
    chunk_info_to_compare="direct", # this means we use the chunk directly, as opposed to "keybert" which generates n keywords per chunk.
    field_info_to_compare="description",
    include_choice_every=1,
    embedding_model_id="all-MiniLM-L6-v2",
    n_keywords=1,
    top_k=3, # find k most relevant chunks, and put into one string
    chunk_size=50,
    chunk_overlap=0,
    pydantic_form=Disease,
)
description_based_retriever.set_document(document)
disease_name_context = description_based_retriever(answer_field_name = "Disease_name")
print(disease_name_context)

mouth, intestines, or eye (uveal melanoma).[1][2]
...
occurs in the skin, but may rarely occur in the
...
Melanoma is the most dangerous type of skin


In [5]:
body_part_context = description_based_retriever(answer_field_name = "Body_part")
print(body_part_context)

occurs in the skin, but may rarely occur in the
...
mouth, intestines, or eye (uveal melanoma).[1][2]
...
Melanoma is the most dangerous type of skin


The choices or choise-list alternative require only literal values in the pydantic_form

In [6]:
import pydantic
from typing import Literal
class LiteralDisease(pydantic.BaseModel):

    Disease_name: Literal["Carcinoma", "'Melanoma'", "Medulloblastoma"] = pydantic.Field(description="Name of disease")
    Body_part: Literal["Leg", "Arm", "Skin"] = pydantic.Field(description="Affected body part")




choices_based_retriever = context_shortening.Retrieval(
    chunk_info_to_compare="direct", # this means we use the chunk directly, as opposed to "keybert" which generates n keywords per chunk.
    field_info_to_compare="choices",
    include_choice_every=1,
    embedding_model_id="all-MiniLM-L6-v2",
    n_keywords=1,
    top_k=2,
    chunk_size=50,
    chunk_overlap=0,
    pydantic_form=LiteralDisease,
)

choices_based_retriever.set_document(document)

context = choices_based_retriever(answer_field_name = "Body_part")
print(context)

Melanoma is the most dangerous type of skin
...
occurs in the skin, but may rarely occur in the


## Form Filling


First we load an llm - llama3.1 8b instruct, gptq-int4 quantization.

In [7]:
import dspy
import outlines
model_id = "hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4"
dspy_model = dspy.HFModel(model = model_id, hf_device_map = "cuda:0")

hf_model = dspy_model.model
hf_tokenizer = dspy_model.tokenizer

# set some dspy model options
#dspy_model.kwargs["max_tokens"]=args.max_tokens
dspy_model.drop_prompt_from_output = True

# define outlines llm and sampler
outlines_llm = outlines.models.Transformers(model=hf_model, tokenizer=hf_tokenizer)
outlines_sampler = outlines.samplers.GreedySampler()


CUDA extension not installed.
CUDA extension not installed.
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.18s/it]
Some weights of the model checkpoint at hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 were not used when initializing LlamaForCausalLM: ['model.layers.0.mlp.down_proj.bias', 'model.layers.0.mlp.gate_proj.bias', 'model.layers.0.mlp.up_proj.bias', 'model.layers.0.self_attn.k_proj.bias', 'model.layers.0.self_attn.o_proj.bias', 'model.layers.0.self_attn.q_proj.bias', 'model.layers.0.self_attn.v_proj.bias', 'model.layers.1.mlp.down_proj.bias', 'model.layers.1.mlp.gate_proj.bias', 'model.layers.1.mlp.up_proj.bias', 'model.layers.1.self_attn.k_proj.bias', 'model.layers.1.self_attn.o_proj.bias', 'model.layers.1.self_attn.q_proj.bias', 'model.layers.1.self_attn.v_proj.bias', 'model.layers.10.mlp.down_proj.bias', 'model.layers.10.mlp.gate_proj.bias', 'model.layers.10.mlp.up_proj.bias', 'model.layers.10.self_attn.k_proj.bias', 'model.layers.10.self_attn.o_proj

In [8]:
import form_filling

form_filler = form_filling.SequentialFormFiller(
    outlines_llm=outlines_llm,
    outlines_sampler=outlines_sampler,
    pydantic_form=Disease,
)

Compiling FSM index for all state transitions: 100%|██████████| 3/3 [00:00<00:00,  5.15it/s]


In [9]:
filled_form = form_filler.forward(description_based_retriever)

print(filled_form)

We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)


Disease_name='Malignant melanoma' Body_part='Skin'


## Evaluation

Disease_name is str (not literal) so its evaluated with similarity.
Body_part is Literal, so 1 if its correct else 0.

In [14]:
labels = {"Disease_name":["melanoma"], "Body_part":["Skin"]}

import evaluation
scores = evaluation.score_general_prediction(labels, filled_form)
print(scores)

{'Disease_name': 0.8594872219577656, 'Body_part': 1.0}


To compare, we make up some other labels;

In [25]:
labels = {"Disease_name":["Cancer"], "Body_part":["Leg"]}

import evaluation
scores = evaluation.score_general_prediction(labels, filled_form)
print(scores)

{'Disease_name': 0.6777031254235824, 'Body_part': 0.0}


## dataset loader

In [11]:
import dataset_loader
documents, labels = dataset_loader.load_arxpr_data(max_amount=8, version="2_25", mode="train")

In [12]:
documents

{'26925227': 'BioC-API\ncollection.key\nCC BY\n[version 2; referees: 2 approved]\nRNA-seq quantification gene expression transcriptomics\nThis is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.\nsurname:Soneson;given-names:Charlotte\nsurname:Love;given-names:Michael I.\nsurname:Robinson;given-names:Mark D.\nIn version 2 of the manuscript, we have reworded and improved the clarity of the text in several places, to better convey the differences between the contrasted approaches and clarify the questions that are addressed by each analysis. We have also expanded the supplementary material with additional analyses of accuracy of abundance estimates among paralogous genes, and added more detailed descriptions of aspects such as the calculation of each abundance measure and the generation of the incomplete annotation files

In [13]:
labels

{'26925227': {'hardware_4': ['illumina hiseq 2000'],
  'organism_part_5': [],
  'experimental_designs_10': ['case control design'],
  'assay_by_molecule_14': ['rna assay'],
  'study_type_18': ['rna-seq of coding rna']},
 '25435910': {'hardware_4': [],
  'organism_part_5': [],
  'experimental_designs_10': [],
  'assay_by_molecule_14': ['rna assay'],
  'study_type_18': ['rna-seq of coding rna']},
 '23671666': {'hardware_4': [],
  'organism_part_5': [],
  'experimental_designs_10': [],
  'assay_by_molecule_14': ['rna assay'],
  'study_type_18': ['transcription profiling by array']},
 '23079210': {'hardware_4': [],
  'organism_part_5': [],
  'experimental_designs_10': [],
  'assay_by_molecule_14': [],
  'study_type_18': ['transcription profiling by rt-pcr']},
 '29980666': {'hardware_4': ['illumina hiseq 2000'],
  'organism_part_5': [],
  'experimental_designs_10': [],
  'assay_by_molecule_14': [],
  'study_type_18': []},
 '21299862': {'hardware_4': [],
  'organism_part_5': [],
  'experimen

## load_modules and run_modules


For these, just take a look at/run ´main.py´.
load_modules loads a set of documents/labels, as well as a context_shortener and form_filler, as specified in the arguments.
run_modules iterates through these and evaluate each time, and stores the scores in wandb/weave (requires login details - create your own wandb project if you want to use this)