Install via uv:
uv syncTo run the extraction pipeline:
uv run python3 src/sac_finetuning/data_analysis/cli/run_data_analysis.py --configuration_path <config> --extract_version <extract_version>The pipeline takes in the papers in pd, covert to markdown and pass them to an LLM and save the extracted info in json. At the end train/val/test split are saved in jsonl.
The configuration file is of the following format:
converter:
data_folder: ./src/sac_finetuning/data/SAC/ # folder of papers pdf (structured in subfolders for each journal)
save_dir: ./src/sac_finetuning/data/converted
save_on_processing: True
verbose: False
extractor:
model: <model_name>
base_url: <base_url>
save_dir: ./src/sac_finetuning/data/extracted
max_concurrency: 180
temperature: 0
seed: 42
timeout: 700
chunk_size: 25000
chunk_overlap: 3000
prompt:
free: |
You are an assistant that extracts detailed information from Single Atom Catalysts (SACs) academic papers. DO NOT rephrase, summarise or reformulate the text. DO NOT structure the procedures in steps. DO LEAVE the extracted information verbatim are they are presented in the paper.
Your task is to extract all the passages, verbatim, as presented in the paper, related to Single Atom Catalysts syntheses including the following information:
- State if the paper is experimental or computational.
- The catalysis type that catalysts are going to be used for.
- The atomic species used as Single Atom Catalysts. NOTE: sometimes some procedures are the same for multiple atoms, so you would find the Single Atom Catalyst denoted as M@SUPPORT_MATERIAL, where M is a placeholder and can be replaced by the atoms presented in the paper.
- The support materials on which the Single Atom Catalysts are synthesized. SACs are usually reported as follows: ATOMIC_SPECIES@SUPPORT_MATERIAL.
- Any precursor syntheses needed to proceed with the Single Atom Catalyst syntheses. Report the syntheses if and only if the procedures are explicitely given in the text.
- The Single Atom Catalysts syntheses as reported in the paper, without the characterisation.
Input Paper:
{paper}
chunks: |
You are an assistant that aggregates the chunks of extracted information into a single extracted information.
Chunks:
{extracted_chunks}
structured: |
You are an helpful assistant that receives unstructured information about Single Atom Catalyst extracted from academic paper, and structures them in the following format:
The syntheses procedure of the Single Atom Catalysts usually need the synthesis of some precursors. Please report the `synthesis_procedure` as:
'
Precursor synthesis: ...
SAC synthesis: ...
'
Whenever the synthesis procedure is stated to be similar to any other reported procedure, report the latter adapting accordingly.
{format_instructions}
Extracted Information:
{extracted_data}
Similarly one can evaluate the extraction on a small set of labeled data:
uv run python3 src/sac_finetuning/data_analysis/cli/run_extraction_evaluation.py --configuration_path <config> --extract_version <extract_version> --tune<True,False>when tune is free we can evaluate on half of the labeled data to tune the extraction.
The config for the extraction evaluation is similar to the one of the extraction pipeline, adding the following:
evaluation:
embedding_model: facebook/bart-base
ground_truth: src/sac_finetuning/data/ground_truth/wiley_elsevier
eval_dir: src/sac_finetuning/data/eval
return_dataframe: False
tune_percentage: 0
standardise: False
info_to_add: [atom, support_material, synthesis_method, catalysis_type]
predicted_key: extracted
llm_as_judge_config:
llm_judging_path: src/sac_finetuning/data/eval/extracted
llm_name: <model_name>
base_url: <base_url>
prompt: |
You are an expert on Single Atom Catalysts. You will be provided with a ground truth synthesis procedure of a Single Atom Catalyst and the synthesis procedure extracted by a language model from a paper. Your job is to evaluate how close the synthesis procedures are. Do consider only the synthesis step and ignore non-synthesis details. Some predicted procedures can have different atom/support_material with respect to the ground truth. If they are the only thing that changes, the procedure is correct and it does not have to take in consideration as an error. The info such as which atom to use and which support material, together with other usual information about the kind of synthesis we desire, are detailed below.
Ignore spelling errors and different symbols, such as degrees Celsius and the multiplication dot etc..., that do not affect the substance of the procedure.
Synthesis Info
{synthesis_info}
Ground Truth:
{ground_truth}
Extracted:
{extracted}
{format_instructions}Fine-tune models via torchrun:
# TODO: before training need to adjust paths in src/sac_finetuning/configs/modeling/example.yaml
torchrun --nproc_per_node {NUMBER_OF_GPUS} src/sac_finetuning/modeling/cli/lora_finetuning.py --pipeline_configuration_path src/sac_finetuning/configs/modeling/example.yaml --deepspeed src/sac_finetuning/configs/modeling/ds_config.jsonAll fine-tuned models are available on the HF Hub here.
All LoRA adapters presented in the paper starting from ibm-granite/granite-3.2-8b-instruct are shipped alongside the repository and are available here.
Run the application (only supports LoRA adapters):
# NOTE: adapter used can be adjusted in the script
uv run python3 src/sac_finetuning/app/client.py