# Synthesis procedure extraction with LLMs from a parsed PDF

We will use [DSPy](github.com/stanfordNLP/dspy) for prompting LLMs. DSPy is a framework for creating "prompting architectures/modules" and "signatures" for sterring the behavior of LLMs. Our end goal is to find an optimal configuration of LLMs *types*, modules, signatures, prompts and evaluation metrics! The purpose of this script is to show the basic functionality of DSPy in the context of this project.

The workflow is: 

In [None]:
# Configure DSPy with a LLM
from llm_synthesis.utils import configure_dspy

configure_dspy(
    lm="gpt-4o-mini",
    model_kwargs={"temperature": 0.0},
)

## Building simple programs
A "program" or "module" in DSPy denotes a "module" (in this case: `dspy.Predict`) and a "signature" (in this case: `question->answer`). By calling the created module `predictor` with `question=<insert-question-here>`, we get a Prediction object that contains `answer` as a key.



# Synthesis procedure extraction with LLMs from a parsed PDF

We will use [DSPy](github.com/stanfordNLP/dspy) for prompting LLMs. DSPy is a framework for creating "prompting architectures/modules" and "signatures" for sterring the behavior of LLMs. Our end goal is to find an optimal configuration of LLMs *types*, modules, signatures, prompts and evaluation metrics! The purpose of this script is to show the basic functionality of DSPy in the context of this project.

The workflow is: 

In [None]:
# Configure DSPy with a LLM
from llm_synthesis.utils import configure_dspy

configure_dspy(
    lm="gpt-4o-mini",
    model_kwargs={"temperature": 0.0},
)

## Load extracted markdown data

cf. `notebooks/pdf_extraction.ipynb`

In [None]:
# From local files

txt_file = "<path_to_save_extracted_text>"

# From google storage

txt_file = "gs://<bucket-name>/<path_to_save_extracted_text>"

In [None]:
txt_file = (
    "/Users/mlederbau/llm-synthesis/data/txt_papers/mistral/Zhou_2023_constructing.txt"
)

In [None]:
# Load and displaythe data
from IPython.display import Markdown
from llm_synthesis.services.storage.local_file_storage import LocalFileStorage

storage = LocalFileStorage()
markdown_text = storage.read_bytes(file_path=txt_file).decode("utf-8")

Markdown(markdown_text)

## Building simple programs
A "program" or "module" in DSPy denotes a "module" (in this case: `dspy.Predict`) and a "signature" (in this case: `question->answer`). By calling the created module `predictor` with `question=<insert-question-here>`, we get a Prediction object that contains `answer` as a key.

In [None]:
import dspy

predictor = dspy.Predict("question->answer")

predictor(question="who are you?")

In [None]:
### Extract synthesis paragraph
from llm_synthesis.extraction.text import ParagraphParser

paragraph_parser = ParagraphParser()
prediction = paragraph_parser(publication_text=markdown_text)
synthesis_paragraph = prediction["synthesis_paragraphs"]
synthesis_paragraph

In [None]:
### Extract structured synthesis procedure
from llm_synthesis.extraction.synthesis import StructuredSynthesisParser

structure_parser = StructuredSynthesisParser()
prediction = structure_parser(synthesis_procedure=synthesis_paragraph)
structured_synthesis_procedure = prediction["structured_synthesis_procedure"]
structured_synthesis_procedure

In [None]:
print(f"Entry ID: {structured_synthesis_procedure.id}")
print("\n")
print(f"Compound Name: {structured_synthesis_procedure.target_compound}")
print("\n")
print(f"Materials: {structured_synthesis_procedure.materials}")
print("\n")
print(f"Steps: {structured_synthesis_procedure.steps}")
print("\n")
print("Conditions:")
for process_step in structured_synthesis_procedure.steps:
    print(f"  {process_step.action}:")
    print(f"    Materials: {process_step.materials}")
    print(f"    Conditions: {process_step.conditions}")
    print("\n")