# Exploration of {Guidance}:
* Llama-index guidance pydantic program
* Microsoft guidance

---
CONCLUSIONS:
* Llama-index guidance pydantic program dicarded due to LLM not outputting correct json schema even with mixtral. # -> OutputParserException: Failed to parse pydantic object from guidance program. Probably the LLM failed to produce data with right json schema
* Native guidance with mistral_Q6 unstable output especially with lists - probably due to quantatization of model.
* Extract data to json works well for select from a list of choices and generation
* Content of the generated output from longer text or scientific article varies. Output an be improved with prompting: propmpt engineering, chain-of-thought, few-shot examples
---

---
## Llama-index guidance pydantic program
https://docs.llamaindex.ai/en/stable/examples/output_parsing/guidance_pydantic_program/?h=guidance

In [1]:
from pydantic import BaseModel
from typing import List
from llama_index.program.guidance import GuidancePydanticProgram

In [None]:
import guidance
#model = guidance.llms.Transformers()
mixtral = guidance.models.LlamaCppChat("/home/dorota/models/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf") # https://guidance.readthedocs.io/en/latest/api.html
# mistral_Q4 = guidance.models.LlamaCppChat("/home/dorota/models/mistral-7b-instruct-v0.2.Q4_K_M.gguf") # https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf?download=true
# mistral_Q6 = guidance.models.LlamaCppChat("/home/dorota/models/mistral-7b-instruct-v0.2.Q6_K.gguf") # https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q6_K.gguf?download=true

In [None]:
class Book(BaseModel):
    autor: str
    title: str
    pub_year: int


class BookShelf(BaseModel):
    genre: str
    books: List[Book]

In [None]:
program = GuidancePydanticProgram(
    output_cls=BookShelf,
    prompt_template_str=(
        "Generate a bookshelf with 3 books within a genre with an author, a title and a publication year. Using"
        " genre '{{query_str}}' as inspiration. Parse output as json"
    ),
    guidance_llm=mixtral,
    verbose=True,

)

In [None]:
output = program(query_str="Science Fiction", tools_str='')

# -> OutputParserException: Failed to parse pydantic object from guidance program. Probably the LLM failed to produce data with right json schema

In [None]:
output

---
## {Guidance}

https://github.com/guidance-ai/guidance/blob/main/notebooks/tutorials/intro_to_guidance.ipynb

In [62]:
import guidance
from guidance import models, gen

# mistral = guidance.models.LlamaCpp("/home/dorota/models/mistral-7b-instruct-v0.2.Q4_K_M.gguf", n_gpu_layers=-1, n_ctx=4096)
mistral = guidance.models.LlamaCpp("/home/dorota/models/mistral-7b-instruct-v0.2.Q6_K.gguf", n_gpu_layers=-1)

In [15]:
lm = mistral + "Who won the last Kentucky derby and by how much?"
lm + gen(max_tokens=20)

In [2]:
mistral + '''\
Q: Who won the last Kentucky derby and by how much?
A:''' + gen(stop="Q:")

# -> expected answer most of the time

In [7]:
# using f strings and naming elements of the output

query = "Who won the last Kentucky derby and by how much?"
lm = mistral + f'''\
Q: {query}
A: {gen(name='answer', stop="Q:")}'''

# -> expected answer

lm['answer']

"The last Kentucky Derby was held on May 1, 2021, and the winner was Medina Spirit, trained by Bob Baffert and ridden by John Velazquez. Medina Spirit won by 0.5 lengths over Mandaloun. However, it's important to note that Medina Spirit's victory is under investigation due to a positive drug test. The results of that investigation are still pending."

In [6]:
# function encapsulation

@guidance
def qa_bot(lm, query):
    lm = lm + f'''\
    Q: {query}
    A: {gen(name="answer", stop="Q:")}'''
    return lm

query = "Who won the last Kentucky derby and by how much?"
mistral + qa_bot(query) # note we don't pass the `lm` arg here (that will get passed during execution when it gets added to the model)

In [5]:
# selecting among alternatives

from guidance import select

query = "Who won the last Kentucky derby and by how much?"

mistral + f'''\
Q: {query}
Now I will choose to either SEARCH the web or RESPOND.
Choice: {select(["SEARCH", "RESPOND"], name="choice")}''' 

In [13]:
@guidance
def qa_bot(lm, query):
    lm += f'''\
    Q: {query}
    Now I will choose to either SEARCH the web or RESPOND.
    Choice: {select(["SEARCH", "RESPOND"], name="choice")}
    '''
    if lm["choice"] == "SEARCH":
        lm += "A: I don't know, Google it!"
    else:
        lm += f'A: {gen(stop="Q:", name="answer")}'
    return lm

query = "Who won the last Kentucky derby and by how much?"
mistral + qa_bot(query)

In [24]:
# generating lists

query = "Who won the last Kentucky derby and by how much?"

lm = mistral + f'''\
Q: {query}
Now I will choose to either SEARCH the web or RESPOND.
Choice: {select(["SEARCH", "RESPOND"], name="choice")}
'''
if lm["choice"] == "SEARCH":
    lm += "Here are 3 search queries:\n"
    for i in range(3):
        lm += f'''{i+1}. "{gen(stop='"', name="queries", temperature=1.0, list_append=True)}"\n'''

In [63]:
# https://github.com/guidance-ai/guidance?tab=readme-ov-file#guidance-acceleration

from guidance import substring

# define a set of possible statements
text = 'guidance is awesome. guidance is so great. guidance is the best thing since sliced bread.'

# force the model to make an exact quote
mistral + f'Here is a true statement about the guidance library: "{substring(text)}"'

# -> not substring is selected

---
## Guidance -> json
https://github.com/guidance-ai/guidance/blob/main/notebooks/guaranteeing_valid_syntax.ipynb \
https://github.com/guidance-ai/guidance?tab=readme-ov-file#guidance-acceleration

In [12]:
import guidance
from guidance import models, gen, select

lm = guidance.models.LlamaCpp("/home/dorota/models/mistral-7b-instruct-v0.2.Q6_K.gguf", n_gpu_layers=-1)

In [19]:
# we can pre-define valid option sets
sample_weapons = ["sword", "axe", "mace", "spear", "bow", "crossbow"]
sample_armor = ["leather", "chainmail", "plate"]

# define a re-usable "guidance function" that we can use below
@guidance
def quoted_list(lm, name, n):
    for i in range(n):
        if i > 0:
            lm += ", "
        lm += '"' + gen(name, list_append=True, stop='"') + '"'
    return lm

@guidance
def generate_character(
    lm,
    character_one_liner,
    weapons: list[str] = sample_weapons,
    armour: list[str] = sample_armor,
    n_items: int = 3
):
    lm += f'''\
    {{
        "description" : "{character_one_liner}",
        "name" : "{gen("character_name", stop='"')}",
        "age" : {gen("age", regex="[0-9]+")},
        "armour" : "{select(armour, name="armor")}",
        "weapon" : "{select(weapons, name="weapon")}",
        "class" : "{gen("character_class", stop='"')}",
        "mantra" : "{gen("mantra", stop='"')}",
        "strength" : {gen("age", regex="[0-9]+")},
        "quest_items" : [{quoted_list("quest_items", n_items)}]
    }}'''
    return lm


generation = lm + generate_character("A quick and nimble fighter")

In [53]:
import json

gen_json = json.loads(str(generation))
print(f"Loaded json:\n{json.dumps(gen_json, indent=4)}")

Loaded json:
{
    "description": "A quick and nimble fighter",
    "name": "Rogue",
    "age": 25,
    "armour": "leather",
    "weapon": "sword",
    "class": "../../../classes/Rogue.js",
    "mantra": ",",
    "strength": 10,
    "quest_items": [
        "key",
        "",
        "map"
    ]
}


In [54]:
type(generation), type(gen_json)

(guidance.models.llama_cpp._llama_cpp.LlamaCpp, dict)

In [46]:
generation['weapon']

'sword'

---
## POC extract data -> json

In [5]:
import guidance
from guidance import models, gen, select
import json

lm = guidance.models.LlamaCpp("/home/dorota/models/mistral-7b-instruct-v0.2.Q6_K.gguf", n_gpu_layers=-1, n_ctx=0)# , echo=False) # default n_ctx=512 (max is 16384), echo=False -> no generation output displayed

### Structured input

In [3]:
recepie = """Ingredients:
• 1 cup all-purpose flozfgvzdfgur
• 3 tablespoons granulated sugar
• 1 teaspoon baking powder
• 1/2 teaspoon baking soda
• 1/2 teaspoon salt
• 1 cup milk
• 1 egg
• 3 tablespoons unsalted butter, melted
• Toppings of your choice (fresh fruit, whipped cream, syrup, etc.)"""

In [6]:
@guidance
def extract_alergens(lm,recepie):
    lm += f'''\
    Indicate if the recepie contains alergens. Recepie is deliminated by (start) and (stop). 
    Output 'yes' if alergen can be found or 'no' if algergen can not be found. Generated output should be a string.
    --------------------------------------(start)
    {recepie}
    (stop) -------------------------------------- 
    {{
        "eggs" : "{gen("eggs", stop='"')}",
        "milk" : "{gen("milk", stop='"')}",
        "flour" : "{gen("flour", stop='"')}",
        "soy" : "{gen("soy", stop='"')}",
    }}'''
    return lm


generation = lm + extract_alergens(recepie)

# 1 cup all-purpose flozfgvzdfgur is still interpreted as flour
# 1 cup of cream is still interpreted as milk but 1 cookie is not

# gen() results in unstable output

In [7]:
@guidance
def extract_alergens(lm,recepie):
    lm += f'''\
    Indicate if the recepie contains ingredients. Recepie is deliminated by (start) and (stop). 
    Output 'yes' if ingredient can be found in recepie or 'no' if ingredient can not be found in recepie.
    This is example output:
     ```json
    {{
        "egg" : "yes",
        "milk" : "yes",
        "sugar" : "yes",
        "sand" : "no"
    }}``
    --------------------------------------(start)
    {recepie}
    (stop) -------------------------------------- 
    ```json
    {{
        "egg" : "{gen("eggs", stop='"')}",
        "milk" : "{gen("milk", stop='"')}",
        "flour" : "{gen("flour", stop='"')}",
        "soy" : "{gen("soy", stop='"')}"
    }}```'''
    return lm


generation = lm + extract_alergens(recepie)

# unpredictable output. few-shot lowered quality

In [5]:
@guidance
def extract_alergens(lm, recepie):
    lm += f'''\
    Indicate if the recepie contains alergens. Recepie is deliminated by (start) and (stop).
    --------------------------------------(start)
    {recepie}
    (stop)--------------------------------------
    {{
        "egg" : "{select(options=['yes', 'no'], name='egg')}",
        "milk" : "{select(options=['yes', 'no'], name='milk')}",
        "flour" : "{select(options=['yes', 'no'], name='flour')}",
        "gluten" : "{select(options=['yes', 'no'], name='gluten')}",
        "soy" : "{select(options=['yes', 'no'], name='soy')}",
        "egg_amount" : {gen("egg_amount", regex="[0-9]+")}
        "soy_amount" : {gen("soy_amount", regex="[0-9]+")}
    }}'''
    return lm

generation = lm + extract_alergens(recepie)

In [9]:
@guidance
def extract_alergens(lm, recepie):
    lm += f'''\
    Indicate if the recepie contains alergens. Alergen can for example be egg, milk. Recepie is deliminated by (start) and (stop). Only select 'yes' if the alergen is explicitly mentioned in the recepie.
    --------------------------------------(start)
    {recepie}
    (stop)--------------------------------------
    generated_output:
    {{
        "egg" : "{select(options=['yes', 'no'], name='egg')}",
        "milk" : "{select(options=['yes', 'no'], name='milk')}",
        "flour" : "{select(options=['yes', 'no'], name='flour')}",
        "gluten" : "{select(options=['yes', 'no'], name='gluten')}",
        "soy" : "{select(options=['yes', 'no'], name='soy')}",
        "egg_amount" : {gen("egg_amount", regex="[0-9]+")}
        "soy_amount" : {gen("soy_amount", regex="[0-9]+")}
    }}
    
    Check if generated output is correct. 
    correctd_generated_output:
    {{
        "egg" : "{select(options=['yes', 'no'], name='egg')}",
        "milk" : "{select(options=['yes', 'no'], name='milk')}",
        "flour" : "{select(options=['yes', 'no'], name='flour')}",
        "gluten" : "{select(options=['yes', 'no'], name='gluten')}",
        "soy" : "{select(options=['yes', 'no'], name='soy')}",
        "egg_amount" : {gen("egg_amount", regex="[0-9]+")}
        "soy_amount" : {gen("soy_amount", regex="[0-9]+")}
    }}'''
    return lm

generation = lm + extract_alergens(recepie)

# incorrect output with more prompting, but corrected in step 2

### Text input

In [26]:
ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York."""
# A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
# Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
# In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
# Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
# 2010 marriage license application, according to court documents.
# Prosecutors said the marriages were part of an immigration scam.
# On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
# After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
# Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
# All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
# Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
# Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
# Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
# Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
# If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
# """

@guidance
def summarize(lm, text):
    lm += f'''\
    Generate a title and summary for the text. Text is deliminated by (start) and (stop). Generated title and summary must be strings.
    ---------------------------------------------------------(start)
    {text}
    (stop)----------------------------------------------------------

    structured_output:
    {{
        "text_title" : "{gen(name="title", stop='"')}",
        "text_summary" : "{gen(name="summary", stop='"')}"
    }}'''
    return lm

generation = lm + summarize(ARTICLE)

#TODO: restrict output value to only the json output


In [18]:
# write structured output part of response to json string and further to dict
import json

gen_json_string = str(generation).split('structured_output:\n')[-1]
gen_dict = json.loads(gen_json_string)
print(json.dumps(gen_dict, indent=4))

{
    "text_title": "Young Woman Gets Married in Westchester County, New York",
    "text_summary": ": At the age of 23, Liana Barrientos tied the knot in Westchester County, New York."
}


### pdf article as input

In [8]:
from pypdf import PdfReader 
  
reader = PdfReader('/home/dorota/LLM-diploma-project/00_concept_tests/data/40001_2023_Article_1364.pdf') 
num_pages = len(reader.pages)
TEXT = ""
for page_num in range(1): #change to range(num_pages) for whole document
    page = reader.pages[page_num]  
    TEXT += page.extract_text()

In [20]:
@guidance
def get_metadata(lm, text):
    lm += f'''\
    Article is deliminated by (start) and (stop). Extract title, authors and publication year. Generate 5 keywords based on Abstract, do not use keywords in article. Summarize Abstract. Generate research area.
    Here is an example of output:
    {{
        "title" : "Breast cancer: A review of risk factors and diagnosis",
        "authors" : "Anna Carlsson, Gertrude Xiang, Emmanuelle Obegau",
        "publication_year": 2000,
        "key_words" : "breast cancer, diagnosis, mortality, prevention, risk factors",
        "summary" : "This review has underscored the multifaceted nature of breast cancer, highlighting various risk factors and diagnostic approaches crucial in understanding and managing this disease.",
        "research_area" : "breast cancer"
    }}
    ---------------------------------------------------------(start)
    {text}
    (stop)----------------------------------------------------------

    structured_output:
    {{
        "title" : "{gen(name="title", stop='"')}",
        "authors" : "{gen(name="authors", stop='"')}",
        "publication_year": {gen("publication_year", regex="[0-9]+")},
        "key_words" : "{gen(name="key_words", stop='"')}",
        "summary" : "{gen(name="summary", stop='"')}",
        "research_area" : "{gen(name="research_area", stop='"')}"
    }}'''
    return lm

generation = lm + get_metadata(TEXT)

In [9]:
@guidance
def get_metadata(lm, text):
    lm += f'''\
    Article is delimited by (start) and (stop). Extract title, authors and publication year. Generate 5 keywords based on Abstract, do not use keywords in article. Summarize Abstract. Generate research area.

    (start)
    {text}
    (stop)

    JSON ouput:
    {{
        "title" : "{gen(name="title", stop='"')}",
        "authors" : "{gen(name="authors", stop='"')}",
        "publication_year": {gen("publication_year", regex="[0-9]+")},
        "key_words" : "{gen(name="key_words", stop='"')}",
        "summary" : "{gen(name="summary", stop='"')}",
        "research_area" : "{gen(name="research_area", stop='"')}",
        "quality_score" : "{select(options=['GOOD', 'BAD', 'EXCELLENT'], name="quality")}",
        "quality_reason" : "{gen(name="quality_reason", stop='"')}"
    }}'''
    return lm

generation = lm + get_metadata(TEXT)

print(f'''\
      {{
        "title" : "{generation["title"]}",
        "authors" : "{generation["authors"]}",
        "publication_year": "{generation["publication_year"]}",
        "key_words" : "{generation["key_words"]}",
        "summary" : "{generation["summary"]}",
        "research_area" : "{generation["research_area"]}",
        "quality_score" : "{generation["quality"]}",
        "quality_reason" : "{generation["quality_reason"]}"
      }}
''')


      {
        "title" : "--Visualization of breast cancer -related protein synthesis from the perspective of bibliometric analysis--",
        "authors" : "../../authors/Xu_et_al.txt",
        "publication_year": "2023",
        "key_words" : "../../keywords/Breast cancer, Bibliometric analysis, Protein synthesis, Introduction, Complex disease, Risk factors, Development and progression, Overexpression, Factors, Genetic mutations, Hormonal imbalances, Lifestyle choices, Age, Gender, Inherited risk, Reproductive history, Exposure to chemicals and radiation, Key factors, Diagnosis and treatment, Sustained research, Cutting-edge therapies, Biology of breast cancer.",
        "summary" : ";Breast cancer is the most common cancer in women worldwide and the number of patients increased year by year. It is a complex disease that can be caused by a variety of factors, including genetic mutations, hormonal imbalances, and lifestyle choices. One of the key factors in the development and progres