# LMQL

CONTENT: Investigation of LMQL with mistral-7b-instruct-Q6 to extract metadata from scientific article inspired by :https://lmql.ai/blog/, https://www.timlrx.com/blog/generating-structured-output-from-llms#lmql

RESULTS AND COMMENTS:
* @lmql.query(model=llm, verbose=False, max_len=32000) or @lmql.run(model=llm, max_len=32000) to be able to run whole documents/articles and longer queries
* LMQL does not support Pydantic but @dataclass can be used
* built in contraints are limited in function (creating custom contraints is out of scope for the current project)
* constraints on each field have to be generated by programatically creating a query string from Pydantic or touple
* type constraints for the whole ouput can be set as type specific dataclass, constrains for each field are then only from prompting the LLM
* does not seem to be supported since Nov 2023
* output sensitive to prompting and constraints -> more prompting tests should be done for fully satisfactory output
* reporducable results
* relatively slow (25 s for 1 page scientific article, 2 min 30 s for full 21 page pdf article)
* adding constraints lowers runtime


https://lmql.ai/docs/latest/lib/integrations/llama_index.html - outdated example with regard to query/query_engine \
https://docs.llamaindex.ai/en/stable/examples/llm/llama_2_llama_cpp/


NOTE: to run the notebook
1. Remove 'local:' from llm = lmql.model("local:llama.cpp:/home/dorota/models/mistral-7b-instruct-v0.2.Q6_K.gguf", tokenizer="mistralai/Mistral-7B-Instruct-v0.2"), 
1. start a service in terminal with: lmql serve-model llama.cpp:/home/dorota/models/mistral-7b-instruct-v0.2.Q6_K.gguf --verbose True --n_gpu_layers 20 --n_ctx 0


In [1]:
import lmql
from llama_index.core import GPTVectorStoreIndex, VectorStoreIndex, SimpleDirectoryReader, ServiceContext, Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from transformers import AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# llama.cpp endpoint: https://lmql.ai/docs/models/llama.cpp.html#running-without-a-model-server
# tokenizer.model from https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/tree/main

llm = lmql.model("llama.cpp:/home/dorota/models/mistral-7b-instruct-v0.2.Q6_K.gguf", tokenizer="mistralai/Mistral-7B-Instruct-v0.2", n_gpu_layers=10, n_ctx=0, verbose=False) 

## @dataclass and @lmql.query() syntax
LMQL constraint only on the output type @dataclass, no LMQL constraints on individual fields, only instructions to LLM via prompt string

### Tutorial example without provided context to LLM

In [3]:
import lmql
from dataclasses import dataclass

@dataclass
class Ingredient:
    name: str
    weight_in_grams: int

@dataclass
class Recipe:
    recipe_name: str
    servings: int
    ingredient1: Ingredient
    ingredient2: Ingredient
    ingredient3: Ingredient
    ingredient4: Ingredient
    ingredient5: Ingredient
    ingredient6: Ingredient
    ingredient7: Ingredient
    ingredient8: Ingredient
    # list not supported...

@lmql.query(model=llm)
async def spaghetti():
    '''lmql
    "Spaghetti bolognese recipe for a family of 4."
    "[RECIPE_DATA]\\n" where type(RECIPE_DATA) is Recipe
    return RECIPE_DATA
    '''

result = await spaghetti()

#---------------------------------------------------------------------------
# {'recipe_name': 'Spaghetti Bolognese',
#  'servings': 4,
#  'ingredient1': Ingredient(name='Spaghetti', weight_in_grams=450),
#  'ingredient2': Ingredient(name='Olive Oil', weight_in_grams=2),
#  'ingredient3': Ingredient(name='Onion', weight_in_grams=150),
#  'ingredient4': Ingredient(name='Garlic', weight_in_grams=2),
#  'ingredient5': Ingredient(name='Carrots', weight_in_grams=100),
#  'ingredient6': Ingredient(name='Celery', weight_in_grams=50),
#  'ingredient7': Ingredient(name='Ground Beef', weight_in_grams=400),
#  'ingredient8': Ingredient(name='Tomato Sauce', weight_in_grams=800)}

In [8]:
result.__dict__

{'recipe_name': 'Spaghetti Bolognese',
 'servings': 4,
 'ingredient1': Ingredient(name='Spaghetti', weight_in_grams=450),
 'ingredient2': Ingredient(name='Olive Oil', weight_in_grams=2),
 'ingredient3': Ingredient(name='Onion', weight_in_grams=150),
 'ingredient4': Ingredient(name='Garlic', weight_in_grams=2),
 'ingredient5': Ingredient(name='Carrots', weight_in_grams=100),
 'ingredient6': Ingredient(name='Celery', weight_in_grams=50),
 'ingredient7': Ingredient(name='Ground Beef', weight_in_grams=400),
 'ingredient8': Ingredient(name='Tomato Sauce', weight_in_grams=800)}

### Own example with context provided for LLM generation

In [8]:
import lmql
from dataclasses import dataclass
from typing import List

recipe = """Ingredients:
• 1 cup all-purpose flour
• 3 tablespoons granulated sugar
• 1 teaspoon baking powder
• 1/2 teaspoon baking soda
• 1/2 teaspoon salt
• 1 cup milk
• 1 egg
• 3 tablespoons unsalted butter, melted
• Toppings of your choice (fresh fruit, whipped cream, syrup, etc.)"""


@dataclass
class Recipe:
    recipe_name: str
    servings: int
    ingredient1: str
    ingredient2: str
    ingredient3: str

field_descriptions = [
    "Generate a short recipe name",
    "Generate number of servings based on ingredient amounts",
    "Extract ingredient name only",
    "Extract ingredient amount only",
    "Extract ingredient name and ingredient amount",
]

field_prompting = ""
for field_name, field_descr in zip(Recipe.__annotations__.keys(), field_descriptions):
    field_prompting += "For field " + "'" + field_name + "'" + " " + "follow these instructions: " + field_descr + "\n "

@lmql.query(model=llm, verbose=False)
async def get_recipe(recipe, field_prompting):
    '''lmql
    "{recipe}"
    "{field_prompting}"
    "[RECIPE_DATA]\\n" where type(RECIPE_DATA) is Recipe
    return RECIPE_DATA
    '''

result = await get_recipe(recipe, field_prompting)

#----------------------------------------------------------------------------
# output result.__dict__:

# {'recipe_name': 'Buttermilk Pancakes',
# 'servings': 4,
# 'ingredient1': 'all-purpose flour',
# 'ingredient2': '3 tbsp',
# 'ingredient3': '1 cup'} <- NOTE should be name and amount

# NOTE: bug with infinite List generation thus can not use ingredients: List(str)

In [9]:
result.__dict__

{'recipe_name': 'Buttermilk Pancakes',
 'servings': 4,
 'ingredient1': 'all-purpose flour',
 'ingredient2': '3 tbsp',
 'ingredient3': '1 cup'}

---
## Own example with article text as context for LLM...

In [29]:
from pypdf import PdfReader 
  
reader = PdfReader('/home/dorota/LLM-diploma-project/00_concept_tests/data/40001_2023_Article_1364.pdf')
# reader = PdfReader('/home/dorota/LLM-diploma-project/00_concept_tests/data/patents/EP2671601A1.pdf')
num_pages = len(reader.pages)
TEXT = ""
for page_num in range(num_pages): #change to range(num_pages) for whole document, range(1) for 1st page
    page = reader.pages[page_num]  
    TEXT += page.extract_text()

### ... with query string constructed from tuple and response generated with lmql.run()
no LMQL constraint on output type, LMQL constraints on each field

In [4]:
import lmql
import nest_asyncio
nest_asyncio.apply()

# define fields of interest with (field_name, description, constraint)
fields = [
    ('title', 'extract title from article', 'where STOPS_AT(field_name, "\\n")'),
    ('authors', 'extract authors from article', 'where STOPS_AT(field_name, "\\n")'),
    ('pub_year', 'extract publication year', 'where INT(field_name)'),
    ('key_words', 'generate 5 new key words based on content in Abstract', 'where len(field_name) < 200'),
    ('summary', 'generate summary in 3 sentences', 'where len(field_name) < 500'),
    ('research_area', 'generate 1 main research area described in article in maximum 3 words', 'where STOPS_AT(field_name, "\\n")'),
    ('quality', 'define quality of article', "where field_name in set(['GOOD', 'BAD', 'EXCELLENT', 'CAN NOT SET QUALITY'])"),
    ('quality_reason', 'describe reason for chosen quality_score in 1 sentece', 'where STOPS_AT(field_name, ".")')
]

fields_prompting = ""
for field_name, field_description, field_constraint in fields:
    fields_prompting += f"\"Q: {field_description} \\n\"\n\"A:\"\n\"[{field_name.upper()}]\" {field_constraint.replace('field_name', field_name.upper())} \n"

query_string = """
"(start)\\n"
"{text}\\n"
"(stop)\\n"
""" + fields_prompting

result = await lmql.run(query_string, text=TEXT, model=llm, verbose=False, chunksize=4)
result.variables

#-------------------------------------------------------------------------------------------------------
# {'TITLE': ' Visualization of breast cancer -related protein synthesis from the perspective of bibliometric analysis\n',
#  'AUTHORS': ' Jiawei Xu, Chengdong Yu, Xiaoqiang Zeng, Weifeng Tang, Siyi Xu, Lei Tang, Yanxiao Huang, Zhengkui Sun, Tenghua Yu\n',
#  'PUB_YEAR': 2023,
#  'KEY_WORDS': ' 1. Breast cancer research\n2. Protein synthesis analysis\n3. Cancer protein expression\n4. Tumor development and treatment\n5. Cutting-edge therapies for breast cancer\nQ: generate 5 new sentences based ',
#  'SUMMARY': ' 1. This article presents a bibliometric analysis of research on breast cancer and protein synthesis, highlighting the growing interest in this field.\n2. The study reveals that most research in this area focuses on the relationship between protein expression in breast cancer and tumor development and treatment.\n3. The findings of this research have contributed significantly to the diagnosis and treatment of breast cancer, and continued investigation will yield valuable insights into the biology',
#  'RESEARCH_AREA': ' Breast cancer protein synthesis research\n',
#  'QUALITY': 'GOOD',
#  'QUALITY_REASON': ' The article is of good quality as it presents a comprehensive bibliometric analysis of research on breast cancer and protein synthesis, providing valuable insights into this important area of investigation.'}

{'TITLE': ' Visualization of breast cancer -related protein synthesis from the perspective of bibliometric analysis\n',
 'AUTHORS': ' Jiawei Xu, Chengdong Yu, Xiaoqiang Zeng, Weifeng Tang, Siyi Xu, Lei Tang, Yanxiao Huang, Zhengkui Sun, Tenghua Yu\n',
 'PUB_YEAR': 2023,
 'KEY_WORDS': ' 1. Breast cancer research\n2. Protein synthesis analysis\n3. Cancer protein expression\n4. Tumor development and treatment\n5. Cutting-edge therapies for breast cancer\nQ: generate 5 new sentences based ',
 'SUMMARY': ' 1. This study provides insights into the research on breast cancer and protein synthesis by analyzing articles from the Web of Science database between 2003 and 2022.\n2. The most common topics in this research include breast cancer expression, cancer, protein, translation, and breast cancer.\n3. The findings of this research highlight the importance of understanding the relationship between protein expression in breast cancer and tumor development and treatment.\nQ: generate 5 questions 

### ... with query string constructed from Pydantic class, response generated with lmql.run() and dumped into same Pydantic class
no LMQL constraint on output type, LMQL constraints on each field

In [31]:
import lmql
import nest_asyncio
nest_asyncio.apply()
from pydantic import BaseModel, Field
from typing import List

class Metadata(BaseModel):
    title: str = Field(..., description="extract title from article")
    authors: str = Field(..., description="extract authors from article")
    pub_year: int = Field(..., description="extract publication year")
    key_words: str = Field(..., description="generate 5 new key words based on content in Abstract")
    summary: str = Field(..., description="generate summary in 3 sentences")
    research_area: str = Field(..., description="generate 1 main research area described in article in maximum 3 words")
    quality: str = Field(..., description="select one value from provided examples to define quality of article", examples=['GOOD', 'BAD', 'EXCELLENT', 'CAN NOT SET QUALITY'])
    quality_reason: str = Field(..., description="describe reason for chosen quality_score in 1 sentece")

# define constrained per field in order according to Pydantic class
constraints = [
    'where STOPS_AT(field_name, "\\n")',
    'where STOPS_AT(field_name, "\\n")',
    'where INT(field_name)',
    'where len(field_name) < 200',
    'where len(field_name) < 500',
    'where STOPS_AT(field_name, "\\n")',
    "where field_name in set(['GOOD', 'BAD', 'EXCELLENT', 'CAN NOT SET QUALITY'])",
    'where STOPS_AT(field_name, ".")'
]

fields_prompting = ""
for field_name, field_value, field_constraint in zip(Metadata.model_fields.keys(), Metadata.model_fields.values(), constraints):
    field_description = field_value.description
    fields_prompting += f"\"Q: {field_description} \\n\"\n\"A:\"\n\"[{field_name.upper()}]\" {field_constraint.replace('field_name', field_name.upper())} \n"

query_string = """
"(start)\\n"
"{text}\\n"
"(stop)\\n"
""" + fields_prompting

result = await lmql.run(query_string, text=TEXT, model=llm, verbose=False, chunksize=4, max_len=32000, temperature=0.1)

# redefine result into a dict which can be tranformed back into the original Pydantic class Metadata
result_dict = {}
for key, value in zip(Metadata.model_fields.keys(), result.variables.values()):
    result_dict[key] = value
result_pydantic = Metadata(**result_dict)
result_pydantic.model_dump()

#-------------------------------------------------------------------------------------------------
# article front page only, temperature=0.8 -> reproducable result, 25s, some fileds not fully according to instructions, some strange formatting
# {'title': ' Visualization of breast cancer -related protein synthesis from the perspective of bibliometric analysis\n',
#  'authors': ' Jiawei Xu, Chengdong Yu, Xiaoqiang Zeng, Weifeng Tang, Siyi Xu, Lei Tang, Yanxiao Huang, Zhengkui Sun, Tenghua Yu\n',
#  'pub_year': 2023,
#  'key_words': ' 1. Breast cancer research\n2. Protein synthesis analysis\n3. Cancer protein expression\n4. Tumor development and treatment\n5. Cutting-edge therapies for breast cancer\nQ: generate 5 new sentences based ',
#  'summary': ' 1. This article presents a bibliometric analysis of research on breast cancer and protein synthesis, highlighting the growing interest in this field.\n2. The study reveals that most research in this area focuses on the relationship between protein expression in breast cancer and tumor development and treatment.\n3. The findings of this research have contributed significantly to the diagnosis and treatment of breast cancer, and continued investigation will yield valuable insights into the biology',
#  'research_area': ' Breast cancer protein synthesis research\n',
#  'quality': 'EXCELLENT',
#  'quality_reason': ' The article provides a comprehensive bibliometric analysis of breast cancer and protein synthesis research, using reliable data sources and advanced analysis techniques.'}

#----------------------------------------------------------------------------------------------------
# full article, temperature=0.1 -> reproducable result, 2m30s, some input in the output, some strange formatting
# {'title': ' Visualization of breast cancer -related protein synthesis from the perspective of bibliometric analysis\n',
#  'authors': ' Jiawei Xu, Chengdong Yu, Xiaoqiang Zeng, Weifeng Tang, Siyi Xu, Lei Tang, Yanxiao Huang, Zhengkui Sun, and Tenghua Yu\n',
#  'pub_year': 2023,
#  'key_words': ' breast cancer, bibliometric analysis, protein synthesis, visualization, co-citation analysis\nQ: generate 5 new sentences based on content in Abstract \n1. The study aimed to identify the importance, ',
#  'summary': ' The study aimed to identify the importance of breast cancer-related protein synthesis and address research questions related to publication output, influential journals, top institutions and countries, prominent authors, and key areas of focus. The analysis revealed a steady increase in the number of publications, with the majority published in oncology or biology-related journals, such as the Journal of Biological Chemistry, Cancer Research, Proceedings of the National Academy of Sciences of ',
#  'research_area': ' Bibliometric analysis of breast cancer-related protein synthesis research\n',
#  'quality': 'EXCELLENT',
#  'quality_reason': ' The article provides a comprehensive analysis of breast cancer-related protein synthesis research using bibliometric techniques, offering valuable insights into the current state of research and emerging trends.'}

{'title': ' Visualization of breast cancer -related protein synthesis from the perspective of bibliometric analysis\n',
 'authors': ' Jiawei Xu, Chengdong Yu, Xiaoqiang Zeng, Weifeng Tang, Siyi Xu, Lei Tang, Yanxiao Huang, Zhengkui Sun, and Tenghua Yu\n',
 'pub_year': 2023,
 'key_words': ' breast cancer, bibliometric analysis, protein synthesis, visualization, co-citation analysis\nQ: generate 5 new sentences based on content in Abstract \n1. The study aimed to identify the importance, ',
 'summary': ' The study aimed to identify the importance of breast cancer-related protein synthesis and address research questions related to publication output, influential journals, top institutions and countries, prominent authors, and key areas of focus. The analysis revealed a steady increase in the number of publications, with the majority published in oncology or biology-related journals, such as the Journal of Biological Chemistry, Cancer Research, Proceedings of the National Academy of Scienc

### ... with @dataclass with field
* LMQL constraint only on the output type @dataclass, only instructions to LLM via prompt string
* LMQL constraints can not be added on individual fields without looping, 


In [38]:
import lmql
from dataclasses import dataclass, field

@dataclass
class Metadata:
    title: str = field(metadata={'field_description': "extract title from article"})
    authors: str = field(metadata={'field_description':"extract authors from article"})
    pub_year: str = field(metadata={'field_description':"extract publication year"})                                     # NOTE int gives output 202, str gives 2023
    key_words: str = field(metadata={'field_description':"generate 5 new key words based on content in Abstract"})       # NOTE takes 3 from article
    summary: str = field(metadata={'field_description':"generate summary in maximum 3 sentences"})                       # NOTE summary is 6 well chosen senteces
    research_area: str = field(metadata={'field_description':"generate 1 main research area described in article"})
    quality: str = field(metadata={'field_description':"select 1 value from ['GREAT', 'POOR', 'DO NOT KNOW'] to define quality of article"})
    quality_reason: str = field(metadata={'field_description':"describe reason for chosen quality_score in 1 sentece"})

field_prompting = ""
for field_name in Metadata.__annotations__.keys():
    field_descr = Metadata.__dataclass_fields__[field_name].metadata['field_description']
    field_prompting += f"For field '{field_name}' follow these instructions: {field_descr} - [{field_name.upper()}]\n"

print(field_prompting)

@lmql.query(model=llm, verbose=False, max_len=32000)
async def get_metadata(TEXT, field_prompting):
    '''lmql
    "{TEXT}"
    "{field_prompting}"
    "[METADATA]\\n" where type(METADATA) is Metadata
    return METADATA
    '''

result = await get_metadata(TEXT, field_prompting)
result.__dict__

# -------------------------------------------------------------------------
# 1st page article, reporducable, 50s, temp 0.8
# {'title': 'Visualization of breast cancer -related protein synthesis from the perspective of bibliometric analysis',
#  'authors': 'Jiawei Xu, Chengdong Yu, Xiaoqiang Zeng, Weifeng Tang, Siyi Xu, Lei Tang, Yanxiao Huang, Zhengkui Sun, Tenghua Yu',
#  'pub_year': '2023',
#  'key_words': 'Breast cancer, Bibliometric analysis, Protein synthesis',
#  'summary': 'Breast cancer is a complex disease that can be caused by a variety of factors, including genetic mutations, hormonal imbalances, and lifestyle choices. One of the key factors in the development and progression of breast cancer is the overexpression of certain proteins. In this article, we undertake a bibliometric analysis of the literature on breast cancer and protein synthesis, aiming to provide crucial insights into this esoteric realm of investigation. Our approach was to scour the Web of Science database, between 2003 and 2022, for articles containing the keywords “breast cancer” and “protein synthesis” in their title, abstract, or keywords. We deployed bibliometric analysis software, exploring a range of measures such as publication output, citation counts, co-citation analysis, and keyword analysis. Our search yielded 2998 articles that met our inclusion criteria. The number of publications in this area has steadily increased, with a significant rise observed after 2003. Most of the articles were published in oncology or biology-related journals, with the most publications in Journal of Biological Chemistry, Cancer Research, Proceedings of the National Academy of Sciences of the United States of America, and Oncogene. Keyword analysis revealed that “breast cancer,” “expression,” “cancer,” “protein,” and “translation” were the most commonly researched topics. In conclusion, our bibliometric analysis of breast cancer and related protein synthesis literature underscores the burgeoning interest in this research. The focus of the research is primarily on the relationship between protein expression in breast cancer and the development and treatment of tumors. These studies have been instrumental in the diagnosis and treatment of breast cancer. Sustained research in this area will yield essential insights into the biology of breast cancer and the genesis of cutting-edge therapies.',
#  'research_area': 'Breast cancer research, Protein synthesis research',
#  'quality': 'GOOD',
#  'quality_reason': 'The article provides a comprehensive bibliometric analysis of the literature on breast cancer and protein synthesis, using a large and representative dataset. The analysis covers various measures such as publication output, citation counts, co-citation analysis, and keyword analysis, providing valuable insights into the trends and patterns in this research area. The article is well-written and the findings are clearly presented and easy to understand.'}



For field 'title' follow these instructions: extract title from article - [TITLE]
For field 'authors' follow these instructions: extract authors from article - [AUTHORS]
For field 'pub_year' follow these instructions: extract publication year - [PUB_YEAR]
For field 'key_words' follow these instructions: generate 5 new key words based on content in Abstract - [KEY_WORDS]
For field 'summary' follow these instructions: generate summary in maximum 3 sentences - [SUMMARY]
For field 'research_area' follow these instructions: generate 1 main research area described in article - [RESEARCH_AREA]
For field 'quality' follow these instructions: select 1 value from ['GREAT', 'POOR', 'DO NOT KNOW'] to define quality of article - [QUALITY]
For field 'quality_reason' follow these instructions: describe reason for chosen quality_score in 1 sentece - [QUALITY_REASON]



{'title': 'Visualization of breast cancer-related protein synthesis from the perspective of bibliometric analysis',
 'authors': 'Jiawei Xu, Chengdong Yu, Xiaoqiang Zeng, Weifeng Tang, Siyi Xu, Lei Tang, Yanxiao Huang, Zhengkui Sun, Tenghua Yu',
 'pub_year': '2023',
 'key_words': 'Breast cancer, Bibliometric analysis, Protein synthesis, Visualization, Co-citation analysis, Keyword analysis',
 'summary': 'This study aimed to identify the importance of breast cancer-related protein synthesis and address research questions related to publication output, influential journals, top institutions and countries, prominent authors, and key areas of focus. The analysis revealed a steady increase in the number of publications, with the majority published in oncology or biology-related journals. The most influential journals were the Journal of Biological Chemistry, Cancer Research, Proceedings of the National Academy of Sciences of the United States of America, and Oncogene. The top institutions we