## Building Evaluation Data 

This manualy extracted data from patent, will be used to evaluate the model and prompts.

In [251]:
import json

def write_json_objects_to_file(json_objects, file_path):
    """
    Writes a list of JSON objects to a file.
    
    Args:
        json_objects (list): List of JSON objects.
        file_path (str): File path to write the JSON objects to.
    """
    with open(file_path, 'w') as file:
        json.dump(json_objects, file)

In [249]:
evaluation_data_json = [{
    "text": """According to one embodiment, the nitrogen oxide storage
                materials comprise ceria particles having alkaline earth
                oxides, for example, baria, Supported on the particles, the
                ceria having a crystallite size of between about 10 and 20 nm
                and the alkaline earth oxides having a crystallite size of
                between about 20 and 40 nm. Other suitable alkaline earth
                oxides include oxides of Mg, Sr. and Ca. In certain embodi
                ments, the composite particles have a BET surface area of
                between about 30 and 80 m/g.""",
    "measurments": [
                    {"measurment":"ceria crystallite size", "unit": "nm", "value": "10 to 20"}, 
                    {"measurment":"alkaline earth oxides crystallite size", "unit": "nm", "value": "20 to 40"}, 
                    {"measurment":"composite particles BET surface area", "unit": "m/g", "value": "30 to 80"},
                   ]
},

    {"text": """Desirably, the refractory metal oxide support will have a
    surface area of between about 5 and about 350 m?g, and more
    particularly between about 100 and 200 m/g. Typically, the
    Support will be present on the coated Substrate in the amount
    of about 1.5 to about 7.0 g/in, for example between about 3
    and 6 g/in. A suitable support material for the precious metal
    is alumina, which may be doped with one or more other
    materials. Alumina having a BET surface area of about 200
    m/g""",
     "measurments": [
                    {"measurment":"refractory metal oxide support surface area", "unit": "m^2/g", "value": "100 to 200"}, 
                    {"measurment":"coated Substrate support", "unit": "g/in", "value": "1.5 to 7.0"}, 
                    {"measurment":"Alumina BET surface area", "unit": "m/g", "value": "200"},
                   ]
    },
{"text":"""In one or more embodiments of the present invention the
catalytic component preferably comprises a precious metal
component, i.e., a platinum group metal component. Suitable
precious metal components include platinum, palladium,
rhodium and mixtures thereof. The catalytic component will
typically be present in an amount of about 20 to about 200
g/ft, more specifically, about 60 to 120 g/ft.""",
"measurments":[
    {"measurment":"catalytic component", "unit": "g/ft", "value": "60 to 120"}
    
]},
{"text": """BaCO, and CeO, were intimately mixed and finely dis
persed in a weight ratio of between about 1:3 and about 1:5.
Cerium oxide having a BET surface area of between about
50-150 m/g was mixed with a solution of barium acetate
such that the BaCO/CeO composite had a BaCO3 content
of about 10-30 wt %. After mixing, the suspension of soluble
barium acetate and CeO2 was then spray-dried at a tempera
ture of between about 90° C. and 120° C. to obtain a solid
mixture of barium acetate and ceria.""",
"measurments": [
                {"measurment":"BaCO and CeO weight ratio", "unit": "weight ratio", "value": "1:3 to 1:5"},
                {"measurment":"Cerium oxide surface area", "unit": "m/g", "value": "50 to 150"},
                {"measurment":"BaCO3 content in BaCO/CeO composite", "unit": "wt", "value": "10 to 30"}
               ]},
{"text":"""According to the data in Tables I and II, an as-prepared BET
surface area between 40-60 m/g and a ceria crystal size
between about 10- and 20 nm and a BaCO crystallite size of
between about 20- and 40 nm yielded the best performance
after aging.""",
"measurments": [
{
"measurment": "BET surface area",
"unit": "m/g",
"value": "between 40-60"
},
{
"measurment": "ceria crystal size",
"unit": "nm",
"value": "between about 10- and 20"
},
{
"measurment": "BaCO crystallite size",
"unit": "nm",
"value": "between about 20- and 40"
}
]},
{"text": """The method of claim 1, wherein the nitrogen oxide
storage material has a surface area between about 30 and 80 m^2/g.""",
"measurments": [
    {"measurment": "nitrogen oxide surface area", "unit":"m^2/g", "value": "30 and 80"}
]},
{
    "text":"""In a favorite realization the bioproducto is useful for the control of plant fungi. In a materialization of the invention the polypeptide included into the bio-product in the concentration range among 1 to 9 g/ml""",
    "measurments": [
        {
        "measurment": "polypeptide concentration",
        "unit": "g/ml",
        "value": "1 to 9"
        }
        ]
    },
{
    "text": "The starting RCR mixture contains about 0.01 ng&#x2dc;10 &#x3bc;g of the RCR-ready RNA/mRNA templates, about 0.1&#x2dc;50 U of isolated coronaviral RdRp/helicase, and a proper amount of rNTPs in 1&#xd7;transcription buffer. RdRp/helicase is either an RdRp enzyme with an additional RNA unwinding activity or a mixture of RdRp and helicase.",
    "measurements":[
          {
            "measurement": "RCR-ready RNA/mRNA templates",
            "unit": "ng-µg",
            "value": "0.01-10"
          },
          {
            "measurement": "Isolated coronaviral RdRp/helicase",
            "unit": "U (units)",
            "value": "0.1-50"
          }
        ]

}]

output_format = """Your output should follow be a list of json objects, good output format: [{"measurments":"string", "unit":"string", "value": "string"},
{"measurments":"string", "unit":"string", "value": "string"}]"""

In [None]:
prompt_example_1 = 'You are given a text from chimestry research patent, extract all the measurments mentinoed in the text.\nExtract Measurment Details, Unit, Value or Value ranges. \n\nYour output should follow JSON format.\n'

In [252]:
write_json_objects_to_file(evaluation_data_json, "Data/evaluation_examples.json")

# Setting up OpenAI and Azure

In [117]:
import os
from dotenv import load_dotenv

load_dotenv("env.env")


True

In [118]:
import os
import requests
import json
import openai

openai.api_key = os.getenv("AZURE_OPENAI_KEY")
openai.api_base = os.getenv("AZURE_OPENAI_ENDPOINT") # your endpoint should look like the following https://YOUR_RESOURCE_NAME.openai.azure.com/
openai.api_type = 'azure'
openai.api_version = '2023-05-15' # this may change in the future

deployment_name='Omar Challange' #This will correspond to the custom name you chose for your deployment when you deployed a model. 

#

## Building Prombt

# Example 1

In [147]:
from langchain.embeddings import OpenAIEmbeddings

In [148]:
OpenAIEmbeddings(engine="text-embedding-ada-002")

ValidationError: 1 validation error for OpenAIEmbeddings
engine
  extra fields not permitted (type=value_error.extra)

In [208]:
from langchain.chat_models import ChatOpenAI, AzureChatOpenAI
from langchain import PromptTemplate, LLMChain
from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    AIMessagePromptTemplate,
    HumanMessagePromptTemplate,
)
from langchain.schema import AIMessage, HumanMessage, SystemMessage

In [224]:
def concat_string_with_dicts(string, dict_list):
    result = string
    for dictionary in dict_list:
        dict_str = ", ".join([f"{key}: {value}" for key, value in dictionary.items()])
        result += " {" + dict_str + "},"
    return result.rstrip(',')  # Remove trailing comma

def format_dicts_to_string(data_list):
    formatted_strings = []
    
    for data in data_list:
        formatted_string =  '''"measurment": "{measurment}", "unit": "{unit}", "value": "{value}"'''.format(**data)
        formatted_strings.append(formatted_string)
    
    result =  ',\n'.join(formatted_strings) 
    return result

In [236]:
import json

def build_few_shots_prombt(prombt, examples):
    
    system_message = "You are a helpful assistant you extract measurments from research patent."
    system_message_prompt = SystemMessagePromptTemplate.from_template(system_message)

    first_message = HumanMessagePromptTemplate.from_template(prombt)
    
    final_prombt = [system_message_prompt, first_message]
    for example in examples:
        final_prombt.append(HumanMessagePromptTemplate.from_template(f"text: '''{example['text']}'''"))
        final_prombt.append(AIMessagePromptTemplate.from_template(format_dicts_to_string(example['measurments'])))
    
    final_prombt.append(HumanMessagePromptTemplate.from_template("\nOutput format:{output_format}:\n '''{input_text}'''"))
        
    return ChatPromptTemplate.from_messages(final_prombt)

In [237]:
ms_dic = format_dicts_to_string(mes)
print(ms_dic)
json.loads('['+ms_dic+']')


"measurment": "ceria crystallite size", "unit": "nm", "value": "10 to 20",
"measurment": "alkaline earth oxides crystallite size", "unit": "nm", "value": "20 to 40",
"measurment": "composite particles BET surface area", "unit": "m/g", "value": "30 to 80"


JSONDecodeError: Expecting ',' delimiter: line 1 column 14 (char 13)

In [220]:
json.dumps(mes)

'[{"measurment": "ceria crystallite size", "unit": "nm", "value": "10 to 20"}, {"measurment": "alkaline earth oxides crystallite size", "unit": "nm", "value": "20 to 40"}, {"measurment": "composite particles BET surface area", "unit": "m/g", "value": "30 to 80"}]'

In [238]:
input_text = """In one or more embodiments of the present invention the
catalytic component preferably comprises a precious metal
component, i.e., a platinum group metal component. Suitable
precious metal components include platinum, palladium,
rhodium and mixtures thereof. The catalytic component will
typically be present in an amount of about 20 to about 200
g/ft, more specifically, about 60 to 120 g/ft."""
print(build_few_shots_prombt(prompt_example_1, evaluation_data_json))

input_variables=['input_text', 'output_format'] output_parser=None partial_variables={} messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=[], output_parser=None, partial_variables={}, template='You are a helpful assistant you extract measurments from research patent.', template_format='f-string', validate_template=True), additional_kwargs={}), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=[], output_parser=None, partial_variables={}, template='You are given a text from chimestry research patent, extract all the measurments mentinoed in the text.\nExtract Measurment Details, Unit, Value or Value ranges. \n\nYour output should follow JSON format.\n', template_format='f-string', validate_template=True), additional_kwargs={}), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=[], output_parser=None, partial_variables={}, template="text: '''According to one embodiment, the nitrogen oxide storage\n                materials comprise 

In [239]:
# chat = ChatOpenAI(engine='gpt-35-turbo')

openai.api_key = os.getenv("AZURE_OPENAI_KEY")
openai.api_base = os.getenv("AZURE_OPENAI_ENDPOINT") # your endpoint should look like the following https://YOUR_RESOURCE_NAME.openai.azure.com/
openai.api_type = 'azure'
openai.api_version = '2023-05-15' # this may change in the future

deployment_name='Omar Challange' #This will correspond to the custom name you chose for your deployment when you deployed a model. 

#

BASE_URL = os.getenv("AZURE_OPENAI_ENDPOINT")
API_KEY = os.getenv("AZURE_OPENAI_KEY")
DEPLOYMENT_NAME = 'gpt-35-turbo'
chat = AzureChatOpenAI(
    openai_api_base=BASE_URL,
    openai_api_version="2023-03-15-preview",
    deployment_name=DEPLOYMENT_NAME,
    openai_api_key=API_KEY,
    openai_api_type="azure",
)

'You are given a text from chimestry research patent, extract all the measurments mentinoed in the text.\nExtract Measurment Details, Unit, Value or Value ranges. \n\nYour output should follow JSON format.\n'

In [245]:
chat_prompt = build_few_shots_prombt(prompt_example_1, evaluation_data_json[0:2])

chain = LLMChain(llm=chat, prompt=chat_prompt)
# get a chat completion from the formatted messages


In [256]:
prompt_example_1

'You are given a text from chimestry research patent, extract all the measurments mentinoed in the text.\nExtract Measurment Details, Unit, Value or Value ranges. \n\nYour output should follow JSON format.\n'

In [257]:
chat.get_num_tokens(prompt_example_1)

47

In [246]:
chat_prompt

ChatPromptTemplate(input_variables=['input_text', 'output_format'], output_parser=None, partial_variables={}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=[], output_parser=None, partial_variables={}, template='You are a helpful assistant you extract measurments from research patent.', template_format='f-string', validate_template=True), additional_kwargs={}), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=[], output_parser=None, partial_variables={}, template='You are given a text from chimestry research patent, extract all the measurments mentinoed in the text.\nExtract Measurment Details, Unit, Value or Value ranges. \n\nYour output should follow JSON format.\n', template_format='f-string', validate_template=True), additional_kwargs={}), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=[], output_parser=None, partial_variables={}, template="text: '''According to one embodiment, the nitrogen oxide storage\n             

In [136]:
result = chain.run(input_text=input_text, output_format=output_format)

In [137]:
result

'measurment: catalytic component amount, unit: g/ft, value: 20 to 200\nmeasurment: catalytic component amount, unit: g/ft, value: 60 to 120'

In [247]:
result = chain.run(input_text=input_text, output_format=output_format)
result

'[{"measurment": "catalytic component amount", "unit": "g/ft", "value": "20 to 200"},\n{"measurment": "catalytic component amount", "unit": "g/ft", "value": "60 to 120"}]'

## output parser 

In [209]:
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field, validator
from typing import List

In [233]:
# Define your desired data structure.
class Measurments(BaseModel):
    measurments: List[dict] = Field(description="list of measurments found in text")



# Set up a parser + inject instructions into the prompt template.
parser = PydanticOutputParser(pydantic_object=Measurments)

# prompt = PromptTemplate(
#     template="Answer the user query.\n{format_instructions}\n{query}\n",
#     input_variables=["query"],
#     partial_variables={"format_instructions": parser.get_format_instructions()},
# )

# _input = prompt.format_prompt(query=joke_query)

# output = model(_input.to_string())

# parser.parse(output)

In [234]:
parser.get_format_instructions()

'The output should be formatted as a JSON instance that conforms to the JSON schema below.\n\nAs an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}}\nthe object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.\n\nHere is the output schema:\n```\n{"properties": {"measurments": {"title": "Measurments", "description": "list of measurments found in text", "type": "array", "items": {"type": "object"}}}, "required": ["measurments"]}\n```'

## Let's try by sendin a file 

In [141]:
from langchain.document_loaders import TextLoader
text_loader = TextLoader("Data/US8022010.txt")
text = text_loader.load()
text

[Document(page_content='U\n presence of excess oxygen. In this case, the catalyst/NOX \n sorbent is effective for storing NOx. As in the case of the gasoline partial lean burn application, after the NOX storage \n mode, a transient rich condition must be utilized to releasef \n reduce the stored NOx to nitrogen. In the case of the diesel engine, this transient reducing condition will require unique \n engine calibration or injection of a diesel fuel into the exhaust \n to create the next reducing environment. \n NOX storage (sorbent) components including alkaline \n earth metal oxides, such as oxides of Mg, Ca,Srand Ba, alkali \n metal oxides such as oxides of Li, Na, K, Rb and Cs, and rare \n earth metal oxides such as oxides of Ce, La, Pr and Nd in \n combination with precious metal catalysts such as platinum \n dispersed on an alumina Support have been used in the puri \n fication of exhaust gas from an internal combustion engine. \n For NOx storage, baria is usually preferred becau

In [142]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
chunk_size = 500
chunk_overlap = 0
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
all_splits = text_splitter.split_documents(text)

In [143]:
all_splits

[Document(page_content='U\n presence of excess oxygen. In this case, the catalyst/NOX \n sorbent is effective for storing NOx. As in the case of the gasoline partial lean burn application, after the NOX storage \n mode, a transient rich condition must be utilized to releasef \n reduce the stored NOx to nitrogen. In the case of the diesel engine, this transient reducing condition will require unique \n engine calibration or injection of a diesel fuel into the exhaust \n to create the next reducing environment.', metadata={'source': 'Data/US8022010.txt'}),
 Document(page_content='NOX storage (sorbent) components including alkaline \n earth metal oxides, such as oxides of Mg, Ca,Srand Ba, alkali \n metal oxides such as oxides of Li, Na, K, Rb and Cs, and rare \n earth metal oxides such as oxides of Ce, La, Pr and Nd in \n combination with precious metal catalysts such as platinum \n dispersed on an alumina Support have been used in the puri \n fication of exhaust gas from an internal comb

In [150]:
all_splits[0].page_content

'U\n presence of excess oxygen. In this case, the catalyst/NOX \n sorbent is effective for storing NOx. As in the case of the gasoline partial lean burn application, after the NOX storage \n mode, a transient rich condition must be utilized to releasef \n reduce the stored NOx to nitrogen. In the case of the diesel engine, this transient reducing condition will require unique \n engine calibration or injection of a diesel fuel into the exhaust \n to create the next reducing environment.'

In [159]:
results_all_splits = []

for chunk in all_splits[20:30]:
    
    result = chain.run(input_text=chunk.page_content)
    results_all_splits.append(result)

In [253]:
results_all_splits

['measurment: catalytic component, unit: g/ft, value: 20 to 200\nmeasurment: catalytic component, unit: g/ft, value: 60 to 120',
 'measurment: catalytic component, unit: g/ft, value: 20 to 200\nmeasurment: catalytic component, unit: g/ft, value: 60 to 120\n',
 'measurment: catalytic component, unit: g/ft, value: 20 to 200\nmeasurment: catalytic component, unit: g/ft, value: 60 to 120',
 'measurment: catalytic component amount, unit: g/ft, value: 20 to 200\nmeasurment: catalytic component amount, unit: g/ft, value: 60 to 120',
 'measurment: catalytic component amount, unit: g/ft, value: 20 to 200\nmeasurment: catalytic component amount, unit: g/ft, value: 60 to 120',
 'measurment: catalytic component, unit: g/ft, value: 20 to 200\nmeasurment: catalytic component, unit: g/ft, value: 60 to 120',
 'measurment: catalytic component amount, unit: g/ft, value: 20 to 200\nmeasurment: catalytic component amount, unit: g/ft, value: 60 to 120',
 'measurment: catalytic component, unit: g/ft, value:

In [167]:
'{'+results_all_splits[0]+'}'

'{measurment: catalytic component, unit: g/ft, value: 20 to 200\nmeasurment: catalytic component, unit: g/ft, value: 60 to 120}'