# 1. Data / Entity Extraction: Azure OpenAI

## 1.0 BackGround
- tokens :  
    - short or common words :  single token e.g hello, bye
    - long or uncommon words : hamburger (ham, bur, ger)
- In-context learning
    - Text Completion : Generate and edit text (Prompt Based)
    - Embedding : Search, Classify and compare text
        - Search for chemicals, classify into chemical, measurement unit, lower bound, upper bound.  
    - Prompt + Doc_text. Models are not retrained but give prediction based on the context included in the prompt.
- Few-Shot training:
    - No. of examples : 0-100 depending on how many can fit in prompt length.
    - Performs less accurate than fine-tuned model.
- Models :
    - GPT-4 models are in preview.
    - GPT-3 are known as Davinci, Curie, Baggage & Ada.
    - Naming :
        - {capability}-{family}[-{input-type}]-{identifier}
        - {code/text} - {ada/ baggage / curie / davinci} - {embeddings} - {version} 
- GPT-3 models:  high performance / low speed -> low performance / high speed
    - text-davinci-003 : 12288 dim, Complex intent, cause and effect, summarization 
    - text-curie-001 : 4096 dim, Language translation, complex classification, text sentiment, summarization
    - text-babbage-001 : 2048 dim, Moderate classification, semantic search classification
    - text-ada-001: 1024 dimension, Parsing text, simple classification, address correction, keywords

- Embeddings:
    - Similarity : similarity between two text
    - Text Search: measure whether long doc are relevant to short search  query.
    - Code Search:
- Model Availiability :
    - https://learn.microsoft.com/en-us/azure/cognitive-services/openai/concepts/models
    - text-davinci-003	Base Model Region: East US, West Europe    
    - Available with Completion API(text-in, text-out) request only and not with Chat completion API (gpt-4).
    - text-embedding-ada-002 (version 2) : East US, South Central US

- Model Limitations:
    - resources per region per azure subscription: 3
    - Davinci 
        -  120 request per minute
        -  40,000 Tokens per minute per model
    - Ada : 
        - 300 request per minute
        - 120,000 tokens per minute

- Deployment :
    - https://learn.microsoft.com/en-us/azure/cognitive-services/openai/overview
    - Create resource or instance of the service in your azure subscription
    - Deploy a model (using the Deployment APIs)
    - Make API call and generate text

Ref: 
https://github.com/Azure-Samples/openai/tree/main/Basic_Samples/Completions



In [10]:
import os
import requests
import json
import openai
from openai_keys import *

openai.api_key = OPENAI_KEY_EXTERNAL
openai.api_base = OPENAI_API_BASE_EXTERNAL
openai.api_type = "azure"
openai.api_version = "2022-12-01"  # this may change in the future

deployment_name = "text-davinci-003"
deployment_name_sm = "text-embedding-ada-002"


def get_completion(prompt, model_name, max_tokens=16):
    try:
        # Create a completion for the provided prompt and parameters
        # To know more about the parameters, checkout this documentation: https://learn.microsoft.com/en-us/azure/cognitive-services/openai/reference
        completion = openai.Completion.create(prompt=prompt, temperature=0, max_tokens=max_tokens, engine=model_name)

        # Here indicating if the response is filtered
        if completion.choices[0].finish_reason == "content_filter":
            print("The generated content is filtered.")

        return completion

    except openai.error.APIError as e:
        # Handle API error here, e.g. retry or log
        print(f"OpenAI API returned an API Error: {e}")

    except openai.error.AuthenticationError as e:
        # Handle Authentication error here, e.g. invalid API key
        print(f"OpenAI API returned an Authentication Error: {e}")

    except openai.error.APIConnectionError as e:
        # Handle connection error here
        print(f"Failed to connect to OpenAI API: {e}")

    except openai.error.InvalidRequestError as e:
        # Handle connection error here
        print(f"Invalid Request Error: {e}")

    except openai.error.RateLimitError as e:
        # Handle rate limit error
        print(f"OpenAI API request exceeded rate limit: {e}")

    except openai.error.ServiceUnavailableError as e:
        # Handle Service Unavailable error
        print(f"Service Unavailable: {e}")

    except openai.error.Timeout as e:
        # Handle request timeout
        print(f"Request timed out: {e}")


def get_cost_estimate(model, no_of_tokens, return_numeric_cost=False):
    # Per 1k cost
    model_per_1k_cost = {
        "text-babbage-001": 0.0005,
        "text-embedding-ada-002": 0.0004,  # cl100k_base	max-token = 8191
        "text-ada-001": 0.0004,
        "text-curie-001": 0.002,
        "gpt-3.5-turbo": 0.002,
        "gpt-3.5-turbo-0301": 0.002,
        "text-davinci-003": 0.02,
        "text-davinci-002": 0.02,
        "gpt-4": 0.06,
        "gpt-4-0314": 0.06,
        "gpt-4-32k": 0.12,
        "gpt-4-32k-0314": 1.12,
    }
    cost = round((no_of_tokens * model_per_1k_cost[model] / (1000)), 4)
    if return_numeric_cost:
        return cost

    return f"{cost} cents"


def get_completion_stats(completion):
    cb = completion["usage"]
    estimated_total_cost = get_cost_estimate(completion["model"], cb.total_tokens)
    print(f"Total Tokens: {cb.total_tokens}")
    print(f"Prompt Tokens: {cb.prompt_tokens}")
    print(f"Completion Tokens: {cb.completion_tokens}")
    # print(f"Successful Requests: {cb.successful_requests}")
    print(f"Total Cost (USD): {estimated_total_cost}")


def get_token_estimate(text):
    """Rough token estimation logic:
    Better solution to do is to get  exact tokenisation count by  using gpt or similar other word-piece meal tokenisation.
    """
    token_estimate = int(len(text.split()))

    if token_estimate < 1000:
        return int(token_estimate + 0.75 * token_estimate)
    else:
        return int(token_estimate + 0.40 * token_estimate)


def print_completion_as_json(completion):
    json_measurement_info = json.loads(completion.choices[0].text.strip(" \n"))
    print(json.dumps(json_measurement_info, indent=2))


def pprint(json_measurement_info):
    print(json.dumps(json_measurement_info, indent=2))


t = "The BaCO3 that was produced had a crystallite dimension ranging from approximately 20 to 40 nm."
print(f" Token Count Estimate : {get_token_estimate(t)}")

completion = get_completion("GPT Completion Test  ", deployment_name)
print("GPT Completion Test : ", completion.choices[0].text.strip(" \n"))

get_completion_stats(completion)


 Token Count Estimate : 28
GPT Completion Test :  1. What is the capital of the United States?

Answer
Total Tokens: 22
Prompt Tokens: 6
Completion Tokens: 16
Total Cost (USD): 0.0004 cents


# 2.0 Prompt Definition / tuning

### 2.1 Entity Type Annotation : Process
1. Use Business Domain Knowledge
2. Use GPT models  to get candidate entity_type and iteratively test them out to find best match.
    - e.g What are the list of good entity_type for  chemical  "BaCO<sub>3 </sub>" in the following sentence ?
        - "The BaCO<sub>3 </sub>and CeO<sub>2 </sub>crystallites formed particles with a size of between about 5 and 50 microns."


In [11]:
get_good_entity_type_list_prompt = """What are the list of good  entity_type for   chemical  "BaCO<sub>3 </sub>"  in the following sentence ?
\n"The BaCO<sub>3 </sub>and CeO<sub>2 </sub>crystallites formed particles with a size of between about 5 and 50 microns." """
# GEL : Good Entity Type List
GEL_completion = get_completion(get_good_entity_type_list_prompt, deployment_name, max_tokens=4000 )   
get_completion_stats(GEL_completion)
print(f"\n Entity Type Annotation Question : {get_good_entity_type_list_prompt}")
print(GEL_completion.choices[0].text.strip(" \n"))    


Total Tokens: 105
Prompt Tokens: 71
Completion Tokens: 34
Total Cost (USD): 0.0021 cents

 Entity Type Annotation Question : What are the list of good  entity_type for   chemical  "BaCO<sub>3 </sub>"  in the following sentence ?

"The BaCO<sub>3 </sub>and CeO<sub>2 </sub>crystallites formed particles with a size of between about 5 and 50 microns." 
1. Chemical Compound 
2. Mineral 
3. Particle 
4. Crystal 
5. Oxide 
6. Element


### 2.2 Prompt Tuning for Entity Type: 
- Ran extraction over small random patent sample with entity_type variations
- & Choose best.
- Because of the time limitation, for this iteration, we have restricted the entity tuning to unit test cases only. Diff variations tried  is available in the raw notebook form but has been eliminated in this final notebook because it will make the notebook too long with raw intermediate steps.


### 2.2.1 Prompt Tuning : Options & Selection
1. Unit Test Based :
For the current iteration, the prompt tuning was limited to unit test capturing 5 different scenarios due to the time limitation.  Some of the failures encountered during the prompt tuning were
    - Failure to extract multiple chemicals in a sentence.
    - Failure to extract two chemicals with measurement values in same sentence.
        - e.g Sodium Hydroxide  and HCL were used and their corresponding concentration and and weight respectively  were 0.9 M and 9 percent respectively.
        - Extracted = chemicals : [Sodium Hydroxide, HCL].
    - Failure to identify each chemical  and their measurements individually.

2. Small Random Sample Based :
It is preferred to test the fine tuning of the prompt tuning based on random patent sample. However due to time limitation,  this step has been deferred.
3. Singular measurement value recorded  in both low & high _value instead of extra additional attribute. Because  because it will increased prompt length and completion space and hence the ROI is not justified given we can easily infer singular value when low==high 


##### 2.2.2 Prompt Tuning : Future Work : 
1. Define Extraction template in Json (easy for user to define, modify template) rather than string based
2. Try using only one example to reduce the prompt size and see its effect.


In [12]:

extraction_template = "\n```TypeScript"+\
""+\
"\nmeasurement and values: { // Measurement  information of Chemicals "+\
"\n chemical compound list: { // List of chemical or chemical's molecular formulae"+\
"\n  chemical compound: string // A Single Chemical's molecular formulae or chemical name"+\
"\n }"+\
"\ndimension: string // The mechanical property of the chemical being measured."+\
"\nmeasurement unit: string // Unit of measurement for the chemical."+\
"\nmeasured value low: string // Lower range value for the chemical compound."+\
"\nmeasured value high: string // Higher range value for the chemical compound."+\
"\n}"+\
"\n```"


extraction_samples = [ 
                    {
                    'example' : "The BaCO<sub>3 </sub>and CeO<sub>2 </sub>crystallites formed particles with a size of between about 5 and 50 microns.",
                    'extract': {"measurement and values" : 
                                        {"chemical compound list"              : [
                                                {'chemical compound': "BaCO<sub>3"},
                                                {'chemical compound': "CeO<sub>3"}
                                                ],
                                        "dimension"  : "crystallite size",
                                        "measurement unit"       : "nm",
                                        "measured value low"     : "20",
                                        "measured value high"    : "40"
                                        }
                                }
                    },                    
                    {
                #     'example' : "The BaCO<sub>3</sub>  is 5 percent.  Sodium hypochlorite dimension is 0.004 %",
                'example' : "Sodium Hydroxide  and HCL were used and their "+\
                            "corresponding concentration and and weight respectively "+\
                             " were 2 %  and 0.9 percent respectively",
                    'extract' : {"measurement and values" :                                
                                        [{"chemical compound list"   : [{"chemical compound": "Sodium Hydroxide"}],
                                          "dimension"  : "concentration",
                                          "measurement unit"       : "%",
                                          "measured value low"     : "2",
                                          "measured value high"    : "2"                                                        
                                        },
                                        {"chemical compound list"   : [{"chemical compound": "HCL"}],
                                          "dimension"  : "weight",
                                          "measurement unit"       : "percent",
                                          "measured value low"     : "0.9",
                                          "measured value high"    : "0.9"                                                        
                                        }
                                        ]
                                } 
                    },
            ]


        
# Instructions for LLM to perform a task
task_prompt = "Your goal is to extract structured information from the user's "+\
"input that matches the form described below. When extracting information "+\
"please make sure it matches the type information exactly. Do not add any attributes "+\
"that do not appear in the schema shown below."

output_instruction_prompt = "\nPlease output the extracted information in JSON format. "+\
"Do not output anything except for the extracted information. "+\
"Do not add any clarifying information. "+\
"Do not add any fields that are not in the schema. "+\
"If the text contains attributes that do not appear in the schema, please ignore them. "+\
"All output must be in JSON format and follow the schema specified above. "


def get_fewshot_prompt(extraction_samples):
    fewshot_prompt = "\nFollowing are some examples."
    for extraction_sample in extraction_samples:
        # print(extraction_sample)
        fewshot_prompt += "\nInput: " + extraction_sample["example"]
        fewshot_prompt += "\nOutput: " + json.dumps(extraction_sample["extract"]) + ""
    return fewshot_prompt


def get_data_extraction_prompt(input_doc_chunk):
    fewshot_prompt = (
        task_prompt
        + extraction_template
        + get_fewshot_prompt(extraction_samples)
        + output_instruction_prompt
        + "\n Please output the extracted information for the following"
        + "\nInput: "
        + input_doc_chunk
        + "\nOutput: "
    )
    return fewshot_prompt


def get_completion_from_text(text, max_tokens=None):
    measurement_prompt = get_data_extraction_prompt(text)
    if max_tokens is None:
        max_tokens = 4096 - get_token_estimate(measurement_prompt)

    print(max_tokens)

    completion = get_completion(measurement_prompt, deployment_name, max_tokens=max_tokens)
    get_completion_stats(completion)
    json_measurement_info = json.loads(completion.choices[0].text.strip(" \n"))
    return json_measurement_info


print(get_data_extraction_prompt("TESTING  PROMPT"))


Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.
```TypeScript
measurement and values: { // Measurement  information of Chemicals 
 chemical compound list: { // List of chemical or chemical's molecular formulae
  chemical compound: string // A Single Chemical's molecular formulae or chemical name
 }
dimension: string // The mechanical property of the chemical being measured.
measurement unit: string // Unit of measurement for the chemical.
measured value low: string // Lower range value for the chemical compound.
measured value high: string // Higher range value for the chemical compound.
}
```
Following are some examples.
Input: The BaCO<sub>3 </sub>and CeO<sub>2 </sub>crystallites formed particles with a size of between about 5 and 50 microns.
Output: {"measurement and v

### 2.2.3 Prompt Tuning based on Unit Tests :
- 2.1 In-Sentence Test:
    - 2.1.1 Single Chemical Info In Single Sentence Test :  All measurement  values including  a chemical names is in same sentence.
    - 2.1.2 Multi Chemical Info in  Single Sentence Test  :  Sentence contains multiple chemical names and their combined  single measurement value.
    - 2.1.3 Multi Chemical Multi-measurement In Single Sentence Test  :  Sentence contains multiple chemical names, with each chemical containing their corresponding measurement value.        
- 2.2 Inter-Sentence Test:
    - 2.2.1 Single Chemical Info across Multiple Sentences
    - 2.2.2 Multi Chemical Info across Multiple sentences
    - 2.2.3 Multi Chemical in Paragraph

#### 2.2.3.1 Single Chemical In Single Sentence Test : PASS

In [166]:
test_text = "The BaCO3 that was produced had a crystallite dimension ranging from approximately 20 to 40 nm."
measurement_json_str = get_completion_from_text(test_text)
pprint(measurement_json_str)


3524
Total Tokens: 616
Prompt Tokens: 553
Completion Tokens: 63
Total Cost (USD): 0.0123 cents
{
  "measurement and values": {
    "chemical compound list": [
      {
        "chemical compound": "BaCO<sub>3"
      }
    ],
    "dimension": "crystallite size",
    "measurement unit": "nm",
    "measured value low": "20",
    "measured value high": "40"
  }
}


#### 2.2.3.2 Multi Chemical In Single Sentence Test : PASS
- Multiple measurement of the chemicals available  in a single sentence.

In [68]:
MCSS_extraction_test_text = "Sodium Hydroxide  and HCL combined had the concentration of 10 % ."
MCSS_measurement_json_str = get_completion_from_text(MCSS_extraction_test_text)
pprint(MCSS_measurement_json_str)

3531
Total Tokens: 618
Prompt Tokens: 549
Completion Tokens: 69
Total Cost (USD): 0.0124 cents
{
  "measurement and values": [
    {
      "chemical compound list": [
        {
          "chemical compound": "Sodium Hydroxide"
        },
        {
          "chemical compound": "HCL"
        }
      ],
      "dimension": "concentration",
      "measurement unit": "%",
      "measured value low": "10",
      "measured value high": "10"
    }
  ]
}


In [62]:
MCSS_extraction_test_text = "Sodium Hydroxide with concentraion of 0.5 M was used and Hcl's weight percent was 5."
MCSS_measurement_json_str = get_completion_from_text(MCSS_extraction_test_text)
pprint(MCSS_measurement_json_str)

3528
Total Tokens: 672
Prompt Tokens: 557
Completion Tokens: 115
Total Cost (USD): 0.0134 cents
{
  "measurement and values": [
    {
      "chemical compound list": [
        {
          "chemical compound": "Sodium Hydroxide"
        }
      ],
      "dimension": "concentration",
      "measurement unit": "M",
      "measured value low": "0.5",
      "measured value high": "0.5"
    },
    {
      "chemical compound list": [
        {
          "chemical compound": "HCL"
        }
      ],
      "dimension": "weight",
      "measurement unit": "percent",
      "measured value low": "5",
      "measured value high": "5"
    }
  ]
}


#### 2.2.3.3 Multi Chemical In Single Complex Sentence Test : PASS
-  Sentence contains multiple chemical names, with each chemical containing their corresponding measurement value.
-  Sodium Hydroxide with concentraion of 0.5 M was used and NaOH 's weight percent was 5.

In [69]:
MCMM_single_sentence_extraction_test_text = (
    "Sodium Hydroxide  and HCL each had respective mass and  density of 0.9 M and 9 percent respectively."
)
MCMM_measurement_json_str = get_completion_from_text(MCMM_single_sentence_extraction_test_text)
pprint(MCMM_measurement_json_str)


3522
Total Tokens: 672
Prompt Tokens: 560
Completion Tokens: 112
Total Cost (USD): 0.0134 cents
{
  "measurement and values": [
    {
      "chemical compound list": [
        {
          "chemical compound": "Sodium Hydroxide"
        }
      ],
      "dimension": "mass",
      "measurement unit": "M",
      "measured value low": "0.9",
      "measured value high": "0.9"
    },
    {
      "chemical compound list": [
        {
          "chemical compound": "HCL"
        }
      ],
      "dimension": "density",
      "measurement unit": "percent",
      "measured value low": "9",
      "measured value high": "9"
    }
  ]
}


##### 2.2.3.4 Inter-Sentence Test: Single Chemical Info in Multiple Sentences : PASS
-  Chemical information is contained in multiple sentences.

In [70]:
SCMS_extraction_test_text = (
    "The chemical used was Sodium Hydroxide. "
    + "We looked at its crystallite size. "
    + "The low value was 0.4 nm and upper value was 20 nm. "
)
SCMS_measurement_json = get_completion_from_text(SCMS_extraction_test_text)
pprint(SCMS_measurement_json)


3510
Total Tokens: 631
Prompt Tokens: 566
Completion Tokens: 65
Total Cost (USD): 0.0126 cents
{
  "measurement and values": {
    "chemical compound list": [
      {
        "chemical compound": "Sodium Hydroxide"
      }
    ],
    "dimension": "crystallite size",
    "measurement unit": "nm",
    "measured value low": "0.4",
    "measured value high": "20"
  }
}


#### 2.2.3.5 Multi Chemical in Multiple Sentences : PASS
- Multiple chemical Information in multiple sentences.

In [71]:
MCMS_extraction_test_text = (
    "The chemical used was Sodium Hydroxide. "
    + "We looked at its crystallite size. "
    + "The low value was 0.4 nm and upper value was 20 nm. "
    + "The other chemical that was used was HCL. "
    + "It's concentration was 0.03 ppm."
)
MCMS_measurement_json = get_completion_from_text(MCMS_extraction_test_text)
pprint(MCMS_measurement_json)


3487
Total Tokens: 705
Prompt Tokens: 584
Completion Tokens: 121
Total Cost (USD): 0.0141 cents
{
  "measurement and values": [
    {
      "chemical compound list": [
        {
          "chemical compound": "Sodium Hydroxide"
        }
      ],
      "dimension": "crystallite size",
      "measurement unit": "nm",
      "measured value low": "0.4",
      "measured value high": "20"
    },
    {
      "chemical compound list": [
        {
          "chemical compound": "HCL"
        }
      ],
      "dimension": "concentration",
      "measurement unit": "ppm",
      "measured value low": "0.03",
      "measured value high": "0.03"
    }
  ]
}


#### 2.2.3.6 Multi Chemical in Paragraph : PASS
- Multiple chemical Information in paragraph with multiple sentences in between the chemical and their measurement value.

In [72]:
MCP_extraction_test_text = (
    "Two chemicals were used  Sodium Hydroxide and HCL. "
    + "Sodium hydroxide is commonly known as lye or caustic soda, used in the production of soaps, detergents, and paper, as well as in the manufacturing of textiles and various chemical processes. "
    + "Hydrochloric acid is used in the production of PVC, as well as in the pickling of steel and the cleaning of various surfaces."
    + "We looked at the crystallite size  and concentration of those chemicals respectively "
    + " and observed them  between  0.4 nm and 20 nm, and 0.03 ppm respectively. "
)
MCP_measurement_json = get_completion_from_text(MCP_extraction_test_text, max_tokens=4096 - 696 - 100)
pprint(MCP_measurement_json)


3300
Total Tokens: 777
Prompt Tokens: 656
Completion Tokens: 121
Total Cost (USD): 0.0155 cents
{
  "measurement and values": [
    {
      "chemical compound list": [
        {
          "chemical compound": "Sodium Hydroxide"
        }
      ],
      "dimension": "crystallite size",
      "measurement unit": "nm",
      "measured value low": "0.4",
      "measured value high": "20"
    },
    {
      "chemical compound list": [
        {
          "chemical compound": "HCL"
        }
      ],
      "dimension": "concentration",
      "measurement unit": "ppm",
      "measured value low": "0.03",
      "measured value high": "0.03"
    }
  ]
}


# 3.0 GPT Models for large Documents: 
- Limited Token allowed : max 4097  allowed.

### 3.1 Large Document Handling Options & Selection
0. Map-Reduce:  BF  : DONE
    - Chunk Document into N chunks.
    - For each chunk, perform extraction.

1. Narrow Search Space :
- 1.0 GPT Document Filter : DONE
    - Filter documents using text from from fields :  title, abstract
    - Class found by asking OpenAI ?
    - Pass title, abstract  with prompt: Categorise documents into classes:  "Chemical Literature or Chemistry Literature"
- 1.1 : Field Based Filter : DONE
    - name vs chemical industry related person classificaiton
        - <inventor>, parties, <applicants>, <deceased-inventor>
        - <orgname>BASF Corporation</orgname>
        - prompt GPT for classification linked in profile or google search result vs  "Chemist or Chemistry Professional"
        - 
- 1.2 Spacy:     
    - Extract candidate relevant chemicals. 
    - Feed only those para or sentence containing those  chemicals to Completion

- 1.3 Text- embedding:  TESTED & Excluded
    - Use text-embedding to filter only the sentences or paragraph containing chemical names.
    - Feed only those sentences or para into GPT completion





### 3.2 Split Large File : Split all-in-one patent into multiple indv patent files

In [13]:
def split_patent_file_into_indv_patent_files(large_patent_filepath):
    output_file_name_prefix = "Data/basf_ipg110920/patent_{}.xml"
    file_num = 1

    with open(large_patent_filepath) as f:
        lines = f.readlines()
    indv_filetext = " ".join(lines).split('<?xml version="1.0" encoding="UTF-8"?>')

    for file_num, xml in enumerate(indv_filetext):
        f = open(output_file_name_prefix.format(file_num), "w")
        f.write('<?xml version="1.0" encoding="UTF-8"?>' + xml)
        f.close()


large_patent_filepath = "Data/basf_ipg110920.xml"
split_patent_file_into_indv_patent_files(large_patent_filepath)

print("COMPLETED : Splitting patent file into individual patent files")


COMPLETED : Splitting patent file into individual patent files


### 3.3 Filter Patents :  Narrow Search Space
- Patent Classification (Prompt Based)
        - Read indv patent one at a time 
    - Filter only the patent that is related to chemistry.
- Patent Kind Exclusion / Inclusion : <Kind>
    - Exclude:
        - P - Plant Patent Grant issued prior to January 2, 2001
            - 210765:<kind>P2</kind>, 211172:<kind>P2</kind>, 211496:<kind>P2</kind>, 211783:<kind>P2</kind>, 4661824:<kind>P1</kind>
        - PLT -  indicates plant patent
            <main-classification>PLT420</main-classification>
        - S1 : Design Patent  : Ornamental design, including shape, color, texture & surface ornamentation. Commonly used in fashion, furniture
            - <kind>S1</kind>
    - Can not Exclude : Kind
        - H : SIR can contain computer aided design  of chemical compounds or simulation of chemical reaction. Hence not excluded
        - T : Defensive publication. Excluded because it can include information on chemical compounds, such as their chemical structure, physical and chemical properties, and methods for their preparation and characterization.
        - A1 - Utility Patent Grant (Chemistry literature)
        - B - Reexamination Certificate issued prior to January 2, 2001
        - E - Reissue Patent	



        




## 4. Input Data & PreProcessing
- Read Patent File 
- Extract Relevenat fields from the xml

In [13]:
from os import listdir
from os.path import isfile, join
from bs4 import BeautifulSoup
import pandas as pd
import xml.etree.ElementTree as ET
from xml.etree import ElementTree


def get_all_files(directory):
    patent_files = [join(directory, f) for f in listdir(directory) if isfile(join(directory, f))]
    return patent_files


def get_text_only_from_xml_element(patent_file, xml_element_path):
    try:
        tree = ElementTree.parse(patent_file)
        root = tree.getroot()
        xml_elements = root.find(xml_element_path)
        if len(xml_elements.text.strip()) > 0:
            return xml_elements.text

        text = ""
        for elem in xml_elements:
            if elem.text:
                text += elem.text + " \n"
            else:
                print("Failed extracting text from element for ", elem)
        return text
    except:
        return False


def get_text_and_sub_xml_from_xml(filepath, tag):
    with open(filepath, "r") as f:
        patent_file = f.read()

    soup = BeautifulSoup(patent_file, "xml")
    text_including_xml = soup.findAll(tag)
    return text_including_xml


def get_brief_description(description_text):
    """Description always starts with Brief Description. 80% of the time, There is "DETAILED DESCRIPTION" text separating brief and detailed description"""
    if "DETAILED DESCRIPTION" in description_text:
        return description_text.split("DETAILED DESCRIPTION")[0]

    # If "DETAILED DESCRIPTION" is not avilable as a separator, tree first 10000 chars are brief description
    return description_text[:10000]


patent_file = "Data/basf_ipg110920/patent_2606.xml"
cur_patent_kind = get_text_only_from_xml_element(
    patent_file, "./us-bibliographic-data-grant/publication-reference/document-id/kind"
)
invention_title = get_text_only_from_xml_element(patent_file, "./us-bibliographic-data-grant/invention-title")
abstract_text = get_text_only_from_xml_element(patent_file, "./abstract")
invention_claim_text = str(get_text_and_sub_xml_from_xml(patent_file, "claims")[0])
# description = get_text_only_from_xml_element(patent_file, './description')
# brief_description = get_brief_description(description)

print(f"\n--- Document Kind: {cur_patent_kind}")
print(f"\n--- Invention title: {invention_title}")
print(f"\n--- Abstract_text : {abstract_text}")
print(f"\n--- Invention Claim text : {invention_claim_text[:300]}")
# print(f"\n--- Brief_description: {brief_description[:300]}")

all_patent_files = get_all_files("Data/basf_ipg110920/")
all_patent_files[:4]



--- Document Kind: B1

--- Invention title: Apparatus for installation of electrical floor boxes

--- Abstract_text : An electrical floor box assembly for installation in a floor structure includes an electrical floor box having a plurality of sidewalls and at least one clamp device attached to a sidewall so that the floor box may be mounted to a raised floor structure or leveled atop a support surface using the clamp device. The clamp device includes a clamp body, a threaded rod mounted for rotation along a longitudinal axis within the clamp body, and a clamp arm threadingly engaging the threaded rod so that it is movable along the rod in association with rotation thereof and can be moved into engagement with the undersurface of the floor structure to secure the floor box in place. The clamp device may further include a leveling subassembly for installation of the electrical floor box onto a support surface prior to construction of the floor around the leveled floor box. 


--- Inven

['Data/basf_ipg110920/patent_2606.xml',
 'Data/basf_ipg110920/patent_3280.xml',
 'Data/basf_ipg110920/patent_4464.xml',
 'Data/basf_ipg110920/patent_2249.xml']

### 4.1 Data PreProcessing : Patent XMl 
- Get Patent Text :
- Get clean text where possible. If xml follows nested  and non-uniform structure across patents, then extract text + non-uniform xml. This non-standardised xml will be cleaned later with xml tag removal step.
- Excluded for the current iteration : 
    - Tables
    - Maths
    - Tables Excluded for the current iteratrion. Separate table  specific  parser or extraction logic to be developed  in future.
-

In [19]:
import re


def get_all_tags(text):
    regex = r"<[^\s^>]*"
    matches = re.finditer(regex, text, re.MULTILINE)
    matching_tag = set()
    for matchNum, match in enumerate(matches, start=1):
        matching_tag.add(match.group())

    return matching_tag


def clean_xml_section(text):
    remove_sections = {
        "remove_table_regex": r"<table(.|\n)*?</table>",
        "remove_maths_regex": r"<maths(.|\n)*?</maths>",
        "remove_heading_start_tag_regex": r"<heading(.|\n)*?>",
        "remove_figref_start_tag_regex": r"<figref(.|\n)*?>",
        "remove_description_start_tag_regex": r"<description(.|\n)*?>",
        "remove_description-of-drawings_start_tag_regex": r"<description-of-drawings(.|\n)*?>",
        "remove_<?_start_tag_regex": r"<\?.*>",
        "remove_p_start_tag_regex": r"<p(.|\n)*?>",
        "remove_heading_end_tag_regex": r"</heading>",
        "remove_p_end_tag_regex": r"</p>",
        "remove_figref_end_tag_regex": r"</figref>",
        "remove_description_end_tag_regex": r"</description>",
        "remove_description-of-drawings_end_tag_regex": r"</description-of-drawings>",
        "remove_table_end_tag_regex": r"</table>",
        "remove_table_end_tag_regex1": r"</tables>",
        "remove_br_end_tag_regex": r"<br/>",
    }
    for _, remove_regex in remove_sections.items():
        text = re.sub(remove_regex, "", text, 0, re.MULTILINE)

    return text


chemical_patent_file = "Data/basf_ipg110920/patent_2275.xml"
basf_chemical_description_text = str(get_text_and_sub_xml_from_xml(chemical_patent_file, "description")[0])
basf_chemical_description_text = clean_xml_section(basf_chemical_description_text)

remaining_xml_tags = get_all_tags(basf_chemical_description_text)
print(remaining_xml_tags)


{'<i', '</sup', '</i', '</sub', '<sub', '<sup'}


## 5. Zero-Shot Document Classifier 

### 5.1 Chemical Literature Classifier : GPT Based
- Classify if a literature is a chemical literature based on the given text.
- FewShot Classifier
- Document Classifier to be used for Exclusion because there are 5000+ indv patents in a zip file to process
- Candidate Field that can be used for classification:
    - Keyword based
    - <invention-title id="d2e53">NOx storage materials and traps resistant to thermal aging</invention-title>
    - abstract
    - ?BRFSUM description="Brief Summary" end="lead"?
    - <heading id="h-0014" level="1">NOx Storage Capacity Testing</heading>
    - <claims id="claims"> - n claims. Claims could have measurements.
        <claim id="CLM-00001" num="00001">
        <claim-text>1. A method of making a nitrogen oxide storage material, comprising mixing a solution of barium with ceria, spray drying the solution of barium mixed with ceria to obtain a solid mixture of barium and ceria, and heating the solid mixture to obtain a material comprising ceria particles having barium supported thereon.</claim-text>
    - <?brief-description-of-drawings description="Brief Description of Drawings" end="lead"?>    
    - <orgname>BASF Corporation</orgname>
    - </citation>
        <othercit>Piacentini, M. et al., &#x201c;Supported Pt-Ba NO<sub>x </sub>storage-reduction catalysts: Influence of support and Ba loading on stability and storage efficiency of Ba-containing species&#x201d;, <i>Applied Catalysis B: Environmental 66</i>, (2006), 126-136 pgs.</othercit>
        

### 5.1.1 Prompt Label Extraction

In [15]:
get_good_classification_prompt = (
    """What is a good prompt for classification into "Chemical Literature" & "Not Chemical Literature" class. """
)
# GEL : Good Entity Type List
classification_completion = get_completion(get_good_classification_prompt, deployment_name, max_tokens=4000)
get_completion_stats(classification_completion)
print(f"\n Question : {get_good_classification_prompt}")
print(classification_completion.choices[0].text.strip(" \n"))


Total Tokens: 39
Prompt Tokens: 22
Completion Tokens: 17
Total Cost (USD): 0.0008 cents

 Question : What is a good prompt for classification into "Chemical Literature" & "Not Chemical Literature" class. 
"Classify the following article as either Chemical Literature or Not Chemical Literature."


### 5.1.2 Zero-Shot Document Classifier Test : PASS
- Unit Test

In [16]:
# Instructions for LLM to perform a task
classification_task_prompt = (
    "Your goal is to classify the user's input as either 'Chemical Literature' or 'Not Chemical Literature' . "
)

classification_template = (
    "\n```TypeScript"
    + ""
    + "\n{"
    + '\n"class": string // Either "Chemical Literature" or "Not Chemical Literature" class .'
    + '\n"confidence": float // Confidence percentage for the above class .'
    + "\n}"
    + "\n```"
)
classification_output_instruction_prompt = (
    "\nPlease output the class in JSON format. "
    + "The output must follow the schema specified above exactly.Do not add any extra information. "
)


def get_classification_from_text(text, max_tokens=4096 - 1000):
    prompt = (
        classification_task_prompt
        + classification_template
        + classification_output_instruction_prompt
        + "\n Please output the class in JSON format for the following"
        + "\n "
        + text
    )
    completion = get_completion(prompt, deployment_name, max_tokens=max_tokens)
    get_completion_stats(completion)
    print(completion.choices[0].text.strip(" \n"))
    json_measurement_info = json.loads(completion.choices[0].text.replace("```", "").strip(" \n"))
    return json_measurement_info


chemical_patent_file = "Data/basf_ipg110920/patent_2275.xml"
chemical_abstract_text = get_text_only_from_xml_element(chemical_patent_file, "./abstract")
non_chemical_patent_file = "Data/basf_ipg110920/patent_411.xml"
non_chemical_abstract_text = get_text_only_from_xml_element(non_chemical_patent_file, "./abstract")

print(f"chemical_abstract_text Classification :  {get_classification_from_text(chemical_abstract_text, 4096-500)}")
print(
    f"non_chemical_abstract_text Classification :  {get_classification_from_text(non_chemical_abstract_text, 4096-500)}"
)


Total Tokens: 257
Prompt Tokens: 233
Completion Tokens: 24
Total Cost (USD): 0.0051 cents
```
{
"class": "Chemical Literature",
"confidence": 100
}
```
chemical_abstract_text Classification :  {'class': 'Chemical Literature', 'confidence': 100}
Total Tokens: 137
Prompt Tokens: 113
Completion Tokens: 24
Total Cost (USD): 0.0027 cents
```
{
"class": "Not Chemical Literature",
"confidence": 100
}
```
non_chemical_abstract_text Classification :  {'class': 'Not Chemical Literature', 'confidence': 100}


## 6. Cost Estimate: Extraction ($1000)
- Estimating cost of extraction for all patents in the zip file
- Extraction as-is without filter Estimate : $1000 dollor
    - Total Cost (USD): $0.2 (0.0487* 4) cents per patent


### 6.1 Cost Reduction Estimate:  Classifier ($25 - $200)
- Classification Cost Estimate using different variations : $200
    - all attributes (title + abstract + invention_claim + brief_description ) : 0.04 per patent  = 5000*0.04 = $200 
    - abstract only: 25$
    - invention_title+ abstract  = 0.005*5000 = 25$
    - invention_title+ abstract + claim_text = 0.04*5000 = 200$
    - 3000 word brief description per request.




In [24]:
chemical_patent_file = "Data/basf_ipg110920/patent_2275.xml"
invention_title = get_text_only_from_xml_element(chemical_patent_file, "./us-bibliographic-data-grant/invention-title")
abstract_text = get_text_only_from_xml_element(chemical_patent_file, "./abstract")
invention_claim_text = str(get_text_and_sub_xml_from_xml(chemical_patent_file, "claims")[0])
description = get_text_only_from_xml_element(chemical_patent_file, "./description")
brief_description = get_brief_description(description)


chemical_brief_desc_text = invention_title + " " + abstract_text + " " + invention_claim_text + " " + brief_description
chemical_brief_desc_text = chemical_brief_desc_text
# get_token_estimate(text[:3900])
print(
    f"chemical_abstract_text Classification :  {get_classification_from_text(chemical_brief_desc_text, max_tokens=30)}"
)
print(f"CLASSIFICATION COST ESTIMATE per patent @ 0.02 per patent = {5000*0.07} dollors")


Total Tokens: 3565
Prompt Tokens: 3547
Completion Tokens: 18
Total Cost (USD): 0.0713 cents
{
"class": "Chemical Literature",
"confidence": 100
}
chemical_abstract_text Classification :  {'class': 'Chemical Literature', 'confidence': 100}
CLASSIFICATION COST ESTIMATE per patent @ 0.02 per patent = 350.00000000000006 dollors


In [399]:
invention_title = get_text_only_from_xml_element(chemical_patent_file, "./us-bibliographic-data-grant/invention-title")
abstract_text = get_text_only_from_xml_element(chemical_patent_file, "./abstract")
invention_claim_text = str(get_text_and_sub_xml_from_xml(chemical_patent_file, "claims")[0])
description = get_text_only_from_xml_element(chemical_patent_file, "./description")
brief_description = get_brief_description(description)


print(f"_abstract_text Classification :  {get_classification_from_text(invention_title+ abstract_text , 30)}")


xml_elements.text  NOx storage materials and traps resistant to thermal aging
Total Tokens: 257
Prompt Tokens: 243
Completion Tokens: 14
Total Cost (USD): 0.0051 cents
{"class": "Chemical Literature", "confidence": 100}
_abstract_text Classification :  {'class': 'Chemical Literature', 'confidence': 100}


In [400]:
print(
    f"_abstract_text Classification :  {get_classification_from_text(invention_title+ abstract_text+invention_claim_text , 30)}"
)


Total Tokens: 2026
Prompt Tokens: 2007
Completion Tokens: 19
Total Cost (USD): 0.0405 cents
{
"class": "Chemical Literature",
"confidence": 100
}
_abstract_text Classification :  {'class': 'Chemical Literature', 'confidence': 100}


### 6.2 Extraction Cost Estimate per Patent
- Sample Basf 1 File  & its Cost Estimate

In [387]:
print(f"No. of words in BASF chemical description : {len(basf_chemical_description_text.split())}")
print(f"No of Characters : {len(basf_chemical_description_text)}")
print(f"Estimated no. of tokens : {get_token_estimate( basf_chemical_description_text)}")


basf_measurement_json = get_completion_from_text(basf_chemical_description_text[:12000], max_tokens=1000)
pprint(basf_measurement_json)


1000
Total Tokens: 3204
Prompt Tokens: 3083
Completion Tokens: 121
Total Cost (USD): 0.0641 cents
{
  "measurement and values": [
    {
      "chemical compound list": [
        {
          "chemical compound": "BaCO<sub>3"
        }
      ],
      "dimension": "crystallite size",
      "measurement unit": "nm",
      "measured value low": "10",
      "measured value high": "20"
    },
    {
      "chemical compound list": [
        {
          "chemical compound": "CeO<sub>2"
        }
      ],
      "dimension": "crystallite size",
      "measurement unit": "nm",
      "measured value low": "20",
      "measured value high": "40"
    }
  ]
}


In [389]:
basf_measurement_json = get_completion_from_text(basf_chemical_description_text[12000:20000], max_tokens=1000)
pprint(basf_measurement_json)

1000
Total Tokens: 2641
Prompt Tokens: 2323
Completion Tokens: 318
Total Cost (USD): 0.0528 cents
{
  "measurement and values": [
    {
      "chemical compound list": [
        {
          "chemical compound": "BaCO<sub>3"
        }
      ],
      "dimension": "NOx storage material",
      "measurement unit": "g/ft<sup>3</sup>",
      "measured value low": "20",
      "measured value high": "200"
    },
    {
      "chemical compound list": [
        {
          "chemical compound": "CeO<sub>2"
        }
      ],
      "dimension": "surface area",
      "measurement unit": "m<sup>2</sup>/g",
      "measured value low": "5",
      "measured value high": "350"
    },
    {
      "chemical compound list": [
        {
          "chemical compound": "ZrO<sub>2"
        }
      ],
      "dimension": "doping percentage",
      "measurement unit": "%",
      "measured value low": "10",
      "measured value high": "30"
    },
    {
      "chemical compound list": [
        {
          "chemic

In [390]:
basf_measurement_json = get_completion_from_text(basf_chemical_description_text[20000:30000], max_tokens=1000)
pprint(basf_measurement_json)


1000
Total Tokens: 3393
Prompt Tokens: 2960
Completion Tokens: 433
Total Cost (USD): 0.0679 cents
{
  "measurement and values": [
    {
      "chemical compound list": [
        {
          "chemical compound": "BaCO<sub>3"
        },
        {
          "chemical compound": "CeO<sub>2"
        }
      ],
      "dimension": "crystallite size",
      "measurement unit": "nm",
      "measured value low": "20",
      "measured value high": "50"
    },
    {
      "chemical compound list": [
        {
          "chemical compound": "Pt"
        },
        {
          "chemical compound": "Rh"
        }
      ],
      "dimension": "concentration",
      "measurement unit": "wt %",
      "measured value low": "1.8",
      "measured value high": "1.8"
    },
    {
      "chemical compound list": [
        {
          "chemical compound": "Pd"
        }
      ],
      "dimension": "concentration",
      "measurement unit": "wt %",
      "measured value low": "1.4",
      "measured value high":

In [391]:
basf_measurement_json = get_completion_from_text(basf_chemical_description_text[30000:], max_tokens=1000)
pprint(basf_measurement_json)

1000
Total Tokens: 2434
Prompt Tokens: 1940
Completion Tokens: 494
Total Cost (USD): 0.0487 cents
{
  "measurement and values": [
    {
      "chemical compound list": [
        {
          "chemical compound": "BaCO<sub>3"
        }
      ],
      "dimension": "crystallite size",
      "measurement unit": "nm",
      "measured value low": "20",
      "measured value high": "40"
    },
    {
      "chemical compound list": [
        {
          "chemical compound": "CeO<sub>2"
        }
      ],
      "dimension": "crystallite size",
      "measurement unit": "nm",
      "measured value low": "10",
      "measured value high": "20"
    },
    {
      "chemical compound list": [
        {
          "chemical compound": "BaCO<sub>3"
        }
      ],
      "dimension": "crystallite size",
      "measurement unit": "nm",
      "measured value low": "20",
      "measured value high": "150"
    },
    {
      "chemical compound list": [
        {
          "chemical compound": "CeO<sub>2"


## 7. Cost Optimisation : How to Lower 1000$ bill ?
- Questions:
    - 1. Do we have to run the extraction to be run for all the 5000+ patents contained in the zip file?
    - 2. What is the acceptable cost range for the task?
- Answers Obtained : 
    - Sample run acceptable
    - Max 250 USD.

### 7.1 Cost Optimation : Filter + 10 Sample Chemical related Patents
1. Kind Code Filter :  DONE : 414 Filtered
2. IPC Classification Based Filter :  See Following Cells
3. Document Classifier : Built but not used for now  because run across 10-100 sample is sufficient
4. Embedding based  zero-shot classification  : 
    - Initial Development did not promise good result. More effort required and hence was stashed, as sample run was acceptable.
    - Deferred as 10-100 patent run is Ok, as per the email
    - Process
        - Prompt Label Tuning to choose "chemical compound" & "non-chemical" representation. To Maximise score Difference between chemical and non-chemical when used later
        - Document Chunking & classification
        - Only Use Document Chunks containing "Chemical Compound" information

Ref : https://github.com/openai/openai-cookbook/blob/main/examples/Zero-shot_classification_with_embeddings.ipynb




## 8. Data Extraction

### 8.1 Document Filter : : IPC Code + Kind tag
- Document Kind based
    - P,P1, P2,S1 are non chemical patents
    - Excerpt for kind code
        - P - Plant Patent  e.g 210765:\<kind\>P2</kind>, 211172:<kind>P2</kind>, 211496:<kind>P2</kind>, 211783:<kind>P2</kind>, 4661824:<kind>P1</kind>
        - S1 : Design Patent  : Ornamental design, including shape, color, texture & surface ornamentation. Commonly used in fashion, furniture e.g <kind>S1</kind>
- IPC Code Excerpt : 
    - Section Title – The section title is to be considered as a very broad indication of the contents of the section. The eight sections are entitled as follows:
        - A HUMAN NECESSITIES
        - B PERFORMING OPERATIONS; TRANSPORTING
        - C CHEMISTRY; METALLURGY
        - D TEXTILES; PAPER
        - E FIXED CONSTRUCTIONS
        - F MECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS;
        - BLASTING
        - G PHYSICS
        - H ELECTRICITY

Ref: https://www.wipo.int/edocs/pubdocs/en/wipo-guide-ipc-2023-en-guide-to-the-international-patent-classification-2023.pdf

In [169]:
def is_non_chemical_literature_based_on_kind_code(patent_file_path):
    # doc_kind = patent_xml_root.find('./us-bibliographic-data-grant/publication-reference/document-id/kind').text
    cur_patent_doc_kind = get_text_only_from_xml_element(
        patent_file_path, "./us-bibliographic-data-grant/publication-reference/document-id/kind"
    )
    plant_patent_kind = ["P", "P1", "P2"]
    design_patent_kind = ["S1"]
    if cur_patent_doc_kind in plant_patent_kind + design_patent_kind:
        return True

    return False


def is_chemical_literature_based_on_ipc_code(patent_file_path):
    ipc_classification_code = set()
    tree = ElementTree.parse(patent_file_path)
    root = tree.getroot()

    cur_patent_doc_ipc_codes = root.find("./us-bibliographic-data-grant/classifications-ipcr")
    for ipc_code in cur_patent_doc_ipc_codes:
        ipc_classification_code.add(ipc_code.find("section").text)

    ipc_chemistry_metalulrgy_code = "C"
    if ipc_chemistry_metalulrgy_code in ipc_classification_code:
        return True

    return False


def is_chemical_literature_related_patent_brief_description_classification(patent_file_path):
    """Classifier to classify if a patent is chemical literature  based on brief description."""

    # invention_title = get_text_only_from_xml_element(patent_file_path, './us-bibliographic-data-grant/invention-title')
    # abstract_text = get_text_only_from_xml_element(patent_file_path, './abstract')
    # invention_claim_text = str(get_text_and_sub_xml_from_xml(patent_file_path, 'claims')[0])
    # description = get_text_only_from_xml_element(patent_file_path, './description')
    # brief_description = get_brief_description(description)
    # return True
    pass


def classify_patent(all_patent_files):
    patent_classification_df = pd.DataFrame(columns=["patent_file_name", "title", "classification", "comment"])
    for patent_file_path in all_patent_files:
        try:
            tree = ET.parse(patent_file_path)
            xml_root = tree.getroot()

            if is_non_chemical_literature_based_on_kind_code(patent_file_path):
                invention_title = get_text_only_from_xml_element(
                    patent_file_path, "./us-bibliographic-data-grant/invention-title"
                )
                patent_classification_df.loc[len(patent_classification_df)] = [
                    patent_file_path,
                    invention_title,
                    "Not Chemical Literature",
                    "Kind Code Based",
                ]
                continue

            if is_chemical_literature_based_on_ipc_code(patent_file_path):
                invention_title = get_text_only_from_xml_element(
                    patent_file_path, "./us-bibliographic-data-grant/invention-title"
                )
                patent_classification_df.loc[len(patent_classification_df)] = [
                    patent_file_path,
                    invention_title,
                    "Chemical Literature",
                    "IPC Code Based",
                ]
                continue

            # if not invention_title:
            #     print(f'Failed XML parsing for Invention title  for {patent_file_path}')

            patent_classification_df.loc[len(patent_classification_df)] = [patent_file_path, "", "", ""]
        except:
            patent_classification_df.loc[len(patent_classification_df)] = [
                patent_file_path,
                "",
                "",
                "XML Parsing Error",
            ]
            # print("Failed Processing Patent File ", patent_file_path)
            continue

    return patent_classification_df


all_patent_files = get_all_files("Data/basf_ipg110920/")
patent_classification_df = classify_patent(all_patent_files)
patent_classification_df.head()


Unnamed: 0,patent_file_name,title,classification,comment
0,Data/basf_ipg110920/patent_2606.xml,,,
1,Data/basf_ipg110920/patent_3280.xml,,,
2,Data/basf_ipg110920/patent_4464.xml,,,
3,Data/basf_ipg110920/patent_2249.xml,,,
4,Data/basf_ipg110920/patent_2622.xml,,,


 <!-- <invention-title id="d2e53">NOx storage materials and traps resistant to thermal aging</invention-title>
- abstract
- ?BRFSUM description="Brief Summary" end="lead"?
- <heading id="h-0014" level="1">NOx Storage Capacity Testing</heading>
- <claims id="claims"> - n claims. Claims could have measurements.
    <claim id="CLM-00001" num="00001">
    <claim-text>1. A method of making a nitrogen oxide storage material, comprising mixing a solution of barium with ceria, spray drying the solution of barium mixed with ceria to obtain a solid mixture of barium and ceria, and heating the solid mixture to obtain a material comprising ceria particles having barium supported thereon.</claim-text>
- <?brief-description-of-drawings description="Brief Description of Drawings" end="lead"?>    
- <orgname>BASF Corporation</orgname>
- </citation>
    <othercit>Piacentini, M. et al., &#x201c;Supported Pt-Ba NO<sub>x </sub>storage-reduction catalysts: Influence of support and Ba loading on stability and storage efficiency of Ba-containing species&#x201d;, <i>Applied Catalysis B: Environmental 66</i>, (2006), 126-136 pgs.</othercit>
         -->

In [170]:
non_chemical_patents = patent_classification_df[patent_classification_df.classification == "Not Chemical Literature"]
chemical_patents = patent_classification_df[patent_classification_df.classification == "Chemical Literature"]
no_status_chemical_patents = patent_classification_df[patent_classification_df.classification == ""]


print(f"Exclude Patent : No. of Non-Chemical Literature Patent : {len(non_chemical_patents)}")
print(f"No. of Chemical Literature Patent : {len(chemical_patents)}")
print(f"No Status Found for  Literature Patent Count : {len(no_status_chemical_patents)}")


Exclude Patent : No. of Non-Chemical Literature Patent : 414
No. of Chemical Literature Patent : 439
No Status Found for  Literature Patent Count : 4266


### 8.2 Chemical Patent Filter : Random 50  + BASF example one
- Sampling 50 Chemical Patents to focus extraction on them only.

In [107]:
chemical_patents['patent_file_name'].sample(50)

2056    Data/basf_ipg110920/patent_2413.xml
4669    Data/basf_ipg110920/patent_2395.xml
103     Data/basf_ipg110920/patent_2414.xml
1064    Data/basf_ipg110920/patent_2399.xml
1213    Data/basf_ipg110920/patent_1843.xml
1157    Data/basf_ipg110920/patent_2487.xml
2916    Data/basf_ipg110920/patent_2475.xml
2472    Data/basf_ipg110920/patent_2106.xml
728     Data/basf_ipg110920/patent_2109.xml
2524    Data/basf_ipg110920/patent_2420.xml
2939    Data/basf_ipg110920/patent_2579.xml
2370    Data/basf_ipg110920/patent_2503.xml
4818    Data/basf_ipg110920/patent_2153.xml
334      Data/basf_ipg110920/patent_619.xml
4645    Data/basf_ipg110920/patent_2520.xml
2073    Data/basf_ipg110920/patent_2513.xml
116     Data/basf_ipg110920/patent_1953.xml
687     Data/basf_ipg110920/patent_2467.xml
2846    Data/basf_ipg110920/patent_2585.xml
3128    Data/basf_ipg110920/patent_1803.xml
2094    Data/basf_ipg110920/patent_1886.xml
2593    Data/basf_ipg110920/patent_2129.xml
2787    Data/basf_ipg110920/pate

### 8.2 Data Extraction for N files :
- Create  extraction method for a file.
- Calculate Document Chunk amount.
     - Document_token ~ 4096 - 1000(completion_token) - prompt_token

#### 8.2.1  Document Chunker for Extraction:
- Document Chunker required because GPT model have token restrictions
- Vanilla Document Chunker implemented.
- Intelligent Document Chunker : ToDo Later
    - ChunK by proper paragraph, so that  we do not pass on improperly broken paragraph  or sentences.
    - incremental paragraph addition to document chunk until  document_chunk_limit is reached.

In [173]:
import tiktoken

tokenizer = tiktoken.get_encoding("cl100k_base")


def get_token_count(text):
    return len(tokenizer.encode(text))


print(f"Token count for 'This is testing' : {get_token_count('This is testing. ')}")

chemical_patent_file = "Data/basf_ipg110920/patent_2275.xml"
basf_chemical_description_text = str(get_text_and_sub_xml_from_xml(chemical_patent_file, "description")[0])
basf_chemical_description_text = clean_xml_section(basf_chemical_description_text)

remaining_xml_tags = get_all_tags(basf_chemical_description_text)
print("Validate if non-expected xml tags still is in description text ", remaining_xml_tags)


extraction_prompt = get_data_extraction_prompt("TESTING  PROMPT")
print(f"You have {get_token_count(extraction_prompt)} token in LLM inmput prompt.")
print(f"Your completion prompt token estimate is 1000.")

print(f"You can provide approximately provide 12K characters as a document chunk input")
print(
    f"Because 1000(completion_token)+ 504(Prompt Token) + (12K ~ 2452)Document_token = 3956 token must be within 4097 max token limit"
)


You have 504 token in LLM inmput prompt.
Your completion prompt token estimate is 1000.
You can provide approximately provide 12K characters as a document chunk input
Because 1000(completion_token)+ 504(Prompt Token) + (12K ~ 2452)Document_token = 3956 token must be within 4097 max token limit


In [299]:
def append_measurement_values(cur_doc_measurement, new_measurement_json):
    try:
        if type(new_measurement_json["measurement and values"]) is dict:
            cur_doc_measurement["measurement and values"].append(new_measurement_json["measurement and values"])
        else:
            cur_doc_measurement["measurement and values"] += new_measurement_json["measurement and values"]
    except:
        print("Failed appending measurement values  ", new_measurement_json)

    return cur_doc_measurement


def write_json_to_file(json_text, output_file_path):
    json_obj = json.dumps(json_text, indent=4)

    with open(output_file_path, "w") as outfile:
        outfile.write(json_obj)


def merge_and_write_json_outputs_into_single_output_file(
    measurement_json_arr, doc_number, invention_title, output_file_path
):
    cur_doc_measurement = {}
    cur_doc_measurement["meta"] = {"doc_number": doc_number, "invention_title": invention_title}
    cur_doc_measurement["measurement and values"] = []

    for each_measurement_output in measurement_json_arr:
        cur_doc_measurement = append_measurement_values(cur_doc_measurement, each_measurement_output)

    write_json_to_file(cur_doc_measurement, output_file_path)


def get_all_measurement_values_from_patent_file(chemical_patent_file, deployment_name, doc_span=10000):
    print(f"Extracting Entity for {chemical_patent_file} :")
    total_cost_per_file = 0
    all_measurement_json = []

    # Get all  text within the description tag & clean it
    cur_chemical_description_text = str(get_text_and_sub_xml_from_xml(chemical_patent_file, "description")[0])
    cur_chemical_description_text = clean_xml_section(cur_chemical_description_text)
    remaining_xml_tags = get_all_tags(cur_chemical_description_text)
    # print(f"{}remaining_xml_tags")

    cur_doc_len = len(cur_chemical_description_text)
    for doc_end in range(0, cur_doc_len, doc_span):
        print(f" ... Doc Chunk start-end : {doc_end} - {doc_end+doc_span}", end="")
        measurement_prompt = None
        measurement_completion = None
        json_measurement_info = None

        cur_doc_chunk = cur_chemical_description_text[doc_end : doc_end + doc_span]
        measurement_prompt = get_data_extraction_prompt(cur_doc_chunk)
        measurement_completion = get_completion(measurement_prompt, deployment_name, max_tokens=1000)
        json_measurement_info = json.loads(measurement_completion.choices[0].text.strip(" \n"))
        all_measurement_json.append(json_measurement_info)

        # HouseKeeping Tasks
        if measurement_completion["usage"].completion_tokens > 900:
            print(f"Nearing  Completion Token limit for {chemical_patent_file} chunk {doc_end} to {doc_end+doc_span}")
        estimated_cost = get_cost_estimate(
            measurement_completion["model"], measurement_completion["usage"].total_tokens, return_numeric_cost=True
        )
        print(f". Estimated Cost : {estimated_cost}")
        total_cost_per_file += float(estimated_cost)

    print(f"Total Cost (USD) for patent File: {total_cost_per_file}")
    return all_measurement_json


chemical_patent_file = "Data/basf_ipg110920/patent_2275.xml"
measurement_values = get_all_measurement_values_from_patent_file(chemical_patent_file, deployment_name=deployment_name)
doc_number = get_text_only_from_xml_element(
    chemical_patent_file, "./us-bibliographic-data-grant/publication-reference/document-id/doc-number"
)
invention_title = get_text_only_from_xml_element(chemical_patent_file, "./us-bibliographic-data-grant/invention-title")
output_file_name = chemical_patent_file.split("/")[-1].split(".")[0]

merge_and_write_json_outputs_into_single_output_file(
    measurement_values,
    doc_number=doc_number,
    invention_title=invention_title,
    output_file_path="Data/output/" + output_file_name + ".json",
)


### 8.3 Data Extraction Test : Sample File
- Testing Single Patent File for Data Extraction Single File


In [300]:
openai.api_key = OPENAI_KEY_EXTERNAL


def extract_and_write_json_to_output_dir(chemical_patent_file, doc_span=7000):
    print(f" ... EXTRACTING : Data from {chemical_patent_file}")
    doc_number = get_text_only_from_xml_element(
        chemical_patent_file, "./us-bibliographic-data-grant/publication-reference/document-id/doc-number"
    )
    invention_title = get_text_only_from_xml_element(
        chemical_patent_file, "./us-bibliographic-data-grant/invention-title"
    )
    measurement_values = get_all_measurement_values_from_patent_file(
        chemical_patent_file, deployment_name=deployment_name, doc_span=doc_span
    )
    output_file_name = chemical_patent_file.split("/")[-1].split(".")[0]

    merge_and_write_json_outputs_into_single_output_file(
        measurement_values,
        doc_number=doc_number,
        invention_title=invention_title,
        output_file_path="Data/output/" + output_file_name + ".json",
    )
    print("Data Extracted to " + "Data/output/" + output_file_name + ".json")


extract_and_write_json_to_output_dir("Data/basf_ipg110920/basf_only/patent_3476.xml", doc_span=4000)


 ... EXTRACTING : Data from Data/basf_ipg110920/basf_only/patent_3476.xml
Extracting Entity for Data/basf_ipg110920/basf_only/patent_3476.xml :
 ... Doc Chunk start-end : 0 - 4000. Estimated Cost : 0.0265
 ... Doc Chunk start-end : 4000 - 8000. Estimated Cost : 0.0286
 ... Doc Chunk start-end : 8000 - 12000. Estimated Cost : 0.028
 ... Doc Chunk start-end : 12000 - 16000. Estimated Cost : 0.0282
 ... Doc Chunk start-end : 16000 - 20000. Estimated Cost : 0.0346
 ... Doc Chunk start-end : 20000 - 24000. Estimated Cost : 0.0387
 ... Doc Chunk start-end : 24000 - 28000. Estimated Cost : 0.0456
 ... Doc Chunk start-end : 28000 - 32000. Estimated Cost : 0.0425
 ... Doc Chunk start-end : 32000 - 36000. Estimated Cost : 0.0361
 ... Doc Chunk start-end : 36000 - 40000. Estimated Cost : 0.032
 ... Doc Chunk start-end : 40000 - 44000. Estimated Cost : 0.0373
 ... Doc Chunk start-end : 44000 - 48000. Estimated Cost : 0.0334
 ... Doc Chunk start-end : 48000 - 52000. Estimated Cost : 0.0333
Total Co

### 8.4 Data Extraction: Random 10 Chemical Files

In [285]:
import random


random.seed(45)
directory = "Data/basf_ipg110920/Random_50_Chemical_Patents/"
all_chemical_file = get_all_files(directory)
random_sample_chemical_files = random.sample(all_chemical_file, 10)
print(f"Random 10 files : {random_sample_chemical_files}")


Random 10 files : ['Data/basf_ipg110920/Random_50_Chemical_Patents/patent_1843.xml', 'Data/basf_ipg110920/Random_50_Chemical_Patents/patent_2542.xml', 'Data/basf_ipg110920/Random_50_Chemical_Patents/patent_2129.xml', 'Data/basf_ipg110920/Random_50_Chemical_Patents/patent_2348.xml', 'Data/basf_ipg110920/Random_50_Chemical_Patents/patent_1695.xml', 'Data/basf_ipg110920/Random_50_Chemical_Patents/patent_2087.xml', 'Data/basf_ipg110920/Random_50_Chemical_Patents/patent_2400.xml', 'Data/basf_ipg110920/Random_50_Chemical_Patents/patent_2093.xml', 'Data/basf_ipg110920/Random_50_Chemical_Patents/patent_2143.xml', 'Data/basf_ipg110920/Random_50_Chemical_Patents/patent_2539.xml']


In [287]:
for rand_chemical_file in random_sample_chemical_files:
    try:
        extract_and_write_json_to_output_dir(rand_chemical_file)
    except:
        print(
            "Skipping Extraction for {rand_chemical_file} because of edge case handling failure. To  investigate and fix later."
        )


 ... EXTRACTING : Data from Data/basf_ipg110920/Random_50_Chemical_Patents/patent_2539.xml
Extracting Entity for Data/basf_ipg110920/Random_50_Chemical_Patents/patent_2539.xml : Doc Chunk start-end : 0 - 7000. Estimated Cost : 0.0459
Extracting Entity for Data/basf_ipg110920/Random_50_Chemical_Patents/patent_2539.xml : Doc Chunk start-end : 7000 - 14000. Estimated Cost : 0.0481
Extracting Entity for Data/basf_ipg110920/Random_50_Chemical_Patents/patent_2539.xml : Doc Chunk start-end : 14000 - 21000. Estimated Cost : 0.0537
Extracting Entity for Data/basf_ipg110920/Random_50_Chemical_Patents/patent_2539.xml : Doc Chunk start-end : 21000 - 28000Nearing  Completion Token limit for Data/basf_ipg110920/Random_50_Chemical_Patents/patent_2539.xml chunk 21000 to 28000
. Estimated Cost : 0.0691
Extracting Entity for Data/basf_ipg110920/Random_50_Chemical_Patents/patent_2539.xml : Doc Chunk start-end : 28000 - 35000. Estimated Cost : 0.0559
Extracting Entity for Data/basf_ipg110920/Random_50_Che

### 8.5 Data Extracting: BASF patents
- 

In [292]:
basf_patents = get_all_files("Data/basf_ipg110920/basf_only/")

for basf_patent in basf_patents[:5]:
    try:
        extract_and_write_json_to_output_dir(basf_patent, doc_span=3000)
    except:
        print(
            "Skipping Extraction for {basf_patent} because of edge case handling failure. To  investigate and fix later."
        )


 ... EXTRACTING : Data from Data/basf_ipg110920/basf_only/patent_2397.xml
Extracting Entity for Data/basf_ipg110920/basf_only/patent_2397.xml : Doc Chunk start-end : 0 - 5000. Estimated Cost : 0.0486
Extracting Entity for Data/basf_ipg110920/basf_only/patent_2397.xml : Doc Chunk start-end : 5000 - 10000. Estimated Cost : 0.0559
Extracting Entity for Data/basf_ipg110920/basf_only/patent_2397.xml : Doc Chunk start-end : 10000 - 15000. Estimated Cost : 0.0559
Extracting Entity for Data/basf_ipg110920/basf_only/patent_2397.xml : Doc Chunk start-end : 15000 - 20000Skipping Extraction for {basf_patent} because of edge case handling failure. To  investigate and fix later.
 ... EXTRACTING : Data from Data/basf_ipg110920/basf_only/patent_1843.xml
Extracting Entity for Data/basf_ipg110920/basf_only/patent_1843.xml : Doc Chunk start-end : 0 - 5000. Estimated Cost : 0.034
Extracting Entity for Data/basf_ipg110920/basf_only/patent_1843.xml : Doc Chunk start-end : 5000 - 10000. Estimated Cost : 0.03

# 9. Evaluation:
- Evaluation will be be done by  selecting a random BASF patent and manually  inspecting the data because we do not have a labelled dataset to perform evaluation against.
- However during the fine-tuning stage, with training and validation data, we can easily evaluate the Entity Extraction. We will see it in the next notebook
- Alternative approach
    - Precision of Entity Extraction  : An alternative approach can be to compare the  inter-cluster similariry across each attribute. e.g  How close together are the the dimension attribute captured by the model. This will give us a sense of how accurately an entity is being captured. 
        - e.g Entity Captured (m, cm, men) 
        - m, cm are close together hence True values. men is not close to m, cm hence False Postive.
    - Recall : For recall of the entity capture, we can look at the  total count of entity_values captured and use it as a proxy.

        



In [5]:
import random

random.seed(52)
basf_patents = get_all_files("Data/basf_ipg110920/basf_only/")
print(f"Patent Randomly Chosen for Discussion : {random.choice(basf_patents)}")


Patent Randomly Chosen for Discussion : Data/basf_ipg110920/basf_only/patent_3476.xml


# 10. Improvement: 
### 10.1. Improvement Option 1 : Document Text Based
-  Proper Text Formatting and Chunking. Para Based  chunking over character based  periodic interval chunking.
- Accuracy over different sections e.g Extraction is generally observed better over examples compared to other desciption body.
- Identify & prioritise sections, if available.
- Get Extraction confidence score from the model and take necessary measure accordingly.
- Post Extraction Filter
- Future Work / Improvement

### 10.2 Improvement Option 2 :  Model Parameter Based
- Top_p : WE want deterministic or  entity that are in the text hence use lower value of top_p.\
- Future Work / Improvement

### 10.3 Improvment Option 3: Prompt  Based
- Few Iteration of prompt tuning was done earlier based on unit tests.
- More iteration of prompt and example  to be done next, dased on sample data extraction evaluation.  
- Prompt Tuning was covered and done earlier for unit tests as well. Hence for this step, we will move to Fine Tuning.

### 10.4 Improvement Option 4: Fine Tuning :See   NEXT notebook
- See other notebook langchain-finetuning.ipynb
- 1 pass of fine tuning with 20 examples  is done in the next notebook as a technology demonstrator.
- More examples, ideally > 200 is recommended by OpenAI and is to be done  next as future work.
- More iteration of prompt and example  to be done next, dased on sample data extraction evaluation.  
- Fine Tuning
- Why ?
    - Better results compared to simple prompt design/ tuning only.
    - Can train on more examples
    - Lower latency 
    - No need to provide examples in prompt.
    - No need for detailed instruction on how to extract data. Only single input prompt & its desired completion output required.
- FineTuning :
    - Create Train & validation data.
    - Recommended : At least 200 examples. Rough guideline, Double Dataset ~ linear increase in model performance.
    




In [None]:
{
    "prompt":"<any text, for example news article>\n\n###\n\n", 
    "completion":" <list of entities, separated by a newline> END"
}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
