# Data Proplem
Given an XML file contains many patents, we follow these steps:

* Split the patents.

* We filter patents using <section> element to only include Chemistry patents.
    
* We extract the test from the filtered patents, we use the text in "description", "abstract", "claims" elements, then put all text a txt file for later usage.
    
* We further clean the text by removing unwanted characters.


In [1]:
import xml.etree.ElementTree as ET

xmll_file_path = "Data/ipa221229.xml"

In [None]:

# Read the XML file
with open(xmll_file_path, "r") as file:
    xml_content = file.read()

# Split the XML file into separate documents
xml_documents = xml_content.split("<?xml version=")

# Remove the empty first element
xml_documents = xml_documents[1:]

# Parse each XML document individually
for i, xml_doc in enumerate(xml_documents):
    # Add back the XML declaration
    xml_doc = "<?xml version=" + xml_doc
    
    # Parse the XML document
    root = ET.fromstring(xml_doc)
    closels_lvl = root.findall(".//section")
    if closels_lvl and closels_lvl[0].text.lower().strip()=='c': # "c" is the chimestry patents categries according to the documentation
        print(closels_lvl[0].text)
        file1 = open(f"Data/ipa221229/ipa221229{closels_lvl[0].text}_{i}.xml","w")
        file1.write(xml_doc)
        file1.close()

## Extracting text from Patents then write it to TXT file for later usages

In [1]:
import os
import html

number_patents = 20
directory = 'Data/ipa221229/'  # Replace with the path to your directory that contains filtered XML patents.

In [3]:

def extract_text_from_xml_patents(patents_number, directory):

    # Get the list of files in the directory
    file_list = os.listdir(directory)[0:number_patents]

    text = []

    for file in file_list:
        file_name = f"Data/ipa221229/{file}"

        text.append(extract_text_from_xml(file_name))
        
    with open(f'Data/patents_text_{patents_number}_patents.txt', 'w', encoding='utf-8') as f:
    for line in text:
        f.write(html.unescape(line))
        f.write('\n')

In [4]:
extract_text_from_xml_patents(number_patents, directory)

# Prompt Evaluation

**Evaluation problem** 
For the Prompt Evaluation, we follow two paradigms:
#### Context Evaluation
This Evaluation answers the question, Does LLM understand the prompt context? Does it understand the provided demonstration? 
To answer this question, we use the examples in the prompts as inputs, and check whether GPT outputs the expected output or no. TO a prompt to be considered, the prompt need to score 100% accuracy for this evaluation.
#### Test set Evaluation
After Context Evaluation, how can we make sure that the prompt is not overfitting over the examples? We carefully choose a test set (with inputs and ground truth) from the provided patents manually, then use it as evaluation set for all prmpts, same test set is used for all prompts so we can compare.


In [5]:
from prompt_evaluation import PromptEvaluator

### Evaluating Prompts in the Prompts Directory

Given that we have a Prompts directory that contains more than one prompt, we evaluate the prompts and Compare the results so we can choose one of them to parse our patents

In [4]:
prompt_evaluator = PromptEvaluator()
test_result, context_test_results = prompt_evaluator.evaluate_prompts()

test_result

Unnamed: 0,total_overall_accuracy,total_true_predictions,total_miss_predictions,total_false_predictions,number_of_tokens,prompt_id
0,0.96875,18,0,1,101,0
1,0.96875,18,0,1,103,1
2,0.96875,18,0,1,111,2
3,0.90625,17,0,2,112,3


In [5]:
context_test_results

Unnamed: 0,total_overall_accuracy,total_true_predictions,total_miss_predictions,total_false_predictions,number_of_tokens,prompt_id
0,1.0,8,0,0,101,0
1,1.0,8,0,0,103,1
2,1.0,8,0,0,111,2
3,1.0,8,0,0,112,3


#### Results discussin

Since this isonly a prototype, and also The examples needs to be choosed be domain experts, I included 4 prompts to ocmpare, and Evaluation results seems like they are almost similar, Let's choose Prompt number 003 to parse our patents 

# Parsing Patents 

In [1]:
from basf_measurement_parser import BASFMeasurementParser
from prompt_evaluation import *
from prompt_builder import *


### Using prompt builder  to build the prompt 

 

In [2]:
chat_prompt, prompt, examples  = PromptBuilder.build_prompt_from_dir("003") # We use prompt 003, change to to any prompt ID 

In [3]:
prompt

{'text': "You are given a text from chemistry research patent, extract all the measurements and it's context mentioned in the text. Only Extract Measurement context, Unit and Value ranges.\nOnly extract measurements that have value and units.\nIf you find no measurements, return [].",
 'output_format': "Your output should follow be a list of json objects, good output format : [ { 'measurement ' : 'string ', 'unit ' : 'string ', 'value ' : 'string ' }, { 'measurement ' : 'string ', 'unit ' : 'string ', 'value ' : 'string ' }"}

## Then we use the Measurements Parser to parse the text

During this process, every step is logged in a log file, check the log file at logging directory with the Prompt id and Chunk size. For example, we are using prompt_id = '003' and chunk size: 6000, log file should be under the name **"parsing_logs_003_6000_logs.log"**

In [6]:
measurement_parser = BASFMeasurementParser(prompt=chat_prompt, prompt_id="003", chunk_size=6000)


In [None]:
text_to_be_parser = "Data/text_data_ipa221229_5_patents.txt" # We use first 5 patents as a test 

# We pass the output formate to be used with the prompt, output forrmate instructions should be with the prompt file 
patents_text_results = measurement_parser.parse_txt_by_chunks(file_path=text_to_be_parser, output_format=prompt["output_format"])


#### Parsing Results in Structured Format

In [40]:
patents_text_results[0]

Unnamed: 0,measurement,unit,value
0,amplicon pacbio sequencing coverage,%,90
1,molecular lop replicate strain typings concord...,%,10
2,paralel testing betwen molecular lop and tradi...,%,98.2
3,analytical sensitivity/specificity heat-inacti...,%,96.23
4,analytical sensitivity/specificity heat-inacti...,%,9.97
...,...,...,...
113,pipe strength,mpa,267 to 282
114,pipe defect,,
115,yield strength after 96-h imersion,mpa,367 to 418
116,"tensile strength in 50 c., 70%",mpa,8 to 34


#### Faulty Predictions 

In [19]:
print(f"{len(patents_text_results[1])} predictions with bad JSON format\n")
patents_text_results[1]

44 predictions with bad JSON format



['[{"measurement": "refractory metal oxide support surface area", "unit": "m^2/g", "value": "100 to 200"},\n{"measurement": "coated Substrate support", "unit": "g/in", "value": "1.5 to 7.0"},\n{"measurement": "Alumina BET surface area", "unit": "m/g", "value": "200"}]',
 '[{"measurement": "nucleic acid sequencing", "unit": "", "value": ""}, {"measurement": "rt-pcr", "unit": "", "value": ""}, {"measurement": "pacbio sequencing", "unit": "", "value": ""}, {"measurement": "geographic location identification", "unit": "", "value": ""}, {"measurement": "barcode linkage", "unit": "", "value": ""}, {"measurement": "prevalence tracking", "unit": "", "value": ""}, {"measurement": "correlation of variants with infectivity and disease severity", "unit": "", "value": ""}, {"measurement": "data deposition", "unit": "", "value": ""}, {"measurement": "cdc database", "unit": "", "value": ""}]',
 '[ \n{ "measurement": "surface area", "unit": "m^2/g", "value": "5 to 350" },\n{ "measurement": "coated Sub

Correct Parsing results aresaved in an Excel file in the output directory, under the name **parsing_results_{prompt_id}_{chunk_size}.xlsx

Also incorrect formats are saved in the output directory under the name **parsing_results_{prompt_id}_{chunk_size}_false_json** for further processing

In [42]:
import pandas as pd 
excel_output = pd.read_excel('output/parsing_results_003_6000.xlsx')
excel_output

Unnamed: 0.1,Unnamed: 0,measurement,unit,value
0,0,amplicon pacbio sequencing coverage,%,90
1,1,molecular lop replicate strain typings concord...,%,10
2,2,paralel testing betwen molecular lop and tradi...,%,98.2
3,3,analytical sensitivity/specificity heat-inacti...,%,96.23
4,4,analytical sensitivity/specificity heat-inacti...,%,9.97
...,...,...,...,...
113,113,pipe strength,mpa,267 to 282
114,114,pipe defect,,
115,115,yield strength after 96-h imersion,mpa,367 to 418
116,116,"tensile strength in 50 c., 70%",mpa,8 to 34


## Parsing using Document Retriever.

It is noticed that some of the predictions are irelevent, we are passing huge number of documents but not all of them are relevent to the task.

#### Solution

We use Document Retriever with Embeddings, before we send the Documents to the LLM, we Embed them and use the retriever with a prompt to retrieve relevant document (chunks) before sending them to the LLM.

In [4]:
measurement_parser = BASFMeasurementParser(prompt=chat_prompt, prompt_id="003", chunk_size=6000)
text_to_be_parser = "Data/text_data_ipa221229_5_patents.txt" # We use first 5 patents as a test 

# We pass the output formate to be used with the prompt, output forrmate instructions should be with the prompt file 
patents_text_results = measurement_parser.parse_txt_by_chunks(file_path=text_to_be_parser, output_format=prompt["output_format"], use_retriever=True)


Results for Parsing with the retriever 

In [5]:
patents_text_results[0]

Unnamed: 0,measurement,unit,value
0,sample surface area,m^2/g,5 and about 350
1,coated Substrate support,g/in,1.5 to 7.0
2,Alumina BET surface area,m/g,200
3,nitrogen oxide surface area,m^2/g,30 and 80
4,polypeptide concentration,g/ml,1 to 9
5,surface area,m^2/g,"13.5 or more, preferably 14.0 or more, more pr..."
6,yield strength,MPa,"230 or more, preferably 250 or more"
7,tensile strength,MPa,"380 or more, preferably 40 or more"


In [6]:
patents_text_results[1]

['[] (No measurements found)']

#### Resuls Discussion

It's noticed number of Chunks sent to the LLM is much lower, normaly we sent 56 Chukns, but with the retriever wwe only sent 4.

The extracted measurements seems more relevent and related, however, much less, but it's more reliable.