This notebook will simulate the process an user will go trough when trying to use it.

The process begings with transforming the PDF file and storing it on disk:

# OCR module

Load the class that will be used:

In [1]:
%reload_ext autoreload
%autoreload 2

from arkham.OCRmodule import ocr_parse_file

In [3]:
#Instanciate the class
file = ocr_parse_file(file_path='CONTRATO_AP000000718.pdf')

#Read the file
file.load_file(method='fast',file_language='spa')

#Write the file
file_path = file.write_file()

PDF text is not extractable. Cannot use the fast partitioning strategy. Falling back to partitioning with the ocr_only strategy.


And just like that! the pdf is parsed: we can take a look at it down below:

In [None]:
f = open(file_path,'r')
print(f.read())



|Page 1

CONTRATO MAESTRO NUMERO AP000000718 DE ARRENDAMIENTO DE BIENES MUEBLES (EN LO SUCESIVO DENOMINADO EL “ARRENDAMIENTO MAESTRO”) QUE CELEBRAN POR UNA PARTE AB2C LEASING DE MEXICO, SOCIEDAD ANÓNIMA PROMOTORA DE INVERSIÓN DE CAPITAL VARIABLE. (EL “ARRENDADOR”, REPRESENTADA POR MARÍA ISABEL BOLIO' MONTERO Y PABLO ENRIQUE ROMERO GONZÁLEZ , POR OTRA PARTE LA EMPRESA; CRANE SUPPLIES SERVICES S.A. de C.V. REPRESENTADA POR ÓSCAR ALBERTO ISLAS MENDOZA ( “EL ARRENDATARIO” ), POR OTRA PARTE: EN LO PERSONAL Y POR SU PROPIO DERECHO, OSCAR ALBERTO ISLAS MENDOZA (COMO “EL OBLIGADO SOLIDARIO”), POR ULTIMO EN LO PERSONAL Y POR SU PROPIO DERECHO OSCAR ALBERTO ISLAS MENDOZA, COMO (EL DEPOSITARIO”) DE ACUERDO CON LAS SIGUIENTES DECLARACIONES Y CLAUSULAS.
DECLARACIONES E. El Arrendador declara, representa y garantiza que:
a. Es una Sociedad Anónima Promotora de Inversión de Capital Variable debidamente coristituida. bajo el nombre de Boston Leasing México, S.A. de C.Y., de conformidad a las léyes d

Now the module has parsed the PDF file into a simpler txt file, it's now possible to use our chat assistant to extract information from it.

# Q&A Module

Once the file has been parsed, now we can ask questions about it. Let's first instanciate the module:

In [1]:
from arkham.QAmodule import QA_assistant
from dotenv import load_dotenv
import evaluate
load_dotenv()

  from .autonotebook import tqdm as notebook_tqdm


True

In [2]:
falcon_tuned = QA_assistant(file_path='for_evaluation.txt',model='Falcon7b-Tuned')
#This method should only be executed once, if you want to executed again, restart your kernel
falcon_assistant = falcon_tuned.get_querier()

Embedding on device: cuda


Loading checkpoint shards: 100%|██████████| 2/2 [00:14<00:00,  7.35s/it]


As you see, to initialize the QA all you need is pass the txt file and the model of your choice, then all you need is to define your querier in a variable and you're good to go!

In [3]:
output = falcon_assistant("Give me a sumary of the file")
output['result']



'\nThe file is a contract between an Arrendatario and an Arrendador. The Arrendatario is\n\ncontracted to pay the Arrendador a sumary of $100,000.00. The Arrendatario is\n\ncontracted to pay the Arrendador a sumary of $100,000.00. The Arrendatario is\n\ncontracted to pay the Arrendador a sumary of $100,000.'

You could iterate with the regular Falcon 7b-Instruct or GPT:

In [2]:
falcon_instruct = QA_assistant(file_path='for_evaluation.txt',model='Falcon7b-Instruct')
#This method should only be executed once, if you want to executed again, restart your kernel
falcon_assistant = falcon_instruct.get_querier()

Embedding on device: cuda


Loading checkpoint shards: 100%|██████████| 2/2 [00:13<00:00,  6.99s/it]


In [3]:
output = falcon_assistant("Give me a sumary of the file")
output['result']



'\nThe file is a contract between an Arrendatario and an Arrendador. The Arrendatario is\n\ncontracted to pay the Arrendador a sumary of $100,000.00. The Arrendatario is\n\ncontracted to pay the Arrendador a sumary of $100,000.00. The Arrendatario is\n\ncontracted to pay the Arrendador a sumary of $100,000.'

In [3]:
gpt_module = QA_assistant(file_path='for_evaluation.txt',model='GPT3.5')
#This method should only be executed once, if you want to executed again, restart your kernel
gpt_assistant = gpt_module.get_querier()

In [4]:
gpt_assistant('Dame un resumen del texto')

Number of requested results 4 is greater than number of elements in index 2, updating n_results = 2


' El Arrendatario debe reintegrar las cantidades erogadas por el seguro contratado por el Arrendador en un plazo de 30 días a partir de la fecha en que sean cubiertos. Si no se cumple con esto, se aplicarán los intereses moratorios establecidos en la cláusula Vigésima posterior.'

Using this outputs, we can get the ROUGE metric for each one:

In [4]:
rouge = evaluate.load('rouge')

In [6]:
predictions = ["\nThe file is a contract between an Arrendatario and an Arrendador. The Arrendatario is\n\ncontracted to pay the Arrendador a sumary of $100,000.00"]
references = ["De no cumplir el Arrendatario con lo mencionado en el párrafo anterior, el Arrendador contratará el seguro correspondiente por los riesgos que deba amparar y por cualesquiera otros que estime convenientes, por cuenta del Arrendatario, quien deberá. reintegrar las cantidades erogadas por tal concepto en un término que.no excederá de 30 (treinta) días, a partir de la-fecha én que sean cubiertos o de lo contrario se aplicarán los intereses moratorios establecidos en la cláusula Vigésima posterior."]
results = rouge.compute(predictions=predictions, references=references)
results

{'rouge1': 0.07339449541284404,
 'rouge2': 0.0,
 'rougeL': 0.07339449541284404,
 'rougeLsum': 0.07339449541284404}

In [11]:
predictions = ["El Arrendatario debe reintegrar las cantidades erogadas por el seguro contratado por el Arrendador en un plazo de 30 días a partir de la fecha en que sean cubiertos. Si no se cumple con esto, se aplicarán los intereses moratorios establecidos en la cláusula Vigésima posterior."]
references = ["De no cumplir el Arrendatario con lo mencionado en el párrafo anterior, el Arrendador contratará el seguro correspondiente por los riesgos que deba amparar y por cualesquiera otros que estime convenientes, por cuenta del Arrendatario, quien deberá. reintegrar las cantidades erogadas por tal concepto en un término que.no excederá de 30 (treinta) días, a partir de la-fecha én que sean cubiertos o de lo contrario se aplicarán los intereses moratorios establecidos en la cláusula Vigésima posterior."]
results = rouge.compute(predictions=predictions, references=references)
results

{'rouge1': 0.6417910447761195,
 'rouge2': 0.4545454545454545,
 'rougeL': 0.5223880597014926,
 'rougeLsum': 0.5223880597014926}

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation is a metric that evaluates the overlap between 2 string chains. In general, the rouge-N refers to the rate of N-grams (chains of word of lenght N) from the measured word chain in the referance chain, i.e.

$$\text{Rouge-N} = \frac{\text{\# N-grams that appear in both word chains}}{\text{\# N-grams in baseline word chain}}$$

Hence, what the metric measures is the ability to proper sumarize texts and it's perplexity, meaning how much the model deviates from the original text when confronted with complex or long chains.

With that said, it's clear that regarding generalization and summarization for this test, the `Falcon7b` has a lot more to improve while the `gpt3.5` model peforms relatively well, considering that the $64\%$ score is heavely biased by the lenght of the baseline chain. Also notice, that this metric is sentive to language, hence, results based on a more robust dataset may vary.