# HITL-SSC Workflow Extraction

In this stage of the workflow, the goal is to process and extract information from the scientific corpus created in the previous step.

This stage consists of 3 main tasks: Data model definition - Data identification - Data validation

**Task 1: Defining Data Model**

A data model in this context is a structured set of properties that should serve as input in order to communicate with the corpus. 

In this task, the goal is to define elements that should be extracted from the corpus. 

Please use the field below to manage key properties you wish to add, delete, include, or ignore from the extraction process.

To add certain properties, use the textfield to input each element at a time, and using the "Add Option" button to add it to the list. 

If you wish to delete properties from the list, tick the checkboxes of the elements to delete and use the button "Delete Selected Options".

By **Ticking** a checkbox, the property would be included in the extraction process accordingly.

<div class="alert alert-block alert-info">
<b>Note:</b>The <b>Title</b> property is initalized and should always be included in the extraction process
</div>




In [1]:
from corpus.widgets.widgets_util import DynamicCheckboxList
dynamic_checkbox_list = DynamicCheckboxList()
dynamic_checkbox_list.display_interface()

Text(value='', placeholder='Enter a new option')

Button(description='Add Option', style=ButtonStyle())

Button(description='Delete Selected Options', style=ButtonStyle())

VBox(children=(Checkbox(value=False, description='Title'),))

(Optional) **Task 1.2: Defining formatting rules**

This is task is an optional task if you wish to define formatting rules for the extraction of information.

**If you do not wish to include any formatting rule, please skip to task 2.**

By defining these rules, the extraction process would take the specified formats into account and generate the information accordingly.  

Below are examples of properties and how the information could be formatted. This should serve as a guide to accomplish this task:

   - Example property: <b>Keywords</b>
   - Example rule: Keywords should be separated by '-'.
   
   
   
   - Example property: <b>Evaluation approach</b>
   - Example rule: The evaluation approach should be 1 sentence.
   
   
   
   - Example property: <b>Conclusion</b>
   - Example rule: The conclusion should be 3 sentences at most. 







In [41]:
import ipywidgets as widgets
from IPython.display import display

elements = dynamic_checkbox_list.get_selected_options()

input_widgets = {}

rows = []

header_row = widgets.HBox([
    widgets.Label('Property', layout=widgets.Layout(width='200px')),
    widgets.Label('Formatting Rule', layout=widgets.Layout(width='600px'))
])
rows.append(header_row)

for element in elements:
    element_label = widgets.Label(element, layout=widgets.Layout(width='200px'))
    text_input = widgets.Text(layout=widgets.Layout(width='600px'))
    input_widgets[element] = text_input
    row = widgets.HBox([element_label, text_input])
    rows.append(row)

table = widgets.VBox(rows)

display(table)



VBox(children=(HBox(children=(Label(value='Property', layout=Layout(width='200px')), Label(value='Formatting R…

In [42]:
def extract_from_table():
    return {element: widget.value for element, widget in input_widgets.items()}

pairs = extract_from_table()

print("Extracted Pairs:", pairs)

Extracted Pairs: {'Title': '', 'Authors': "Authors should be separated by ', '", 'Keywords': "Keywords should be separated by ' - '", 'Research question': ''}


In [44]:
non_empty_values = "\n\n".join(value for value in pairs.values() if value)
print(non_empty_values)

Authors should be separated by ', '

Keywords should be separated by ' - '


**Task 2: Data identification**

After defining the data model for the extraction process, we come to a task of prompting an LLM-Agent using the above specifications.

This task requires you to specify the name of the folder that contains the PDF-corpus. No further input is required.

Generated information would be displayed for validation purposes. 

The presented textfield is predisposed to specify the folder of PDF documents.

In [20]:
import ipywidgets as widgets
from IPython.display import display

input_widget = widgets.Text(
    value='',
    placeholder='Folder name here',
    disabled=False
)
def save_input(change):
    global folder_name
    folder_name = change['new']

input_widget.observe(save_input, names='value')

display(input_widget)

Text(value='', placeholder='Folder name here')

In [84]:
from mistralai import Mistral
import os
from corpus.embedding.document_util import DocumentUtil
from dotenv import load_dotenv

load_dotenv()

non_empty_values = "\n\n".join(value for value in pairs.values() if value)

folder_path = input_widget.value
results = []
data = " - ".join(dynamic_checkbox_list.get_selected_options())

for filename in os.listdir(folder_path):
    if filename.lower().endswith('.pdf'):
        file_path = os.path.join(folder_path, filename)
        
        print(f"Processing PDF file: {file_path}")



        test_doc = DocumentUtil.get_text_without_references(DocumentUtil, file_path)
        
        prompt = f"""
            You are an information extraction system specialized in retrieving specified key information from scientific texts.
            
            Extract and provide the following key details from the given text: {data}. 
            
            
            Use only the information available in the Scientific Text, without any external knowledge. If a information is not found, return 'NOT FOUND' for that specific key.
            
            For each item, provide the extracted information along with an associated confidence score.
            
            You will respond only with a JSON object where each field is structured as follows:
            
            "FieldName": {{ "Information": "<extracted information>", "Confidence": <confidence score> }}
            Do not include explanations or extra text.
            
            Some information should be formatted following these rules: 
            
            {non_empty_values}
            
            Scientific Text:
            {test_doc}
        """
        
        s = Mistral(
            api_key= os.getenv('MISTRALAI')
,
        )
        
        res = s.chat.complete(model="mistral-large-latest", temperature=0, messages=[
            {
                "content": prompt,
                "role": "user",
            },
        ],
            response_format = {
              "type": "json_object",
           })
        
        if res is not None:
            # Run for around 8 Minutes for a corpus of around 30 Papers.
            pass
        print(res.choices[0].message.content)
        results.append(res.choices[0].message.content)

Processing PDF file: zotero_pdfs\4KKE293P.pdf
{
  "Title": {
    "Information": "A NLP-Oriented Methodology to Enhance Event Log Quality",
    "Confidence": 1.0
  },
  "Authors": {
    "Information": "Belén Ramos-Gutiérrez, F. Javier Ortega, Ángel Jesús Varela-Vaca, María Teresa Gómez-López, Moe Thandar Wynn",
    "Confidence": 1.0
  },
  "Keywords": {
    "Information": "Natural Language Processing - Event log quality - Process mining",
    "Confidence": 1.0
  },
  "Research question": {
    "Information": "What are the most suitable NLP techniques to use for relabelling an event log in order to improve its quality?",
    "Confidence": 1.0
  }
}
Processing PDF file: zotero_pdfs\4XTLX385.pdf
{
  "Title": {
    "Information": "An Empirical Investigation of the Intuitiveness of Process Landscape Designs",
    "Confidence": 1.0
  },
  "Authors": {
    "Information": "Gregor Polanˇciˇc, Pavlo Brin, Lucineia Heloisa Thom, Encarna Sosa, Mateja Kocbek Bule",
    "Confidence": 1.0
  },
  "Keyw

{
  "Title": {
    "Information": "Exception Handling in the Context of Fragment-Based Case Management",
    "Confidence": 1.0
  },
  "Authors": {
    "Information": "Kerstin Andree, Sven Ihde, Luise Pufahl",
    "Confidence": 1.0
  },
  "Keywords": {
    "Information": "Case management - Exception handling - Business process modeling - Flexible process automation",
    "Confidence": 1.0
  },
  "Research question": {
    "Information": "NOT FOUND",
    "Confidence": 0.0
  }
}
Processing PDF file: zotero_pdfs\FWSAQ63Q.pdf
{
  "Title": {
    "Information": "Chatting About Processes in Digital Factories: A Model-Based Approach",
    "Confidence": 1.0
  },
  "Authors": {
    "Information": "Donya Rooein, Devis Bianchini, Francesco Leotta, Massimo Mecella, Paolo Paolini, Barbara Pernici",
    "Confidence": 1.0
  },
  "Keywords": {
    "Information": "Conversational user interface - Chatbot modelling - Digital factory - Industry 4.0",
    "Confidence": 1.0
  },
  "Research question": {
    "

{
  "Title": {
    "Information": "Automated Planning for Supporting Knowledge-Intensive Processes",
    "Confidence": 1.0
  },
  "Authors": {
    "Information": "Sheila Katherine Venero, Julio Cesar dos Reis, Bradley Schmerl, Cecília Mary Fischer Rubira, Leonardo Montecchi",
    "Confidence": 1.0
  },
  "Keywords": {
    "Information": "Knowledge-intensive process - Business process modeling - Case management - Automated planning - Markov Decision Process - Business process management systems",
    "Confidence": 1.0
  },
  "Research question": {
    "Information": "NOT FOUND",
    "Confidence": 0.0
  }
}
Processing PDF file: zotero_pdfs\T822ESXR.pdf
{
  "Title": {
    "Information": "Model Consolidation: A process modelling method combining Process Mining and Business Process Modelling",
    "Confidence": 1.0
  },
  "Authors": {
    "Information": "Ornela Çela, Agnès Front, Dominique Rieu",
    "Confidence": 1.0
  },
  "Keywords": {
    "Information": "Process analysis - business proc

In [85]:
print(len(results))

33


**Task 2: Data validation**


In [96]:
import json
import pandas as pd
from ipydatagrid import DataGrid, TextRenderer, VegaExpr
from IPython.display import display

parsed_data = []
for json_string in results:
    try:
        parsed_data.append(json.loads(json_string))
    except TypeError:
        parsed_data.append(json_string)
    except json.JSONDecodeError as e:
        print("Invalid JSON string:", e)

def flatten_json(json_obj):
    flattened = {}
    for key, value in json_obj.items():
        flattened[f"{key} - Information"] = value["Information"]
        flattened[f"{key} - Confidence"] = value["Confidence"]
    return flattened

flattened_data = [flatten_json(json_obj) for json_obj in parsed_data]

df = pd.DataFrame(flattened_data)
df["Selected"] = True 
columns = ["Selected"] + [col for col in df.columns if col != "Selected"]
df = df[columns]

renderer = TextRenderer(
    background_color=VegaExpr(
        "cell.value == 'NOT FOUND' ? 'red' : 'white'"
    )
)

huge_datagrid = DataGrid(
    df,
    base_row_size=30,
    base_column_size=150,
    layout={"height": "400px", "width": "100%"},
    default_renderer=renderer,
    editable=True
)

checkboxes = [
    widgets.Checkbox(value=True, layout=widgets.Layout(margin="0 0 0 5px")) for _ in range(len(df))
]
checkbox_column = widgets.VBox(
    checkboxes, layout=widgets.Layout(align_items="stretch", margin="0 5px 0 0")
)
def validate_and_save(_):
    updated_df = huge_datagrid.data 
    filtered_df = updated_df[updated_df["Selected"]]
    filtered_df = filtered_df.drop(columns=["Selected"])
    filtered_df.to_csv("filtered_data.csv", index=False)
    print(f"Filtered data saved to 'filtered_data.csv'")

validate_button = widgets.Button(description="Validate & Save")
validate_button.on_click(validate_and_save)

layout = widgets.VBox([huge_datagrid, validate_button], layout=widgets.Layout(gap="10px"))



display(layout)




VBox(children=(DataGrid(auto_fit_params={'area': 'all', 'padding': 30, 'numCols': None}, base_column_size=150,…

Filtered data saved to 'filtered_data.csv'


In [97]:
filtered_df = df[df["Title - Information"] != "NOT FOUND"]

filtered_df = filtered_df[filtered_df["Title - Information"].notna()]

confidence_columns = [col for col in filtered_df.columns if "Confidence" in col]

filtered_df = filtered_df[(filtered_df[confidence_columns] > 0).any(axis=1)]

for column in filtered_df.columns:
    if "Confidence" in column:
        info_column = column.replace("Confidence", "Information")
        
        filtered_df.loc[df[column] == 0, info_column] = ""

In [100]:
#filtered_df = filtered_df.drop(columns=[col for col in df.columns if "Confidence" in col])
filtered_df = filtered_df.drop(columns=["Selected"])

file_path = "filtered_dataframe12.csv" 
filtered_df.to_csv(file_path, index=False)

print(f"The filtered DataFrame has been saved to {file_path}")

The filtered DataFrame has been saved to filtered_dataframe12.csv


In [101]:
import ipywidgets as widgets
from IPython.display import display, HTML

javascript_functions = {False: "hide()", True: "show()"}
button_descriptions  = {False: "Show code", True: "Hide code"}


def toggle_code(state):

    """
    Toggles the JavaScript show()/hide() function on the div.input element.
    """

    output_string = "<script>$(\"div.input\").{}</script>"
    output_args   = (javascript_functions[state],)
    output        = output_string.format(*output_args)

    display(HTML(output))


def button_action(value):

    """
    Calls the toggle_code function and updates the button description.
    """

    state = value.new

    toggle_code(state)

    value.owner.description = button_descriptions[state]


state = False
toggle_code(state)

button = widgets.ToggleButton(state, description = button_descriptions[state])
button.observe(button_action, "value")

display(button)

ToggleButton(value=False, description='Show code')