# HITL-SCC_Workflow Stage 2 : Knowledge Extraction

If you are at this stage, it means you already have a corpus of PDFs locally stored.

At this point, and based on your research interest, our goal is to structure the extraction process.

To establish a structured data model that serves as a template for extracting key information from scientific literature. This model will outline the specific fields of knowledge to be extracted, such as titles, keywords, methods, metrics, and evaluation approaches...

**Task 1: Define Data model**

A data model in this context is a structured set of properties that should serve as input in order to communicate with the corpus.

In this task, the goal is to define elements that should be extracted from the corpus.

Please use the field below to manage key properties you wish to add, delete, include, or ignore from the extraction process.

To add certain properties, use the textfield to input each element at a time, and using the "Add Option" button to add it to the list.

If you wish to delete properties from the list, tick the checkboxes of the elements to delete and use the button "Delete Selected Options".

By Ticking a checkbox, the property would be included in the extraction process accordingly.

<div class="alert alert-block alert-info"> <b>Note:</b>The Title property is initalized and should always be included in the extraction process</div>

In [131]:
from corpus.widgets.widgets_util import DynamicCheckboxList
dynamic_checkbox_list = DynamicCheckboxList()
dynamic_checkbox_list.display_interface()

Text(value='', placeholder='Enter a new option')

Button(description='Add Option', style=ButtonStyle())

Button(description='Delete Selected Options', style=ButtonStyle())

VBox(children=(Checkbox(value=False, description='Title'), Checkbox(value=False, description='Theme'), Checkbo…

In [117]:
data = {
    "columns": {
        "title": "paper:title",
        "authors": "paper:authors",
        "publication_month": "paper:publication_month",
        "publication_year": "paper:publication_year",
        "research_field": "paper:research_field",
        "doi": "paper:doi",
        "url": "paper:url",
        "published_in": "paper:published_in",
        "research_problem ": "contribution:research_problem ",
        "extraction_method ": "contribution:extraction_method"
    },
    "rules": {
        "authors": "Authors should be separated by a ;"
    }
}

elements = dynamic_checkbox_list.get_selected_options()

output_lines = []
for element in elements:
    element_lower = element.lower()
    if element_lower in data["rules"]:
        rule_message = data["rules"][element_lower]
        output_lines.append(f"{element}: {rule_message}")
    else:
        output_lines.append(element)

result_string = "\n".join(output_lines)

print(result_string)


Title
Theme
Evaluation Approach


**Task 2: Information extraction**

In this phase, the workflow leverages a pre-trained language model to automatically extract information from the corpus created in the previous stage. This step is crucial for transforming raw textual data into a format aligned with the data model you have chosen.

To start the process, please click the button presented below.

In [134]:
from mistralai import Mistral
import os
from corpus.embedding.document_util import DocumentUtil
from dotenv import load_dotenv
import ipywidgets as widgets
from IPython.display import display, clear_output
import json
import pandas as pd
from ipydatagrid import DataGrid, TextRenderer, VegaExpr

load_dotenv()

def process_pdfs(folder_path, result_string):
    global results
    results = []

    progress_bar = widgets.IntProgress(
        value=0,
        min=0,
        max=len([f for f in os.listdir(folder_path) if f.lower().endswith('.pdf')]),
        description='Processing:',
        bar_style='info',
        orientation='horizontal'
    )
    display(progress_bar)

    for idx, filename in enumerate(os.listdir(folder_path)):
        if filename.lower().endswith('.pdf'):
            file_path = os.path.join(folder_path, filename)            
            test_doc = DocumentUtil.get_text_without_references(DocumentUtil, file_path)
            
            prompt = f"""
                You are an information extraction system specialized in retrieving specified key information from scientific texts.
                
                Extract and provide the following key details from the given text: 
                
                {result_string}. 
                
                Use only the information available in the Scientific Text, without any external knowledge. If information is not found, return 'NOT FOUND' for that specific key.
                
                For each item, provide the extracted information along with an associated confidence score.
                
                You will respond only with a JSON object where each field is structured as follows:
                
                "FieldName": {{ "Information": "<extracted information>", "Confidence": <confidence score> }}
                Do not include explanations or extra text. Keep the sentences short and precise.
                
                Scientific Text:
                {test_doc}
            """
            
            s = Mistral(
                api_key=os.getenv('MISTRALAI'),
            )
            
            res = s.chat.complete(
                model="mistral-large-latest",
                temperature=0,
                messages=[
                    {
                        "content": prompt,
                        "role": "user",
                    },
                ],
                response_format={"type": "json_object"}
            )
            
            if res is not None:
                results.append(res.choices[0].message.content)
            
            progress_bar.value = idx + 1
    
    clear_output(wait=True)
    print("Processing completed.")
    return results

button = widgets.Button(
    description="Run Extraction",
    button_style="success",
    tooltip="Click to extract data from PDFs",
    icon="check"
)

button = widgets.Button(description="Query and display results", layout=widgets.Layout(width="200px"))
output = widgets.Output()

display(button, output)

def button_click(b):
    with output:
        folder_path = "corpus_result"
        results = process_pdfs(folder_path, result_string)


button.on_click(button_click)


Button(description='Query and display results', layout=Layout(width='200px'), style=ButtonStyle())

Output()

**Task 3: Display and Validate**

The button **"Display"** below would display a grid that represents the information extracted from the corpus. In this step, you can revise, modify, and validate specific cells and rows from the table. 

If you wish to deselect / reselect rows from the table, please use the first column "Selected". By double clicking on the corresponding cell, a checkbox would be displayed. By deselecting elements from the table, you will exclude the whole paper from the further process. 

In [137]:
import json
import pandas as pd
from ipydatagrid import DataGrid, TextRenderer, VegaExpr
from IPython.display import display
import ipywidgets as widgets

column_mapping = {
    "title": "paper:title",
    "authors": "paper:authors",
    "publication_month": "paper:publication_month",
    "publication_year": "paper:publication_year",
    "research_field": "paper:research_field",
    "doi": "paper:doi",
    "url": "paper:url",
    "published_in": "paper:published_in",
    "research_problem": "contribution:research_problem",
    "extraction_method": "contribution:extraction_method"
}

parsed_data = []
for json_string in results:
    try:
        parsed_data.append(json.loads(json_string))
    except TypeError:
        parsed_data.append(json_string)
    except json.JSONDecodeError as e:
        print("Invalid JSON string:", e)

def flatten_json(json_obj):
    flattened = {}
    for key, value in json_obj.items():
        mapped_key = column_mapping.get(key.lower(), key)
        flattened[mapped_key] = value["Information"]
    return flattened

flattened_data = [flatten_json(json_obj) for json_obj in parsed_data]

df = pd.DataFrame(flattened_data)
df["Selected"] = True 
columns = ["Selected"] + [col for col in df.columns if col != "Selected"]
df = df[columns]

csv_file = "doi_list.csv"
additional_column = pd.read_csv(csv_file)

df = pd.concat([df, additional_column], axis=1)

renderer = TextRenderer(
    background_color=VegaExpr(
        "cell.value == 'NOT FOUND' ? 'red' : 'white'"
    )
)

huge_datagrid = DataGrid(
    df,
    base_row_size=30,
    base_column_size=150,
    layout={"height": "400px", "width": "100%"},
    default_renderer=renderer,
    editable=True
)

checkboxes = [
    widgets.Checkbox(value=True, layout=widgets.Layout(margin="0 0 0 5px")) for _ in range(len(df))
]
checkbox_column = widgets.VBox(
    checkboxes, layout=widgets.Layout(align_items="stretch", margin="0 5px 0 0")
)
def validate_and_save(_):
    updated_df = huge_datagrid.data 
    filtered_df = updated_df[updated_df["Selected"]]
    filtered_df = filtered_df.drop(columns=["Selected"])
    filtered_df.replace("NOT FOUND", pd.NA, inplace=True)
    filtered_df.to_csv("filtered_data.csv", index=False)
    print(f"Filtered data saved to 'filtered_data.csv'")

validate_button = widgets.Button(description="Validate & Save")
validate_button.on_click(validate_and_save)

layout = widgets.VBox([huge_datagrid], layout=widgets.Layout(gap="10px"))


button = widgets.Button(description="Display results", layout=widgets.Layout(width="200px"))
output_task3 = widgets.Output()

display(button, output_task3)

def button_click(b):
    with output_task3:
        display(layout)



button.on_click(button_click)

Button(description='Display results', layout=Layout(width='200px'), style=ButtonStyle())

Output()

The button **"Validate & Save"** below would save a csv file with the elements that are displayed.

In [139]:
output_task4 = widgets.Output()

def validate_and_save(_):
    with output_task4:
        updated_df = huge_datagrid.data 
        filtered_df = updated_df[updated_df["Selected"].fillna(False)]
        filtered_df = filtered_df.drop(columns=["Selected"])
        filtered_df.replace("NOT FOUND", pd.NA, inplace=True)
        filtered_df.to_csv("filtered_data.csv", index=False)
        print(f"Filtered data saved to 'filtered_data.csv'")

validate_button = widgets.Button(description="Validate & Save")
validate_button.on_click(validate_and_save)

display(validate_button, output_task4)

Button(description='Validate & Save', style=ButtonStyle())

Output()

In [140]:
import pandas as pd
import requests
import json
import ipywidgets as widgets
from IPython.display import display

api_url = "https://labs.tib.eu/falcon/falcon2/api"
headers = {"Content-Type": "application/json"}

csv_path = "filtered_data.csv"
output_path = "to_import_orkg.csv"

def extract_entities(text):
    payload = {"text": text}
    response = requests.post(f"{api_url}?mode=long", headers=headers, data=json.dumps(payload))
    if response.status_code == 200:
        result = response.json()
        entities = result.get("entities_wikidata", [])
        return [entity["surface form"] for entity in entities]
    else:
        print(f"Falcon was unable to extract an entity.")
        return []

def process_csv_and_extract_entities(_):
    output_task5.clear_output() 
    with output_task5:
        try:
            df = pd.read_csv(csv_path)
            new_data = df.copy()

            for column in df.columns:
                if ":" not in column.lower():
                    for index, cell in enumerate(df[column]):
                        if pd.notna(cell):
                            entities = extract_entities(str(cell))
                            if entities:
                                new_data.loc[index, column] = entities[0]

                                for i, entity in enumerate(entities[1:], start=1):
                                    new_col_name = f"{column.lower()}({i})"
                                    insert_pos = new_data.columns.get_loc(column) + i

                                    if new_col_name not in new_data.columns:
                                        new_data.insert(insert_pos, new_col_name, "")

                                    new_data.loc[index, new_col_name] = entity

            new_data.columns = [col.lower() for col in new_data.columns]

            columns = [new_data.columns[-1]] + list(new_data.columns[:-1])
            new_data = new_data.reindex(columns=columns)

            new_data.to_csv(output_path, index=False)
            print(f"Entity extraction completed. Results saved to {output_path}")
        except Exception as e:
            print(f"An error occurred: {e}")

button = widgets.Button(description="Process CSV and Extract Entities", layout=widgets.Layout(width="300px"))
output_task5 = widgets.Output()

button.on_click(process_csv_and_extract_entities)

display(button, output_task5)


Button(description='Process CSV and Extract Entities', layout=Layout(width='300px'), style=ButtonStyle())

Output()