# HITL-SSC Import 

In this stage of the workflow, the goal is to prepare and structure the data to be imported in ORKG.

We provide tasks to properly create a suitably formated the CSV-file that was generated from the previous stage.

For this, we consider the [formatting rules of ORKG](https://orkg.org/help-center/article/16/Import_CSV_files_in_ORKG).

Each row in the import CSV-File represent a paper, where each column represent key data related to the paper. 

We present the Task **Format CSV** to create a new CSV File ready for import based on the information extracted from the previous stage.

The goal is to manually map each extracted key information to its respective column label. 



In [16]:
import pandas as pd

existing_file_path = "filtered_dataframe.csv"  
existing_df = pd.read_csv(existing_file_path)

new_columns = {
    "paper:title": "Title - Information",
    "paper:authors": "Authors - Information",
    "Keywords": "Keywords - Information",
    "contribution:research_problem": "Research question - Information"
}

new_df = pd.DataFrame()

for new_col, existing_col in new_columns.items():
    if existing_col in existing_df.columns:
        new_df[new_col] = existing_df[existing_col]
    else:
        print(f"Warning: {existing_col} does not exist in the original CSV")

new_file_path = "new_mapped_dataframe.csv"
new_df.to_csv(new_file_path, index=False)

print(f"The new mapped DataFrame has been saved to {new_file_path}")


The new mapped DataFrame has been saved to new_mapped_dataframe.csv


A required column is the title column that should be present for the ORKG-import and should be mapped respectively. 

In this task, you are required to manually input the labels that correspond to each key element. 

Required columns

**paper:title** Title of the paper

Optional columns

**paper:authors** Paper authors

**paper:publication_month** Numeric value of the publication month

**paper:publication_year** Numeric value of the publication year

**paper:research_field** Research field ID

**paper:doi** The DOI of the paper (e.g. 10.1145/3360901.3364435)

**paper:url** The URL of the paper (in case no DOI is provided, the URL is displayed instead)

**paper:published_in** The venue of the paper

**contribution:research_problem** A research problem

**contribution:extraction_method** The extraction method of the contribution.

It also possible to add additional columns for key values where the column names will be used as a predicate in the ORKG.

In [19]:
import pandas as pd
import ipywidgets as widgets
from IPython.display import display

csv_path = "filtered_dataframe.csv" 
df = pd.read_csv(csv_path)

existing_columns = df.columns

column_mapping = {}
rows = []

header = widgets.HBox([
    widgets.Label("Existing Column", layout=widgets.Layout(width='200px')),
    widgets.Label("New Column Name", layout=widgets.Layout(width='300px'))
])
rows.append(header)

for col in existing_columns:
    label = widgets.Label(col, layout=widgets.Layout(width='200px'))
    text_input = widgets.Text(layout=widgets.Layout(width='300px'))
    column_mapping[col] = text_input
    rows.append(widgets.HBox([label, text_input]))

table = widgets.VBox(rows)
display(table)

save_button = widgets.Button(description="Save Mappings")
output = widgets.Output()

def save_mappings(_):
    with output:
        output.clear_output()
        mappings = {col: widget.value for col, widget in column_mapping.items()}
        print("Column Mappings:")
        for existing_col, new_col in mappings.items():
            print(f"  {existing_col} -> {new_col}")
        
        pd.DataFrame.from_dict(mappings, orient="index", columns=["New Column Name"]).to_csv("column_mappings.csv")
        print("\nMappings saved to 'column_mappings.csv'.")

save_button.on_click(save_mappings)

display(save_button, output)


VBox(children=(HBox(children=(Label(value='Existing Column', layout=Layout(width='200px')), Label(value='New C…

Button(description='Save Mappings', style=ButtonStyle())

Output()

In [11]:
import pandas as pd
import requests
import json

file_path = "new_mapped_dataframe.csv"
df = pd.read_csv(file_path)

df.head()


Unnamed: 0,paper:title,paper:authors,Keywords,contribution:research_problem
0,A NLP-Oriented Methodology to Enhance Event Lo...,"Belén Ramos-Gutiérrez, F. Javier Ortega, Ángel...",Natural Language Processing - Event log qualit...,What are the most suitable NLP techniques to u...
1,An Empirical Investigation of the Intuitivenes...,"Gregor Polanˇciˇc, Pavlo Brin, Lucineia Helois...",Process landscape - Diagram - Semantic transpa...,RQ1: Are common landscape designs semantically...
2,Secure Multi-party Computation for Inter-organ...,"Gamal Elkoumy, Stephan A. Fahrenkrog-Petersen,...",Process mining - Privacy - Secure multi-party ...,
3,Factors Impacting Successful BPMS Adoption and...,"Ashley Koopman, Lisa F. Seymour",Business Process Management · Business Process...,How do organisational factors affect successfu...
4,Just Tell Me: Prompt Engineering in Business P...,"Kiran Busch, Alexander Rochlitzer, Diana Sola,...",,


In [12]:
def extract_entities(text, api_url="https://labs.tib.eu/falcon/falcon2/api"):

    headers = {
        "Content-Type": "application/json"
    }
    payload = {
        "text": text
    }
    try:
        response = requests.post(f"{api_url}?mode=long", headers=headers, data=json.dumps(payload))
        if response.status_code == 200:
            result = response.json()
            entities = result.get("entities_wikidata", []
            return entities
        else:
            print(f"API Error: {response.status_code} - {response.text}")
            return []
    except Exception as e:
        print(f"Exception occurred: {e}")
        return []


In [13]:

columns_to_process = ["Keywords"]

for col in columns_to_process:
    new_col = f"{col}"
    df[new_col] = df[col].apply(lambda x: extract_entities(x) if pd.notnull(x) else [])


In [15]:
new_file_path = "new_mapped_dataframe1.csv" 
df.to_csv(new_file_path, index=False)

print(f"The new mapped DataFrame has been saved to {new_file_path}")

The new mapped DataFrame has been saved to new_mapped_dataframe1.csv


In [18]:
import ipywidgets as widgets
from IPython.display import display, HTML

javascript_functions = {False: "hide()", True: "show()"}
button_descriptions  = {False: "Show code", True: "Hide code"}


def toggle_code(state):

    """
    Toggles the JavaScript show()/hide() function on the div.input element.
    """

    output_string = "<script>$(\"div.input\").{}</script>"
    output_args   = (javascript_functions[state],)
    output        = output_string.format(*output_args)

    display(HTML(output))


def button_action(value):

    """
    Calls the toggle_code function and updates the button description.
    """

    state = value.new

    toggle_code(state)

    value.owner.description = button_descriptions[state]


state = False
toggle_code(state)

button = widgets.ToggleButton(state, description = button_descriptions[state])
button.observe(button_action, "value")

display(button)

ToggleButton(value=False, description='Show code')