# HITL-SCC_Workflow Stage 1 : Corpus Setup

This stage represents the first step of the Scientific Content Creation Workflow, which represents a user interactive pipeline to systematically extract knowledge from a corpus of scientific literature. 

This should help the user have better insights into key contents of the corpus.

This notebook provides a step-by-step, instruction based approach from setting up the corpus to extracting and representing knowledge relevant to your research interest until creating a ready-to-import CSV file for the ORKG. 

The first stage is to support you in the process of retrieving a set of relevant literature.

Based on your research interest, this workflow should support you in formalizing effective search queries. 

<div class="alert alert-block alert-info"> <b>Note:</b> By the end of the tasks starting, it will result in creating the scientific corpus.
</div>

**Task 1: Formalize research interest**

Begin by entering your research interest in the provided text area. This could be a concise statement summarizing your field of study or a specific problem you are investigating.

For this, we would use an LLM to assist you in generating search keywords.

Click the **'Generate Keywords'** button to let the integrated LLM analyze your input and produce a list of keywords relevant to your research interest.

The generated keywords will help define your search queries needed for the next task.

In [7]:
import os
import ipywidgets as widgets
from IPython.display import display
from mistralai import Mistral 
from dotenv import load_dotenv

load_dotenv()

research_interest_input = widgets.Textarea(
    value='', 
    placeholder='Enter your formalized research interest here...',
    description='Research Interest:',
    layout=widgets.Layout(width='80%', height='80px')
)

generate_button = widgets.Button(
    description='Generate keywords',
    button_style='success',
    tooltip='Click to generate keywords based on your research interest',
    icon='search'
)

output_task1 = widgets.Output()

s = Mistral(api_key=os.getenv('MISTRALAI'))

def generate_queries(_):

    with output_task1:
        output_task1.clear_output()
        loading_label = widgets.Label(value="⏳ Loading...")
        display(loading_label)
        research_interest = research_interest_input.value.strip()
        if not research_interest:
            print("Please provide a valid research interest.")
            return
        
        prompt = (
            f"You are an expert researcher. Based on the following research interest, "
            f"generate a list of keywords that could help retrieve relevant academic papers. "
            f"Your will respond ONLY in bullet points as keywords. "
            f"Research Interest: {research_interest}"
        )
        
        try:
            res = s.chat.complete(
                model="mistral-large-latest",
                messages=[
                    {"content": prompt, "role": "user"}
                ]
            )
            
            generated_queries = res.choices[0].message.content
            clear_output()
            if not generated_queries:
                print("No keywords were generated. Please refine your research interest.")
            else:
                
                print("Generated keywords:")
                print(generated_queries)
        
        except Exception as e:
            print(f"An error occurred while generating keywords: {e}")

generate_button.on_click(generate_queries)

display(research_interest_input, generate_button, output_task1)


Textarea(value='', description='Research Interest:', layout=Layout(height='80px', width='80%'), placeholder='E…

Button(button_style='success', description='Generate keywords', icon='search', style=ButtonStyle(), tooltip='C…

Output()

**Task 2: Define search query**

Using the keywords generated in the previous step, construct a search query and input it into the text field labeled **'Search Query'**.

Feel free to modify or enhance the search query based on your specific needs or additional insights.

In [10]:
# To see the code click here
import ipywidgets as widgets
from IPython.display import display

papers_search_query=''

input_widget = widgets.Text(
    value='',
    placeholder='Your search query here',
    disabled=False
)
def save_input(change):
    global papers_search_query
    papers_search_query = change['new']

input_widget.observe(save_input, names='value')

display(input_widget)

Text(value='', placeholder='Your search query here')

**Task 3: Define search limit size**

The next task is to define the size of the publications to render. 

Adjust the search limit using the interactive slider. This allows you to control the number of results retrieved from each source.

Move the slider to select a number that reflects the scope of your search. The default number is 20 elements per source.

In [12]:
from corpus.widgets.widgets_util import DocumentLimitSlider
slider_widget = DocumentLimitSlider()
get_value_func = slider_widget.display_slider()


IntSlider(value=20, description='Documents Limit:', layout=Layout(width='400px'), min=10, style=SliderStyle(de…

**Task 4: Define sources**

Select the sources you wish to query using the checkboxes provided below. Available sources may include databases like **ArXiv**, **Semantic Scholar**, or/and **Google Scholar**.


In [14]:
from corpus.widgets.widgets_util import DynamicSourceList
dynamic_checkbox_list = DynamicSourceList()
dynamic_checkbox_list.display_interface()

Text(value='', placeholder='Enter a new option')

Button(description='Add Option', style=ButtonStyle())

Button(description='Delete Selected Options', style=ButtonStyle())

VBox(children=(Checkbox(value=False, description='Arxiv'), Checkbox(value=False, description='Semantic Scholar…

**Task 5: Choose Whether to Display Only Results with Available PDFs or Include Abstract Results**

Use the provided toggle buttons below to decide how comprehensive your results should be: select whether you want to view only elements with available PDFs or include those with abstracts only.

If you prioritize full access to documents, choose the option to display only results with PDFs. This ensures that all retrieved items can be downloaded and read in full.

Alternatively, if abstracts are sufficient for your initial exploration or if PDFs are unavailable, select the option to include results with abstracts only

In [16]:
import ipywidgets as widgets
from IPython.display import display

toggle_include_abstracts = widgets.ToggleButton(
    value=False,
    description="Abtracts and PDF",
    tooltip="Include abstracts when no PDF is found"
)

toggle_only_pdfs = widgets.ToggleButton(
    value=True, 
    description="PDF only",
    tooltip="Include sources only if PDFs are available"
)

def on_toggle_change(change):
    if change['owner'] == toggle_include_abstracts and change['new']:
        toggle_only_pdfs.value = False
    elif change['owner'] == toggle_only_pdfs and change['new']:
        toggle_include_abstracts.value = False

toggle_include_abstracts.observe(on_toggle_change, names='value')
toggle_only_pdfs.observe(on_toggle_change, names='value')

display(toggle_only_pdfs, toggle_include_abstracts)


ToggleButton(value=True, description='PDF only', tooltip='Include sources only if PDFs are available')

ToggleButton(value=False, description='Abtracts and PDF', tooltip='Include abstracts when no PDF is found')

**Task 6: Display results**

Once all inputs have been configured—research interest, search query, sources, search limits, and result type preferences—proceed with starting the search process by clicking on the "Query and display results" button.

Carefully review the retrieved items and use the checkboxes to mark the ones most relevant to your research.

In [18]:
#Click to collapse code
from semanticscholar import SemanticScholar
from corpus.util.arxiv_api import Arxiv
from serpapi import GoogleScholarSearch
import ipywidgets as widgets
from IPython.display import display, clear_output
import os
from dotenv import load_dotenv

load_dotenv()


def truncate_text(text, limit):
    if len(text) > limit:
        return text[:limit] + "..."
    return text

def process_and_display_sources(query, sources, max_results=slider_widget.get_current_value()):
    global results 
    results = []
    progress_bar = widgets.IntProgress(
        value=0,
        min=0,
        max=len(sources),
        description='Progress:',
        bar_style='info',
        orientation='horizontal'
    )
    display(progress_bar)

    for idx, source in enumerate(sources):
        if source == "Arxiv":
            arxiv_instance = Arxiv()
            results += arxiv_instance.query_arxiv(query, max_results, "relevance", True)
        
        elif source == "Semantic Scholar":
            sch = SemanticScholar()
            smscholar = sch.search_paper(query, limit=max_results)
            for i in smscholar.items:
                results.append([
                    i.title, 
                    ", ".join(author['name'] for author in i.authors), 
                    i.abstract, 
                    i.publicationDate, 
                    None if i.openAccessPdf is None else i.openAccessPdf.get('url')
                ])
        
        elif source == "Google Scholar":
            params = {
                "api_key": os.getenv('GOOGLESCHOLAR'),
                "engine": "google_scholar",
                "hl": "en",
                "q": query,
                "num":max_results
            }
            search = GoogleScholarSearch(params)
            gsresult = search.get_dict()
            for i in gsresult.get("organic_results", []):
                if i.get("resources") and i["publication_info"].get("authors"):
                    results.append([
                        i["title"], 
                        "; ".join(author["name"] for author in i["publication_info"]["authors"]), 
                        i["snippet"], 
                        None, 
                        i["resources"][0]["link"]
                    ])
                elif i.get("resources"):
                    results.append([
                        i["title"], 
                        "; ".join(i["publication_info"]["summary"]), 
                        i["snippet"], 
                        None, 
                        i["resources"][0]["link"]
                    ])
                else:
                    results.append([
                        i["title"], 
                        "; ".join(i["publication_info"]["summary"]), 
                        i["snippet"], 
                        None, 
                        None
                    ])
        
        else:
            print(f"Unknown source specified: {source}")
        
        progress_bar.value = idx + 1
    
    clear_output(wait=True)

    accordion = widgets.Accordion(
        layout=widgets.Layout(width='900px', overflow='hidden', white_space='nowrap', text_overflow='ellipsis')
    )
    global checkboxes
    checkboxes = []
    accordion_children = []

    filter_results = [item for item in results if item[4] is not None]

    for i, result in enumerate(filter_results):
        title, authors, abstract, date, link = result
        
        checkbox = widgets.Checkbox(value=False, layout=widgets.Layout(margin='7.4px 0px'))
        checkboxes.append(checkbox)
        
        text_widget = widgets.HTMLMath(
            value=f"""
            <h4>{title}</h4>
            <p><strong>Abstract:</strong> {abstract or 'No abstract available.'}</p>
            <p><strong>Authors:</strong> {authors}</p>
            <p><strong>Published:</strong> {date or 'Date not available'}</p>
                    <p>
            <a href="{link}" target="_blank" style="color: blue; text-decoration: underline;">
                Read more
            </a>
        </p>
            """
        )
        section_content = widgets.VBox([text_widget])
        accordion_children.append(section_content)

    accordion.children = tuple(accordion_children)


    for i, result in enumerate(filter_results):
        accordion.set_title(i, truncate_text(result[0],130))

    checkbox_container = widgets.VBox(
        checkboxes,
        layout=widgets.Layout(align_items='flex-start', margin='0px')
    )
    
    accordion.layout.margin = '0px 0px 0px -180px'
    layout = widgets.HBox([checkbox_container, accordion], layout=widgets.Layout(spacing='0px'))

    display(layout)


button_query = widgets.Button(description="Query and display results", layout=widgets.Layout(width="200px"))
output_query = widgets.Output()

display(button_query, output_query)

def button_click(b):
    with output_query:
        sources = dynamic_checkbox_list.get_selected_options()
        query = papers_search_query
        process_and_display_sources(query, sources)
        


button_query.on_click(button_click)


Button(description='Query and display results', layout=Layout(width='200px'), style=ButtonStyle())

Output()

**Task 6: Click the Button to Download Selected Papers**

Once your selection is complete, click the 'Download Selected Papers' button provided below.

In [20]:
from IPython.display import display
import requests
import re

save_folder = "./corpus_result/"


def sanitize_filename(filename):
    sanitized = re.sub(r'[^\w\-_\.]', '_', filename)
    return sanitized

def download_pdf(name, url, save_path):
    try:
        response = requests.get(url)
        if response.status_code == 200:
            print(f"'{name}'. Downloaded succesfully")
            with open(save_path, "wb") as f:
                f.write(response.content)
        else:
            print(f"Failed to download {name}. Status code: {response.status_code}")
    except: 
        print(f"{name}: No PDF accessible to download")



button = widgets.Button(description="Download Selected Papers")
output_task6 = widgets.Output()

display(button, output_task6)

def on_button_clicked(b):
    with output_task6:
        output_task6.clear_output()
        selected_indexes = [i for i, checkbox in enumerate(checkboxes) if checkbox.value]
        for index in selected_indexes:
            file_name = f"{sanitize_filename(results[index][0])}.pdf"
            save_path = f"{save_folder}{file_name}"
            download_pdf(results[index][0], results[index][4], save_path)
        output_task6.clear_output()
        download_label = widgets.Label(value="✅ task completed!")
        display(download_label)


button.on_click(on_button_clicked)

Button(description='Download Selected Papers', style=ButtonStyle())

Output()

<div class="alert alert-block alert-info"> <b>Note:</b> After clicking the 'Download Selected Papers' button, the system actively downloaded the selected papers and added them to your scientific corpus in the folder "corpus_result". You can revisit the folder and go through the PDF files if you wish.</div>

**(Optional) Task 7: Retry the Process for Additional Papers**

If you want to expand or refine your corpus further, you can revisit the previous steps to adjust your search parameters. This includes updating the search query, changing the search limits, or selecting different sources. For this, you will need to rerun all cells from the start. By clicking on the "Fast Forward" button in the notebook tab bar.

**Task 8: Digital Object Identifier**

This task automates the process of extracting the doi for each downloaded paper. The doi numbers are saved in a csv file "doi_list". 

This task is crucial to facilitate identifying the papers in later steps.

In [24]:

import ipywidgets as widgets
from IPython.display import display, HTML
import pdf2doi
import csv

pdf2doi.config.set('verbose', False)

def process_pdfs_and_save_doi(_):
    output_task8.clear_output()
    with output_task8:
        try:
            test = pdf2doi.pdf2doi('./corpus_result/')
            doi_list = [entry["identifier"] for entry in test]

            output_file = "doi_list.csv"
            with open(output_file, mode='w', newline='') as csv_file:
                writer = csv.writer(csv_file)
                writer.writerow(["paper:doi"])
                writer.writerows([[doi] for doi in doi_list])

            print(f"DOIs saved to {output_file}")
            stage1_note = """
                <div style="border: 2px solid gold; border-radius: 10px; padding: 10px; background-color: #FFD700; color: black; font-weight: bold; text-align: center;">
                    This stage is over, Please proceed to the next stage.
                    </div>
            """
            display(HTML(stage1_note))
        except Exception as e:
            print(f"An error occurred: {e}")

button = widgets.Button(description="Save DOIs")
output_task8 = widgets.Output()

button.on_click(process_pdfs_and_save_doi)

display(button, output_task8)


Button(description='Save DOIs', style=ButtonStyle())

Output()