# HITL-SCC_Workflow Iteration I

**This iteration represents a first step of the Scientific Content Creation Workflow.**

The Scientific Content Creation Workflow represents a user interactive pipeline to systematically extract knowledge from a corpus of scientific literature. This should help the user have better insights into key contents of the corpus.

This notebook provides a step-by-step, instruction based approach from setting up the corpus to extracting and representing knowledge relevant to the user.

It is designed to support the user in the process of retrieving a set of relevant literature, and to convert the knowledge within to scientific knowledge graphs.

This iterative process requires little to no programming prior knowledge.

![setup](<MA003.jpg>)
![knowledge](<Frame9.jpg>)
![publish](<Frame10.jpg>)

In [None]:
#run this cell only the first time, to install the requiremenets.
#pip install -r requirements.txt && pip install -e .

The initial step is to formulate a search query that aligns with the desired research objective. In this task, we can use a large language model (LLM) to either refine a research question, or extract relevant keywords that will be used in the process of querying scientific databases. 

The second step is inputing the search query that will be fed to different scraping models. This step represents the core of this iteration. 

The used tools in this Iteration are: 
-  **LLM** (Optional for formulating the search query)
-  Modified [RESP](https://github.com/monk1337/resp) Arxiv-module 




The formulated research question is to be given in the next cell. An example with the following research question is already given.

In [2]:
# For now, we are just using keywords.
papers_search_query = 'large language models for effective knowledge extraction'

The next cell calls the arxiv modul that scrapes 100 papers that resulted from inputing the search query to arxiv.org

The result is saved in the variable "arxiv_result". This variable represents a dataframe with 3 columns, namely the title of the paper, link to the paper and a second link to download the pdf file of each respective paper.

The cell also displays an overview of the dataframe

In [4]:
from util.arxiv_api import Arxiv
ap           = Arxiv()
arxiv_result = ap.arxiv(papers_search_query,50, max_pages = 1)
arxiv_result.head()

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.45s/it]


The next cell saves the titles and the respective pdf links to two variables, we would use the titles to save each downloaded pdf file to the name of the respective paper.

In [10]:
pdf_links = arxiv_result["pdf_link"]
titles = arxiv_result["title"]

Running the next cell creates a directory called "pdfs_corpus", in which the pdf files will be saved

In [11]:
import requests
import os
import wget
import re
from urllib.parse import quote

download_dir = "pdfs_corpus"
os.makedirs(download_dir, exist_ok=True)

invalid_char_re = re.compile(r'[<>:"/\\|?*]')

def sanitize_filename(filename):

    return invalid_char_re.sub('_', filename)

def download_pdf(url, save_name):
    try:
        if not save_name.endswith(".pdf"):
            save_name += ".pdf"

        save_name = sanitize_filename(save_name)
        
        save_path = os.path.join(download_dir, save_name)

        encoded_url = quote(url, safe=":/")
        
        response = requests.get(encoded_url, stream=True)
        response.raise_for_status() 

        if 'application/pdf' not in response.headers.get('Content-Type', ''):
            raise ValueError(f"URL does not point to a PDF: {url}")
        
        with open(save_path, 'wb') as file:
            for chunk in response.iter_content(chunk_size=8192):
                file.write(chunk)
        
    except Exception as e:
        print(f"Failed to download {url}: {e}")


Running this cell iterates through the dataframe and downloads/saves each paper under their name in the folder created in the cell above

In [12]:
for url, custom_name in zip(pdf_links, titles):
   download_pdf(url, custom_name)

The last step in this iteration creates the "text_corpus". 
Running the next cell creates a corpus of text files from the downloaded pdf files. 

In [21]:
import pdfplumber

pdf_directory = "pdfs_corpus"
text_directory = "text_corpus"

os.makedirs(text_directory, exist_ok=True)

for filename in os.listdir(pdf_directory):
    if filename.endswith(".pdf"):
        pdf_path = os.path.join(pdf_directory, filename)
        with pdfplumber.open(pdf_path) as pdf:
            full_text = ""
            for page in pdf.pages:
                full_text += page.extract_text()
        
        text_filename = os.path.splitext(filename)[0] + ".txt"
        text_path = os.path.join(text_directory, text_filename)
        
        with open(text_path, "w", encoding="utf-8") as text_file:
            text_file.write(full_text)

print("PDF to text conversion completed!")

PDF to text conversion completed!
