# HITL_Workflow Iteration I

**This iteration represents a first step of the scientific content creation workflow.**

![setup](<setup.png>)
![knowledge](<knowledge.png>)
![publish](<publish.png>)

The initial step is to formulate a search query that aligns with the desired research objective. In this task, we can use a large language model (LLM) to either refine a research question, or extract relevant keywords that will be used in the process of querying scientific databases. 

The second step is inputing the search query that will be fed to different scraping models. This step represents the core of this iteration. 

The used tools in this Iteration are: 
-  **LLM** (Optional for formulating the search query)
-  Modified [RESP](https://github.com/monk1337/resp) Arxiv-module 




The formulated research question is to be given in the next cell. An example with the following research question is already given.

In [None]:
papers_search_query = 'How can large language models be utilized for effective knowledge extraction?'

The next cell calls the arxiv modul that scrapes 100 papers that resulted from inputing the search query to arxiv.org

The result is saved in the variable "arxiv_result". This variable represents a dataframe with 3 columns, namely the title of the paper, link to the paper and a second link to download the pdf file of each respective paper.

The cell also displays an overview of the dataframe

In [None]:
from arxiv_api import Arxiv
ap           = Arxiv()
arxiv_result = ap.arxiv(papers_search_query, max_pages = 1)
arxiv_result.head()

The next cell saves the titles and the respective pdf links to two variables, we would use the titles to save each downloaded pdf file to the name of the respective paper.

In [None]:
pdf_links = arxiv_result["pdf_link"]
titles = arxiv_result["title"]

Running the next cell creates a directory called "pdfs_corpus", in which the pdf files will be saved

In [None]:
import requests
import os
import wget
import re
from urllib.parse import quote

download_dir = "pdfs_corpus"
os.makedirs(download_dir, exist_ok=True)

invalid_char_re = re.compile(r'[<>:"/\\|?*]')

def sanitize_filename(filename):

    return invalid_char_re.sub('_', filename)

def download_pdf(url, save_name):
    try:
        if not save_name.endswith(".pdf"):
            save_name += ".pdf"

        save_name = sanitize_filename(save_name)
        
        save_path = os.path.join(download_dir, save_name)

        encoded_url = quote(url, safe=":/")
        
        response = requests.get(encoded_url, stream=True)
        response.raise_for_status() 

        if 'application/pdf' not in response.headers.get('Content-Type', ''):
            raise ValueError(f"URL does not point to a PDF: {url}")
        
        with open(save_path, 'wb') as file:
            for chunk in response.iter_content(chunk_size=8192):
                file.write(chunk)
        
    except Exception as e:
        print(f"Failed to download {url}: {e}")


Running this cell iterates through the dataframe and downloads/saves each paper under their name in the folder created in the cell above

In [None]:
for url, custom_name in zip(pdf_links, titles):
   download_pdf(url, custom_name)