# HITL-SCC_Workflow Iteration I

This iteration represents the first step of the Scientific Content Creation Workflow, whiche represents a user interactive pipeline to systematically extract knowledge from a corpus of scientific literature. This should help the user have better insights into key contents of the corpus.

This notebook provides a step-by-step, instruction based approach from setting up the corpus to extracting and representing knowledge relevant to the user.

The first task is to support the user in the process of retrieving a set of relevant literature, and to better represent the knowledge for the user.

This iterative process requires little to no programming prior knowledge.

![setup](<media/MA003.jpg>)
![knowledge](<media/Frame9.jpg>)
![publish](<media/Frame10.jpg>)

For the described process, this notebook runs tools and functions that are specifically implemented to query and scrape different digital libraries. A requirements file (requirements.txt) is predefined to install all necessary packages. 

This requires a Jupyter environment that runs any version of python 3.

In order to install the different packages, we only need to run the next cell one time. If you already run it once on your machine, just ignore it and don't run it again

In [None]:
pip install -r requirements.txt && pip install -e .

**Two approaches are developed to extract a corpus of PDFs. One is for the case of not having a set of scientfic literature, the second is for the case of having one. Note that either step 1. or step 2. should be used to extract a corpus of PDFs, and not both. If you already have a set of papers in a Zotero collection, please skip to step 2.**

**1. Corpus mining**

The initial step is to formulate a search query that aligns with the desired research objective. In this task, we can use a large language model (LLM) to extract relevant keywords that will be used in the process of querying scientific databases. 

The second step is inputing the search query that will be fed to different scraping models. This step represents the core of this iteration. 

The used tools in this step are: 
-  **LLM** (Optional for formulating the search query)
-  Modified **[RESP](https://github.com/monk1337/resp)** Arxiv-module 
- **Semantic Scholar API**




The user formulates a search query and copies this in the space between the single quotation marks below in the next cell. We use the variable named **papers_search_query**

An example that can be used as a search query is: *large language models for effective knowledge extraction*

In [4]:
papers_search_query = 'large language models for effective knowledge extraction'

The next cell allows the user to predefine the size limit of the corpus to be created. The variable **limit** holds to maximum size of papers to be downloaded from each source. 

Note that many search results don't include an open acess to PDFs.

The next cell has 50 as a predefined value ( 50 pdf as a maximum from each different source ).

The user is able to modify this value

In [5]:
limit = 50

Next, is to define the sources of our corpus. Multiple digital libraries and scientific databases can be accessed and queried. 

Here we create a list of sources, that the user can adapt. Note that the names of the sources are given between quotations and separated by a coma "," as in the example below. The elements in the list are responsible of specifying which and how many sources we take in consieration.

**Note** This version only supports querying **Arxiv** and **Semantic Scholar**. Later version will include further sources.

Current possible entries for the list: 
- "Arxiv"
- "Semantic Scholar"

In [6]:
sources = ["Semantic Scholar"]

In [7]:
from util.arxiv_api import Arxiv
from util.semanticscholar_util import SemanticScholar

for source in sources:
    if source == "Arxiv":
        arxiv_instance = Arxiv()
        arxiv_instance.download_pdf(papers_search_query, limit)
            
    elif source == "Semantic Scholar":
        semanticscholar_instance = SemanticScholar()
        semanticscholar_instance.download_pdfs(papers_search_query, limit)
    else:
        print("Unknown Identifier specified in the sources")


Querying Semantic Scholar...
Number of papers with PDF: 18
Semantic Scholar was successfully queried...


**2. Corpus extraction from Zotero**

This approach is for the case of having a Zotero collections that we want to investigate. This will result in creating a corpus of PDFs locally saved on the users local machine. For this, we use Zotero's API. 

This step requires a unique Zotero API Key, the library ID, the library type, and the collection's ID. This information can be found/set up in your personal zotero account.

The used tools in this step are: 
- **Zotero** and **Zotero API**



In [4]:
from pyzotero import zotero
import os
from util.zotero_util import ZoteroUtil
import os
from dotenv import load_dotenv

load_dotenv()

API_KEY = os.getenv('ZOTERO_API_KEY')
LIBRARY_ID = os.getenv('LIBRARY_ID')
LIBRARY_TYPE = os.getenv('LIBRARY_TYPE')
COLLECTIONS = os.getenv('COLLECTION')


zot = zotero.Zotero(LIBRARY_ID, LIBRARY_TYPE, API_KEY)

download_directory = 'zotero_pdfs'
os.makedirs(download_directory, exist_ok=True)

items = zot.collection_items(COLLECTIONS)
for item in items:
    if 'url' in item['data']:
        ZoteroUtil.download_pdf(item['data']['url'], item['data']['key'])
        
        
print("Download completed. Found documents were downloaded")
            


PDF Downloaded.
PDF Downloaded.
PDF Downloaded.
PDF Downloaded.
PDF Downloaded.
PDF Downloaded.
PDF Downloaded.
PDF Downloaded.
PDF Downloaded.
PDF Downloaded.
PDF Downloaded.
PDF failed to download.
PDF failed to download.
PDF failed to download.
PDF failed to download.
PDF failed to download.
PDF failed to download.
PDF failed to download.
PDF failed to download.
PDF failed to download.
PDF failed to download.
PDF failed to download.
PDF failed to download.
PDF failed to download.
PDF failed to download.
PDF Downloaded.
PDF Downloaded.
PDF Downloaded.
PDF failed to download.
PDF failed to download.
PDF failed to download.
PDF failed to download.
PDF failed to download.
PDF failed to download.
PDF failed to download.
PDF failed to download.
PDF failed to download.
PDF failed to download.
PDF failed to download.
PDF failed to download.
PDF failed to download.
PDF failed to download.
PDF failed to download.
PDF Downloaded.
PDF Downloaded.
PDF Downloaded.
PDF failed to download.
PDF fail

**3. From PDF to Text** 

This step consists of turning the pdf files into textual files that can be treated and transfered as input of the later steps. 

We start from a corpus of pdf files and aim to have a folder filled with files with the extension (.txt).

The used tools in this step are: 
- **PDF Plumber**

In [1]:
from util.pdf_util import PdfUtil

folder_path = 'zotero_pdfs'
output_path = 'text_files'


# Convert all PDFs in the folder
PdfUtil.convert_pdfs_in_folder(PdfUtil, folder_path, output_path)





PDF to text conversion completed!
Converted and saved 2UZS965Y.pdf to 2UZS965Y.txt
PDF to text conversion completed!
Converted and saved 4KKE293P.pdf to 4KKE293P.txt
PDF to text conversion completed!
Converted and saved 4XTLX385.pdf to 4XTLX385.txt
PDF to text conversion completed!
Converted and saved 56FCMXLU.pdf to 56FCMXLU.txt
PDF to text conversion completed!
Converted and saved 5F2YMTBS.pdf to 5F2YMTBS.txt
PDF to text conversion completed!
Converted and saved 5R8BBVAT.pdf to 5R8BBVAT.txt
PDF to text conversion completed!
Converted and saved 6ZNP9N2R.pdf to 6ZNP9N2R.txt
PDF to text conversion completed!
Converted and saved 7MWFJGJ8.pdf to 7MWFJGJ8.txt
PDF to text conversion completed!
Converted and saved 7NK8VXTJ.pdf to 7NK8VXTJ.txt
PDF to text conversion completed!
Converted and saved 9P4JSU6X.pdf to 9P4JSU6X.txt
PDF to text conversion completed!
Converted and saved B7VCHQ4F.pdf to B7VCHQ4F.txt
PDF to text conversion completed!
Converted and saved BF5URTZH.pdf to BF5URTZH.txt
PDF 