# HITL-SCC_Workflow Iteration I_2

This step of the Iteration represents the processing of the extracted text and preparing the input for the Large language model.
This step is considered the core of the RAG application (Retrieval augmented generation). Here, we use the context of the paper to leverage the ability
of the models to extract relevant parts of the paper and to perform context-based analysis. This part also creates a data model based on user input.

![rag_pipeline](<media/rag_pipeline.jpg>)


We list the tools used in this part: 
-  **Ollama**

Ollama is a service that provides easy access to large language Models and other tools needed for the embedding, computing and generating text.
It allows us in this workflow to communicate with the corpus.

**1. Defining Data Model**

This steps allows the user to define the data model. 
A data model in this context is a structured set of properties that should serve as an input in order to communicate with the corpus. 
The properties can include multiple parts of a paper as well as specific values relevant to the user. 

For now, we define a data model to be a list that contains one (or multiple sets) of the following properties: 
- Title
- Theme
- Keywords
- Task
- Evaluation Approach
- Future Directions
- Theories
- Dataset


In [8]:
from widgets.widgets_util import DynamicCheckboxList
dynamic_checkbox_list = DynamicCheckboxList()
dynamic_checkbox_list.display_interface()

Text(value='', placeholder='Enter a new option')

Button(description='Add Option', style=ButtonStyle())

Button(description='Delete Selected Options', style=ButtonStyle())

VBox(children=(Checkbox(value=False, description='Title'), Checkbox(value=False, description='Theme'), Checkbo…

In [12]:
print(dynamic_checkbox_list.get_selected_options())

['Title', 'Keywords', 'Evaluation Approach']


**2. Text chunking**

In this part, we compute the text of the documents in order to create a meaningful start point for the communication with the documents.

To better identify information present in the text, a semantic chunking method is used in order to create smaller parts of the text. 

The result of this step is creating semantically conntected units of text that are easier to process.


![rag_pipeline](<media/semantic.jpg>)


**3. Similarity search & Prompting**

This parts consists of retrieving the related information to the provided data model. 
In this step, we search for the top k chunks of text that result of a vector search between each chunk and the respective query created out of the data model.

The top k chunks are then used in the prompt given to the large language model in order to provide context in this application.

Running the code cell below outputs the LLM-generated text, which represent the identified data out of the paper.

In [None]:
# Synchronous Example
from mistralai import Mistral
import os
from embedding.document_util import DocumentUtil
folder_path = 'zotero_pdfs'
results= []

for filename in os.listdir(folder_path):
    if filename.lower().endswith('.pdf'):
        file_path = os.path.join(folder_path, filename)
        
        print(f"Processing PDF file: {file_path}")



        test_doc = DocumentUtil.get_text_without_references(DocumentUtil, file_path)
        
        prompt = f"""
        INSTRUCTIONS:
        You are a tool that extracts information from a given document based on Key values. The given document represents a research paper.
        Given the DOCUMENT below and the KEY VALUES, and using no prior knowledge, extract the respective information.
        Your answer should contain the extracted information without further explanation.
        If the information is not present in the text, return NOT FOUND.
        In The text should be formatted in the following way: 
        ### Key value : Information
        ----------------------------------------------
        DOCUMENT: 
        {test_doc}
        ----------------------------------------------
        KEY VALUES:
            Title - Keywords - Evaluation approach - Conclusion - Future directions - Theories - Dataset.
        ----------------------------------------------
        ANSWER: 
        """
        
        s = Mistral(
            api_key="OgYSLOA5ZyDRBWbdSP0wKMKf68z6v9Rq",
        )
        
        res = s.chat.complete(model="mistral-large-latest", messages=[
            {
                "content": prompt,
                "role": "user",
            },
        ])
        
        if res is not None:
            # Run for around 8 Minutes for a corpus of around 30 Papers.
            pass
        print(res.choices[0].message.content)
        results.append(res.choices[0].message.content)

In [None]:
import pandas as pd
import re

df_list = []
for result in results:
    lines = result.splitlines()
    example_values = []
    information_list = []
        
    for line in lines:
        match = re.match(r"^###\s*(.*?)[:-]\s*(.*)$", line)
        if match:
            example_value = match.group(1).strip()
            information = match.group(2).strip()
            example_values.append(example_value)
            information_list.append(information)
        
        # Create a DataFrame
    df = pd.DataFrame({'Key value': example_values, 'Information': information_list})
    df_list.append(df)
    print(df)
    print("--------------------")

In [None]:
folder_path = "zotero_pdfs"

file_names = os.listdir(folder_path)

file_names = [f for f in file_names if os.path.isfile(os.path.join(folder_path, f))]

transformed_dataframes = []

for df in df_list:
    pivoted_df = df.set_index('Key value').T
    transformed_dataframes.append(pivoted_df)

combined_df = pd.concat(transformed_dataframes, ignore_index=True)

combined_df['File Name'] = file_names
columns = ['File Name'] + [col for col in combined_df.columns if col != 'File Name']
combined_df = combined_df[columns]

combined_df.to_csv('combined_output_with_filenames.csv', index=False)

In [7]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')