# HITL-SCC_Workflow Iteration I_2

This step of the Iteration represents the processing of the extracted text and preparing the input for the Large language model.
This step is considered the core of the RAG application (Retrieval augmented generation). Here, we use the context of the paper to leverage the ability
of the models to extract relevant parts of the paper and to perform context-based analysis. This part also creates a data model based on user input.

![rag_pipeline](<media/rag_pipeline.jpg>)


We list the tools used in this part: 
-  **Ollama**

Ollama is a service that provides easy access to large language Models and other tools needed for the embedding, computing and generating text.
It allows us in this workflow to communicate with the corpus.

**1. Defining Data Model**

This steps allows the user to define the data model. 
A data model in this context is a structured set of properties that should serve as an input in order to communicate with the corpus. 
The properties can include multiple parts of a paper as well as specific values relevant to the user. 

For now, we define a data model to be a list that contains one (or multiple sets) of the following properties: 
- Title
- Theme
- Keywords
- Task
- Evaluation Approach
- Future Directions
- Theories
- Dataset


In [4]:
from widgets.widgets_util import DynamicCheckboxList
dynamic_checkbox_list = DynamicCheckboxList()
dynamic_checkbox_list.display_interface()

Text(value='', placeholder='Enter a new option')

Button(description='Add Option', style=ButtonStyle())

Button(description='Delete Selected Options', style=ButtonStyle())

VBox(children=(Checkbox(value=False, description='Title'), Checkbox(value=False, description='Theme'), Checkbo…

**2. Text chunking**

In this part, we compute the text of the documents in order to create a meaningful start point for the communication with the documents.

To better identify information present in the text, a semantic chunking method is used in order to create smaller parts of the text. 

The result of this step is creating semantically conntected units of text that are easier to process.


![rag_pipeline](<media/semantic.jpg>)


**3. Similarity search & Prompting**

This parts consists of retrieving the related information to the provided data model. 
In this step, we search for the top k chunks of text that result of a vector search between each chunk and the respective query created out of the data model.

The top k chunks are then used in the prompt given to the large language model in order to provide context in this application.

Running the code cell below outputs the LLM-generated text, which represent the identified data out of the paper.

In [5]:
from pdfminer.high_level import extract_text
from embedding.document_util import DocumentUtil

splitted_text = DocumentUtil.text_splitter(DocumentUtil, './')
sentences = [{'sentence': x, 'index': i} for i,x in enumerate(splitted_text)]
comb_sentences = DocumentUtil.combine_sentences(sentences)

In [7]:
from chunking.chunks_util import ChunksUtil
from embedding.embedding_util import EmbeddingUtil

embeddings = EmbeddingUtil.ollama_embed_combined_sentences(EmbeddingUtil, comb_sentences)
for i, sentence in enumerate(comb_sentences):
    sentence['combined_sentence_embedding'] = embeddings[i]
distances, sentences = ChunksUtil.calculate_cosine_distances(comb_sentences)
split_distances = ChunksUtil.get_split_indices(distances)
chunks_final = ChunksUtil.split_using_distances(split_distances, sentences)

In [3]:
from embedding.embedding_util import EmbeddingUtil 
from tqdm.autonotebook import tqdm, trange
from llms.llm_util import LlmUtil
from prompting.prompts import Prompts

top_k_chunks = EmbeddingUtil.compute_top_k_ollama(EmbeddingUtil, chunks_final, "Evaluation Approach of this paper:")
prompt = Prompts.get_prompt_2(Prompts, top_k_chunks, "What is the Evaluation Approach in this paper ?")
response = LlmUtil.prompt_ollama(prompt)

Evaluation of two processes, namely the online shop process described in [16] and the hotel service process from the PET data set [2].


In [1]:
pip list

Package                      Version
---------------------------- --------------
absl-py                      2.1.0
aiohappyeyeballs             2.4.0
aiohttp                      3.10.5
aiosignal                    1.3.1
annotated-types              0.7.0
anyio                        4.4.0
argon2-cffi                  23.1.0
argon2-cffi-bindings         21.2.0
arrow                        1.3.0
asttokens                    2.4.1
astunparse                   1.6.3
async-lru                    2.0.4
async-timeout                4.0.3
attrs                        24.2.0
babel                        2.16.0
beautifulsoup4               4.12.3
bibtexparser                 1.4.1
bleach                       6.1.0
blis                         0.7.11
cachetools                   5.5.0
catalogue                    2.0.10
certifi                      2024.8.30
cffi                         1.17.1
charset-normalizer           3.3.2
click                        8.1.7
cloudpathlib                 0.

In [2]:
!jupyter --version

Selected Jupyter core packages...
IPython          : 8.27.0
ipykernel        : 6.29.5
ipywidgets       : 8.1.5
jupyter_client   : 8.6.2
jupyter_core     : 5.7.2
jupyter_server   : 2.14.2
jupyterlab       : 4.2.5
nbclient         : 0.10.0
nbconvert        : 7.16.4
nbformat         : 5.10.4
notebook         : 7.2.2
qtconsole        : not installed
traitlets        : 5.14.3
