# AI reviewing
An AI-powered tool to support the literature review process.
This notebook implements *Retrieval-Augmented Generation* (RAG), i.e. a technique that grants a generative model to fetch information from (in this case) files in the local system.
Leveraging on the power of RAG, the notebook allows the user to query the paper and gather insights without having to read the full-text.

Two different pipelines are shown in this notebook:
* [Step-by-step execution](#step-by-step-execution)<br>Run each step of the processing and have a look at intermediate output of the processing to gain an understanding of what is going on under the hood.

* [Batch execution](#batch-execution)<br>Run the pipeline on a set of references. Meant to be used in production.

In [None]:
# Start by clearing the workspace. Previous execution of the present notebook
# may have initialized models that occupy a large space in the computer memory.
# Execute the cell for a fresh start.
%reset -f

Define the paths to the models used in the application, namely:
* the path to the model to generate embeddings
* the path to the large language model that will answer to queries

You can manually define the paths, or set them to `None`. In the latter case, a filedialog will pop-up to let you pick the file by browsing the filesystem.

To install `gpt4all` and get the needed models, you can follow the official [GPT4All Python SDK documentation](https://docs.gpt4all.io/gpt4all_python/home.html).

In [None]:
import os
import tkinter as tk # for filedialogs
from tkinter import filedialog

# Start by defining paths to relevant components of the application, namely:
#   * the path to the model to generate embeddings
EMBED_MODEL = "C:\\Users\\pozzi\\gpt4all\\resources\\nomic-embed-text-v1.5.f16.gguf"
if EMBED_MODEL is None:
    file = filedialog.askopenfile()
    if file:
        EMBED_MODEL = os.path.abspath(file.name)

#   * the path to the large language model
LLM_MODEL = "C:\\Users\\pozzi\\AppData\\Local\\nomic.ai\\GPT4All\\Meta-Llama-3-8B-Instruct.Q4_0.gguf"
if LLM_MODEL is None:
    file = filedialog.askopenfile()
    if file:
        LLM_MODEL = os.path.abspath(file.name)

## Step-by-step execution
This is meant as an instructive example to better understand how the pipeline works. Each step is manually triggered by the user.

If you already know how RAG works, and/or have previous experience with this tools, you can collapse this section and move to the [Batch Execution section](ai_reviewing.ipynb#batch-execution).

You can manually define the path to the PDF file to inspect, or set it to None. In the latter case, a filedialog will pop-up to let you pick the file by browsing the filesystem.

In [None]:
# The PDF file to process
# Must be:
#    * valid filepath to PDF file
#    * `None` (opens a filedialog to select the file)
FILE_PATH = None 
#FILE_PATH = "C:\\Users\\pozzi\\OneDrive - Politecnico di Milano\\review_article\\code\\refs\\_to_review_short\\files\\1875\\Maroto-Gómez et al. - 2023 - Bio-inspired Cognitive Decision-making to Personal.pdf" 

if FILE_PATH is None:
    file = filedialog.askopenfile()
    if file:
        FILE_PATH = os.path.abspath(file.name)

# Check for type extension
if not os.path.splitext(FILE_PATH)[-1] == '.pdf':
    raise ValueError("Unsupported file extension {}. Only PDF files are supported".format(os.path.splitext(FILE_PATH)[-1]))

### Document loading and splitting
The PDF document is loaded and parsed to extract the actual content. The content is then split into smaller chunks.

In [None]:
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load the PDF file and read its content
loader = PyMuPDFLoader(FILE_PATH)
docs = loader.load()

# Split the content into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512,
                                               chunk_overlap=128, 
                                               add_start_index=True
                                              )
chunks = text_splitter.split_documents(docs)
if not chunks:
    raise ValueError('Document at {} produced no chunks'.format(FILE_PATH))
else:
    print("A sample from the middle of your file:\n\n" +
          "{}".format(chunks[len(chunks)//2].page_content)
          )

### Generate embeddings
In order to later feed it into the LLM model, we want to store the *meaning* of a text document. But computers do like numbers rather than raw text. 

Document embeddings are a way of converting a document's text into high-dimensional vectors that captures their meaning. The result is a numerical representation of the text, where similar contents (i.e. `chunks`) have similar representations.

In [None]:
from langchain_community.embeddings import GPT4AllEmbeddings
from langchain_community.vectorstores import FAISS

# Initialize the embedding model
if os.path.isfile(EMBED_MODEL):
    allow_download=False
else:
    print('{} is not a valid file in the local system.\nGPT4All will try to download the model from gpt4all.io'.format(EMBED_MODEL))
    allow_download=True
model_kwargs = {'allow_download':allow_download}    
embed_model = GPT4AllEmbeddings(model_name=EMBED_MODEL,
                                device="cpu",
                                gpt4all_kwargs=model_kwargs
                                )

# Initialize the vector store (i.e. database) to store the embeddings
db = FAISS.from_documents(documents=chunks, 
                          embedding=embed_model
                          )

The generated embeddings can now be exploited to evaluate similarity between contents by evaluating a numerical distance.<br>The cell below just print the first chunk returned by the retriever. You may observe that it is not necessarily the best one to answer the query. For this reason, it is advisable to set the `k` parameter to a number > 1. Keep in mind that increasing `k` will also increase the computational effort.

In [None]:
# Retrieve chunk(s) of document(s) that are relevant to the query
query = 'What is the robot name?'
retriever = db.as_retriever(search_type="mmr", #"similarity",
                            search_kwargs={"k":6,
                                            "fetch_k":20,
                                            "lambda_mult":0.5,
                                            #"filter":{"source":FILE_PATH}
                                        })
print(retriever.invoke(query)[0].page_content)

### Retrieval-Augmented Generation
Exploit the above steps to give context to feed in context to a Large Language Model, giving it access to the contents of your interest.

Here a [chain](https://python.langchain.com/docs/how_to/sequence/) including the retriever, the prompt, the LLM, and the output parser is defined. The output of each component is fed as input to the following one.


In [None]:
from langchain_community.llms import GPT4All
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

template = """You are an helpful research assistant. Please give answer only based on the following context: {context}. If you don't know something, just say you don't know.
Question: {question}
Answer:
"""

prompt = PromptTemplate.from_template(template)
local_path = ("C:\\Users\\pozzi\\AppData\\Local\\nomic.ai\\GPT4All\\Meta-Llama-3-8B-Instruct.Q4_0.gguf")
if 'llm' in locals():
    # NOTE. Attempting to create a new instance of the model (e.g. if you want
    # to play with the params) might fail due to lack of memory. Before
    # instantiating, verify that no other model exist.
    del llm
llm = GPT4All(model=local_path,
              device="gpu",
              max_tokens=2048,
              n_predict=4096,
              temp=0.7,
              top_p=0.4,
              top_k=40,
              repeat_last_n=64,
              repeat_penalty=1.18
              )

# Put everything together
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

if 'rag_chain' in locals():
    # NOTE. Attempting to create a new instance of the chain might fail due to 
    # lack of memory. Before instantiating, verify that no other chain exist.
    del rag_chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

Define your query and give it to the RAG to get the answer based on the provided context.

In [None]:
query = """What is the robot name?"""

print(query)
output = rag_chain.invoke(query)
print(output)

## Batch execution
After having understood how the pipeline works, you can leverage on RAG to help with literature reviewing.
> **NOTE.** It is advisable to start from scratch, clearing all the variables. To do so, run the first two cells and just skip the **Step-by-step execution** section.


Select the path to the directory where the BIB file and articles PDF are saved.

In [None]:
#   * the path to the folder where the Zotero references has been exported
REFS_PATH = "C:\\Users\\pozzi\\OneDrive - Politecnico di Milano\\review_article\\code\\refs\\_full_text"
if REFS_PATH is None:
    dir = filedialog.askdirectory(initialdir='./refs')
    if dir:
        REFS_PATH = os.path.abspath(dir)

#### Initialize useful objects
Create the instance of the LLM that will be used for the RAG application, and define the prompt.

In [None]:
import os

from langchain_community.document_loaders import PyMuPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import GPT4AllEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.llms import GPT4All
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Get ready to split the content into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512,
                                               chunk_overlap=128, 
                                               add_start_index=True
                                              )

# Initialize the embedding model
if os.path.isfile(EMBED_MODEL):
    allow_download=False
else:
    print('{} is not a valid file in the local system.\nGPT4All will try to download the model from gpt4all.io'.format(EMBED_MODEL))
    allow_download=True
model_kwargs = {'allow_download':allow_download}    
embed_model = GPT4AllEmbeddings(model_name=EMBED_MODEL,
                                device="cpu",
                                gpt4all_kwargs=model_kwargs
                                )

# Initialize the LLM model
template = """I am an engineer using large language model for scientific paper summarization. Given the context:\n\n{context}\n\nplease answer to the question below. If you don't know something, just say you don't know.
Question: {question}
Answer:
"""

prompt = PromptTemplate.from_template(template)
if 'llm' in locals():
    # NOTE. Attempting to create a new instance of the model (e.g. if you want
    # to play with the params) might fail due to lack of memory. Before
    # instantiating, verify that no other model exist.
    del llm
llm = GPT4All(model=LLM_MODEL,
              device="gpu",
              max_tokens=4096,
              n_predict=4096,
              temp=0.7,
              top_p=0.4,
              top_k=40,
              repeat_last_n=64,
              repeat_penalty=1.18
             )

# Specify what do you want the model to do
query = """I would like you to generate a digest of the paper {}, highlighting:
    * **Robot.** Characteristics of the device used in the study.
    * **Application.** What is the robot used for.
    * **Population.** Demographics (total number, age, health status) of subjects taking part to the experimental campaign.
    * **Duration.** Number of interaction of each subject with the robot. Duration of each interaction.
    * **Setting.** Where is the experimental campaign conducted (nursing home, hospital, retirement village, private homes…).
    * **Interaction.** How the user can interact with the robot -and vice versa- during the task (e.g. vocal, touch screen, controller, gestures…).
    * **Outcome measures.** How is the application evaluated in terms of: (i) technical accuracy, (ii) user experience.
    * **Results.** Quantitative evaluation of the outcome measures.

    Please return max 500 words per each bullet point.
    """

n_features = query.count('* **')

### Query definition
Define the query to be passed to the RAG.

The query should be formatted as a dictionary (see example below). 

In [None]:
query = {
    'intro': 'I would like you to generate a digest of the paper, highlighting the following features:',
         'features': [
             # (<feature name>, <question string>, <Excel file column>)
             ('Robot', 'Which are the characteristics of the robot used in the study?', 'F'),
             ('Application', 'What is the robot used for?', 'G'),
             ('Population', 'How many people participated to the testing? What is their age? Do they suffer from special health conditions?', 'H'),
             ('Duration', 'How many times does each user interact with the robot? How long does each interaction take?', 'I'),
             ('Setting', 'Where is the experimental campaign conducted?', 'J'),
             ('Interaction', 'How can the user interact with the robot -and vice versa- during the task? E.g. vocal, touch screen, controller, gestures...', 'K'),
             ('Outcome measures', 'Which metrics are used to evaluate the application in terms of technical accuracy? Which metrics are used to evaluate the application in terms of user experience?', 'L'),
             ('Results', 'What are the quantitative results? What are the qualitative results of the study?', 'M')
            ],
    'constraints': 'Please answer in max 500 words',
    'outro': 'per each bullet point'
    }

### Generate a reference list from an exported Zotero library 
See [README.md](README.md) for instructions on how to create and export your library.

> **NOTE.** The cell below has to be executed only once per collection. Once the Excel file has been created, you should skip the execution of the cell and move directly to [Generate digest](#generate-digest).

In [None]:
import re

from pybtex.database.input import bibtex
import openpyxl

def row_col_to_cell(row, col):
    if col < 1 or col > 26:
        raise ValueError('The specified `col` value {} cannot be converted to a capital letter. Indeed, it would return {}'.format(col, chr(col+64)))
    return chr(col + 64) + str(row)

# Read the BibTeX file
# NOTE. The BibTex file MUST be named as the folder that contains it.
refs_folder_name = os.path.basename(os.path.normpath(REFS_PATH))
bib_parser = bibtex.Parser()
bib_data = bib_parser.parse_file(os.path.join(REFS_PATH,
                                              refs_folder_name + '.bib'
                                              ))
# Write the refs to an Excel file
wb = openpyxl.Workbook()
ws = wb.active # get the active sheet in the workbook
ws.title = 'References'
# Create the header of the Excel table
header = ['Key', 'Author', 'Title', 'Year', 'PDF']
for col in range(1, 5+1): # cells in Excel are 1-based
    ws[row_col_to_cell(row=1, col=col)] = header[col-1]
# Populate the table
row = 2
for key in bib_data.entries.keys():
    # Write the key value
    ws[row_col_to_cell(row=row, col=1)] = key
    entry = bib_data.entries[key]
    # Get and write the authors names to one cell
    authors_string = ''
    for author in entry.persons['author']:
        if not authors_string: # i.e. for the first author
            authors_string += str(author)
        else:
            authors_string += ' and ' + str(author)
    ws[row_col_to_cell(row=row, col=2)] = authors_string
    # Get and the write the paper title and year of publication
    ws[row_col_to_cell(row=row, col=3)] = entry.fields['title']
    ws[row_col_to_cell(row=row, col=4)] = entry.fields['year']
    # Get the file location within `REFS_PATH\files`
    pdf_rel_path =re.search("(?<=:files/)(.*)(?=:application)", 
                            entry.fields['file']
                            )
    if pdf_rel_path: # i.e. if there is a match from regex
        pdf_abs_path = os.path.normpath(os.path.join(REFS_PATH,
                                                    'files',
                                                    pdf_rel_path.group()
                                                    ))
        ws[row_col_to_cell(row=row, col=5)] = pdf_abs_path
    row += 1 # move to the row belows

# Save the produced Excel file
if os.path.isfile(os.path.join(REFS_PATH, refs_folder_name + '.xlsx')):
    # If the file already exists, append a string to the filename to avoid overwriting a useful file. 
    print('WARNING. The file {} already exists.'.format(refs_folder_name + '.xlsx'))
    refs_folder_name += '_new'
    print('The new file has been saved as {} to avoid overwriting.\nManually rename the files before proceeding.'.format(refs_folder_name + '.xlsx'))
wb.save(os.path.join(REFS_PATH,
                     refs_folder_name + '.xlsx'
                    ))

### Generate digest
Apply the query to references into the selected Excel file.

Use the cell below to set the row indexes (1-based) of the Excel file to process.

In [None]:
MIN_ROW = 0     # enter the desired value
MAX_ROW = 10    # enter the desired value

SINGLE_QUERY = False    
# If True, all the `features` in the `query` are merged into a single string, organized in bullet points. This provides faster, but less accurate responses.
# If False, each feature is passed as a single query. This involves N (where N is equal to `len(query['features'])`) calls to the RAG chain per each paper. The response time should be roughly N times slower, but the provided responses should be more accurate.
if SINGLE_QUERY:
    query_string = query['intro'] + '\n'
    for feature, question, __ in query['features']:
        query_string += '* **{}.** {}\n'.format(feature, question)
    query_string += query['constraints'] + ' ' + query['outro']
    query_list = [query_string]
else:
    query_list = [question for __, question, __ in query['features']]

In [None]:
import re
import time

import openpyxl
from IPython.display import clear_output

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Load the Excel file
# NOTE. The Excel file MUST be named as the folder that contains it.
refs_folder_name = os.path.basename(os.path.normpath(REFS_PATH))
refs_filepath = os.path.join(REFS_PATH,
                             refs_folder_name + '.xlsx'
                            )
wb = openpyxl.load_workbook(filename=refs_filepath)
ws = wb.active # get the active sheet in the workbook

MIN_ROW = max(2, MIN_ROW)
MAX_ROW = min(ws.max_row, MAX_ROW)
for row_cnt, row in enumerate(ws.iter_rows(min_row=MIN_ROW, max_row=MAX_ROW)):
    print("Processing ref {}/{}".format(row_cnt+1, MAX_ROW-MIN_ROW+1))
    clear_output(wait=True)
    
    # Load the PDF file and read its content
    loader = PyMuPDFLoader(ws['E' + str(row_cnt + MIN_ROW)].value)
    docs = loader.load()

    # Split the content into chunks
    chunks = text_splitter.split_documents(docs)
    if not chunks:
        continue    # the PDF is likely generated as an image and it is thus
                    # impossible to extract content from it.

    # Create the vector store (i.e. database) to store the embeddings
    db = FAISS.from_documents(documents=chunks, 
                            embedding=embed_model
                            )
    # Get ready to retrieve relevant sources from the vector store
    retriever = db.as_retriever(search_type="mmr", #"similarity",
                                search_kwargs={"k":6,
                                                "fetch_k":20,
                                                "lambda_mult":0.5,
                                                #"filter":{"source":FILE_PATH}
                                            })
    
    # Create and call the RAG chain
    rag_chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )
    
    first_column = ord('F')
    n_features = len(query['features'])
    for i, q in enumerate(query_list):
        # Call the RAG chain
        output = rag_chain.invoke(q)
        print(output)
        # Save one features per Excel column
        row_idx = row_cnt + MIN_ROW
        if SINGLE_QUERY:
            # Parse the output to fill in the Excel file
            # Get the file location within `REFS_PATH\files`
            fields = output.split('**')
            try:
                for i in range(n_features):
                    ws[chr(first_column + i) + str(row_idx)] = fields [(i+1)*2]
            except: 
                # If an error occurs while parsing, save the row output for later manual parsing
                ws[chr(first_column + n_features) + str(row_idx)] = output
        else:
            ws[chr(first_column + i) + str(row_idx)] = output
    del rag_chain
    
    
    
    
    if row_cnt%10 == 0 and not row_cnt == 0:
        # Save the output every 10 items processed
        wb.save(refs_filepath)

wb.save(refs_filepath)
    
    
    