# aisaac benchmarking quickstart guide

Welcome to the interactive testing notebook for `aisaac`, An Intelligent Screening Assistant for Academic Content. This notebook is designed to guide you through the process of testing `aisaac` under various configurations, particularly focusing on its integration with different Language Learning Models (LLMs).

## Objective

This notebook aims to:
- Demonstrate how to perform basic and advanced tests on `aisaic`.
- Guide users through creating their own tests for custom configurations.
- Offer executable examples that illustrate `aisaac`'s compatibility and performance with a variety of LLMs.

## Getting Started

Before we dive into the testing procedures, please ensure you have the following prerequisites satisfied:

1. **Python Installation**: Ensure you have Python >= 3.9 installed on your system. This notebook was tested with Python 3.10.

2. **Project Setup**: If you haven't already, clone the `aisaac` repository and navigate to its root directory:

    ```bash
    git clone <repository-url>
    cd aisaac
    ```

3. **Install Dependencies**: Install the required dependencies by running the following command:

    ```bash
    pip install -r requirements.txt
    ```

    Make sure to 
    a. have Ollama and all the LLMs that you want to test locally installed (in Ollama) or
    b. have the LLMs that you want to test available in the cloud (via Xinference)

# 1. Setup

Setup the imports

In [8]:
from aisaac.aisaac.core.screener import Screener
from aisaac.aisaac.utils.context_manager import ContextManager

from aisaac.aisaac.core.criteria_optimizer import CriteriaOptimizer
from aisaac.aisaac.core.evaluator import Evaluator

import time
import csv

Initializing the 'utils' package


Setup the variables

In [9]:
performance_file = "aisaac_performance.csv"
testable_llms = ["mistral:latest", "gemma:2b", "gemma:7b", "llama2:7b", "mixtral:8x7b-text-v0.1-q2_K"]
standard_checkpoints = {
                        "Thyroid Cancer Types": 
                            "If the study involves any type of thyroid cancer such as Papillary TC, Follicular TC, Medullary TC, Poorly Differentiated TC, Anaplastic TC, Hurtle cell carcinoma, or Non-invasive follicular thyroid neoplasm with papillary-like nuclear features (NIFTP), then return True. However, If the study involves non-malignant entities or conditions other than the specified types of thyroid cancer, then return False.",

                        "Study Population": 
                            "If the study is conducted on human subjects, then return True. Otherwise, if the study is conducted on an organism different than human,  animals or uses cell lines,  return False.",

                        "Study Type": 
                            "If the article is a conference abstract, review, study without results (like a protocol), or model-based study, or if it investigates specific pathways or genetic alterations only, then return False. Otherwise return True",

                        "DNA alterations": 
                            "If the study is an original paper reporting on DNA alterations in thyroid cancer, then return True. Otherwise return False.",

                        "Methodology": 
                            "If the study uses Whole exome sequencing (WES), Whole genome sequencing (WGS), Next generation sequencing (NGS), Sanger sequencing, Custom panel, or Microarray analysis, then return True. Otherwise, return False.",

                        "TCGA or RNA/protein": 
                            "If the study involves computational analysis of TCGA data or any previously reported studies, or if it reports RNA/protein sequencing only, then return False. Otherwise, return True."
}

Initial setup of the Document Data Manager and Vector Data Manager so that it doesn't have to be done for every test.
For this, all Data Managers should access the same binary ('bin') as cache, meaning that parallel processing is not possible

In [10]:
initial_cm = ContextManager()
initial_cm.set_config('BIN_PATH', "bin")
initial_cm.get_document_data_manager().update_global_data()
# This only has to be done once for every place that you store the data, comment it out after you ran it
# initial_cm.get_vector_data_manager().create_document_stores()

FileNotFoundError: [Errno 2] No such file or directory: 'aisaac/bin/doc_mngr.pkl'

# 2. Support Methods

set up your testing configuration here

In [ ]:
def set_up_cm(current_llm, round, checkpoints):
    setup_cm = ContextManager()
    setup_cm.set_config('RAG_MODEL', current_llm)
    setup_cm.set_config('RESULT_FILE', f"results_{round}_{current_llm}")
    setup_cm.set_config('FEATURE_IMPORTANCE_THRESHOLD', -0.1)
    setup_cm.set_config('BIN_PATH', "bin")
    setup_cm.set_config('CHECKPOINT_DICTIONARY', checkpoints)
    setup_cm.set_config('LOGGING_LEVEL', "ERROR")
    setup_cm.get_document_data_manager()
    return setup_cm

the benchmarking process consists of screening, evaluation and optimization

In [3]:
def run_eval(cm):
    start_time = time.time()
    screener = Screener(cm)
    screener.do_screening()
    end_time = time.time()
    elapsed_time = end_time - start_time
    evaluator = Evaluator(cm)
    evaluation = evaluator.get_full_evaluation()
    # save return of evaluator in performance file
    with open(performance_file, 'a', newline='') as csvfile:
        csvwriter = csv.writer(csvfile)
        csvwriter.writerow([cm.get_config('RESULT_FILE'), evaluation, elapsed_time, cm.get_config('CHECKPOINT_DICTIONARY')])
    return evaluation[2] # feature importance

To keep track of the progress. Edit the total_iterations variable to the number of tests you are going to run

In [7]:
counter = 0
total_iterations = len(testable_llms)*3

def update_counter(counter):
    counter+=1
    print(f"\n\n\n\nThis is iteration No. {counter} out of {total_iterations} total iterations\n")
    return counter

# 3. Main Test

This test will run the benchmarking process for all the LLMs in the testable_llms list. It will run the benchmarking process three times for each LLM, with the initial checkpoints, and two improved versions of the checkpoints. The improved checkpoints are generated by the CriteriaOptimizer class.

Feel free to create your own test-loop, for example with different similarity_search_threshholds?

In [ ]:
for llm in testable_llms:
    # standard checkpoints
    cm = set_up_cm(llm, "initial", standard_checkpoints)
    feature_importance = run_eval(cm)
    
    # automatically improved checkpoints
    criteria_optimizer = CriteriaOptimizer(cm)
    new_checkpoints = criteria_optimizer.automated_feature_improvement(feature_importance)    
    cm = set_up_cm(llm, "improvauto", new_checkpoints)
    run_eval(cm)
    
    # context-discriminative improved checkpoints
    criteria_optimizer = CriteriaOptimizer(cm)
    new_checkpoints = criteria_optimizer.automated_feature_improvement(feature_importance)
    cm = set_up_cm(llm, "improvcont", new_checkpoints)
    run_eval(cm)

# 4. Conclusion
Your insights are invaluable for us, so please consider sharing your findings with us. 

If you encounter any issues, have suggestions for improvements, or want to discuss your testing experiences, please reach out through the project's repository issues or contact channels.