# CI/CD in Generative AI Development Series
## LLM Stability Testing with Python unit testing frameworks 

This notebook showcases how to run Stability Tests using unit testing frameworks available in python. Stability testing is important in Generative AI applications especially the ones using LLM APIs like OpenAI, Anthropic, Gemini, BedRock etc.,. The LLMs behind the APIs are frequently updated and hot swapped leading to change in performance of the LLMs which are deployed in production applications of many organizations. Consumers of these APIs should have automated stability testing frameworks to intercept any drift in LLM behaviour/output. Stability testing for Generative AI acts as the canary in the coal mine for these critical applications consumed by organizations.

## Scenario

An process automation application that uses an LLM to ingest bills from users and scrapes data from them to upload to a backend office application. The stability of the GPT-4o endpoint in this scenario is its ability to extract correct information from the bills over time. To make sure that the performance of the LLM is not falling, developers will need to write tests that measure the performance of the LLM over an evaluation dataset and provide alerts if the performance tests fail. This is quite similar to unit testing in CI/CD for software development. 

### Input Data

For this showcase, we are using the kaggle dataset provided by the use ------. To test the stability of our LLM we have created an evaluation dataset that has a input and expected output and can be used to test if the LLM is stable or not.

In [1]:
import json
import pymupdf4llm
import pandas as pd
from openai import OpenAI

In [2]:
llm_test_data = pd.DataFrame({'filename':['chowderhut_20231005_011.pdf','shell_20231005_003.pdf',
                                          'beerhouse_20231209_005.pdf','dennys_20231209_004.pdf',
                                          'cafemason_20231005_009.pdf','topgolf_20231209_011.pdf',
                                          'yellow_20231209_008.pdf'],
                              'output':[{'category': 'food', 'total_bill_amount': 21.15},
                                        {'category': 'transport', 'total_bill_amount': 28.32},
                                        {'category': 'food', 'total_bill_amount': 41.72},
                                        {'category': 'food', 'total_bill_amount': 58.44},
                                        {'category': 'food', 'total_bill_amount': 32.59},
                                        {'category': 'other', 'total_bill_amount': 155.68},
                                        {'category': 'transport', 'total_bill_amount': 43.02},
                                       ]
                             })
llm_test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   filename  7 non-null      object
 1   output    7 non-null      object
dtypes: object(2)
memory usage: 244.0+ bytes


In [3]:
llm_test_data.head()

Unnamed: 0,filename,output
0,chowderhut_20231005_011.pdf,"{'category': 'food', 'total_bill_amount': 21.15}"
1,shell_20231005_003.pdf,"{'category': 'transport', 'total_bill_amount':..."
2,beerhouse_20231209_005.pdf,"{'category': 'food', 'total_bill_amount': 41.72}"
3,dennys_20231209_004.pdf,"{'category': 'food', 'total_bill_amount': 58.44}"
4,cafemason_20231005_009.pdf,"{'category': 'food', 'total_bill_amount': 32.59}"


### LLM Function
The LLM completion function to be tested. This function simulates the endpoint which will ingest text from a bill to be reimbursed and return the bill amount and the category of the bill. The function uses GPT-4o model as the LLM but this notebook can be used with any LLM API or self hosted LLM as well.

In [4]:
key = json.load(open('./openai_key.json'))['key']

def prompt_llm(bill_text, key):
    client = OpenAI(api_key=key)
    completion = client.chat.completions.create(
    model="gpt-4o",
      messages=[
        {"role": "system", "content": "You are an AI assistant, skilled in helping \
         extract billing information from invoices and bills to help in reimbursement processes."},
        {"role": "user", "content": "Using the following extracted information from a bill in triple quotes, generate \
          total_bill_amount, category which is one of food, transport, other as output. \
         Output only in json format as {\"total_bill_amount\":, \"category\":}. \
         \n Bill Extract: \n \n \n '''" + bill_text + "'''"}
      ],
      seed = 42
    )
    return completion.choices[0].message.content

### Stability Tests using Python Unittest

Python's [unittest library](https://docs.python.org/3/library/unittest.html) is the defacto unit testing library that is included in the python distribution. Unittest recommends a class based approach to unit testing functions. All the tests to be executed need to be wrapped inside the [TestCase class](https://docs.python.org/3/library/unittest.html#unittest.TestCase) and should start with "test" in the name.

In our scenario, we will include three test cases with varying levels of stability checks on the LLM API. The tests will be executed in the required pipelines to ensure that the LLM is stable and no interventions are required. 

In [5]:
import unittest

class LLMStability(unittest.TestCase):
        
    def test_one_row(self):
        # Testing a single row
        row = llm_test_data.iloc[0]
        md_text = pymupdf4llm.to_markdown("./bills/"+row['filename'])
        response = prompt_llm(md_text, key)
        response_json = json.loads(response[response.find('{'):response.rfind('}')+1])
        self.assertDictEqual(response_json, {'category': 'food', 'total_bill_amount': 21.15})

    def test_llm_output_sample(self):
        # Testing a sample of rows
        for i,row in llm_test_data.head(3).iterrows():
            md_text = pymupdf4llm.to_markdown("./bills/"+row['filename'])
            response = prompt_llm(md_text, key)
            response_json = json.loads(response[response.find('{'):response.rfind('}')+1])
            self.assertDictEqual(response_json, row['output'])
    
    def test_llm_output_full(self):
        # Testing entire evaluation dataset and checking threshold of at least 50% match rate
        ROWS_CORRECT = 0
        EXPECTED_MATCH_RATE = 0.5
        for i,row in llm_test_data.iterrows():
            md_text = pymupdf4llm.to_markdown("./bills/"+row['filename'])
            response = prompt_llm(md_text, key)
            response_json = json.loads(response[response.find('{'):response.rfind('}')+1])
            if(response_json == row['output']):
                ROWS_CORRECT=ROWS_CORRECT+1
        self.assertGreaterEqual(ROWS_CORRECT/llm_test_data.shape[0],EXPECTED_MATCH_RATE)

#### Running the Tests

In [6]:
result = unittest.main(argv=[''], verbosity=2, exit=False)

test_llm_output_full (__main__.LLMStability.test_llm_output_full) ... ok
test_llm_output_sample (__main__.LLMStability.test_llm_output_sample) ... ok
test_one_row (__main__.LLMStability.test_one_row) ... ok

----------------------------------------------------------------------
Ran 3 tests in 10.813s

OK


#### Test Results
All the stability tests are fine and the LLM is stable and no intervention is required. To see what the output looks when a test fails, please update `EXPECTED_MATCH_RATE` to 0.99. 

In [7]:
print('LLM Stability Test Results using UNITTESTS Library \n'+''.join(['-' for i in range(30)])+'\nTotal Tests Executed: {}\nTests Failed: {}\nTest Execution Run Success: {}'.format(
    result.result.testsRun,
    len(result.result.failures),
    result.result.wasSuccessful()))

LLM Stability Test Results using UNITTESTS Library 
------------------------------
Total Tests Executed: 3
Tests Failed: 0
Test Execution Run Success: True


### Stability Tests using Python Pytest

[Pytest](https://docs.pytest.org/en/stable/) is a very popular framework amongst its developers because of its versatility of using simple assertions and fixtures. Developers prefer pytest over unittest library because of the influence of Java programming standards in the latter. Both frameworks are robust while being generic and have been used in commercial grade applications of all kinds. The choice comes down to the developer and their preference.

We will showcase the same stability tests from the unittest classes using pytest to contrast the difference in the implementation style. 

In [8]:
#!pip install ipytest

In [9]:
import ipytest
ipytest.autoconfig()

In [10]:
def test_one_row_pt():
    # Testing a single row
    row = llm_test_data.iloc[0]
    md_text = pymupdf4llm.to_markdown("./bills/"+row['filename'])
    response = prompt_llm(md_text, key)
    response_json = json.loads(response[response.find('{'):response.rfind('}')+1])
    assert response_json == {'category': 'food', 'total_bill_amount': 21.15} 

def test_llm_output_sample_pt():
    # Testing a sample of rows
    for i,row in llm_test_data.head(3).iterrows():
        md_text = pymupdf4llm.to_markdown("./bills/"+row['filename'])
        response = prompt_llm(md_text, key)
        response_json = json.loads(response[response.find('{'):response.rfind('}')+1])
        assert response_json == row['output']

def test_llm_output_full_pt():
    # Testing entire evaluation dataset and checking threshold of at least 50% match rate
    ROWS_CORRECT = 0
    EXPECTED_MATCH_RATE = 0.5
    for i,row in llm_test_data.iterrows():
        md_text = pymupdf4llm.to_markdown("./bills/"+row['filename'])
        response = prompt_llm(md_text, key)
        response_json = json.loads(response[response.find('{'):response.rfind('}')+1])
        if(response_json == row['output']):
            ROWS_CORRECT=ROWS_CORRECT+1
    assert ROWS_CORRECT/llm_test_data.shape[0]>EXPECTED_MATCH_RATE

#### Running the Tests

PS: You might have noticed that pytest ran 6 tests instead of 3. This is because pytest can recognise tests written in other frameworks and it follows the same naming standards as unittest library. This allows easy migration of unit tests from other frameworks to pytest.

In [11]:
result_pt = ipytest.run('-vv')

platform darwin -- Python 3.11.9, pytest-8.3.3, pluggy-1.5.0 -- /opt/anaconda3/envs/devenv311/bin/python
cachedir: .pytest_cache
rootdir: /Users/y2ee201/Documents/evolveailabs/blogs/stability
plugins: anyio-4.4.0
[1mcollecting ... [0mcollected 6 items

t_2cfefb81a85948f0b46123638742dec6.py::LLMStability::test_llm_output_full [32mPASSED[0m[33m             [ 16%][0m
t_2cfefb81a85948f0b46123638742dec6.py::LLMStability::test_llm_output_sample [32mPASSED[0m[33m           [ 33%][0m
t_2cfefb81a85948f0b46123638742dec6.py::LLMStability::test_one_row [32mPASSED[0m[33m                     [ 50%][0m
t_2cfefb81a85948f0b46123638742dec6.py::test_one_row_pt [32mPASSED[0m[33m                                [ 66%][0m
t_2cfefb81a85948f0b46123638742dec6.py::test_llm_output_sample_pt [32mPASSED[0m[33m                      [ 83%][0m
t_2cfefb81a85948f0b46123638742dec6.py::test_llm_output_full_pt [32mPASSED[0m[33m                        [100%][0m

../../../../../../opt/anaconda3/envs

### Integration into Pipelines

Both unittests and pytests allow easy integration into pipelines. If you are already having CI/CD pipelines, you can just have python scripts as .py files and they will be automatically picked up by CI/CD tools like Github Actions, Azure DevOPS etc.,.

## Conclusion

Given the rising adoption of LLMs and Generative AI for consumer and commercial use, it is crucial to keep testing the stability of the LLMs at a high frequency to avoid any surprises in Automation Pipelines that support critical operations like payment reconciliations, data processing, customer support etc.,. Leveraging popular unit testing frameworks along with CI/CD pipelines is one of the quickest ways to implement stability testing of LLMs inside your workflows. 

We at Evolve AI Labs would love to help you on your transformative journey with Generative AI.