## Running End to End experiments!

In this notebook, we:

1. Unpack our dataset (i.e METCLOUD Q&A). We have two!! The first is the 119 questions which we treat as our __test__ set, and the ~2400 dataset which we use for some experiments later on for RAG
2. [OPTIONAL] create a vector database using the selected dataset -> this is for RAG experiments.
3. We use a `QuestionAnswering` class to ask each question in our dataset to our LLM. We have six different LLM's to try; all from Ollama
4. We use a `Verifier` to measure how accurate the responses generated by `QuestionAnswering`; we do this using two different LLMs
5. Fin! We need to plot this which we do in a separate notebook.

In [8]:
%load_ext autoreload
%autoreload 2

In [9]:
import pandas as pd
import ollama
from pathlib import Path
from src.qa import QuestionAnswering
from src.vectordb import ChromaDB
from src.verifier import Verifier

  from .autonotebook import tqdm as notebook_tqdm


In [18]:
#this is the path i.e. location to where our datasets are stored on our PC
data_path = Path('/data/')

#now we set our 'experiment' parameters
mode      = 'test'                          #this means we use the test set i.e. 119 questions.
rerank    = False                            #this means we use RERANKING when doing RAG
emb_model = "all-MiniLM-L6-v2"              #this is the RAG embedding model

#now we load the dataset using pandas
if mode == "test":
    data_df = pd.read_csv(data_path/'metcloud-with-id.csv')
else:
    data_df = pd.read_csv(data_path/'METCLOUD_training.csv')

#this just prints how many questions we have in our dataframe
print('data shape:',data_df.shape)

#we now create a vector database using our ChromaDB class that we wrote
#firstly, we create a cache folder to store our embedding model
#then we pass our dataset and chroma will automatically embed the questions
chroma_cache = data_path/f'chroma_cache/chromadb_{mode}_{emb_model}'
chroma_db    = ChromaDB(chroma_cache,
                        data_df,
                        embedding_model = emb_model)

#create directories to store our generated answers.
core_dir = data_path/'Benchmarking'
save_dir = core_dir/f'dataset_{mode}_emb_{emb_model}_rerank_{rerank}'
save_dir.mkdir(exist_ok=True,parents=True)

#now data is the test dataset
data_df =  pd.read_csv(data_path/'metcloud-with-id.csv')

print('mode:',mode)
print('embedding_model:',emb_model)
print('rerank:',rerank)


data shape: (119, 4)




mode: test
embedding_model: all-MiniLM-L6-v2
rerank: False


Quickly check that the chromadb is working as expected. Ask a question, see if we get an appropriate qa pair back

In [19]:
chroma_db.retrieve('what is metcloud',k=1)

["METCLOUD is a secure sovereign cloud service provider that specializes in offering digital modernization through advanced cybersecurity and artificial intelligence. It is designed to support businesses in adopting next-generation technologies for cloud computing and cybersecurity, ensuring that they stay secure, effective, and efficient. METCLOUD's approach is tailored to meet the unique needs of businesses, with a focus on a people-first strategy. The platform is scalable, making it suitable for small to medium-sized enterprises, and it has been recognized for its excellence in the field, including being named the Cybersecurity Firm of the Year by Finance Monthly in the 2021 FinTech Awards."]

### Now we run the Question and Answering loop!
In this cell, we do the following:
1. Define a list of open source models, available on Ollama.
2. Write a for loop to go through each model.
3. 'pull' the model -> this downloads it, if we don't already heave it
4. Creates a `QuestionAnswering` class.
5. Processes the data
6. Asks each question in the dataset and stores the results to our 'save_dir' set earlier. This can use a vector database if we provide it

In [14]:
finetuned = [i['name'] for i in ollama.list()['models'] if 'metcloud' in i['name']]

In [None]:
#models = ['phi3','mistral','gemma2','llama3','llama3.1','qwen2']
models = finetuned

#loop through each of the models in models
for model in models:
  #download the model if we dont have it  
  #ollama.pull(model)
  print('MODEL:',model)

  #create question/answering class
  qa = QuestionAnswering(model=model)

  #process the dataset
  qa.process_dataset(data_df)

  #ask all questions, saving responses to a .csv file in save_dir
  qa.ask_all_questions(save_dir,
                       vector_db=chroma_db,
                       rerank = rerank)

MODEL: metcloud_1epoch_Qwen2-7B-instruct-bnb-4bit:latest


  1%|          | 1/119 [02:19<4:34:19, 139.49s/it]

False


  2%|▏         | 2/119 [02:21<1:54:17, 58.61s/it] 

False


  3%|▎         | 3/119 [02:23<1:03:23, 32.79s/it]

False


  3%|▎         | 4/119 [02:25<39:31, 20.62s/it]  

False


  4%|▍         | 5/119 [02:28<26:55, 14.17s/it]

False


  5%|▌         | 6/119 [02:32<20:17, 10.77s/it]

False


  6%|▌         | 7/119 [02:35<15:08,  8.11s/it]

False


  7%|▋         | 8/119 [02:36<11:13,  6.07s/it]

False


  8%|▊         | 9/119 [02:40<10:04,  5.49s/it]

False


  8%|▊         | 10/119 [02:43<08:04,  4.45s/it]

False


  9%|▉         | 11/119 [02:45<06:53,  3.83s/it]

False


 10%|█         | 12/119 [02:48<06:07,  3.44s/it]

False


 11%|█         | 13/119 [02:50<05:29,  3.11s/it]

False


 12%|█▏        | 14/119 [02:52<04:43,  2.70s/it]

False


 13%|█▎        | 15/119 [02:55<04:58,  2.87s/it]

False


 13%|█▎        | 16/119 [02:57<04:46,  2.78s/it]

False


 14%|█▍        | 17/119 [03:00<04:44,  2.79s/it]

False


 15%|█▌        | 18/119 [03:04<05:22,  3.19s/it]

False


 16%|█▌        | 19/119 [03:06<04:44,  2.85s/it]

False


 17%|█▋        | 20/119 [03:08<04:12,  2.55s/it]

False


 18%|█▊        | 21/119 [03:11<04:12,  2.58s/it]

False


 18%|█▊        | 22/119 [03:13<03:47,  2.35s/it]

False


 19%|█▉        | 23/119 [03:15<03:53,  2.43s/it]

False


 20%|██        | 24/119 [03:23<06:08,  3.88s/it]

False


 21%|██        | 25/119 [03:26<05:41,  3.64s/it]

False


 22%|██▏       | 26/119 [03:28<05:10,  3.34s/it]

False


 23%|██▎       | 27/119 [03:32<05:05,  3.32s/it]

False


 24%|██▎       | 28/119 [03:34<04:44,  3.13s/it]

False


 24%|██▍       | 29/119 [03:37<04:16,  2.85s/it]

False


 25%|██▌       | 30/119 [03:39<04:04,  2.75s/it]

False


 26%|██▌       | 31/119 [03:41<03:46,  2.58s/it]

False


 27%|██▋       | 32/119 [03:43<03:27,  2.39s/it]

False


 28%|██▊       | 33/119 [03:47<03:56,  2.75s/it]

False


 29%|██▊       | 34/119 [03:49<03:44,  2.64s/it]

False


 29%|██▉       | 35/119 [03:51<03:32,  2.53s/it]

False


 30%|███       | 36/119 [03:54<03:35,  2.59s/it]

False


 31%|███       | 37/119 [03:56<03:13,  2.36s/it]

False


 32%|███▏      | 38/119 [03:58<03:00,  2.23s/it]

False


 33%|███▎      | 39/119 [04:01<03:17,  2.47s/it]

False


 34%|███▎      | 40/119 [04:03<02:57,  2.25s/it]

False


 34%|███▍      | 41/119 [04:05<02:56,  2.26s/it]

False


 35%|███▌      | 42/119 [04:07<02:56,  2.29s/it]

False


 36%|███▌      | 43/119 [04:11<03:21,  2.66s/it]

False


 37%|███▋      | 44/119 [04:14<03:30,  2.80s/it]

False


 38%|███▊      | 45/119 [04:16<03:07,  2.53s/it]

False


 39%|███▊      | 46/119 [04:18<02:52,  2.36s/it]

False


 39%|███▉      | 47/119 [04:21<03:02,  2.54s/it]

False


 40%|████      | 48/119 [04:25<03:27,  2.92s/it]

False


 41%|████      | 49/119 [04:27<03:13,  2.77s/it]

False


 42%|████▏     | 50/119 [04:29<03:00,  2.61s/it]

False


 43%|████▎     | 51/119 [04:31<02:49,  2.49s/it]

False


 44%|████▎     | 52/119 [04:33<02:37,  2.35s/it]

False


 45%|████▍     | 53/119 [04:35<02:25,  2.21s/it]

False


 45%|████▌     | 54/119 [04:38<02:41,  2.48s/it]

False


 46%|████▌     | 55/119 [04:41<02:30,  2.35s/it]

False


 47%|████▋     | 56/119 [04:43<02:38,  2.52s/it]

False


 48%|████▊     | 57/119 [04:47<02:46,  2.69s/it]

False


 49%|████▊     | 58/119 [04:48<02:26,  2.40s/it]

False


 50%|████▉     | 59/119 [04:50<02:15,  2.25s/it]

False


 50%|█████     | 60/119 [04:52<02:13,  2.26s/it]

False


 51%|█████▏    | 61/119 [04:56<02:32,  2.64s/it]

False


 52%|█████▏    | 62/119 [04:58<02:14,  2.36s/it]

False


 53%|█████▎    | 63/119 [04:59<01:55,  2.06s/it]

False


 54%|█████▍    | 64/119 [05:02<02:05,  2.29s/it]

False


 55%|█████▍    | 65/119 [05:06<02:31,  2.81s/it]

False


 55%|█████▌    | 66/119 [05:08<02:24,  2.74s/it]

False


 56%|█████▋    | 67/119 [05:10<02:10,  2.51s/it]

False


 57%|█████▋    | 68/119 [05:13<02:05,  2.45s/it]

False


 58%|█████▊    | 69/119 [05:18<02:37,  3.16s/it]

False


 59%|█████▉    | 70/119 [05:19<02:10,  2.66s/it]

False


 60%|█████▉    | 71/119 [05:22<02:06,  2.64s/it]

False


 61%|██████    | 72/119 [05:24<02:02,  2.62s/it]

False


 61%|██████▏   | 73/119 [05:28<02:10,  2.83s/it]

False


 62%|██████▏   | 74/119 [05:29<01:54,  2.54s/it]

False


 63%|██████▎   | 75/119 [05:31<01:44,  2.37s/it]

False


 64%|██████▍   | 76/119 [05:34<01:50,  2.56s/it]

False


 65%|██████▍   | 77/119 [05:36<01:37,  2.32s/it]

False


 66%|██████▌   | 78/119 [05:39<01:38,  2.39s/it]

False


 66%|██████▋   | 79/119 [05:42<01:41,  2.54s/it]

False


 67%|██████▋   | 80/119 [05:44<01:35,  2.46s/it]

False


 68%|██████▊   | 81/119 [05:46<01:26,  2.27s/it]

False


 69%|██████▉   | 82/119 [05:48<01:27,  2.37s/it]

False


 70%|██████▉   | 83/119 [05:51<01:26,  2.41s/it]

False


 71%|███████   | 84/119 [05:53<01:21,  2.33s/it]

False


 71%|███████▏  | 85/119 [05:58<01:43,  3.05s/it]

False


 72%|███████▏  | 86/119 [06:05<02:18,  4.19s/it]

False


 73%|███████▎  | 87/119 [06:07<01:58,  3.71s/it]

False


 74%|███████▍  | 88/119 [06:10<01:43,  3.33s/it]

False


 75%|███████▍  | 89/119 [06:12<01:28,  2.96s/it]

False


 76%|███████▌  | 90/119 [06:14<01:21,  2.81s/it]

False


 76%|███████▋  | 91/119 [06:20<01:40,  3.60s/it]

False


 77%|███████▋  | 92/119 [06:22<01:31,  3.37s/it]

False


 78%|███████▊  | 93/119 [06:25<01:21,  3.13s/it]

False


 79%|███████▉  | 94/119 [06:27<01:12,  2.90s/it]

False


 80%|███████▉  | 95/119 [06:29<01:03,  2.65s/it]

False


 81%|████████  | 96/119 [06:32<00:58,  2.56s/it]

False


 82%|████████▏ | 97/119 [06:33<00:50,  2.32s/it]

False


 82%|████████▏ | 98/119 [06:35<00:46,  2.23s/it]

False


 83%|████████▎ | 99/119 [06:38<00:44,  2.20s/it]

False


 84%|████████▍ | 100/119 [06:39<00:39,  2.10s/it]

False


 85%|████████▍ | 101/119 [06:43<00:43,  2.41s/it]

False


 86%|████████▌ | 102/119 [06:45<00:39,  2.35s/it]

False


 87%|████████▋ | 103/119 [06:48<00:41,  2.58s/it]

False


 87%|████████▋ | 104/119 [06:51<00:39,  2.65s/it]

False


 88%|████████▊ | 105/119 [06:54<00:40,  2.86s/it]

False


 89%|████████▉ | 106/119 [06:58<00:42,  3.27s/it]

False


 90%|████████▉ | 107/119 [07:01<00:37,  3.15s/it]

False


 91%|█████████ | 108/119 [07:05<00:35,  3.23s/it]

False


 92%|█████████▏| 109/119 [07:07<00:29,  2.99s/it]

False


 92%|█████████▏| 110/119 [07:09<00:23,  2.56s/it]

False


 93%|█████████▎| 111/119 [07:12<00:21,  2.75s/it]

False


 94%|█████████▍| 112/119 [07:15<00:19,  2.84s/it]

False


 95%|█████████▍| 113/119 [07:17<00:16,  2.74s/it]

False


 96%|█████████▌| 114/119 [07:21<00:14,  2.88s/it]

False


 97%|█████████▋| 115/119 [07:23<00:10,  2.66s/it]

False


 97%|█████████▋| 116/119 [07:24<00:06,  2.29s/it]

False


 98%|█████████▊| 117/119 [07:26<00:04,  2.03s/it]

False


 99%|█████████▉| 118/119 [07:29<00:02,  2.32s/it]

False


100%|██████████| 119/119 [07:32<00:00,  3.81s/it]


False
MODEL: metcloud_1epoch_Phi-3-mini-4k-instruct-bnb-4bit:latest


  0%|          | 0/119 [00:00<?, ?it/s]

In [16]:
!ollama serve

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Error: listen tcp 127.0.0.1:11434: bind: address already in use


### Now we do verification i.e. how good were the LLM responses?

1. We grab our answers generated in the previous cell
2. We define a set of 'verification' models. In this case, llama3.1 and gemma2
3. Loop through these models, create a verifier
4. Loop through the generated answers and check if the generated answers were good using verifier
5. Save results

In [9]:
# We first pick which set to verify
rerank = True
mode   = 'test'
emb_model = "all-MiniLM-L6-v2"
data_path = Path('/data/')

#now we define our verifier models i.e vmodels
vmodels = ['llama3.1','gemma2']

#this is the list of tested models
models = ['gemma2','phi3','qwen2','llama3','llama3.1','mistral']

#we loop through our verifier models. This is just two loops
for vmodel in vmodels:
  #makes sure we definitely have the verifier model  
  ollama.pull(vmodel)

  #now we create a verifier class, using our verifier model
  verifier = Verifier(vmodel)

  #'this' is just a directory where we will store our results
  this  = f'dataset_{mode}_emb_{emb_model}_rerank_{rerank}'
  this  = data_path/f'Verification/{this}'
  this.mkdir(exist_ok=True,parents=True)

  #loop through model folders in the question answering directory 
  for model in models:

    #create a directory in our verifier directory to store these results
    pth = this/f'{vmodel}'
    pth.mkdir(exist_ok=True,parents=True)

    #load the csv containing all of the results
    answer_df = pd.read_csv(f'{save_dir}/{model}/all_questions.csv')

    #get verifier to see how good the responses were
    verifier.judge_all_questions(answer_df,model,pth)

100%|██████████| 119/119 [04:09<00:00,  2.10s/it]
100%|██████████| 119/119 [03:37<00:00,  1.82s/it]
100%|██████████| 119/119 [04:54<00:00,  2.47s/it]
100%|██████████| 119/119 [04:32<00:00,  2.29s/it]
100%|██████████| 119/119 [04:37<00:00,  2.34s/it]
100%|██████████| 119/119 [04:43<00:00,  2.38s/it]
100%|██████████| 119/119 [08:02<00:00,  4.06s/it]
100%|██████████| 119/119 [06:40<00:00,  3.36s/it]
100%|██████████| 119/119 [08:20<00:00,  4.21s/it]
100%|██████████| 119/119 [07:56<00:00,  4.00s/it]
100%|██████████| 119/119 [07:57<00:00,  4.01s/it]
100%|██████████| 119/119 [08:04<00:00,  4.07s/it]
