<a href="https://colab.research.google.com/github/NoamMichael/Comparing-Confidence-in-LLMs/blob/main/LSAT_Benchmarking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# This Notebook will test all models on the formatted LSAT-AR dataset
# I have no issue running multiple API clients simultaneously. However, running
# local models is pretty memory intensive so I can only run one at a time.

# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
#                 TO DO LIST:
# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
# (X) Implement GetRAC() for Anthropic and Gemini Models
# (X) Fix formatdf() function (Daniels implementation)
# (X) Fix how systemprompts work (maybe make a metaclass var?)
#       --Honestly works fine now as global var. I dont see a need to change it.
# (X) Make an init_models() function:
#     def init_functions(models_dict):
#         "Do some stuff"
#         return models_list
# (X) Clean up notebook
#  O  GetRAC for open models.
#  O  test_models_sequential_by_question for open models
#     ---Only looking at a few open models. May be better to split into a different notebook as they cant run in parallel.
#  O Fix up some of the methodology to be more in line with previous work.




## Stopped at 133 / 1620
## ---Find out how to get beter Gini Coef. in confidence distribution. (Ended up improving system prompt)
##




## Initialize:

In [None]:
## Pip Installs

In [None]:
%pip install anthropic
%pip install openai
%pip install tqdm

Collecting anthropic
  Downloading anthropic-0.54.0-py3-none-any.whl.metadata (25 kB)
Downloading anthropic-0.54.0-py3-none-any.whl (288 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/288.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m288.8/288.8 kB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: anthropic
Successfully installed anthropic-0.54.0


In [None]:
## Imports
import pandas as pd
import numpy as np
import json
import time
import random
import torch
import matplotlib.pyplot as plt
from transformers import (AutoTokenizer,
                        AutoModelForCausalLM,
                        BitsAndBytesConfig,
                        pipeline)
import warnings
import openai
import anthropic
import os
import google.generativeai as genai
from abc import ABC, abstractmethod
from tqdm.notebook import tqdm
from google.colab import userdata


warnings.filterwarnings('ignore')
os.environ["HF_HUB_VERBOSITY"] = "critical"

## Closed Models

In [None]:


max_tokens = 150

class ClosedModel(ABC):
  @abstractmethod
  def generate(self, prompt: str, system:str = "")-> str:
        """
        Abstract method to generate a response from the language model.
        """
        pass
  @abstractmethod
  def __init__(self, name, api_key):
    self.name = name
    self.key = api_key
    self.results = pd.DataFrame(columns = ['Question ID','Question', 'Answer', 'Reasoning', 'Error'])
    pass

  @abstractmethod
  def client(self):
    pass

  @abstractmethod
  def GetRAC(self, prompt: str, system1:str = "", system2: str = "")-> tuple[str, str]: ## Get Reasoning Answer Confidence
    pass

class GPTmodel(ClosedModel):
  def __init__(self, name, api_key):
    self.name = name
    self.key = api_key
    self.results = pd.DataFrame(columns = ['Question ID','Question', 'Answer', 'Reasoning', 'Error'])
  def client(self):
    # Initialize the OpenAI client with the API key
    self.client = openai.OpenAI(api_key=self.key)


  def generate(self, prompt: str, system: str = "") -> str:
    # Use the new client-based API call
    response = self.client.chat.completions.create(
        model=self.name,
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": prompt}
        ],
        temperature=0,
        max_tokens= max_tokens
    )
    # Access the content from the new response object structure
    return response.choices[0].message.content
  def GetRAC(self, prompt: str)-> tuple[str, str]: ## Get Reasoning Answer Confidence
    ## For context here's the two system prompts:
    ## System prompt 1:
    '''
    Given the following question, analyze the options, and provide a concise reasoning for your selected answer. Your reasoning should not exceed 100 words. After your explanation, clearly state your answer by choosing one of the options listed (A, B, C, D, or E).

    Question: ${Question}
    Options:
    A) ${Option A}
    B) ${Option B}
    C) ${Option C}
    D) ${Option D}
    E) ${Option E}

    Please provide your reasoning first, limited to 100 words, and then conclusively state only your selected answer using the corresponding letter (A, B, C, D, or E).
    Reasoning: <Your concise reasoning here. Max 100 words>
    '''

    ## System prompt 2:
    '''
    Based on your reasoning, select the most likely answer choice and estimate the probability that each option is correct. Express your uncertainty by assigning probabilities between 0.0 and 1.0 in a JSON format. The probabilities should sum to 1.0. For example:

    {
    'Answer': <Your answer choice here, as a single letter and nothing else.>
    'A': <Probability choice A is correct. As a float from 0.0 to 1.0>,
    'B': <Probability choice B is correct. As a float from 0.0 to 1.0>,
    'C': <Probability choice C is correct. As a float from 0.0 to 1.0>,
    'D': <Probability choice D is correct. As a float from 0.0 to 1.0>,
    'E': <Probability choice E is correct. As a float from 0.0 to 1.0>
    }
    '''
    # Access global system prompts
    global sys_prompt1, sys_prompt2
    ## Get the reasoning
    reasoning = self.generate(prompt, sys_prompt1)
    ## Get the answer and confidence
    answer_confidence = self.generate(prompt + reasoning + sys_prompt2, sys_prompt2)

    return reasoning, answer_confidence

class AnthropicModel(ClosedModel):
  def __init__(self, name, api_key):
    self.name = name
    self.key = api_key
    self.results = pd.DataFrame(columns = ['Question ID','Question', 'Answer', 'Reasoning', 'Error'])
  def client(self):
    # Initialize the Anthropic client with the API key
    self.client = anthropic.Anthropic(api_key=self.key)

  def generate(self, prompt: str, system: str = "") -> str:
    # The messages list should only contain user and assistant roles
    messages = [{"role": "user", "content": prompt}]

    # Use the Anthropic client to create a message
    # Pass the system message as a top-level 'system' parameter
    message = self.client.messages.create(
        model=self.name,
        max_tokens= max_tokens, # You can adjust this or make it an instance variable
        messages=messages,
        system=system if system else None # Pass system as a separate parameter, or None if empty
    )
    # Access the content from the response object
    return message.content[0].text

  def GetRAC(self, prompt: str)-> tuple[str, str]: ## Get Reasoning Answer Confidence
    # Access global system prompts
    global sys_prompt1, sys_prompt2
    ## Get the reasoning
    reasoning = self.generate(prompt, sys_prompt1)
    ## Get the answer and confidence
    answer_confidence = self.generate(prompt + reasoning + sys_prompt2, sys_prompt2)

    return reasoning, answer_confidence
class GeminiModel(ClosedModel):
  def __init__(self, name, api_key):
    self.name = name
    self.key = api_key
    self.results = pd.DataFrame(columns = ['Question ID','Question', 'Answer', 'Reasoning', 'Error'])

  def client(self):
    # Initialize the google.generativeai client with the API key

    genai.configure(api_key=self.key)
    self.model = genai.GenerativeModel(model_name=self.name)

  def generate(self, prompt: str, system: str = "") -> str:
    # Build the content list, including the system message if provided
    contents = [{"role": "user", "parts": [prompt]}]
    if system:
        contents = [{"role": "user", "parts": [system]}] + contents

    # Use the Gemini model to generate content
    response = self.model.generate_content(contents)

    # Access the content from the response object
    return response.text

  def GetRAC(self, prompt: str)-> tuple[str, str]: ## Get Reasoning Answer Confidence
    # Access global system prompts
    global sys_prompt1, sys_prompt2
    ## Get the reasoning
    reasoning = self.generate(prompt, sys_prompt1)
    ## Get the answer and confidence
    answer_confidence = self.generate(prompt + reasoning + sys_prompt2, sys_prompt2)

    return reasoning, answer_confidence






### Define Functions:

In [None]:
## Define Functions:
def init_models(models_dict,
                test_prompt = "Zdzisław Beksiński was",
                test_system = "You are a helpful assistant.", open = False):
  print('Initializing Closed Models:')
  models = []
  for model_type in models_dict:
      print(f'{model_type}:')
      api_key_name = models_dict[model_type]['api_key_name']
      api_key = userdata.get(api_key_name)
      print(f'  Key Name: {models_dict[model_type]["api_key_name"]}')
      for model_name in models_dict[model_type]['models']:
        # Instantiate the correct subclass based on model_type
        if model_type == 'GPT':
            my_model = GPTmodel(name = model_name, api_key = api_key)
            my_model.client()
        elif model_type == 'Claude':
            my_model = AnthropicModel(name = model_name, api_key = api_key)
            my_model.client()
        elif model_type == 'Gemini':
            my_model = GeminiModel(name = model_name, api_key = api_key)
            my_model.client()
        elif open:
            my_model = OpenModel(name = model_name, key = api_key)

        else:
            # Handle unexpected model types if necessary
            print(f"Warning: Unknown model type {model_type}. Skipping.")
            continue # Skip to the next model name if type is unknow

        models.append(my_model)
        print(f'    {model_name}')


  print(f'Models Initialized: {len(models)}')
  print(f'Model locations:\n{models}')

  print('-'*42)
  print('Testing all closed models:')
  print(f'Test prompt: {test_prompt}')
  print(f'Test system: {test_system}')

  for model in models:
    print(f'\nTesting model: {model.name}')
    if open:
      print(model.generate(test_prompt))
  return models

def format_df(df):

  ## %%%%%%%%%%%%%%
  ## I need to fix how formating is done for some Q's. As daniel pointed out some
  ## questions only have 4 options, not 5.
  ## %%%%%%%%%%%%%%

  ## Takes in a dataframe in the form:
  ## | Question Number | Question | Option A | Option B | ... | Correct Answer Letter |
  ## |     (Int)       |     (Str)     |  (Str)   |  (Str)   |     |       (Char)          |
  ##
  ## Returns a dataframe in the form:
  ## | Question Number | Full Prompt 1 | Full Prompt 2 |
  ## |     (Int)       |    (Str)      |    (Str)      |

  columns = df.columns
  num_options = columns.str.contains('Option').astype(int).sum()

  #----------------------------------------------------------------------------#
  ## Check if DF is formatted properly
  error_text = f'''Make sure dataframe is in following format:
  | Question Number | Question | Option A | Option B | ... | Correct Answer Letter |
  |     (Int)       |     (Str)     |  (Str)   |  (Str)   |     |       (Char)          |

  The current format of Dataframe is: {columns}
  '''
  ['Question Number', 'Question', 'Correct Answer Letter']
  if num_options < 2:
    raise Exception(error_text)

  #----------------------------------------------------------------------------#
  ## Initialize Output dataframe:
  header = ['Question Num', 'Full Prompt 1', 'Full Prompt 2']
  output_df = pd.DataFrame(columns = header)

  #----------------------------------------------------------------------------#

  ## Format questions for benchmark
  letters = ['A', 'B', 'C', 'D', 'E']
  options = ['Option A', 'Option B', 'Option C', 'Option D', 'Option E']

  for i in range(len(df)):
    question = df['Question'][i]

    sys_prompt_temp1 = sys_prompt1
    sys_prompt_temp2 = sys_prompt2
    ## Reformat system prompt in order to fit number of options in benchmark
    if type(df['Option E'][i]) == float: ## ABCD
      sys_prompt_temp1 = (sys_prompt1
                    .replace('(A, B, C, D, or E)', '(A, B, C, or D)')
                    .replace('E) ${Option E}', '')
          )
      sys_prompt_temp2 = (sys_prompt2
                    .replace('(A, B, C, D, or E)', '(A, B, C, or D)')
                    .replace('E) ${Option E}', '')
          )
      if type(df['Option D'][i]) == float: ## ABC
        sys_prompt_temp1 = (sys_prompt_temp1
                      .replace('(A, B, C, or D)', '(A, B, or C)')
                      .replace('D) ${Option D}', '')
            )
        sys_prompt_temp2 = (sys_prompt_temp2
                    .replace('(A, B, C, or D)', '(A, B, or C)')
                    .replace('D) ${Option D}', '')
          )

        if type(df['Option C'][i]) == float: ## AB
          sys_prompt_temp1 = (sys_prompt_temp1
                        .replace('(A, B, or C)', '(A or B)')
                        .replace('C) ${Option C}', '')
              )
          sys_prompt_temp2 = (sys_prompt_temp2
                      .replace('(A, B, or C)', '(A or B)')
                      .replace('C) ${Option C}', '')
            )

    option_text = df[options[:num_options]].iloc[i].to_list()
    ## Prompt for specific question
    new_prompt = sys_prompt_temp1.replace('${Question}', question)
    for j in range(num_options): ## This for loop allows for dynamic question amounts
        new_prompt = new_prompt.replace(f'${{Option {letters[j]}}}', str(option_text[j]))


    ## Add formatted prompts.
    ## Note that this is formatted to llama so changes may be needed down the line.
    prompts1 = (new_prompt.split('<Your concise reasoning here. Max 100 words>')[0]) ## Specific prompt for question

    prompts2 = (sys_prompt_temp2) ## Generic prompt for question confidence
    output_df.loc[i] = [df['Question Number'].iloc[i], prompts1, prompts2]

  return output_df

def test_models_sequential_by_question(df, models, debug=False, start = 0):
    """
    Tests a list of models on a given dataset sequentially,
    iterating through questions and then models for each question.
    Includes a debug mode to process only the first 10 questions.

    Args:
        df (pd.DataFrame): The dataset containing questions and prompts.
        models (list): A list of initialized model objects.
        debug (bool): If True, only process the first 10 questions.
    """
    print("Clearing previous results for each model...")
    for model in models:
        model.results = pd.DataFrame(columns=['Question ID', 'Question', 'Answer', 'Reasoning', 'Error'])
        print(f"  Cleared results for {model.name}")
    print("Starting sequential testing (by question)...")

    # Determine the number of questions to process
    num_questions_to_process = 10 if debug else len(df)

    # Iterate over questions first
    for index, row in tqdm(df.iloc[start: num_questions_to_process].iterrows(), total=num_questions_to_process, desc="Processing Questions"):
        question_num = row['Question Num']
        prompt = row['Full Prompt 1']
        '''

        I dont love this implementation But honestly,
        unless my logic is flawed with some edge case, I think this should work
        and I dont want to rewrite the functions for each subclass to take in the
        dataframe in order to work.
        '''
        global sys_prompt2
        sys_prompt2 = row['Full Prompt 2']

        print(f"\nProcessing Question {question_num}")

        # Iterate over models for the current question
        for model in models:

            try:
                print(f"  Testing with model: {model.name}")
                # Call GetRAC and add the result to the model's self.results
                reasoning, answer_confidence = model.GetRAC(prompt=prompt)

                # Add the results to the model's self.results DataFrame
                new_row = pd.DataFrame([{
                    'Question ID': question_num,
                    'Question': prompt,
                    'Answer': answer_confidence,
                    'Reasoning': reasoning,
                    'Error': False
                }])
                model.results = model.results._append(new_row, ignore_index=True)
                filename = f"{model.name.replace('/', '_').replace('-', '_')}_test_results.csv"
                model.results.to_csv(filename, index=False)
            except Exception as e:
                print(f"  Error testing {model.name} on Question {question_num}: {e}")
                # Optionally add an error entry to the results
                error_row = pd.DataFrame([{
                    'Question ID': question_num,
                    'Question': prompt,
                    'Answer': f"Error: {e}",
                    'Reasoning': f"Error: {e}",
                    'Error': True
                }])
                model.results = model.results._append(error_row, ignore_index=True)
                filename = f"{model.name.replace('/', '_').replace('-', '_')}_test_results.csv"
                model.results.to_csv(filename, index=False)
    print("\nSequential testing complete.")

    # After processing all questions, save the results for each model
    for model in models:
        filename = f"{model.name.replace('/', '_').replace('-', '_')}_test_results.csv"
        model.results.to_csv(filename, index=False)
        print(f"Results for {model.name} saved to '{filename}'")

### Model Playground:

In [None]:
## Playground for Llama
hf_llama_token = userdata.get('hf_llama_token')
test_name = 'meta-llama/Llama-3.1-8B-Instruct'
test_key = hf_llama_token
test_prompt = "Zdzisław Beksiński was"

test_model = OpenModel(name = test_name, key = test_key)

In [None]:
print(test_model.generate(test_prompt))
test_model.GetTokens(test_prompt)

In [None]:
## Playground for GPT

gpt_4_key = userdata.get('gpt_api_key')
test_name = 'gpt-4'
test_key = gpt_4_key
test_prompt = "Zdzisław Beksiński was"
test_system = "You are a helpful assistant."

my_gpt = GPTmodel(name = test_name, api_key = test_key)
my_gpt.client()
my_gpt.generate(test_prompt, test_system)


In [None]:
## Playground for Claude

claude_key = userdata.get('claude_api_key')
test_name = 'claude-3-haiku-20240307'
test_key = claude_key
test_prompt = "Zdzisław Beksiński was"
test_system = "You are a helpful assistant."

my_claude = AnthropicModel(name = test_name, api_key = test_key)
my_claude.client()
my_claude.generate(test_prompt, test_system)

In [None]:
## Playground for Gemini

gemini_api_key = userdata.get('gemini_api_key') # Assuming you stored your key in Userdata
test_name = "gemini-2.0-flash" # Or another Gemini model name like 'gemini-1.5-flash'
test_key = gemini_api_key
test_prompt = "Zdzisław Beksiński was"
test_system = "You are a helpful assistant." # Optional system message

my_gemini = GeminiModel(name = test_name, api_key = test_key)
my_gemini.client()
my_gemini.generate(test_prompt, test_system)


### Run Benchmarking for Closed Models:

In [None]:
## Import Dataset
print('-' *42)
file_path = '/content/lsat_ar_test_formatted.csv'
print(f'Importing Dataset: {file_path}')
dataset = pd.read_csv(file_path)
dataset.head()

## Edit System Prompts
print('-' *42)
print('Editing System Prompts:')

sys_prompt1 = '''
Given the following question, analyze the options, and provide a concise reasoning for your selected answer. Your reasoning should not exceed 100 words. After your explanation, clearly state your answer by choosing one of the options listed (A, B, C, D, or E).

Question: ${Question}
Options:
A) ${Option A}
B) ${Option B}
C) ${Option C}
D) ${Option D}
E) ${Option E}

Please provide your reasoning first, limited to 100 words, and consider how certain you should be of your answer.
Reasoning: <Your concise reasoning here. Max 100 words>
'''
sys_prompt2 = '''
Based on the reasoning above, Provide the best answer and the likelihood that each option is correct from 0.0 to 1.0 in a JSON format. The probabilities should sum to 1.0. For example:

{
'A': <Probability choice A is correct. As a float from 0.0 to 1.0>,
'B': <Probability choice B is correct. As a float from 0.0 to 1.0>,
'C': <Probability choice C is correct. As a float from 0.0 to 1.0>,
'D': <Probability choice D is correct. As a float from 0.0 to 1.0>,
'E': <Probability choice E is correct. As a float from 0.0 to 1.0>,
'Answer': <Your answer choice here, as a single letter and nothing else.>
}

All options have a non-zero probability of being correct. No option should have a probability of 0 or 1.
Be modest about your certainty.  Do not provide any additional reasoning.

Response:

'''



print('System Prompts:')
print(f'  {sys_prompt1}')
print(f'  {sys_prompt2}')



## Format DF
print('-' *42)
print('Formatting Dataset:')
new_dataset = format_df(dataset)
print(' Successfully Formatted Dataset')
print('New Dataset:')
display(new_dataset.head())




## Initialize Models
print('-' *42)
print('Initializing Models:')
my_closed_models = {
    'GPT': {
        'api_key_name': 'gpt_api_key', # Name of the key to retrieve from userdata
        'models': [
            'gpt-4',
            'gpt-3.5-turbo'
        ]
    },
    'Claude': {
        'api_key_name': 'claude_api_key', # Name of the key to retrieve from userdata
        'models': [
            'claude-3-7-sonnet-20250219',
            'claude-3-haiku-20240307'
        ]
    },
    'Gemini': {
        'api_key_name': 'gemini_api_key', # Name of the key to retrieve from userdata
        'models': [
            'gemini-1.5-flash',
            'gemini-2.5-pro-preview-06-05'
        ]
    }
}

closed_models = init_models(my_closed_models)
print(' Successfully Initialied Models')

## Test Models on LSAT
print('-' *42)
print('Testing Models:')

test_models_sequential_by_question(new_dataset, closed_models, debug=False, start = 106)


In [None]:
## Make final version of data:
import re
import ast


def extract_json(string):
    error_dict = {
                'A': -1,
                'B': -1,
                'C': -1,
                'D': -1,
                'E': -1,
                'Answer': 'Error'
              }
    string = str(string).replace("```json", "").replace("```", "")
    match = re.search(r'({.*Answer.*})', string, re.DOTALL)
    if match:
      try:
        return ast.literal_eval(match.group(match.lastindex))
      except Exception as e:
          print('Error with answer string:')
          print(e)
          print(f'String:\n{string}')

          return error_dict

    else:
        return error_dict

def clean_data(file_location):
    model_data = pd.read_csv(file_location)

    new_model_data = model_data.copy().drop_duplicates(keep = 'first', subset = 'Question ID').reset_index(drop=True)

    new_model_data['Correct Answer Letter'] = dataset['Correct Answer Letter']



    ## Extract the JSON out of the raw answer
    new_model_data['AnswerJSON'] = new_model_data['Answer'].apply(extract_json)

    ## Get the stated probability
    new_model_data['Stated Prob A'] = new_model_data['AnswerJSON'].apply(lambda x: x.get('A', None)).astype('float')
    new_model_data['Stated Prob B'] = new_model_data['AnswerJSON'].apply(lambda x: x.get('B', None)).astype('float')
    new_model_data['Stated Prob C'] = new_model_data['AnswerJSON'].apply(lambda x: x.get('C', None)).astype('float')
    new_model_data['Stated Prob D'] = new_model_data['AnswerJSON'].apply(lambda x: x.get('D', None)).astype('float')
    new_model_data['Stated Prob E'] = new_model_data['AnswerJSON'].apply(lambda x: x.get('E', None)).astype('float')
    new_model_data['Stated Answer'] = new_model_data['AnswerJSON'].apply(lambda x: x.get('Answer', None))

    new_model_data['Correct'] = new_model_data['Stated Answer'].str.strip() == new_model_data['Correct Answer Letter'].str.strip()
    new_model_data.drop(columns=['AnswerJSON'], inplace=True)
    return new_model_data


folder_path = '/content/combined_old' # Replace with your actual folder path
for filename in os.listdir(folder_path):
        if filename == '.ipynb_checkpoints': ## Ignore .ipynb_checkpoints
          continue
        # Process each file or directory
        print(filename)
        new_file_name = filename.replace('CB_', '').replace('.csv', '')

        full_file_path = os.path.join(folder_path, filename)
        new_data = clean_data(full_file_path)
        new_data.to_csv('/content/LSAT_' + new_file_name + '.csv' , index=False)




## Open Models:

### Define Classes


In [None]:
## Define Classes
max_tokens = 250


class OpenModel(ABC):
  @abstractmethod
  def __init__(self, name, key, MaxTokens = max_tokens):
    pass

  @abstractmethod
  def generate(self, prompt):
    pass

  @abstractmethod
  def GetTokens(self, prompt):
    pass

  @abstractmethod
  def GetRAC(self, prompt):
    pass

class LlamaModel(OpenModel): ## This class is built around Hugging Face methods
  def __init__(self, name, key, MaxTokens = 150):
    self.name = name
    self.key = key
    self.MaxTokens = MaxTokens
    print(f"Downloading Tokenizer for {self.name}")
    self.tokenizer = AutoTokenizer.from_pretrained(self.name,token = self.key) ## Import Tokenizer
    print(f"Downloading Model Weights for {self.name}")
    self.model = AutoModelForCausalLM.from_pretrained(self.name, token = self.key, device_map="auto",torch_dtype=torch.uint8,) ## Import Model
    self.results = pd.DataFrame(columns = ['Question ID','Question', 'Answer', 'Reasoning', 'Token Probability', 'Error'])
    ## Make text generation pipeline
    self.pipeline = pipeline(
    "text-generation",
    model = self.model,
    tokenizer = self.tokenizer,
    do_sample = False,
    max_new_tokens = self.MaxTokens,
    eos_token_id = self.tokenizer.eos_token_id,
    pad_token_id = self.tokenizer.eos_token_id,
    device_map="auto",
    transformers_version="4.37.0",
    )


  def generate(self, prompt):

    new_prompt = ("<|begin_of_text|><|start_header_id|>system<|end_header_id|>"
                  + 'You are a helpful assistant. When prompted for a response give your reasoning and answer to the user. Signify your answer with this response:\n"Therefore, the correct answer is:\n<|eot_id|>"'
                  + "\n<|eot_id|><|start_header_id|>user<|end_header_id|>"
                  + prompt
                  + "<|eot_id|><|start_header_id|>assistant<|end_header_id|>")
    return self.pipeline(prompt)[0]['generated_text'].replace(prompt, '')

  def GetTokens(self, prompt: str):
    ## Get Answer:
    batch = self.tokenizer(prompt, return_tensors= "pt").to('cuda')
    with torch.no_grad():
        outputs = self.model(**batch)
    ## Get Token Probabilites
    logits = outputs.logits

    ## Apply softmax to the logits to get probabilities
    probs = torch.softmax(logits[0, -1], dim=0)

    ##Get the top k token indices and their probabilities
    top_k_probs, top_k_indices = torch.topk(probs, 100, sorted =True)

    ## Convert token indices to tokens
    top_k_tokens = [self.tokenizer.decode([token_id]) for token_id in top_k_indices]

    ## Convert probabilities to list of floats
    top_k_probs = top_k_probs.tolist()                  #list of probabilities

    ## Create a Pandas Series with tokens as index and probabilities as values
    global logit_series
    logit_series = pd.Series(top_k_probs, index=top_k_tokens)

    ## Sort the series by values in descending order
    logit_series = logit_series.sort_values(ascending=False)
    ## Get the answer Letter
    target_tokens = [' A', ' B', ' C', ' D', ' E', 'A', 'B', 'C', 'D', 'E']

    only_target_tokens = logit_series[logit_series.index.isin(target_tokens)]
    best_answer = only_target_tokens.index[0]

    ## Format logit series
    logit_series.index.name = "Token"
    logit_series.name = "Probability"
    return str(logit_series.to_dict()), best_answer.strip()

  def GetRAC(self, prompt: str)-> tuple[str, str]: ## Get Reasoning Answer/Confidence
    ## Get the reasoning
    new_prompt = ("<|begin_of_text|><|start_header_id|>system<|end_header_id|>"
                  + sys_prompt1
                  + "\n<|eot_id|><|start_header_id|>user<|end_header_id|>"
                  + prompt
                  + "<|eot_id|><|start_header_id|>assistant<|end_header_id|>")
    reasoning = self.generate(new_prompt).replace(new_prompt, '')
    ## Get the answer and confidence
    answer_confidence = self.generate(new_prompt
                                      + reasoning
                                      + '<|eot_id|><|start_header_id|>user<|end_header_id|>'
                                      + sys_prompt2
                                      + '<|eot_id|><|start_header_id|>assistant<|end_header_id|>' )

    return reasoning, answer_confidence

### Define Functions:

In [None]:
sys_prompt2 = '''
Based on the reasoning above, Provide the best answer and the likelihood that each option is correct from 0.0 to 1.0 in a JSON format. The probabilities should sum to 1.0. For example:

{
'Answer': <Your answer choice here, as a single letter and nothing else>,
'A': <Probability choice A is correct. As a float from 0.0 to 1.0>,
'B': <Probability choice B is correct. As a float from 0.0 to 1.0>,
'C': <Probability choice C is correct. As a float from 0.0 to 1.0>,
'D': <Probability choice D is correct. As a float from 0.0 to 1.0>,
'E': <Probability choice E is correct. As a float from 0.0 to 1.0>
}<|eot_id|>

All options have a non-zero probability of being correct. No option should have a probability of 0 or 1.
Be modest about your certainty.  Do not provide any additional reasoning.

Response: '''
from tqdm.notebook import tqdm
def test_open_model_lsat(model, df, debug = False):
  model.results = pd.DataFrame(columns=['Question ID',
                                        'Question',
                                        'Stated Confidence',
                                        'Stated Answer',
                                        'Reasoning',
                                        'Error',
                                        'Token Probability'
                                        ])
  if debug:
    length = 3
  else:
    length = len(df)
  start = 0

  for i in tqdm(range(length), total=length, desc="Processing Questions"):
  #for i in range(length):
    try:
      ## Question Information
      question_num = df['Question Number'][i]
      question_text = df['Question'][i]
      reasoning_prompt = question_text + ' what?\nReasoning: '


      ## Get reasoning from model
      model_reasoning = model.generate(reasoning_prompt)
      ## Get Answer from model

      answer_prompt = (reasoning_prompt
                      + '<|eot_id|><|start_header_id|>assistant<|end_header_id|>'
                      + model_reasoning
                      + '\n<|eot_id|><|start_header_id|>user<|end_header_id|>'
                      + sys_prompt2
                      + '<|eot_id|><|start_header_id|>assistant<|end_header_id|>'
                      + "{'Answer': '"
                      )
      model_tokens, answer_letter = model.GetTokens(answer_prompt)

      ## Get JSON formatted answer and confidence
      answer_JSON_prompt = answer_prompt + answer_letter + "',"
      model_confidence_JSON = "{'Answer': '" + answer_letter + "'," + model.generate(answer_JSON_prompt)

      new_row = pd.DataFrame([{
          'Question ID': question_num,
          'Question': question_text,
          'Stated Confidence': model_confidence_JSON,
          'Stated Answer': answer_letter,
          'Reasoning': model_reasoning,
          'Error': False,
          'Token Probability': model_tokens,
          'Correct Answer Letter': df['Correct Answer Letter'][i]
      }])
    except Exception as e:
      new_row = pd.DataFrame([{
          'Question ID': question_num,
          'Question': question_text,
          'Stated Confidence': 'ERROR',
          'Stated Answer': 'ERROR',
          'Reasoning': e,
          'Error': True,
          'Token Probability': 'ERROR',
          'Correct Answer Letter': df['Correct Answer Letter'][i]
      }])
    model.results = model.results._append(new_row, ignore_index=True)

  return model.results



#### Llama playground

In [None]:
## method without functions:
import pandas as pd
import numpy as np
import json
import time
import random
import torch
import matplotlib.pyplot as plt
from transformers import (AutoTokenizer,
                        AutoModelForCausalLM,
                        BitsAndBytesConfig,
                        pipeline)
import warnings
from google.colab import userdata
from transformers.utils import logging
logging.set_verbosity_error()
warnings.filterwarnings('ignore')
from google.colab import drive
import os
drive.mount('/content/drive')


sys_prompt2 = '''
Based on the reasoning above, Provide the best answer and the likelihood that each option is correct from 0.0 to 1.0 in a JSON format. The probabilities should sum to 1.0. For example:

{
'Answer': <Your answer choice here, as a single letter and nothing else>,
'A': <Probability choice A is correct. As a float from 0.0 to 1.0>,
'B': <Probability choice B is correct. As a float from 0.0 to 1.0>,
'C': <Probability choice C is correct. As a float from 0.0 to 1.0>,
'D': <Probability choice D is correct. As a float from 0.0 to 1.0>,
'E': <Probability choice E is correct. As a float from 0.0 to 1.0>
}<|eot_id|>

All options have a non-zero probability of being correct. No option should have a probability of 0 or 1.
Be modest about your certainty.  Do not provide any additional reasoning.

Response: '''

## Get llama:

HF_TOKEN = userdata.get('hf_llama_token')
model_name = 'meta-llama/Llama-4-Scout-17B-16E-Instruct'


my_llama = LlamaModel(name = model_name, key = HF_TOKEN, MaxTokens = 250)


print('-' *42)
file_path = '/content/lsat_ar_test_formatted.csv'
print(f'Importing Dataset: {file_path}')
dataset = pd.read_csv(file_path)
dataset.head()


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Downloading Tokenizer for meta-llama/Llama-4-Scout-17B-16E-Instruct


tokenizer_config.json:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/27.9M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/3.62M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/106 [00:00<?, ?B/s]

chat_template.jinja:   0%|          | 0.00/7.35k [00:00<?, ?B/s]

Downloading Model Weights for meta-llama/Llama-4-Scout-17B-16E-Instruct


config.json:   0%|          | 0.00/2.18k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/112k [00:00<?, ?B/s]

Fetching 50 files:   0%|          | 0/50 [00:00<?, ?it/s]

model-00002-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00006-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00008-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00004-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00003-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00005-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00001-of-00050.safetensors:   0%|          | 0.00/3.94G [00:00<?, ?B/s]

model-00007-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00009-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00010-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00011-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00012-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00013-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00014-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00015-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00016-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00017-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00018-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00019-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00020-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00021-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00022-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00024-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00023-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00025-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00026-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00027-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00028-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00029-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00030-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00031-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00032-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00033-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00034-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00035-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00036-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00037-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00038-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00039-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00040-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00041-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00042-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00043-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00044-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00045-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00046-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00047-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00048-of-00050.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

model-00049-of-00050.safetensors:   0%|          | 0.00/4.28G [00:00<?, ?B/s]

model-00050-of-00050.safetensors:   0%|          | 0.00/2.07G [00:00<?, ?B/s]

RuntimeError: Data processing error: CAS service error : IO Error: No space left on device (os error 28)

In [None]:
sys_prompt2 = '''
Based on the reasoning above, Provide the best answer and the likelihood that each option is correct from 0.0 to 1.0 in a JSON format. The probabilities should sum to 1.0. For example:

{
'Answer': <Your answer choice here, as a single letter and nothing else>,
'A': <Probability choice A is correct. As a float from 0.0 to 1.0>,
'B': <Probability choice B is correct. As a float from 0.0 to 1.0>,
'C': <Probability choice C is correct. As a float from 0.0 to 1.0>,
'D': <Probability choice D is correct. As a float from 0.0 to 1.0>,
'E': <Probability choice E is correct. As a float from 0.0 to 1.0>
}<|eot_id|>

All options have a non-zero probability of being correct. No option should have a probability of 0 or 1.
Be modest about your certainty.  Do not provide any additional reasoning.

Response: '''
llama_results = test_open_model_lsat(my_llama, dataset, debug = False)
llama_results

Mounted at /content/drive


In [None]:

lsat_folder_path = '/content/drive/MyDrive/LSAT'


if not os.path.exists(lsat_folder_path):
    os.makedirs(lsat_folder_path)
    print(f"Created folder: {lsat_folder_path}")

# Define the full path for the CSV file
output_filename = f'LSAT_{model_name}'
output_filename = output_filename.replace('/', '_').replace('-','_').replace('.','_') + '.csv'
output_path = os.path.join(lsat_folder_path, output_filename)

# 2. Save the DataFrame to the specified path
llama_results.to_csv(output_path, index=False)

print(f"Successfully saved llama_results to {output_path}")