# LLM ensembling

This simple ensemble method works as follows:
1. First train $N$ different models ($N=2$ in our case), each of which takes neural activty as input and ouputs a single best candidate sentence.
2. Use a Large Language Model (LLM) to choose the highest LLM score among the $N$ candidate sentences.

We demonstrate step 2 in this notebook. Note that
- The inputs to this ensembling step only requires candidate sentences from each model.
- There are two different LLMs used in this method.
    - First LLM is used for the language model decoding step within each model (OPT6.7B in our case).
    - Second LLM is used for ensembling (Llama2-7B-chat model in our case)
- While we have not experimented with these model choices extensively, our initial experiments suggest that models that are optimized for conversations (such as Llama2 chat) might work well for the ensembling step (i.e. the second LLM).

In [1]:
import numpy as np
import os
import sys
import torch

In [2]:
os.environ['CURL_CA_BUNDLE'] = ''

In [3]:
parent_dir = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
sys.path.append(parent_dir)

from NeuralDecoder.neuralDecoder.utils.lmDecoderUtils import _cer_and_wer as cer_and_wer

Download the model and specify the path in the `MODEL_CACHE_DIR` below. Lllama-2-7b-chat-hf could be downloaded from Hugging Face [here](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf).

In [4]:
# LLM path, we use Llama2-7b-chat here
MODEL_NAME = 'meta-llama/Llama-2-7b-chat-hf'
MODEL_CACHE_DIR = '/home/user/LLM/Llama-2-7b-chat'

# True test sentences
target_test_file = './samples/target_test.txt'
# Decoded test sentences with 11.71% and 12.19% test WER
model1_test_file = './samples/model1_test.txt'
model2_test_file = './samples/model2_test.txt'

# Decoded competition sentences with 11.71% and 12.19% test WER
model1_competition_file = './samples/model1_competition.txt'
model2_competition_file = './samples/model2_competition.txt'

# Output file
output_file = './samples/Llama2chat_ensemble.txt'

## Load an LLM for ensembling

We use Llama2-7B-chat model since it is optimized for dialogue.

In [5]:
def build_llm(modelName=None, cacheDir=None, device='auto', load_in_8bit=False):
    from transformers import AutoModelForCausalLM, AutoTokenizer

    tokenizer = AutoTokenizer.from_pretrained(modelName, cache_dir=cacheDir)
    model = AutoModelForCausalLM.from_pretrained(modelName, cache_dir=cacheDir,
                                                 device_map=device, load_in_8bit=load_in_8bit)

    tokenizer.padding_side = "right"
    tokenizer.pad_token = tokenizer.eos_token

    return model, tokenizer

In [6]:
llm_model, tokenizer = build_llm(modelName=MODEL_NAME, cacheDir=MODEL_CACHE_DIR)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

We've detected an older driver with an RTX 4000 series GPU. These drivers have issues with P2P. This can affect the multi-gpu inference when using accelerate device_map.Please make sure to update your driver to the latest version which resolves this.


## Test data

### Load test data

In [7]:
def load_sentence(file):
    sentence = []
    with open(file, 'r') as f:
        for s in f.readlines():
            s = s.strip('\n')
            sentence.append(s)
    return sentence

In [8]:
target_test = load_sentence(target_test_file)
model1_test = load_sentence(model1_test_file)
model2_test = load_sentence(model2_test_file)

### Calculate LLM score

In [9]:
def cal_llm_score(sentences):
    scores = []

    for s in range(len(sentences)):
        sentence = sentences[s]
        inputs = tokenizer(sentence, return_tensors='pt', padding=True)
        with torch.no_grad():
            outputs = llm_model(**inputs)
            logProbs = torch.nn.functional.log_softmax(outputs['logits'].float(), -1).numpy()
        B, T, _ = logProbs.shape
        for i in range(B):
            n_tokens = np.sum(inputs['attention_mask'][i].numpy())
            newLMScore = 0.
            for j in range(1, n_tokens):
                newLMScore += logProbs[i, j - 1, inputs['input_ids'][i, j].numpy()]
        scores.append(newLMScore)

    return scores

In [10]:
model1_test_score = cal_llm_score(model1_test)
model2_test_score = cal_llm_score(model2_test)

### Pick a sentence with the highest LLM score

In [11]:
def pick_sentence(sentence1, sentence2, score1, score2):
    """
    Args:
        sentence1 (list): decoded sentences using model1
        sentence2 (list): decoded sentences using model2
        score1 (list): LLM score for sentence1
        score2 (list): LLM score for sentence2

    Returns:
        decoded (list): sentences picked based on LLM score
        pick (list): which sentence the output comes from, "model 1" or "model 2", " " for sentences same in 2 models
    """
    
    decoded = []
    pick = []

    assert len(sentence1)==len(sentence2)
    for s in range(len(sentence1)):
        if score1[s] > score2[s]:
            decoded.append(sentence1[s])
            pick.append("model 1")
        else:
            decoded.append(sentence2[s])
            if score1[s] == score2[s]:
                pick.append(" ")
            else:
                pick.append("model 2")

    return decoded, pick

In [12]:
decoded_test, pick_test = pick_sentence(model1_test, model2_test, model1_test_score, model2_test_score)

print("Test output")
print("#Outputs picked from model 1:", sum(np.array(pick_test)=="model 1"))
print("#Outputs picked from model 2:", sum(np.array(pick_test)=="model 2"))

Test output
#Outputs picked from model 1: 110
#Outputs picked from model 2: 91


### Example outputs

In [13]:
def format_output(index):
    separator = '-' * 100 + '\n'
    output = ''
    sentence_num = index + 1
    target_output = target_test[index]
    model_outputs = [
        {'model_name': 'Model 1', 'output': model1_test[index], 'score': model1_test_score[index]},
        {'model_name': 'Model 2', 'output': model2_test[index], 'score': model2_test_score[index]},
    ]
    final_output = decoded_test[index]
    output += separator
    output += f'Sentence: {sentence_num}\n'
    output += separator
    output += f'Target output : {target_output}\n'
    output += separator
    for model_output in model_outputs:
        model_name = model_output['model_name']
        text = model_output['output']
        score = model_output['score']
        output += f'{model_name} output: {text:<50} (LLM score: {score:.1f})\n'
    output += separator
    output += f'Final output  : {final_output}\n'
    output += separator
    output += '\n'
    print(output)

In [14]:
# An example where model 1 output was correct and LLM correctly chose model 1
format_output(151)

----------------------------------------------------------------------------------------------------
Sentence: 152
----------------------------------------------------------------------------------------------------
Target output : i'm away from my other son during those hours
----------------------------------------------------------------------------------------------------
Model 1 output: i'm away from my other son during those hours      (LLM score: -60.7)
Model 2 output: i'm really from my other son doing those hours     (LLM score: -78.7)
----------------------------------------------------------------------------------------------------
Final output  : i'm away from my other son during those hours
----------------------------------------------------------------------------------------------------




In [15]:
# An example where model 2 output was correct and LLM correctly chose model 2
format_output(330)

----------------------------------------------------------------------------------------------------
Sentence: 331
----------------------------------------------------------------------------------------------------
Target output : i really would like to see them do well
----------------------------------------------------------------------------------------------------
Model 1 output: we would like to see them too well                 (LLM score: -45.6)
Model 2 output: i really would like to see them do well            (LLM score: -39.6)
----------------------------------------------------------------------------------------------------
Final output  : i really would like to see them do well
----------------------------------------------------------------------------------------------------




In [16]:
# An example where neither models were correct, but LLM chose the sentence with lower WER
format_output(270)

----------------------------------------------------------------------------------------------------
Sentence: 271
----------------------------------------------------------------------------------------------------
Target output : can you speak more about this system of treasury decentralization
----------------------------------------------------------------------------------------------------
Model 1 output: then you make more about this system of treasury us inflation (LLM score: -83.0)
Model 2 output: then you pick more about this system of treasury decentralization (LLM score: -74.6)
----------------------------------------------------------------------------------------------------
Final output  : then you pick more about this system of treasury decentralization
----------------------------------------------------------------------------------------------------




In [17]:
# An example where model 1 was correct but LLM chose model 2 (ensembling lead to a worse result)
format_output(337)

----------------------------------------------------------------------------------------------------
Sentence: 338
----------------------------------------------------------------------------------------------------
Target output : anything like that we participated
----------------------------------------------------------------------------------------------------
Model 1 output: anything like that we participated                 (LLM score: -38.6)
Model 2 output: anything like that we must abide                   (LLM score: -38.4)
----------------------------------------------------------------------------------------------------
Final output  : anything like that we must abide
----------------------------------------------------------------------------------------------------




### Evaluation

In [18]:
_, wer1 = cer_and_wer(model1_test, target_test, outputType='speech_sil', returnCI=False)
_, wer2 = cer_and_wer(model2_test, target_test, outputType='speech_sil', returnCI=False)
_, ensemble_wer = cer_and_wer(decoded_test, target_test, outputType='speech_sil', returnCI=False)

In [19]:
print (f"WER before LLM ensembling: {(wer1*100):.2f}% (model 1), {(wer2*100):.2f}% (model 2)")
print (f"WER after LLM ensembling: {(ensemble_wer*100):.2f}%")

WER before LLM ensembling: 11.71% (model 1), 12.19% (model 2)
WER after LLM ensembling: 11.30%


## Competition data

### Load competition data

In [20]:
model1_competition = load_sentence(model1_competition_file)
model2_competition = load_sentence(model2_competition_file)

### Calculate LLM score

In [21]:
model1_competition_score = cal_llm_score(model1_competition)
model2_competition_score = cal_llm_score(model2_competition)

### Pick a sentence with the highest LLM score

In [22]:
decoded_competition, pick_competition = pick_sentence(model1_competition, model2_competition, model1_competition_score, model2_competition_score)


print("Competition output")
print("#Outputs picked from model 1:", sum(np.array(pick_competition)=="model 1"))
print("#Outputs picked from model 2:", sum(np.array(pick_competition)=="model 2"))

Competition output
#Outputs picked from model 1: 186
#Outputs picked from model 2: 139


### Save the result

In [23]:
with open(output_file, 'w') as f:
    for x in range(len(decoded_competition)):
        f.write(decoded_competition[x]+'\n')