# LLM-Blender Usage examples

## Loading blender (quick start)

In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
import llm_blender
blender = llm_blender.Blender()
# blender.loadranker("llm-blender/pair-ranker") # load ranker checkpoint
blender.loadranker("OpenAssistant/reward-model-deberta-v3-large-v2") # load ranker checkpoint
# blender.loadfuser("llm-blender/gen_fuser_3b") # load fuser checkpoint if you want to use pre-trained fuser; or you can use ranker only

  from .autonotebook import tqdm as notebook_tqdm
2023-10-29 15:00:25.107619: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-29 15:00:25.974340: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-10-29 15:00:25.974420: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
Fetching 10 files: 100%|██████████| 10/10 [00:00<00:00, 78.92it/s]



Using Other model


## ( Or ) Loading blender (Detailed config)

In [2]:
import llm_blender
ranker_config = llm_blender.RankerConfig()
ranker_config.ranker_type = "pairranker" # only supports pairranker now.
ranker_config.model_type = "deberta"
ranker_config.model_name = "microsoft/deberta-v3-large" # ranker backbone
ranker_config.load_checkpoint = "llm-blender/pair-ranker" # hugging face hub model path or your local ranker checkpoint <your checkpoint path>
ranker_config.cache_dir = "./hf_models" # hugging face model cache dir
ranker_config.source_maxlength = 128
ranker_config.candidate_maxlength = 128
ranker_config.n_tasks = 1 # number of singal that has been used to train the ranker. This checkpoint is trained using BARTScore only, thus being 1.
fuser_config = llm_blender.GenFuserConfig()
fuser_config.model_name = "llm-blender/gen_fuser_3b" # our pre-trained fuser
fuser_config.cache_dir = "./hf_models"
fuser_config.max_length = 1024
fuser_config.candidate_maxlength = 128
blender_config = llm_blender.BlenderConfig()
blender_config.device = "cuda" # blender ranker and fuser device
blender = llm_blender.Blender(blender_config, ranker_config, fuser_config)



[2023-10-22 22:52:42,511] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.



Using DeBERTa model


Downloading (…)d383a67230/README.md: 100%|██████████| 5.87k/5.87k [00:00<00:00, 353kB/s]
Fetching 12 files: 100%|██████████| 12/12 [00:00<00:00, 22.19it/s]


## Using LLM-Blender for ranking
By the rank function, LLM-Blender could ranks the candidates through pairwise comparisons and return the ranks. We show our ranker's ranks are highly correlated with ChatGPT ranks.

In [2]:
import datasets
import json
from llm_blender.gpt_eval.cor_eval import COR_MAPS
from llm_blender.gpt_eval.utils import get_ranks_from_chatgpt_cmps
mixinstruct_test = datasets.load_dataset("llm-blender/mix-instruct", split="test", streaming=True)
few_examples = list(mixinstruct_test.take(8))
# remove cmp_results with none cmp results
for ex in few_examples:
    ex['cmp_results'] = json.loads(ex['cmp_results'])
few_examples = [x for x in few_examples if x['cmp_results']]
insts = [x['instruction'] for x in few_examples]
inputs = [x['input'] for x in few_examples]
candidates_texts = [[cand['text'] for cand in x['candidates']] for x in few_examples]
print("Example:")
print("Instruction 1:\n", insts[0])
print("Input 1:\n", inputs[0])
print("Candidate 1 for input 1:\n")
print(candidates_texts[0][0])


Example:
Instruction 1:
 
Input 1:
 I've always wondered what the difference is between a skeptic and a denier.
Candidate 1 for input 1:

A skeptic is someone who doubts or expresses doubt about a claim or idea without being dismissive of it. They are open-minded and approach evidence with an open mind, searching for reasonable explanations and evidence to support their beliefs.

A denier, on the other hand, is someone who actively works to deny or ignore evidence that contradicts their beliefs. They are often characterized by a closed mind and an unwillingness to consider alternative perspectives. They may also use rhetoric or false claims to try to discredit the evidence.


In [3]:
ranks = blender.rank(inputs, candidates_texts, instructions=insts, return_scores=False, batch_size=2)

Ranking candidates: 100%|██████████| 4/4 [00:00<00:00,  5.06it/s]


In [4]:
print("Ranks for input 1:", ranks[0]) # ranks of candidates for input 1
# Ranks for input 1: [ 1 11  4  9 12  5  2  8  6  3 10  7]

Ranks for input 1: [ 1 11  4  9 12  5  2  8  6  3 10  7]


In [5]:
import numpy as np
llm_ranks_map, gpt_cmp_results = get_ranks_from_chatgpt_cmps(few_examples)
gpt_ranks = np.array(list(llm_ranks_map.values())).T
print("Correlation with ChatGPT")
print("------------------------")
for cor_name, cor_func in COR_MAPS.items():
    print(cor_name, cor_func(ranks, gpt_ranks))

Correlation with ChatGPT
------------------------
pearson 0.5611269554680965
spearman 0.39947955756051595
spearman_footrule 24.25
set_based 0.6543252465127465


## Using LLM-blender to directly compare two candidates

In [6]:
candidates_A = [x['candidates'][0]['text'] for x in few_examples]
candidates_B = [x['candidates'][1]['text'] for x in few_examples]
comparison_results = blender.compare(inputs, candidates_A, candidates_B, instructions=insts, batch_size=2)

Ranking candidates: 100%|██████████| 4/4 [00:00<00:00, 13.53it/s]


In [7]:
print("comparison_results:", comparison_results)
# whether candidate A is better than candidate B for each input

comparison_results: [ True  True False  True False  True  True  True]


## Using LLM-Blender for fuse generation
We show that the the fused generation using the top-ranked candidate from the rankers could get outputs of higher quality.

In [8]:
from llm_blender.blender.blender_utils import get_topk_candidates_from_ranks
topk_candidates = get_topk_candidates_from_ranks(ranks, candidates_texts, top_k=3)
fuse_generations = blender.fuse(inputs, topk_candidates, instructions=insts, batch_size=2)
print("fuse_generations for input 1:", fuse_generations[0])

Fusing candidates:   0%|          | 0/4 [00:00<?, ?it/s]

Fusing candidates: 100%|██████████| 4/4 [00:28<00:00,  7.04s/it]

fuse_generations for input 1: A skeptic is someone who questions the validity of a claim or idea, while a denier is someone who dismisses or ignores evidence that contradicts their beliefs. Skeptics approach claims with an open mind and seek evidence to support or refute them, while denier's are more likely to dismiss or ignore evidence that contradicts their beliefs.





In [9]:
# # Or do rank and fuser together
fuse_generations, ranks = blender.rank_and_fuse(inputs, candidates_texts, instructions=insts, return_scores=False, batch_size=2, top_k=3)

Ranking candidates: 100%|██████████| 4/4 [00:15<00:00,  4.00s/it]
Fusing candidates: 100%|██████████| 4/4 [00:28<00:00,  7.12s/it]


In [10]:
from llm_blender.common.evaluation import overall_eval
metrics = ['bartscore']
targets = [x['output'] for x in few_examples]
scores = overall_eval(fuse_generations, targets, metrics)

print("Fusion Scores")
for key, value in scores.items():
    print("  ", key+":", np.mean(value))

print("LLM Scores")
llms = [x['model'] for x in few_examples[0]['candidates']]
llm_scores_map = {llm: {metric: [] for metric in metrics} for llm in llms}
for ex in few_examples:
    for cand in ex['candidates']:
        for metric in metrics:
            llm_scores_map[cand['model']][metric].append(cand['scores'][metric])
for i, (llm, scores_map) in enumerate(llm_scores_map.items()):
    print(f"{i} {llm}")
    for metric, llm_scores in llm_scores_map[llm].items():
        print("  ", metric+":", "{:.4f}".format(np.mean(llm_scores)))


Evaluating bartscore: 100%|██████████| 8/8 [00:00<00:00, 50.29it/s]

Fusion Scores
   bartscore: -3.4856293499469757
LLM Scores
0 oasst-sft-4-pythia-12b-epoch-3.5
   bartscore: -3.8071
1 koala-7B-HF
   bartscore: -4.5505
2 alpaca-native
   bartscore: -4.2063
3 llama-7b-hf-baize-lora-bf16
   bartscore: -3.9364
4 flan-t5-xxl
   bartscore: -4.9341
5 stablelm-tuned-alpha-7b
   bartscore: -4.4329
6 vicuna-13b-1.1
   bartscore: -4.2022
7 dolly-v2-12b
   bartscore: -4.4400
8 moss-moon-003-sft
   bartscore: -3.5876
9 chatglm-6b
   bartscore: -3.7075
10 mpt-7b
   bartscore: -4.1353
11 mpt-7b-instruct
   bartscore: -4.2827



