<a href="https://colab.research.google.com/github/akashmathur-2212/LLMs-playground/blob/main/LlamaIndex-applications/Advanced-RAG/reranker_models_evaluation/LLM_rerankers_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reranker models

This notebook showcases how to do a two-stage pass for retrieval. Use `embedding-based` retrieval with a high `top-k` value
in order to maximize recall and get a large set of candidate items. Then, use `LLM-based` retrieval
to dynamically select the nodes that are actually relevant to the query.

In [1]:
!pip install -qqq llama-index llama-hub cohere langchain openai accelerate==0.21.0 bitsandbytes==0.40.2 transformers sentence_transformers InstructorEmbedding

In [12]:
pip install llama-index-postprocessor-cohere-rerank

Collecting llama-index-postprocessor-cohere-rerank
  Downloading llama_index_postprocessor_cohere_rerank-0.1.2-py3-none-any.whl.metadata (720 bytes)
Downloading llama_index_postprocessor_cohere_rerank-0.1.2-py3-none-any.whl (2.7 kB)
Installing collected packages: llama-index-postprocessor-cohere-rerank
Successfully installed llama-index-postprocessor-cohere-rerank-0.1.2
Note: you may need to restart the kernel to use updated packages.


In [32]:
pip install llama-index-embeddings-huggingface

Collecting llama-index-embeddings-huggingface
  Downloading llama_index_embeddings_huggingface-0.1.4-py3-none-any.whl.metadata (806 bytes)
Downloading llama_index_embeddings_huggingface-0.1.4-py3-none-any.whl (7.7 kB)
Installing collected packages: llama-index-embeddings-huggingface
Successfully installed llama-index-embeddings-huggingface-0.1.4
Note: you may need to restart the kernel to use updated packages.


In [25]:
pip install llama-index-llms-ollama

Collecting llama-index-llms-ollama
  Downloading llama_index_llms_ollama-0.1.2-py3-none-any.whl.metadata (636 bytes)
Downloading llama_index_llms_ollama-0.1.2-py3-none-any.whl (3.2 kB)
Installing collected packages: llama-index-llms-ollama
Successfully installed llama-index-llms-ollama-0.1.2
Note: you may need to restart the kernel to use updated packages.


In [2]:
from pathlib import Path
import pandas as pd
from llama_index.core.prompts import PromptTemplate
from llama_index.core import download_loader, VectorStoreIndex, ServiceContext
from langchain.embeddings import HuggingFaceInstructEmbeddings

from IPython.display import display, HTML
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.schema import QueryBundle
from llama_index.postprocessor.cohere_rerank import CohereRerank
from llama_index.core.indices.postprocessor import SentenceTransformerRerank
from llama_index.llms.ollama import Ollama

# Setup

1. In this section we will work with the LLM overview paper paper and create an initial set of nodes (chunk size 500).
2. We will use Open Source LLM from ollama

# Load Data

In [3]:
PDFReader = download_loader("PDFReader")
loader = PDFReader()
docs = loader.load_data(file=Path("data/LLM_Overview_Paper.pdf"))

  PDFReader = download_loader("PDFReader")




In [4]:
from llama_index.core.node_parser import SimpleNodeParser
node_parser = SimpleNodeParser.from_defaults(chunk_size=1000)
nodes = node_parser.get_nodes_from_documents(docs)

In [24]:
len(nodes)

119

# Models

## LLM - Ollama

In [21]:
cohere_api_key = "REPLACE WITH YOUR COHERE_API_KEY"

llm = Ollama(model="mistral")

## Embedding Model

In [19]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name='BAAI/bge-small-en-v1.5')


  warn("The installed version of bitsandbytes was compiled without GPU support. "


'NoneType' object has no attribute 'cadam32bit_grad_fp32'


## Configure Index and Retriever

In [20]:
# ServiceContext
service_context = ServiceContext.from_defaults(llm=llm,
                                               embed_model=embed_model
                                               )

# index
vector_index = VectorStoreIndex(
    nodes, service_context=service_context
)

# configure retriever
retriever = VectorIndexRetriever(
    index=vector_index,
    similarity_top_k=10,
    service_context=service_context)

  service_context = ServiceContext.from_defaults(llm=llm,


## Initialize Re-rankers

In [22]:
# Define all embeddings and rerankers
RERANKERS = {
    "WithoutReranker": "None",
    "CohereRerank": CohereRerank(api_key=cohere_api_key, top_n=5),
    "bge-reranker-base": SentenceTransformerRerank(model="BAAI/bge-reranker-base", top_n=5),
    "bge-reranker-large": SentenceTransformerRerank(model="BAAI/bge-reranker-large", top_n=5)
}

## Retrieval Comparisons

In [36]:
import pandas as pd
from IPython.display import display, HTML
from copy import deepcopy

# Assuming QueryBundle and retriever are defined elsewhere
# Assuming RERANKERS is a dictionary of rerankers

def retrieve_and_process_nodes(query_str, reranker):
    query_bundle = QueryBundle(query_str)
    retrieved_nodes = retriever.retrieve(query_bundle)
    if reranker != "None":
        retrieved_nodes = reranker.postprocess_nodes(retrieved_nodes, query_bundle)
    return retrieved_nodes

def get_retrieved_nodes(query_str, reranker):
    return retrieve_and_process_nodes(query_str, reranker)

def pretty_print(df):
    pd.set_option('display.max_rows', None)
    pd.set_option('display.max_columns', None)
    display(HTML(df.to_html().replace("\\n", "<br>")))

def visualize_retrieved_nodes(nodes) -> None:
    result_dicts = [{"Score": node.score, "Text": node.node.get_text().replace("\n", " ")} for node in nodes]
    pretty_print(pd.DataFrame(result_dicts))

In [37]:
query_str = "What are theLLM finetuning Methods mentionned in the paper?"

for rerank_name, reranker in RERANKERS.items():
    print(f"Running Evaluation for Reranker: {rerank_name}")
    retrieved_nodes = retrieve_and_process_nodes(query_str, reranker)
    print(f"Visualize Retrieved Nodes for Reranker: {rerank_name}")
    visualize_retrieved_nodes(retrieved_nodes)

Running Evaluation for Reranker: WithoutReranker
Visualize Retrieved Nodes for Reranker: WithoutReranker


Unnamed: 0,Score,Text
0,0.715059,"10: This example illustrates the PanGu-Parchitecture, as depicted in the image sourced from [128].B. Fine-Tuned LLMs Pre-trained LLMs have excellent generalization abilities to unseen tasks. However, because they are generally trained with the objective of next token prediction, LLMs have limited capacity to follow user intent and are prone to generate unethical, toxic or inaccurate responses [20]. For their effective utilization, LLMs are fine-tuned to follow instructions [16], [17], [92] and generate safe responses [20], which also results in increasing zero-shot, few-shot, and cross-task generaliza- tion [92], [16], [18], with minimal compute increment, e.g., 0.2% of the total pre-training for PaLM 540B [16]. We review various fine-tuned LLMs and strategies for effective fine-tuning in this section. 1. Instruction-Tuning with Manually Created Datasets: Numerous hand-crafted instruction-tuning datasets with different design choices are proposed in the literature to instruction-tune LLMs. The performance of fine-tuned LLMs depends on multiple factors, such as dataset, instruction diversity, prompting templates, model size, and training"
1,0.705562,"PREPRINT 15 4. Continue Pre-Training: Although fine-tuning boosts a model’s performance, it leads to catastrophic forgetting of previously learned information. Concatenating fine-tuning data with a few randomly selected pre-training samples in every iteration avoids network forgetting [170], [141]. This is also effective in adapting LLMs for cases where fine-tuning data is small and the original capacity is to be maintained. Prompt- based continued pre-training (PCP) [171] trains the model with text and instructions related to tasks and then finally instruction-tunes the model for downstream tasks. 5. Sample Efficiency: While fine-tuning data is generally many-fold smaller than the pre-training data, it still has to be large enough for acceptable performance [16], [92], [18] and requires proportional computing resources. To study the effects on performance with less data, existing literature [172], [173] finds that the models trained on lesser data can out- perform models trained with more data. In [172], 25% of the total downstream data is found enough for state-of-the- art performance. Selecting coreset-based 0.5% of the total instruction-tuning data improves the model performance by 2% in [173], as compared to the complete data tuning. Less is more for alignment (LIMA) [174] uses only 1000 carefully created demonstrations to fine-tune the model and has achieved comparable performance to GPT-4. C. Increasing Context Window LLMs are trained with limited context windows due to expensive attention and high memory requirements. A model trained on limited sequence lengths fails to generalize to unseen lengths at inference time [175], [49]. Alternatively, LLMs with ALiBi [65] positional encodings can perform zero- shot length extrapolation. However, ALiBi has less expres- sive power [66] and inferior performance on multiple bench- marks [46], and many LLMs use RoPE positional embedding that is unable to perform zero-shot extrapolation. A larger con- text length has benefits such as a better understanding of longer documents, more samples in in-context learning, execution of bigger reasoning processes, etc. Expanding context length during fine-tuning is slow, inefficient, and computationally expensive [49]. Therefore, researchers employ various context window extrapolation techniques discussed below. Position Interpolation: Rather than extrapolating, [49] shows that interpolating position encodings within the pre-trained context window are more effective. The work demonstrates that only 1000 steps of fine-tuning are enough to achieve better results on larger windows without performance loss compared to the original context size. Giraffe [46] uses power scaling in RoPE, and YaRN [47] proposed NTK-aware interpolation. Efficient Attention Mechanism: Dense global attention is one of the major constraints in training larger context win- dow LLMs. Using efficient attention variants, such as local, sparse, and dilated attention, reduces the computation cost significantly. LongT5 [48] proposes transient global attention (TGlobal), applying attention to local and global tokens (win- dowing token averaging). The model replaces attention in T5 [10] with TGlobal attention, pre-trains the model on 4098 sequence length, fine-tunes on larger window sizes, as largeas 16k, and improves task performance with longer inputs. This shows the extrapolation ability of TGlobal attention with only fine-tuning. COLT5 [176] uses two branches, one with lightweight and the other with heavyweight attention and feed-forward layers. All tokens are processed from the lightweight branch, and only important tokens are routed to the heavyweight branch. LongNet [177] replaces standard attention with dilated attention, expanding sequence length to 1 billion tokens. LongLoRA [178] proposes shift-short attention, used during fine-tuning to reduce dense attention costs, while the model during inference can use dense attention and achieve similar performance as full attention fine-tuning. Extrapolation without Training: LM-Infinite [175] and par- allel context windows (PCW) [179] show length extrapolation is possible using pre-trained LLMs. LM-Infinite suggested Λ- shaped attention applied within the original context window limits."
2,0.702324,"PREPRINT 26 TABLE IX: Categorized evaluation datasets used in evaluating LLMs. Type Datasets/Benchmarks Multi-Task MMLU [296], SuperGLUE [2], BIG-bench [297], GLUE [298], BBH [297], CUGE [299], ZeroCLUE [300], FewCLUE [301], Blended Skill Talk [302], HELM [303], KLUE-STS [304] Language Understanding CoQA [305], WiC [306], Wikitext103 [307], PG19 [308], LCQMC [309], QQP [310], WinoGender [311], CB [312], FinRE [313], SanWen [314], AFQMC [300], BQ Corpus [315], CNSS [316], CKBQA 13 [317], CLUENER [300], Weibo [318], AQuA [319], OntoNotes [320], HeadQA [321], Twitter Dataset [322] Story Cloze and Sentence CompletionStoryCloze [323], LAMBADA [324], LCSTS [325], AdGen [326], E2E [327], CHID [328], CHID-FC [301] Physical Knowledge and World UnderstandingPIQA [329], TriviaQA [330], ARC [331], ARC-Easy [331], ARC-Challenge [331], PROST [332], Open- BookQA [333], WebNLG [334], DogWhistle Insider & Outsider [335] Contextual Language UnderstandingRACE [336], RACE-Middle [336], RACE-High [336], QuAC [337], StrategyQA [338], Quiz Bowl [339], cMedQA [340], cMedQA2 [341], MATINF-QA [342] Commonsense Reasoning WinoGrande [343], HellaSwag [344], COPA [345], WSC [346], CSQA [347], SIQA [348], C3[349], CLUEWSC2020 [300], CLUEWSC [300], CLUEWSC-FC [301], ReCoRD [350] Reading Comprehension SQuAD [351], BoolQ [352], SQUADv2 [353], DROP [354], RTE [355], WebQA [356], CMRC2017 [357], CMRC2018 [358], CMRC2019 [359], COTE-BD [360], COTE-DP [360], COTE-MFW [360], MultiRC [361], Natural Questions [362], CNSE [316], DRCD [363], DuReader [364], Dureader robust [365], DuReader-QG [364], SciQ [366], Sogou-log [367], Dureader robust -QG [365], QA4MRE [368], KorQuAD 1.0 [369], CAIL2018-Task1 & Task2 [370] Mathematical Reasoning MATH [371], Math23k [372], GSM8K [373], MathQA [374], MGSM [375], MultiArith [376], ASDiv [377], MAWPS [378], SV AMP [379] Problem Solving HumanEval [130], DS-1000 [380], MBPP [381], APPS [371], CodeContests [131] Natural Language Inference & Logical ReasoningANLI [382], MNLI-m [383], MNLI-mm [383],QNLI [351], WNLI [346], OCNLI [300], CMNLI [300], ANLI R1 [382], ANLI R2 [382], ANLI R3 [382], HANS [384], OCNLI-FC [301], LogiQA [385], StrategyQA [338] Cross-Lingual Understanding MLQA [386], XNLI [387], PAWS-X [388], XSum [389], XCOPA [390], XWinograd [391], TyDiQA-GoldP [392], MLSum [393] Truthfulness and Fact Checking TruthfulQA [394], MultiFC [395], Fact Checking on Fever [396] Biases and Ethics in AI ETHOS [397], StereoSet [398], BBQ [399], Winobias [400], CrowS-Pairs [401] Toxicity RealToxicityPrompts [402], CivilComments toxicity classification [403] Language Translation WMT [404], WMT20 [405], WMT20-enzh [405],"
3,0.696482,"PREPRINT 6 Fig. 6: A basic flow diagram depicting various stages of LLMs from pre-training to prompting/utilization. Prompting LLMs to generate responses is possible at different training stages like pre-training, instruction-tuning, or alignment tuning. and utilization. An example of different training stages and inference in LLMs is shown in Figure 6. In this paper, we refer alignment-tuning to aligning with human preferences, while occasionally the literature uses the term alignment for different purposes. 1. Pre-Training: In the very first stage, the model is trained in a self-supervised manner on a large corpus to predict the next tokens given the input. The design choices of LLMs vary from encoder-decoder to decoder-only architectures with dif- ferent building blocks and loss functions in sections II-E, II-D, II-J. 2. Fine-Tuning: There are different styles to fine-tune an LLM. This section briefly discusses fine-tuning approaches. Transfer Learning: The pre-trained LLMs perform well for various tasks [6], [15]. But to improve the performance for a downstream task, pre-trained models are fine-tuned with the task-specific data [10], [11], known as transfer learning. Instruction-tuning: To enable a model to respond to user queries effectively, the pre-trained model is fine-tuned on instruction formatted data i.e., instruction and an input-output pair. Instructions generally comprise multi-task data in plain natural language, guiding the model to respond according to the prompt and the input. This type of fine-tuning improveszero-shot generalization and downstream task performance. Details on formatting instruction data and its various styles are available in [16], [50], [92]. Alignment-tuning: LLMs are prone to generate false, biased, and harmful text. To make them helpful, honest, and harmless models are aligned using human feedback. Alignment involves asking LLMs to generate unexpected responses and then updating their parameters to avoid such responses [20], [21], [93]. It ensures LLMs operate according to human intentions and values. A model is defined to be an “aligned” model if the model fulfills three criteria of helpful, honest, and harmless or “HHH” [94]. Researchers employ reinforcement learning with human feed- back (RLHF) [95] for model alignment. In RLHF, a fine- tuned model on demonstrations is further trained with reward modeling (RM) and reinforcement learning (RL), shown in Figure 6. Below we briefly discuss RM and RL pipelines in RLHF. Reward modeling: trains a model to rank generated responses according to human preferences using a classification objec- tive. To train the classifier humans annotate LLMs generated responses based on HHH criteria."
4,0.69422,"PREPRINT 13 TABLE II: Key insights and findings from the study of instruction-tuned Large Language Models. Models Findings & Insights T0•Multi-task prompting enables zero-shot generalization and outperforms baselines •Even a single prompt per dataset task is enough to improve performance WebGPT•The answer quality of LLMs can be further improved with human feedback. •To aid the model in effectively filtering and utilizing relevant information, human labelers play a crucial role in answering questions regarding the usefulness of the retrieved documents. •Interacting a fine-tuned language model with a text-based web-browsing environment can improve end-to-end retrieval and synthesis via imitation learning and reinforcement learning. •Generating answers with references can make labelers easily judge the factual accuracy of answers. Tk-INSTRUCT•Instruction tuning leads to a stronger generalization of unseen tasks •More tasks improve generalization whereas only increasing task instances does not help •Supervised trained models are better than generalized models •Models pre-trained with instructions and examples perform well for different types of inputs mT0 and BLOOMZ•Instruction tuning enables zero-shot generalization to the tasks never seen before •Multi-lingual training leads to even better zero-shot generalization for both English and non-English •Training on machine-translated prompts improves performance for held-out tasks with non-English prompts •English only fine-tuning on multilingual pre-trained language model is enough to generalize to other pre-trained language tasks OPT-IML•Task size sampling to create a batch with most of the task examples is important for better performance •Only example proportional sampling is not enough, training datasets/benchmarks should also be proportional for better generalization/performance •Fully held-out and partially supervised tasks performance improves by scaling tasks or categories whereas fully supervised tasks have no effect •Including small amounts i.e. 5% of pretraining data during fine-tuning is effective •Only 1% reasoning data improves the performance, adding more deteriorates performance •Adding dialogue data makes the performance worse Flan•Finetuning with CoT improves performance on held-out tasks •Fine-tuning along with CoT data improves reasoning abilities •CoT tuning improves zero-shot reasoning •Performance improves with more tasks •Instruction fine-tuning improves usability which otherwise is challenging for pre-trained models •Improving the model’s performance with instruction tuning is compute-efficient •Multitask prompting enables zero-shot generalization abilities in LLM Sparrow•The judgments of labelers and the alignments with defined rules can help the model generate better responses. •Good dialogue goals can be broken down into detailed natural language rules for the agent and the raters. •The combination of reinforcement learning (RL) with reranking yields optimal performance in terms of preference win rates and resilience against adversarial probing. WizardCoder•Fine-tuning with re-written instruction-tuning data into a complex set improves the performance significantly LLaMA-2-Chat•Model learns to write safe responses with fine-tuning on safe demonstrations, while additional RLHF step further improves model safety and make it less prone to jailbreak attacks LIMA•Less high quality data is enough for fine-tuned model generalization objectives. Keeping this in view, diverse fine-tuned models have emerged in the literature using manually created datasets. The models T0 [17] and mT0 (multi-lingual) [143] employ templates to convert existing datasets into prompt datasets. They have shown improvements in generalization to zero-shot and held-out tasks. Tk-Instruct [18] fine-tuned the T5 model with in-context instructions to study generalization on unseen tasks when given in-context instructions during test time. The model outperformed Instruct-GPT, despite being smaller in size, i.e., 11B parameters as compared to 175B of GPT-3. Increasing Tasks and Prompt Setups: Zero-shot and few- shot performance improves significantly by expanding task collection and prompt styles. OPT-IML [92] and Flan [16] curated larger 2k and 1.8k task datasets, respectively. While increasing task size alone is not enough, OPT-IML and Flan add more prompting setups in their datasets, zero-shot, few-shot, and CoT. In continuation, CoT Collection [96] fine-tunes Flan-T5 further on 1.88M CoT samples."
5,0.688342,"Huang, A. M. Dai, S. Tong, D. Lepikhin, Y . Xu, M. Krikun, Y . Zhou, A. W. Yu, O. Firat et al. , “Glam: Efficient scaling of language models with mixture-of-experts,” in International Conference on Machine Learning . PMLR, 2022, pp. 5547–5569. 9, 21, 23 [117] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,” arXiv preprint arXiv:1701.06538 , 2017. 9, 21 [118] W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,” The Journal of Machine Learning Research , vol. 23, no. 1, pp. 5232–5270, 2022. 9 [119] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark et al. , “Training compute-optimal large language models,” arXiv preprint arXiv:2203.15556 , 2022. 9, 23, 26 [120] S. Soltan, S. Ananthakrishnan, J. FitzGerald, R. Gupta, W. Hamza, H. Khan, C. Peris, S. Rawls, A. Rosenbaum, A. Rumshisky et al. , “Alexatm 20b: Few-shot learning using a large-scale multilingual seq2seq model,” arXiv preprint arXiv:2208.01448 , 2022. 9, 20, 21, 22, 23 [121] R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen et al. , “Palm 2 technical report,” arXiv preprint arXiv:2305.10403 , 2023. 9, 23 [122] Y . Tay, J. Wei, H. W. Chung, V . Q. Tran, D. R. So, S. Shakeri, X. Gar- cia, H. S. Zheng, J. Rao, A. Chowdhery et al. , “Transcending scaling laws with 0.1% extra compute,” arXiv preprint arXiv:2210.11399 , 2022. 9, 21, 23 [123] Y . Tay, M. Dehghani, V . Q. Tran, X. Garcia, J. Wei, X. Wang, H. W. Chung, D. Bahri, T. Schuster, S. Zheng et al. , “Ul2: Unifying language learning paradigms,” in The Eleventh International Conference on Learning Representations , 2022. 9, 21, 22, 23 [124] Z. Du, Y . Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang, “Glm: General language model pretraining with autoregressive blank infilling,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2022, pp. 320– 335. 9 [125] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al."
6,0.6875,"IEEE, 2020, pp. 1–16. 2, 4, 5, 21 [38] J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, and G. Neubig, “Towards a unified view of parameter-efficient transfer learning,” arXiv preprint arXiv:2110.04366 , 2021. 2, 18, 19 [39] Z. Hu, Y . Lan, L. Wang, W. Xu, E.-P. Lim, R. K.-W. Lee, L. Bing, and S. Poria, “Llm-adapters: An adapter family for parameter-efficient fine- tuning of large language models,” arXiv preprint arXiv:2304.01933 , 2023. 2, 18 [40] B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,” arXiv preprint arXiv:2104.08691 , 2021. 2, 8, 18[41] X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” arXiv preprint arXiv:2101.00190 , 2021. 2, 18 [42] X. Ma, G. Fang, and X. Wang, “Llm-pruner: On the structural pruning of large language models,” arXiv preprint arXiv:2305.11627 , 2023. 2, 19 [43] R. Xu, F. Luo, C. Wang, B. Chang, J. Huang, S. Huang, and F. Huang, “From dense to sparse: Contrastive pruning for better pre-trained language model compression,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 36, no. 10, 2022, pp. 11 547–11 555. 2, 19 [44] G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “Smoothquant: Accurate and efficient post-training quantization for large language models,” in ICML , ser. Proceedings of Machine Learn- ing Research, vol. 202. PMLR, 2023, pp. 38 087–38 099. 2, 18 [45] C. Tao, L. Hou, W. Zhang, L. Shang, X. Jiang, Q. Liu, P. Luo, and N. Wong, “Compression of generative pre-trained language models via quantization,” arXiv preprint arXiv:2203.10705 , 2022. 2, 18 [46] A. Pal, D. Karkhanis, M. Roberts, S. Dooley, A. Sundararajan, and S. Naidu, “Giraffe: Adventures in expanding context lengths in llms,” arXiv preprint arXiv:2308.10882 , 2023. 2, 15 [47] B. Peng, J. Quesnelle, H. Fan, and E. Shippole, “Yarn: Efficient context window extension of large language models,” arXiv preprint arXiv:2309.00071 , 2023. 2, 15 [48] M. Guo, J. Ainslie, D. Uthus, S. Ontanon, J. Ni, Y .-H. Sung, and Y . Yang, “Longt5: Efficient text-to-text transformer for long sequences,” arXiv preprint arXiv:2112.07916 , 2021. 2, 15 [49] S. Chen, S. Wong, L. Chen, and Y . Tian, “Extending context window of large language models via positional interpolation,” arXiv preprint arXiv:2306.15595 , 2023. 2, 15 [50] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, Z. Dong et al."
7,0.686608,"9, 21, 22, 23 [124] Z. Du, Y . Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang, “Glm: General language model pretraining with autoregressive blank infilling,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2022, pp. 320– 335. 9 [125] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al. , “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971 , 2023. 9, 20, 23 [126] M. N. Rabe and C. Staats, “Self-attention does not need o(n2) memory,” arXiv preprint arXiv:2112.05682 , 2021. 9 [127] V . A. Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch, M. Shoeybi, and B. Catanzaro, “Reducing activation recomputation in large transformer models,” Proceedings of Machine Learning and Systems , vol. 5, 2023. 10[128] X. Ren, P. Zhou, X. Meng, X. Huang, Y . Wang, W. Wang, P. Li, X. Zhang, A. Podolskiy, G. Arshinov et al. , “Pangu-P: Towards trillion parameter language model with sparse heterogeneous computing,” arXiv preprint arXiv:2303.10845 , 2023. 10, 12, 21, 23 [129] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y . Zhou, S. Savarese, and C. Xiong, “Codegen: An open large language model for code with multi-turn program synthesis,” arXiv preprint arXiv:2203.13474 , 2022. 10, 20, 23, 25 [130] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman et al. , “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374 , 2021. 10, 23, 26, 28 [131] Y . Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago et al. , “Competition- level code generation with alphacode,” Science , vol. 378, no. 6624, pp. 1092–1097, 2022. 10, 21, 23, 26 [132] N. Shazeer, “Fast transformer decoding: One write-head is all you need,” arXiv preprint arXiv:1911.02150 , 2019. 10 [133] R. Y . Pang and H. He, “Text generation by learning from demonstra- tions,” arXiv preprint arXiv:2009.07839 , 2020. 10 [134] R. Dabre and A. Fujita, “Softmax tempering for training neural machine translation models,” arXiv preprint arXiv:2009.09372 , 2020. 10 [135] Y . Wang, W. Wang, S. Joty, and S. C. Hoi, “Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,” arXiv preprint arXiv:2109.00859 , 2021."
8,0.684946,"18 [238] Y . Wang, S. Mukherjee, X. Liu, J. Gao, A. H. Awadallah, and J. Gao, “Adamix: Mixture-of-adapter for parameter-efficient tuning of large language models,” arXiv preprint arXiv:2205.12410 , vol. 1, no. 2, p. 4, 2022. 18 [239] E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685 , 2021. 18, 19, 20 [240] X. Liu, K. Ji, Y . Fu, W. Tam, Z. Du, Z. Yang, and J. Tang, “P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , 2022, pp. 61– 68. 18 [241] A. Razdaibiedina, Y . Mao, R. Hou, M. Khabsa, M. Lewis, and A. Almahairi, “Progressive prompts: Continual learning for language models,” arXiv preprint arXiv:2301.12314 , 2023. 18 [242] Z.-R. Zhang, C. Tan, H. Xu, C. Wang, J. Huang, and S. Huang, “Towards adaptive prefix tuning for parameter-efficient language model fine-tuning,” arXiv preprint arXiv:2305.15212 , 2023. 18 [243] E. B. Zaken, S. Ravfogel, and Y . Goldberg, “Bitfit: Simple parameter- efficient fine-tuning for transformer-based masked language-models,” arXiv preprint arXiv:2106.10199 , 2021. 18"
9,0.684925,"14 [170] T. Scialom, T. Chakrabarty, and S. Muresan, “Fine-tuned language models are continual learners,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , 2022, pp. 6107–6122. 15 [171] Z. Shi and A. Lipani, “Don’t stop pretraining? make prompt-based fine-tuning powerful learner,” arXiv preprint arXiv:2305.01711 , 2023. 15 [172] H. Gupta, S. A. Sawant, S. Mishra, M. Nakamura, A. Mitra, S. Mashetty, and C. Baral, “Instruction tuned models are quick learn- ers,” arXiv preprint arXiv:2306.05539 , 2023. 15 [173] H. Chen, Y . Zhang, Q. Zhang, H. Yang, X. Hu, X. Ma, Y . Yanggong, and J. Zhao, “Maybe only 0.5% data is needed: A preliminary exploration of low training data instruction tuning,” arXiv preprint arXiv:2305.09246 , 2023. 15 [174] C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y . Mao, X. Ma, A. Efrat, P. Yu, L. Yu et al. , “Lima: Less is more for alignment,” arXiv preprint arXiv:2305.11206 , 2023. 15, 23, 25[175] C. Han, Q. Wang, W. Xiong, Y . Chen, H. Ji, and S. Wang, “Lm-infinite: Simple on-the-fly length generalization for large language models,” arXiv preprint arXiv:2308.16137 , 2023. 15 [176] J. Ainslie, T. Lei, M. de Jong, S. Ontañón, S. Brahma, Y . Zemlyan- skiy, D. Uthus, M. Guo, J. Lee-Thorp, Y . Tay et al. , “Colt5: Faster long-range transformers with conditional computation,” arXiv preprint arXiv:2303.09752 , 2023. 15 [177] J. Ding, S. Ma, L. Dong, X. Zhang, S. Huang, W. Wang, and F. Wei, “Longnet: Scaling transformers to 1,000,000,000 tokens,” arXiv preprint arXiv:2307.02486 , 2023. 15 [178] Y . Chen, S. Qian, H. Tang, X. Lai, Z. Liu, S. Han, and J. Jia, “Longlora: Efficient fine-tuning of long-context large language models,” arXiv preprint arXiv:2309.12307 , 2023. 15 [179] N. Ratner, Y . Levine, Y . Belinkov, O. Ram, I. Magar, O. Abend, E. Karpas, A. Shashua, K. Leyton-Brown, and Y . Shoham, “Parallel context windows for large language models,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2023, pp. 6383–6402. 15 [180] W. Wang, L. Dong, H. Cheng, X. Liu, X. Yan, J. Gao, and F. Wei, “Augmenting language models with long-term memory,” arXiv preprint arXiv:2306.07174 , 2023. 15 [181] X. Xu, Z. Gou, W. Wu, Z.-Y . Niu, H. Wu, H. Wang, and S. Wang, “Long time no see! open-domain conversation with long-term persona memory,” arXiv preprint arXiv:2203.05797 , 2022."


Running Evaluation for Reranker: CohereRerank
Visualize Retrieved Nodes for Reranker: CohereRerank


Unnamed: 0,Score,Text
0,0.986726,"PREPRINT 6 Fig. 6: A basic flow diagram depicting various stages of LLMs from pre-training to prompting/utilization. Prompting LLMs to generate responses is possible at different training stages like pre-training, instruction-tuning, or alignment tuning. and utilization. An example of different training stages and inference in LLMs is shown in Figure 6. In this paper, we refer alignment-tuning to aligning with human preferences, while occasionally the literature uses the term alignment for different purposes. 1. Pre-Training: In the very first stage, the model is trained in a self-supervised manner on a large corpus to predict the next tokens given the input. The design choices of LLMs vary from encoder-decoder to decoder-only architectures with dif- ferent building blocks and loss functions in sections II-E, II-D, II-J. 2. Fine-Tuning: There are different styles to fine-tune an LLM. This section briefly discusses fine-tuning approaches. Transfer Learning: The pre-trained LLMs perform well for various tasks [6], [15]. But to improve the performance for a downstream task, pre-trained models are fine-tuned with the task-specific data [10], [11], known as transfer learning. Instruction-tuning: To enable a model to respond to user queries effectively, the pre-trained model is fine-tuned on instruction formatted data i.e., instruction and an input-output pair. Instructions generally comprise multi-task data in plain natural language, guiding the model to respond according to the prompt and the input. This type of fine-tuning improveszero-shot generalization and downstream task performance. Details on formatting instruction data and its various styles are available in [16], [50], [92]. Alignment-tuning: LLMs are prone to generate false, biased, and harmful text. To make them helpful, honest, and harmless models are aligned using human feedback. Alignment involves asking LLMs to generate unexpected responses and then updating their parameters to avoid such responses [20], [21], [93]. It ensures LLMs operate according to human intentions and values. A model is defined to be an “aligned” model if the model fulfills three criteria of helpful, honest, and harmless or “HHH” [94]. Researchers employ reinforcement learning with human feed- back (RLHF) [95] for model alignment. In RLHF, a fine- tuned model on demonstrations is further trained with reward modeling (RM) and reinforcement learning (RL), shown in Figure 6. Below we briefly discuss RM and RL pipelines in RLHF. Reward modeling: trains a model to rank generated responses according to human preferences using a classification objec- tive. To train the classifier humans annotate LLMs generated responses based on HHH criteria."
1,0.982691,"10: This example illustrates the PanGu-Parchitecture, as depicted in the image sourced from [128].B. Fine-Tuned LLMs Pre-trained LLMs have excellent generalization abilities to unseen tasks. However, because they are generally trained with the objective of next token prediction, LLMs have limited capacity to follow user intent and are prone to generate unethical, toxic or inaccurate responses [20]. For their effective utilization, LLMs are fine-tuned to follow instructions [16], [17], [92] and generate safe responses [20], which also results in increasing zero-shot, few-shot, and cross-task generaliza- tion [92], [16], [18], with minimal compute increment, e.g., 0.2% of the total pre-training for PaLM 540B [16]. We review various fine-tuned LLMs and strategies for effective fine-tuning in this section. 1. Instruction-Tuning with Manually Created Datasets: Numerous hand-crafted instruction-tuning datasets with different design choices are proposed in the literature to instruction-tune LLMs. The performance of fine-tuned LLMs depends on multiple factors, such as dataset, instruction diversity, prompting templates, model size, and training"
2,0.939137,"PREPRINT 15 4. Continue Pre-Training: Although fine-tuning boosts a model’s performance, it leads to catastrophic forgetting of previously learned information. Concatenating fine-tuning data with a few randomly selected pre-training samples in every iteration avoids network forgetting [170], [141]. This is also effective in adapting LLMs for cases where fine-tuning data is small and the original capacity is to be maintained. Prompt- based continued pre-training (PCP) [171] trains the model with text and instructions related to tasks and then finally instruction-tunes the model for downstream tasks. 5. Sample Efficiency: While fine-tuning data is generally many-fold smaller than the pre-training data, it still has to be large enough for acceptable performance [16], [92], [18] and requires proportional computing resources. To study the effects on performance with less data, existing literature [172], [173] finds that the models trained on lesser data can out- perform models trained with more data. In [172], 25% of the total downstream data is found enough for state-of-the- art performance. Selecting coreset-based 0.5% of the total instruction-tuning data improves the model performance by 2% in [173], as compared to the complete data tuning. Less is more for alignment (LIMA) [174] uses only 1000 carefully created demonstrations to fine-tune the model and has achieved comparable performance to GPT-4. C. Increasing Context Window LLMs are trained with limited context windows due to expensive attention and high memory requirements. A model trained on limited sequence lengths fails to generalize to unseen lengths at inference time [175], [49]. Alternatively, LLMs with ALiBi [65] positional encodings can perform zero- shot length extrapolation. However, ALiBi has less expres- sive power [66] and inferior performance on multiple bench- marks [46], and many LLMs use RoPE positional embedding that is unable to perform zero-shot extrapolation. A larger con- text length has benefits such as a better understanding of longer documents, more samples in in-context learning, execution of bigger reasoning processes, etc. Expanding context length during fine-tuning is slow, inefficient, and computationally expensive [49]. Therefore, researchers employ various context window extrapolation techniques discussed below. Position Interpolation: Rather than extrapolating, [49] shows that interpolating position encodings within the pre-trained context window are more effective. The work demonstrates that only 1000 steps of fine-tuning are enough to achieve better results on larger windows without performance loss compared to the original context size. Giraffe [46] uses power scaling in RoPE, and YaRN [47] proposed NTK-aware interpolation. Efficient Attention Mechanism: Dense global attention is one of the major constraints in training larger context win- dow LLMs. Using efficient attention variants, such as local, sparse, and dilated attention, reduces the computation cost significantly. LongT5 [48] proposes transient global attention (TGlobal), applying attention to local and global tokens (win- dowing token averaging). The model replaces attention in T5 [10] with TGlobal attention, pre-trains the model on 4098 sequence length, fine-tunes on larger window sizes, as largeas 16k, and improves task performance with longer inputs. This shows the extrapolation ability of TGlobal attention with only fine-tuning. COLT5 [176] uses two branches, one with lightweight and the other with heavyweight attention and feed-forward layers. All tokens are processed from the lightweight branch, and only important tokens are routed to the heavyweight branch. LongNet [177] replaces standard attention with dilated attention, expanding sequence length to 1 billion tokens. LongLoRA [178] proposes shift-short attention, used during fine-tuning to reduce dense attention costs, while the model during inference can use dense attention and achieve similar performance as full attention fine-tuning. Extrapolation without Training: LM-Infinite [175] and par- allel context windows (PCW) [179] show length extrapolation is possible using pre-trained LLMs. LM-Infinite suggested Λ- shaped attention applied within the original context window limits."
3,0.936168,"PREPRINT 13 TABLE II: Key insights and findings from the study of instruction-tuned Large Language Models. Models Findings & Insights T0•Multi-task prompting enables zero-shot generalization and outperforms baselines •Even a single prompt per dataset task is enough to improve performance WebGPT•The answer quality of LLMs can be further improved with human feedback. •To aid the model in effectively filtering and utilizing relevant information, human labelers play a crucial role in answering questions regarding the usefulness of the retrieved documents. •Interacting a fine-tuned language model with a text-based web-browsing environment can improve end-to-end retrieval and synthesis via imitation learning and reinforcement learning. •Generating answers with references can make labelers easily judge the factual accuracy of answers. Tk-INSTRUCT•Instruction tuning leads to a stronger generalization of unseen tasks •More tasks improve generalization whereas only increasing task instances does not help •Supervised trained models are better than generalized models •Models pre-trained with instructions and examples perform well for different types of inputs mT0 and BLOOMZ•Instruction tuning enables zero-shot generalization to the tasks never seen before •Multi-lingual training leads to even better zero-shot generalization for both English and non-English •Training on machine-translated prompts improves performance for held-out tasks with non-English prompts •English only fine-tuning on multilingual pre-trained language model is enough to generalize to other pre-trained language tasks OPT-IML•Task size sampling to create a batch with most of the task examples is important for better performance •Only example proportional sampling is not enough, training datasets/benchmarks should also be proportional for better generalization/performance •Fully held-out and partially supervised tasks performance improves by scaling tasks or categories whereas fully supervised tasks have no effect •Including small amounts i.e. 5% of pretraining data during fine-tuning is effective •Only 1% reasoning data improves the performance, adding more deteriorates performance •Adding dialogue data makes the performance worse Flan•Finetuning with CoT improves performance on held-out tasks •Fine-tuning along with CoT data improves reasoning abilities •CoT tuning improves zero-shot reasoning •Performance improves with more tasks •Instruction fine-tuning improves usability which otherwise is challenging for pre-trained models •Improving the model’s performance with instruction tuning is compute-efficient •Multitask prompting enables zero-shot generalization abilities in LLM Sparrow•The judgments of labelers and the alignments with defined rules can help the model generate better responses. •Good dialogue goals can be broken down into detailed natural language rules for the agent and the raters. •The combination of reinforcement learning (RL) with reranking yields optimal performance in terms of preference win rates and resilience against adversarial probing. WizardCoder•Fine-tuning with re-written instruction-tuning data into a complex set improves the performance significantly LLaMA-2-Chat•Model learns to write safe responses with fine-tuning on safe demonstrations, while additional RLHF step further improves model safety and make it less prone to jailbreak attacks LIMA•Less high quality data is enough for fine-tuned model generalization objectives. Keeping this in view, diverse fine-tuned models have emerged in the literature using manually created datasets. The models T0 [17] and mT0 (multi-lingual) [143] employ templates to convert existing datasets into prompt datasets. They have shown improvements in generalization to zero-shot and held-out tasks. Tk-Instruct [18] fine-tuned the T5 model with in-context instructions to study generalization on unseen tasks when given in-context instructions during test time. The model outperformed Instruct-GPT, despite being smaller in size, i.e., 11B parameters as compared to 175B of GPT-3. Increasing Tasks and Prompt Setups: Zero-shot and few- shot performance improves significantly by expanding task collection and prompt styles. OPT-IML [92] and Flan [16] curated larger 2k and 1.8k task datasets, respectively. While increasing task size alone is not enough, OPT-IML and Flan add more prompting setups in their datasets, zero-shot, few-shot, and CoT. In continuation, CoT Collection [96] fine-tunes Flan-T5 further on 1.88M CoT samples."
4,0.839997,"14 [170] T. Scialom, T. Chakrabarty, and S. Muresan, “Fine-tuned language models are continual learners,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , 2022, pp. 6107–6122. 15 [171] Z. Shi and A. Lipani, “Don’t stop pretraining? make prompt-based fine-tuning powerful learner,” arXiv preprint arXiv:2305.01711 , 2023. 15 [172] H. Gupta, S. A. Sawant, S. Mishra, M. Nakamura, A. Mitra, S. Mashetty, and C. Baral, “Instruction tuned models are quick learn- ers,” arXiv preprint arXiv:2306.05539 , 2023. 15 [173] H. Chen, Y . Zhang, Q. Zhang, H. Yang, X. Hu, X. Ma, Y . Yanggong, and J. Zhao, “Maybe only 0.5% data is needed: A preliminary exploration of low training data instruction tuning,” arXiv preprint arXiv:2305.09246 , 2023. 15 [174] C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y . Mao, X. Ma, A. Efrat, P. Yu, L. Yu et al. , “Lima: Less is more for alignment,” arXiv preprint arXiv:2305.11206 , 2023. 15, 23, 25[175] C. Han, Q. Wang, W. Xiong, Y . Chen, H. Ji, and S. Wang, “Lm-infinite: Simple on-the-fly length generalization for large language models,” arXiv preprint arXiv:2308.16137 , 2023. 15 [176] J. Ainslie, T. Lei, M. de Jong, S. Ontañón, S. Brahma, Y . Zemlyan- skiy, D. Uthus, M. Guo, J. Lee-Thorp, Y . Tay et al. , “Colt5: Faster long-range transformers with conditional computation,” arXiv preprint arXiv:2303.09752 , 2023. 15 [177] J. Ding, S. Ma, L. Dong, X. Zhang, S. Huang, W. Wang, and F. Wei, “Longnet: Scaling transformers to 1,000,000,000 tokens,” arXiv preprint arXiv:2307.02486 , 2023. 15 [178] Y . Chen, S. Qian, H. Tang, X. Lai, Z. Liu, S. Han, and J. Jia, “Longlora: Efficient fine-tuning of long-context large language models,” arXiv preprint arXiv:2309.12307 , 2023. 15 [179] N. Ratner, Y . Levine, Y . Belinkov, O. Ram, I. Magar, O. Abend, E. Karpas, A. Shashua, K. Leyton-Brown, and Y . Shoham, “Parallel context windows for large language models,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2023, pp. 6383–6402. 15 [180] W. Wang, L. Dong, H. Cheng, X. Liu, X. Yan, J. Gao, and F. Wei, “Augmenting language models with long-term memory,” arXiv preprint arXiv:2306.07174 , 2023. 15 [181] X. Xu, Z. Gou, W. Wu, Z.-Y . Niu, H. Wu, H. Wang, and S. Wang, “Long time no see! open-domain conversation with long-term persona memory,” arXiv preprint arXiv:2203.05797 , 2022."


Running Evaluation for Reranker: bge-reranker-base


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Visualize Retrieved Nodes for Reranker: bge-reranker-base


Unnamed: 0,Score,Text
0,0.834917,"PREPRINT 6 Fig. 6: A basic flow diagram depicting various stages of LLMs from pre-training to prompting/utilization. Prompting LLMs to generate responses is possible at different training stages like pre-training, instruction-tuning, or alignment tuning. and utilization. An example of different training stages and inference in LLMs is shown in Figure 6. In this paper, we refer alignment-tuning to aligning with human preferences, while occasionally the literature uses the term alignment for different purposes. 1. Pre-Training: In the very first stage, the model is trained in a self-supervised manner on a large corpus to predict the next tokens given the input. The design choices of LLMs vary from encoder-decoder to decoder-only architectures with dif- ferent building blocks and loss functions in sections II-E, II-D, II-J. 2. Fine-Tuning: There are different styles to fine-tune an LLM. This section briefly discusses fine-tuning approaches. Transfer Learning: The pre-trained LLMs perform well for various tasks [6], [15]. But to improve the performance for a downstream task, pre-trained models are fine-tuned with the task-specific data [10], [11], known as transfer learning. Instruction-tuning: To enable a model to respond to user queries effectively, the pre-trained model is fine-tuned on instruction formatted data i.e., instruction and an input-output pair. Instructions generally comprise multi-task data in plain natural language, guiding the model to respond according to the prompt and the input. This type of fine-tuning improveszero-shot generalization and downstream task performance. Details on formatting instruction data and its various styles are available in [16], [50], [92]. Alignment-tuning: LLMs are prone to generate false, biased, and harmful text. To make them helpful, honest, and harmless models are aligned using human feedback. Alignment involves asking LLMs to generate unexpected responses and then updating their parameters to avoid such responses [20], [21], [93]. It ensures LLMs operate according to human intentions and values. A model is defined to be an “aligned” model if the model fulfills three criteria of helpful, honest, and harmless or “HHH” [94]. Researchers employ reinforcement learning with human feed- back (RLHF) [95] for model alignment. In RLHF, a fine- tuned model on demonstrations is further trained with reward modeling (RM) and reinforcement learning (RL), shown in Figure 6. Below we briefly discuss RM and RL pipelines in RLHF. Reward modeling: trains a model to rank generated responses according to human preferences using a classification objec- tive. To train the classifier humans annotate LLMs generated responses based on HHH criteria."
1,0.189269,"10: This example illustrates the PanGu-Parchitecture, as depicted in the image sourced from [128].B. Fine-Tuned LLMs Pre-trained LLMs have excellent generalization abilities to unseen tasks. However, because they are generally trained with the objective of next token prediction, LLMs have limited capacity to follow user intent and are prone to generate unethical, toxic or inaccurate responses [20]. For their effective utilization, LLMs are fine-tuned to follow instructions [16], [17], [92] and generate safe responses [20], which also results in increasing zero-shot, few-shot, and cross-task generaliza- tion [92], [16], [18], with minimal compute increment, e.g., 0.2% of the total pre-training for PaLM 540B [16]. We review various fine-tuned LLMs and strategies for effective fine-tuning in this section. 1. Instruction-Tuning with Manually Created Datasets: Numerous hand-crafted instruction-tuning datasets with different design choices are proposed in the literature to instruction-tune LLMs. The performance of fine-tuned LLMs depends on multiple factors, such as dataset, instruction diversity, prompting templates, model size, and training"
2,0.184576,"IEEE, 2020, pp. 1–16. 2, 4, 5, 21 [38] J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, and G. Neubig, “Towards a unified view of parameter-efficient transfer learning,” arXiv preprint arXiv:2110.04366 , 2021. 2, 18, 19 [39] Z. Hu, Y . Lan, L. Wang, W. Xu, E.-P. Lim, R. K.-W. Lee, L. Bing, and S. Poria, “Llm-adapters: An adapter family for parameter-efficient fine- tuning of large language models,” arXiv preprint arXiv:2304.01933 , 2023. 2, 18 [40] B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,” arXiv preprint arXiv:2104.08691 , 2021. 2, 8, 18[41] X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” arXiv preprint arXiv:2101.00190 , 2021. 2, 18 [42] X. Ma, G. Fang, and X. Wang, “Llm-pruner: On the structural pruning of large language models,” arXiv preprint arXiv:2305.11627 , 2023. 2, 19 [43] R. Xu, F. Luo, C. Wang, B. Chang, J. Huang, S. Huang, and F. Huang, “From dense to sparse: Contrastive pruning for better pre-trained language model compression,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 36, no. 10, 2022, pp. 11 547–11 555. 2, 19 [44] G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “Smoothquant: Accurate and efficient post-training quantization for large language models,” in ICML , ser. Proceedings of Machine Learn- ing Research, vol. 202. PMLR, 2023, pp. 38 087–38 099. 2, 18 [45] C. Tao, L. Hou, W. Zhang, L. Shang, X. Jiang, Q. Liu, P. Luo, and N. Wong, “Compression of generative pre-trained language models via quantization,” arXiv preprint arXiv:2203.10705 , 2022. 2, 18 [46] A. Pal, D. Karkhanis, M. Roberts, S. Dooley, A. Sundararajan, and S. Naidu, “Giraffe: Adventures in expanding context lengths in llms,” arXiv preprint arXiv:2308.10882 , 2023. 2, 15 [47] B. Peng, J. Quesnelle, H. Fan, and E. Shippole, “Yarn: Efficient context window extension of large language models,” arXiv preprint arXiv:2309.00071 , 2023. 2, 15 [48] M. Guo, J. Ainslie, D. Uthus, S. Ontanon, J. Ni, Y .-H. Sung, and Y . Yang, “Longt5: Efficient text-to-text transformer for long sequences,” arXiv preprint arXiv:2112.07916 , 2021. 2, 15 [49] S. Chen, S. Wong, L. Chen, and Y . Tian, “Extending context window of large language models via positional interpolation,” arXiv preprint arXiv:2306.15595 , 2023. 2, 15 [50] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, Z. Dong et al."
3,0.08793,"PREPRINT 13 TABLE II: Key insights and findings from the study of instruction-tuned Large Language Models. Models Findings & Insights T0•Multi-task prompting enables zero-shot generalization and outperforms baselines •Even a single prompt per dataset task is enough to improve performance WebGPT•The answer quality of LLMs can be further improved with human feedback. •To aid the model in effectively filtering and utilizing relevant information, human labelers play a crucial role in answering questions regarding the usefulness of the retrieved documents. •Interacting a fine-tuned language model with a text-based web-browsing environment can improve end-to-end retrieval and synthesis via imitation learning and reinforcement learning. •Generating answers with references can make labelers easily judge the factual accuracy of answers. Tk-INSTRUCT•Instruction tuning leads to a stronger generalization of unseen tasks •More tasks improve generalization whereas only increasing task instances does not help •Supervised trained models are better than generalized models •Models pre-trained with instructions and examples perform well for different types of inputs mT0 and BLOOMZ•Instruction tuning enables zero-shot generalization to the tasks never seen before •Multi-lingual training leads to even better zero-shot generalization for both English and non-English •Training on machine-translated prompts improves performance for held-out tasks with non-English prompts •English only fine-tuning on multilingual pre-trained language model is enough to generalize to other pre-trained language tasks OPT-IML•Task size sampling to create a batch with most of the task examples is important for better performance •Only example proportional sampling is not enough, training datasets/benchmarks should also be proportional for better generalization/performance •Fully held-out and partially supervised tasks performance improves by scaling tasks or categories whereas fully supervised tasks have no effect •Including small amounts i.e. 5% of pretraining data during fine-tuning is effective •Only 1% reasoning data improves the performance, adding more deteriorates performance •Adding dialogue data makes the performance worse Flan•Finetuning with CoT improves performance on held-out tasks •Fine-tuning along with CoT data improves reasoning abilities •CoT tuning improves zero-shot reasoning •Performance improves with more tasks •Instruction fine-tuning improves usability which otherwise is challenging for pre-trained models •Improving the model’s performance with instruction tuning is compute-efficient •Multitask prompting enables zero-shot generalization abilities in LLM Sparrow•The judgments of labelers and the alignments with defined rules can help the model generate better responses. •Good dialogue goals can be broken down into detailed natural language rules for the agent and the raters. •The combination of reinforcement learning (RL) with reranking yields optimal performance in terms of preference win rates and resilience against adversarial probing. WizardCoder•Fine-tuning with re-written instruction-tuning data into a complex set improves the performance significantly LLaMA-2-Chat•Model learns to write safe responses with fine-tuning on safe demonstrations, while additional RLHF step further improves model safety and make it less prone to jailbreak attacks LIMA•Less high quality data is enough for fine-tuned model generalization objectives. Keeping this in view, diverse fine-tuned models have emerged in the literature using manually created datasets. The models T0 [17] and mT0 (multi-lingual) [143] employ templates to convert existing datasets into prompt datasets. They have shown improvements in generalization to zero-shot and held-out tasks. Tk-Instruct [18] fine-tuned the T5 model with in-context instructions to study generalization on unseen tasks when given in-context instructions during test time. The model outperformed Instruct-GPT, despite being smaller in size, i.e., 11B parameters as compared to 175B of GPT-3. Increasing Tasks and Prompt Setups: Zero-shot and few- shot performance improves significantly by expanding task collection and prompt styles. OPT-IML [92] and Flan [16] curated larger 2k and 1.8k task datasets, respectively. While increasing task size alone is not enough, OPT-IML and Flan add more prompting setups in their datasets, zero-shot, few-shot, and CoT. In continuation, CoT Collection [96] fine-tunes Flan-T5 further on 1.88M CoT samples."
4,0.013631,"18 [238] Y . Wang, S. Mukherjee, X. Liu, J. Gao, A. H. Awadallah, and J. Gao, “Adamix: Mixture-of-adapter for parameter-efficient tuning of large language models,” arXiv preprint arXiv:2205.12410 , vol. 1, no. 2, p. 4, 2022. 18 [239] E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685 , 2021. 18, 19, 20 [240] X. Liu, K. Ji, Y . Fu, W. Tam, Z. Du, Z. Yang, and J. Tang, “P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , 2022, pp. 61– 68. 18 [241] A. Razdaibiedina, Y . Mao, R. Hou, M. Khabsa, M. Lewis, and A. Almahairi, “Progressive prompts: Continual learning for language models,” arXiv preprint arXiv:2301.12314 , 2023. 18 [242] Z.-R. Zhang, C. Tan, H. Xu, C. Wang, J. Huang, and S. Huang, “Towards adaptive prefix tuning for parameter-efficient language model fine-tuning,” arXiv preprint arXiv:2305.15212 , 2023. 18 [243] E. B. Zaken, S. Ravfogel, and Y . Goldberg, “Bitfit: Simple parameter- efficient fine-tuning for transformer-based masked language-models,” arXiv preprint arXiv:2106.10199 , 2021. 18"


Running Evaluation for Reranker: bge-reranker-large


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Visualize Retrieved Nodes for Reranker: bge-reranker-large


Unnamed: 0,Score,Text
0,0.787194,"PREPRINT 6 Fig. 6: A basic flow diagram depicting various stages of LLMs from pre-training to prompting/utilization. Prompting LLMs to generate responses is possible at different training stages like pre-training, instruction-tuning, or alignment tuning. and utilization. An example of different training stages and inference in LLMs is shown in Figure 6. In this paper, we refer alignment-tuning to aligning with human preferences, while occasionally the literature uses the term alignment for different purposes. 1. Pre-Training: In the very first stage, the model is trained in a self-supervised manner on a large corpus to predict the next tokens given the input. The design choices of LLMs vary from encoder-decoder to decoder-only architectures with dif- ferent building blocks and loss functions in sections II-E, II-D, II-J. 2. Fine-Tuning: There are different styles to fine-tune an LLM. This section briefly discusses fine-tuning approaches. Transfer Learning: The pre-trained LLMs perform well for various tasks [6], [15]. But to improve the performance for a downstream task, pre-trained models are fine-tuned with the task-specific data [10], [11], known as transfer learning. Instruction-tuning: To enable a model to respond to user queries effectively, the pre-trained model is fine-tuned on instruction formatted data i.e., instruction and an input-output pair. Instructions generally comprise multi-task data in plain natural language, guiding the model to respond according to the prompt and the input. This type of fine-tuning improveszero-shot generalization and downstream task performance. Details on formatting instruction data and its various styles are available in [16], [50], [92]. Alignment-tuning: LLMs are prone to generate false, biased, and harmful text. To make them helpful, honest, and harmless models are aligned using human feedback. Alignment involves asking LLMs to generate unexpected responses and then updating their parameters to avoid such responses [20], [21], [93]. It ensures LLMs operate according to human intentions and values. A model is defined to be an “aligned” model if the model fulfills three criteria of helpful, honest, and harmless or “HHH” [94]. Researchers employ reinforcement learning with human feed- back (RLHF) [95] for model alignment. In RLHF, a fine- tuned model on demonstrations is further trained with reward modeling (RM) and reinforcement learning (RL), shown in Figure 6. Below we briefly discuss RM and RL pipelines in RLHF. Reward modeling: trains a model to rank generated responses according to human preferences using a classification objec- tive. To train the classifier humans annotate LLMs generated responses based on HHH criteria."
1,0.759965,"10: This example illustrates the PanGu-Parchitecture, as depicted in the image sourced from [128].B. Fine-Tuned LLMs Pre-trained LLMs have excellent generalization abilities to unseen tasks. However, because they are generally trained with the objective of next token prediction, LLMs have limited capacity to follow user intent and are prone to generate unethical, toxic or inaccurate responses [20]. For their effective utilization, LLMs are fine-tuned to follow instructions [16], [17], [92] and generate safe responses [20], which also results in increasing zero-shot, few-shot, and cross-task generaliza- tion [92], [16], [18], with minimal compute increment, e.g., 0.2% of the total pre-training for PaLM 540B [16]. We review various fine-tuned LLMs and strategies for effective fine-tuning in this section. 1. Instruction-Tuning with Manually Created Datasets: Numerous hand-crafted instruction-tuning datasets with different design choices are proposed in the literature to instruction-tune LLMs. The performance of fine-tuned LLMs depends on multiple factors, such as dataset, instruction diversity, prompting templates, model size, and training"
2,0.34045,"PREPRINT 13 TABLE II: Key insights and findings from the study of instruction-tuned Large Language Models. Models Findings & Insights T0•Multi-task prompting enables zero-shot generalization and outperforms baselines •Even a single prompt per dataset task is enough to improve performance WebGPT•The answer quality of LLMs can be further improved with human feedback. •To aid the model in effectively filtering and utilizing relevant information, human labelers play a crucial role in answering questions regarding the usefulness of the retrieved documents. •Interacting a fine-tuned language model with a text-based web-browsing environment can improve end-to-end retrieval and synthesis via imitation learning and reinforcement learning. •Generating answers with references can make labelers easily judge the factual accuracy of answers. Tk-INSTRUCT•Instruction tuning leads to a stronger generalization of unseen tasks •More tasks improve generalization whereas only increasing task instances does not help •Supervised trained models are better than generalized models •Models pre-trained with instructions and examples perform well for different types of inputs mT0 and BLOOMZ•Instruction tuning enables zero-shot generalization to the tasks never seen before •Multi-lingual training leads to even better zero-shot generalization for both English and non-English •Training on machine-translated prompts improves performance for held-out tasks with non-English prompts •English only fine-tuning on multilingual pre-trained language model is enough to generalize to other pre-trained language tasks OPT-IML•Task size sampling to create a batch with most of the task examples is important for better performance •Only example proportional sampling is not enough, training datasets/benchmarks should also be proportional for better generalization/performance •Fully held-out and partially supervised tasks performance improves by scaling tasks or categories whereas fully supervised tasks have no effect •Including small amounts i.e. 5% of pretraining data during fine-tuning is effective •Only 1% reasoning data improves the performance, adding more deteriorates performance •Adding dialogue data makes the performance worse Flan•Finetuning with CoT improves performance on held-out tasks •Fine-tuning along with CoT data improves reasoning abilities •CoT tuning improves zero-shot reasoning •Performance improves with more tasks •Instruction fine-tuning improves usability which otherwise is challenging for pre-trained models •Improving the model’s performance with instruction tuning is compute-efficient •Multitask prompting enables zero-shot generalization abilities in LLM Sparrow•The judgments of labelers and the alignments with defined rules can help the model generate better responses. •Good dialogue goals can be broken down into detailed natural language rules for the agent and the raters. •The combination of reinforcement learning (RL) with reranking yields optimal performance in terms of preference win rates and resilience against adversarial probing. WizardCoder•Fine-tuning with re-written instruction-tuning data into a complex set improves the performance significantly LLaMA-2-Chat•Model learns to write safe responses with fine-tuning on safe demonstrations, while additional RLHF step further improves model safety and make it less prone to jailbreak attacks LIMA•Less high quality data is enough for fine-tuned model generalization objectives. Keeping this in view, diverse fine-tuned models have emerged in the literature using manually created datasets. The models T0 [17] and mT0 (multi-lingual) [143] employ templates to convert existing datasets into prompt datasets. They have shown improvements in generalization to zero-shot and held-out tasks. Tk-Instruct [18] fine-tuned the T5 model with in-context instructions to study generalization on unseen tasks when given in-context instructions during test time. The model outperformed Instruct-GPT, despite being smaller in size, i.e., 11B parameters as compared to 175B of GPT-3. Increasing Tasks and Prompt Setups: Zero-shot and few- shot performance improves significantly by expanding task collection and prompt styles. OPT-IML [92] and Flan [16] curated larger 2k and 1.8k task datasets, respectively. While increasing task size alone is not enough, OPT-IML and Flan add more prompting setups in their datasets, zero-shot, few-shot, and CoT. In continuation, CoT Collection [96] fine-tunes Flan-T5 further on 1.88M CoT samples."
3,0.260049,"18 [238] Y . Wang, S. Mukherjee, X. Liu, J. Gao, A. H. Awadallah, and J. Gao, “Adamix: Mixture-of-adapter for parameter-efficient tuning of large language models,” arXiv preprint arXiv:2205.12410 , vol. 1, no. 2, p. 4, 2022. 18 [239] E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685 , 2021. 18, 19, 20 [240] X. Liu, K. Ji, Y . Fu, W. Tam, Z. Du, Z. Yang, and J. Tang, “P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , 2022, pp. 61– 68. 18 [241] A. Razdaibiedina, Y . Mao, R. Hou, M. Khabsa, M. Lewis, and A. Almahairi, “Progressive prompts: Continual learning for language models,” arXiv preprint arXiv:2301.12314 , 2023. 18 [242] Z.-R. Zhang, C. Tan, H. Xu, C. Wang, J. Huang, and S. Huang, “Towards adaptive prefix tuning for parameter-efficient language model fine-tuning,” arXiv preprint arXiv:2305.15212 , 2023. 18 [243] E. B. Zaken, S. Ravfogel, and Y . Goldberg, “Bitfit: Simple parameter- efficient fine-tuning for transformer-based masked language-models,” arXiv preprint arXiv:2106.10199 , 2021. 18"
4,0.213883,"PREPRINT 15 4. Continue Pre-Training: Although fine-tuning boosts a model’s performance, it leads to catastrophic forgetting of previously learned information. Concatenating fine-tuning data with a few randomly selected pre-training samples in every iteration avoids network forgetting [170], [141]. This is also effective in adapting LLMs for cases where fine-tuning data is small and the original capacity is to be maintained. Prompt- based continued pre-training (PCP) [171] trains the model with text and instructions related to tasks and then finally instruction-tunes the model for downstream tasks. 5. Sample Efficiency: While fine-tuning data is generally many-fold smaller than the pre-training data, it still has to be large enough for acceptable performance [16], [92], [18] and requires proportional computing resources. To study the effects on performance with less data, existing literature [172], [173] finds that the models trained on lesser data can out- perform models trained with more data. In [172], 25% of the total downstream data is found enough for state-of-the- art performance. Selecting coreset-based 0.5% of the total instruction-tuning data improves the model performance by 2% in [173], as compared to the complete data tuning. Less is more for alignment (LIMA) [174] uses only 1000 carefully created demonstrations to fine-tune the model and has achieved comparable performance to GPT-4. C. Increasing Context Window LLMs are trained with limited context windows due to expensive attention and high memory requirements. A model trained on limited sequence lengths fails to generalize to unseen lengths at inference time [175], [49]. Alternatively, LLMs with ALiBi [65] positional encodings can perform zero- shot length extrapolation. However, ALiBi has less expres- sive power [66] and inferior performance on multiple bench- marks [46], and many LLMs use RoPE positional embedding that is unable to perform zero-shot extrapolation. A larger con- text length has benefits such as a better understanding of longer documents, more samples in in-context learning, execution of bigger reasoning processes, etc. Expanding context length during fine-tuning is slow, inefficient, and computationally expensive [49]. Therefore, researchers employ various context window extrapolation techniques discussed below. Position Interpolation: Rather than extrapolating, [49] shows that interpolating position encodings within the pre-trained context window are more effective. The work demonstrates that only 1000 steps of fine-tuning are enough to achieve better results on larger windows without performance loss compared to the original context size. Giraffe [46] uses power scaling in RoPE, and YaRN [47] proposed NTK-aware interpolation. Efficient Attention Mechanism: Dense global attention is one of the major constraints in training larger context win- dow LLMs. Using efficient attention variants, such as local, sparse, and dilated attention, reduces the computation cost significantly. LongT5 [48] proposes transient global attention (TGlobal), applying attention to local and global tokens (win- dowing token averaging). The model replaces attention in T5 [10] with TGlobal attention, pre-trains the model on 4098 sequence length, fine-tunes on larger window sizes, as largeas 16k, and improves task performance with longer inputs. This shows the extrapolation ability of TGlobal attention with only fine-tuning. COLT5 [176] uses two branches, one with lightweight and the other with heavyweight attention and feed-forward layers. All tokens are processed from the lightweight branch, and only important tokens are routed to the heavyweight branch. LongNet [177] replaces standard attention with dilated attention, expanding sequence length to 1 billion tokens. LongLoRA [178] proposes shift-short attention, used during fine-tuning to reduce dense attention costs, while the model during inference can use dense attention and achieve similar performance as full attention fine-tuning. Extrapolation without Training: LM-Infinite [175] and par- allel context windows (PCW) [179] show length extrapolation is possible using pre-trained LLMs. LM-Infinite suggested Λ- shaped attention applied within the original context window limits."


# Evaluation

Now, we will use RetrieverEvaluator to evaluate the quality of any Retriever module.

# Evaluation metrics:

1. hit-rate 
2. MRR. 


## Build an Evaluation dataset of (query, context) pairs
Here we build a simple evaluation dataset over the existing text corpus.

We use our generate_question_context_pairs to generate a set of (question, context) pairs over a given unstructured text corpus. This uses the LLM to auto-generate questions from each context chunk.We will use `Mistral` LLM to generate Question-Context Pairs.

We get back a EmbeddingQAFinetuneDataset object. At a high-level this contains a set of ids mapping to queries and relevant doc chunks, as well as the corpus itself.

In [7]:
qa_generate_prompt_tmpl = """\
Context:

---------------------
{context_str}
---------------------

Task: As a Professor, your objective is to create {num_questions_per_chunk} questions for an upcoming quiz or examination. The questions should be varied and cover the provided context thoroughly. They should not include options or start with Q1/Q2. Ensure the questions are restricted to the context information provided.
"""

In [7]:
import random
nodes_filter = random.sample(nodes, 5)
# Evaluator
from llama_index.core.evaluation import (
    generate_question_context_pairs,
)
from llama_index.core.evaluation import RetrieverEvaluator
llm = Ollama(model="mistral", request_timeout=300.0)

qa_dataset = generate_question_context_pairs(
    nodes_filter, llm=llm, num_questions_per_chunk=2, qa_generate_prompt_tmpl=qa_generate_prompt_tmpl
)

100%|██████████| 5/5 [02:36<00:00, 31.30s/it]


In [8]:
len(qa_dataset.corpus.keys())

5

In [10]:
qa_dataset

EmbeddingQAFinetuneDataset(queries={'543f75c8-5dce-49df-9d65-c6928363da64': 'Question 1: Which techniques were used in the training of UL2 model to improve its performance for downstream tasks? please specify the denoisers and their respective percentages.', '93567bf7-86e6-43e0-b0c9-db3d83d93c01': 'Question 2: Compare and contrast the training methods of PaLM-2, 1.20 U-PaLM, and GLM-130B in terms of their pre-training objectives, architecture, and improvements over baseline models on various NLP tasks.', '71b9a7a7-573b-49d1-886a-e0d7fe2736d2': 'Which pre-trained language models (LLMs) have been employed for question-answering (QA), classification (Clf), natural language inference (NLI), machine translation (MT), reading comprehension (RC), commonsense reasoning (CR), mathematical reasoning (MR), and memorization (Mem.) tasks, as illustrated in Table X?', '19bcc273-e9ec-4ee4-a4fb-f85365637705': 'In which benchmark is the BIG-benchmark model with SuperGLUE used for evaluating QA, Clf, NL

In [14]:
# Generated 2 questions for this chunk
qa_dataset.queries['9596503c-800b-44e9-bc52-530501823fe6']

'How do the approaches presented in "Bloom," "OPT," "Palm," "Scaling instruction-finetuned language models," and "Multitask prompted training" differ from each other in terms of scaling language modeling?'

In [15]:
# Extract relevant doc for this chunk
qa_dataset.relevant_docs['9596503c-800b-44e9-bc52-530501823fe6']

['25372f52-a0a3-4409-b8f5-0c429e4e073d']

In [16]:
# Extract corpus for this relevant doc
qa_dataset.corpus['25372f52-a0a3-4409-b8f5-0c429e4e073d']

', “Bloom: A 176b-\nparameter open-access multilingual language model,” arXiv preprint\narXiv:2211.05100 , 2022. 2, 4, 9, 10, 20, 21, 23, 27\n[14] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen,\nC. Dewan, M. Diab, X. Li, X. V . Lin et al. , “Opt: Open pre-trained\ntransformer language models,” arXiv preprint arXiv:2205.01068 , 2022.\n2, 8, 10, 21, 23\n[15] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts,\nP. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al. , “Palm: Scaling\nlanguage modeling with pathways,” arXiv preprint arXiv:2204.02311 ,\n2022. 2, 6, 9, 10, 20, 21, 22, 23\n[16] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, E. Li,\nX. Wang, M. Dehghani, S. Brahma et al. , “Scaling instruction-finetuned\nlanguage models,” arXiv preprint arXiv:2210.11416 , 2022. 2, 6, 7, 12,\n13, 15, 20, 21, 23, 25, 28\n[17] V . Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai,\nA. Chaffin, A. Stiegler, T. L. Scao, A. Raja et al. , “Mul

In [None]:
# try it out on a sample query
sample_id, sample_query = list(qa_dataset.queries.items())[1]
sample_expected = qa_dataset.relevant_docs[sample_id]
#sample_expected = qa_dataset.corpus['4d7593c4-85b5-420d-944d-1ffe07bacf0f']

retriever_evaluator = RetrieverEvaluator.from_metric_names(
        ["mrr", "hit_rate"], retriever=retriever
    )
eval_result = retriever_evaluator.evaluate(sample_query, sample_expected)
print(eval_result)

In [38]:
sample_query

'Question 2: Compare and contrast the training methods of PaLM-2, 1.20 U-PaLM, and GLM-130B in terms of their pre-training objectives, architecture, and improvements over baseline models on various NLP tasks.'

In [25]:
def get_sample_data(qa_dataset):
    """
    Extracts a sample query and its expected results from the QA dataset.
    
    Args:
        qa_dataset (object): The question-answering dataset object.
        
    Returns:
        tuple: A tuple containing the sample ID, sample query, and expected results.
    """
    # Extracting the first item from the dataset's queries
    sample_id, sample_query = list(qa_dataset.queries.items())[0]
    sample_expected = qa_dataset.relevant_docs[sample_id]
    return sample_id, sample_query, sample_expected

def evaluate_retriever(retriever, sample_query, sample_expected):
    """
    Evaluates a retriever model against a sample query using specified metrics.
    
    Args:
        retriever (object): The retriever model to evaluate.
        sample_query (str): The sample query to evaluate.
        sample_expected (list): The expected results for the sample query.
        
    Returns:
        dict: The evaluation results, including MRR and Hit Rate.
    """
    retriever_evaluator = RetrieverEvaluator.from_metric_names(
        ["mrr", "hit_rate"], retriever=retriever
    )
    try:
        eval_result = retriever_evaluator.evaluate(sample_query, sample_expected)
        return eval_result
    except Exception as e:
        print(f"Evaluation failed: {e}")
        return None

def main():
    # Assuming qa_dataset and retriever are already defined
    sample_id, sample_query, sample_expected = get_sample_data(qa_dataset)
    eval_result = evaluate_retriever(retriever, sample_query, sample_expected)
    if eval_result:
        print(eval_result)

if __name__ == "__main__":
    main()


Evaluation failed: asyncio.run() cannot be called from a running event loop


  return None
