# Contents
- [Introduction]()
- [Testset generation]()
- [Build RAG with llama-index]()
- [Tracing using Phoenix]()
- [Evaluation]()
- [Embedding analysis]()
- [Conclusion]()

TODO

- ✅ Launch Phoenix
- ✅ Index corpus
- ✅ Instrument LlamaIndex
- ✅ Run LlamaIndex application
- ✅ Look at UI, understand traces
- ❌ Export and flatten span data into a phoenix dataset
- ✅ Instrument LangChain
- ✅ Run evaluations
- ✅ Show evaluations
- ✅ Map root span IDs over Ragas evaluations, create `SpanEvaluations`, used `add_evaluations` API to attach evaluations to trace dataset and launch Phoenix with that trace dataset
- ✅ Display traces with annotated evaluations
- ❌ Export using the query DSL to get a primary dataset for visualizing embeddings. Alternatively, call `get_spans_dataframe` on trace `px.TraceDataset`. Whichever is simpler.
- ❌ Wrangle out the corpus dataset from LlamaIndex in-memory vector store
- ❌ Re-launch Phoenix with primary and corpus datasets

## Introduction

In this notebook

In [None]:
!pip install ragas pypdf arize-phoenix llama-index pandas

In [1]:
import pandas as pd

# Display the complete contents of dataframe cells.
pd.set_option("display.max_colwidth", None)

## Synthetic Test data generation

Follow the instructions [here](https://docs.github.com/en/repositories/working-with-files/managing-large-files/installing-git-large-file-storage) to install `git-lfs`.

In [None]:
! git lfs install
! git clone https://huggingface.co/datasets/explodinggradients/prompt-engineering-papers

In [2]:
from llama_index import SimpleDirectoryReader

dir_path = "./prompt-engineering-papers"
reader = SimpleDirectoryReader(dir_path, num_files_limit=2)
documents = reader.load_data()

In [3]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

# generator with openai models
generator = TestsetGenerator.with_openai()

# set question type distribution
distribution = {simple: 0.5, reasoning: 0.25, multi_context: 0.25}

# generate testset
testset = generator.generate_with_llamaindex_docs(
    documents, test_size=10, distributions=distribution
)
test_df = testset.to_pandas()
test_df.head()

embedding nodes:   0%|          | 0/222 [00:00<?, ?it/s]

Generating:   0%|          | 0/10 [00:00<?, ?it/s]

Unnamed: 0,question,contexts,ground_truth,evolution_type,episode_done
0,How does the lattice width of F relate to the diameter of F(M)?,"[1miisapathofminimallength, then ∥u′−v′∥ ≤r·∥M∥\nand the claim follows from diam( F(M))≥distF(M)(u′,v′) =r. □\nRemark 3.2. LetF ⊂Zdbe a normal set. For all l∈ {−1,0,1}dandu,v∈ Fwe have\n(u−v)Tl≤ ∥u−v∥1and thus width l(F) := max {(u−v)Tl:u,v∈ F} ≤ max{∥u−v∥1:\nu,v∈ F}. Suppose that u′,v′∈ Fare such that ∥u′−v′∥1= max{∥u−v∥1:u,v∈ F}and\nletl′\ni:= sign(u′\ni−v′\ni) fori∈[d], then\n∥u′−v′∥1= (u′−v′)T·l′≤widthl′(F)≤max{∥u−v∥1:u,v∈ F}=∥u′−v′∥1.\nThelattice width ofFis width( F) := minl∈Zdwidthl(F) and thus Lemma 3.1gives\n∥M∥1·diam(F(M))≥width(F).]",The lattice width of F is related to the diameter of F(M) by the inequality ∥M∥1·diam(F(M))≥width(F).,simple,True
1,How is distillation used to improve the reasoning ability of language models?,"[\nlanguage models such as T5-XXL. The distillationis achieved by finetuning the small model on the\nchain-of-thought data (Wei et al., 2022c) generated\nby a large teacher model. Although promising per-\nformance is achieved, the improvements are likely\ntask-dependent. Further investigation on improv-\ning the reasoning ability by learning from larger\nLLMs could be an interesting direction.\n11.3 ICL Robustness\nPrevious studies have shown that ICL performance\nis extremely unstable, from random guess to SOTA,\nand can be sensitive to many factors, including\ndemonstration permutation, demonstration format,\netc. (Zhao et al., 2021; Lu et al., 2022). The robust-\nness of ICL is a critical yet challenging problem.\nHowever, most of the existing methods fall into\nthe dilemma of accuracy and robustness (Chen\net al., 2022c), or even at the cost of sacrificing\ninference efficiency. To effectively improve the\nrobustness of ICL, we need deeper analysis of the\nworking mechanism of the ICL. We believe that\nthe analysis of the robustness of the ICL from a\nmore theoretical perspective rather than an empir-\nical perspective can highlight future research on\nmore robust ICL.\n11.4 ICL Efficiency and Scalability\nICL necessitates prepending a significant number\nof demonstrations within the context. However, it\npresents two challenges: (1) the quantity of demon-\nstrations is constrained by the maximum input\nlength of LMs, which is significantly fewer com-\npared to fine-tuning (scalability); (2) as the number\nof demonstrations increases, the computation cost\nbecomes higher due to the quadratic complexity of\nattention mechanism (efficiency). Previous work in\n§5 focused on exploring how to achieve better ICL\nperformance using a limited number of demonstra-\ntions and proposed several demonstration design-\ning strategies. Scaling ICL to more demonstrations\nand improving its efficiency remains a challenging\ntask.\nRecently, some works have been proposed to ad-\ndress the issues of scalability and efficiency of ICL.\nEfforts were made to optimize prompting strate-\ngies with structured prompting (Hao et al., 2022b),\ndemonstration ensembling (Khalifa et al., 2023),\ndynamic prompting]",Distillation is used to improve the reasoning ability of language models by finetuning a small model on the chain-of-thought data generated by a large teacher model.,simple,True
2,How do Transformers utilize task recognition ability and implicit empirical risk minimization for implementing a proper function class?,"[ Trans-\nformers can implement a proper function class\nthrough implicit empirical risk minimization for\nthe demonstrations. Pan et al. (2023) decoupled the\nICL ability into task recognition ability and task\nlearning ability, and further showed how they uti-lize demonstrations. From an information-theoretic\nperspective, Hahn and Goyal (2023) showed an er-\nror bound for ICL under linguistically motivated\nassumptions to explain how next-token prediction\ncan bring about the ICL ability. Si et al. (2023)\nfound that large language models exhibit prior fea-\nture biases and showed a way to use intervention\nto avoid unintended features in ICL.\nAnother series of work attempted to build con-\nnections between ICL and gradient descent. Tak-\ning linear regression as a starting point, Akyürek\net al. (2022) found that Transformer-based in-\ncontext learners can implement standard finetun-\ning algorithms implicitly, and von Oswald et al.\n(2022) showed that linear attention-only Transform-\ners with hand-constructed parameters and mod-\nels learned by gradient descent are highly related.\nBased on softmax regression, Li et al. (2023e)\nfound that self-attention-only Transformers showed\nsimilarity with models learned by gradient-descent.\nDai et al. (2022) figured out a dual form between\nTransformer attention and gradient descent and fur-\nther proposed to understand ICL as implicit fine-\ntuning. Further, they compared GPT-based ICL\nand explicit finetuning on real tasks and found that\nICL indeed behaves similarly to finetuning from\nmultiple perspectives.\nFunctional Components Focusing on specific\nfunctional modules, Olsson et al. (2022) found that\nthere exist some induction heads in Transformers\nthat copy previous patterns to complete the next\ntoken. Further, they expanded the function of in-\nduction heads to more abstract pattern matching\nand completion, which may implement ICL. Wang\net al. (2023b) focused on the information flow in\nTransformers and found that during the ICL pro-\ncess, demonstration label words serves as anchors,\nwhich aggregates and distributes key information\nfor the final prediction.\n3Takeaway :(1) Knowing and considering\n]","Transformers utilize task recognition ability and implicit empirical risk minimization for implementing a proper function class by decoupling the ICL ability into task recognition ability and task learning ability, and utilizing demonstrations.",simple,True
3,How does the METALM model contribute to enhancing the In-Context Learning ability in multi-modal tasks?,"[ improving the results.\n9.2 Multi-Modal In-Context Learning\nIn the vision-language area, Tsimpoukelli et al.\n(2021) utilize a vision encoder to represent an im-\nage as a prefix embedding sequence that is aligned\nwith a frozen language model after training on the\npaired image-caption dataset. The resulting model,\nFrozen, is capable of performing multi-modal few-\nshot learning. Further, Alayrac et al. (2022) in-\ntroduce Flamingo, which combines a vision en-\ncoder with LLMs and adopts LLMs as the general\ninterface to perform in-context learning on many\nmulti-modal tasks. They show that training on\nlarge-scale multi-modal web corpora with arbitrar-\nily interleaved text and images is key to endowing\nthem with in-context few-shot learning capabili-\nties. Kosmos-1 (Huang et al., 2023b) is another\nmulti-modal LLMs and demonstrates promising\nzero-shot, few-shot, and even multimodal chain-\nof-thought prompting abilities. Hao et al. (2022a)\npresent METALM, a general-purpose interface to\nmodels across tasks and modalities. With a semi-\ncausal language modeling objective, METALM is\npretrained and exhibits strong ICL performance\nacross various vision-language tasks.\nIt is natural to further enhance the ICL ability]","The contribution of the METALM model to enhancing the In-Context Learning ability in multi-modal tasks is not mentioned in the given context. Therefore, the answer is -1.",simple,True
4,How does symmetry of a symmetric random walk relate to the paths in a graph?,"[ these paths, then Γ is a set of canonical paths . Let for any edge e∈E,\nΓe:={p∈Γ :e∈p}be the set of paths from Γ that use e. Now, let H:V×V→[0,1] be a\nsymmetric random walk on Gand deﬁne\nρ(Γ,H) :=max{|p|:p∈Γ}\n|V|·max\n{u,v}∈E|Γ{u,v}|\nH(u,v).\nObserve that symmetry of His needed to make ρ(Γ,H) well-deﬁned. This can be used to\nprove the following upper bound on the second largest eigenv alue.\nLemma 5.7. LetGbe a graph, Hbe a symmetric random walk on G, andΓbe a set of\ncanonical paths in G. Thenλ2(H)≤1−1\nρ(Γ,H).\nProof.The stationary distribution of His the uniform distribution and thus the statement\nis a direct consequence of [23, Theorem 5], since ρ(Γ,H) is an upper bound on the constant\ndeﬁned in [23, equation 4]. □\nTheorem 5.8. LetF ⊂Zdbe ﬁnite and let M:={m1,...,m k} ⊂Zdbe an augmenting\nMarkov basis. Let πbe the uniform and fbe a positive distribution on FandMrespectively.\nFori∈[k], letri:= max{|RF,mi(u)|:u∈ F}and suppose that r1≥r2≥ ··· ≥rk. Then\nλ(Hπ,f\nM,F)≤1−|F|·min(f)\nAM(F)·AM(F)!·3AM(F)−1·2|M|·r1r2···rAM(F).\nProof.Choose for any distinct u,v∈ Fan augmenting path pu,vof minimal length in Fc(M)\nand let Γ be the collection of all these paths. Let u+µmk]",The answer to the question is not present in the given context.,simple,True


## Build RAG with llama-index

In [4]:
import phoenix as px

session = px.launch_app()

🌍 To view the Phoenix app in your browser, visit http://localhost:6006/
📺 To view the Phoenix app in a notebook, run `px.active_session().view()`
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix


In [5]:
import llama_index

llama_index.set_global_handler("arize_phoenix")

In [6]:
from llama_index import VectorStoreIndex, ServiceContext
from llama_index.embeddings import OpenAIEmbedding
from datasets import Dataset
from tqdm.auto import tqdm


def build_query_engine(documents):
    vector_index = VectorStoreIndex.from_documents(
        documents,
        service_context=ServiceContext.from_defaults(chunk_size=512),
        embed_model=OpenAIEmbedding(),
    )

    query_engine = vector_index.as_query_engine(similarity_top_k=2)
    return query_engine


def generate_response(query_engine, question):
    response = query_engine.query(question)
    return {
        "answer": response.response,
        "contexts": [c.node.get_content() for c in response.source_nodes],
    }


# Function to evaluate as Llama index does not support async evaluation for HFInference API
def generate_ragas_dataset(query_engine, test_df):
    test_questions = test_df["question"].values
    responses = [generate_response(query_engine, q) for q in tqdm(test_questions)]

    dataset_dict = {
        "question": test_questions,
        "answer": [response["answer"] for response in responses],
        "contexts": [response["contexts"] for response in responses],
        "ground_truth": test_df["ground_truth"].values.tolist(),
    }
    ds = Dataset.from_dict(dataset_dict)
    return ds

In [7]:
query_engine = build_query_engine(documents)
ragas_eval_dataset = generate_ragas_dataset(query_engine, test_df)
pd.DataFrame(ragas_eval_dataset)

  0%|          | 0/10 [00:00<?, ?it/s]

Unnamed: 0,question,answer,contexts,ground_truth
0,How does the lattice width of F relate to the diameter of F(M)?,"The lattice width of F, denoted as width(F), is defined as the minimum of the lattice widths of F in all dimensions. On the other hand, the diameter of F(M), denoted as diam(F(M)), is a lower bound on the shortest path length between any two elements in F(M). \n\nBased on the given context, it can be inferred that the lattice width of F is related to the diameter of F(M) through the inequality: \n\nwidth(F) ≤ diam(F(M)) / ∥M∥\n\nThis means that the lattice width of F is bounded above by the diameter of F(M) divided by the maximum norm of the elements in M.","[□\nRemark 3.2. LetF ⊂Zdbe a normal set. For all l∈ {−1,0,1}dandu,v∈ Fwe have\n(u−v)Tl≤ ∥u−v∥1and thus width l(F) := max {(u−v)Tl:u,v∈ F} ≤ max{∥u−v∥1:\nu,v∈ F}. Suppose that u′,v′∈ Fare such that ∥u′−v′∥1= max{∥u−v∥1:u,v∈ F}and\nletl′\ni:= sign(u′\ni−v′\ni) fori∈[d], then\n∥u′−v′∥1= (u′−v′)T·l′≤widthl′(F)≤max{∥u−v∥1:u,v∈ F}=∥u′−v′∥1.\nThelattice width ofFis width( F) := minl∈Zdwidthl(F) and thus Lemma 3.1gives\n∥M∥1·diam(F(M))≥width(F)., 3.Bounds on the diameter\nIn general knowledge of the diameter of the graph underlying a Markov chain can provide\ninformation about the mixing time. For random walks on ﬁber g raphs, the chains which we\nconsider, the underlying graph coincides with the ﬁber grap h. In this section, we determine\nlower and upper bounds on the diameter of ﬁber graphs and thei r compressed counterparts.\nFor a ﬁnite set M ⊂Zdand any norm ∥·∥onRd, let∥M∥:= max m∈M∥m∥.\nLemma 3.1. LetF ⊂ZdandM ⊂Zdbe ﬁnite sets, then\ndiam(F(M))≥1\n∥M∥·max{∥u−v∥:u,v∈ F}.\nProof.IfF(M) is not connected, then the statement holds trivially, so as sume that Mis a\nMarkov basis for F. Letu′,v′∈ Fsuch that ∥u′−v′∥= max{∥u−v∥:u,v∈ F}and let\nm1,...,m r∈ Msothatu′=v′+∑r\ni=1miisapathofminimallength, then ∥u′−v′∥ ≤r·∥M∥\nand the claim follows from diam( F(M))≥distF(M)(u′,v′) =r. □\nRemark 3.2. LetF ⊂Zdbe a normal set. For all l∈ {−1,0,1}dandu,v∈ Fwe have\n(u−v)Tl≤ ∥u−v∥1and thus width l(F) := max {(u−v)Tl:u,v∈ F} ≤ max{∥u−v∥1:\nu,v∈ F}.]",The lattice width of F is related to the diameter of F(M) by the inequality ∥M∥1·diam(F(M))≥width(F).
1,How is distillation used to improve the reasoning ability of language models?,"Distillation is used to improve the reasoning ability of language models by transferring the reasoning ability from larger models to smaller models. This is achieved by finetuning the small model on the chain-of-thought data generated by a large teacher model. The goal is to distill the reasoning ability of the larger model into the smaller model, allowing for improved performance in reasoning tasks. However, it is important to note that the effectiveness of this approach may vary depending on the specific task.","[11.1 New Pretraining Strategies\nAs investigated by Shin et al. (2022b), language\nmodel objectives are not equal to ICL abilities. Re-\nsearchers have proposed to bridge the gap between\npretraining objectives and ICL through interme-\ndiate tuning before inference (Section 4), which\nshows promising performance improvements. To\ntake it further, tailored pretraining objectives and\nmetrics for ICL have the potential to raise LLMs\nwith superior ICl capabilities.\n11.2 ICL Ability Distillation\nPrevious studies have shown that in-context learn-\ning for reasoning tasks emerges as the scale of\ncomputation and parameter exceed a certain thresh-\nold (Wei et al., 2022b). Transferring the ICL ability\nto smaller models could facilitate the model deploy-\nment greatly. Magister et al. (2022) showed that it\nis possible to distill the reasoning ability to small\nlanguage models such as T5-XXL. The distillationis achieved by finetuning the small model on the\nchain-of-thought data (Wei et al., 2022c) generated\nby a large teacher model. Although promising per-\nformance is achieved, the improvements are likely\ntask-dependent. Further investigation on improv-\ning the reasoning ability by learning from larger\nLLMs could be an interesting direction.\n11.3 ICL Robustness\nPrevious studies have shown that ICL performance\nis extremely unstable, from random guess to SOTA,\nand can be sensitive to many factors, including\ndemonstration permutation, demonstration format,\netc. (Zhao et al., 2021; Lu et al., 2022). The robust-\nness of ICL is a critical yet challenging problem.\nHowever, most of the existing methods fall into\nthe dilemma of accuracy and robustness (Chen\net al., 2022c), or even at the cost of sacrificing\ninference efficiency. To effectively improve the\nrobustness of ICL, we need deeper analysis of the\nworking mechanism of the ICL. We believe that\nthe analysis of the robustness of the ICL from a\nmore theoretical perspective rather than an empir-\nical perspective can highlight future research on\nmore robust ICL., CoRR , abs/2210.03350.\nShuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang\nChen, Yunzhi Yao, Shumin Deng, Chuanqi Tan,\nFei Huang, and Huajun Chen. 2022. Reason-\ning with language model prompting: A survey.\narXiv preprint arXiv:2212.09597 .\nAlec Radford, Jeff Wu, Rewon Child, David Luan,\nDario Amodei, and Ilya Sutskever. 2019. Lan-\nguage models are unsupervised multitask learn-\ners.\nPranav Rajpurkar, Robin Jia, and Percy Liang.\n2018. Know what you don’t know: Unanswer-\nable questions for SQuAD. In Proc. of ACL ,\npages 784–789, Melbourne, Australia. Associa-\ntion for Computational Linguistics.\nOri Ram, Yoav Levine, Itay Dalmedigos, Dor\nMuhlgay, Amnon Shashua, Kevin Leyton-\nBrown, and Yoav Shoham. 2023. In-context\nretrieval-augmented language models. arXiv\npreprint arXiv:2302.00083 .\nOhad Rubin, Jonathan Herzig, and Jonathan Be-\nrant. 2022. Learning to retrieve prompts for\nin-context learning. In Proceedings of the 2022\nConference of the North American Chapter of\nthe Association for Computational Linguistics:Human Language Technologies , pages 2655–\n2671, Seattle, United States. Association for\nComputational Linguistics.\nAbulhair Saparov and He He. 2022. Language\nmodels are greedy reasoners: A systematic for-\nmal analysis of chain-of-thought. ArXiv preprint ,\nabs/2210.01240.\nFreda Shi, Mirac Suzgun, Markus Freitag,\nXuezhi Wang, Suraj Srivats, Soroush V osoughi,\nHyung Won Chung, Yi Tay, Sebastian Ruder,\nDenny Zhou, et al. 2022. Language models are\nmultilingual chain-of-thought reasoners.]",Distillation is used to improve the reasoning ability of language models by finetuning a small model on the chain-of-thought data generated by a large teacher model.
2,How do Transformers utilize task recognition ability and implicit empirical risk minimization for implementing a proper function class?,"Transformers utilize task recognition ability and implicit empirical risk minimization to implement a proper function class by leveraging demonstrations. This means that Transformers are able to recognize different tasks and learn from them, and they minimize the empirical risk associated with these tasks to generalize and implement a proper function class. By doing so, Transformers can effectively encode learning algorithms and achieve comparable error rates to traditional estimators.","[7.2 Understanding Why ICL Works\nDistribution of Training Data Concentrating on\nthe pretraining data, Chan et al. (2022) showed that\nthe ICL ability is driven by data distributional prop-\nerties. They found that the ICL ability emerges\nwhen the training data have examples appearing in\nclusters and have enough rare classes. Xie et al.\n(2022) explained ICL as implicit Bayesian infer-\nence and constructed a synthetic dataset to prove\nthat the ICL ability emerges when the pretraining\ndistribution follows a mixture of hidden Markov\nmodels.\nLearning Mechanism By learning linear func-\ntions, Garg et al. (2022) proved that Transformers\ncould encode effective learning algorithms to learn\nunseen linear functions according to demonstra-\ntion samples. They also found that the learning\nalgorithm encoded in an ICL model can achieve\na comparable error to that from a least squares\nestimator. Li et al. (2023g) abstracted ICL as an\nalgorithm learning problem and showed that Trans-\nformers can implement a proper function class\nthrough implicit empirical risk minimization for\nthe demonstrations. Pan et al. (2023) decoupled the\nICL ability into task recognition ability and task\nlearning ability, and further showed how they uti-lize demonstrations. From an information-theoretic\nperspective, Hahn and Goyal (2023) showed an er-\nror bound for ICL under linguistically motivated\nassumptions to explain how next-token prediction\ncan bring about the ICL ability. Si et al. (2023)\nfound that large language models exhibit prior fea-\nture biases and showed a way to use intervention\nto avoid unintended features in ICL.\nAnother series of work attempted to build con-\nnections between ICL and gradient descent. Tak-\ning linear regression as a starting point, Akyürek\net al. (2022) found that Transformer-based in-\ncontext learners can implement standard finetun-\ning algorithms implicitly, and von Oswald et al.\n(2022) showed that linear attention-only Transform-\ners with hand-constructed parameters and mod-\nels learned by gradient descent are highly related.\nBased on softmax regression, Li et al., Functional Components Focusing on specific\nfunctional modules, Olsson et al. (2022) found that\nthere exist some induction heads in Transformers\nthat copy previous patterns to complete the next\ntoken. Further, they expanded the function of in-\nduction heads to more abstract pattern matching\nand completion, which may implement ICL. Wang\net al. (2023b) focused on the information flow in\nTransformers and found that during the ICL pro-\ncess, demonstration label words serves as anchors,\nwhich aggregates and distributes key information\nfor the final prediction.\n3Takeaway :(1) Knowing and considering\nhow ICL works can help us improve the ICL per-\nformance, and the factors that strongly correlate to\nICL performance are listed in Table 3. (2) Although\nsome analytical studies have taken a preliminary\nstep to explain ICL, most of them are limited to\nsimple tasks and small models. Extending analysis\non extensive tasks and large models may be the\nnext step to be considered. In addition, among ex-\nisting work, explaining ICL with gradient descent\nseems to be a reasonable, general, and promising\ndirection for future research. If we build clear con-]","Transformers utilize task recognition ability and implicit empirical risk minimization for implementing a proper function class by decoupling the ICL ability into task recognition ability and task learning ability, and utilizing demonstrations."
3,How does the METALM model contribute to enhancing the In-Context Learning ability in multi-modal tasks?,"The METALM model contributes to enhancing the In-Context Learning (ICL) ability in multi-modal tasks by exhibiting strong ICL performance across various vision-language tasks. It achieves this through its general-purpose interface to models across tasks and modalities, along with its semi-causal language modeling objective. The model is pretrained and has demonstrated promising results in zero-shot, few-shot, and even multimodal chain-of-thought prompting abilities.","[To\naddress this, Zhang et al. (2023a) investigate two\napproaches: (1) an unsupervised retriever that se-\nlects nearest samples using an off-the-shelf model,\nand (2) a supervised method training an additional\nretriever model to maximize ICL performance. The\nretrieved samples notably enhance performance, ex-\nhibiting semantic similarity to the query and closer\ncontextual alignment regarding viewpoint, back-\nground, and appearance. Except for the prompt\nretrieval, Sun et al. (2023) further explore a prompt\nfusion technique for improving the results.\n9.2 Multi-Modal In-Context Learning\nIn the vision-language area, Tsimpoukelli et al.\n(2021) utilize a vision encoder to represent an im-\nage as a prefix embedding sequence that is aligned\nwith a frozen language model after training on the\npaired image-caption dataset. The resulting model,\nFrozen, is capable of performing multi-modal few-\nshot learning. Further, Alayrac et al. (2022) in-\ntroduce Flamingo, which combines a vision en-\ncoder with LLMs and adopts LLMs as the general\ninterface to perform in-context learning on many\nmulti-modal tasks. They show that training on\nlarge-scale multi-modal web corpora with arbitrar-\nily interleaved text and images is key to endowing\nthem with in-context few-shot learning capabili-\nties. Kosmos-1 (Huang et al., 2023b) is another\nmulti-modal LLMs and demonstrates promising\nzero-shot, few-shot, and even multimodal chain-\nof-thought prompting abilities. Hao et al. (2022a)\npresent METALM, a general-purpose interface to\nmodels across tasks and modalities. With a semi-\ncausal language modeling objective, METALM is\npretrained and exhibits strong ICL performance\nacross various vision-language tasks.\nIt is natural to further enhance the ICL ability, with instruction tuning, and the idea is also ex-\nplored in the multi-modal scenarios as well. Re-\ncent explorations first generate instruction tuning\ndatasets transforming existing vision-language task\ndataset (Xu et al., 2022; Li et al., 2023a) or with\npower LLMs such as GPT-4 (Liu et al., 2023; Zhu\net al., 2023a) , and connect LLMs with powerful vi-\nsion foundational models such as BLIP-2 (Li et al.,\n2023c) on these multi-modal datasets (Zhu et al.,\n2023a; Dai et al., 2023).\n9.3 Speech In-Context Learning\nIn the speech area, Wang et al. (2023a) treated text-\nto-speech synthesis as a language modeling task.\nThey use audio codec codes as an intermediate rep-\nresentation and propose the first TTS framework\nwith strong in-context learning capability. Subse-\nquently, V ALLE-X (Zhang et al., 2023b) extend the\nidea to multi-lingual scenarios, demonstrating su-\nperior performance in zero-shot cross-lingual text-\nto-speech synthesis and zero-shot speech-to-speech\ntranslation tasks.\n3Takeaway :(1) Recent studies have explored\nin-context learning beyond natural language with\npromising results. Properly formatted data (e.g.,\ninterleaved image-text datasets for vision-language\ntasks) and architecture designs are key factors\nfor activating the potential of in-context learning.\nExploring it in a more complex structured space\nsuch as for graph data is challenging and promis-\ning (Huang et al., 2023a). (2) Findings in textual\nin-context learning demonstration design and selec-\ntion cannot be trivially transferred to other modal-\nities. Domain-specific investigation is required to\nfully leverage the potential of in-context learning\nin various modalities.]","The contribution of the METALM model to enhancing the In-Context Learning ability in multi-modal tasks is not mentioned in the given context. Therefore, the answer is -1."
4,How does symmetry of a symmetric random walk relate to the paths in a graph?,"Symmetry of a symmetric random walk is needed to make the concept of ρ(Γ,H) well-defined. This concept is used to prove an upper bound on the second largest eigenvalue of the random walk. The paths in the graph are related to this symmetry through the set Γ, which consists of all augmenting paths of minimal length in Fc(M), where F is a finite set and M is an augmenting Markov basis. The collection of these paths is used to bound the size of Γ{u,v}, where u and v are distinct vertices in F.","[HEAT-BATH RANDOM WALKS WITH MARKOV BASES 3\n2.Graphs and statistics\nWe ﬁrst introduce the statistical framework in which this pa per lives and recall important\naspects of the interplay between graphs and statistics. A random walk on a graph G= (V,E)\nis a map H:V×V→[0,1] such that for all v∈V,∑\nu∈VH(v,u) = 1 and such that\nH(v,u) = 0 if{v,u} ̸∈E. When there is no ambiguity, we represent a random walk as an\n|V| × |V|-matrix, for example when it is clear how the elements of Vare ordered. Fix a\nrandom walk HonG. ThenHisirreducible if for all v,u∈Vthere exists t∈Nsuch that\nHt(v,u)>0. The random walk Hisreversible if there exists a mass function µ:V→[0,1]\nsuch that µ(u)·H(u,v) =µ(v)·H(v,u) for allu,v∈Vandsymmetric ifHis a symmetric\nmap. A mass function π:V→[0,1] is astationary distribution ofHifπ◦ H=π. For\nsymmetric random walks, the uniform distribution on Vis always a stationary distribution.\nIf|V|=n, then we denote the eigenvalues of Hby 1 =λ1(H)≥λ2(H)≥ ··· ≥λn(H)≥ −1\nand we write λ(H) := max {λ2(H),−λn(H)}for thesecond largest eigenvalue modulus ofH.\nAny irreducible random walk has a unique stationary distrib ution [14, Corollary 1.17] and\nλ(H)∈[0,1] measures the convergence rate: the smaller λ(H), the faster the convergence.\nThe aim of this paper is to study random walks on lattice point s that use a set of moves., Observe that symmetry of His needed to make ρ(Γ,H) well-deﬁned. This can be used to\nprove the following upper bound on the second largest eigenv alue.\nLemma 5.7. LetGbe a graph, Hbe a symmetric random walk on G, andΓbe a set of\ncanonical paths in G. Thenλ2(H)≤1−1\nρ(Γ,H).\nProof.The stationary distribution of His the uniform distribution and thus the statement\nis a direct consequence of [23, Theorem 5], since ρ(Γ,H) is an upper bound on the constant\ndeﬁned in [23, equation 4]. □\nTheorem 5.8. LetF ⊂Zdbe ﬁnite and let M:={m1,...,m k} ⊂Zdbe an augmenting\nMarkov basis. Let πbe the uniform and fbe a positive distribution on FandMrespectively.\nFori∈[k], letri:= max{|RF,mi(u)|:u∈ F}and suppose that r1≥r2≥ ··· ≥rk. Then\nλ(Hπ,f\nM,F)≤1−|F|·min(f)\nAM(F)·AM(F)!·3AM(F)−1·2|M|·r1r2···rAM(F).\nProof.Choose for any distinct u,v∈ Fan augmenting path pu,vof minimal length in Fc(M)\nand let Γ be the collection of all these paths. Let u+µmk=vbe an edge in Fc(M), then\nour goal is to bound |Γ{u,v}|from above. Let S:={S⊆[r] :|S| ≤ AM(F),k∈S}and take\nany path px,y∈Γ{u,v}. Then there exists S:={i1,...,is}withs:=|S| ≤ AM(F) such that\nx+∑s\nk=1λikmik=y.]",The answer to the question is not present in the given context.
5,"How is the concept of canonical paths connected to a symmetric random walk, according to the provided context?","The concept of canonical paths is connected to a symmetric random walk in the context by defining a set of paths that are used in the random walk. These paths are called canonical paths and are represented by the set Γ. The symmetric random walk H is then defined based on these canonical paths. The concept of canonical paths is used to calculate the value ρ(Γ,H), which is used in Lemma 5.7 to provide an upper bound on the second largest eigenvalue of the random walk.","[By assumption, there exists\nan augmenting path from wtow+cvusing only relements from M. Put diﬀerently, the\nelementcvfromVcan be represented by a linear combination of rvectors from M. Sincev\nwas chosen arbitrarily, Lemma 5.4implies dim( P) = dim( V)≤r. □\nRemark 5.6. It is a consequence from Proposition 5.5that for any matrix A∈Zm×dwith\nkerZ(A)∩Nd={0}and an augmenting Markov basis M, there exists F ∈ P Asuch that\nAM(F)≥dim(ker Z(A)).\nLet us now shortly recall the framework from [23] which is nec essary to prove our main\ntheorem. Let G= (V,E) be a graph. For any ordered pair of distinct nodes ( x,y)∈V×V,\nletpx,y⊆Ebe a path from xtoyinGand let Γ := {px,y: (x,y)∈V×V,x̸=y}be\nthe collection of these paths, then Γ is a set of canonical paths . Let for any edge e∈E,\nΓe:={p∈Γ :e∈p}be the set of paths from Γ that use e. Now, let H:V×V→[0,1] be a\nsymmetric random walk on Gand deﬁne\nρ(Γ,H) :=max{|p|:p∈Γ}\n|V|·max\n{u,v}∈E|Γ{u,v}|\nH(u,v).\nObserve that symmetry of His needed to make ρ(Γ,H) well-deﬁned. This can be used to\nprove the following upper bound on the second largest eigenv alue.\nLemma 5.7. LetGbe a graph, Hbe a symmetric random walk on G, andΓbe a set of\ncanonical paths in G. Thenλ2(H)≤1−1\nρ(Γ,H)., HEAT-BATH RANDOM WALKS WITH MARKOV BASES 3\n2.Graphs and statistics\nWe ﬁrst introduce the statistical framework in which this pa per lives and recall important\naspects of the interplay between graphs and statistics. A random walk on a graph G= (V,E)\nis a map H:V×V→[0,1] such that for all v∈V,∑\nu∈VH(v,u) = 1 and such that\nH(v,u) = 0 if{v,u} ̸∈E. When there is no ambiguity, we represent a random walk as an\n|V| × |V|-matrix, for example when it is clear how the elements of Vare ordered. Fix a\nrandom walk HonG. ThenHisirreducible if for all v,u∈Vthere exists t∈Nsuch that\nHt(v,u)>0. The random walk Hisreversible if there exists a mass function µ:V→[0,1]\nsuch that µ(u)·H(u,v) =µ(v)·H(v,u) for allu,v∈Vandsymmetric ifHis a symmetric\nmap. A mass function π:V→[0,1] is astationary distribution ofHifπ◦ H=π. For\nsymmetric random walks, the uniform distribution on Vis always a stationary distribution.\nIf|V|=n, then we denote the eigenvalues of Hby 1 =λ1(H)≥λ2(H)≥ ··· ≥λn(H)≥ −1\nand we write λ(H) := max {λ2(H),−λn(H)}for thesecond largest eigenvalue modulus ofH.\nAny irreducible random walk has a unique stationary distrib ution [14, Corollary 1.17] and\nλ(H)∈[0,1] measures the convergence rate: the smaller λ(H), the faster the convergence.\nThe aim of this paper is to study random walks on lattice point s that use a set of moves.]",
6,"How is the length of the ray RF,m(v) in Algorithm 1 related to the number of rows of A in a normal set F={u∈Zd:Au≤b} given in H-representation?","The length of the ray RF,m(v) in Algorithm 1 is not directly related to the number of rows of A in a normal set F={u∈Zd:Au≤b} given in H-representation. The length of the ray RF,m(v) depends on the specific Markov move m and the vector v in the set F. It is determined by the intersection of the ray R with the set F along the Markov move m. The number of rows of A in the H-representation of F represents the number of hyperplanes that define the set F.","[The longest ray through Falong\nvectors of MisRF,M:= argmax {|RF,m(u)|:m∈ M,u∈ F}.\nCorollary 5.10. Let(Fi)i∈Nbe a sequence of ﬁnite sets in Zdand letπibe the uniform\ndistribution on Fi. LetM ⊂Zdbe an augmenting Markov basis for FiwithAM(Fi)≤\ndim(Fi)and suppose that (|RFi,M|)dim(Fi))i∈N∈ O(|Fi|)i∈N. Then for any positive mass\nfunction f:M →[0,1], there exists ǫ >0such that λ(Hπi,f\nFi,M)≤1−ǫfor alli∈N.\nProof.This is a straightforward application of Theorem 5.8. □\nCorollary 5.11. LetP ⊂Zdbe a polytope, Fi:= (i· P)∩Zdfori∈N, and let πibe the\nuniform distribution on Fi. Suppose that M ⊂Zdis an augmenting Markov basis {Fi:i∈N}\nsuch that AM(Fi)≤dim(P)for alli∈N. Then for any positive mass function f:M →[0,1],\nthere exists ǫ >0such that λ(Hπi,f\nFi,M)≤1−ǫfor alli∈N.\nProof.Letr:= dim(P). Weﬁrstshowthat( |RFi,M|)i∈N∈ O(i)i∈N. WriteM={m1,...,m k}\nand denote by li:= max{|(u+mi·Z)∩P|:u∈ P}be the length of the longest ray through\nthe polytope Palongmi. It suﬃces to prove that i·(lk+ 1) is an upper bound on the\nlength of any ray along mkthrough Fi., Sincexandx′are in the same connected component VaofF(M), letyi0,...,yir∈ F\nbe the nodes on a minimal path in Fc(M) withyi0=xandyir=x′. For any s∈[r],yis\nandyis−1are contained in the same ray Rks\ntscoming from a Markov move mks. In particular,\nRts−1\nks−1∩Rks\nts̸=∅and due to our observation made above λi\nj=λk1\nt1=λk2\nt2=···=λkr\ntr=λi\nj′\nwhich ﬁnishes the proof. □\nDeﬁnition 4.8. LetF ⊂ZdandM ⊂Zdbe ﬁnite sets and M′⊆ M. LetVbe the set of\nconnected components of F(M\M′) andRbetheset of all rays through Falongall elements\nofM′. Theray matrix ofF(M) alongM′isAF(M,M′) := (|R∩V|)R∈R,V∈V∈NR×V.\nExample 4.9. LetF= [3]×[3],M={e1,e2,e1+e2}, andM′={e1,e2}. ThenF(M\M′)\nhas ﬁve connected components and the ray matrix of F(M) alongM′is\nAF(M,M′) =\n1 1 1 0 0\n0 1 1 1 0\n0 0 1 1 1\n0 0 1 1 1\n0 1 1 1 0\n1 1 1 0 0\n.\nRemark 4.10.]",
7,"""What's the new paradigm in natural language processing that enables large language models to learn tasks with few examples and its relation to in-context learning?""","The new paradigm in natural language processing that enables large language models to learn tasks with few examples is called in-context learning (ICL). In-context learning allows language models to learn and generalize from a small number of demonstration examples, rather than relying on extensive fine-tuning or training on large datasets. This approach aims to leverage the intrinsic capabilities of large language models without the need for extensive task-specific training. In-context learning enables models to perform well on a wide range of tasks, even those for which they have not been explicitly trained.","[(2020) found that GPT-\n3 can achieve results comparable to state-of-the-\nart (SOTA) finetuning performance on COPA and\nReCoRD, but still falls behind finetuning on most\nNLU tasks. Hao et al. (2022b) showed the po-\ntential of scaling up the number of demonstration\nexamples. However, the improvement brought by\nscaling is very limited. At present, compared to\nfinetuning, there still remains some room for ICL\nto reach on traditional NLP tasks.\n8.2 New Challenging Tasks\nIn the era of large language models with in-context\nlearning capabilities, researchers are more inter-\nested in evaluating the intrinsic capabilities of large\nlanguage models without downstream task finetun-\ning (Bommasani et al., 2021).\nTo explore the capability limitations of LLM on\nvarious tasks, Srivastava et al. (2022) proposed\nthe BIG-Bench (Srivastava et al., 2022), a large\nbenchmark covering a large range of tasks, includ-\ning linguistics, chemistry, biology, social behav-\nior, and beyond. The best models have already\noutperformed the average reported human-rater\nresults on 65% of the BIG-Bench tasks throughICL (Suzgun et al., 2022). To further explore tasks\nactually unsolvable by current language models,\nSuzgun et al. (2022) proposed a more challenging\nICL benchmark, BIG-Bench Hard (BBH). BBH in-\ncludes 23 unsolved tasks, constructed by selecting\nchallenging tasks where the state-of-art model per-\nformances are far below the human performances.\nBesides, researchers are searching for inverse scal-\ning tasks,1that is, tasks where model performance\nreduces when scaling up the model size. Such\ntasks also highlight potential issues with the cur-\nrent paradigm of ICL. To further probe the model\ngeneralization ability, Iyer et al. (2022) proposed\nOPT-IML Bench, consisting of 2000 NLP tasks\nfrom 8 existing benchmarks, especially benchmark\nfor ICL on held-out categories., In Proc. of ICLR . OpenRe-\nview.net.\nOr Honovich, Uri Shaham, Samuel R. Bowman,\nand Omer Levy. 2022. Instruction induction:\nFrom few examples to natural language task de-\nscriptions. CoRR , abs/2205.10782.\nQian Huang, Hongyu Ren, Peng Chen, Gre-\ngor Kržmanc, Daniel Zeng, Percy Liang, and\nJure Leskovec. 2023a. Prodigy: Enabling in-\ncontext learning over graphs. arXiv preprint\narXiv:2305.12600 .\nShaohan Huang, Li Dong, Wenhui Wang, Yaru\nHao, Saksham Singhal, Shuming Ma, Tengchao\nLv, Lei Cui, Owais Khan Mohammed, Qiang\nLiu, et al. 2023b. Language is not all you\nneed: Aligning perception with language models.\narXiv preprint arXiv:2302.14045 .\nSrinivasan Iyer, Xi Victoria Lin, Ramakanth Pa-\nsunuru, Todor Mihaylov, Daniel Simig, Ping\nYu, Kurt Shuster, Tianlu Wang, Qing Liu,\nPunit Singh Koura, Xian Li, Brian O’Horo,\nGabriel Pereyra, Jeff Wang, Christopher Dewan,\nAsli Celikyilmaz, Luke Zettlemoyer, and Ves\nStoyanov. 2022. Opt-iml: Scaling language\nmodel instruction meta learning through the lens\nof generalization.Muhammad Khalifa, Lajanugen Logeswaran,\nMoontae Lee, Honglak Lee, and Lu Wang.\n2023. Exploring demonstration ensembling for\nin-context learning. In ICLR 2023 Workshop on\nMathematical and Empirical Understanding of\nFoundation Models .\nHanieh Khorashadizadeh, Nandana Mihindukula-\nsooriya, Sanju Tiwari, Jinghua Groppe, and Sven\nGroppe. 2023. Exploring in-context learning\ncapabilities of foundation models for generat-\ning knowledge graphs from text.]",The new paradigm in natural language processing that enables large language models to learn tasks with few examples is called in-context learning (ICL). In-context learning refers to the ability of LLMs to make predictions based on contexts augmented with a few examples.
8,"""What are the potential applications of ICL in data engineering and how can it improve knowledge graph construction?""","ICL has the potential to be widely applied in data engineering. It offers benefits such as generating high-quality data at a low cost compared to human annotation or noisy automatic annotation methods. In the context of data engineering, ICL can significantly improve knowledge graph construction. It has been demonstrated that ICL can enhance the automatic construction and completion of knowledge graphs, leading to a reduction in manual costs with minimal engineering effort. Leveraging the capabilities of ICL in various data engineering applications can yield significant benefits, including improving the state of the art in knowledge graph construction.","[Moreover, ICL offers potential for popular meth-ods such as meta-learning and instruction-tuning.\nChen et al. (2022d) applied ICL to meta-learning,\nadapting to new tasks with frozen model parame-\nters, thus addressing the complex nested optimiza-\ntion issue. (Ye et al., 2023b) enhanced zero-shot\ntask generalization performance for both pretrained\nand instruction-finetuned models by applying in-\ncontext learning to instruction learning.\nSpecifically, we explore several emerging and\nprevalent applications of ICL, showcasing their\npotential in the following paragraphs.\nData Engineering ICL has manifested the po-\ntential to be widely applied in data engineering.\nBenefiting from the strong ICL ability, it costs 50%\nto 96% less to use labels from GPT-3 than using la-\nbels from humans for data annotation. Combining\npseudo labels from GPT-3 with human labels leads\nto even better performance at a small cost (Wang\net al., 2021). In more complex scenarios, such as\nknowledge graph construction, Khorashadizadeh\net al. (2023) has demonstrated that ICL has the po-\ntential to significantly improve the state of the art of\nautomatic construction and completion of knowl-\nedge graphs, resulting in a reduction in manual\ncosts with minimal engineering effort. Therefore,\nleveraging the capabilities of ICL in various data\nengineering applications can yield significant bene-\nfits. Compared to human annotation (e.g., crowd-\nsourcing) or noisy automatic annotation (e.g., dis-\ntant supervision), ICL generates relatively high\nquality data at a low cost. However, how to use ICL\nfor data annotation remains an open question. For\nexample, Ding et al. (2022) performed a compre-\nhensive analysis and found that generation-based\nmethods are more cost-effective in using GPT-3\nthan annotating unlabeled data via ICL.\nModel Augmentating The context-flexible na-\nture of ICL demonstrates significant potential to\nenhance retrieval-augmented methods., Therefore,\nleveraging the capabilities of ICL in various data\nengineering applications can yield significant bene-\nfits. Compared to human annotation (e.g., crowd-\nsourcing) or noisy automatic annotation (e.g., dis-\ntant supervision), ICL generates relatively high\nquality data at a low cost. However, how to use ICL\nfor data annotation remains an open question. For\nexample, Ding et al. (2022) performed a compre-\nhensive analysis and found that generation-based\nmethods are more cost-effective in using GPT-3\nthan annotating unlabeled data via ICL.\nModel Augmentating The context-flexible na-\nture of ICL demonstrates significant potential to\nenhance retrieval-augmented methods. By keep-\ning the LM architecture unchanged and prepend-\ning grounding documents to the input, in-context\nRALMRam et al. (2023) effectively utilizes off-\nthe-shelf general-purpose retrievers, resulting in\nsubstantial LM gains across various model sizes\nand diverse corpora. Furthermore, ICL for retrieval\nalso exhibits the potential to improve safety. In ad-\ndition to efficiency and flexibility, ICL also shows\npotential in safety (Panda et al., 2023), (Meade\net al., 2023) use ICL for retrieved demonstrations\nto steer a model towards safer generations, reduc-]",
9,"What does the survey on advanced in-context learning techniques cover, including strategies for training and designing demonstrations?","The survey on advanced in-context learning techniques covers strategies for training and designing demonstrations. This includes training strategies, demonstration designing strategies, evaluation datasets and resources, as well as related analytical studies.","[ment aims to improve the scalability and efficiency\nof ICL. As LMs continue to scale up, exploring\nways to effectively and efficiently utilize a larger\nnumber of demonstrations in ICL remains an ongo-\ning area of research.\n12 Conclusion\nIn this paper, we survey the existing ICL literature\nand provide an extensive review of advanced ICL\ntechniques, including training strategies, demon-\nstration designing strategies, evaluation datasets\nand resources, as well as related analytical studies.\nFurthermore, we highlight critical challenges and\npotential directions for future research. To the best\nof our knowledge, this is the first survey about ICL.\nWe hope this survey can highlight the current re-\nsearch status of ICL and shed light on future work\non this promising paradigm.\nReferences\nEkin Akyürek, Dale Schuurmans, Jacob An-\ndreas, Tengyu Ma, and Denny Zhou. 2022.\nWhat learning algorithm is in-context learn-\ning? investigations with linear models. CoRR ,\nabs/2211.15661.\nJean-Baptiste Alayrac, Jeff Donahue, Pauline Luc,\nAntoine Miech, Iain Barr, Yana Hasson, Karel\nLenc, Arthur Mensch, Katherine Millican, Mal-\ncolm Reynolds, et al. 2022. Flamingo: a vi-\nsual language model for few-shot learning. Ad-\nvances in Neural Information Processing Sys-\ntems, 35:23716–23736.\nShengnan An, Zeqi Lin, Qiang Fu, Bei Chen,\nNanning Zheng, Jian-Guang Lou, and Dong-\nmei Zhang. 2023. How do in-context exam-\nples affect compositional generalization? CoRR ,\nabs/2305.04835.\nAmir Bar, Yossi Gandelsman, Trevor Darrell,\nAmir Globerson, and Alexei Efros. 2022. Vi-\nsual prompting via image inpainting. Ad-\nvances in Neural Information Processing Sys-\ntems, 35:25005–25017.\nRichard Bellman. 1957. A markovian decision\nprocess. Journal of mathematics and mechanics ,\npages 679–684., (3) The perfor-\nmance advancement made by warmup encounters\na plateau when increasingly scaling up the training\ndata. This phenomenon appears both in supervised\nin-context training and self-supervised in-context\ntraining, indicating that LLMs only need a small\namount of data to adapt to learn from the context\nduring warmup.\n5 Demonstration Designing\nMany studies have shown that the performance\nof ICL strongly relies on the demonstration sur-\nface, including demonstration format, the order of\ndemonstration examples, and so on (Zhao et al.,\n2021; Lu et al., 2022). As demonstrations play a vi-\ntal role in ICL, in this section, we survey demonstra-\ntion designing strategies and classify them into two\ngroups: demonstration organization and demonstra-\ntion formatting, as shown in Table 1.\n5.1 Demonstration Organization\nGiven a pool of training examples, demonstration\norganization focuses on how to select a subset of\nexamples and the order of the selected examples.\n5.1.1 Demonstration Selection\nDemonstrations selection aims to answer a funda-\nmental question: Which examples are good exam-ples for ICL? We classify related studies into two\ncategories, including unsupervised methods based\non pre-defined metrics and supervised methods.\nUnsupervised Method Liu et al. (2022) showed\nthat selecting the closest neighbors as the in-context\nexamples is a good solution. The distance metrics\nare pre-defined L2 distance or cosine-similarity\ndistance based on sentence embeddings. They\nproposed KATE, a kNN-based unsupervised re-\ntriever for selecting in-context examples. In addi-\ntion to distance metrics, mutual information is also\na valuable selection metric (Sorensen et al., 2022).\nSimilarly, k-NN cross-lingual demonstrations can\nbe retrieved for multi-lingual ICL (Tanwar et al.,\n2023) to strengthen source-target language align-\nment. The advantage of mutual information is that\nit does not require labeled examples and specific\nLLMs. In addition, Gonen et al. (2022) attempted\nto choose prompts with low perplexity. Levy et al.]",The survey on advanced in-context learning techniques covers strategies for training and designing demonstrations.


In [8]:
from phoenix.session.evaluation import get_qa_with_reference

client = px.Client()
spans_dataframe = get_qa_with_reference(client)
spans_dataframe.head()

Unnamed: 0_level_0,input,output,reference
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bf7d55a697974b32beb362984e39e1b4,"What does the survey on advanced in-context learning techniques cover, including strategies for training and designing demonstrations?","The survey on advanced in-context learning techniques covers strategies for training and designing demonstrations. This includes training strategies, demonstration designing strategies, evaluation datasets and resources, as well as related analytical studies.","ment aims to improve the scalability and efficiency\nof ICL. As LMs continue to scale up, exploring\nways to effectively and efficiently utilize a larger\nnumber of demonstrations in ICL remains an ongo-\ning area of research.\n12 Conclusion\nIn this paper, we survey the existing ICL literature\nand provide an extensive review of advanced ICL\ntechniques, including training strategies, demon-\nstration designing strategies, evaluation datasets\nand resources, as well as related analytical studies.\nFurthermore, we highlight critical challenges and\npotential directions for future research. To the best\nof our knowledge, this is the first survey about ICL.\nWe hope this survey can highlight the current re-\nsearch status of ICL and shed light on future work\non this promising paradigm.\nReferences\nEkin Akyürek, Dale Schuurmans, Jacob An-\ndreas, Tengyu Ma, and Denny Zhou. 2022.\nWhat learning algorithm is in-context learn-\ning? investigations with linear models. CoRR ,\nabs/2211.15661.\nJean-Baptiste Alayrac, Jeff Donahue, Pauline Luc,\nAntoine Miech, Iain Barr, Yana Hasson, Karel\nLenc, Arthur Mensch, Katherine Millican, Mal-\ncolm Reynolds, et al. 2022. Flamingo: a vi-\nsual language model for few-shot learning. Ad-\nvances in Neural Information Processing Sys-\ntems, 35:23716–23736.\nShengnan An, Zeqi Lin, Qiang Fu, Bei Chen,\nNanning Zheng, Jian-Guang Lou, and Dong-\nmei Zhang. 2023. How do in-context exam-\nples affect compositional generalization? CoRR ,\nabs/2305.04835.\nAmir Bar, Yossi Gandelsman, Trevor Darrell,\nAmir Globerson, and Alexei Efros. 2022. Vi-\nsual prompting via image inpainting. Ad-\nvances in Neural Information Processing Sys-\ntems, 35:25005–25017.\nRichard Bellman. 1957. A markovian decision\nprocess. Journal of mathematics and mechanics ,\npages 679–684.\n\n(3) The perfor-\nmance advancement made by warmup encounters\na plateau when increasingly scaling up the training\ndata. This phenomenon appears both in supervised\nin-context training and self-supervised in-context\ntraining, indicating that LLMs only need a small\namount of data to adapt to learn from the context\nduring warmup.\n5 Demonstration Designing\nMany studies have shown that the performance\nof ICL strongly relies on the demonstration sur-\nface, including demonstration format, the order of\ndemonstration examples, and so on (Zhao et al.,\n2021; Lu et al., 2022). As demonstrations play a vi-\ntal role in ICL, in this section, we survey demonstra-\ntion designing strategies and classify them into two\ngroups: demonstration organization and demonstra-\ntion formatting, as shown in Table 1.\n5.1 Demonstration Organization\nGiven a pool of training examples, demonstration\norganization focuses on how to select a subset of\nexamples and the order of the selected examples.\n5.1.1 Demonstration Selection\nDemonstrations selection aims to answer a funda-\nmental question: Which examples are good exam-ples for ICL? We classify related studies into two\ncategories, including unsupervised methods based\non pre-defined metrics and supervised methods.\nUnsupervised Method Liu et al. (2022) showed\nthat selecting the closest neighbors as the in-context\nexamples is a good solution. The distance metrics\nare pre-defined L2 distance or cosine-similarity\ndistance based on sentence embeddings. They\nproposed KATE, a kNN-based unsupervised re-\ntriever for selecting in-context examples. In addi-\ntion to distance metrics, mutual information is also\na valuable selection metric (Sorensen et al., 2022).\nSimilarly, k-NN cross-lingual demonstrations can\nbe retrieved for multi-lingual ICL (Tanwar et al.,\n2023) to strengthen source-target language align-\nment. The advantage of mutual information is that\nit does not require labeled examples and specific\nLLMs. In addition, Gonen et al. (2022) attempted\nto choose prompts with low perplexity. Levy et al."
5f0c5f7f0da342a4bcc229544c0962fa,"""What are the potential applications of ICL in data engineering and how can it improve knowledge graph construction?""","ICL has the potential to be widely applied in data engineering. It offers benefits such as generating high-quality data at a low cost compared to human annotation or noisy automatic annotation methods. In the context of data engineering, ICL can significantly improve knowledge graph construction. It has been demonstrated that ICL can enhance the automatic construction and completion of knowledge graphs, leading to a reduction in manual costs with minimal engineering effort. Leveraging the capabilities of ICL in various data engineering applications can yield significant benefits, including improving the state of the art in knowledge graph construction.","Moreover, ICL offers potential for popular meth-ods such as meta-learning and instruction-tuning.\nChen et al. (2022d) applied ICL to meta-learning,\nadapting to new tasks with frozen model parame-\nters, thus addressing the complex nested optimiza-\ntion issue. (Ye et al., 2023b) enhanced zero-shot\ntask generalization performance for both pretrained\nand instruction-finetuned models by applying in-\ncontext learning to instruction learning.\nSpecifically, we explore several emerging and\nprevalent applications of ICL, showcasing their\npotential in the following paragraphs.\nData Engineering ICL has manifested the po-\ntential to be widely applied in data engineering.\nBenefiting from the strong ICL ability, it costs 50%\nto 96% less to use labels from GPT-3 than using la-\nbels from humans for data annotation. Combining\npseudo labels from GPT-3 with human labels leads\nto even better performance at a small cost (Wang\net al., 2021). In more complex scenarios, such as\nknowledge graph construction, Khorashadizadeh\net al. (2023) has demonstrated that ICL has the po-\ntential to significantly improve the state of the art of\nautomatic construction and completion of knowl-\nedge graphs, resulting in a reduction in manual\ncosts with minimal engineering effort. Therefore,\nleveraging the capabilities of ICL in various data\nengineering applications can yield significant bene-\nfits. Compared to human annotation (e.g., crowd-\nsourcing) or noisy automatic annotation (e.g., dis-\ntant supervision), ICL generates relatively high\nquality data at a low cost. However, how to use ICL\nfor data annotation remains an open question. For\nexample, Ding et al. (2022) performed a compre-\nhensive analysis and found that generation-based\nmethods are more cost-effective in using GPT-3\nthan annotating unlabeled data via ICL.\nModel Augmentating The context-flexible na-\nture of ICL demonstrates significant potential to\nenhance retrieval-augmented methods.\n\nTherefore,\nleveraging the capabilities of ICL in various data\nengineering applications can yield significant bene-\nfits. Compared to human annotation (e.g., crowd-\nsourcing) or noisy automatic annotation (e.g., dis-\ntant supervision), ICL generates relatively high\nquality data at a low cost. However, how to use ICL\nfor data annotation remains an open question. For\nexample, Ding et al. (2022) performed a compre-\nhensive analysis and found that generation-based\nmethods are more cost-effective in using GPT-3\nthan annotating unlabeled data via ICL.\nModel Augmentating The context-flexible na-\nture of ICL demonstrates significant potential to\nenhance retrieval-augmented methods. By keep-\ning the LM architecture unchanged and prepend-\ning grounding documents to the input, in-context\nRALMRam et al. (2023) effectively utilizes off-\nthe-shelf general-purpose retrievers, resulting in\nsubstantial LM gains across various model sizes\nand diverse corpora. Furthermore, ICL for retrieval\nalso exhibits the potential to improve safety. In ad-\ndition to efficiency and flexibility, ICL also shows\npotential in safety (Panda et al., 2023), (Meade\net al., 2023) use ICL for retrieved demonstrations\nto steer a model towards safer generations, reduc-"
5b2ec97561fc44e19e7e809860ed49b0,"""What's the new paradigm in natural language processing that enables large language models to learn tasks with few examples and its relation to in-context learning?""","The new paradigm in natural language processing that enables large language models to learn tasks with few examples is called in-context learning (ICL). In-context learning allows language models to learn and generalize from a small number of demonstration examples, rather than relying on extensive fine-tuning or training on large datasets. This approach aims to leverage the intrinsic capabilities of large language models without the need for extensive task-specific training. In-context learning enables models to perform well on a wide range of tasks, even those for which they have not been explicitly trained.","(2020) found that GPT-\n3 can achieve results comparable to state-of-the-\nart (SOTA) finetuning performance on COPA and\nReCoRD, but still falls behind finetuning on most\nNLU tasks. Hao et al. (2022b) showed the po-\ntential of scaling up the number of demonstration\nexamples. However, the improvement brought by\nscaling is very limited. At present, compared to\nfinetuning, there still remains some room for ICL\nto reach on traditional NLP tasks.\n8.2 New Challenging Tasks\nIn the era of large language models with in-context\nlearning capabilities, researchers are more inter-\nested in evaluating the intrinsic capabilities of large\nlanguage models without downstream task finetun-\ning (Bommasani et al., 2021).\nTo explore the capability limitations of LLM on\nvarious tasks, Srivastava et al. (2022) proposed\nthe BIG-Bench (Srivastava et al., 2022), a large\nbenchmark covering a large range of tasks, includ-\ning linguistics, chemistry, biology, social behav-\nior, and beyond. The best models have already\noutperformed the average reported human-rater\nresults on 65% of the BIG-Bench tasks throughICL (Suzgun et al., 2022). To further explore tasks\nactually unsolvable by current language models,\nSuzgun et al. (2022) proposed a more challenging\nICL benchmark, BIG-Bench Hard (BBH). BBH in-\ncludes 23 unsolved tasks, constructed by selecting\nchallenging tasks where the state-of-art model per-\nformances are far below the human performances.\nBesides, researchers are searching for inverse scal-\ning tasks,1that is, tasks where model performance\nreduces when scaling up the model size. Such\ntasks also highlight potential issues with the cur-\nrent paradigm of ICL. To further probe the model\ngeneralization ability, Iyer et al. (2022) proposed\nOPT-IML Bench, consisting of 2000 NLP tasks\nfrom 8 existing benchmarks, especially benchmark\nfor ICL on held-out categories.\n\nIn Proc. of ICLR . OpenRe-\nview.net.\nOr Honovich, Uri Shaham, Samuel R. Bowman,\nand Omer Levy. 2022. Instruction induction:\nFrom few examples to natural language task de-\nscriptions. CoRR , abs/2205.10782.\nQian Huang, Hongyu Ren, Peng Chen, Gre-\ngor Kržmanc, Daniel Zeng, Percy Liang, and\nJure Leskovec. 2023a. Prodigy: Enabling in-\ncontext learning over graphs. arXiv preprint\narXiv:2305.12600 .\nShaohan Huang, Li Dong, Wenhui Wang, Yaru\nHao, Saksham Singhal, Shuming Ma, Tengchao\nLv, Lei Cui, Owais Khan Mohammed, Qiang\nLiu, et al. 2023b. Language is not all you\nneed: Aligning perception with language models.\narXiv preprint arXiv:2302.14045 .\nSrinivasan Iyer, Xi Victoria Lin, Ramakanth Pa-\nsunuru, Todor Mihaylov, Daniel Simig, Ping\nYu, Kurt Shuster, Tianlu Wang, Qing Liu,\nPunit Singh Koura, Xian Li, Brian O’Horo,\nGabriel Pereyra, Jeff Wang, Christopher Dewan,\nAsli Celikyilmaz, Luke Zettlemoyer, and Ves\nStoyanov. 2022. Opt-iml: Scaling language\nmodel instruction meta learning through the lens\nof generalization.Muhammad Khalifa, Lajanugen Logeswaran,\nMoontae Lee, Honglak Lee, and Lu Wang.\n2023. Exploring demonstration ensembling for\nin-context learning. In ICLR 2023 Workshop on\nMathematical and Empirical Understanding of\nFoundation Models .\nHanieh Khorashadizadeh, Nandana Mihindukula-\nsooriya, Sanju Tiwari, Jinghua Groppe, and Sven\nGroppe. 2023. Exploring in-context learning\ncapabilities of foundation models for generat-\ning knowledge graphs from text."
ed867e82c64446d6b3950f6f90e17316,"How is the length of the ray RF,m(v) in Algorithm 1 related to the number of rows of A in a normal set F={u∈Zd:Au≤b} given in H-representation?","The length of the ray RF,m(v) in Algorithm 1 is not directly related to the number of rows of A in a normal set F={u∈Zd:Au≤b} given in H-representation. The length of the ray RF,m(v) depends on the specific Markov move m and the vector v in the set F. It is determined by the intersection of the ray R with the set F along the Markov move m. The number of rows of A in the H-representation of F represents the number of hyperplanes that define the set F.","The longest ray through Falong\nvectors of MisRF,M:= argmax {|RF,m(u)|:m∈ M,u∈ F}.\nCorollary 5.10. Let(Fi)i∈Nbe a sequence of ﬁnite sets in Zdand letπibe the uniform\ndistribution on Fi. LetM ⊂Zdbe an augmenting Markov basis for FiwithAM(Fi)≤\ndim(Fi)and suppose that (|RFi,M|)dim(Fi))i∈N∈ O(|Fi|)i∈N. Then for any positive mass\nfunction f:M →[0,1], there exists ǫ >0such that λ(Hπi,f\nFi,M)≤1−ǫfor alli∈N.\nProof.This is a straightforward application of Theorem 5.8. □\nCorollary 5.11. LetP ⊂Zdbe a polytope, Fi:= (i· P)∩Zdfori∈N, and let πibe the\nuniform distribution on Fi. Suppose that M ⊂Zdis an augmenting Markov basis {Fi:i∈N}\nsuch that AM(Fi)≤dim(P)for alli∈N. Then for any positive mass function f:M →[0,1],\nthere exists ǫ >0such that λ(Hπi,f\nFi,M)≤1−ǫfor alli∈N.\nProof.Letr:= dim(P). Weﬁrstshowthat( |RFi,M|)i∈N∈ O(i)i∈N. WriteM={m1,...,m k}\nand denote by li:= max{|(u+mi·Z)∩P|:u∈ P}be the length of the longest ray through\nthe polytope Palongmi. It suﬃces to prove that i·(lk+ 1) is an upper bound on the\nlength of any ray along mkthrough Fi.\n\nSincexandx′are in the same connected component VaofF(M), letyi0,...,yir∈ F\nbe the nodes on a minimal path in Fc(M) withyi0=xandyir=x′. For any s∈[r],yis\nandyis−1are contained in the same ray Rks\ntscoming from a Markov move mks. In particular,\nRts−1\nks−1∩Rks\nts̸=∅and due to our observation made above λi\nj=λk1\nt1=λk2\nt2=···=λkr\ntr=λi\nj′\nwhich ﬁnishes the proof. □\nDeﬁnition 4.8. LetF ⊂ZdandM ⊂Zdbe ﬁnite sets and M′⊆ M. LetVbe the set of\nconnected components of F(M\M′) andRbetheset of all rays through Falongall elements\nofM′. Theray matrix ofF(M) alongM′isAF(M,M′) := (|R∩V|)R∈R,V∈V∈NR×V.\nExample 4.9. LetF= [3]×[3],M={e1,e2,e1+e2}, andM′={e1,e2}. ThenF(M\M′)\nhas ﬁve connected components and the ray matrix of F(M) alongM′is\nAF(M,M′) =\n1 1 1 0 0\n0 1 1 1 0\n0 0 1 1 1\n0 0 1 1 1\n0 1 1 1 0\n1 1 1 0 0\n.\nRemark 4.10."
a9594f63496b44f5bf72fb5019f1c5fa,"How is the concept of canonical paths connected to a symmetric random walk, according to the provided context?","The concept of canonical paths is connected to a symmetric random walk in the context by defining a set of paths that are used in the random walk. These paths are called canonical paths and are represented by the set Γ. The symmetric random walk H is then defined based on these canonical paths. The concept of canonical paths is used to calculate the value ρ(Γ,H), which is used in Lemma 5.7 to provide an upper bound on the second largest eigenvalue of the random walk.","By assumption, there exists\nan augmenting path from wtow+cvusing only relements from M. Put diﬀerently, the\nelementcvfromVcan be represented by a linear combination of rvectors from M. Sincev\nwas chosen arbitrarily, Lemma 5.4implies dim( P) = dim( V)≤r. □\nRemark 5.6. It is a consequence from Proposition 5.5that for any matrix A∈Zm×dwith\nkerZ(A)∩Nd={0}and an augmenting Markov basis M, there exists F ∈ P Asuch that\nAM(F)≥dim(ker Z(A)).\nLet us now shortly recall the framework from [23] which is nec essary to prove our main\ntheorem. Let G= (V,E) be a graph. For any ordered pair of distinct nodes ( x,y)∈V×V,\nletpx,y⊆Ebe a path from xtoyinGand let Γ := {px,y: (x,y)∈V×V,x̸=y}be\nthe collection of these paths, then Γ is a set of canonical paths . Let for any edge e∈E,\nΓe:={p∈Γ :e∈p}be the set of paths from Γ that use e. Now, let H:V×V→[0,1] be a\nsymmetric random walk on Gand deﬁne\nρ(Γ,H) :=max{|p|:p∈Γ}\n|V|·max\n{u,v}∈E|Γ{u,v}|\nH(u,v).\nObserve that symmetry of His needed to make ρ(Γ,H) well-deﬁned. This can be used to\nprove the following upper bound on the second largest eigenv alue.\nLemma 5.7. LetGbe a graph, Hbe a symmetric random walk on G, andΓbe a set of\ncanonical paths in G. Thenλ2(H)≤1−1\nρ(Γ,H).\n\nHEAT-BATH RANDOM WALKS WITH MARKOV BASES 3\n2.Graphs and statistics\nWe ﬁrst introduce the statistical framework in which this pa per lives and recall important\naspects of the interplay between graphs and statistics. A random walk on a graph G= (V,E)\nis a map H:V×V→[0,1] such that for all v∈V,∑\nu∈VH(v,u) = 1 and such that\nH(v,u) = 0 if{v,u} ̸∈E. When there is no ambiguity, we represent a random walk as an\n|V| × |V|-matrix, for example when it is clear how the elements of Vare ordered. Fix a\nrandom walk HonG. ThenHisirreducible if for all v,u∈Vthere exists t∈Nsuch that\nHt(v,u)>0. The random walk Hisreversible if there exists a mass function µ:V→[0,1]\nsuch that µ(u)·H(u,v) =µ(v)·H(v,u) for allu,v∈Vandsymmetric ifHis a symmetric\nmap. A mass function π:V→[0,1] is astationary distribution ofHifπ◦ H=π. For\nsymmetric random walks, the uniform distribution on Vis always a stationary distribution.\nIf|V|=n, then we denote the eigenvalues of Hby 1 =λ1(H)≥λ2(H)≥ ··· ≥λn(H)≥ −1\nand we write λ(H) := max {λ2(H),−λn(H)}for thesecond largest eigenvalue modulus ofH.\nAny irreducible random walk has a unique stationary distrib ution [14, Corollary 1.17] and\nλ(H)∈[0,1] measures the convergence rate: the smaller λ(H), the faster the convergence.\nThe aim of this paper is to study random walks on lattice point s that use a set of moves."


In [9]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_correctness,
    context_recall,
    context_precision,
)

In [10]:
from phoenix.trace.langchain import OpenInferenceTracer

tracer = OpenInferenceTracer()

In [11]:
evaluation_result = evaluate(
    dataset=ragas_eval_dataset,
    metrics=[faithfulness, answer_correctness, context_recall, context_precision],
    callbacks=[tracer],
)
eval_scores_df = pd.DataFrame(evaluation_result.scores)

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

Invalid prediction format. Expected a list of dictionaries with keys 'TP', 'FP', 'FN'


In [23]:
eval_data_df = pd.DataFrame(evaluation_result.dataset)
assert eval_data_df.question.to_list() == list(
    reversed(spans_dataframe.input.to_list())
)
eval_scores_df.index = pd.Index(
    list(reversed(spans_dataframe.index.to_list())), name=spans_dataframe.index.name
)

In [26]:
from phoenix.trace import SpanEvaluations

for eval_name in eval_scores_df.columns:
    evals_df = eval_scores_df[[eval_name]].rename(columns={eval_name: "score"})
    evals = SpanEvaluations(eval_name, evals_df)
    px.log_evaluations(evals)

Sending Evaluations:   0%|          | 0/10 [00:00<?, ?it/s]

Sending Evaluations: 100%|██████████| 10/10 [00:00<00:00, 94.66it/s]
Sending Evaluations: 100%|██████████| 10/10 [00:00<00:00, 97.58it/s]
Sending Evaluations: 100%|██████████| 10/10 [00:00<00:00, 94.97it/s]
Sending Evaluations: 100%|██████████| 10/10 [00:00<00:00, 94.56it/s]


![](../../_static/imgs/arize-tracing2.gif)

## Embedding analysis
TBD:
- cluster queries
- color each data point based on question type?
- display average score for each cluster