# Contents
- [Introduction]()
- [Testset generation]()
- [Build RAG with llama-index]()
- [Tracing using Phoenix]()
- [Evaluation]()
- [Embedding analysis]()
- [Conclusion]()

## Introduction

In [None]:
In this notebook

## Synthetic Test data generation

In [17]:
! git lfs install
! git clone https://huggingface.co/datasets/explodinggradients/prompt-engineering-papers

Updated git hooks.
Git LFS initialized.
Cloning into 'prompt-engineering-papers'...
remote: Enumerating objects: 69, done.[K
remote: Counting objects: 100% (65/65), done.[K
remote: Compressing objects: 100% (65/65), done.[K
remote: Total 69 (delta 0), reused 0 (delta 0), pack-reused 4[K
Unpacking objects: 100% (69/69), 14.95 MiB | 10.15 MiB/s, done.
Filtering content: 100% (31/31), 111.86 MiB | 21.30 MiB/s, done.


In [20]:
from llama_index import SimpleDirectoryReader

In [21]:
dir_path = "./prompt-engineering-papers"
reader = SimpleDirectoryReader(dir_path,num_files_limit=2)
documents = reader.load_data()

In [3]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

# generator with openai models
generator = TestsetGenerator.with_openai()

# set question type distribution
distribution = {simple: 0.5, reasoning: 0.25, multi_context: 0.25}

# generate testset
testset = generator.generate_with_llamaindex_docs(documents, test_size=10, distributions=distribution)

embedding nodes:   0%|          | 0/222 [00:00<?, ?it/s]

Generating:   0%|          | 0/10 [00:00<?, ?it/s]

In [22]:
test_df = testset.to_pandas()
test_df.to_csv("ragas_testdata.csv")
test_df.head()

Unnamed: 0,question,contexts,ground_truth,evolution_type,episode_done
0,How does the lattice width of a normal set rel...,"[1miisapathofminimallength, then ∥u′−v′∥ ≤r·∥M...",The relationship between the lattice width of ...,simple,True
1,What are some strategies proposed to enhance t...,[ parameter adap-\ntation to learn the best mo...,Some strategies proposed to enhance the in-con...,simple,True
2,How does the information-theoretic perspective...,[ Trans-\nformers can implement a proper funct...,The information-theoretic perspective explains...,simple,True
3,How does the use of ICL in data engineering ap...,"[ 2023; He et al., 2023) and\ntext-to-SQL (Pou...",The answer is not present in the context.,simple,True
4,How does the proof utilize the concept of a Ma...,[16 CAPRICE STANLEY AND TOBIAS WINDISCH\nProof...,The answer to the question is not present in t...,simple,True


## Build RAG with llama-index

In [5]:
import phoenix as px
session = px.launch_app()

🌍 To view the Phoenix app in your browser, visit http://localhost:6006/
📺 To view the Phoenix app in a notebook, run `px.active_session().view()`
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix


In [6]:
import llama_index
llama_index.set_global_handler("arize_phoenix")

In [28]:
import nest_asyncio
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.embeddings import OpenAIEmbedding
import pandas as pd
from datasets import Dataset

nest_asyncio.apply()



def build_query_engine(documents):
    vector_index = VectorStoreIndex.from_documents(
        documents, service_context=ServiceContext.from_defaults(chunk_size=512),
        embed_model=OpenAIEmbedding(),
    )

    query_engine = vector_index.as_query_engine(similarity_top_k=2)
    return query_engine


def generate_response(query_engine, question):
    
    response = query_engine.query(question)
    return {
            "answer":response.response,
            "contexts":[c.node.get_content() for c in response.source_nodes]
           }

# Function to evaluate as Llama index does not support async evaluation for HFInference API
def generate_ragas_dataset(query_engine, test_df):

  test_questions = test_df["question"].values
  responses = [generate_response(query_engine,q) for q in test_questions]


  dataset_dict = {
        "question": test_questions,
        "answer": [response["answer"] for response in responses],
        "contexts":[response["contexts"] for response in responses],
        "ground_truth":test_df["ground_truth"].values.tolist()
        
  }
  ds = Dataset.from_dict(dataset_dict)
  return ds

In [35]:
generate_response(query_engine,test_df["question"][1])

{'answer': 'Strategies proposed to enhance the in-context learning capability of language models include instruction tuning, generating instruction tuning datasets, connecting language models with powerful vision foundational models, and using proper data formatting and architecture designs. Additionally, in the speech area, treating text-to-speech synthesis as a language modeling task and using intermediate representations such as audio codec codes have been proposed to enhance in-context learning capability.',
 'contexts': ['with instruction tuning, and the idea is also ex-\nplored in the multi-modal scenarios as well. Re-\ncent explorations first generate instruction tuning\ndatasets transforming existing vision-language task\ndataset (Xu et al., 2022; Li et al., 2023a) or with\npower LLMs such as GPT-4 (Liu et al., 2023; Zhu\net al., 2023a) , and connect LLMs with powerful vi-\nsion foundational models such as BLIP-2 (Li et al.,\n2023c) on these multi-modal datasets (Zhu et al.,\n2

In [29]:
query_engine = build_query_engine(documents)
ragas_eval_dataset = generate_responses(query_engine, test_df)

![](../../_static/imgs/arize-tracing1.gif)

In [10]:
ragas_eval_dataset

Dataset({
    features: ['question', 'answer', 'contexts', 'ground_truth'],
    num_rows: 10
})

## Evaluation

In [11]:
from ragas import evaluate
from ragas.metrics import faithfulness, answer_correctness, context_recall, context_precision

In [12]:
from phoenix.trace.langchain import OpenInferenceTracer
tracer = OpenInferenceTracer()

In [13]:
ragas_scores = evaluate(dataset=ragas_eval_dataset, 
                        metrics=[faithfulness, answer_correctness, context_recall, context_precision],
                        callbacks=[tracer])

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

In [14]:
ragas_scores

{'faithfulness': 1.0000, 'answer_correctness': 0.3806, 'context_recall': 0.5500}

![](../../_static/imgs/arize-tracing2.gif)

## Embedding analysis
TBD:
- cluster queries
- color each data point based on question type?
- display average score for each cluster