# Phoenix Notebook: Demo Llama-Index Tracing

All tracing fixtures generated from this notebook can be found on GCS under the `arize-phoenix-assets/traces` as `demo_llama_index_rag_(name).parquet`: [link here](https://console.cloud.google.com/storage/browser/arize-phoenix-assets/traces)

## Setup

Install libraries

In [None]:
!pip install -qq arize-phoenix llama-index "openai>=1" gcsfs nest_asyncio langchain langchain-community cohere llama-index-postprocessor-cohere-rerank

Set up environment variables and enter API keys


In [None]:
import os
from getpass import getpass

if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key

if not (cohere_api_key := os.getenv("COHERE_API_KEY")):
    cohere_api_key = getpass("🔑 Enter your Cohere API key: ")
os.environ["COHERE_API_KEY"] = cohere_api_key

## Launch Phoenix and Instrumentation

In [None]:
import phoenix as px

session = px.launch_app()

In [None]:
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor

from phoenix.otel import register

tracer_provider = register(endpoint="http://127.0.0.1:6006/v1/traces")
LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)

## Parse Phoenix Documentation into Llama-Index Documents

Imports

In [None]:
import json
import logging
import sys

In [None]:
# The nest_asyncio module enables the nesting of asynchronous functions within an already running async loop.
# This is necessary because Jupyter notebooks inherently operate in an asynchronous loop.
# By applying nest_asyncio, we can run additional async functions within this existing loop without conflicts.
import nest_asyncio
import pandas as pd
from langchain.document_loaders import GitbookLoader
from llama_index.core import Document, VectorStoreIndex

nest_asyncio.apply()

Enable Phoenix tracing via `LlamaIndexInstrumentor`. 

Phoenix uses OpenInference traces - an open-source standard for capturing and storing LLM application traces that enables LLM applications to seamlessly integrate with LLM observability solutions such as Phoenix.

In [None]:
"""
Fetches the Arize documentation from Gitbook and serializes it into LangChain format.
"""


def load_gitbook_docs(docs_url: str):
    """Loads documents from a Gitbook URL.

    Args:
        docs_url (str): URL to Gitbook docs.

    Returns:
        List[LangChainDocument]: List of documents in LangChain format.
    """
    loader = GitbookLoader(
        docs_url,
        load_all_paths=True,
    )
    return loader.load()


logging.basicConfig(level=logging.INFO, stream=sys.stdout)

# Fetch documentation
docs_url = "https://docs.arize.com/phoenix"
embedding_model_name = "text-embedding-ada-002"
docs = load_gitbook_docs(docs_url)

In [None]:
documents = []
for doc in docs:
    documents.append(Document(metadata=doc.metadata, text=doc.page_content))

In [None]:
documents[0].metadata

In [None]:
# Convert documents to a JSON serializable format (if needed)
documents_json = [doc.to_dict() for doc in documents]

# Save documents to a JSON file
with open("demo_llama_index_documents.json", "w") as file:
    json.dump(documents_json, file, indent=4)

## Set Up VectorStore and Query Engine

In [None]:
from llama_index.core.node_parser import SentenceSplitter

# Build index with a chunk_size of 1024
splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=250)
nodes = splitter.get_nodes_from_documents(documents)
vector_index = VectorStoreIndex(nodes)

Build a QueryEngine and set up a Cohere reranker

In [None]:
from llama_index.postprocessor.cohere_rerank import CohereRerank

cohere_api_key = os.environ["COHERE_API_KEY"]
cohere_rerank = CohereRerank(api_key=cohere_api_key, top_n=2)

query_engine = vector_index.as_query_engine(
    similarity_top_k=5,
    node_postprocessors=[cohere_rerank],
)

## Import Questions

In [None]:
host = "https://storage.googleapis.com/"
bucket = "arize-phoenix-assets"
prefix = "traces"
url = f"{host}{bucket}/{prefix}"

In [None]:
questions_file = "demo_llama_index_rag_questions.parquet"
questions_df = pd.read_parquet(f"{url}/{questions_file}")
questions_df

## Generate Answers for All Questions

Start querying

In [None]:
# Loop over the questions and generate the answers
for i, row in questions_df.iterrows():
    question = row["Prompt/ Question"]
    response_vector = query_engine.query(question)
    print(f"Question: {question}\nAnswer: {response_vector.response}\n")

## OPTIONAL: Remove index spans

Indexing traces, such as document embeddings and document node parsing, might be instrumented. Remove by:

1. Query spans from Phoenix without indexing spans and save as `demo_traces.parquet`
2. Clear all spans generated from this notebook (manually)
3. Log the same traces back to Phoenix without the indexing spans

Step 1: Query spans from Phoenix without indexing spans and save as `demo_traces.parquet`

In [None]:
from phoenix.trace.dsl import SpanQuery

traces = px.Client().query_spans(
    SpanQuery().where(
        "name != 'BaseEmbedding.get_text_embedding_batch' and name != 'MetadataAwareTextSplitter._parse_nodes' and name != 'SentenceSplitter.split_text_metadata_aware'"
    ),
    limit=5000,
    timeout=100,
)
traces.to_parquet("demo_traces.parquet")

Step 2: Clear all spans manually on Phoenix

In [None]:
session.view()

Step 3: Log the same traces back to Phoenix without the indexing spans

In [None]:
from phoenix import TraceDataset

px.Client().log_traces(TraceDataset(pd.read_parquet("demo_traces.parquet")))

Now indexing spans are removed and we're left with the traces we want

## Phoenix Evals

In [None]:
from phoenix.session.evaluation import get_retrieved_documents

retrieved_documents_df = get_retrieved_documents(px.Client())
retrieved_documents_df

In [None]:
from phoenix.session.evaluation import get_qa_with_reference

queries_df = get_qa_with_reference(px.active_session())
queries_df

Let's now use Phoenix's LLM Evals to evaluate the relevance of the retrieved documents with regards to the query. 

Note, we've turned on `explanations` which prompts the LLM to explain it's reasoning. This can be useful for debugging and for figuring out potential corrective actions.

In [None]:
from phoenix.evals import (
    HallucinationEvaluator,
    OpenAIModel,
    QAEvaluator,
    RelevanceEvaluator,
    run_evals,
)

eval_model = OpenAIModel(model="gpt-4")
relevance_evaluator = RelevanceEvaluator(eval_model)
hallucination_evaluator = HallucinationEvaluator(eval_model)
qa_evaluator = QAEvaluator(eval_model)

Document relevance evaluations

In [None]:
retrieved_documents_relevance_df = run_evals(
    evaluators=[relevance_evaluator],
    dataframe=retrieved_documents_df,
    provide_explanation=True,
    concurrency=20,
)[0]
retrieved_documents_relevance_df

Hallucination and QA-correctness evaluations

In [None]:
hallucination_eval_df, qa_eval_df = run_evals(
    dataframe=queries_df,
    evaluators=[hallucination_evaluator, qa_evaluator],
    provide_explanation=True,
    concurrency=20,
)
hallucination_eval_df

In [None]:
qa_eval_df

## Log the Evaluations into Phoenix

In [None]:
from phoenix.trace import DocumentEvaluations, SpanEvaluations

px.Client().log_evaluations(
    SpanEvaluations(eval_name="Hallucination", dataframe=hallucination_eval_df),
    SpanEvaluations(eval_name="QA Correctness", dataframe=qa_eval_df),
    DocumentEvaluations(
        eval_name="Retrieval Relevance", dataframe=retrieved_documents_relevance_df
    ),
)

In [None]:
session.view()

## Save the Traces and Evaluations

Save all spans and evals as parquet fixtures

In [None]:
import os

# Specify and Create the Directory for Trace Dataset
directory = "fixtures"
os.makedirs(directory, exist_ok=True)

# Save the Trace Dataset (set limit to above 2000)
trace_id = px.Client().get_trace_dataset(limit=5000, timeout=60).save(directory=directory)

Save LLM spans as fixtures for dataset usage

In [None]:
from phoenix.trace.dsl import SpanQuery

llm_open_ai = px.Client().query_spans(
    SpanQuery().where("span_kind == 'LLM' and name == 'OpenAI.chat'")
)

llm_predict = px.Client().query_spans(
    SpanQuery().where("span_kind == 'LLM' and name == 'LLM.predict'")
)

all_llm = px.Client().query_spans(SpanQuery().where("span_kind == 'LLM'"))

llm_open_ai.to_parquet("fixtures/demo_llama_index_llm_open_ai.parquet")
llm_predict.to_parquet("fixtures/demo_llama_index_llm_predict.parquet")
all_llm.to_parquet("fixtures/demo_llama_index_llm_all_spans.parquet")

OPTIONAL: Delete on Phoenix and import again to check for validity if necessary

In [None]:
from phoenix import TraceDataset
from phoenix.trace import DocumentEvaluations, SpanEvaluations

px.Client().log_traces(
    TraceDataset(pd.read_parquet("fixtures/demo_llama_index_rag_traces.parquet"))
)

retrieved_documents_relevance_df = pd.read_parquet(
    "fixtures/demo_llama_index_rag_doc_relevance_eval.parquet"
)
qa_eval_df = dataframe = pd.read_parquet(
    "fixtures/demo_llama_index_rag_qa_correctness_eval.parquet"
)
hallucination_eval_df = dataframe = pd.read_parquet(
    "fixtures/demo_llama_index_rag_hallucination_eval.parquet"
)

px.Client().log_evaluations(
    SpanEvaluations(eval_name="Hallucination", dataframe=hallucination_eval_df),
    SpanEvaluations(eval_name="QA Correctness", dataframe=qa_eval_df),
    DocumentEvaluations(
        eval_name="Retrieval Relevance", dataframe=retrieved_documents_relevance_df
    ),
)

Now we have finished generating llama-index RAG QA traces and evals and have them saved as fixtures!