<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-assets/phoenix/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Code Functionality  Evals</h1>


This tests whether code is written correctly, without bugs, accomplishes the functionality you want, does not have syntax errors.

The purpose of this notebook is:

- to evaluate the performance of code fuctionality Eval
- to provide an experimental framework for users to iterate and improve on the default classification template.

## Install Dependencies and Import Libraries

In [None]:
!pip install -qq arize-phoenix  "openai>=1" ipython matplotlib pycm scikit-learn tiktoken nest_asyncio

ℹ️ To enable async request submission in notebook environments like Jupyter or Google Colab, optionally use `nest_asyncio`. `nest_asyncio` globally patches `asyncio` to enable event loops to be re-entrant. This is not required for non-notebook environments.

Without `nest_asyncio`, eval submission can be much slower, depending on your organization's rate limits. Speed increases of about 5x are typical.

In [None]:
import nest_asyncio

nest_asyncio.apply()

In [None]:
import os
from getpass import getpass

import matplotlib.pyplot as plt
import pandas as pd
from pycm import ConfusionMatrix
from sklearn.metrics import classification_report

from phoenix.evals import (
    CODE_FUNCTIONALITY_PROMPT_RAILS_MAP,
    # To Add templates
    CODE_FUNCTIONALITY_PROMPT_TEMPLATE,
    OpenAIModel,
    llm_classify,
)

pd.set_option("display.max_colwidth", None)

## Download Benchmark Dataset

TODO

In [None]:
df = pd.read_csv(
    "https://storage.googleapis.com/arize-assets/phoenix/evals/code-functionality/validated_python_code_samples_2.csv"
)

df.head()



```
# This is formatted as code
```

## Display Code Functionality Classification Template

View the default template used to code functionality. You can tweak this template and evaluate its performance relative to the default.

In [None]:
print(CODE_FUNCTIONALITY_PROMPT_TEMPLATE)

The template variables are:

- **coding_instruction:** What is the code supposed to do as an instruction
- **code:** The code to evaluate 


## Configure the LLM

Configure your OpenAI API key.

In [None]:
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key

## Benchmark Dataset Sample
Sample size determines run time
Recommend iterating small: 100 samples
Then increasing to large test set

## LLM Evals: Code Functionality Classifications GPT-4
Run Code Functionality against a subset of the data.
Instantiate the LLM and set parameters.

In [None]:
model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

In [None]:
model("Hello world, this is a test if you are working?")

## Run Code Func Classifications

Run code functionality classifications against a subset of the data.

In [None]:
# The rails is used to hold the output to specific values based on the template
# It will remove text such as ",,," or "..."
# Will ensure the binary value expected from the template is returned
rails = list(CODE_FUNCTIONALITY_PROMPT_RAILS_MAP.values())
relevance_classifications = llm_classify(
    dataframe=df,
    template=CODE_FUNCTIONALITY_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    concurrency=20,
)["label"].tolist()

## Evaluate Classifications

Evaluate the predictions against human-labeled ground-truth code functionality labels.

In [None]:
true_labels = df["is_well_coded"].map(CODE_FUNCTIONALITY_PROMPT_RAILS_MAP).tolist()

print(classification_report(true_labels, relevance_classifications, labels=rails))
confusion_matrix = ConfusionMatrix(
    actual_vector=true_labels, predict_vector=relevance_classifications, classes=rails
)
confusion_matrix.plot(
    cmap=plt.colormaps["Blues"],
    number_label=True,
    normalized=True,
)

## Classifications with explanations

When evaluating a dataset for code functionality, it can be useful to know why the LLM classified a document as relevant or irrelevant. The following code block runs `llm_classify` with explanations turned on so that we can inspect why the LLM made the classification it did. There is speed tradeoff since more tokens is being generated but it can be highly informative when troubleshooting.

In [None]:
small_df_sample = df.copy().sample(n=5).reset_index(drop=True)
relevance_classifications_df = llm_classify(
    dataframe=small_df_sample,
    template=CODE_FUNCTIONALITY_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    provide_explanation=True,
    verbose=True,
    concurrency=20,
)

In [None]:
# Let's view the data
merged_df = pd.merge(
    small_df_sample, relevance_classifications_df, left_index=True, right_index=True
)
merged_df[["coding_instruction", "code", "label", "explanation"]].head()

## LLM Evals: code functionality Classifications GPT-3.5 Turbo
Run Code functionality against a subset of the data using GPT-3.5. GPT-3.5 can significantly speed up the classification process. However there are tradeoffs as  we will see below.

In [None]:
model = OpenAIModel(model_name="gpt-3.5-turbo", temperature=0.0, request_timeout=20)

In [None]:
rails = list(CODE_FUNCTIONALITY_PROMPT_RAILS_MAP.values())
relevance_classifications = llm_classify(
    dataframe=df,
    template=CODE_FUNCTIONALITY_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    concurrency=20,
)["label"].tolist()

In [None]:
true_labels = df["is_well_coded"].map(CODE_FUNCTIONALITY_PROMPT_RAILS_MAP).tolist()

print(classification_report(true_labels, relevance_classifications, labels=rails))
confusion_matrix = ConfusionMatrix(
    actual_vector=true_labels, predict_vector=relevance_classifications, classes=rails
)
confusion_matrix.plot(
    cmap=plt.colormaps["Blues"],
    number_label=True,
    normalized=True,
)

## Preview: Running with GPT-4 Turbo

In [None]:
model = OpenAIModel(model_name="gpt-4-1106-preview")
classifications = llm_classify(
    dataframe=df,
    template=CODE_FUNCTIONALITY_PROMPT_TEMPLATE,
    model=model,
    rails=list(CODE_FUNCTIONALITY_PROMPT_RAILS_MAP.values()),
    concurrency=20,
)["label"].tolist()

In [None]:
true_labels = df["is_well_coded"].map(CODE_FUNCTIONALITY_PROMPT_RAILS_MAP).tolist()

print(classification_report(true_labels, classifications, labels=rails))
confusion_matrix = ConfusionMatrix(
    actual_vector=true_labels, predict_vector=classifications, classes=rails
)
confusion_matrix.plot(
    cmap=plt.colormaps["Blues"],
    number_label=True,
    normalized=True,
)