<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-assets/phoenix/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Code Readability Evals</h1>

Arize provides tooling to evaluate LLM applications, including tools to determine the readability or unreadability of code generated by LLM applications. 

The purpose of this notebook is:

- to evaluate the performance of an LLM-assisted approach to classifying generated code as readable or unreadable using datasets with ground-truth labels
- to provide an experimental framework for users to iterate and improve on the default classification template.

## Install Dependencies and Import Libraries

In [None]:
!pip install -qq "arize-phoenix[experimental]" ipython matplotlib openai pycm scikit-learn

In [None]:
import os
from getpass import getpass

import matplotlib.pyplot as plt
import openai
import pandas as pd
from phoenix.experimental.evals import (
    CODE_READABILITY_PROMPT_RAILS_MAP,
    CODE_READABILITY_PROMPT_TEMPLATE_STR,
    OpenAiModel,
    download_benchmark_dataset,
    llm_eval_binary,
)
from pycm import ConfusionMatrix
from sklearn.metrics import classification_report

pd.set_option("display.max_colwidth", None)

## Download Benchmark Dataset

We'll evaluate the evaluation system consisting of an LLM model and settings in addition to an evaluation prompt template against a benchmark datasets of readable and unreadable code with ground-truth labels. Currently supported datasets for this task include:

- openai_humaneval_with_readability

In [None]:
dataset_name = "openai_humaneval_with_readability"
df = download_benchmark_dataset(task="code-readability-classification", dataset_name=dataset_name)
df.head()

## Display Binary Relevance Classification Template

View the default template used to classify readability. You can tweak this template and evaluate its performance relative to the default.

In [None]:
print(CODE_READABILITY_PROMPT_TEMPLATE_STR)

The template variables are:

- **query:** the coding task asked by a user
- **code:** implementation of the coding task

## Configure the LLM

Configure your OpenAI API key.

In [None]:
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
openai.api_key = openai_api_key
os.environ["OPENAI_API_KEY"] = openai_api_key

Instantiate the LLM and set parameters.

In [None]:
model = OpenAiModel(
    model_name="gpt-4",
    temperature=0.0,
)

## Run Relevance Classifications

Run relevance classifications against a subset of the data.

In [None]:
df = df.sample(n=100).reset_index(drop=True)
df = df.rename(
    columns={"prompt": "query", "solution": "code"},
)

In [None]:
rails = list(CODE_READABILITY_PROMPT_RAILS_MAP.values())
relevance_classifications = llm_eval_binary(
    dataframe=df,
    template=CODE_READABILITY_PROMPT_TEMPLATE_STR,
    model=model,
    rails=rails,
)

## Evaluate Classifications

Evaluate the predictions against human-labeled ground-truth relevance labels.

In [None]:
true_labels = df["readable"].map(CODE_READABILITY_PROMPT_RAILS_MAP).tolist()
predicted_labels = relevance_classifications

print(classification_report(true_labels, predicted_labels, labels=rails))
confusion_matrix = ConfusionMatrix(
    actual_vector=true_labels, predict_vector=predicted_labels, classes=rails
)
confusion_matrix.plot(
    cmap=plt.colormaps["Blues"],
    number_label=True,
    normalized=True,
);

## Inspecting evaluations

Because the evals are binary classifications, we can easily sample a few rows where the evals deviated from ground truth and see what the actual code was in that case.

In [None]:
df["predicted"] = predicted_labels
# inspect instances where ground truth was readable but evaluated to unreadable by the LLM
filtered_df = df.query('readable == False and predicted == "readable"')

# inspect first 5 rows that meet this condition
result = filtered_df.head(5)
result