# Collect Chats with GPT - Embed and Look at in Phoenix

## Load ChatGPT

The following is an example of an analysis of data collected from GPT-3.5 (ChatGPT) and GPT response dataset. This example was collected using the OpenAI python API below and can be analyzed in Phoenix. The notebook below:

* Imports a dataset of previously generated prompt/response pairs 
* Loads the dataset into Phoenix for analysis 
* Shows how the data was collected using the Python interaface of GPT


In [None]:
import pandas as pd
import json

In [None]:
conversations_df = pd.read_csv(
    "https://storage.googleapis.com/arize-assets/fixtures/Embeddings/GENERATIVE/dataframe_llm_gpt.csv"
)

In [None]:
import numpy as np
import ast
import re


def string_to_array(s):
    numbers = re.findall(r"[-+]?\d*\.\d+|[-+]?\d+", s)
    return np.array([float(num) for num in numbers])

In [None]:
conversations_df["prompt_vector"] = conversations_df["prompt_vector"].apply(string_to_array)
conversations_df["response_vector"] = conversations_df["response_vector"].apply(string_to_array)

In [None]:
conversations_df

Installing Arize to make use of the embeddings generators available for use from the SDK generators package

In [None]:
!pip install arize

In [None]:
!pip install 'arize[AutoEmbeddings]'

In [None]:
import arize
from arize.pandas.embeddings import EmbeddingGenerator, UseCases

generator = EmbeddingGenerator.from_use_case(
    use_case=UseCases.NLP.SEQUENCE_CLASSIFICATION,
    model_name="distilbert-base-uncased",
    tokenizer_max_length=512,
    batch_size=100,
)

Generate embeddings for each Prompt and Response column

In [None]:
# Very fast on GPU (seconds) but can take a 2-3 minute on a CPU
conversations_df = conversations_df.reset_index(drop=True)
if not all(col in conversations_df.columns for col in ["prompt_vector", "response_vector"]):
    conversations_df["prompt_vector"] = generator.generate_embeddings(
        text_col=conversations_df["prompt"]
    )
    conversations_df["response_vector"] = generator.generate_embeddings(
        text_col=conversations_df["response"]
    )

**Install Phoenix**

In [None]:
!pip install arize-phoenix

In [None]:
import phoenix as px

# Define a Schema() object for Phoenix to pick up data from the correct columns for logging
schema = px.Schema(
    feature_column_names=[
        "step",
        "conversation_id",
        "api_call_duration",
        "response_len",
        "prompt_len",
    ],
    prompt_column_names=px.EmbeddingColumnNames(
        vector_column_name="prompt_vector", raw_data_column_name="prompt"
    ),
    response_column_names=px.EmbeddingColumnNames(
        vector_column_name="response_vector", raw_data_column_name="response"
    ),
)

In [None]:
# Create the dataset from the conversaiton dataframe & schema
conv_ds = px.Dataset(conversations_df, schema, "production")

In [None]:
# Click the link below to open in a view in Phoenix of ChatGPT data
px.launch_app(conv_ds)

**Collecting GPT Prompt & Response Data**

In order to analyze data in Phoenix the dataframe in the pervious section was collected using the code below.

In [None]:
!pip install openai

In [None]:
import openai

openai.api_key = "YOUR_OPEN_AI_KEY"

In [None]:
import time
import uuid

messages = []

In [None]:
print("This is the ChatGPT interface. Type in prompt!")
print("the prompt/response data will be concatenated with conversations_df dataframe.")
print("This cell can be run many times to concatenate more data.")
print("To Exit type: CTRL-D")


def count_words(text):
    words = text.split()
    return len(words)


try:
    data = {
        "prompt": [],
        "response": [],
        "step": [],
        "conversation_id": [],
        "prediction_id": [],
        "api_call_duration": [],
        "prompt_len": [],
        "response_len": [],
    }
    # This represents a single string of a conversation
    conversation_id = uuid.uuid4().hex[:4]
    step = 0
    while True:
        message = input("Prompt : ")
        start_time = time.time()
        if message:
            messages.append(
                {"role": "user", "content": message},
            )
            chat = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=messages)
        end_time = time.time()
        reply = chat.choices[0].message.content
        data["prediction_id"].append(str(uuid.uuid4())[:20])
        data["prompt"].append(message)
        data["response"].append(reply)
        data["step"].append(step)
        data["conversation_id"].append(conversation_id)
        data["api_call_duration"].append(end_time - start_time)
        data["prompt_len"].append(
            count_words(message)
        )  # Words / not tokens, but just a simple example
        data["response_len"].append(
            count_words(reply)
        )  # Words / not tokens, but just a simple example
        print(str(end_time - start_time))
        step += 1
        print(f"ChatGPT Response: {reply}")
        messages.append({"role": "assistant", "content": reply})
except Exception as e:
    print("Exiting Chat")
    print(str(e))
    df = pd.DataFrame(data)
    conversations_df = pd.concat([conversations_df, df])
    messages = []

In [None]:
conversations_df = conversations_df.reset_index(drop=True)

The example above is just for test purposes and application specific integrations will look different. 