## Generate Responses and Embeddings with Dolly and Pythia on the Alpaca Dataset

This notebook generates responses to prompts from the Alpaca dataset using Dolly and Pythia.

Pythia is a decoder architecture from [EleutherAI](https://www.eleuther.ai/). Check out their models on the hub [here](https://huggingface.co/EleutherAI). Dolly is a version of Pythia that Databricks been fine-tuned on an instruction-following dataset similar to Alpaca.

Note: This notebook requires at least 12.1GB of CPU memory. If you're using Colab, you'll need Colab Pro and you should set the runtime to use a GPU and high RAM. Unfortunately, the free version of Colab only provides 10 GB of RAM, which isn't enough.

Let's get started. Install dependencies.

In [None]:
!pip install -q accelerate arize-phoenix datasets openai git+https://github.com/huggingface/transformers.git

Set your OpenAI API key.

In [None]:
openai.api_key = "your key here"
assert openai.api_key != "your key here", "Set your key"

In [None]:
from dataclasses import replace
import datetime
import locale
import re
import time
import uuid

from datasets import load_dataset
import openai
import pandas as pd
import phoenix as px
import torch
from transformers import AutoTokenizer

locale.getpreferredencoding = lambda: "UTF-8"  # This resolves a Colab bug that occurs sometimes.

Download a model (Dolly or Pythia) from Hugging Face and load it onto your device.

In [None]:
# choose a model type by commenting and uncommenting the following lines
model_type = "databricks/dolly-v2-3b"
# model_type = "EleutherAI/pythia-2.8b"

tokenizer = AutoTokenizer.from_pretrained(model_type, padding_side="left")
# Check if GPU is available and set the device accordingly
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_type, device_map="auto")

# Move the model to the device
model.to(device)

Some model setup.

In [None]:
RESPONSE_KEY = "### Response:"
END_KEY = "### End"
tokenizer_response_key = next(
    (token for token in tokenizer.additional_special_tokens if token.startswith(RESPONSE_KEY)), None
)

generate_kwargs = {}
response_key_token_id = None
end_key_token_id = None
if tokenizer_response_key:
    try:
        response_key_token_id = tokenizer.encode(tokenizer_response_key)
        end_key_token_id = tokenizer.encode(END_KEY)

        # Ensure generation stops once it generates "### End"
        generate_kwargs["eos_token_id"] = end_key_token_id
    except ValueError:
        pass

Define a function that takes in an entire text string consisting of many paragraphs and capitilizes the sentence/paragraph we are embedding.

**Example:**

"This is an example response from an LLM. The response contains multiple sentences. <WE ARE EMBEDDING THIS SENTENCE, WHICH IS WHY IT IS CAPITALIZED.> Hopefully this is clear."

In [None]:
def capitalize_sentence_in_text(generated_text, target_sentence):
    # Find the target sentence in the generated text
    sentence_start = generated_text.find(target_sentence)

    # Check if the target sentence is found in the generated text
    if sentence_start == -1:
        print("The target sentence was not found in the generated text.")
        return generated_text

    sentence_end = sentence_start + len(target_sentence)

    # Capitalize the target sentence
    capitalized_sentence = generated_text[sentence_start:sentence_end].upper()

    # Replace the target sentence with its capitalized version
    capitalized_text = (
        generated_text[:sentence_start]
        + "<"
        + capitalized_sentence
        + ">"
        + generated_text[sentence_end:]
    )

    return capitalized_text

In [None]:
# This takes in a list of token IDs and splits them into paragraphs of token IDs
# Output is list of lists. Where list is token IDs of a paragraph
def split_paragraphs(generated_ids, tokenizer):
    # Define the newline token ID
    newline_token_id = tokenizer.encode("\n")[0]

    # Split the tokens into paragraphs
    paragraphs = []
    paragraph = []
    newline_count = 0
    total_tokens = 0
    for token in generated_ids:
        if token == newline_token_id:
            newline_count += 1
        else:
            newline_count = 0
        total_tokens += 1
        paragraph.append(token)

        if newline_count == 2:
            paragraphs.append(paragraph)
            paragraph = []
            newline_count = 0

    if paragraph:
        paragraphs.append(paragraph)
    print("Total Tokens")
    print(total_tokens)
    return paragraphs

Define a function to embed the prompt. The function takes in prompt text and returns an embedding average of tokens.

In [None]:
def create_prompt_embedding(prompt, model):
    # Tokenize the prompt
    print(prompt)
    prompt_inputs = tokenizer(prompt, return_tensors="pt")

    # Move the input to the appropriate device
    prompt_inputs = {k: v.to(device) for k, v in prompt_inputs.items()}

    # Pass the prompt through the model
    prompt_output = model(**prompt_inputs, output_hidden_states=True)

    # Extract the hidden states
    prompt_hidden_states = prompt_output.hidden_states

    # The last hidden state is usually used as the embedding for the sequence
    # It has shape [batch_size, sequence_length, hidden_size]
    # To get an embedding for the entire sequence, you might average over the sequence length dimension
    prompt_embedding = prompt_hidden_states[-1][0].detach().cpu().mean(dim=0)
    # Convert the prompt_embedding tensor to a NumPy array
    prompt_embedding_np = prompt_embedding.numpy()
    return prompt_embedding_np

Define a function that takes in generated IDs, generated text and returns a single row dataframe with the generated text and associated embedding and the prompt text and prompt embedding.

In [None]:
def create_conversation_embeddings(
    prompt_len, generated_ids, hidden_states, model, tokenizer, prompt, prompt_category
):
    # Find sentence boundaries based on tokenized output
    generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)
    sentences = split_paragraphs(generated_ids[prompt_len:], tokenizer)
    # sentence_boundary_pattern = r'(?<=[\.\?!])\s|\n'
    # paragraph_boundary_pattern = r'\n\s*\n'
    # sentences = re.split(paragraph_boundary_pattern, generated_text)
    print("Total Hidden Length")
    print(len(hidden_states))
    print("Number of Paragraphs")
    print(len(sentences))

    # Compute sentence embeddings by averaging the embeddings of each token in a sentence
    sentence_embeddings = []
    sentences_texts = []
    capitialized_paragraphs = []
    response_text = []
    start = 0
    for sentence in sentences:
        tokenized_sentence = sentence
        text_sentence = tokenizer.decode(sentence, skip_special_tokens=True)
        # Average all the hidden states for the tokens for the sentence
        print("Len of paragraph")
        print(len(tokenized_sentence))
        num_tokens = len(tokenized_sentence)
        hidden_size = hidden_states[0][-1].shape[-1]

        # Initialize a tensor to store the sum of the hidden states
        sum_hidden_states = torch.zeros((hidden_size))

        # Iterate through the tokens and sum up the hidden states from the last layer
        for i in range(num_tokens):
            # Get the last layer embedding for each token
            sum_hidden_states += hidden_states[start + i][-1][0][0].cpu()
        # Divide by the number of tokens to get the average
        sentence_embedding = sum_hidden_states / num_tokens

        # Only grab paragraphs above 10 tokens.
        if num_tokens > 10:
            sentence_embeddings.append(sentence_embedding)
            sentences_texts.append(text_sentence)
            capitialized_paragraphs.append(
                capitalize_sentence_in_text(generated_text, text_sentence)
            )
            response_text.append(generated_text)

        start += num_tokens

    prompt_embedding = create_prompt_embedding(prompt, model)

    # Convert sentence_embeddings to NumPy arrays on the CPU
    cpu_sentence_embeddings = [embedding.cpu().numpy() for embedding in sentence_embeddings]

    uid = str(uuid.uuid4())[:20]

    # Create a list of UIDs for each row
    uids = [uid] * len(sentences_texts)

    # Create a list of prompts for each row
    prompts = [prompt] * len(sentences_texts)

    prompt_category_list = [prompt_category] * len(sentences_texts)

    prompt_embedding_list = [prompt_embedding] * len(sentences_texts)
    print("prompt embedding")
    print(prompt_embedding_list)
    print(prompt_embedding)
    # Create a DataFrame from the lists of cpu_sentence_embeddings, sentence_texts, capitalized_texts, and UIDs
    data = {
        "conversation_id": uids,
        "prompt": prompts,
        "prompt_embedding": prompt_embedding_list,
        "response_paragraph": sentences_texts,
        "response_capitalized": capitialized_paragraphs,
        "response_text": response_text,
        "paragraph_embedding": cpu_sentence_embeddings,
        "prompt_category": prompt_category_list,
    }
    df = pd.DataFrame(data)
    return df

In [None]:
column_names = [
    "conversation_id",
    "prompt",
    "response_paragraph",
    "response_capitalized",
    "paragraph_embedding",
    "prompt_category",
]
df1 = pd.DataFrame(columns=column_names)

Load the Alpaca dataset and format prompts from the data.

In [None]:
alpaca_df = load_dataset("tatsu-lab/alpaca", split="train").to_pandas()
alpaca_df["prompt"] = (
    "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n"
    + "### Instruction:\n"
    + df["instruction"]
    + "### Input:\n"
    + df["input"]
    + "### Response:\n"
)
alpaca_df

In [None]:
sample_alpaca_df = alpaca_df[:60]

Iterate through each prompt, generate a response using the model, and create an embedding from the response.

In [None]:
for index, row in sample_alpaca_df.iterrows():
    print("Index " + str(index))

    generated_inputs = tokenizer(row["prompt"], return_tensors="pt")
    input_ids = generated_inputs["input_ids"].to(device)
    attention_mask = generated_inputs["attention_mask"].to(device)
    pad_token_id = tokenizer.pad_token_id
    prompt_len = len(input_ids[0])
    attention_mask = generated_inputs.get("attention_mask", None).to(device)

    model_data_output = model.generate(
        input_ids,
        do_sample=True,
        temperature=0.9,
        attention_mask=attention_mask,
        max_length=250,
        output_hidden_states=True,
        return_dict_in_generate=True,
        pad_token_id=tokenizer.pad_token_id,
        **generate_kwargs,
    )
    generated_text = tokenizer.decode(model_data_output.sequences[0])
    print(generated_text)
    df_2 = create_conversation_embeddings(
        prompt_len,
        model_data_output.sequences[0],
        model_data_output.hidden_states,
        model,
        tokenizer,
        row["prompt"],
        "",
    )
    # Concatenate the empty DataFrame with the new DataFrame
    df1 = pd.concat([df1, df_2], ignore_index=True)

Evaluate the prompt response pairs with a call to the OpenAI API.

In [None]:
def evaluate_question_answer_pair(question, answer):
    prompt = f"You are an evaluation model that evaluates the accuracy of the response of question and answer pairs. Please score the result from 0-1 based on how good the answer is, where 0 is the worst. Question:\n{question}\nAnswer:\n{answer}\nPlease return a number from 0-1:"

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
    )
    print(prompt)
    print(response.choices[0])
    time.sleep(2)  # avoid rate-limiting from the OpenAI API
    gpt_response = response.choices[0].message.content.strip()
    return gpt_response


def evaluate_response_for_dataframe(df):
    df["evals"] = df.apply(
        lambda row: evaluate_question_answer_pair(row["prompt"], row["response_paragraph"]), axis=1
    )
    return df


post_eval_df = evaluate_response_for_dataframe(df1)

Convert column to float and replace non-numeric values with 0.

In [None]:
post_eval_df["evals"] = pd.to_numeric(post_eval_df["evals"], errors="coerce").fillna(0)

Calculate the mean evaluation score.

In [None]:
post_eval_df["evals"].mean()

Save off a copy of the dataframe.

In [None]:
# get a formatted timestamp
now = datetime.datetime.now()
timestamp = now.strftime("%Y-%m-%d_%H-%M-%S")

save_df = post_eval_df.copy()
save_df["prompt_embedding_vec"] = save_df["prompt_embedding"].apply(lambda x: str(x.tolist()))
save_df["paragraph_embedding_vec"] = save_df["paragraph_embedding"].apply(lambda x: str(x.tolist()))

# Create the file name with date and time and save
file_name = f'{model_type.split("/")[1]}_{timestamp}'
file_name += ".csv"
save_df.to_csv(file_name, index=False)

Optionally launch Phoenix to visualize your embedding data and get a sanity check.

In [None]:
schema = px.Schema(
    prompt_column_names=px.EmbeddingColumnNames(
        raw_data_column_name="prompt", vector_column_name="prompt_embedding"
    ),
    response_column_names=px.EmbeddingColumnNames(
        raw_data_column_name="response_paragraph", vector_column_name="paragraph_embedding"
    ),
    tag_column_names=[
        "prompt_category",
        "conversation_id",
        "response_capitalized",
        "response_text",
    ],
)

In [None]:
model_name = model_type.split("/")[1]
ds = px.Dataset(dataframe=post_eval_df, schema=schema, name=model_name)

In [None]:
session = px.launch_app(ds)