# Custom QA-bot Project

The data selected for the project is "character_descriptions.csv", which contains charactor descriptions, medium and location/setting. This dataset contains mostly text descriptions with informations spanned over multiple sentences. This is very suited for the gpt models to understand and respond. Multiple questions can be asked by the chatbot user to understand more about these charactors and get grounded answers from chat model. This is the primary reason for selecting this dataset

## Data Wrangling


In [2]:
import os
import pandas as pd

In [3]:
os.listdir("./data")

['.ipynb_checkpoints',
 'nyc_food_scrap_drop_off_sites.csv',
 'character_descriptions.csv',
 '2023_fashion_trends.csv']

In [25]:
raw_data = pd.read_csv("./data/character_descriptions.csv")

In [29]:
raw_data.head()

Unnamed: 0,Name,Description,Medium,Setting,text
0,Emily,"A young woman in her early 20s, Emily is an as...",Play,England,"A young woman in her early 20s, Emily is an as..."
1,Jack,"A middle-aged man in his 40s, Jack is a succes...",Play,England,"A middle-aged man in his 40s, Jack is a succes..."
2,Alice,"A woman in her late 30s, Alice is a warm and n...",Play,England,"A woman in her late 30s, Alice is a warm and n..."
3,Tom,"A man in his 50s, Tom is a retired soldier and...",Play,England,"A man in his 50s, Tom is a retired soldier and..."
4,Sarah,"A woman in her mid-20s, Sarah is a free-spirit...",Play,England,"A woman in her mid-20s, Sarah is a free-spirit..."


In [77]:
#join all details to text column
raw_data["text"] = raw_data["Description"] + "This charactor is featured in " + raw_data["Medium"] + " and location setting at "+raw_data["Setting"]

In [78]:
df = raw_data["text"]
df.iloc[0]

"A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George.This charactor is featured in Play and location setting at England"

In [79]:
#take only first 25 charactors
df = pd.DataFrame(df[:25], columns=["text"])
df.head()

Unnamed: 0,text
0,"A young woman in her early 20s, Emily is an as..."
1,"A middle-aged man in his 40s, Jack is a succes..."
2,"A woman in her late 30s, Alice is a warm and n..."
3,"A man in his 50s, Tom is a retired soldier and..."
4,"A woman in her mid-20s, Sarah is a free-spirit..."


## Query Answering

create a basic chat using chatgpt turbo instruct

In [38]:
import openai
openai.api_key = "" #add your key to run!!!

In [39]:
#direct chat with model without grounding
def basic_chat(query):
    prompt = f"""
    Question: {query}
    Answer:
    """
    answer = openai.Completion.create(
        model="gpt-3.5-turbo-instruct",
        prompt=prompt,
        max_tokens=150
    )["choices"][0]["text"].strip()
    print(answer)

## Create RAG based question answering 

#### steps: - 

#### 1. embedding creation
#### 2. similarity search
#### 3. adding context with prompt

In [80]:
#generate embeddings
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 25
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings
df

Unnamed: 0,text,embeddings
0,"A young woman in her early 20s, Emily is an as...","[-0.009494630619883537, -0.014824042096734047,..."
1,"A middle-aged man in his 40s, Jack is a succes...","[0.011266863904893398, -0.019644448533654213, ..."
2,"A woman in her late 30s, Alice is a warm and n...","[0.0067794108763337135, -0.010819331742823124,..."
3,"A man in his 50s, Tom is a retired soldier and...","[0.01814238354563713, -0.015497720800340176, -..."
4,"A woman in her mid-20s, Sarah is a free-spirit...","[-0.012886534444987774, -0.024789365008473396,..."
5,"A man in his early 30s, George is a charming a...","[-0.01977045275270939, -0.015234489925205708, ..."
6,"A woman in her late 20s, Rachel is a shy and i...","[-0.003998512867838144, -0.010049359872937202,..."
7,"A man in his 60s, John is a retired professor ...","[0.021669458597898483, -0.014908378012478352, ..."
8,"A middle-aged Latina woman in her 40s, Maria i...","[-0.006386805325746536, -0.00761336600407958, ..."
9,"A young African American man in his early 20s,...","[0.0062502180226147175, -0.026581238955259323,..."


In [81]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns the text in dataframe 
    which is most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    #take only first row in our case since each row has seperate charactor
    return df_copy.iloc[0].text

In [82]:
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
    Answer the question based on the context below, and if the question
    can't be answered based on the context, say "I don't know"

    Context: 

    {}

    ---

    Question: {}
    Answer:"""

    context = get_rows_by_relevance(question, df)
    return prompt_template.format(context, question)
    

In [83]:
print(create_prompt("tell me about the charactor Emily", df, 100))


    Answer the question based on the context below, and if the question
    can't be answered based on the context, say "I don't know"

    Context: 

    A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George.This charactor is featured in Play and location setting at England

    ---

    Question: tell me about the charactor Emily
    Answer:


In [110]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def custom_chat(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

##  Performance Demonstration


### Question 1

In [96]:
query_1 = "Explain personality of Emily in two sentences?"

In [97]:
#basic prompting
basic_chat(query_1)

Emily is a complex and enigmatic character whose experiences and relationships shape her into a reclusive, controlling, and possibly disturbed person who is deeply attached to her family's past.


In [111]:
#chat with RAG/grounding
custom_chat(query_1, df)

'Emily has a bubbly personality and a quick wit, but she also struggles with self-doubt and insecurity.'

In [99]:
#check ground truth
df.iloc[0].text

"A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George.This charactor is featured in Play and location setting at England"

### Question 2

In [124]:
query_2 = "Jack is married to whom?"

In [125]:
#basic prompting
basic_chat(query_1)

There is not enough information given to answer this question. We need to know more about Jack and his relationships in order to identify his spouse.


In [127]:
#chat with RAG/grounding
custom_chat(query_1, df)

'Alice'

In [130]:
#check ground truth
df.iloc[1].text

"A middle-aged man in his 40s, Jack is a successful businessman and Sarah's boss. He has a no-nonsense attitude, but is fiercely loyal to his friends and family. He's married to Alice.This charactor is featured in Play and location setting at England"

#### observations

1. with RAG/grounding using embeddings based retrived context the model is correctly able to answer the question based on underlying custom dataset
2. model response is validated by cross checking with relvent row