# Custom Chatbot Project

I want to build a chatbot application that can answer questions related to rap star Eminem . But the LLM model (gpt-3.5-turbo-instruct) finished training in Sep 2021 and it does not have the up-to-date information (especially things happened last year). Therefore I need to use RAG to provide up-to-date information in the custom query before sending to LLM.

I want to choose Eminem wikipedia page as the data source. Because the web page has rich and factual content covering most of the related questions. Since the content is in a webpage, I will use requests + BeautifulSoup to prepare the data (requests + BeautifulSoup)

Wikipedia source: https://en.wikipedia.org/wiki/Eminem

In [125]:
# Prepare environment variable
# Environment variables
OPENAI_API_KEY = 'YOU_API_KEY

# URLs and file paths
SOURCE_URL = 'https://en.wikipedia.org/wiki/Eminem'
CSV_FILEPATH_WITH_EMBEDDINGS = './eminem_embeddings.csv'

# OpenAI Models
EMBEDDING_MODEL = 'text-embedding-ada-002'
COMPLETION_MODEL = 'gpt-3.5-turbo-instruct'

# Batch size for processing
BATCH_SIZE = 25

import requests
from bs4 import BeautifulSoup
import pandas as pd 

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [126]:
# Making a GET request
response = requests.get(SOURCE_URL)

# print the status code
# print(response.status_code)

# print the content of the response
# print(response.content)

with open("mount_everest.html", mode='wb') as file:
    file.write(response.content)
    
with open("mount_everest.html") as fp:
    mount_everest_content = BeautifulSoup(fp, 'html.parser')

In [127]:
items = [item.text.strip() for item in mount_everest_content.find_all('p')]
items = list(filter(lambda item: (len(item) != 0), items))
df = pd.DataFrame()
df['text'] = items
df

Unnamed: 0,text
0,"Marshall Bruce Mathers III (born October 17, 1..."
1,After the release of his debut album Infinite ...
2,Eminem was also a member of the hip-hop groups...
3,Eminem is among the best-selling music artists...
4,Marshall Bruce Mathers III was born on October...
...,...
89,Eminem has also been included and ranked in se...
90,Solo studio albums
91,D12 studio albums
92,As a headliner


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [128]:
# Setup API creds
import openai
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = OPENAI_API_KEY

In [129]:
batch_size = 50
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL
    )
    
    embeddings.extend([data['embedding'] for data in response['data']])

df['embeddings'] = embeddings
df

Unnamed: 0,text,embeddings
0,"Marshall Bruce Mathers III (born October 17, 1...","[-0.032345566898584366, -0.026841014623641968,..."
1,After the release of his debut album Infinite ...,"[-0.015536091290414333, -0.019841734319925308,..."
2,Eminem was also a member of the hip-hop groups...,"[-0.01941017061471939, -0.04326850548386574, -..."
3,Eminem is among the best-selling music artists...,"[-0.029969148337841034, -0.030934298411011696,..."
4,Marshall Bruce Mathers III was born on October...,"[-0.02432360127568245, -0.026493549346923828, ..."
...,...,...
89,Eminem has also been included and ranked in se...,"[-0.026019081473350525, -0.029423678293824196,..."
90,Solo studio albums,"[-0.027343137189745903, -0.017208876088261604,..."
91,D12 studio albums,"[-0.020676232874393463, -0.01961451768875122, ..."
92,As a headliner,"[-0.018161434680223465, -0.005976084619760513,..."


In [130]:
df.to_csv(CSV_FILEPATH_WITH_EMBEDDINGS)

In [131]:
# Create tokenizer as well as prompt template
# question - original user prompt
# df - sorted data frame by ascending distance
# max_token_count - max allowed tokens to send to OpenAI
import tiktoken
def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")

    # Count the number of tokens in the prompt template and question
    prompt_template = """
        Answer the question based on the context below, and if the question
        can't be answered based on the context, say "I don't know"

        Context: 

        {}

        ---

        Question: {}
        Answer:
    """
    
    if df is None:
        return prompt_template.format('No context', question)
    
    print(f"Question has this many tokens: {len(tokenizer.encode(question))}")
    print(f"Prompt template has this many tokens: {len(tokenizer.encode(prompt_template))}")
    remaining_token_count = max_token_count - len(tokenizer.encode(question)) - len(tokenizer.encode(prompt_template))
    print(f"Remaining token count is: {remaining_token_count}")
    text_ary = []
    for text in df['text'].tolist():
        if remaining_token_count > len(tokenizer.encode(text)):
            remaining_token_count = remaining_token_count - len(tokenizer.encode(text))
            text_ary.append(text)
        else:
            break
    
    context = "\n\n###\n\n".join(text_ary)
    return prompt_template.format(context, question)

In [132]:
from openai.embeddings_utils import get_embedding
from openai.embeddings_utils import distances_from_embeddings

def sort_df_by_asc_distance(q, df):
    q_embedding = get_embedding(q, engine=EMBEDDING_MODEL)
    distances = distances_from_embeddings(q_embedding, df['embeddings'].tolist(), distance_metric='cosine')
    df['distances'] = distances
    return df.sort_values(by='distances')


In [133]:
import openai

def get_response_from_openai(q, df, use_template):
    if not use_template:
        return openai.Completion.create(model=COMPLETION_MODEL, prompt=q, max_tokens=2000)
    if df is None:
        custom_prompt = create_prompt(q, None, 1500)
    else:
        sorted_df = sort_df_by_asc_distance(q, df)
        # print(sorted_df.head())
        custom_prompt = create_prompt(q, sorted_df, 1500)
        # print(custom_prompt)
    return openai.Completion.create(model=COMPLETION_MODEL, prompt=custom_prompt, max_tokens=2000)
    


## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [135]:
q_1 = '''What year did Eminem's Shady Records sign Ez Mil in a direct joint deal with Aftermath Entertainment and Interscope Records?'''
response_1 = get_response_from_openai(q_1, df, True)
response_2 = get_response_from_openai(q_1, None, True)
response_3 = get_response_from_openai(q_1, None, False)
# df
print(f"#### With RAG, the response is #### \n {response_1['choices'][0]['text']} \n\n")
print(f"#### Without RAG, the response is #### \n {response_2['choices'][0]['text']} \n\n")
print(f"#### Without RAG without template, the response is #### \n {response_3['choices'][0]['text']} \n\n")

Question has this many tokens: 27
Prompt template has this many tokens: 50
Remaining token count is: 1423
#### With RAG, the response is #### 
 
        In July 2023. 


#### Without RAG, the response is #### 
  


#### Without RAG without template, the response is #### 
 
Eminem's Shady Records did not sign Ez Mil in a direct joint deal with Aftermath Entertainment and Interscope Records. This information is not accurate and there is no evidence to support it. 




### Question 2

In [136]:
q_2 = '''What year did Eminem appear alongside Roger Goodell at the opening ceremony of the NFL draft in Detroit'''

In [None]:
response_2_1 = get_response_from_openai(q_2, df, True)
response_2_2 = get_response_from_openai(q_2, None, True)
response_2_3 = get_response_from_openai(q_2, None, False)
# df
print(f"#### With RAG, the response is #### \n {response_2_1['choices'][0]['text']} \n\n")
print(f"#### Without RAG, the response is #### \n {response_2_2['choices'][0]['text']} \n\n")
print(f"#### Without RAG without template, the response is #### \n {response_2_3['choices'][0]['text']} \n\n")