## **🚀 Custom Chatbot Project**

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

- In this project, we used a dataset containing a list of characters from "Frieren: Beyond Journey's End," a Japanese manga series.

- We chose this dataset because our goal was to create a tool that acts like an expert, providing information about the characters in this specific manga series.

- To improve our tool's ability to answer questions, we employed a technique called `Retrieval Augmented Generation`.

- This technique adds context from the dataset to the questions asked, helping the model give more accurate and relevant answers about the characters in "Frieren: Beyond Journey's End."

In [1]:
import openai

In [2]:
# openai api key
OPENAI_API_KEY = "YOUR_OPENAI_API_KEY"

# URL and file paths
SOURCE_URL = 'https://en.wikipedia.org/wiki/List_of_Frieren_characters'
HTML_PAGE_FILEPATH = './html_page.html'
CSV_FILEPATH_WITH_EMBEDDINGS = './frieren_wiki_with_embeddings.csv'


### **Data Wrangling**

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [3]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
from typing import List, Union

In [4]:
# Helper function to fetch HTML page from a URL
def fetch_html_page(url: str) -> bytes:
    headers = {
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36'
    }
    response = requests.get(url, headers=headers)
    
    if response.status_code == 200:
        return response.content
    else:
        raise Exception('Connection error')

# Save the HTML page to a file
with open(HTML_PAGE_FILEPATH, mode='wb') as html_file:
    html_page = fetch_html_page(SOURCE_URL)
    html_file.write(html_page)

In [5]:
def extract_characters_from_html(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')

    characters = []

    # Find all <dl> tags
    dl_elements = soup.find_all('dl')

    for dl in dl_elements:
        dt_elements = dl.find_all('dt')
        for dt in dt_elements:
            character_name = dt.text.strip()
            dd = dt.find_next_sibling('dd')

            if dd:
                next_dd = dd.find_next_sibling('dd')
                if next_dd:
                    character_description = next_dd.text.strip()
                else:
                    character_description = dd.text.strip()
            else:
                character_description = "An elven mage who was a member of the Hero Party. Although she looks young, she was born into a long-lived race of elves and has lived for over a thousand years. Because her sense of time is different from that of humans, she does not mind working for months, if not years, at a time. After the death of Himmel, a member of the party, Frieren regrets not getting to know him better during their ten-year adventure. As a result, she embarks on another journey to learn more about humanity. She also travels with a human wizard apprentice, Fern, after taking her on as an apprentice at Heiter's suggestion."

            characters.append((character_name, character_description))

    df = pd.DataFrame(characters, columns=['Character', 'Description'])

    return df

with open(HTML_PAGE_FILEPATH, mode='rb') as html_file:
    html_content = html_file.read()

characters_df = extract_characters_from_html(html_content)
characters_df

Unnamed: 0,Character,Description
0,"Frieren (フリーレン, Furīren)[a][b]",An elven mage who was a member of the Hero Par...
1,"Fern (フェルン, Ferun)[e]",Frieren's apprentice. She is a war orphan from...
2,"Stark (シュタルク, Shutaruku)[f]","A young warrior who Eisen raised, who serves a..."
3,"Himmel (ヒンメル, Hinmeru)[g]",A human member of the Hero Party. He was the h...
4,"Heiter (ハイター, Haitā)[h]","A human member of the Hero Party, who was an a..."
...,...,...
65,The Hero of the South,"Fass is a Dwarf Himmel's Party encountered, an..."
66,"Fass (ファス, Fasu)[bm]",Milliarde is an old acquaintance and fellow el...
67,"Milliarde (ミリアルデ, Miriarude)[bn]",Glück was the feudal lord of the Fortified Cit...
68,"Glück (グリュック, Guryukku)[bo]",Lektüre is the daughter of Glück and the late ...


### **Creating an Embeddings Index for our Chatbot**

In [7]:
openai.api_key = OPENAI_API_KEY

In [8]:
pd.reset_option('display.max_colwidth')
pd.reset_option('display.max_rows')

In [9]:
def get_embeddings(prompt: Union[str, List[str]]) -> List[List[float]]:
    """
    Retrieves embeddings from OpenAI API for the given prompt(s) using the default embedding model.
    """
    
    response = openai.Embedding.create(
        input=prompt if isinstance(prompt, list) else [prompt],
        model='text-embedding-3-small'
    )
    return [row.embedding for row in response.data]


In [10]:
def create_embeddings(df, text_column, batch_size=32):
    embeddings_output = []
    for idx in range(0, len(df), batch_size):
        batch = df.iloc[idx:idx+batch_size][text_column].tolist()
        embeddings = get_embeddings(batch)
        embeddings_output.extend(embeddings)
    return embeddings_output


In [11]:
characters_df['embedding'] = create_embeddings(characters_df, text_column='Description')
characters_df.to_csv(CSV_FILEPATH_WITH_EMBEDDINGS, sep=',', index=False)

In [12]:
characters_df.head()

Unnamed: 0,Character,Description,embedding
0,"Frieren (フリーレン, Furīren)[a][b]",An elven mage who was a member of the Hero Par...,"[-0.016321850940585136, -0.0030223201029002666..."
1,"Fern (フェルン, Ferun)[e]",Frieren's apprentice. She is a war orphan from...,"[-0.024075113236904144, -0.012648157775402069,..."
2,"Stark (シュタルク, Shutaruku)[f]","A young warrior who Eisen raised, who serves a...","[0.007023664657026529, 0.005538128782063723, -..."
3,"Himmel (ヒンメル, Hinmeru)[g]",A human member of the Hero Party. He was the h...,"[-0.01922447793185711, -0.022054746747016907, ..."
4,"Heiter (ハイター, Haitā)[h]","A human member of the Hero Party, who was an a...","[-0.013876520097255707, -0.007876708172261715,..."


### **Custom Query Completion**

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [96]:
def handle_question_without_context(prompt):
    response = openai.Completion.create(
        engine="gpt-3.5-turbo-instruct",
        prompt=prompt,
        max_tokens=50, 
    )
    return response.choices[0].text.strip()

In [108]:
def handle_question_with_context(prompt: str, embeddings: List[List[float]]) -> str:
    context = " ".join([" ".join(map(str, emb)) for emb in embeddings])
    
    if len(context.split()) > 60:
        context = " ".join(context.split()[:60])

    prompt_with_context = f"{context}\n\n{prompt}"
    
    response = openai.Completion.create(
        model="gpt-3.5-turbo-instruct",
        prompt=prompt_with_context,
        max_tokens=50
    )
    return response.choices[0].text.strip()


### **Custom Performance Demonstration**

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

In [61]:
df = pd.read_csv(CSV_FILEPATH_WITH_EMBEDDINGS)

df['embedding'] = df['embedding'].apply(lambda value: [float(dim) for dim in value.replace('[', '').replace(']', '').split(',')])

#### **Question 1**

In [109]:
q1 = "Who is Frieren? Please answer briefly."

print('Answer without Context: \n', handle_question_without_context(q1))
print('\nAnswer with Context: \n', handle_question_with_context(q1, df['embedding'].tolist()))

Answer without Context: 
 Frieren is a character from the manga series "The Girl From the Other Side: Siúil, A Rún." She is a powerful sorceress who has mastered various types of magic and has lived for a long time. She

Answer with Context: 
 Frieren is a character from the Japanese manga series "The Witch of the Water Town." She is a powerful ice witch who has lived for hundreds of years with her older sister, Lien. Despite her cold exterior, Frieren is kind


#### **Question 2**

In [110]:
q2 = "What is Fern's background? How did she become Frieren's apprentice?"

print('Answer without Context: \n', handle_question_without_context(q2))
print('\nAnswer with Context: \n', handle_question_with_context(q2, df['embedding'].tolist()))

Answer without Context: 
 Fern is a human girl who lived in a small village in the mountains with her family. She was raised in a harsh environment, constantly struggling for survival. Her father was a skilled hunter, and her mother was a herbalist who taught Fern

Answer with Context: 
 Fern's background is largely shrouded in mystery. She is an ice spirit and does not have a human past like Frieren. However, she does mention that she has been around for a long time and has been through many battles


#### **Question 3**

In [111]:
q3 = "What role does Himmel play in the story? What are his relationships with Frieren?"

print('Answer without Context: \n', handle_question_without_context(q3))
print('\nAnswer with Context: \n', handle_question_with_context(q3, df['embedding'].tolist()))

Answer without Context: 
 Himmel is a major character in the manga series "Frieren at the Funeral." He is a powerful mage and the strongest member of the funeral procession led by Frieren's father. Himmel initially appears as a stoic and distant

Answer with Context: 
 Himmel is a major supporting character in the story. He is a skilled knight and a member of Frieren's adventuring party. Although he is not related to Frieren by blood, he considers her to be like a younger sister


#### **✍🏻 Analysis of Model Responses**

- **Question 1**

    The model's answer showed partial correctness, possibly due to unclear or incomplete context provided.

- **Question 2**

    The result could be impacted by setting a maximum token limit for context transmission, limiting the model's access to relevant information and affecting its accuracy.

- **Question 3**

    The model likely found more pertinent details within the provided context, leading to a more accurate response.
