# Custom Chatbot Project

This notebook wraps the openai client to demonstrate Retrieval Augmented Generation used to create a custom 'chatbot'.

The dataset chosen is the Wikipedia overview page for the topic of Voice Therapy. This source was chosen as it is a topic that is very personally meaningful to me but is also a topic where a baseline level of common knowledge might be assumed, but no actual detail.

For example I might expect a more advanced model to give the general idea of what voice therapy is, who might use it, and even to name some techniques, but I would not expect ti to be able to offer tailored advice on what someone's next actions might be if they are having trouble with their voice and wish to explore options.

In a production environment you could imagine a similar design with access to the individual topic pages linked to by this summary page, which could serve to aid the process of someone undergoing voice training, or perhaps be used to query and understand options available, what would suit and what would not.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [30]:
pip install openai==0.28 tiktoken

Collecting openai==0.28
  Downloading openai-0.28.0-py3-none-any.whl.metadata (13 kB)
Downloading openai-0.28.0-py3-none-any.whl (76 kB)
   ---------------------------------------- 0.0/76.5 kB ? eta -:--:--
   ---------------------------------------- 76.5/76.5 kB 2.1 MB/s eta 0:00:00
Installing collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 1.34.0
    Uninstalling openai-1.34.0:
      Successfully uninstalled openai-1.34.0
Successfully installed openai-0.28.0
Note: you may need to restart the kernel to use updated packages.


In [2]:
from typing import List

import requests

import openai
import tiktoken

import pandas as pd

In [3]:
api_key = 'YOUR API KEY'
openai.api_key = api_key

In [3]:
data_url = 'https://en.wikipedia.org/w/api.php'
page_name = 'Voice_therapy'
csv_name = 'dataset-voice-therapy.csv'
completion_model = 'gpt-3.5-turbo-instruct'
# embedding_model = 'text-embedding-ada-002'
embedding_model = 'cl100k_base'
query_params = {
    'format': 'json',
    'prop': 'extracts',
    'action': 'query',
    'exlimit': 1,
    'titles': page_name,
    'explaintext': 1,
    'formatversion': 2,
}

In [4]:
response = requests.get(data_url, params=query_params)

In [5]:
print(response.json())
df = pd.DataFrame()
df['text'] = [*filter(lambda x: x != '' and not x.startswith('=='), response.json()['query']['pages'][0]['extract'].split('\n'))]
df.to_csv(csv_name)
df

{'batchcomplete': True, 'query': {'normalized': [{'fromencoded': False, 'from': 'Voice_therapy', 'to': 'Voice therapy'}], 'pages': [{'pageid': 24797922, 'ns': 0, 'title': 'Voice therapy', 'extract': 'Voice therapy consists of techniques and procedures that target vocal parameters, such as vocal fold closure, pitch, volume, and quality. This therapy is provided by speech-language pathologists and is primarily used to aid in the management of voice disorders, or for altering the overall quality of voice, as in the case of transgender voice therapy. Vocal pedagogy is a related field to alter voice for the purpose of singing. Voice therapy may also serve to teach preventive measures such as vocal hygiene and other safe speaking or singing practices.\n\n\n== Orientations ==\nThere are several orientations towards management in voice therapy. The approach taken to voice therapy varies between individuals, as no set treatment method applies for all individuals. The specific method of treatmen

Unnamed: 0,text
0,Voice therapy consists of techniques and proce...
1,There are several orientations towards managem...
2,Symptomatic voice therapy aims to directly or ...
3,Physiologic voice therapy may be adopted when ...
4,Hygienic voice therapy involves modifying or e...
...,...
59,Direct Treatment Methods: Direct treatment met...
60,Vocal Function Exercises: designed to improve ...
61,Resonance Therapy: modified form of Resonance ...
62,Semiocclusion of the Vocal Tract: methods that...


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [6]:
# Copied from Retrieval Augmented Generation codealong
prompt_template = '''
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:'''

In [7]:
tokeniser = tiktoken.get_encoding(embedding_model)

def prepare_context(prompt: str = '', full_context: List[str] = []):
    '''
    Parses a given context to fit within the context window of the model.

    params
    ---
    prompt {string}: The question passed by the user.
    full_context {string[]}: The context dataset as a list of strings.

    returns
    ---
    {string[]} The parsed context sized to fit within the context window.
    '''
    context = []
    max_token_count = 1000
    current_token_count = len(tokeniser.encode(prompt_template)) + len(tokeniser.encode(prompt))

    for text in full_context:
        text_token_count = len(tokeniser.encode(text))
        current_token_count += text_token_count
        
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return context

In [25]:
def run_query(prompt, context = []):
    '''
    Wrapper function for the Open AI client.

    params
    ---
    prompt {string}: The question for the model.
    context {string[]}: Context material for the context section of the prompt.

    returns
    ---
    {string} The response or an error.
    '''
    context_parsed  = prepare_context(prompt, context)
    formatted_prompt = prompt_template.format('\n\n\n###\n\n\n'.join(context_parsed), prompt)
    print(formatted_prompt)
    try:
        result = openai.Completion.create(
            model=completion_model,
            prompt=formatted_prompt,
            max_tokens=32,
        )
        return result['choices'][0]
    except Exception as ex:
        return str(ex)

In [26]:
prepare_context('some example question', df['text'].values)

['Voice therapy consists of techniques and procedures that target vocal parameters, such as vocal fold closure, pitch, volume, and quality. This therapy is provided by speech-language pathologists and is primarily used to aid in the management of voice disorders, or for altering the overall quality of voice, as in the case of transgender voice therapy. Vocal pedagogy is a related field to alter voice for the purpose of singing. Voice therapy may also serve to teach preventive measures such as vocal hygiene and other safe speaking or singing practices.',
 'There are several orientations towards management in voice therapy. The approach taken to voice therapy varies between individuals, as no set treatment method applies for all individuals. The specific method of treatment should consider the type and severity of the disorder, as well as individual qualities such as personal and cultural characteristics. Some common orientations are described below.',
 'Symptomatic voice therapy aims to

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [19]:
question_1 = 'What is Symptomatic Voice Therapy?'

In [20]:
answer_1_base = run_query(question_1, [])


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 



---

Question: What is Symptomatic Voice Therapy?
Answer:


In [28]:
answer_1_base

<OpenAIObject at 0x270d67f7720> JSON: {
  "text": " I don't know",
  "index": 0,
  "logprobs": null,
  "finish_reason": "stop"
}

In [22]:
answer_1_contextful = run_query(question_1, df['text'].values)


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

Voice therapy consists of techniques and procedures that target vocal parameters, such as vocal fold closure, pitch, volume, and quality. This therapy is provided by speech-language pathologists and is primarily used to aid in the management of voice disorders, or for altering the overall quality of voice, as in the case of transgender voice therapy. Vocal pedagogy is a related field to alter voice for the purpose of singing. Voice therapy may also serve to teach preventive measures such as vocal hygiene and other safe speaking or singing practices.


###


There are several orientations towards management in voice therapy. The approach taken to voice therapy varies between individuals, as no set treatment method applies for all individuals. The specific method of treatment should consider the type and severity of the disorder, as well as individua

In [27]:
answer_1_contextful

<OpenAIObject at 0x270d6137860> JSON: {
  "text": " Symptomatic voice therapy is a type of voice therapy that aims to directly or",
  "index": 0,
  "logprobs": null,
  "finish_reason": "length"
}

### Question 2

In [29]:
question_2 = 'What are the main targets of accent mehtods used in Physiologic voice therapy?'

In [30]:
answer_2_base = run_query(question_2, [])


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 



---

Question: What are the main targets of accent mehtods used in Physiologic voice therapy?
Answer:


In [31]:
answer_2_base

<OpenAIObject at 0x270d61390e0> JSON: {
  "text": " \n\nI don't know.",
  "index": 0,
  "logprobs": null,
  "finish_reason": "stop"
}

In [32]:
answer_2_contextful = run_query(question_2, df['text'].values)


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

Voice therapy consists of techniques and procedures that target vocal parameters, such as vocal fold closure, pitch, volume, and quality. This therapy is provided by speech-language pathologists and is primarily used to aid in the management of voice disorders, or for altering the overall quality of voice, as in the case of transgender voice therapy. Vocal pedagogy is a related field to alter voice for the purpose of singing. Voice therapy may also serve to teach preventive measures such as vocal hygiene and other safe speaking or singing practices.


###


There are several orientations towards management in voice therapy. The approach taken to voice therapy varies between individuals, as no set treatment method applies for all individuals. The specific method of treatment should consider the type and severity of the disorder, as well as individua

In [33]:
answer_2_contextful

<OpenAIObject at 0x270d6137950> JSON: {
  "text": " \n\nThe main targets of accent methods in Physiologic voice therapy are to increase pulmonary output, reduce tension in muscles, reduce glottis waste, and",
  "index": 0,
  "logprobs": null,
  "finish_reason": "length"
}