# Custom Chatbot Project

I have chosen the Diabetic Retinopathy Dataset because it provides a well-structured collection of medical records, clinical images, and diagnostic labels related to Diabetic Retinopathy (DR). This dataset is appropriate for developing an NLP-based chatbot as it contains key medical terms, symptoms, risk factors, and treatment options that can help the chatbot provide accurate, context-aware responses to user queries.

Additionally, this dataset supports the integration of natural language processing (NLP) techniques for understanding and classifying medical questions, enabling the chatbot to provide relevant, evidence-based information to patients, caregivers, and healthcare professionals. The dataset ensures that responses are informed by real-world clinical data, making it a reliable foundation for automating medical Q&A related to Diabetic Retinopathy.

## Data Wrangling

Let's extract only the first non-empty paragraph from Wikipedia's Diabetic Retinopathy page, stores it in a single-row DataFrame under the "text" column, and keeps it simple for NLP chatbot training or text processing

In [137]:
import requests, re
import pandas as pd
import numpy as np
# from scipy import spatial
from typing import List, Union, Dict
from scipy.spatial.distance import cosine
# from datasets import Dataset

# Get the Wikipedia page for "2022" since OpenAI's models stop in 2021
params = {
    "action": "query", 
    "prop": "extracts",
    "exlimit": 1,
    "titles": "diabetic retinopathy",
    "explaintext": 1,
    "formatversion": 2,
    "format": "json"
}
resp = requests.get("https://en.wikipedia.org/w/api.php", params=params)
response_dict = resp.json()

paragraph = response_dict["query"]["pages"][0]["extract"].split("\n")

In [138]:
# Remove empty lines and lines containing headers (== Header ==)
cleaned_lines = [line for line in paragraph if line.strip() and not re.match(r"^==.*==$", line.strip())]

# Convert to DataFrame
df = pd.DataFrame(cleaned_lines, columns=["text"])

df.head()

Unnamed: 0,text
0,Diabetic retinopathy (also known as diabetic e...
1,Diabetic retinopathy affects up to 80 percent ...
2,Nearly all people with diabetes develop some d...
3,Around half of people with diabetic retinopath...
4,"The repeated processes of blood vessel growth,..."


## Saving our Dataset
Having done all this hard work, the last thing we want to do is have to reconstruct our dataset from scratch at run time. 
To do this, call the `.save_to_disk()` method on your `Dataset` object

In [139]:
# save dataset in csv
df.to_csv('data/dataset.csv')

In [114]:
data = Dataset.from_pandas(df)

In [115]:
data.save_to_disk("data.hf")

In [140]:
# Import OpenAI library
import openai
from openai.embeddings_utils import get_embedding
# Set the base URL for the OpenAI API (Vocareum's endpoint)
openai.api_base = "https://openai.vocareum.com/v1"
# Set the API key (Needs to be provided for authentication)
openai.api_key =  "voc-16538039912667736273486722fba0b7d651.69520446"

In [144]:
# OpenAI Paramters
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
COMPLETION_MODEL_NAME = 'gpt-3.5-turbo-instruct'

# Batch size for processing
BATCH_SIZE = 100

In [145]:
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input = df.iloc[i:i + batch_size]["text"].tolist(),
        engine = EMBEDDING_MODEL_NAME
    )
    print(response)
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings

{
  "data": [
    {
      "embedding": [
        -0.010677504353225231,
        -0.025032196193933487,
        0.0028557199984788895,
        -0.021516507491469383,
        -0.0002577757113613188,
        0.02334267646074295,
        0.008975563570857048,
        -0.01029239408671856,
        -0.018038088455796242,
        -0.00043208489660173655,
        0.003245489438995719,
        0.019963640719652176,
        -0.013677640818059444,
        0.004903949797153473,
        -0.0021196617744863033,
        0.009497326798737049,
        0.040001820772886276,
        0.006087233778089285,
        0.016087688505649567,
        -0.012441559694707394,
        -0.03962913155555725,
        0.015764692798256874,
        -0.01188873965293169,
        -0.0024100476875901222,
        0.013367068022489548,
        -0.003341767005622387,
        -0.005453664343804121,
        -0.03284621611237526,
        -0.006689745467156172,
        -0.02088293805718422,
        0.001319159404374659,
        -0.

In [147]:
df.head(20)

Unnamed: 0,text,embeddings
0,Diabetic retinopathy (also known as diabetic e...,"[-0.010677504353225231, -0.025032196193933487,..."
1,Diabetic retinopathy affects up to 80 percent ...,"[-0.003970709629356861, -0.035033438354730606,..."
2,Nearly all people with diabetes develop some d...,"[-0.015914591029286385, -0.02492091991007328, ..."
3,Around half of people with diabetic retinopath...,"[-0.006666772998869419, -0.02953506074845791, ..."
4,"The repeated processes of blood vessel growth,...","[-0.004785932134836912, -0.01526925154030323, ..."
5,Diabetic retinopathy is typically diagnosed by...,"[-0.018960656598210335, -0.015124611556529999,..."
6,The same guidelines separately divide macular ...,"[-0.000526395917404443, 0.02458757720887661, 0..."
7,Fluorescein angiography is used by retina spec...,"[-0.021063480526208878, -0.010775918141007423,..."
8,"Due to the lack of symptoms, most people with ...","[-9.7348602139391e-05, -0.022609487175941467, ..."
9,"Iceland, Ireland, and the United Kingdom are t...","[-0.003574279835447669, -0.027453580871224403,..."


In [148]:
# save dataset in csv
df.to_csv('data/datasetWithEmbeddings.csv')

## Custom Query Completion

In the cells below, compose a custom query using my chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [48]:
# Function to get embeddings from OpenAI API
def get_embeddings(prompt: Union[str, List[str]], embedding_model: str) -> List[List[float]]:
    """
    Retrieves embeddings from OpenAI API for the given prompt using the specified embedding model.

    Args:
        prompt (Union[str, List[str]]): Input prompt or list of prompts.
        embedding_model (str): Name of the embedding model to use.

    Returns:
        List[List[float]]: List of embeddings for the input prompt(s).
    """
    

    response = openai.Embedding.create(
        input=prompt if isinstance(prompt, list) else [prompt],
        model= EMBEDDING_MODEL_NAME 
    )
    return [row.embedding for row in response.data]

# Function to create embeddings for DataFrame
def create_embeddings(df: pd.DataFrame, embedding_model_name: str = EMBEDDING_MODEL_NAME, batch_size: int = BATCH_SIZE) -> List[List[float]]:
    """
    Creates embeddings for the text data in the DataFrame using the specified embedding model.

    Args:
        df (pd.DataFrame): DataFrame containing text data.
        embedding_model_name (str): Name of the embedding model to use.
        batch_size (int): Size of batches for processing.

    Returns:
        List[List[float]]: List of embeddings corresponding to the text data.
    """
    embeddings_output = []
    for idx in range(0, len(df), batch_size):
        batch = df.iloc[idx:idx+batch_size]['text'].tolist()
        embeddings = get_embeddings(batch, embedding_model_name)
        embeddings_output.extend(embeddings)
    return embeddings_output

In [151]:
def generateSimplePrompt(question: str) -> List[str]:
    """
    Builds a simple prompt for asking a question.

    Args:
        question (str): The question to include in the prompt.

    Returns:
        List[str]: A list containing a single message with the user role and the provided question.
    """
    return [question]

def buildCustomPrompt(question: str, database_df: pd.DataFrame) -> List[str]:
    """
    Builds a custom prompt including context for asking a question based on a database DataFrame.

    Args:
        question (str): The question to include in the prompt.
        database_df (pd.DataFrame): The DataFrame containing the database of facts.

    Returns:
        List[str, str]: A list containing two messages: system message with context and user message with the question.
    """
    
    context = '\n\n'.join(generateCustomContext(question, database_df))

    prompt = f"""
    Answer the question based on the following context:

    {context}

    If the question cannot be answered based on the provided context, say:
    "Sorry, Please rephrase your input or provide more details."

    Question: {question}
    """
    
    return prompt
    

def generateCustomContext(question: str, database_df: pd.DataFrame, n: int = 5) -> List[str]:
    """
    Builds a custom context for a given question based on the closest facts from a database DataFrame.

    Args:
        question (str): The question for which the context is being built.
        database_df (pd.DataFrame): The DataFrame containing the database of facts.
        n (int): The number of closest facts to include in the context.

    Returns:
        List[str]: A list of closest facts to the question.
    """
    question_embedding = get_embeddings(question, EMBEDDING_MODEL_NAME)[0]
    
    df = database_df.copy()
    df["distances"] = df['embeddings'].apply(lambda embedding: cosine(embedding, question_embedding))

    df.sort_values("distances", ascending=True, inplace=True)
    return df.iloc[:n]['text'].tolist()

def getResponse(prompt: List[Dict[str, str]], client: openai, model_name: str = COMPLETION_MODEL_NAME) -> str:
    """
    Handles a question prompt by generating a response using the specified model.

    Args:
        prompt (List[Dict[str, str]]): The prompt messages to send to the model.
        model_name (str): The name of the completion model to use.

    Returns:
        str: The response generated by the model.
    """
    print('COMPLETION_MODEL: ', COMPLETION_MODEL_NAME)
    
#     print('prompt: ', prompt)
    response = openai.Completion.create(
        model=model_name,
        prompt=prompt,
        max_tokens=100
    )

#     print(response)
    return response["choices"][0]["text"]

## Custom Performance Demonstration

demonstrate the performance of my custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [152]:
# Define the question
question1 = 'What is Diabetic Retinopathy, and how does it develop?'

# Print answer without context
print('Q & A without Context: \n', getResponse(generateSimplePrompt(question1), openai), '\n')

# Print answer with context
print('Q & A with Context: \n', getResponse(buildCustomPrompt(question1, df), openai))

COMPLETION_MODEL:  gpt-3.5-turbo-instruct
Q & A without Context: 
  Diabetic retinopathy is a complication of diabetes that affects the eyes. It is caused by damage to the small blood vessels in the retina, the light-sensitive tissue at the back of the eye. High blood sugar levels can cause the blood vessels to swell, leak, or become blocked, resulting in reduced blood flow to the retina. This can lead to vision problems or even blindness if left untreated.

As diabetes progresses, the damage to the blood vessels can worsen. The body may try to compensate 

COMPLETION_MODEL:  gpt-3.5-turbo-instruct
Q & A with Context: 
 
Diabetic Retinopathy is a medical condition caused by prolonged high blood glucose that damages the small blood vessels and neurons in the retina. It develops in stages, starting with changes in retinal arteries and dysfunction of retinal neurons, followed by dysfunction of the outer retina and changes in visual function. Later, this leads to thickening of the basement

### Question 2

In [154]:
# Define the question
question2 = 'Who is at the highest risk of developing Diabetic Retinopathy?'

# Print answer without context
print('Q & A without Context: \n', getResponse(generateSimplePrompt(question2), openai), '\n')

# Print answer with context
print('Q & A with Context: \n', getResponse(buildCustomPrompt(question2, df), openai))

COMPLETION_MODEL:  gpt-3.5-turbo-instruct
Q & A without Context: 
 

Individuals who have had diabetes for a long time, have poorly controlled blood sugar levels, high blood pressure, high cholesterol, are pregnant, or have a family history of diabetic retinopathy are at the highest risk for developing this condition. 

COMPLETION_MODEL:  gpt-3.5-turbo-instruct
Q & A with Context: 
 
Adults who have had type 1 or type 2 diabetes for 20 years or more are at the highest risk of developing diabetic retinopathy.


### Interactive Chatbot

In [155]:
# Interactive loop for user input
while True:
    question = input("\nAsk a question (or type 'exit' to quit): ").strip()
    
    if question.lower() == "exit":
        print("Goodbye! 👋")
        break

    context = build_custom_context(question, df)
    response = getResponse(question, context)
    
    print("\n🤖 AI Response:", response)


Ask a question (or type 'exit' to quit): Who is at the highest risk of developing Diabetic Retinopathy?
COMPLETION_MODEL:  gpt-3.5-turbo-instruct

🤖 AI Response: 

Individuals with poorly controlled diabetes, specifically those who have had diabetes for a long time, are at the highest risk of developing diabetic retinopathy. Other risk factors include high blood pressure, high cholesterol, pregnancy, and smoking. Additionally, individuals with type 1 diabetes have a higher risk than those with type 2 diabetes.

Ask a question (or type 'exit' to quit): exit
Goodbye! 👋
