# Custom Chatbot Project

The medical sector at this point in time has not yet utilized Artificial Intelligence models to a degree that would be possible, in particular for patients "first contact to the medical knowledge base" for intitial overview of their presumed medical issue prior to seeing a specialist. Also the opposite way should be explored in which e.g. a family doctor may use a costumized chatbot to compare symptoms of a patient to a medical assumption, doing this in conversation with a costumized model may be a route for informed decision making.

The chosen dataset is a Wikipedia scrape of **Chagas Disease**, for which the ultimate questions will query directly, potentially another useful costumization could be a resource for a specific symptom instead of a disease, but for this experiment we will investigate the formerly mentioned specific disease.


One can imagine to have a library of disease specific customized-chat bots for large scale implementation

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [1]:
import openai
openai.api_key = "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"

In [2]:
import pandas as pd
import numpy as np
import requests
import urllib.parse

In [3]:
# Define the Wikipedia API endpoint
endpoint = 'https://en.wikipedia.org/w/api.php'

# Define the URL of the Wikipedia page
url = 'https://en.wikipedia.org/wiki/Chagas_disease'

# Extract the page title from the URL
title = urllib.parse.unquote(url.split('/')[-1])

# Define the parameters for the API request
params = {
    'action': 'query',
    'format': 'json',
    'titles': title,
    'prop': 'info',
    'inprop': 'pageid'
}

# Make the API request and get the JSON response
response = requests.get(endpoint, params=params).json()

# Programmatical extraction of the pageid from the response - alternatively the page_id can be found
# in the Wikipedia pages page information
page_id = str(list(response['query']['pages'].values())[0]['pageid'])

In [4]:
# Using the page id to get api.php response for further wrangling:

# Define the URL for the Wikipedia API
url = 'https://en.wikipedia.org/w/api.php'

# Define the parameters for the API request
params = {
    'action': 'query',
    'format': 'json',
    'prop': 'extracts',
    'explaintext': '',
    'exsectionformat': 'wiki', # needed to extract full text instead of only introduction
    'redirects': 1, # needed to extract full text instead of only introduction
    'pageids': page_id # Replace with the page ID you want to access
}

# Make the API request and get the JSON response
response = requests.get(url, params=params).json()

# Extract the page content from the response
page_content = response['query']['pages'][page_id]['extract'].split("\n")

# Print the page content
print(page_content)

['Chagas disease, also known as American trypanosomiasis, is a tropical parasitic disease caused by Trypanosoma cruzi. It is spread mostly by insects in the subfamily Triatominae, known as "kissing bugs". The symptoms change over the course of the infection. In the early stage, symptoms are typically either not present or mild, and may include fever, swollen lymph nodes, headaches, or swelling at the site of the bite. After four to eight weeks, untreated individuals enter the chronic phase of disease, which in most cases does not result in further symptoms. Up to 45% of people with chronic infections develop heart disease 10–30 years after the initial illness, which can lead to heart failure. Digestive complications, including an enlarged esophagus or an enlarged colon, may also occur in up to 21% of people, and up to 10% of people may experience nerve damage.T. cruzi is commonly spread to humans and other mammals by the bite of a kissing bug. The disease may also be spread through blo

In [5]:
# Load into dataframe and perform wrangling steps to modify to form that is feedable to the language model
df = pd.DataFrame()
# naming column "text" by convention
df["text"] = page_content


In [6]:
# clean up steps - adhered to the lecture content
# Clean up text to remove empty lines and headings
df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]

In [7]:
# Inspect number of rows and character per row to evaluate amount and quality of data
num = df.text.map(lambda calc: len(calc))
print("No. of characters in each row in the text column:")
print(num)
print()
num_tot = df.text.map(lambda calc: len(calc)).sum()
print("No. of characters in each row in the text column:")
print(num_tot)

No. of characters in each row in the text column:
0     2231
1      609
6     3994
11     849
16    2430
21    1670
26    2874
31    3059
36    2304
40    1416
45    3066
49    1562
54    3328
61     890
65    1219
69      39
70      30
71      84
79      24
80      58
81      67
82     113
Name: text, dtype: int64

No. of characters in each row in the text column:
31916


**Observation**: The overall number of characters in the text will result in an equal amount of tokens as our example in the case study, ≈ 43K tokens. Now the characters however are only distributed through 22 rows which appear on the lower end of the spectrum for effective unsupervised learning, but the text composition suggests to be appropriate in terms of context which is why embeddings will be attempted to produce from here.

## Creating Embeddings

In [8]:
# We use the singular available text embedding model from OpenAI - alternative approach would be to create embeddings
# locally which would be of interest for private data, like medical data or private property, etc.

EmbeddingModel = "text-embedding-ada-002"

# trying same batch-size like case study
batch_size = 100

embeddings = []

for i in range(0, len(df), batch_size):
    # Send to OpenAI model to get embeddings
    resp = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EmbeddingModel
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in resp["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings
df

Unnamed: 0,text,embeddings
0,"Chagas disease, also known as American trypano...","[-0.008199987933039665, 0.015350688248872757, ..."
1,"It is estimated that 6.5 million people, mostl...","[0.0045480369590222836, 0.0056409514509141445,..."
6,Chagas disease occurs in two stages: an acute ...,"[-0.013317729346454144, 0.012180065736174583, ..."
11,Chagas disease is caused by infection with the...,"[-0.0014543388970196247, 0.018097007647156715,..."
16,T. cruzi can be transmitted by various triatom...,"[2.0406326939337305e-07, 0.012577449902892113,..."
21,"In the acute phase of the disease, signs and s...","[-0.02887132205069065, 0.0075886547565460205, ..."
26,The presence of T. cruzi in the blood is diagn...,"[-0.016381796449422836, 0.01853901706635952, 0..."
31,Efforts to prevent Chagas disease have largely...,"[-0.019528387114405632, 0.008026555180549622, ..."
36,Chagas disease is managed using antiparasitic ...,"[-0.01676325686275959, 0.014513242989778519, 0..."
40,"In the chronic stage, treatment involves manag...","[-0.016756875440478325, 0.0055229151621460915,..."


## Checkpoint

In [9]:
# Creating embeddings and using the OpenAI API cost credits, hence the embeddings will be saved to file to use at a later data
# instead of calculating them again

df.to_csv("embeddings.csv", index=False)

In [10]:
# Check content of directory 
!ls

data  embeddings.csv  project.ipynb


In [11]:
# Load from checkpoint - having column format reconstructed with literal eval
from ast import literal_eval

EmbeddingModel = "text-embedding-ada-002"
df = pd.read_csv("embeddings.csv")


df['embeddings'] = df['embeddings'].apply(literal_eval)

In [12]:
# Visual check of reloaded dataframe from checkpoint:
df.head()

Unnamed: 0,text,embeddings
0,"Chagas disease, also known as American trypano...","[-0.008199987933039665, 0.015350688248872757, ..."
1,"It is estimated that 6.5 million people, mostl...","[0.0045480369590222836, 0.0056409514509141445,..."
2,Chagas disease occurs in two stages: an acute ...,"[-0.013317729346454144, 0.012180065736174583, ..."
3,Chagas disease is caused by infection with the...,"[-0.0014543388970196247, 0.018097007647156715,..."
4,T. cruzi can be transmitted by various triatom...,"[2.0406326939337305e-07, 0.012577449902892113,..."


## Custom Query Completion

Compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model.

### Using cosine similarity as method to calculate distances for relevance

In [13]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EmbeddingModel)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [14]:
# Visual to show calculated distances for question 1 as demonstration
get_rows_sorted_by_relevance("What are common symptoms of Chagas Disease?", df)

Unnamed: 0,text,embeddings,distances
2,Chagas disease occurs in two stages: an acute ...,"[-0.013317729346454144, 0.012180065736174583, ...",0.084001
0,"Chagas disease, also known as American trypano...","[-0.008199987933039665, 0.015350688248872757, ...",0.09189
5,"In the acute phase of the disease, signs and s...","[-0.02887132205069065, 0.0075886547565460205, ...",0.114001
10,"In 2019, an estimated 6.5 million people world...","[-0.009837660007178783, 0.007542857900261879, ...",0.12111
3,Chagas disease is caused by infection with the...,"[-0.0014543388970196247, 0.018097007647156715,...",0.122268
1,"It is estimated that 6.5 million people, mostl...","[0.0045480369590222836, 0.0056409514509141445,...",0.124746
11,Though Chagas is traditionally considered a di...,"[0.004176131449639797, 0.005235458258539438, 0...",0.13182
9,"In the chronic stage, treatment involves manag...","[-0.016756875440478325, 0.0055229151621460915,...",0.133538
14,"As of 2018, standard diagnostic tests for Chag...","[-0.028471719473600388, 0.02480226941406727, 0...",0.134231
19,Chagas information at the U.S. Centers for Dis...,"[0.0028333619702607393, 0.020676061511039734, ...",0.135157


**Observation**: Manual review suggest correct calculation of distances for the embedding vector dimensions

### Custom Prompt Creation
Tokenization with tiktoken to create costum prompt from question to provide as context for OpenAI completion odel.

In [16]:
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

In [17]:
# Questions 1 Demonstrations of 
print(create_prompt("What are common symptoms of Chagas Disease?", df, 1000))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

Chagas disease occurs in two stages: an acute stage, which develops one to two weeks after the insect bite, and a chronic stage, which develops over many years. The acute stage is often symptom-free. When present, the symptoms are typically minor and not specific to any particular disease. Signs and symptoms include fever, malaise, headache, and enlargement of the liver, spleen, and lymph nodes. Sometimes, people develop a swollen nodule at the site of infection, which is called "Romaña's sign" if it is on the eyelid, or a "chagoma" if it is elsewhere on the skin. In rare cases (less than 1–5%), infected individuals develop severe acute disease, which can involve inflammation of the heart muscle, fluid accumulation around the heart, and inflammation of the brain and surrounding tissues, and may be life-threatening. The acute phase typically lasts f

**Observation**: Due to the mentioned long strings (text) per row in the demo of the create_prompt() function the token number must be set higher than in the case study, this is automatically resolved with the custom answer function, since we fill the models context feed to **max token** programmatically with control flow.

### Completion
Incorporates costum prompt created with create_prompt function, sending to completion model for answer.

In [18]:
COMPLETION_MODEL_NAME = "text-davinci-003"

def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""
        

## Custom Performance Demonstration

Showing the answer from a basic `Completion` model query as well as the answer from the custom query.

### Question 1 - **What are common symptoms of Chagas Disease?**
This is an example of a question that could likely be asked by the presumable patients.

In [20]:
# initial general language model
chagas_disease_symptoms = """
Question: "What are common symptoms of Chagas Disease?"
Answer:
"""
initial_chagas_disease_symptoms_answer = openai.Completion.create(
    model="text-davinci-003",
    prompt=chagas_disease_symptoms,
    max_tokens=150
)["choices"][0]["text"].strip()

In [19]:
# Customized model
custom_chagas_disease_symptom_answer = answer_question("What are common symptoms of Chagas Disease?", df)

### Question 2 - How can Chagas Disease be diagnosed and what treatments are available if tested positive?
This is an example of a question that could potentially asked by a physician, however simplified.

In [23]:
# initial general language model
chagas_diagnosis_treatment_prompt = """
Question: "How can Chagas Disease be diagnosed and what treatments are available if tested positive?"
Answer:
"""
initial_chagas_diagnosis_treatment_answer = openai.Completion.create(
    model="text-davinci-003",
    prompt=chagas_diagnosis_treatment_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()

In [29]:
# Customized model
custom_chagas_disease_diagnosis_treatment_answer = answer_question("How can Chagas Disease be diagnosed and what treatments are available if tested positive?", df)

### Evaluation of answers

In [30]:
#Question 1:

print(f"""
"What are common symptoms of Chagas Disease?"

ORIGINAL ANSWER: {initial_chagas_disease_symptoms_answer}

CUSTOM ANSWER:   {custom_chagas_disease_symptom_answer}

"How can Chagas Disease be diagnosed and what treatments are available if tested positive?"

ORIGINAL ANSWER: {initial_chagas_diagnosis_treatment_answer}

CUSTOM ANSWER:   {custom_chagas_disease_diagnosis_treatment_answer}
""")


"What are common symptoms of Chagas Disease?"

ORIGINAL ANSWER: Common symptoms of Chagas Disease include fever, fatigue, body aches, loss of appetite, and swollen lymph nodes. In some cases, more severe symptoms can develop, such as facial swelling, difficulty breathing, an enlarged heart, or stroke.

CUSTOM ANSWER:   Common symptoms of Chagas Disease include fever, malaise, headache, enlargement of the liver, spleen and lymph nodes, Romaña's sign (a swollen nodule at the site of infection, particularly on the eyelid) and chagoma (swollen nodule elsewhere on the skin). In chronic Chagas Disease, heart palpitations, arrhythmias, heart failure, thromboembolism, chest pain, digestive issues, swelling of the esophagus or colon, constipation, numbness and altered reflexes or movement are common symptoms.

"How can Chagas Disease be diagnosed and what treatments are available if tested positive?"

ORIGINAL ANSWER: Chagas Disease can be diagnosed through a blood test which can detect the pr

## Conclusion

This initial exploration shows significantly more detailed answers of the disease than the initial, uncostumized model. It shows the power of costumization in this example for scientific purposes. 

The data source for our context creation was "just" a wikipedia page, if we are to change the source of information for your embeddings to e.g. [NCBI´s pubmed](https://pubmed.ncbi.nlm.nih.gov/), with an automated script to extract text out of relevant publications for a query (e.g. disease), we would immediately generate a chatbot for cutting edge scientific evalution of **primary literature**.