# DAKTARI- THE AI MEDICAL CHATBOT

# PROJECT SUMMARY

# 1. BUSINESS UNDERSTANDING

# 2. BUSINESS PROBLEM

# 3. OBJECTIVES

## 3.1 Main objective

## 3.2 Specific objective

## 3.3 Research Questions

## 3.4 Metric of success

# 4. DATA UNDERSTANDING

## 4.1 Data Limitation 

# 5. DATA EXPLORATION

## 5.1 Loading a Dataset

In [24]:

# import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [2]:
# pip install huggingface_hub


In [3]:
df = pd.read_parquet("hf://datasets/DrBenjamin/ai-medical-chatbot/dialogues.parquet")
df.head()



Unnamed: 0,Description,Patient,Doctor
0,Q. What does abutment of the nerve root mean?,"Hi doctor,I am just wondering what is abutting...",Hi. I have gone through your query with dilige...
1,Q. What should I do to reduce my weight gained...,"Hi doctor, I am a 22-year-old female who was d...",Hi. You have really done well with the hypothy...
2,Q. I have started to get lots of acne on my fa...,Hi doctor! I used to have clear skin but since...,Hi there Acne has multifactorial etiology. Onl...
3,Q. Why do I have uncomfortable feeling between...,"Hello doctor,I am having an uncomfortable feel...",Hello. The popping and discomfort what you fel...
4,Q. My symptoms after intercourse threatns me e...,"Hello doctor,Before two years had sex with a c...",Hello. The HIV test uses a finger prick blood ...


In [4]:
df.shape

(256916, 3)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 256916 entries, 0 to 256915
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   Description  256916 non-null  object
 1   Patient      256916 non-null  object
 2   Doctor       256916 non-null  object
dtypes: object(3)
memory usage: 5.9+ MB


In [6]:
df.isnull().sum()

Description    0
Patient        0
Doctor         0
dtype: int64

Since our data is not missing any missing values we can move to cleaning the text.

In [9]:
df.duplicated().sum()

np.int64(10378)

## 5.2 Data Cleaning

In [10]:
import nltk # natural language toolkit
import re # regular expressions
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, RegexpTokenizer
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import FreqDist
from sklearn.feature_extraction.text import CountVectorizer

In [11]:
# Download necessary resources
nltk.download('punkt') 
nltk.download('stopwords') 
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

### 5.2.1 Text preprocessing

In [12]:
df.columns

Index(['Description', 'Patient', 'Doctor'], dtype='object')

**Tokenization**

Let's breaks down each text into a list of words **tokens**

In [13]:
# tokenization
df['Description'] = df['Description'].apply(nltk.word_tokenize)
df['Patient'] = df['Patient'].apply(nltk.word_tokenize)
df['Doctor'] = df['Doctor'].apply(nltk.word_tokenize)
df.head()

Unnamed: 0,Description,Patient,Doctor
0,"[Q, ., What, does, abutment, of, the, nerve, r...","[Hi, doctor, ,, I, am, just, wondering, what, ...","[Hi, ., I, have, gone, through, your, query, w..."
1,"[Q, ., What, should, I, do, to, reduce, my, we...","[Hi, doctor, ,, I, am, a, 22-year-old, female,...","[Hi, ., You, have, really, done, well, with, t..."
2,"[Q., I, have, started, to, get, lots, of, acne...","[Hi, doctor, !, I, used, to, have, clear, skin...","[Hi, there, Acne, has, multifactorial, etiolog..."
3,"[Q, ., Why, do, I, have, uncomfortable, feelin...","[Hello, doctor, ,, I, am, having, an, uncomfor...","[Hello, ., The, popping, and, discomfort, what..."
4,"[Q, ., My, symptoms, after, intercourse, threa...","[Hello, doctor, ,, Before, two, years, had, se...","[Hello, ., The, HIV, test, uses, a, finger, pr..."


**Lowercasing**

Let's convert our tokenized text to **lowercase** to ensure consistency in our data.

In [14]:
# Convert all tokens to lowercase
df['Description'] = df['Description'].apply(lambda x: [word.lower() for word in x])
df['Patient'] = df['Patient'].apply(lambda x: [word.lower() for word in x])
df['Doctor'] = df['Doctor'].apply(lambda x: [word.lower() for word   in x])
df.head()

Unnamed: 0,Description,Patient,Doctor
0,"[q, ., what, does, abutment, of, the, nerve, r...","[hi, doctor, ,, i, am, just, wondering, what, ...","[hi, ., i, have, gone, through, your, query, w..."
1,"[q, ., what, should, i, do, to, reduce, my, we...","[hi, doctor, ,, i, am, a, 22-year-old, female,...","[hi, ., you, have, really, done, well, with, t..."
2,"[q., i, have, started, to, get, lots, of, acne...","[hi, doctor, !, i, used, to, have, clear, skin...","[hi, there, acne, has, multifactorial, etiolog..."
3,"[q, ., why, do, i, have, uncomfortable, feelin...","[hello, doctor, ,, i, am, having, an, uncomfor...","[hello, ., the, popping, and, discomfort, what..."
4,"[q, ., my, symptoms, after, intercourse, threa...","[hello, doctor, ,, before, two, years, had, se...","[hello, ., the, hiv, test, uses, a, finger, pr..."


**Stopword Removal**

By removing **Stopwords** like *the, is, and*   which don’t really  carry meaningful information we reduce the noise in our data.

In [19]:
stop_words = set(stopwords.words('english'))

# Remove stopwords
for col in ['Description', 'Patient', 'Doctor']:
    df[col] = df[col].apply(lambda x: [word for word in x if word not in stop_words])

df[['Description', 'Patient', 'Doctor']].head()

Unnamed: 0,Description,Patient,Doctor
0,"[q, ., abutment, nerve, root, mean, ?]","[hi, doctor, ,, wondering, abutting, abutment,...","[hi, ., gone, query, diligence, would, like, k..."
1,"[q, ., reduce, weight, gained, due, genetic, h...","[hi, doctor, ,, 22-year-old, female, diagnosed...","[hi, ., really, done, well, hypothyroidism, pr..."
2,"[q., started, get, lots, acne, face, ,, partic...","[hi, doctor, !, used, clear, skin, since, move...","[hi, acne, multifactorial, etiology, ., acne, ..."
3,"[q, ., uncomfortable, feeling, middle, spine, ...","[hello, doctor, ,, uncomfortable, feeling, mid...","[hello, ., popping, discomfort, felt, either, ..."
4,"[q, ., symptoms, intercourse, threatns, even, ...","[hello, doctor, ,, two, years, sex, call, girl...","[hello, ., hiv, test, uses, finger, prick, blo..."


**Punctuation Removal**

Removing punctuation marks like *!, ., ,*  because they rarely add semantic meaning in our text analyses.

In [20]:
## **Punctuation Removal**
def remove_punctuation(tokens):
    return [re.sub(r'[^\w\s]', '', word) for word in tokens if re.sub(r'[^\w\s]', '', word) != '']

df['Description'] = df['Description'].apply(remove_punctuation)
df['Patient'] = df['Patient'].apply(remove_punctuation)
df['Doctor'] = df['Doctor'].apply(remove_punctuation)
df.head()

Unnamed: 0,Description,Patient,Doctor
0,"[q, abutment, nerve, root, mean]","[hi, doctor, wondering, abutting, abutment, ne...","[hi, gone, query, diligence, would, like, know..."
1,"[q, reduce, weight, gained, due, genetic, hypo...","[hi, doctor, 22yearold, female, diagnosed, hyp...","[hi, really, done, well, hypothyroidism, probl..."
2,"[q, started, get, lots, acne, face, particular...","[hi, doctor, used, clear, skin, since, moved, ...","[hi, acne, multifactorial, etiology, acne, soa..."
3,"[q, uncomfortable, feeling, middle, spine, lef...","[hello, doctor, uncomfortable, feeling, middle...","[hello, popping, discomfort, felt, either, imp..."
4,"[q, symptoms, intercourse, threatns, even, neg...","[hello, doctor, two, years, sex, call, girl, d...","[hello, hiv, test, uses, finger, prick, blood,..."


**Stemming and Lemmatization**

**Stemming** reduces words to their base forms e.g running to run.

**Lemmatization** refines this using linguistic context better to good.
Both this steps will help us unify variations of the same word.

In [21]:
## Lemmatization*
lemmatizer = WordNetLemmatizer()
def lemmatize_tokens(tokens):
    return [lemmatizer.lemmatize(word) for word in tokens]

df['Description'] = df['Description'].apply(lemmatize_tokens)
df['Patient'] = df['Patient'].apply(lemmatize_tokens)
df['Doctor'] = df['Doctor'].apply(lemmatize_tokens)
df.head()

Unnamed: 0,Description,Patient,Doctor
0,"[q, abutment, nerve, root, mean]","[hi, doctor, wondering, abutting, abutment, ne...","[hi, gone, query, diligence, would, like, know..."
1,"[q, reduce, weight, gained, due, genetic, hypo...","[hi, doctor, 22yearold, female, diagnosed, hyp...","[hi, really, done, well, hypothyroidism, probl..."
2,"[q, started, get, lot, acne, face, particularl...","[hi, doctor, used, clear, skin, since, moved, ...","[hi, acne, multifactorial, etiology, acne, soa..."
3,"[q, uncomfortable, feeling, middle, spine, lef...","[hello, doctor, uncomfortable, feeling, middle...","[hello, popping, discomfort, felt, either, imp..."
4,"[q, symptom, intercourse, threatns, even, nega...","[hello, doctor, two, year, sex, call, girl, da...","[hello, hiv, test, us, finger, prick, blood, s..."


**Text Normalization**

Finally, we normalize the text by joining the processed tokens back into sentences.

In [25]:
# Rejoin tokens into cleaned text
for col in ['Description', 'Patient', 'Doctor']:
    df[col + '_cleaned'] = df[col].apply(lambda x: ' '.join(x))

df[['Description_cleaned', 'Patient_cleaned', 'Doctor_cleaned']].head()

Unnamed: 0,Description_cleaned,Patient_cleaned,Doctor_cleaned
0,q abutment nerve root mean,hi doctor wondering abutting abutment nerve ro...,hi gone query diligence would like know help i...
1,q reduce weight gained due genetic hypothyroidism,hi doctor 22yearold female diagnosed hypothyro...,hi really done well hypothyroidism problem lev...
2,q started get lot acne face particularly foreh...,hi doctor used clear skin since moved new plac...,hi acne multifactorial etiology acne soap impr...
3,q uncomfortable feeling middle spine left shou...,hello doctor uncomfortable feeling middle spin...,hello popping discomfort felt either improper ...
4,q symptom intercourse threatns even negative h...,hello doctor two year sex call girl dark locat...,hello hiv test us finger prick blood sample re...
