In [2]:
import pandas as pd
from ydata_profiling import ProfileReport

# Problem Exploration

Doctors have the complicated task of identifying disease(s) a patient may have from information they can collect. This information can come in the form of:

- Reported symptoms from patients (from a clinical interview).
- Clinical history.
- Conducting a physical exam.
- Conducting diagnostic tests:
  - Biopsy.
  - Colonscopy.
  - CT scan.
  - Electrocardiogram (ECG).
  - ...
- Consulting with other clinicians.

## Problem as a ML Optimisation Objective

**Primary Goal:** Given a One Hot Encoded vector (of symptoms) find the most likely disease they have.

Future Goal: Given a One Hot Encoded vector (of symptoms), produce a set of diseases (could be a set of size 1) that the patient is likely to have. This may require additional data (especially for evaluation).

## What does Success Look Like?

Today's CAD (Computer Aided Diagnosis) systems have been shown to achieve up to a 90% hit rate (sensitivity = TP / (TP + FN)). I.e. they get the correct prognosis of a patient with a disease up to 90% of the time.

Models that have been used include:
- KNN
- ANN
- Decision Trees and Random Forests
- Genetic Algorithms
- Naive Bayes

References:
- [Computer-aided diagnosis Wikipedia](https://en.wikipedia.org/wiki/Computer-aided_diagnosis#:~:text=Today's%20CAD%20systems%20cannot%20detect,a%20False%20Positive%20(FP))
- [Computer-aided diagnosis systems: a comparative study of classical machine learning versus deep learning-based approaches](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10205571/)

## Assumptions

- The patient has a disease in our set of diseases.
- All patients have a disease. There is no "no disease" prognosis.
- The patient is only experiencing symptoms in our set of symptoms.
- Making a diagnosis accurately only requires knowing if symptom(s) exist (binary exists or not) and not a metric of its extent/degree.

In [3]:
train_df = pd.read_csv("./dataset/Training.csv")
train_df.head()

Unnamed: 0,itching,skin_rash,nodal_skin_eruptions,continuous_sneezing,shivering,chills,joint_pain,stomach_pain,acidity,ulcers_on_tongue,...,scurring,skin_peeling,silver_like_dusting,small_dents_in_nails,inflammatory_nails,blister,red_sore_around_nose,yellow_crust_ooze,prognosis,Unnamed: 133
0,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Fungal infection,
1,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Fungal infection,
2,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Fungal infection,
3,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Fungal infection,
4,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Fungal infection,


In [4]:
len(train_df)

4920

In [5]:
test_df = pd.read_csv("./dataset/Testing.csv")
test_df.head()

Unnamed: 0,itching,skin_rash,nodal_skin_eruptions,continuous_sneezing,shivering,chills,joint_pain,stomach_pain,acidity,ulcers_on_tongue,...,blackheads,scurring,skin_peeling,silver_like_dusting,small_dents_in_nails,inflammatory_nails,blister,red_sore_around_nose,yellow_crust_ooze,prognosis
0,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
1,0,0,0,1,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Allergy
2,0,0,0,0,0,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,GERD
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Chronic cholestasis
4,1,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,Drug Reaction


In [6]:
len(test_df)

42

In [7]:
print("Number of unique diseases:", train_df['prognosis'].nunique())

Number of unique diseases: 41


In [8]:
print("Number of unique diseases:", test_df['prognosis'].nunique())

Number of unique diseases: 41


In [9]:
print("Non common prognosis:", set(train_df['prognosis']) ^ set(test_df['prognosis']))

Non common prognosis: set()


In [11]:
isGenerateReport = False

In [12]:
if isGenerateReport:
  profile = ProfileReport(train_df, title="Symptoms Prognosis Dataset Report", explorative=True)
  profile.to_file("symptoms-prognosis-datase-report.html")

# Model Building

## Feature Engineering

- Using a LLM to categorise symptoms into:
  - Diagnostic test required for symptoms (self-reporting, diagnostic test: ECG, CT Scan etc)
- Correlation of one symptom with another to try to find unindentified symptoms

## Model Selection

Classification Models:
- CART Descision Tree
- Random Forest
- Multiple Logistic Regression