# AI-DRIVEN OCULAR DISEASE DETECTION

# 1. Business Understanding

## 1.1 Project Background

Ocular diseases such as Diabetic Retinopathy (DR), Glaucoma, and Cataracts represent a significant and growing global health burden. These conditions are leading causes of preventable blindness worldwide. The key to preventing vision loss is early and accurate detection.

Currently, diagnosis relies on a manual examination of retinal fundus images by highly trained ophthalmologists. This process, while effective, faces several critical challenges:  

- **Scalability & Accessibility:** There is a global shortage of ophthalmologists, particularly in remote and underserved regions. This creates a severe bottleneck, leading to long wait times for screenings and delayed diagnoses.  

- **Time-Consuming & Repetitive:** Manual screening is a time-intensive task that consumes a significant portion of a specialist's day, much of which is spent reviewing normal, healthy eye scans.  

- **Human Factor:** The diagnostic process is subject to human fatigue and inter-observer variability, which can lead to inconsistent or missed findings.  

The convergence of deep learning, particularly in computer vision, and the increased availability of digital fundus imagery combined with patient metadata presents a transformative opportunity to address these challenges.

## 1.2 Problem Statement

The current manual screening process for ocular diseases is inefficient, unscalable, and inaccessible to large parts of the population, leading to preventable vision loss due to late detection.  

Healthcare providers require a tool that can automate the initial screening process. This tool must analyze a retinal fundus image and accurately identify the presence of multiple potential pathologies simultaneously, leveraging all available patient information for a more holistic assessment.  

This project addresses the need for an assistive tool by tackling this as a multi-label classification problem, where a single image can be flagged for one or more diseases, informed by patient demographics and comorbidities.

## 1.3 Project Objectives

The primary objective of this project is to develop and deploy a proof-of-concept Clinical Decision Support System (CDSS) for ophthalmologists and general practitioners. This system will leverage a deep learning model that integrates Retinal Scan data with patient structured data (age, known medical history) to serve as an automated, first-pass screening tool.

The specific, measurable objectives are:  

- **To Develop a Multi-Modal Model:** Build, train, and validate a fused model combining a Convolutional Neural Network (CNN) for image analysis with a classifier for structured patient metadata (e.g., Age, Hypertension status). The model must accurately detect eight distinct ocular pathologies from a single fundus image and supporting data: Normal, Diabetes, Glaucoma, Cataract, Age-related Macular Degeneration (AMD), Hypertension, Myopia, and Other abnormalities.  

- **To Prioritize Triage:** The model will act as a triage assistant to help clinicians prioritize patient caseloads by flagging high-risk images for immediate review.  

- **To Enhance Efficiency:** Automate screening of healthy/normal scans to reduce manual review burden on specialists, allowing them to focus on complex diagnoses and treatment.  

- **To Deploy an Accessible Tool:** Deploy the trained model as an interactive web application where users can upload retinal images and input patient features (age, comorbidities) to receive clear, probabilistic multi-label outputs.

## 1.4 Business Success Criteria

This academic project will be evaluated on both its technical performance and practical utility.

- **Primary Technical Metric (Multi-Modal Performance):** Mean Area Under the Receiver Operating Characteristic Curve (AUC-ROC) across all 8 classes, demonstrating the performance gain from incorporating structured patient data.  
  - *Target:* Mean AUC-ROC \( \geq 0.90 \) on the hold-out test set.  
  - *Rationale:* Effectively measures the ability to distinguish positive and negative cases, even for rare classes.  

- **Secondary Technical Metric:** Per-class F1-Score, Precision, and Recall to transparently show performance on common vs. rare conditions.  

- **Deployment & Utility Metric :** Successful deployment of a functional web-based application allowing users to upload fundus images and input mandatory metadata (age and at least one comorbidity like Hypertension) to receive a human-readable probabilistic output for all 8 disease categories, proving value as a CDSS.




### 2. INITIAL DATA EXPLORATION/ DATA UNDERSTANDING.

In [1]:
# IMPORT RELEVANT LIBRARIES
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


In [4]:
# load the dataset
DF =  pd.read_csv('full_df.csv')
DF.head()

Unnamed: 0,ID,Patient Age,Patient Sex,Left-Fundus,Right-Fundus,Left-Diagnostic Keywords,Right-Diagnostic Keywords,N,D,G,C,A,H,M,O,filepath,labels,target,filename
0,0,69,Female,0_left.jpg,0_right.jpg,cataract,normal fundus,0,0,0,1,0,0,0,0,../input/ocular-disease-recognition-odir5k/ODI...,['N'],"[1, 0, 0, 0, 0, 0, 0, 0]",0_right.jpg
1,1,57,Male,1_left.jpg,1_right.jpg,normal fundus,normal fundus,1,0,0,0,0,0,0,0,../input/ocular-disease-recognition-odir5k/ODI...,['N'],"[1, 0, 0, 0, 0, 0, 0, 0]",1_right.jpg
2,2,42,Male,2_left.jpg,2_right.jpg,laser spot，moderate non proliferative retinopathy,moderate non proliferative retinopathy,0,1,0,0,0,0,0,1,../input/ocular-disease-recognition-odir5k/ODI...,['D'],"[0, 1, 0, 0, 0, 0, 0, 0]",2_right.jpg
3,4,53,Male,4_left.jpg,4_right.jpg,macular epiretinal membrane,mild nonproliferative retinopathy,0,1,0,0,0,0,0,1,../input/ocular-disease-recognition-odir5k/ODI...,['D'],"[0, 1, 0, 0, 0, 0, 0, 0]",4_right.jpg
4,5,50,Female,5_left.jpg,5_right.jpg,moderate non proliferative retinopathy,moderate non proliferative retinopathy,0,1,0,0,0,0,0,0,../input/ocular-disease-recognition-odir5k/ODI...,['D'],"[0, 1, 0, 0, 0, 0, 0, 0]",5_right.jpg


In [5]:
# check tail
DF.tail()

Unnamed: 0,ID,Patient Age,Patient Sex,Left-Fundus,Right-Fundus,Left-Diagnostic Keywords,Right-Diagnostic Keywords,N,D,G,C,A,H,M,O,filepath,labels,target,filename
6387,4686,63,Male,4686_left.jpg,4686_right.jpg,severe nonproliferative retinopathy,proliferative diabetic retinopathy,0,1,0,0,0,0,0,0,../input/ocular-disease-recognition-odir5k/ODI...,['D'],"[0, 1, 0, 0, 0, 0, 0, 0]",4686_left.jpg
6388,4688,42,Male,4688_left.jpg,4688_right.jpg,moderate non proliferative retinopathy,moderate non proliferative retinopathy,0,1,0,0,0,0,0,0,../input/ocular-disease-recognition-odir5k/ODI...,['D'],"[0, 1, 0, 0, 0, 0, 0, 0]",4688_left.jpg
6389,4689,54,Male,4689_left.jpg,4689_right.jpg,mild nonproliferative retinopathy,normal fundus,0,1,0,0,0,0,0,0,../input/ocular-disease-recognition-odir5k/ODI...,['D'],"[0, 1, 0, 0, 0, 0, 0, 0]",4689_left.jpg
6390,4690,57,Male,4690_left.jpg,4690_right.jpg,mild nonproliferative retinopathy,mild nonproliferative retinopathy,0,1,0,0,0,0,0,0,../input/ocular-disease-recognition-odir5k/ODI...,['D'],"[0, 1, 0, 0, 0, 0, 0, 0]",4690_left.jpg
6391,4784,58,Male,4784_left.jpg,4784_right.jpg,hypertensive retinopathy，age-related macular d...,hypertensive retinopathy，age-related macular d...,0,0,0,0,1,1,0,0,../input/ocular-disease-recognition-odir5k/ODI...,['H'],"[0, 0, 0, 0, 0, 1, 0, 0]",4784_left.jpg


In [16]:
# shape of dataset
print(f" This dataset has {DF.shape[0]} observations and {DF.shape[1]} variables")

 This dataset has 6392 observations and 19 variables


In [9]:
# Get metadata
DF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6392 entries, 0 to 6391
Data columns (total 19 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   ID                         6392 non-null   int64 
 1   Patient Age                6392 non-null   int64 
 2   Patient Sex                6392 non-null   object
 3   Left-Fundus                6392 non-null   object
 4   Right-Fundus               6392 non-null   object
 5   Left-Diagnostic Keywords   6392 non-null   object
 6   Right-Diagnostic Keywords  6392 non-null   object
 7   N                          6392 non-null   int64 
 8   D                          6392 non-null   int64 
 9   G                          6392 non-null   int64 
 10  C                          6392 non-null   int64 
 11  A                          6392 non-null   int64 
 12  H                          6392 non-null   int64 
 13  M                          6392 non-null   int64 
 14  O       

In [10]:
# check null values
DF.isna().sum()

ID                           0
Patient Age                  0
Patient Sex                  0
Left-Fundus                  0
Right-Fundus                 0
Left-Diagnostic Keywords     0
Right-Diagnostic Keywords    0
N                            0
D                            0
G                            0
C                            0
A                            0
H                            0
M                            0
O                            0
filepath                     0
labels                       0
target                       0
filename                     0
dtype: int64

In [12]:
# duplicates
DF.duplicated().sum()

0

In [13]:
# Statistical information numeric
DF.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ID,6392.0,2271.150814,1417.559018,0.0,920.75,2419.5,3294.0,4784.0
Patient Age,6392.0,57.857947,11.727737,1.0,51.0,59.0,66.0,91.0
N,6392.0,0.328692,0.469775,0.0,0.0,0.0,1.0,1.0
D,6392.0,0.332134,0.471016,0.0,0.0,0.0,1.0,1.0
G,6392.0,0.062109,0.241372,0.0,0.0,0.0,0.0,1.0
C,6392.0,0.062891,0.242786,0.0,0.0,0.0,0.0,1.0
A,6392.0,0.049906,0.217768,0.0,0.0,0.0,0.0,1.0
H,6392.0,0.031758,0.17537,0.0,0.0,0.0,0.0,1.0
M,6392.0,0.047872,0.213513,0.0,0.0,0.0,0.0,1.0
O,6392.0,0.248436,0.432139,0.0,0.0,0.0,0.0,1.0


In [14]:
# Statistical information categorical
DF.describe(include='O').T

Unnamed: 0,count,unique,top,freq
Patient Sex,6392,2,Male,3424
Left-Fundus,6392,3358,0_left.jpg,2
Right-Fundus,6392,3358,0_right.jpg,2
Left-Diagnostic Keywords,6392,196,normal fundus,2796
Right-Diagnostic Keywords,6392,205,normal fundus,2705
filepath,6392,6392,../input/ocular-disease-recognition-odir5k/ODI...,1
labels,6392,8,['N'],2873
target,6392,8,"[1, 0, 0, 0, 0, 0, 0, 0]",2873
filename,6392,6392,0_right.jpg,1


In [15]:
# check columns
DF.columns

Index(['ID', 'Patient Age', 'Patient Sex', 'Left-Fundus', 'Right-Fundus',
       'Left-Diagnostic Keywords', 'Right-Diagnostic Keywords', 'N', 'D', 'G',
       'C', 'A', 'H', 'M', 'O', 'filepath', 'labels', 'target', 'filename'],
      dtype='object')

In [23]:
# Explore value counts for each column
for column in DF.columns:
    print(f"Value counts for column '{column}':")
    print(DF[column].value_counts())
    print("\n")

Value counts for column 'ID':
ID
0       2
2985    2
2987    2
2988    2
2989    2
       ..
516     1
518     1
528     1
548     1
4659    1
Name: count, Length: 3358, dtype: int64


Value counts for column 'Patient Age':
Patient Age
56    294
60    285
54    277
62    265
65    252
     ... 
15      2
19      2
14      2
91      2
17      2
Name: count, Length: 75, dtype: int64


Value counts for column 'Patient Sex':
Patient Sex
Male      3424
Female    2968
Name: count, dtype: int64


Value counts for column 'Left-Fundus':
Left-Fundus
0_left.jpg       2
2985_left.jpg    2
2987_left.jpg    2
2988_left.jpg    2
2989_left.jpg    2
                ..
516_left.jpg     1
518_left.jpg     1
528_left.jpg     1
548_left.jpg     1
4659_left.jpg    1
Name: count, Length: 3358, dtype: int64


Value counts for column 'Right-Fundus':
Right-Fundus
0_right.jpg       2
2985_right.jpg    2
2987_right.jpg    2
2988_right.jpg    2
2989_right.jpg    2
                 ..
516_right.jpg     1
518_right.

### Dataset Overview

- **Total Images**: 6,392 eye photos
- **Total Patients**: 3,358 patients  
- **Left/Right Eyes**: Each patient has both eyes documented
- **Most Common Condition**: Normal (healthy) eyes
- **Key Diseases**: Diabetic Retinopathy, Cataract, Glaucoma
- **Age Range**: 14-91 years (mostly middle-aged to elderly)


## Dataset Column Descriptions

| Column Name | Description | Key Insights |
|-------------|-------------|-------------|
| **ID** | Patient identification number | • 3,358 unique patients<br>• Some patients have 2 entries |
| **Patient Age** | Age of patients | • Range: 14-91 years<br>• Most common: 56, 60, 54 years |
| **Patient Sex** | Gender of patients | • Male: 3,424<br>• Female: 2,968 |
| **Left-Fundus** | Left eye image filename | • Format: `ID_left.jpg`<br>• 3,358 unique values |
| **Right-Fundus** | Right eye image filename | • Format: `ID_right.jpg`<br>• 3,358 unique values |
| **Left-Diagnostic Keywords** | Doctor's notes for left eye | • 196 unique conditions<br>• Most common: "normal fundus" (2,796) |
| **Right-Diagnostic Keywords** | Doctor's notes for right eye | • 205 unique conditions<br>• Most common: "normal fundus" (2,705) |
| **N** | Normal (healthy) | • Normal: 4,291<br>• Abnormal: 2,101 |
| **D** | Diabetic Retinopathy | • Without: 4,269<br>• With: 2,123 |
| **G** | Glaucoma | • Without: 5,995<br>• With: 397 |
| **C** | Cataract | • Without: 5,990<br>• With: 402 |
| **A** | Age-related Macular Degeneration | • Without: 6,073<br>• With: 319 |
| **H** | Hypertension | • Without: 6,189<br>• With: 203 |
| **M** | Other diseases | • Without: 6,086<br>• With: 306 |
| **O** | Other abnormalities | • Without: 4,804<br>• With: 1,588 |
| **filepath** | Full image file path | • 6,392 unique paths<br>• Training images location |
| **labels** | Disease labels as text | • Most common: ['N'] = Normal (2,873) |
| **target** | Disease labels as binary array | • [1,0,0,0,0,0,0,0] = Normal<br>• [0,1,0,0,0,0,0,0] = Diabetic Retinopathy |
| **filename** | Image filename only | • 6,392 unique filenames |er diseases
- **O**: Other abnormalities

# Data Understanding: Key Findings

In this phase, we performed an initial investigation of the **full_df.csv** file to understand our data before preparing it for modeling.

---

## Initial State of the DataFrame

### Complete Data
The dataset is **high-quality**, with **zero missing values** in any column.

### Good Columns
- **Patient Age** and the **8 disease columns** (`N`, `D`, `G`, `C`, `A`, `H`, `M`, `O`) were already clean.  
- These were stored as **integers**, which is ideal for modeling.

### Problem Columns
Three main columns required cleaning and transformation:
- **target**:  
  - Contained data in a string format such as `"[1, 0, 0, ...]"`.  
  - Although it looked like a list, it was actually stored as a string object.  
  - It needs to be converted into a **real list of integers**.
  
- **labels**:  
  - Stored as strings like `['N']`, not as actual Python lists.  
  - This column must be parsed into an appropriate list format.

- **Patient Sex**:  
  - Stored as a text column containing `"Male"` and `"Female"`.  
  - For modeling, it should be **encoded into numeric values**:
    - `0` for Male
    - `1` for Female

---

## Conclusion

The **raw dataset** is complete and of good quality but stored in non-numeric or inconsistent formats that prevent direct use in a modeling pipeline.

The next step, **Data Preparation**, will focus on fixing these data type issues by:
1. Converting the **target** string into actual lists of integers.  
2. Converting the **Patient Sex** column from text to numeric encoding.  
3. Creating a final, fully cleaned DataFrame that is **100% ready for model training**.