# Glaucoma Detection Dataset

## Purpose
> ##### --> A dataset containing all the required fields to build AI/ML models to detect glaucoma

## Description
> ##### --> 10,000 records of glaucoma detection dataset, containing:

        -   Patient ID
        -   Age
        -   Gender
        -   Visual Acuity Measurements
        -   Intraocular Pressure (IOP)
        -   Cup-to-Disc Ratio (CDR)
        -   Family History
        -   Medical History
        -   Medication Usage
        -   Visual Field Test Results
        -   Optical Coherence Tomography (OCT) Results
        -   Pachymetry
        -   Cataract Status
        -   Angle Closure Status
        -   Visual Symptoms
        -   Diagnosis
        -   Glaucoma Type

In [1]:
# Create Libraries

import pandas as pd
import numpy as np

In [2]:
# Read The File
raw_df=pd.read_csv('../data/raw/glaucoma_dataset.csv')

In [3]:
df=raw_df.copy()
df.head()

Unnamed: 0,Patient ID,Age,Gender,Visual Acuity Measurements,Intraocular Pressure (IOP),Cup-to-Disc Ratio (CDR),Family History,Medical History,Medication Usage,Visual Field Test Results,Optical Coherence Tomography (OCT) Results,Pachymetry,Cataract Status,Angle Closure Status,Visual Symptoms,Diagnosis,Glaucoma Type
0,62431,69,Male,LogMAR 0.1,19.46,0.42,No,Diabetes,"Amoxicillin, Lisinopril, Omeprazole, Atorvasta...","Sensitivity: 0.54, Specificity: 0.75","RNFL Thickness: 86.48 µm, GCC Thickness: 64.14...",541.51,Present,Open,"Tunnel vision, Eye pain, Nausea",No Glaucoma,Primary Open-Angle Glaucoma
1,68125,69,Female,LogMAR 0.1,18.39,0.72,No,Hypertension,"Lisinopril, Amoxicillin, Atorvastatin, Ibuprof...","Sensitivity: 0.72, Specificity: 0.88","RNFL Thickness: 96.88 µm, GCC Thickness: 56.48...",552.77,Absent,Open,"Redness in the eye, Vision loss, Tunnel vision",No Glaucoma,Juvenile Glaucoma
2,63329,67,Female,20/40,23.65,0.72,No,Hypertension,"Amoxicillin, Ibuprofen, Metformin, Atorvastati...","Sensitivity: 0.56, Specificity: 0.8","RNFL Thickness: 89.81 µm, GCC Thickness: 59.05...",573.65,Absent,Closed,"Halos around lights, Vision loss, Redness in t...",No Glaucoma,Juvenile Glaucoma
3,47174,23,Male,LogMAR 0.0,18.04,0.61,No,,"Ibuprofen, Aspirin","Sensitivity: 0.6, Specificity: 0.93","RNFL Thickness: 87.25 µm, GCC Thickness: 63.98...",590.67,Absent,Closed,"Nausea, Nausea, Halos around lights",No Glaucoma,Congenital Glaucoma
4,67361,21,Male,LogMAR 0.1,15.87,0.3,No,Diabetes,"Amoxicillin, Omeprazole, Aspirin, Ibuprofen, A...","Sensitivity: 0.82, Specificity: 0.9","RNFL Thickness: 82.61 µm, GCC Thickness: 66.01...",588.41,Absent,Closed,"Eye pain, Eye pain, Tunnel vision",No Glaucoma,Primary Open-Angle Glaucoma


# EDA (Explotary Data Analysis)

In [4]:
# (Row,Col)
df.shape

(10000, 17)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 17 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   Patient ID                                  10000 non-null  int64  
 1   Age                                         10000 non-null  int64  
 2   Gender                                      10000 non-null  object 
 3   Visual Acuity Measurements                  10000 non-null  object 
 4   Intraocular Pressure (IOP)                  10000 non-null  float64
 5   Cup-to-Disc Ratio (CDR)                     10000 non-null  float64
 6   Family History                              10000 non-null  object 
 7   Medical History                             7453 non-null   object 
 8   Medication Usage                            8769 non-null   object 
 9   Visual Field Test Results                   10000 non-null  object 
 10  Optical Coh

In [6]:
# The Percentage of Nulls
df.isnull().sum()/len(df)*100

Patient ID                                     0.00
Age                                            0.00
Gender                                         0.00
Visual Acuity Measurements                     0.00
Intraocular Pressure (IOP)                     0.00
Cup-to-Disc Ratio (CDR)                        0.00
Family History                                 0.00
Medical History                               25.47
Medication Usage                              12.31
Visual Field Test Results                      0.00
Optical Coherence Tomography (OCT) Results     0.00
Pachymetry                                     0.00
Cataract Status                                0.00
Angle Closure Status                           0.00
Visual Symptoms                                0.00
Diagnosis                                      0.00
Glaucoma Type                                  0.00
dtype: float64

In [7]:
df.columns

Index(['Patient ID', 'Age', 'Gender', 'Visual Acuity Measurements',
       'Intraocular Pressure (IOP)', 'Cup-to-Disc Ratio (CDR)',
       'Family History', 'Medical History', 'Medication Usage',
       'Visual Field Test Results',
       'Optical Coherence Tomography (OCT) Results', 'Pachymetry',
       'Cataract Status', 'Angle Closure Status', 'Visual Symptoms',
       'Diagnosis', 'Glaucoma Type'],
      dtype='object')

In [8]:
df['Patient ID'].nunique()

10000

## Data Cleaning

In [9]:
df.columns=df.columns.str.lower().str.replace(' ','_')

In [10]:
df.columns

Index(['patient_id', 'age', 'gender', 'visual_acuity_measurements',
       'intraocular_pressure_(iop)', 'cup-to-disc_ratio_(cdr)',
       'family_history', 'medical_history', 'medication_usage',
       'visual_field_test_results',
       'optical_coherence_tomography_(oct)_results', 'pachymetry',
       'cataract_status', 'angle_closure_status', 'visual_symptoms',
       'diagnosis', 'glaucoma_type'],
      dtype='object')

In [11]:
df.columns=['patient_id', 'age', 'gender', 'visual_acuity_measurements',
       'iop', 'cdr',
       'family_history', 'medical_history', 'medication_usage',
       'visual_field_test_results',
       'oct_results', 'pachymetry',
       'cataract_status', 'angle_closure_status', 'visual_symptoms',
       'diagnosis', 'glaucoma_type']

In [12]:
df.age.max()

90

In [13]:
df.gender.unique()

array(['Male', 'Female'], dtype=object)

In [14]:
df.visual_acuity_measurements.unique()

array(['LogMAR 0.1', '20/40', 'LogMAR 0.0', '20/20'], dtype=object)

In [15]:
df.visual_acuity_measurements[2]

'20/40'

In [16]:
def correct_visual_acuity(x):
    if x['visual_acuity_measurements']=='20/20':
        return 'LogMAR 0.0'
    elif x['visual_acuity_measurements']=='20/40':
        return 'LogMAR 0.3'
    else:
        return x['visual_acuity_measurements']
    

In [17]:
df['visual_acuity_measurements']=df.apply(correct_visual_acuity,axis=1).apply(lambda y:y.split(' ')[1])

> The Unit of visual_acuity_measurements is LogMAR

In [18]:
df.visual_acuity_measurements=df.visual_acuity_measurements.astype(float)

In [19]:
df.visual_acuity_measurements[0]

0.1

In [20]:
df['iop']

0       19.46
1       18.39
2       23.65
3       18.04
4       15.87
        ...  
9995    22.83
9996    11.72
9997    10.67
9998    23.37
9999    22.90
Name: iop, Length: 10000, dtype: float64

In [21]:
df.cdr.min(),df.cdr.max()

(0.3, 0.8)

In [22]:
df.cdr.unique()

array([0.42, 0.72, 0.61, 0.3 , 0.58, 0.39, 0.46, 0.52, 0.37, 0.71, 0.55,
       0.41, 0.69, 0.43, 0.47, 0.38, 0.33, 0.75, 0.54, 0.68, 0.48, 0.65,
       0.64, 0.7 , 0.36, 0.6 , 0.57, 0.49, 0.51, 0.74, 0.34, 0.35, 0.4 ,
       0.78, 0.77, 0.56, 0.62, 0.67, 0.53, 0.66, 0.63, 0.59, 0.73, 0.32,
       0.76, 0.44, 0.8 , 0.45, 0.5 , 0.79, 0.31])

In [23]:
df.family_history.unique()

array(['No', 'Yes'], dtype=object)

In [24]:
df.medical_history.unique()

array(['Diabetes', 'Hypertension', nan, 'Glaucoma in family'],
      dtype=object)

In [25]:
df.medical_history.fillna('other',inplace=True)

In [26]:
df.isnull().sum()

patient_id                       0
age                              0
gender                           0
visual_acuity_measurements       0
iop                              0
cdr                              0
family_history                   0
medical_history                  0
medication_usage              1231
visual_field_test_results        0
oct_results                      0
pachymetry                       0
cataract_status                  0
angle_closure_status             0
visual_symptoms                  0
diagnosis                        0
glaucoma_type                    0
dtype: int64

In [27]:
df.medication_usage.nunique()

4079

In [28]:
df.medication_usage.mode()

0    Amoxicillin
Name: medication_usage, dtype: object

In [29]:
df.medication_usage.fillna('Amoxicillin',inplace=True)

In [30]:
df.isnull().sum()

patient_id                    0
age                           0
gender                        0
visual_acuity_measurements    0
iop                           0
cdr                           0
family_history                0
medical_history               0
medication_usage              0
visual_field_test_results     0
oct_results                   0
pachymetry                    0
cataract_status               0
angle_closure_status          0
visual_symptoms               0
diagnosis                     0
glaucoma_type                 0
dtype: int64

In [31]:
df.visual_field_test_results

0       Sensitivity: 0.54, Specificity: 0.75
1       Sensitivity: 0.72, Specificity: 0.88
2        Sensitivity: 0.56, Specificity: 0.8
3        Sensitivity: 0.6, Specificity: 0.93
4        Sensitivity: 0.82, Specificity: 0.9
                        ...                 
9995    Sensitivity: 0.81, Specificity: 0.97
9996     Sensitivity: 0.7, Specificity: 0.97
9997      Sensitivity: 0.8, Specificity: 0.9
9998     Sensitivity: 0.68, Specificity: 0.9
9999    Sensitivity: 0.87, Specificity: 0.73
Name: visual_field_test_results, Length: 10000, dtype: object

In [32]:
df['sensitivity']=df.visual_field_test_results.apply(lambda x:x.split(',')[0]).apply(lambda y:y.split(':')[1]).astype(float)

In [33]:
df['specificity']=df.visual_field_test_results.apply(lambda x:x.split(',')[1]).apply(lambda y:y.split(':')[1]).astype(float)

In [34]:
df.drop('visual_field_test_results',axis=1,inplace=True)

In [35]:
df.head()

Unnamed: 0,patient_id,age,gender,visual_acuity_measurements,iop,cdr,family_history,medical_history,medication_usage,oct_results,pachymetry,cataract_status,angle_closure_status,visual_symptoms,diagnosis,glaucoma_type,sensitivity,specificity
0,62431,69,Male,0.1,19.46,0.42,No,Diabetes,"Amoxicillin, Lisinopril, Omeprazole, Atorvasta...","RNFL Thickness: 86.48 µm, GCC Thickness: 64.14...",541.51,Present,Open,"Tunnel vision, Eye pain, Nausea",No Glaucoma,Primary Open-Angle Glaucoma,0.54,0.75
1,68125,69,Female,0.1,18.39,0.72,No,Hypertension,"Lisinopril, Amoxicillin, Atorvastatin, Ibuprof...","RNFL Thickness: 96.88 µm, GCC Thickness: 56.48...",552.77,Absent,Open,"Redness in the eye, Vision loss, Tunnel vision",No Glaucoma,Juvenile Glaucoma,0.72,0.88
2,63329,67,Female,0.3,23.65,0.72,No,Hypertension,"Amoxicillin, Ibuprofen, Metformin, Atorvastati...","RNFL Thickness: 89.81 µm, GCC Thickness: 59.05...",573.65,Absent,Closed,"Halos around lights, Vision loss, Redness in t...",No Glaucoma,Juvenile Glaucoma,0.56,0.8
3,47174,23,Male,0.0,18.04,0.61,No,other,"Ibuprofen, Aspirin","RNFL Thickness: 87.25 µm, GCC Thickness: 63.98...",590.67,Absent,Closed,"Nausea, Nausea, Halos around lights",No Glaucoma,Congenital Glaucoma,0.6,0.93
4,67361,21,Male,0.1,15.87,0.3,No,Diabetes,"Amoxicillin, Omeprazole, Aspirin, Ibuprofen, A...","RNFL Thickness: 82.61 µm, GCC Thickness: 66.01...",588.41,Absent,Closed,"Eye pain, Eye pain, Tunnel vision",No Glaucoma,Primary Open-Angle Glaucoma,0.82,0.9


In [36]:
df.oct_results[1]

'RNFL Thickness: 96.88 µm, GCC Thickness: 56.48 µm, Retinal Volume: 5.69 mm³, Macular Thickness: 261.48 µm'

> The Unit of OCT Results is MicroMeter

In [37]:
df['oct_results'].apply(lambda x:x.split(',')[0]).apply(lambda y:y.split(':')[1]).apply(lambda z:z.split(' ')[1]).astype(float)


0       86.48
1       96.88
2       89.81
3       87.25
4       82.61
        ...  
9995    83.41
9996    83.04
9997    95.93
9998    92.84
9999    88.77
Name: oct_results, Length: 10000, dtype: float64

In [38]:
df['rnfl_thickness']=df['oct_results'].apply(lambda x:x.split(',')[0]).apply(lambda y:y.split(':')[1]).apply(lambda z:z.split(' ')[1]).astype(float)

df['gcc_thickness']=df['oct_results'].apply(lambda x:x.split(',')[1]).apply(lambda y:y.split(':')[1]).apply(lambda z:z.split(' ')[1]).astype(float)
df['retinal_volume']=df['oct_results'].apply(lambda x:x.split(',')[2]).apply(lambda y:y.split(':')[1]).apply(lambda z:z.split(' ')[1]).astype(float)
df['macular_thickness']=df['oct_results'].apply(lambda x:x.split(',')[3]).apply(lambda y:y.split(':')[1]).apply(lambda z:z.split(' ')[1]).astype(float)


In [39]:
df.drop('oct_results',axis=1,inplace=True)

In [40]:
df.pachymetry.min(),df.pachymetry.max()

(500.01, 599.99)

In [41]:
df.cataract_status.unique()

array(['Present', 'Absent'], dtype=object)

In [42]:
df.angle_closure_status.unique()

array(['Open', 'Closed'], dtype=object)

In [43]:
df.visual_symptoms

0                         Tunnel vision, Eye pain, Nausea
1          Redness in the eye, Vision loss, Tunnel vision
2       Halos around lights, Vision loss, Redness in t...
3                     Nausea, Nausea, Halos around lights
4                       Eye pain, Eye pain, Tunnel vision
                              ...                        
9995                    Eye pain, Eye pain, Tunnel vision
9996              Eye pain, Halos around lights, Vomiting
9997                Vision loss, Vomiting, Blurred vision
9998                  Halos around lights, Nausea, Nausea
9999              Eye pain, Eye pain, Halos around lights
Name: visual_symptoms, Length: 10000, dtype: object

In [44]:
df.diagnosis.unique()

array(['No Glaucoma', 'Glaucoma'], dtype=object)

In [45]:
df.glaucoma_type.unique()

array(['Primary Open-Angle Glaucoma', 'Juvenile Glaucoma',
       'Congenital Glaucoma', 'Normal-Tension Glaucoma',
       'Angle-Closure Glaucoma', 'Secondary Glaucoma'], dtype=object)

In [46]:
df.columns

Index(['patient_id', 'age', 'gender', 'visual_acuity_measurements', 'iop',
       'cdr', 'family_history', 'medical_history', 'medication_usage',
       'pachymetry', 'cataract_status', 'angle_closure_status',
       'visual_symptoms', 'diagnosis', 'glaucoma_type', 'sensitivity',
       'specificity', 'rnfl_thickness', 'gcc_thickness', 'retinal_volume',
       'macular_thickness'],
      dtype='object')

In [47]:
df.head()

Unnamed: 0,patient_id,age,gender,visual_acuity_measurements,iop,cdr,family_history,medical_history,medication_usage,pachymetry,...,angle_closure_status,visual_symptoms,diagnosis,glaucoma_type,sensitivity,specificity,rnfl_thickness,gcc_thickness,retinal_volume,macular_thickness
0,62431,69,Male,0.1,19.46,0.42,No,Diabetes,"Amoxicillin, Lisinopril, Omeprazole, Atorvasta...",541.51,...,Open,"Tunnel vision, Eye pain, Nausea",No Glaucoma,Primary Open-Angle Glaucoma,0.54,0.75,86.48,64.14,5.63,283.67
1,68125,69,Female,0.1,18.39,0.72,No,Hypertension,"Lisinopril, Amoxicillin, Atorvastatin, Ibuprof...",552.77,...,Open,"Redness in the eye, Vision loss, Tunnel vision",No Glaucoma,Juvenile Glaucoma,0.72,0.88,96.88,56.48,5.69,261.48
2,63329,67,Female,0.3,23.65,0.72,No,Hypertension,"Amoxicillin, Ibuprofen, Metformin, Atorvastati...",573.65,...,Closed,"Halos around lights, Vision loss, Redness in t...",No Glaucoma,Juvenile Glaucoma,0.56,0.8,89.81,59.05,5.96,282.34
3,47174,23,Male,0.0,18.04,0.61,No,other,"Ibuprofen, Aspirin",590.67,...,Closed,"Nausea, Nausea, Halos around lights",No Glaucoma,Congenital Glaucoma,0.6,0.93,87.25,63.98,6.44,262.86
4,67361,21,Male,0.1,15.87,0.3,No,Diabetes,"Amoxicillin, Omeprazole, Aspirin, Ibuprofen, A...",588.41,...,Closed,"Eye pain, Eye pain, Tunnel vision",No Glaucoma,Primary Open-Angle Glaucoma,0.82,0.9,82.61,66.01,6.16,261.78


In [48]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
patient_id,10000.0,50002.1688,28939.82498,4.0,24660.25,50091.5,74829.25,99992.0
age,10000.0,53.8722,21.127563,18.0,36.0,54.0,72.0,90.0
visual_acuity_measurements,10000.0,0.09844,0.121684,0.0,0.0,0.0,0.1,0.3
iop,10000.0,17.507527,4.356101,10.0,13.76,17.485,21.3,25.0
cdr,10000.0,0.548437,0.144326,0.3,0.42,0.55,0.67,0.8
pachymetry,10000.0,549.733974,28.902741,500.01,524.59,549.335,574.9725,599.99
sensitivity,10000.0,0.750052,0.143988,0.5,0.63,0.75,0.87,1.0
specificity,10000.0,0.850237,0.086709,0.7,0.77,0.85,0.92,1.0
rnfl_thickness,10000.0,87.433765,7.176537,75.01,81.28,87.38,93.64,100.0
gcc_thickness,10000.0,62.514676,4.333931,55.0,58.82,62.52,66.24,70.0


In [49]:
df.to_csv('../data/processed/EDA & Cleaning.csv')