<a href="https://colab.research.google.com/github/SydAt1/Eye_Disease_Classification/blob/main/notebooks/01_data_cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd

In [21]:
df = pd.read_csv('NHIS_eye_dataset.csv')
df.sample(5)

Unnamed: 0,YearStart,YearEnd,LocationAbbr,LocationDesc,DataSource,Topic,Category,Question,Response,Age,...,ResponseID,DataValueTypeID,AgeID,SexID,RaceEthnicityID,RiskFactorID,RiskFactorResponseID,GeoLocation,Geographic Level,StateAbbreviation
608,2016,2017,US,National,NHIS,Eye Health Conditions,Self-Report Age Related Macular Degeneration,Percentage of adults ever told by a doctor or ...,Yes,40-64 years,...,RYES,CRDPREV,AGE4064,GM,BLK,RFHT,RFYES,,National,US
8783,2016,2017,US,National,NHIS,Service Utilization,Eye Protection,Proportion of adults who participate in activi...,Refused,85 years and older,...,RRF,CRDPREV,AGE85PLUS,GM,BLK,RFHT,RFNO,,National,US
9072,2016,2017,US,National,NHIS,Service Utilization,Eye Protection,Proportion of adults who participate in activi...,Some of the time,18 years and older,...,RSOMT,CRDPREV,AGE18PLUS,GF,ASN,RFDM,RFYES,,National,US
2361,2016,2017,US,National,NHIS,Eye Health Conditions,Self-Report Diabetic Retinopathy,Percentage of adults ever told by a doctor or ...,Yes,40-64 years,...,RYES,CRDPREV,AGE4064,GALL,BLK,RFDM,RFNO,,National,US
8714,2016,2017,US,National,NHIS,Service Utilization,Eye Protection,Proportion of adults who participate in activi...,Refused,65-84 years,...,RRF,CRDPREV,AGE6584,GM,WHT,RFAPAR,RFTOT,,National,US


Simply removing all the Null values here can result in wiping a large portion of rows of the dataset.

In [42]:
# Drop columns that have zero data
df = df.drop(columns=['Numerator', 'GeoLocation'])

In [13]:
df.isna().sum()

Unnamed: 0,0
YearStart,0
YearEnd,0
LocationAbbr,0
LocationDesc,0
DataSource,0
Topic,0
Category,0
Question,0
Response,0
Age,0


In [43]:
# Drop rows only if the actual results (Data_Value) are missing
df_model = df.dropna(subset=['Data_Value']).copy()

1. What is Data_Value?

The Data_Value column is the primary metric of this dataset. It represents the statistical result for a specific survey question (e.g., "Difficulty seeing even when wearing glasses").

  * Type: Usually a percentage (Prevalence) or a crude rate.

  * Context: It must be interpreted alongside Data_Value_Unit (%) and Data_Value_Type (e.g., Age-adjusted Prevalence).

In [44]:
# To see all unique footnotes in the dataset
df_model['Data_Value'].unique()

array([ 2.74,  2.34,  5.25, ..., 48.5 , 46.5 , 39.6 ])

The target variable `Data_Value` represents the prevalence of eye disease at a population level. The model uses demographic factors, risk factors, geographic location, and temporal information to predict variations in eye disease prevalence.

In [32]:
df_model.shape[0]

11295

In [45]:
df_model.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11295 entries, 0 to 23740
Data columns (total 35 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   YearStart                   11295 non-null  int64  
 1   YearEnd                     11295 non-null  int64  
 2   LocationAbbr                11295 non-null  object 
 3   LocationDesc                11295 non-null  object 
 4   DataSource                  11295 non-null  object 
 5   Topic                       11295 non-null  object 
 6   Category                    11295 non-null  object 
 7   Question                    11295 non-null  object 
 8   Response                    11295 non-null  object 
 9   Age                         11295 non-null  object 
 10  Sex                         11295 non-null  object 
 11  RaceEthnicity               11295 non-null  object 
 12  RiskFactor                  11295 non-null  object 
 13  RiskFactorResponse          11295 no

In [47]:
# Remove the columns that aren't 'features' (like footnotes or empty columns)
cols_to_keep = ['YearStart', 'LocationAbbr', 'Age', 'Sex', 'RaceEthnicity', 'Question', 'Data_Value', 'RiskFactorID', 'RiskFactorResponseID']
df_model = df_model[cols_to_keep]

To predict prevalence or health outcomes, we need a mix of demographics (who they are), geography (where they are), and medical context (what other risks they have).

Here is why the selected columns are the "Engine" of the Machine Learning model:

#### The Target Variable

* Data_Value: This is the Label. It is the specific percentage or rate which is to be predicted. Without this, the model has no "answer key" to learn from.

#### The Features (Predictors)
* Age, Sex, RaceEthnicity: These are the Primary Biological Predictors. Most eye conditions (like cataracts or macular degeneration) are heavily dependent on age and sometimes vary significantly by gender or genetic background.

* RiskFactorID & RiskFactorResponseID: These are the Clinical Predictors. Conditions like Diabetes or Smoking status are high-impact drivers of eye health. These columns tell the model why a certain group might have higher vision loss than another group of the same age.

* Question: This is the Context Feature. Since your dataset contains multiple types of eye conditions, this tells the model which specific condition the Data_Value refers to.

* LocationAbbr: This is the Environmental Predictor. It captures geographic disparities, such as differences in state-level healthcare access or regional environmental factors.

* YearStart: This is the Temporal Predictor. It helps the model account for trends over time, such as improvements in medical technology or changes in survey methods.

In [48]:
df_model.sample(5)

Unnamed: 0,YearStart,LocationAbbr,Age,Sex,RaceEthnicity,Question,Data_Value,RiskFactorID,RiskFactorResponseID
17215,2016,US,65-84 years,Both sexes,"White, non-Hispanic",Percentage of adults who even when wearing gla...,5.83,RFHT,RFNO
9785,2016,US,0-17 years,Both sexes,"White, non-Hispanic",Percentage of children (ages 6-17) who partici...,12.08,RFAPAR,RFTOT
22206,2016,US,85 years and older,Both sexes,All races,Percentage of adults who even when wearing gla...,0.28,RFSM,RFNEV
18335,2016,US,85 years and older,Both sexes,All races,Percentage of adults who even when wearing gla...,53.95,RFHT,RFYES
12000,2016,US,18 years and older,Female,North American Native,Percentage of adults who currently wear eyegla...,81.35,RFSM,RFFORM


In [50]:
df_model.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11295 entries, 0 to 23740
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   YearStart             11295 non-null  int64  
 1   LocationAbbr          11295 non-null  object 
 2   Age                   11295 non-null  object 
 3   Sex                   11295 non-null  object 
 4   RaceEthnicity         11295 non-null  object 
 5   Question              11295 non-null  object 
 6   Data_Value            11295 non-null  float64
 7   RiskFactorID          11295 non-null  object 
 8   RiskFactorResponseID  11295 non-null  object 
dtypes: float64(1), int64(1), object(7)
memory usage: 882.4+ KB
