<a href="https://colab.research.google.com/github/VU-cs5891/assignment-3-JonathanPoteet/blob/main/Special_Topics_AI_Predictive_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 3: Create a Predictive Model

This assignment focuses on building a predictive model and conducting a subsequent bias analysis using a selected biased dataset. The primary goal is to predict a specific outcome based on demographic data and then analyze any potential biases present in the model's performance due to the characteristics of the data.

To handle this I took these steps:

1.  **Data Loading and Initial Analysis:** Loading the dataset and performing an initial examination of its structure and content.
2.  **Data Cleaning and Preprocessing:** Cleaning the data to handle missing values and prepare it for modeling. This includes addressing specific data irregularities identified in relevant features.
3.  **Predictive Model Creation:** Developing a predictive model, specifically a Random Forest Classifier, which is a suitable binary classification model for this task.
4.  **Model Evaluation and Bias Analysis:** Evaluating the performance of the trained model using appropriate metrics and analyzing potential biases in its predictions, considering the known characteristics and potential limitations of the dataset.

## Data Loading, Imports, and Initial Analysis:

To begin with by accessing the data:

In [84]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [85]:
%cd /content/gdrive/My\ Drive/Vanderbilt\ Class\ Misc/"Special Topics AI"
%ls

/content/gdrive/My Drive/Vanderbilt Class Misc/Special Topics AI
'Assignment 1 Special Topics AI.gdoc'   biased_demographic.csv


Importing tools:

In [86]:
import pandas as pd

from imblearn.over_sampling import RandomOverSampler
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer
from sklearn.metrics import classification_report, confusion_matrix

The data of this dataset is provided in codes, so in order to properly understand and work with it, I am using a data dictionary. This data dictionary is based on the 2013-2014 Demographics Variable List provided by the Centers for Disease Control and Prevention (CDC) through the National Health and Nutrition Examination Survey (NHANES). The variables listed below correspond to the columns found in the 2013-2014 demographic dataset (demographic.csv) available from Kaggle and the CDC.

The dataset can be found here: https://www.kaggle.com/datasets/cdc/national-health-and-nutrition-examination-survey/data?select=demographic.csv

The variable names and definitions were obtained from the official CDC NHANES documentation, accessible at:
https://wwwn.cdc.gov/Nchs/Nhanes/Search/variablelist.aspx?Component=Demographics&CycleBeginYear=2013

In [87]:
code_data = [
    {'Code': 'SEQN', 'Definition': 'Respondent sequence number'},
    {'Code': 'SDDSRVYR', 'Definition': 'Data release cycle'},
    {'Code': 'RIDSTATR', 'Definition': 'Interview and examination status of the participant'},
    {'Code': 'RIAGENDR', 'Definition': 'Gender of the participant'},
    {'Code': 'RIDAGEYR', 'Definition': 'Age in years at screening (topcoded at 80)'},
    {'Code': 'RIDAGEMN', 'Definition': 'Age in months at screening (for participants aged 24 months or younger)'},
    {'Code': 'RIDRETH1', 'Definition': 'Race/Hispanic origin (Recode)'},
    {'Code': 'RIDRETH3', 'Definition': 'Race/Hispanic origin with Non-Hispanic Asian category'},
    {'Code': 'RIDEXMON', 'Definition': 'Six-month time period when the examination was performed'},
    {'Code': 'RIDEXAGM', 'Definition': 'Age in months at examination (for participants aged 19 years or younger)'},
    {'Code': 'DMQMILIZ', 'Definition': 'Ever served on active duty in U.S. Armed Forces, Reserves, or National Guard'},
    {'Code': 'DMQADFC', 'Definition': 'Ever served in a foreign country during armed conflict or on a humanitarian mission'},
    {'Code': 'DMDBORN4', 'Definition': 'Country of birth'},
    {'Code': 'DMDCITZN', 'Definition': 'Citizenship status'},
    {'Code': 'DMDYRSUS', 'Definition': 'Length of time in the U.S.'},
    {'Code': 'DMDEDUC3', 'Definition': 'Education level for participants aged 6–19 years'},
    {'Code': 'DMDEDUC2', 'Definition': 'Education level for participants aged 20 years and older'},
    {'Code': 'DMDMARTL', 'Definition': 'Marital status'},
    {'Code': 'RIDEXPRG', 'Definition': 'Pregnancy status at MEC exam (for females aged 20–44 years)'},
    {'Code': 'SIALANG', 'Definition': 'Language of the Sample Person interview'},
    {'Code': 'SIAPROXY', 'Definition': 'Was a proxy respondent used in the Sample Person interview?'},
    {'Code': 'SIAINTRP', 'Definition': 'Was an interpreter used in the Sample Person interview?'},
    {'Code': 'FIALANG', 'Definition': 'Language of the Family interview'},
    {'Code': 'FIAPROXY', 'Definition': 'Was a proxy respondent used in the Family interview?'},
    {'Code': 'FIAINTRP', 'Definition': 'Was an interpreter used in the Family interview?'},
    {'Code': 'MIALANG', 'Definition': 'Language of the MEC interview'},
    {'Code': 'MIAPROXY', 'Definition': 'Was a proxy respondent used in the MEC interview?'},
    {'Code': 'MIAINTRP', 'Definition': 'Was an interpreter used in the MEC interview?'},
    {'Code': 'AIALANGA', 'Definition': 'Language of the Audio Computer-Assisted Self-Interview (ACASI)'},
    {'Code': 'DMDHHSIZ', 'Definition': 'Total number of people in the household'},
    {'Code': 'DMDHHSZA', 'Definition': 'Number of children aged 0–5 years in the household'},
    {'Code': 'DMDHHSZB', 'Definition': 'Number of children aged 6–17 years in the household'},
    {'Code': 'DMDHHSZE', 'Definition': 'Number of adults aged 60 years or older in the household'},
    {'Code': 'DMDHRGND', 'Definition': 'Gender of the household reference person'},
    {'Code': 'DMDHRAGE', 'Definition': 'Age of the household reference person'},
    {'Code': 'DMDHRBR4', 'Definition': 'Country of birth of the household reference person'},
    {'Code': 'DMDHREDU', 'Definition': 'Education level of the household reference person'},
    {'Code': 'DMDHRMAR', 'Definition': 'Marital status of the household reference person'},
    {'Code': 'DMDHSEDU', 'Definition': 'Education level of the spouse of the household reference person'},
    {'Code': 'WTINT2YR', 'Definition': 'Full sample 2-year interview weight'},
    {'Code': 'WTMEC2YR', 'Definition': 'Full sample 2-year MEC exam weight'},
    {'Code': 'SDMVPSU', 'Definition': 'Masked variance unit pseudo-PSU variable for variance estimation'},
    {'Code': 'SDMVSTRA', 'Definition': 'Masked variance unit pseudo-stratum variable for variance estimation'},
    {'Code': 'INDHHIN2', 'Definition': 'Annual household income'},
    {'Code': 'INDFMIN2', 'Definition': 'Annual family income'},
    {'Code': 'INDFMPIR', 'Definition': 'Family income-to-poverty ratio'}
]

code_df = pd.DataFrame(code_data)

Load the file:

In [88]:
file_path = 'biased_demographic.csv'

df = pd.read_csv(file_path)

print("Successfully loaded the data.")
print("\n--- First 5 rows of the DataFrame ---")
print(df.head())

print("\n--- DataFrame Info (Column names, data types, non-null counts) ---")
df.info()


Successfully loaded the data.

--- First 5 rows of the DataFrame ---
    SEQN  SDDSRVYR  RIDSTATR  RIAGENDR  RIDAGEYR  RIDAGEMN  RIDRETH1  \
0  73557         8         2         1        69       NaN         4   
1  73558         8         2         1        54       NaN         3   
2  73559         8         2         1        72       NaN         3   
3  73560         8         2         1         9       NaN         3   
4  73561         8         2         2        73       NaN         3   

   RIDRETH3  RIDEXMON  RIDEXAGM  ...  DMDHREDU  DMDHRMAR  DMDHSEDU  \
0         4       1.0       NaN  ...       3.0       4.0       NaN   
1         3       1.0       NaN  ...       3.0       1.0       1.0   
2         3       2.0       NaN  ...       4.0       1.0       3.0   
3         3       1.0     119.0  ...       3.0       1.0       4.0   
4         3       1.0       NaN  ...       5.0       1.0       5.0   

       WTINT2YR      WTMEC2YR  SDMVPSU  SDMVSTRA  INDHHIN2  INDFMIN2  INDFMPI

### Data Cleaning

Here we are selecting the desired features from the dataset for the predictive model and cleaning the data.

In [89]:
# These are the features I am using for the predictive model
predictive_features = [
    "RIAGENDR",   # Gender of the participant
    "RIDRETH1",   # Race/Hispanic origin (Recode)
    "DMDCITZN",   # Citizenship status
    "DMDHHSIZ",   # Total number of people in the household
    "DMDEDUC2",   # Education level for participants aged 20 years and older
    "DMDMARTL",   # Marital status
    "INDHHIN2",   # Annual household income
    "DMDBORN4",   # Country of Birth
    "SIALANG"     # Language of the Sample Person interview
]

# Target feature: Total number of people in the household
target_feature = "DMDHHSIZ"
# We are seeking a binary prediction to establish if the total number of people in the household is greater than 2 based on the data (0 for x<2, 1 for x>2)

# This feature requires cleaning since the selected options were improperly categorized which causes prediction issues.
# "INDHHIN2",   # Annual household income


Here we are selecting the desired features from the dataset for the predictive model.

Additionally, we are performing a cleaning step on the 'INDHHIN2' (Annual household income) feature. Based on our previous analysis and the data dictionary ([https://wwwn.cdc.gov/nchs/Data/Nhanes/Public/2013/DataFiles/DEMO_H.htm#INDFMIN2](https://wwwn.cdc.gov/nchs/Data/Nhanes/Public/2013/DataFiles/DEMO_H.htm#INDFMIN2)), codes 12 and 13 represent "Under 20,000" and "Under 20,000", respectively, which are irregular categorizations that overlap. To improve the accuracy and clarity of this feature for modeling, we are removing rows where 'INDHHIN2' has these values.

The remaining features are relatively clean, but there is a notable amount of missing data in the 'DMDEDUC2' (Education level for participants aged 20 years and older) and 'DMDMARTL' (Marital status) categories. These missing values will need to be handled through imputation, as addressed in a later step of this notebook.

In [90]:
selected_data = df[predictive_features].copy()

# Remove rows where 'INDHHIN2' is 12 or 13. According to the data dictionary, these codes
# represent irregular categorizations that may overlap with other features, potentially
# causing prediction issues. Removing them aims to clean the data.
# https://wwwn.cdc.gov/nchs/Data/Nhanes/Public/2013/DataFiles/DEMO_H.htm#INDFMIN2
selected_data = selected_data[~selected_data['INDHHIN2'].isin([12.0, 13.0])].copy()

print("\n--- Features selected for prediction ---")
print(selected_data.head())


--- Features selected for prediction ---
   RIAGENDR  RIDRETH1  DMDCITZN  DMDHHSIZ  DMDEDUC2  DMDMARTL  INDHHIN2  \
0         1         4       1.0         3       3.0       4.0       4.0   
1         1         3       1.0         4       3.0       1.0       7.0   
2         1         3       1.0         2       4.0       1.0      10.0   
3         1         3       1.0         4       NaN       NaN       9.0   
4         2         3       1.0         2       5.0       1.0      15.0   

   DMDBORN4  SIALANG  
0         1        1  
1         1        1  
2         1        1  
3         1        1  
4         1        1  


Here I am copying the dataset and checking how much missing data is present before the imputation.

In [91]:
# copy dataset so we can work without changing the main data object which is used in the final pipeline
final_data = selected_data.copy();
print("Missing values between imputation:")
print(final_data.isnull().sum())

Missing values between imputation:
RIAGENDR       0
RIDRETH1       0
DMDCITZN       4
DMDHHSIZ       0
DMDEDUC2    4218
DMDMARTL    4218
INDHHIN2     133
DMDBORN4       0
SIALANG        0
dtype: int64


Since there are missing values, I am imputing the data to handle the NaNs in the data to since the model fails with undefined values. We are avoiding imputing missing values for the target feature since we do not need fake values in the target column. Since we are working with mostly categorical data, we are using a simple imputer strategy of Most Frequent to address the missing values.

In [92]:
# Impute missing values before splitting the data
imputer = SimpleImputer(strategy='most_frequent')

# Identify features to impute (all columns except the target)
features_to_impute = [col for col in final_data.columns if col != target_feature]

# Apply imputation to the selected features
final_data[features_to_impute] = imputer.fit_transform(final_data[features_to_impute])

print("Missing values after imputation:")
print(final_data.isnull().sum())

Missing values after imputation:
RIAGENDR    0
RIDRETH1    0
DMDCITZN    0
DMDHHSIZ    0
DMDEDUC2    0
DMDMARTL    0
INDHHIN2    0
DMDBORN4    0
SIALANG     0
dtype: int64


## Create a Predictive Model

Here I am setting up a predictive model pipeline for a random forest model.

In [93]:
# this prepares the data for the model,
# X represents the columns without the target column
# Y represents the target column and it gets converted into a binary value here.
# Since we are trying to predict for households greater than 2 we have set the threshold to 2
def prepare_data(df, target_column, threshold=2, test_size=0.2, random_state=42):
    X = df.drop(columns=[target_column])
    y = (pd.to_numeric(df[target_column], errors='coerce') > threshold).astype(int)
    return train_test_split(X, y, test_size=test_size, random_state=random_state, stratify=y)

# Creates a train/test split for data pipeline and drops the target_feature column from the X
(X_train, X_test, y_train, y_test) = prepare_data(final_data, target_feature)

def train_pipeline(X_train, y_train, X_test, y_test):
    # All features except the target are categorical
    categorical_features = X_train.columns.tolist()
    numerical_features = [] # No numerical features to scale

    # Addressing class imbalance with RandomOverSampler, which is suitable for categorical data
    from imblearn.over_sampling import RandomOverSampler
    ros = RandomOverSampler(random_state=42)
    X_train_resampled, y_train_resampled = ros.fit_resample(X_train, y_train)


    # This is setting up the preprocessor which is handling some additional imputation for missing values and scaling the features to reduce
    # the impact of different scales and prepare the data for the model.
    preprocessor = ColumnTransformer(
        transformers=[
            ('cat', Pipeline([
                ('imputer', SimpleImputer(strategy='most_frequent')), # Impute missing categorical values with the most frequent
                ('onehot', OneHotEncoder(handle_unknown='ignore')) # One-hot encode categorical features
            ]), categorical_features)
        ],
        remainder='passthrough' # Keep any other columns (none in this case)
    )

    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', RandomForestClassifier())
    ])

    pipeline.fit(X_train_resampled, y_train_resampled)

    return pipeline

randomForestPipeline = train_pipeline(X_train, y_train, X_test, y_test)

Random Forest Model

In [94]:
y_pred = randomForestPipeline.predict(X_test)

print("\n--- Test Results from RandomForestPipeline ---")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))


--- Test Results from RandomForestPipeline ---

Classification Report:
              precision    recall  f1-score   support

           0       0.49      0.62      0.55       501
           1       0.86      0.78      0.81      1443

    accuracy                           0.74      1944
   macro avg       0.67      0.70      0.68      1944
weighted avg       0.76      0.74      0.75      1944


Confusion Matrix:
[[ 313  188]
 [ 323 1120]]


In this confusion matrix,

*   True Negatives (313): The model correctly predicted that the household had 2 or fewer people.
*   False Positives (188): The model incorrectly predicted that the household had more than 2 people when they actually had 2 or fewer people.
*   False Negatives (323): The model incorrectly predicted that the household had 2 or fewer people when they actually had more than 2 people.
*   True Positives (1120): The model correctly predicted that the household had more than 2 people.

## Model Analysis

The model performs significantly better at predicting Class 1 (households > 2 people) than Class 0 (households <= 2 people), as indicated by the higher precision, recall, and F1-score for Class 1. Quantitatively, the F1-score for Class 1 is 0.81, compared to 0.53 for Class 0, highlighting a substantial difference in the model's ability to correctly identify instances of each class while balancing precision and recall.

The lower precision for Class 0 suggests that when the model predicts a household has 2 or fewer people, it is often wrong, misclassifying households that actually have more than 2 people.

The class imbalance (1443 instances of Class 1 vs. 501 instances of Class 0) likely contributes to the model's better performance on the majority class (Class 1). RandomOverSampler was used to mitigate this, and while it helped improve the recall for Class 0 compared to a model without handling imbalance, there's still a notable difference in performance between the classes.

The overall accuracy of 0.74 seems reasonable, but the weighted averages of precision, recall, and F1-score (around 0.74-0.76) provide a more balanced view of the model's performance across both classes, taking into account the imbalance. They indicate that the model's performance on the minority class is bringing down the overall weighted metrics compared to its performance on the majority class. This means that model performance can improve with more data that is featuring households that are less than 2 people.

**Impact of Data Bias:**

This dataset from a U.S. national survey (NHANES) likely contains inherent biases reflecting the survey's sampling methodology and the population it aims to represent. There appears to be an overrepresentation of data points corresponding to larger American families leading to a notable class imbalance where smaller households (Class 0) are the minority. Conversely, there might be insufficient representation of smaller families and potentially non-American individuals.

This uneven data distribution significantly impacts the model's performance. The model demonstrates higher proficiency in identifying patterns within the overrepresented group (larger households) due to having more data to learn from. Consequently, it is less accurate in predicting for the underrepresented group (smaller households), contributing to the observed differences in precision, recall, and F1-score between the classes.

Furthermore, features like 'DMDEDUC2' (Education level) and 'DMDMARTL' (Marital status) had a substantial amount of missing data. While 'most_frequent' imputation was used to handle these missing values, it's possible that this strategy had created problems of its own. This imputation strategy might reinforce patterns present in the majority of the data, potentially masking variations or unique characteristics within underrepresented subgroups. This could affect how the model learns from these features and potentially influence the fairness or accuracy of predictions for different demographic segments.

In order to improve model performance, it will be beneficial to add more features to the data and reduce the missing data.