## Leslie English

## Capstone Project - October 2024

### UC Berkeley Professional Certificate in Machine Learning and Artificial Intelligence

## Project Overview

GitHub.com Repository: https://github.com/LeslieSeattle/UCB-ML-AI-Capstone-Preliminary

#### UW Project: 
In 2023, a team of University of Washington, Seattle (UW) researchers and Seattle-area scientists published a paper titled “ExplaiNAble BioLogical Age (ENABL Age): an artificial intelligence framework for interpretable biological age” (The Lancet Healthy Longevity, Volume 4, Issue 12, E711-E723, December 2023, Open Access. Wei Qiu, MSc ∙ Hugh Chen, PhD ∙ Prof Matt Kaeberlein, PhD ∙ Prof Su-In Lee, PhD  ∙ https://www.thelancet.com/journals/lanhl/article/PIIS2666-7568(23)00189-7/fulltext). The paper explained the artificial intelligence model they built to create an age clock that estimates a person’s biological age based upon medical, health, and lifestyle data provided by the person. 

#### UW Data: 
The UW model used twenty features selected from approximately a thousand available features drawn from two dataset sources: (1) the UK Biobank, and (2) National Health and Nutrition Examination Survey (NHANES) program of the National Center for Health Statistics (NCHS) within the Centers for Disease Control and Prevention (CDC). https://www.cdc.gov/nchs/nhanes/about_nhanes.htm

#### Capstone Research Question: 
The purpose of this UCB capstone project is to return to the original NHANES database from which the UW model selected its twenty features, and build a new ML/AI model with a different combination of twenty features, to determine if a more accurate age clock can be created through a different feature selection strategy. Additionally, multiple regression models will be evaluated.

#### Capstone Rationale: 
If people can obtain information detailed information about the specific aspects of their physical health that are having the most detrimental effect upon their longevity, then they are empowered to make the lifestyle choices necessary to bring their biological age in closer alignment with their chronological age, to live a longer and healthier life.

#### Capstone Data: 
This project will only select features from the NHANES program dataset, and will not draw from data in the UK Biobank. Among the NHANES features, this project will only use laboratory bloodwork data, and will exclude the use of data collected from (1) Physical Examinations, (2) Lifestyle Questionnaires, and (3) Demographic Questionnaires.

#### Capstone Methodology: 

   ##### 1. Clean Data:     
    Remove from the NHANES dataset all features that contain physical examination data, lifestyle data, and demographic data (except age, which is the target).
   ##### 2. Select Features: 
    Perform feature selection using GridSearchCV, to narrow the dataset down to twenty features. 
   ##### 3. Build Models: 
    Build the following models, each optimized using GridSearchCV for hyperparameter tuning, possibly employing Ensemble Techniques:
            a.	Linear Regression 
            b.	Ridge Regression
            c.	Lasso Regression
            d.	Decision Tree
            e.	Random Forest
            f.	Neural Network
   ##### 4. Evaluate Model Performance: 
    Train and test each model separately on both the UW data and the Capstone data, to see which dataset results in better model performance based upon:
            a.	mean squared error (MSE), and
            b.	adjusted R2.


In [1]:
## Import Python libraries and functions
import pandas as pd

# Normalize the data
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer

# Implement test-train dataset split
from sklearn.model_selection import train_test_split

In [2]:
# Enable multiple print statements 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

### IMPORT DATA SETS

In [3]:
# 20 features used in UW project
Data_UW_20 = pd.read_csv('Data_UW_20.csv')

# 218 features from NHANES dataset
Data_NH_200 = pd.read_csv('Data_NHANES_200.csv')

## STEP 1:  CLEAN & PREPARE DATA

### UW DATASET FEATURES - PREPARE DATA

In [4]:
# The UW data set features include data categories of: demographics, patient self-reporting questionnaire,
# physical examination by a medical professional, and laboratory bloodwork results. 
Data_UW_20.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47279 entries, 0 to 47278
Data columns (total 27 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   Demographics_Age                              47279 non-null  float64
 1   Laboratory_BloodLeadSI                        47279 non-null  float64
 2   Laboratory_RedCellDistributionWidth           47279 non-null  float64
 3   Questionnaire_GeneralHealth                   47279 non-null  int64  
 4   Laboratory_UrineAlbumin                       47279 non-null  float64
 5   Demographics_IncomeRatio                      47279 non-null  float64
 6   Examination_ArmCircum                         47279 non-null  float64
 7   Laboratory_BloodCadmium                       47279 non-null  float64
 8   Laboratory_AlbuminSI                          47279 non-null  float64
 9   Questionnaire_SpecialHealthCareEquipment_2.0  47279 non-null 

In [5]:
# Change name of target column from "Demographics_Age" to "Age", for consistency with Capstone dataset
UWdf = Data_UW_20.copy()
UWdf.rename(columns={"Demographics_Age": "Age"}, inplace=True)

In [6]:
# Drop all columns for year_label, mortstat, permth
UWdf.drop(list(UWdf.filter(regex='year_label|mortstat|permth')), axis=1, inplace=True)
UWdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47279 entries, 0 to 47278
Data columns (total 20 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   Age                                           47279 non-null  float64
 1   Laboratory_BloodLeadSI                        47279 non-null  float64
 2   Laboratory_RedCellDistributionWidth           47279 non-null  float64
 3   Questionnaire_GeneralHealth                   47279 non-null  int64  
 4   Laboratory_UrineAlbumin                       47279 non-null  float64
 5   Demographics_IncomeRatio                      47279 non-null  float64
 6   Examination_ArmCircum                         47279 non-null  float64
 7   Laboratory_BloodCadmium                       47279 non-null  float64
 8   Laboratory_AlbuminSI                          47279 non-null  float64
 9   Questionnaire_SpecialHealthCareEquipment_2.0  47279 non-null 

In [7]:
# Split X, y
UWX = UWdf.drop('Age', axis=1)
UWy = UWdf['Age']

# Split test, train
UWX_train, UWX_test, UWy_train, UWy_test = train_test_split(UWX, UWy, test_size=0.2)

# Normalize the data
scaler = StandardScaler()
UWX_train_scaled = scaler.fit_transform(UWX_train)
UWX_test_scaled = scaler.transform(UWX_test)

#### UW dataset is now defined by:
UWX_train_scaled, UWX_test_scaled, UWy_train, UWy_test

### CAPSTONE (NHANES) DATASET - PREPARE DATA

### For the new Capstone project set of features from the NHANES dataset, only laboratory data will be included, so several categories of less reliable data (patient self reporting questionnaire, physical examination, etc.) will be deleted.

In [9]:
# Change name of target column from "Demographics_Age" to "Age", 
# so it is retained when demographic data is removed from the dataset.
NHdf = Data_NH_200.copy()
NHdf.rename(columns={"Demographics_Age": "Age"}, inplace=True)

In [10]:
# Drop all columns for Dietary, Questionnaire, Examination, Demographics, year_label, mortstat, permth
NHdf.drop(list(NHdf.filter(regex='Dietary|Questionnaire|Examination|Demographics|year_label|mortstat|permth')), 
          axis=1, inplace=True)

In [11]:
NHdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47279 entries, 0 to 47278
Data columns (total 58 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   Laboratory_WhiteBloodCellCount              47279 non-null  float64
 1   Laboratory_Cotinine                         47279 non-null  float64
 2   Laboratory_Sodium                           47279 non-null  float64
 3   Laboratory_MeanCellVolume                   47279 non-null  float64
 4   Laboratory_CholesterolSI                    47279 non-null  float64
 5   Laboratory_HepB                             47279 non-null  int64  
 6   Laboratory_Monocyte                         47279 non-null  float64
 7   Laboratory_UricAcid                         47279 non-null  float64
 8   Laboratory_BloodCadmium                     47279 non-null  float64
 9   Laboratory_SegmentedNeutrophils             47279 non-null  float64
 10  Laboratory

In [12]:
# Remove "Laboratory_" string from the beginning of each feature name
### DOES NOT WORK ###
# This function appears to only work to replace dataframe values, not dataframe feature names.
NHdf.replace(to_replace="Laboratory_", value="AA")
NHdf.info()

Unnamed: 0,Laboratory_WhiteBloodCellCount,Laboratory_Cotinine,Laboratory_Sodium,Laboratory_MeanCellVolume,Laboratory_CholesterolSI,Laboratory_HepB,Laboratory_Monocyte,Laboratory_UricAcid,Laboratory_BloodCadmium,Laboratory_SegmentedNeutrophils,...,Laboratory_HepBSurfaceAntibody,Laboratory_LymphocytePercent,Laboratory_Hemoglobin,Laboratory_CRP,Laboratory_AlkalinePhosphatase,Laboratory_MeanCellHemoglobinConcentration,Laboratory_Hematocrit,Laboratory_Triglyceride,Laboratory_HDLCholesterol,Laboratory_LDLCholesterol
0,7.60,0.035000,144.100,88.500,5.25000,2,0.5,6.100,1.7800,5.1,...,2,21.100,14.100,3.600,62.00,33.600,41.800,1.29800,1.3900,3.52000
1,5.90,1.810000,137.500,84.900,7.16000,2,0.4,6.800,3.5600,3.1,...,2,37.800,14.500,0.800,63.00,33.300,43.600,3.85000,1.0800,4.34000
2,4.90,0.090000,143.200,87.400,6.31000,2,0.4,4.300,5.3400,2.1,...,2,46.800,13.400,0.400,75.00,33.300,40.200,0.64400,2.7300,3.28000
3,4.60,374.580000,140.900,92.300,3.49000,2,0.5,6.000,7.1200,2.0,...,2,41.300,15.400,1.200,86.00,33.500,46.200,0.47400,1.3100,2.07000
4,10.20,0.340000,141.300,83.500,3.90000,2,0.9,5.700,3.5600,6.5,...,2,23.700,16.000,1.900,63.00,33.300,48.100,1.58100,0.9800,2.30000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
47274,7.70,2.710000,138.000,99.200,4.91300,2,0.8,4.200,2.2465,3.9,...,2,37.600,14.300,3.805,64.00,35.500,40.100,1.19700,1.2700,3.00000
47275,6.10,0.036000,141.000,93.800,4.00800,2,0.6,6.100,1.7800,4.2,...,2,19.000,11.800,3.211,43.00,33.900,34.800,1.52400,1.3200,2.05399
47276,6.88,0.059119,138.987,88.952,4.99583,2,0.5,5.446,1.3846,4.1,...,2,30.613,14.101,1.233,63.67,34.008,41.402,1.37278,1.1356,3.17951
47277,5.10,0.011000,143.000,88.900,5.04300,2,0.3,5.300,1.4122,2.9,...,2,31.900,14.900,1.132,75.00,32.900,45.100,0.85800,1.4200,3.07700


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47279 entries, 0 to 47278
Data columns (total 58 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   Laboratory_WhiteBloodCellCount              47279 non-null  float64
 1   Laboratory_Cotinine                         47279 non-null  float64
 2   Laboratory_Sodium                           47279 non-null  float64
 3   Laboratory_MeanCellVolume                   47279 non-null  float64
 4   Laboratory_CholesterolSI                    47279 non-null  float64
 5   Laboratory_HepB                             47279 non-null  int64  
 6   Laboratory_Monocyte                         47279 non-null  float64
 7   Laboratory_UricAcid                         47279 non-null  float64
 8   Laboratory_BloodCadmium                     47279 non-null  float64
 9   Laboratory_SegmentedNeutrophils             47279 non-null  float64
 10  Laboratory

#### Define Target, Test-Train Split, Normalize 

In [13]:
# Split X, y
NHX = NHdf.drop('Age', axis=1)
NHy = NHdf['Age']

# Split test, train
NHX_train, NHX_test, NHy_train, NHy_test = train_test_split(NHX, NHy, test_size=0.2)

# Normalize the data
scaler = StandardScaler()
NHX_train_scaled = scaler.fit_transform(NHX_train)
NHX_test_scaled = scaler.transform(NHX_test)

NHX_train_scaled

array([[ 0.16325376, -0.45464436, -0.94819997, ..., -0.75754809,
         0.174565  ,  0.19926254],
       [-0.05749133, -0.45490268, -0.01590804, ..., -0.00558582,
        -0.54345364, -0.32946278],
       [ 0.07495573, -0.45473602,  2.1594398 , ..., -0.15725398,
        -1.21018523, -0.93461094],
       ...,
       [-1.07291876, -0.45510267, -0.06030289, ..., -0.70155844,
         0.6361484 ,  0.93647105],
       [-0.76387563, -0.45510267, -1.39214851, ..., -0.84543058,
         1.04644477, -0.31949214],
       [-0.27823642, -0.45498601, -0.94819997, ..., -0.68596638,
         1.63624579, -0.13936789]])

### Capstone data set is now defined by:  NHX_train_scaled, NHX_test_scaled, NHy_train, NHy_test

## STEP 2:  CAPSTONE (NHANES) DATASET - FEATURE SELECTION 

# *** Need to perform feature selection before doing x,y split and test, train split ??******
    ## DOES IT DEPEND ON WHICH FEATURE SELECTION METHOD IS USED?


In [14]:
# GridSearchCV Strategy
# MEDIUM -- Requires Membership -- https://medium.com/pythoneers/how-to-perform-feature-selection-with-gridsearchcv-in-sklearn-in-python-eb20a18f290b
# STACK OVERFLOW -- Public -- https://stackoverflow.com/questions/55609339/how-to-perform-feature-selection-with-gridsearchcv-in-sklearn-in-python
# KAGGLE -- Public -- https://www.kaggle.com/code/antoreepjana/sklearn-pipelines-gridsearch-feature-selection

### OPTION 1: SEQUENTIAL FEATURE SELECTION
#### From Codio 9.1, Problem 3

In [15]:
# Sequential Feature Selection -- FORWARD

# Use the SequentialFeatureSelector function, forward direction, with a LinearRegression estimator to select: 
    # 20 features (actually 19 features and 1 target, like UW did) 
    # from the NHdf dataset of 58 features (actually 57 features and 1 target).
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LinearRegression
selector = SequentialFeatureSelector(estimator=LinearRegression(), n_features_to_select=19)

# Use the fit_transform function on selector to train the model on train_df and y_train. Creates numpy.ndarray.
best_features = selector.fit_transform(NHX_train_scaled, NHy_train)

# Assign transformed features best_features to a DataFrame, with columns equal to selector.get_feature_names_out().
# Save the result as an array
NH20df = pd.DataFrame(best_features, columns=selector.get_feature_names_out())
NH20df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37823 entries, 0 to 37822
Data columns (total 19 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x1      37823 non-null  float64
 1   x3      37823 non-null  float64
 2   x4      37823 non-null  float64
 3   x8      37823 non-null  float64
 4   x10     37823 non-null  float64
 5   x12     37823 non-null  float64
 6   x13     37823 non-null  float64
 7   x18     37823 non-null  float64
 8   x19     37823 non-null  float64
 9   x21     37823 non-null  float64
 10  x23     37823 non-null  float64
 11  x24     37823 non-null  float64
 12  x29     37823 non-null  float64
 13  x35     37823 non-null  float64
 14  x36     37823 non-null  float64
 15  x41     37823 non-null  float64
 16  x46     37823 non-null  float64
 17  x47     37823 non-null  float64
 18  x56     37823 non-null  float64
dtypes: float64(19)
memory usage: 5.5 MB


In [16]:
# Sequential Feature Selection -- BACKWARD

# Use the SequentialFeatureSelector function, forward direction, with a LinearRegression estimator to select: 
    # 20 features (actually 19 features and 1 target, like UW did) 
    # from the NHdf dataset of 58 features (actually 57 features and 1 target).
selector_back = SequentialFeatureSelector(estimator=LinearRegression(), direction="backward", n_features_to_select=19)

# Use the fit_transform function on selector to train the model on train_df and y_train. Creates numpy.ndarray.
best_features_back = selector_back.fit_transform(NHX_train_scaled, NHy_train)

# Assign transformed features best_features to a DataFrame, with columns equal to selector.get_feature_names_out().
# Save the result as an array
NH20df_back = pd.DataFrame(best_features_back, columns=selector.get_feature_names_out())
NH20df_back.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37823 entries, 0 to 37822
Data columns (total 19 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x1      37823 non-null  float64
 1   x3      37823 non-null  float64
 2   x4      37823 non-null  float64
 3   x8      37823 non-null  float64
 4   x10     37823 non-null  float64
 5   x12     37823 non-null  float64
 6   x13     37823 non-null  float64
 7   x18     37823 non-null  float64
 8   x19     37823 non-null  float64
 9   x21     37823 non-null  float64
 10  x23     37823 non-null  float64
 11  x24     37823 non-null  float64
 12  x29     37823 non-null  float64
 13  x35     37823 non-null  float64
 14  x36     37823 non-null  float64
 15  x41     37823 non-null  float64
 16  x46     37823 non-null  float64
 17  x47     37823 non-null  float64
 18  x56     37823 non-null  float64
dtypes: float64(19)
memory usage: 5.5 MB


### OPTION 2: Grid Search

In [None]:
######### SAMPLE ONLY #############

# https://www.geeksforgeeks.org/performing-feature-selection-with-gridsearchcv-in-sklearn/

######### SAMPLE ONLY #############

# Define the classifier
clf = RandomForestClassifier(random_state=42, class_weight="balanced")

# Define the feature selector
rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(10), scoring='roc_auc')

# Create a pipeline
pipeline = Pipeline([
    ('feature_selection', rfecv),
    ('classification', clf)
])
# Define the parameter grid
param_grid = {
    'classification__n_estimators': [200, 500],
    'classification__max_features': ['auto', 'sqrt', 'log2'],
    'classification__max_depth': [4, 5, 6, 7, 8],
    'classification__criterion': ['gini', 'entropy']
}
# Define the GridSearchCV
grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=StratifiedKFold(10), scoring='roc_auc_ovr', n_jobs=-1)

# Fit the model
grid_search.fit(X_train, y_train)

# Get the best parameters and score
print("Best parameters found: ", grid_search.best_params_)
print("Best cross-validation score: ", grid_search.best_score_)

######### SAMPLE ONLY #############

## STEP 3: BUILD MODELS

### MODEL 1: Linear Regression
#### From Codio 9.1, Problem 4

In [None]:
# Create a Pipeline object with steps column_selector to select three features, 
# and linreg to build a LinearRegression estimator.

from sklearn.pipeline import Pipeline
LRpipe = Pipeline([
    ("column_selector", selector),
    ("linreg", LinearRegression())
])

# Use fit function to train your pipeline using train_df and y_train.
# NHX_train_scaled, NHX_test_scaled, NHy_train, NHy_test
LRpipe.fit(NHX_train_scaled, NHy_train)

# Use the predict function to calculate the predicitons on train_df. Assign the result on train_preds.
LRtrain_preds = LRpipe.predict(NHX_train)
# Use the predict function to calculate the predicitons on test_df. Assign the result on test_preds.
LRtest_preds = LRpipe.predict(NHX_test)

# Use the mean_squared_error function to calculate the MSE between y_train and train_preds. 
from sklearn.metrics import mean_squared_error
LRtrain_mse = mean_squared_error(NHy_train, LRtrain_preds)
# Use the mean_squared_error function to calculate the MSE between y_test and test_preds. 
LRtest_mse = mean_squared_error(NHy_test, LRtest_preds)


# Answer check
print(f'Train MSE: {LRtrain_mse: .2f}')
print(f'Test MSE: {LRtest_mse: .2f}')
pipe


### ?????? DO GRID SEARCH to tune hyperparameters on every model ????

### MODEL 2: Ridge Regression

#### From Codio 9.3, Problem 3

In [None]:
from sklearn.linear_model import Ridge

# EXPLORING DIFFERENT ALPHA VALUES
alphas = [0.001, 1.0, 10.0, 100.0]

coef_list = []
# Define a for loop to iterate over the list alphas to create and train different Ridge models.
for alpha in alphas:
    ridge = model_1 = Ridge(alpha=alpha)
    ridge.fit(X_train, y_train)
    # Append the coefficients of the Ridge model as a list to coef_list.
    coef_list.append(list(ridge.coef_))

len(coef_list)
print('For alpha = 100 we have the following coefficients:')
list(zip(X_train.columns, coef_list[-1]))

### MODEL 3: Lasso Regression

### MODEL 4: Decision Tree

### MODEL 5: Random Forest

### MODEL 6: Neural Network

### ********************************************
### Handling Data Sets that are too large
### ********************************************

** NOTE ** File Size Problem

I tried to load to GitHub the following two data sets:

    Data_UW_20 (20 features, 47279 observations)
    Data_NH_200 (223 features, 47279 observations)

but both data files were too large for GitHub to accept.

Therefore, I will have to do feature selection and probably some reduction in the number of observations 
in each dataset from my computer. Then, I will upload to GitHub the reduced size data sets:

    Data_UW_20_R --- with reduced number of observations
    Data_NH_20_R --- with reduced number of observations, and features reduced from 223 down to 20

When I upload my Jupyter Notebook to GitHub, I will leave the feature selection code and observation reduction code in the Jupyter Notebook, but will have it commented out so that the Jupyter Notebook code can be run. The active Jupyter Notebook code will begin with Baseline Modeling, and running the initial Regression models on the two sets of data. 