<h3>Unsupervised Learning: Dimensionality Reduction, FAMD</h3>
The purpose of this notebook is to read in the CDC survey data and try different dimensionality reduction methods on it to see if we can reduce the complexity of the dataset. The dataset has ~300 columns, we reduce that to 30 columns that we manually choose. Although 30 columns is less complex than 300 complex we want to reduce the complexity even further.

In [None]:
#Installations needed, please uncomment and run if needed
'''! pip install pandas
! pip install numpy
! pip install zipfile
! pip install plotly.express
! pip install plotly.graph_objects
! pip install plotly==5.5
! pip install matplotlib.pyplot
! pip install types
! pip install prince'''

In [None]:
import pandas as pd
import numpy as np
import zipfile 
from prince import FAMD, MCA
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split 
from imblearn.over_sampling import SMOTENC
from sklearn.utils import resample
from sklearn.metrics import accuracy_score,recall_score,precision_score,f1_score

In [None]:
%%capture
from tqdm import tqdm_notebook as tqdm
tqdm().pandas()

In [None]:
#Set random_state for reproducible results
RANDOM_SEED = 42

In [None]:
# Read in DataFrame
zf = zipfile.ZipFile('ny.csv.zip') 
zf.namelist() 
df = pd.read_csv(zf.open('ny.csv'),  encoding = 'cp1252')

The dataset is pretty clean, however there are some columns where majority of the values are missing and we want to drop those columns to start with.

In [None]:
# Use df_clean for cleaning
df_clean = df.copy()

# Repalce 'Not asked or Missing' and 'Data do not meet the criteria for statistical reliability, 
# data quality or confidentiality (data are suppressed)' with NA
for col in df.columns:
    df_clean[col].replace({'Not asked or Missing' : np.nan}, inplace = True)
    df_clean[col].replace({'Data do not meet the criteria for statistical reliability, data quality or confidentiality (data are suppressed)' : np.nan}, inplace = True)
    
# Drop columns with over 80% missing values
df_clean.dropna(axis = 1, thresh = len(df_clean) * .50, inplace = True)

First we want to try manually selecting columns from the dataset based on literature that discusses which lifestyle questions are generally attributed to an increased risk of heart attacks. The sources used are listed in the next code block.

In [None]:
# Select relevant columns related to heart disease by utilizing resources detailing factors of heart disease
# shorturl.at/oqwF5 - Behavioral risk factors of coronary artery disease: A paired matched case control study
# shorturl.at/cpAXZ - Strategies to prevent heart disease
# shorturl.at/gpwAR - Top five habits that harm the heart
# shorturl.at/mtJUZ - 9 Common Habits That Are Bad for Your Heart

# Characteristics
# 1. SEXVAR - Sex - (Male or Female)
# 2. _IMPAGE - Age - (Age 65 or older, Age 55 - 64, Age 45 - 54, Age 35 - 44, Age 25 - 34, Age 18 - 24)
# 3. _IMPRACE - Race - (White, Non-Hispanic, Hispanic, Black, Non-Hispanic, Other race, Non-Hispanic, Asian, Non-Hispanic,
# American Indian/Alaskan Native, Non-Hispanic)
# 4. VETERAN3 - Former veteran status - (Yes, No, Refused, Don't know/Not sure)
# 5. WTKG3 - Weight in KG - (Continous value)
# 6. _IMPMRTL - Marital status - (Married, Never Married, Divorced, Widowed, A member of an unmarried couple, 
# Separated)
# 7. _RFBMI5 - Overweight or Obese - (Yes, No, Don’t know/Refused/Missing)


# Health 
# 8. HLTHPLN1 - Has Healthcare Coverage - (Yes, No, Don't know/Not sure, Refused)
# 9. ADDEPEV3 - Diagnosed with depression - (Yes, No, Don't know/Not sure, Refused)
# 10. DIABETE4 - Diagnosed with diabetes - (Yes, Yes, but female told only during pregnancy, 
# No, pre-diabetes or borderline diabetes, No, Don't know/Not sure, Refused)
# 11. RMVTETH4 - Number of teeth removed - All, 6 or more, but not all, 1 to 5, None, Don't know/Not sure, Refused
# 12. _PHYS14D - Number of days physical health not well - (Zero days when physical health not good,     
# 1-13 days when physical health not good, 14+ days when physical health not good, Don’t know/Refused/Missing)                 
# 13. _MENT14D - Number of days mental health not well - Zero days when mental health not good
# 1-13 days when mental health not good, 14+ days when mental health not good, Don’t know/Refused/Missing    
# 14. _TOTINDA - Physical activity - (Had physical activity or exercise, No physical activity or exercise in last 30 days,     
# Don’t know/Refused/Missing)       
# 15. PDIABTST - User has gotten a test for high blood sugar in past 3 years - (Yes, No, Don't know/Not sure, Refused)
# 16. PREDIAB1 - Diagnosed as prediabetic - Yes, Yes, during pregnancy, Don't know/Not Sure, Refused, No
# 17. _RFHLTH - General health - (Good or Better Health, Fair or Poor Health, Don’t know/Not Sure Or Refused/Missing)
# 18. BPHIGH4 - (Told they have high blood pressure - Yes, Told borderline high or pre-hypertensive, 
# Yes, but female told only during pregnancy, Don't Know/Not Sure Refused, No) 

# Lifestyle
# 19. CHECKUP1 - Length since last checkup - (Within past year (anytime less than 12 months ago), 
# Within past 2 years (1 year but less than 2 years ago), Within past 5 years (2 years but less than 5 years ago), 
# 5 or more years ago, Don’t know/Not sure, Never, Refused)
# 20. LASTDEN4 - Last visited dentist - (Within past year (anytime less than 12 months ago), 
# Within past 2 years (1 year but less than 2 years ago), Within past 5 years (2 years but less than 5 years ago), 
# 5 or more years ago, Don’t know/Not sure, Never, Refused)
# 21. FLUSHOT7 - Whether someone has taken the flu shot - (Yes, No, Don't know/Not sure, Refused)
# 22. _RFSEAT3 - Seatbeat wearing status - (Always Wear Seat Belt, Don’t Always Wear Seat Belt
# Don’t know/Not Sure Or Refused/Missing)

# Socioeconomic status
# 23. _IMPEDUC - Education - (College 4 years or more (College graduate), 
# College 1 year to 3 years (Some college or technical school), Grade 12 or GED (High school graduate), 
# Grades 9 through 11 (Some high school), Grades 1 through 8 (Elementary), Never attended school or only kindergarten)
# Grades 9 through 11 (Some high school), Grades 1 through 8 (Elementary), Never attended school or only kindergarten)
# 24. EMPLOY1 - 
# 25. _INCOMG - Income level - ($50,000 or more, Don’t know/Not sure/Missing, $15,000 to less than $25,000,   
# $35,000 to less than $50,000, $25,000 to less than $35,000, Less than $15,000)
# 26. _METSTAT - Whether they live in a metropolitan - (1, 2)

# Tobacco, Alcohol
# 27. USENOW3 - Use of smokeless tobacco - (Not at all, Some days, Every day, Refused, Don’t know/Not Sure) 
# 28. ECIGARET - E-ciggarette usage - (Yes, No, Don't know/Not sure, Refused)
# 29. _SMOKER3 - Smoking status - (Current smoker - now smokes every day, Current smoker - now smokes some days,
# Former smoker, Never smoked, Don’t know/Refused/Missing
# 30. _RFBING5 - Binge drinking status - (Yes, No, Don’t know/Refused/Missing)                

# Columns to keep - Response variable
# 31. CVDCRHD4 - Ever diagnosed with heart attack - (Yes, No, Don't know/Not sure, Refused)
# 32. CVDCRHD4 - Ever diagnosed with angina/ coronary heart disease - (Yes, No, Don't know/Not sure, Refused)

# For now we will predict heart disease
df_clean_columns = df_clean[['SEXVAR', '_IMPAGE', '_IMPRACE', 'VETERAN3', 'WTKG3', '_IMPMRTL', '_RFBMI5', 
                             'HLTHPLN1', 'ADDEPEV3', 'DIABETE4', 'RMVTETH4', '_PHYS14D', '_MENT14D', '_TOTINDA',
                             'PDIABTST', 'PREDIAB1', '_RFHLTH', 'BPHIGH4', 'CHECKUP1', 'LASTDEN4', 'FLUSHOT7', 
                             '_RFSEAT3', '_IMPEDUC', 'EMPLOY1', '_INCOMG', '_METSTAT', 'USENOW3', 'ECIGARET',
                             '_SMOKER3', '_RFBING5', 'CVDCRHD4']]

In [None]:
# Drop all missing values
df_cleaned = df_clean_columns.dropna(axis = 0).reset_index(drop = True)

# Drop all rows that are Don't know/Not sure or Refused for column we are predicting
df_cleaned = df_cleaned.loc[(df_cleaned['CVDCRHD4'] == 'No') | (df_cleaned['CVDCRHD4'] == 'Yes')]

# Split into X and y
X = df_cleaned.loc[:, df_cleaned.columns != 'CVDCRHD4']
y = df_cleaned['CVDCRHD4']

# Split the data into training and test data set
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, test_size = 0.3,random_state = 42)

Handling Imbalanced Data <br />
Our data is very imbalanced. Only about 4% of the data is for people who have said that they have experienced a heart attack before. Because of this, we need to balance out our data. The two ways we will try to accomplish that is with SMOTE and upsampling. We will try both methods and see which method produces better scores when our models are built later.

In [None]:
#SMOTE using imblearn library

os = SMOTENC(categorical_features = [0, 1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 26, 27, 28, 29], random_state = 0)
os_data_X , os_data_y = os.fit_resample(X_train, y_train)

#Upsampling using resample
#create two different dataframe of majority and minority class 
training_data = pd.DataFrame(X_train)
training_data['CVDCRHD4'] = y_train
df_majority = training_data[(training_data['CVDCRHD4']=='No')] 
df_minority = training_data[(training_data['CVDCRHD4']=='Yes')] 
# upsample minority class
df_minority_upsampled = resample(df_minority, 
                                 replace=True,    # sample with replacement
                                 n_samples= len(df_majority), # to match majority class
                                 random_state=42)  # reproducible results
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_minority_upsampled, df_majority])
X_train_upsampled = df_upsampled.loc[:, df_cleaned.columns != 'CVDCRHD4']
y_train_upsampled = df_upsampled['CVDCRHD4']

In [None]:
# Find optimal number of components for FAMD
optimal_components = pd.DataFrame(columns = ['num_components', 'explained_variance'])

for i in range(1, 151):
    
    # Get val of x
    num_components = i
    
    # Initialize FAMD
    famd = FAMD(n_components = i, n_iter = 3, random_state = 42)
    famd.fit_transform(X_train)
    
    # Calculate explained variance
    explained_variance = famd.explained_inertia_.sum()
    
    # Insert into dataframe
    row = {'num_components': num_components, 'explained_variance' : explained_variance}
    optimal_components = optimal_components.append(row, ignore_index = True)
    
fig = px.scatter(optimal_components, x = 'num_components', y = 'explained_variance')
fig.show()

# Print max 
optimal_components.loc[optimal_components['explained_variance'] == optimal_components['explained_variance'].max()]

In [None]:
# Initialize FAMD
famd = FAMD(n_components = 120, n_iter = 3, random_state = 42)
famd.fit_transform(X_train)

famd_explained_variance = famd.explained_inertia_
df_famd_explained = pd.DataFrame(famd_explained_variance)
df_famd_explained['component'] = list(range(1, 121))
df_famd_explained.columns = ['explained_variance', 'component']

In [None]:
# Explained variance for each component

fig = px.bar(df_famd_explained, x = 'component', y = 'explained_variance')
fig

In [None]:
famd_correlations_comp_0 = famd.column_correlations(X_train)[0]
famd_correlations_comp_0_abs = famd_correlations_comp_0.abs()
famd_correlations_comp_0_top_20_labels = list(famd_correlations_comp_0_abs.sort_values(ascending=False)[0:10].index.values)
famd_correlations_comp_0_top_10 = famd_correlations_comp_0.loc[famd_correlations_comp_0_top_20_labels]
famd_correlations_top_0_labels_10 = list(famd_correlations_comp_0_top_10.index.values)
famd_correlations_top_0_values_10 = list(famd_correlations_comp_0_top_10.values)
famd_correlations_top_0_values_10

famd_correlations_comp_1 = famd.column_correlations(X_train)[1]
famd_correlations_comp_1_abs = famd_correlations_comp_1.abs()
famd_correlations_comp_1_top_20_labels = list(famd_correlations_comp_1_abs.sort_values(ascending=False)[0:10].index.values)
famd_correlations_comp_1_top_10 = famd_correlations_comp_1.loc[famd_correlations_comp_1_top_20_labels]
famd_correlations_top_1_labels_10 = list(famd_correlations_comp_1_top_10.index.values)
famd_correlations_top_1_values_10 = list(famd_correlations_comp_1_top_10.values)
famd_correlations_comp_1_top_10

In [None]:
labels_0 = ['COMPUTED WEIGHT IN KILOGRAMS', 'OVERWEIGHT OR OBESE CALCULATED VARIABLE - YES', 'OVERWEIGHT OR OBESE CALCULATED VARIABLE - NO', 'SEX OF RESPONDENT - MALE', 'SEX OF RESPONDENT - FEMALE', 'TOLD YOU HAVE HIGH BP - YES', 'TOLD YOU HAVE HIGH BP - NO', 'TOLD YOU HAVE PREDIABETES - YES', 'TOLD YOU HAVE PREDIABETES - NO', 'HAD TEST FOR HIGH BLOOD SUAGR IN LAST 3 YEARS - YES']
fig = go.Figure(data=go.Heatmap(
        z=[famd_correlations_top_0_values_10],
        x=labels_0,
        y=['Component 0'],
        colorscale='Viridis'))
fig.show()

In [None]:
labels_1 = ['COMPUTED WEIGHT IN KILOGRAMS', 'OVERWEIGHT OR OBESE CALCULATED VARIABLE - YES', 'OVERWEIGHT OR OBESE CALCULATED VARIABLE - NO', 'SEX OF RESPONDENT - MALE', 'SEX OF RESPONDENT - FEMALE', 'TOLD YOU HAVE HIGH BP - YES', 'TOLD YOU HAVE HIGH BP - NO', 'TOLD YOU HAVE PREDIABETES - YES', 'TOLD YOU HAVE PREDIABETES - NO', 'HAD TEST FOR HIGH BLOOD SUAGR IN LAST 3 YEARS - YES']
fig = go.Figure(data=go.Heatmap(
        z=[famd_correlations_top_1_values_10],
        x=labels_1,
        y=['Component 1'],
        colorscale='Viridis'))
fig.show()

In [None]:
#Transform SMOTE and upsampled training data
X_train_os = famd.fit_transform(os_data_X)
X_train_upsampled_transformed = famd.fit_transform(X_train_upsampled)
X_train_imabalanced = famd.fit_transform(X_train)

Now that we have our training data transformed using FAMD, let's see what kind of supervised learning scores we get when trying to classify people as at risk of heart attack or not at risk of heart attack. Using both the data balanced using SMOTE and upsampling to see which performs better

In [None]:
#Helper functions to get model scores
def get_performance_scores(y_pred, y_true):
    f1 = f1_score(y_true, y_pred, average='macro')
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred, average='macro')
    recall = recall_score(y_true, y_pred, average='macro')
    return [f1, accuracy, precision, recall]

def print_performance_scores(scores):
    print("Accuracy Score = " + str(scores[1]))
    print("Precision Score = " + str(scores[2]))
    print("Recall Score = " + str(scores[3]))
    print("F1 Score = " + str(scores[0]))

In [None]:
X_test_transformed = famd.fit_transform(X_test)

#Logistic Regression Using Original Dataset
clf_lr = LogisticRegression(random_state = RANDOM_SEED, penalty='l2', class_weight = {'No': 0.75, 'Yes': 0.25},  C=0.25).fit(X_train_imabalanced, y_train)
train_preds = (clf_lr.predict_proba(X_test_transformed)[:,1] >= 0.8).astype(int)
train_preds = pd.DataFrame(train_preds, columns = ['val'])
train_preds = train_preds['val'].replace(to_replace = [0, 1], value = ['No', 'Yes'])

original_lr_scores = get_performance_scores(train_preds, y_test)

#Logistic Regression Using SMOTE
clf_lr = LogisticRegression(random_state = RANDOM_SEED, penalty='l2', class_weight = {'No': 0.75, 'Yes': 0.25},  C=0.25).fit(X_train_os, os_data_y.values.ravel())
train_preds = (clf_lr.predict_proba(X_test_transformed)[:,1] >= 0.8).astype(int)
train_preds = pd.DataFrame(train_preds, columns = ['val'])
train_preds = train_preds['val'].replace(to_replace = [0, 1], value = ['No', 'Yes'])

smote_lr_scores = get_performance_scores(train_preds, y_test)

#Logistic Regression using upsampled data
clf_lr = LogisticRegression(random_state = RANDOM_SEED, penalty='l2', class_weight = {'No': 0.75, 'Yes': 0.25},  C=0.25).fit(X_train_upsampled_transformed, y_train_upsampled)
train_preds = (clf_lr.predict_proba(X_test_transformed)[:,1] >= 0.8).astype(int)
train_preds = pd.DataFrame(train_preds, columns = ['val'])
train_preds = train_preds['val'].replace(to_replace = [0, 1], value = ['No', 'Yes'])

upsampled_lr_scores = get_performance_scores(train_preds, y_test)
original_lr_scores.insert(0, 'Imbalanced')
smote_lr_scores.insert(0, 'Balanced Using SMOTE')
upsampled_lr_scores.insert(0, 'Upsampled using resample')

In [None]:
#Random Forest Using Original Data
random_forest = RandomForestClassifier(n_estimators= 200, min_samples_split= 2, min_samples_leaf = 1, max_depth = None, bootstrap = False, random_state = RANDOM_SEED )
random_forest.fit(X_train_imabalanced, y_train)

y_pred = random_forest.predict(X_test_transformed)
    
original_rf_scores = get_performance_scores(y_pred, y_test)

#Random Forest Using SMOTE Data
random_forest = RandomForestClassifier(n_estimators= 200, min_samples_split= 2, min_samples_leaf = 1, max_depth = None, bootstrap = False, random_state = RANDOM_SEED )
random_forest.fit(X_train_os, os_data_y.values.ravel())

y_pred = random_forest.predict(X_test_transformed)
    
smote_rf_scores = get_performance_scores(y_pred, y_test)

#Random Forest Using Upsampled Data
random_forest = RandomForestClassifier(n_estimators= 200, min_samples_split= 2, min_samples_leaf = 1, max_depth = None, bootstrap = False, random_state = RANDOM_SEED )
random_forest.fit(X_train_upsampled_transformed, y_train_upsampled)

y_pred = random_forest.predict(X_test_transformed)
    
upsampled_rf_scores = get_performance_scores(y_pred, y_test)
original_rf_scores.insert(0, 'Imbalanced')
smote_rf_scores.insert(0, 'Balanced Using SMOTE')
upsampled_rf_scores.insert(0, 'Upsampled using resample')

In [None]:
#SVC Using Original dataset
svc_model = SVC(C=100, gamma=0.1, kernel='poly', random_state=RANDOM_SEED).fit(X_train_imabalanced, y_train)
y_pred = svc_model.predict(X_test_transformed)
original_svc_scores = get_performance_scores(y_pred, y_test)

#SVC Using SMOTE dataset
svc_model = SVC(C=100, gamma=0.1, kernel='poly', random_state=RANDOM_SEED).fit(X_train_os, os_data_y.values.ravel())
y_pred = svc_model.predict(X_test_transformed)
smote_svc_scores = get_performance_scores(y_pred, y_test)

#SVC Using Upsampled dataset
svc_model = SVC(C=100, gamma=0.1, kernel='poly', random_state=RANDOM_SEED).fit(X_train_upsampled_transformed, y_train_upsampled)
y_pred = svc_model.predict(X_test_transformed)
upsampled_svc_scores = get_performance_scores(y_pred, y_test)

original_svc_scores.insert(0, 'Imbalanced')
smote_svc_scores.insert(0, 'Balanced Using SMOTE')
upsampled_svc_scores.insert(0, 'Upsampled using resample')

In [None]:
scores_columns = ["Balancing Type", "F1 Score", "Accuracy Score", "Precision Score", "Recall Score"]
scores_rows = ["Logistic Regression", "Logistic Regression", "Logistic Regression", "Random Forest", "Random Forest", "Random Forest", "SVC", "SVC", "SVC"]
scores = [original_lr_scores, smote_lr_scores, upsampled_lr_scores, original_rf_scores, smote_rf_scores, upsampled_rf_scores, original_svc_scores, smote_svc_scores, upsampled_svc_scores]

scores_df = pd.DataFrame(data=scores, index=scores_rows, columns=scores_columns)
scores_df

In [None]:
fig = px.bar(scores_df, x=scores_df.index, y="F1 Score",
             color='Balancing Type', barmode='group', title="F1 Score By Model And Balance Type", text_auto='.3', 
             labels={"index": "Supervised Learning Model", "Balancing Type": "Balancing Method"},
             height=400)
fig.update_layout(title_text='F1 Score By Model And Balancing Method', title_x=0.5)
fig.update_yaxes(range=[0, 1])
fig.show()

Since the logistic regression model performed the best, let's see what the feature importance for that model is

In [None]:
#Get feature importance
from operator import itemgetter
column_arr = X_train_upsampled_transformed.columns
features_coeff = dict(zip(column_arr, clf_lr.coef_[0]))
features_coeff_abs = {str(key) : abs(val) for key, val in features_coeff.items()}
features_coeff_abs_top_10 = dict(sorted(features_coeff_abs.items(), key = itemgetter(1), reverse = True)[:10])
print(features_coeff_abs_top_10)
features_dict = {'Components': list(features_coeff_abs_top_10.keys()), 'coeffs': list(features_coeff_abs_top_10.values())}
fig = px.bar(features_dict, x='Components', y='coeffs')
fig.show()

Overall the performance is better when we upsample the data using the resample library than when we use SMOTE. We can try using upsampling with a different selection of initial columns to run FAMD to see if the results change. One way to select columns differently is by calcultating which columns have the highest correlation with the target variable.

In [None]:
# Step 5: Encode categorical variables as numeric to calculate correlations
df_clean_categorical = df_clean.copy()
cols = list(df_clean_categorical.columns)
for col in cols:
    if str(df_clean_categorical[col].dtype) == 'object':
        df_clean_categorical[col] = df_clean_categorical[col].astype('category').cat.codes

df_clean.head()

In [None]:
#Step 7: Create correlation matrix to find which features to use for mca
df_clean_corr = df_clean_categorical.corrwith(df_clean_categorical["CVDCRHD4"])
df_clean_corr_abs = df_clean_corr.abs()
df_clean_corr_abs.sort_values(inplace=True, ascending=False)
df_clean_corr_abs

In [None]:
feature_list = list(df_clean_corr_abs[0:100].keys())
feature_list.remove('CVDINFR4')
feature_list.remove('_MICHD')
feature_list

df_clean_columns = df_clean[feature_list]

In [None]:
from sklearn.model_selection import train_test_split 

# Drop all missing values
df_cleaned = df_clean_columns.dropna(axis = 0).reset_index(drop = True)

# Drop all rows that are Don't know/Not sure or Refused for column we are predicting
df_cleaned = df_cleaned.loc[(df_cleaned['CVDCRHD4'] == 'No') | (df_cleaned['CVDCRHD4'] == 'Yes')]

# Split into X and y
X = df_cleaned.loc[:, df_cleaned.columns != 'CVDCRHD4']
y = df_cleaned['CVDCRHD4']

# Split the data into training and test data set
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, test_size = 0.3,random_state = 42)

In [None]:
dummy_clf = DummyClassifier(strategy= 'uniform').fit(famd.fit_transform(X_train),y_train)
y_pred = dummy_clf.predict(famd.fit_transform(X_test))
print_performance_scores(get_performance_scores(y_pred, y_test))

In [None]:
famd = FAMD(n_components = 120, n_iter = 3, random_state = 42)
X_train_transformed = famd.fit_transform(X_train)
X_test_transformed = famd.fit_transform(X_test)
X_train_transformed.head()

In [None]:
training_data = pd.DataFrame(X_train)
training_data['CVDCRHD4'] = y_train
df_majority = training_data[(training_data['CVDCRHD4']=='No')] 
df_minority = training_data[(training_data['CVDCRHD4']=='Yes')] 
# upsample minority class
df_minority_upsampled = resample(df_minority, 
                                 replace=True,    # sample with replacement
                                 n_samples= len(df_majority), # to match majority class
                                 random_state=42)  # reproducible results
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_minority_upsampled, df_majority])
X_train_upsampled = df_upsampled.loc[:, df_cleaned.columns != 'CVDCRHD4']
y_train_upsampled = df_upsampled['CVDCRHD4']

In [None]:
X_train_upsampled_transformed = famd.fit_transform(X_train_upsampled)
clf_lr = LogisticRegression(random_state = RANDOM_SEED, penalty='l2', class_weight = {'No': 0.75, 'Yes': 0.25},  C=0.25).fit(X_train_upsampled_transformed, y_train_upsampled)
train_preds = (clf_lr.predict_proba(X_test_transformed)[:,1] >= 0.8).astype(int)
train_preds = pd.DataFrame(train_preds, columns = ['val'])
train_preds = train_preds['val'].replace(to_replace = [0, 1], value = ['No', 'Yes'])

upsampled_lr_scores_correlated = get_performance_scores(train_preds, y_test)
print_performance_scores(upsampled_lr_scores_correlated)

In [None]:
#Get feature importance
from operator import itemgetter
column_arr = X_train_upsampled_transformed.columns
features_coeff = dict(zip(column_arr, clf_lr.coef_[0]))
features_coeff_abs = {str(key) : abs(val) for key, val in features_coeff.items()}
features_coeff_abs_top_10 = dict(sorted(features_coeff_abs.items(), key = itemgetter(1), reverse = True)[:10])
print(features_coeff_abs_top_10)
features_dict = {'Components': list(features_coeff_abs_top_10.keys()), 'coeffs': list(features_coeff_abs_top_10.values())}
fig = px.bar(features_dict, x='Components', y='coeffs', text_auto='.3')
fig.update_layout(title_text='Feature Importance: Top 10 Feature Coefficients', title_x=0.5)
fig.show()

In [None]:
famd_correlations_comp_7 = famd.column_correlations(X_train)[7]
famd_correlations_comp_7_abs = famd_correlations_comp_7.abs()
famd_correlations_comp_7_top_20_labels = list(famd_correlations_comp_7_abs.sort_values(ascending=False)[0:10].index.values)
famd_correlations_comp_7_top_10 = famd_correlations_comp_7.loc[famd_correlations_comp_7_top_20_labels]
famd_correlations_top_10_labels_7 = list(famd_correlations_comp_7_top_10.index.values)
famd_correlations_top_10_values_7 = list(famd_correlations_comp_7_top_10.values)



famd_correlations_comp_5 = famd.column_correlations(X_train)[5]
famd_correlations_comp_5_abs = famd_correlations_comp_5.abs()
famd_correlations_comp_5_top_20_labels = list(famd_correlations_comp_5_abs.sort_values(ascending=False)[0:10].index.values)
famd_correlations_comp_5_top_10 = famd_correlations_comp_5.loc[famd_correlations_comp_5_top_20_labels]
famd_correlations_top_10_labels_5 = list(famd_correlations_comp_5_top_10.index.values)
famd_correlations_top_10_values_5 = list(famd_correlations_comp_5_top_10.values)


famd_correlations_comp_10 = famd.column_correlations(X_train)[10]
famd_correlations_comp_10_abs = famd_correlations_comp_10.abs()
famd_correlations_comp_10_top_20_labels = list(famd_correlations_comp_10_abs.sort_values(ascending=False)[0:10].index.values)
famd_correlations_comp_10_top_10 = famd_correlations_comp_10.loc[famd_correlations_comp_10_top_20_labels]
famd_correlations_top_10_labels_10 = list(famd_correlations_comp_10_top_10.index.values)
famd_correlations_top_10_values_10 = list(famd_correlations_comp_10_top_10.values)


In [None]:
labels_7 = ['ANNUAL SEQUENCE NUMBER', 'PRIMARY SAMPLING UNIT', 'QUESTIONNAIRE VERSION IDENTIFIER', 'FINAL WEIGHT: LAND-LINE AND CELL-PHONE DATA', 'Final Adjusted Weight for Content on Two of Three Splits','COMPUTED WEIGHT IN KILOGRAMS', 'OVERWEIGHT OR OBESE - OBESE', 'AGE TOLD HAD CANCER - >= 97', 'AGE TOLD HAD CANCER - MISSING', 'OVERWEIGHT OR OBESE CALCULATED VARIABLE - No']
fig = go.Figure(data=go.Heatmap(
        z=[famd_correlations_top_10_values_7],
        x=labels_7,
        y=['Component 7'],
        colorscale='Viridis'))
fig.show()

In [None]:
labels_5 = ['COMPUTED WEIGHT IN KILOGRAMS', 'COMPUTED BODY MASS INDEX CATEGORIES - OBESE', 'OVERWEIGHT OR OBESE CALCULATED VARIABLE - Yes', 'OVERWEIGHT OR OBESE CALCULATED VARIABLE - No', 'COMPUTED BODY MASS INDEX CATEGORIES - NORMAL WEIGHT', 'ANNUAL SEQUENCE NUMBER', 'PRIMARY SAMPLING UNIT', 'QUESTIONNAIRE VERSION IDENTIFIER', 'TOLD YOU HAVE DIABETES - No', 'NUMBER OF DAYS PHYSICAL HEALTH NOT GOOD - None']
fig = go.Figure(data=go.Heatmap(
        z=[famd_correlations_top_10_values_5],
        x=labels_5,
        y=['Component 5'],
        colorscale='Viridis'))
fig.show()

In [None]:
random_forest = RandomForestClassifier(random_state = RANDOM_SEED, n_estimators= 200, min_samples_split= 2, min_samples_leaf = 1, max_depth = None, bootstrap = False, class_weight={'No': 0.25, 'Yes': 0.75})
random_forest.fit(X_train_upsampled_transformed, y_train_upsampled)

y_pred = random_forest.predict(X_test_transformed)
    
upsampled_rf_scores_correlated = get_performance_scores(y_pred, y_test)
print_performance_scores(upsampled_rf_scores_correlated)

The next 3 cells contain our grid searches for hyperparameter tuning. This code is commented out since it takes a long time to run and we have already tuned the models

In [None]:
'''from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

rf_clf = RandomForestClassifier(random_state=RANDOM_SEED).fit(X_train_upsampled_transformed, y_train_upsampled)

grid_values = {'bootstrap': [True, False],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]
 }

rand_search_clf = RandomizedSearchCV(estimator = rf_clf, param_distributions = grid_values, n_iter = 10, cv = 3, verbose=2, scoring='f1_macro', random_state=RANDOM_SEED, n_jobs = -1)

my_list = list(range(100))
for x in tqdm(my_list):
    rand_search_clf.fit(X_train_upsampled_transformed[0:10000], y_train_upsampled[0:10000])
print(rand_search_clf.best_estimator_)'''

In [None]:

'''param_grid = {'C': [0.1,1, 10, 100], 'gamma': [1,0.1,0.01,0.001],'kernel': ['rbf', 'poly', 'sigmoid']}

grid = RandomizedSearchCV(SVC(random_state=RANDOM_SEED),param_grid,refit=True,verbose=2, scoring='f1_macro')

my_list = list(range(100))
for x in tqdm(my_list):
    grid.fit(X_train_transformed[0:10000],y_train_upsampled[0:10000])

print(grid.best_estimator_)'''

In [None]:
'''#Do grid search for hyperparameter tuning
clf = LogisticRegression(random_state = RANDOM_SEED, solver='liblinear')
grid_values = {'penalty': ['l1', 'l2'],'C':[0.001,.009,0.01,.09,1,5,10,25]}
grid_clf_acc = GridSearchCV(clf, param_grid = grid_values,scoring = 'f1_macro')
my_list = list(range(100))
for x in tqdm(my_list):
    grid_clf_acc.fit(X_train_transformed, y_train)

y_pred_acc = grid_clf_acc.predict(X_test_transformed)

print_performance_scores(get_performance_scores(y_pred_acc, y_test))'''

In [None]:
svc_model = SVC(C=100, gamma=0.1, kernel='poly', random_state=RANDOM_SEED).fit(X_train_upsampled_transformed, y_train_upsampled)
y_pred = svc_model.predict(X_test_transformed)
upsampled_svc_scores_correlated = get_performance_scores(y_pred, y_test)
print_performance_scores(upsampled_svc_scores_correlated)

In [None]:
upsampled_lr_scores.remove("Upsampled using resample")
upsampled_rf_scores.remove("Upsampled using resample")
upsampled_svc_scores.remove("Upsampled using resample")
upsampled_lr_scores.insert(0, "Manually Selected Columns")
upsampled_rf_scores.insert(0, "Manually Selected Columns")
upsampled_svc_scores.insert(0, "Manually Selected Columns")
upsampled_lr_scores_correlated.insert(0, "Most Correlated Columns")
upsampled_rf_scores_correlated.insert(0, "Most Correlated Columns")
upsampled_svc_scores_correlated.insert(0, "Most Correlated Columns")
scores_columns = ["Input Columns", "F1 Score", "Accuracy Score", "Precision Score", "Recall Score"]
scores_rows = ["Logistic Regression", "Logistic Regression", "Random Forest", "Random Forest", "SVC", "SVC"]
scores = [upsampled_lr_scores, upsampled_lr_scores_correlated, upsampled_rf_scores, upsampled_rf_scores_correlated, upsampled_svc_scores, upsampled_svc_scores_correlated]
print(scores)
scores_df = pd.DataFrame(data=scores, index=scores_rows, columns=scores_columns)
scores_df

fig = px.bar(scores_df, x=scores_df.index, y="F1 Score",
             color='Input Columns', barmode='group', title="F1 Score By Model And Balance Type", text_auto='.3', 
             labels={"index": "Supervised Learning Model", "Input Columns": "Initial Column Selection"},
             height=400)
fig.update_layout(title_text='F1 Score By Model And Column Selection Method', title_x=0.5)
fig.update_yaxes(range=[0, 1])
fig.show()