## Question to Answer: 
Can we classify the heart disease risk in each county in USA based on demographic, socio-economic, health measures data? 

## Dataset: 
3 different datasets all grouped by US State and County.  
1) Dataset on heart disease mortality rate 
2) Dataset on socio-economic and health measures data by county 
3) Dataset on population 

## Machine Learning Model: 
We chose Random Forest Classifier model.  This model allows a random subset of features to be built and and trained to improve accuracy. 

### Building the workflow 

1) Preprocess the data.  Based on the mortality rate in each county, we classified the data into four classes: 
- Class 1 is the counties in the 1st percentile of the mortality rate distribution 
- Class 2 is the counties that XXX 
- Class 3 XXX 
- Class 4 is the counties above the 3rd percentile of the mortality rate distribution 

2) The Datasets will be merged and encoded when necessary 

3) The target will be the classes of mortality 

4) Split, scale, and fit the data 

5) Predict and assess the model 

In [83]:
import pandas as pd
import numpy as np
from path import Path
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

In [75]:
#Import first datasets
df_mortality = pd.read_csv("Data/Heart_Disease_Mortality_County.csv")
df_mortality.head()

Unnamed: 0,State,County,Rate_100000,Rate_Level
0,AK,Aleutians East,165.0,1
1,AK,Aleutians West,261.8,1
2,AK,Anchorage,261.733333,1
3,AK,Bethel,321.322222,2
4,AK,Bristol Bay,0.0,1


In [76]:
#Import second dataset
df_features = pd.read_csv("Data/Features_County.csv")
df_features.head()

Unnamed: 0,State,County,Percent_Fair_or_Poor_Health,Average_Number_of_Physically_Unhealthy_Days,Average_Number_of_Mentally_Unhealthy_Days,Percent_Smokers,Percent_Adults_with_Obesity,Food_Environment_Index,Percent_Physically_Inactive,Percent_With_Access_to_Exercise_Opportunities,...,Social_Association_Rate,Violent_Crime_Rate,Polution_Average_Daily_PM2.5,Presence_of_Water_Violation,Percent_Severe_Housing_Problems,Percent_Drive_Alone_to_Work,Percent_Long_Commute,Percent_Adults_with_Diabetes,Percent_Limited_Access_to_Healthy_Foods,Median_Household_Income
0,Alabama,Autauga,21,4.7,4.7,18,33,7.2,35,69,...,12.1,272,11.7,0,15,87,40,11,12,59338
1,Alabama,Baldwin,18,4.2,4.3,17,31,8.0,27,74,...,10.2,204,10.3,0,14,84,42,11,5,57588
2,Alabama,Barbour,30,5.4,5.2,22,42,5.6,24,53,...,7.5,414,11.5,0,15,83,32,18,11,34382
3,Alabama,Bibb,19,4.6,4.6,19,38,7.8,34,16,...,8.4,89,11.2,0,10,85,50,15,3,46064
4,Alabama,Blount,22,4.9,4.9,19,34,8.4,30,16,...,8.4,483,11.7,0,11,86,59,17,3,50412


In [77]:
df_state = pd.read_csv("Data/State_symbol.csv")
df_state.head()

Unnamed: 0,StateName,Symbol
0,Alaska,AK
1,Alabama,AL
2,Arkansas,AR
3,Arizona,AZ
4,California,CA


In [84]:
statename = df_state["StateName"].values.tolist()


In [85]:
symbol = df_state["Symbol"].values.tolist()


In [86]:
df_features.replace(to_replace = statename, value = symbol, inplace = True)
df_features.tail()

Unnamed: 0,State,County,Percent_Fair_or_Poor_Health,Average_Number_of_Physically_Unhealthy_Days,Average_Number_of_Mentally_Unhealthy_Days,Percent_Smokers,Percent_Adults_with_Obesity,Food_Environment_Index,Percent_Physically_Inactive,Percent_With_Access_to_Exercise_Opportunities,...,Social_Association_Rate,Violent_Crime_Rate,Polution_Average_Daily_PM2.5,Presence_of_Water_Violation,Percent_Severe_Housing_Problems,Percent_Drive_Alone_to_Work,Percent_Long_Commute,Percent_Adults_with_Diabetes,Percent_Limited_Access_to_Healthy_Foods,Median_Household_Income
3137,WY,Sweetwater,15,3.4,3.6,18,30,7.7,25,90,...,10.3,300,5.1,1,10,76,18,9,11,73315
3138,WY,Teton,12,3.0,3.2,15,12,8.2,12,100,...,16.3,0,4.9,1,17,67,14,2,7,99087
3139,WY,Uinta,16,3.6,3.7,17,36,7.4,27,84,...,2.9,71,5.9,1,11,77,19,11,10,63401
3140,WY,Washakie,16,3.6,3.7,17,29,8.3,28,83,...,16.1,78,4.8,0,10,79,7,12,4,55190
3141,WY,Weston,14,3.5,3.7,17,33,7.9,27,63,...,13.0,157,4.1,0,14,74,24,9,4,54319


In [91]:
#Merge two datasets together
df_combined1 = df_features.merge(df_mortality, left_on = ["State","County"], right_on = ["State","County"])
df_combined1.head()

Unnamed: 0,State,County,Percent_Fair_or_Poor_Health,Average_Number_of_Physically_Unhealthy_Days,Average_Number_of_Mentally_Unhealthy_Days,Percent_Smokers,Percent_Adults_with_Obesity,Food_Environment_Index,Percent_Physically_Inactive,Percent_With_Access_to_Exercise_Opportunities,...,Polution_Average_Daily_PM2.5,Presence_of_Water_Violation,Percent_Severe_Housing_Problems,Percent_Drive_Alone_to_Work,Percent_Long_Commute,Percent_Adults_with_Diabetes,Percent_Limited_Access_to_Healthy_Foods,Median_Household_Income,Rate_100000,Rate_Level
0,AL,Autauga,21,4.7,4.7,18,33,7.2,35,69,...,11.7,0,15,87,40,11,12,59338,422.022222,4
1,AL,Baldwin,18,4.2,4.3,17,31,8.0,27,74,...,10.3,0,14,84,42,11,5,57588,321.570588,2
2,AL,Barbour,30,5.4,5.2,22,42,5.6,24,53,...,11.5,0,15,83,32,18,11,34382,461.144444,4
3,AL,Bibb,19,4.6,4.6,19,38,7.8,34,16,...,11.2,0,10,85,50,15,3,46064,393.036364,3
4,AL,Blount,22,4.9,4.9,19,34,8.4,30,16,...,11.7,0,11,86,59,17,3,50412,387.481818,3


In [92]:
df_combined1.shape

(3010, 33)

In [93]:
df_combined1.dtypes

State                                             object
County                                            object
Percent_Fair_or_Poor_Health                        int64
Average_Number_of_Physically_Unhealthy_Days      float64
Average_Number_of_Mentally_Unhealthy_Days        float64
Percent_Smokers                                    int64
Percent_Adults_with_Obesity                        int64
Food_Environment_Index                           float64
Percent_Physically_Inactive                        int64
Percent_With_Access_to_Exercise_Opportunities      int64
Percent_Excessive_Drinking                         int64
Percent_Uninsured                                  int64
Primary_Care_Physicians_Rate                       int64
Dentist_Rate                                       int64
Mental_Health_Provider_Rate                        int64
Preventable_Hospitalization_Rate                   int64
Percent_Vaccinated                                 int64
High_School_Graduation_Rate    

In [96]:
# Drop State and county columns as they do not add weight to the ML model. 
#Drop "Rate level" as the target 
X = df_combined1.copy()
X = X.drop([ "County", "Rate_Level"], 1)
X.head()

Unnamed: 0,State,Percent_Fair_or_Poor_Health,Average_Number_of_Physically_Unhealthy_Days,Average_Number_of_Mentally_Unhealthy_Days,Percent_Smokers,Percent_Adults_with_Obesity,Food_Environment_Index,Percent_Physically_Inactive,Percent_With_Access_to_Exercise_Opportunities,Percent_Excessive_Drinking,...,Violent_Crime_Rate,Polution_Average_Daily_PM2.5,Presence_of_Water_Violation,Percent_Severe_Housing_Problems,Percent_Drive_Alone_to_Work,Percent_Long_Commute,Percent_Adults_with_Diabetes,Percent_Limited_Access_to_Healthy_Foods,Median_Household_Income,Rate_100000
0,AL,21,4.7,4.7,18,33,7.2,35,69,15,...,272,11.7,0,15,87,40,11,12,59338,422.022222
1,AL,18,4.2,4.3,17,31,8.0,27,74,18,...,204,10.3,0,14,84,42,11,5,57588,321.570588
2,AL,30,5.4,5.2,22,42,5.6,24,53,13,...,414,11.5,0,15,83,32,18,11,34382,461.144444
3,AL,19,4.6,4.6,19,38,7.8,34,16,16,...,89,11.2,0,10,85,50,15,3,46064,393.036364
4,AL,22,4.9,4.9,19,34,8.4,30,16,14,...,483,11.7,0,11,86,59,17,3,50412,387.481818


In [97]:
X = X.drop("State",1)

In [98]:
y = df_combined1["Rate_Level"].ravel()
y[:5]

array([4, 2, 4, 3, 3])

In [99]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=78)

In [100]:
# Creating a StandardScaler instance.
scaler = StandardScaler()
# Fitting the Standard Scaler with the training data.
X_scaler = scaler.fit(X_train)

# Scaling the data.
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

In [101]:
# Create a random forest classifier.
rf_model = RandomForestClassifier(n_estimators=128, random_state=78) 

In [102]:
rf_model = rf_model.fit(X_train_scaled, y_train)

In [103]:
predictions = rf_model.predict(X_test_scaled)

In [104]:
cm = confusion_matrix(y_test, predictions)

# Create a DataFrame from the confusion matrix.
cm_df = pd.DataFrame(
    cm, index=["Actual 1", "Actual 2", "Actual 3", "Actual 4"], columns=["Predicted 1", "Predicted 2", "Predicted 3", "Predicted 4"])

cm_df

Unnamed: 0,Predicted 1,Predicted 2,Predicted 3,Predicted 4
Actual 1,152,0,0,0
Actual 2,1,232,1,0
Actual 3,0,0,159,0
Actual 4,0,0,0,208


In [105]:
acc_score = accuracy_score(y_test, predictions)
display(cm_df)
print(f"Accuracy Score : {acc_score}")
print("Classification Report")
print(classification_report(y_test, predictions))

Unnamed: 0,Predicted 1,Predicted 2,Predicted 3,Predicted 4
Actual 1,152,0,0,0
Actual 2,1,232,1,0
Actual 3,0,0,159,0
Actual 4,0,0,0,208


Accuracy Score : 0.99734395750332
Classification Report
              precision    recall  f1-score   support

           1       0.99      1.00      1.00       152
           2       1.00      0.99      1.00       234
           3       0.99      1.00      1.00       159
           4       1.00      1.00      1.00       208

    accuracy                           1.00       753
   macro avg       1.00      1.00      1.00       753
weighted avg       1.00      1.00      1.00       753



In [106]:
importances = rf_model.feature_importances_
sorted(zip(rf_model.feature_importances_, X.columns), reverse=True)

[(0.5426816953256371, 'Rate_100000'),
 (0.04494279843413491, 'Median_Household_Income'),
 (0.030903580026320054, 'Percent_Smokers'),
 (0.02705525504770238, 'Average_Number_of_Physically_Unhealthy_Days'),
 (0.023907250238436345, 'Percent_Fair_or_Poor_Health'),
 (0.02296458016843048, 'Average_Number_of_Mentally_Unhealthy_Days'),
 (0.02159925383873855, 'Percent_Physically_Inactive'),
 (0.01907676095128801, 'Polution_Average_Daily_PM2.5'),
 (0.01895734780046619, 'Percent_With_Access_to_Exercise_Opportunities'),
 (0.01823535955559056, 'Preventable_Hospitalization_Rate'),
 (0.017701430614602942, 'Percent_Some_College'),
 (0.013939604811628453, 'Social_Association_Rate'),
 (0.013726924339023954, 'Percent_Drive_Alone_to_Work'),
 (0.013335227782572855, 'Percent_Excessive_Drinking'),
 (0.013192985796274828, 'Mental_Health_Provider_Rate'),
 (0.01226510088676698, 'Dentist_Rate'),
 (0.01207381296516316, 'Percent_Vaccinated'),
 (0.012026686675192529, 'Percent_Unemployed'),
 (0.012013740359444974, 'P

In [107]:
#Drop actual mortality rate column to test model accuracy
X_droprate = X.drop("Rate_100000", 1)
X_droprate_train, X_droprate_test, y_train, y_test = train_test_split(X_droprate, y, random_state=78)

In [108]:
X_scaler = scaler.fit(X_droprate_train)

# Scaling the data.
X_droprate_train_scaled = X_scaler.transform(X_droprate_train)
X_droprate_test_scaled = X_scaler.transform(X_droprate_test)

In [109]:
rf_model_droprate = rf_model.fit(X_droprate_train_scaled, y_train)

In [110]:
predictions_droprate = rf_model.predict(X_droprate_test_scaled)

In [111]:
cm_droprate = confusion_matrix(y_test, predictions_droprate)

cm_droprate_df = pd.DataFrame(
    cm_droprate, index=["Actual 1", "Actual 2", "Actual 3", "Actual 4"], columns=["Predicted 1", "Predicted 2", "Predicted 3", "Predicted 4"])

cm_droprate_df

Unnamed: 0,Predicted 1,Predicted 2,Predicted 3,Predicted 4
Actual 1,85,59,4,4
Actual 2,42,150,22,20
Actual 3,8,53,50,48
Actual 4,0,27,30,151


In [112]:
importances = rf_model.feature_importances_
sorted(zip(rf_model.feature_importances_, X.columns), reverse=True)

[(0.07818205762412815, 'Median_Household_Income'),
 (0.05073832033555747, 'Percent_Smokers'),
 (0.049961842800323196, 'Average_Number_of_Physically_Unhealthy_Days'),
 (0.04563593655482101, 'Percent_Physically_Inactive'),
 (0.045411110782341425, 'Polution_Average_Daily_PM2.5'),
 (0.04264350767006409, 'Preventable_Hospitalization_Rate'),
 (0.04015186571233087, 'Average_Number_of_Mentally_Unhealthy_Days'),
 (0.03873025106523598, 'Percent_With_Access_to_Exercise_Opportunities'),
 (0.037212564699645914, 'Percent_Some_College'),
 (0.03613588750417899, 'Percent_Fair_or_Poor_Health'),
 (0.035035392236059434, 'Social_Association_Rate'),
 (0.03378983722433926, 'Mental_Health_Provider_Rate'),
 (0.032401796976197474, 'Percent_Drive_Alone_to_Work'),
 (0.03197330529347901, 'Violent_Crime_Rate'),
 (0.03152031240654293, 'Percent_Vaccinated'),
 (0.03132987416255882, 'Dentist_Rate'),
 (0.030031773383937296, 'Percent_Unemployed'),
 (0.029840198991358848, 'Primary_Care_Physicians_Rate'),
 (0.0297554157996

In [113]:
acc_score_droprate = accuracy_score(y_test, predictions_droprate)
display(cm_droprate_df)
print(f"Accuracy Score : {acc_score_droprate}")
print("Classification Report")
print(classification_report(y_test, predictions_droprate))

Unnamed: 0,Predicted 1,Predicted 2,Predicted 3,Predicted 4
Actual 1,85,59,4,4
Actual 2,42,150,22,20
Actual 3,8,53,50,48
Actual 4,0,27,30,151


Accuracy Score : 0.5790172642762285
Classification Report
              precision    recall  f1-score   support

           1       0.63      0.56      0.59       152
           2       0.52      0.64      0.57       234
           3       0.47      0.31      0.38       159
           4       0.68      0.73      0.70       208

    accuracy                           0.58       753
   macro avg       0.57      0.56      0.56       753
weighted avg       0.58      0.58      0.57       753



In [114]:
df_population = pd.read_csv("Data/Population_State_County.csv")
df_population.head()

Unnamed: 0,State,Area_Name,POP_ESTIMATE
0,AL,Autauga County,55869
1,AL,Baldwin County,223234
2,AL,Barbour County,24686
3,AL,Bibb County,22394
4,AL,Blount County,57826


In [115]:
df_population.dtypes

State           object
Area_Name       object
POP_ESTIMATE    object
dtype: object

In [116]:
df_population['Area_Name'] = df_population["Area_Name"].str.replace(' County', "")
df_population["POP_ESTIMATE"] = df_population["POP_ESTIMATE"].str.replace(',','')
df_population.head()

Unnamed: 0,State,Area_Name,POP_ESTIMATE
0,AL,Autauga,55869
1,AL,Baldwin,223234
2,AL,Barbour,24686
3,AL,Bibb,22394
4,AL,Blount,57826


In [117]:
df_population["POP_ESTIMATE"] = df_population["POP_ESTIMATE"].fillna(0)
df_population["POP_ESTIMATE"] = df_population["POP_ESTIMATE"].astype("int")

In [118]:
df_population.dtypes

State           object
Area_Name       object
POP_ESTIMATE     int64
dtype: object

In [120]:
df_combined = df_combined1.merge(df_population, how = "inner", left_on = ["State","County"], right_on = ["State","Area_Name"])
df_combined.head()

Unnamed: 0,State,County,Percent_Fair_or_Poor_Health,Average_Number_of_Physically_Unhealthy_Days,Average_Number_of_Mentally_Unhealthy_Days,Percent_Smokers,Percent_Adults_with_Obesity,Food_Environment_Index,Percent_Physically_Inactive,Percent_With_Access_to_Exercise_Opportunities,...,Percent_Severe_Housing_Problems,Percent_Drive_Alone_to_Work,Percent_Long_Commute,Percent_Adults_with_Diabetes,Percent_Limited_Access_to_Healthy_Foods,Median_Household_Income,Rate_100000,Rate_Level,Area_Name,POP_ESTIMATE
0,AL,Autauga,21,4.7,4.7,18,33,7.2,35,69,...,15,87,40,11,12,59338,422.022222,4,Autauga,55869
1,AL,Baldwin,18,4.2,4.3,17,31,8.0,27,74,...,14,84,42,11,5,57588,321.570588,2,Baldwin,223234
2,AL,Barbour,30,5.4,5.2,22,42,5.6,24,53,...,15,83,32,18,11,34382,461.144444,4,Barbour,24686
3,AL,Bibb,19,4.6,4.6,19,38,7.8,34,16,...,10,85,50,15,3,46064,393.036364,3,Bibb,22394
4,AL,Blount,22,4.9,4.9,19,34,8.4,30,16,...,11,86,59,17,3,50412,387.481818,3,Blount,57826


In [121]:
df_combined.shape

(2946, 35)

In [122]:
df_combined.replace(to_replace = statename, value = symbol, inplace = True)
df_combined.tail()

Unnamed: 0,State,County,Percent_Fair_or_Poor_Health,Average_Number_of_Physically_Unhealthy_Days,Average_Number_of_Mentally_Unhealthy_Days,Percent_Smokers,Percent_Adults_with_Obesity,Food_Environment_Index,Percent_Physically_Inactive,Percent_With_Access_to_Exercise_Opportunities,...,Percent_Severe_Housing_Problems,Percent_Drive_Alone_to_Work,Percent_Long_Commute,Percent_Adults_with_Diabetes,Percent_Limited_Access_to_Healthy_Foods,Median_Household_Income,Rate_100000,Rate_Level,Area_Name,POP_ESTIMATE
2941,WY,Sweetwater,15,3.4,3.6,18,30,7.7,25,90,...,10,76,18,9,11,73315,392.575,3,Sweetwater,42343
2942,WY,Teton,12,3.0,3.2,15,12,8.2,12,100,...,17,67,14,2,7,99087,186.1125,1,Teton,23464
2943,WY,Uinta,16,3.6,3.7,17,36,7.4,27,84,...,11,77,19,11,10,63401,321.914286,2,Uinta,20226
2944,WY,Washakie,16,3.6,3.7,17,29,8.3,28,83,...,10,79,7,12,4,55190,304.2875,2,Washakie,7805
2945,WY,Weston,14,3.5,3.7,17,33,7.9,27,63,...,14,74,24,9,4,54319,359.85,3,Weston,6927


In [127]:
X_new = df_combined.copy()
X_new = X_new.drop(["State", "County", "Rate_Level", "Rate_100000"],1)
y_new = df_combined["Rate_Level"].ravel()

In [133]:
X_new = X_new.drop("Area_Name", 1)

In [134]:
X_new_train, X_new_test, y_new_train, y_new_test = train_test_split(X_new, y_new, random_state=78)

In [135]:
X_new.dtypes

Percent_Fair_or_Poor_Health                        int64
Average_Number_of_Physically_Unhealthy_Days      float64
Average_Number_of_Mentally_Unhealthy_Days        float64
Percent_Smokers                                    int64
Percent_Adults_with_Obesity                        int64
Food_Environment_Index                           float64
Percent_Physically_Inactive                        int64
Percent_With_Access_to_Exercise_Opportunities      int64
Percent_Excessive_Drinking                         int64
Percent_Uninsured                                  int64
Primary_Care_Physicians_Rate                       int64
Dentist_Rate                                       int64
Mental_Health_Provider_Rate                        int64
Preventable_Hospitalization_Rate                   int64
Percent_Vaccinated                                 int64
High_School_Graduation_Rate                        int64
Percent_Some_College                               int64
Percent_Unemployed             

In [136]:
X_new_scaler = scaler.fit(X_new_train)

# Scaling the data.
X_new_train_scaled = X_new_scaler.transform(X_new_train)
X_new_test_scaled = X_new_scaler.transform(X_new_test)

In [137]:
rf_model_new = rf_model.fit(X_new_train_scaled, y_new_train)

In [138]:
predictions_new = rf_model_new.predict(X_new_test_scaled)

In [139]:
cm_new = confusion_matrix(y_new_test, predictions_new)

cm_new_df = pd.DataFrame(
    cm_new, index=["Actual 1", "Actual 2", "Actual 3", "Actual 4"], columns=["Predicted 1", "Predicted 2", "Predicted 3", "Predicted 4"])

cm_new_df

Unnamed: 0,Predicted 1,Predicted 2,Predicted 3,Predicted 4
Actual 1,92,88,3,5
Actual 2,27,129,21,12
Actual 3,3,55,42,51
Actual 4,3,29,27,150


In [140]:
acc_score_new = accuracy_score(y_new_test, predictions_new)
display(cm_new_df)
print(f"Accuracy Score : {acc_score_new}")
print("Classification Report")
print(classification_report(y_new_test, predictions_new))

Unnamed: 0,Predicted 1,Predicted 2,Predicted 3,Predicted 4
Actual 1,92,88,3,5
Actual 2,27,129,21,12
Actual 3,3,55,42,51
Actual 4,3,29,27,150


Accuracy Score : 0.5603799185888738
Classification Report
              precision    recall  f1-score   support

           1       0.74      0.49      0.59       188
           2       0.43      0.68      0.53       189
           3       0.45      0.28      0.34       151
           4       0.69      0.72      0.70       209

    accuracy                           0.56       737
   macro avg       0.58      0.54      0.54       737
weighted avg       0.59      0.56      0.55       737

