<a href="https://colab.research.google.com/github/Kianjputnam/project_chd/blob/main/Chd_Cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Import libraries**
******

In [2]:
import pandas as pd
import numpy as np

##**Load and inspect data**
*********

In [1]:
! git clone https://www.github.com/Kianjputnam/project_chd

Cloning into 'project_chd'...
remote: Enumerating objects: 13, done.[K
remote: Counting objects: 100% (11/11), done.[K
remote: Compressing objects: 100% (8/8), done.[K
remote: Total 13 (delta 3), reused 10 (delta 3), pack-reused 2[K
Receiving objects: 100% (13/13), 638.57 KiB | 4.95 MiB/s, done.
Resolving deltas: 100% (3/3), done.


In [5]:
chd_test = pd.read_csv("./project_chd/fhs_test.csv")

In [7]:
chd_train = pd.read_csv("./project_chd/fhs_train.csv")

##**Test cleaning**
**********

Given that predictive models generally don't do well with na values, I made the following decisions for variables

* Categorical variables: dropped na values and mapped encodings to proper categorical values in new columns. This allows for the models to run with the one-hot encodings and for visuals to be done with categorical values
* Numerical variables: converted na values into the median values

**In categorical variables which the na's were dropped, this only removed 3% of the testing data

###Exploration
****************

In [6]:
chd_test.head()

Unnamed: 0.1,Unnamed: 0,sex,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,674,0,58,1.0,1,20.0,0.0,0,0,0,,126.0,77.0,30.08,78.0,,0
1,4070,0,51,3.0,0,0.0,0.0,0,0,0,264.0,135.0,83.0,26.68,60.0,74.0,0
2,3150,0,44,2.0,1,9.0,0.0,0,1,0,,147.5,96.0,30.57,78.0,,1
3,1695,0,40,2.0,1,20.0,0.0,0,0,0,271.0,138.5,88.0,27.24,80.0,,1
4,2692,1,58,2.0,1,20.0,0.0,0,0,0,207.0,110.0,80.0,23.55,78.0,78.0,0


In [10]:
print(chd_test.shape)
print(chd_test.dtypes)

(1060, 17)
Unnamed: 0           int64
sex                  int64
age                  int64
education          float64
currentSmoker        int64
cigsPerDay         float64
BPMeds             float64
prevalentStroke      int64
prevalentHyp         int64
diabetes             int64
totChol            float64
sysBP              float64
diaBP              float64
BMI                float64
heartRate          float64
glucose            float64
TenYearCHD           int64
dtype: object


###Cleaning
**********

#####sex
*******

decision: no na's to drop, mapped the one-hot encodings to categories

In [57]:
chd_test['sex'].value_counts()

sex
0    617
1    443
Name: count, dtype: int64

In [58]:
chd_test['sex'].isna().sum()

0

In [95]:
# Map numerical values to categories
sex_map = {0: 'female', 1: 'male'}

# Create a new column 'sex_category' based on the mapping
chd_test['sex_category'] = chd_test['sex'].map(sex_map)

#####age
*********

decision: no na's to drop, no changes to be made

In [17]:
chd_test['age'].value_counts()

age
40    54
44    50
46    45
51    45
39    45
43    43
42    41
50    40
57    39
41    38
45    37
52    36
53    35
47    34
48    34
59    33
55    32
49    32
38    32
56    30
37    29
54    29
64    27
58    26
63    25
62    24
36    24
61    23
60    21
67    14
35    12
66     9
68     8
65     7
34     4
69     3
Name: count, dtype: int64

In [60]:
chd_test['age'].isna().sum()

0

#####education
**********

Decision: drop na's, mapped the one-hot encodings to categories

In [18]:
chd_test['education'].value_counts()

education
1.0    410
2.0    304
3.0    194
4.0    132
Name: count, dtype: int64

In [62]:
chd_test['education'].isna().sum()

20

In [96]:
chd_test.dropna(subset=['education'], inplace=True)

In [99]:
# Map numerical values to categories
education_map = {1: 'some HS', 2: 'HS/GED', 3: 'some college/vocation', 4: 'college'}

# Create a new column based on the mapping
chd_test['edu_category'] = chd_test['education'].map(education_map)

#####currentSmoker
**********

Decision: no na's, mapped one-hot encodings to categorical

In [19]:
chd_test['currentSmoker'].value_counts()

currentSmoker
1    534
0    526
Name: count, dtype: int64

In [63]:
chd_test['currentSmoker'].isna().sum()

0

In [100]:
# Map numerical values to categories
smoker_map = {0: 'no smoking', 1: 'smoking'}

# Create a new column based on the mapping
chd_test['smoker_category'] = chd_test['currentSmoker'].map(smoker_map)

#####cigsPerDay
**************

Decision: filled na's with the median score

In [20]:
chd_test['cigsPerDay'].value_counts()

cigsPerDay
0.0     526
20.0    206
15.0     51
30.0     46
9.0      36
10.0     35
3.0      27
5.0      26
25.0     22
43.0     16
1.0      14
40.0     13
35.0      6
4.0       5
2.0       5
17.0      3
7.0       3
6.0       3
13.0      2
45.0      2
60.0      2
8.0       1
12.0      1
50.0      1
18.0      1
23.0      1
14.0      1
Name: count, dtype: int64

In [64]:
chd_test['cigsPerDay'].isna().sum()

5

In [101]:
median_value = chd_test['cigsPerDay'].median()

# Replace missing values with the median
chd_test['cigsPerDay'].fillna(median_value, inplace=True)

#####BPMeds
***********

Decision: remove na's, mapped one-hot encodings to categorical

In [21]:
chd_test['BPMeds'].value_counts()

BPMeds
0.0    1013
1.0      31
Name: count, dtype: int64

In [65]:
chd_test['BPMeds'].isna().sum()

16

In [103]:
chd_test.dropna(subset=['BPMeds'], inplace=True)

In [104]:
# Map numerical values to categories
BPMeds_map = {0: 'no meds', 1: 'on meds'}

# Create a new column based on the mapping
chd_test['BPMeds_category'] = chd_test['BPMeds'].map(BPMeds_map)

#####prevalentStroke
****************

Decision: no na's, mapped one-hot encodings to categorical

In [22]:
chd_test['prevalentStroke'].value_counts()

prevalentStroke
0    1056
1       4
Name: count, dtype: int64

In [66]:
chd_test['prevalentStroke'].isna().sum()

0

In [105]:
# Map numerical values to categories
stroke_map = {0: 'no stroke', 1: 'had stroke'}

# Create a new column based on the mapping
chd_test['stroke_category'] = chd_test['prevalentStroke'].map(stroke_map)

#####prevalentHyp
*************

Decision: no na's, mapped one-hot encodings to categorical

In [23]:
chd_test['prevalentHyp'].value_counts()

prevalentHyp
0    764
1    296
Name: count, dtype: int64

In [67]:
chd_test['prevalentHyp'].isna().sum()

0

In [106]:
# Map numerical values to categories
hyp_map = {0: 'no hyp', 1: 'hyp'}

# Create a new column based on the mapping
chd_test['hyp_category'] = chd_test['prevalentHyp'].map(hyp_map)

#####diabetes
*************

Decision: no na's, mapped one-hot encoding to categorical

In [24]:
chd_test['diabetes'].value_counts()

diabetes
0    1034
1      26
Name: count, dtype: int64

In [68]:
chd_test['diabetes'].isna().sum()

0

In [107]:
# Map numerical values to categories
diabetes_map = {0: 'no diabetes', 1: 'diabetes'}

# Create a new column based on the mapping
chd_test['diabetes_category'] = chd_test['diabetes'].map(diabetes_map)

#####totChol
************

Decision: replace na's with median value

In [25]:
chd_test['totChol'].value_counts()

totChol
240.0    22
220.0    20
226.0    16
250.0    16
210.0    16
         ..
382.0     1
314.0     1
322.0     1
327.0     1
184.0     1
Name: count, Length: 196, dtype: int64

In [69]:
chd_test['totChol'].isna().sum()

11

In [108]:
median_value_chol = chd_test['totChol'].median()

# Replace missing values with the median
chd_test['totChol'].fillna(median_value_chol, inplace=True)

#####sysBP
***********

Decision: no na's, no mapping needed

In [26]:
chd_test['sysBP'].value_counts()

sysBP
120.0    37
130.0    31
119.0    29
110.0    24
122.0    23
         ..
164.5     1
135.5     1
186.5     1
191.0     1
145.5     1
Name: count, Length: 167, dtype: int64

In [70]:
chd_test['sysBP'].isna().sum()

0

#####diaBP
***********

Decision: no na's, no mapping needed

In [27]:
chd_test['diaBP'].value_counts()

diaBP
80.0     64
82.0     46
78.0     41
70.0     38
72.0     36
         ..
88.5      1
103.5     1
69.5      1
100.5     1
58.0      1
Name: count, Length: 117, dtype: int64

In [71]:
chd_test['diaBP'].isna().sum()

0

#####BMI
**************

Decision: replace na's with median value

In [28]:
chd_test['BMI'].value_counts()

BMI
22.19    8
25.38    7
25.94    7
22.18    6
22.54    6
        ..
31.50    1
24.36    1
30.46    1
20.19    1
42.15    1
Name: count, Length: 666, dtype: int64

In [72]:
chd_test['BMI'].isna().sum()

4

In [109]:
median_value_BMI = chd_test['BMI'].median()

# Replace missing values with the median
chd_test['BMI'].fillna(median_value_BMI, inplace=True)

#####heartRate
***********

Decision: replace na with median value

In [29]:
chd_test['heartRate'].value_counts()

heartRate
75.0     135
80.0     112
70.0      79
85.0      58
60.0      57
72.0      53
65.0      51
90.0      43
68.0      37
78.0      26
67.0      23
62.0      22
100.0     22
95.0      21
63.0      20
64.0      20
83.0      19
66.0      17
58.0      16
88.0      15
77.0      15
69.0      14
96.0      14
76.0      13
82.0      11
73.0      11
55.0      10
71.0       9
87.0       9
92.0       9
79.0       9
56.0       7
94.0       7
74.0       7
110.0      7
84.0       6
52.0       6
105.0      5
57.0       5
86.0       5
81.0       5
50.0       3
102.0      3
108.0      3
45.0       2
98.0       2
91.0       2
48.0       2
107.0      1
61.0       1
59.0       1
122.0      1
104.0      1
103.0      1
112.0      1
101.0      1
89.0       1
143.0      1
125.0      1
120.0      1
Name: count, dtype: int64

In [73]:
chd_test['heartRate'].isna().sum()

1

In [110]:
median_value_hr = chd_test['heartRate'].median()

# Replace missing values with the median
chd_test['heartRate'].fillna(median_value_hr, inplace=True)

#####glucose
**************

Decision: converted na's to median value

In [30]:
chd_test['glucose'].value_counts()

glucose
76.0     47
75.0     46
83.0     41
77.0     41
78.0     40
         ..
143.0     1
163.0     1
119.0     1
205.0     1
145.0     1
Name: count, Length: 90, dtype: int64

In [74]:
chd_test['glucose'].isna().sum()

103

In [111]:
median_value_gluc = chd_test['glucose'].median()

# Replace missing values with the median
chd_test['glucose'].fillna(median_value_gluc, inplace=True)

#####TenYearCHD
**********

Decision: no na's, mapped one-hot encodings to categorical

In [31]:
chd_test['TenYearCHD'].value_counts()

TenYearCHD
0    903
1    157
Name: count, dtype: int64

In [75]:
chd_test['TenYearCHD'].isna().sum()

0

In [112]:
# Map numerical values to categories
chd_map = {0: 'no chd', 1: 'chd'}

# Create a new column based on the mapping
chd_test['chd_category'] = chd_test['TenYearCHD'].map(chd_map)

####Inspection
********

In [113]:
chd_test.head()

Unnamed: 0.1,Unnamed: 0,sex,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,...,glucose,TenYearCHD,sex_category,edu_category,smoker_category,BPMeds_category,stroke_category,hyp_category,diabetes_category,chd_category
0,674,0,58,1.0,1,20.0,0.0,0,0,0,...,78.0,0,female,some HS,smoking,no meds,no stroke,no hyp,no diabetes,no chd
1,4070,0,51,3.0,0,0.0,0.0,0,0,0,...,74.0,0,female,some college/vocation,no smoking,no meds,no stroke,no hyp,no diabetes,no chd
2,3150,0,44,2.0,1,9.0,0.0,0,1,0,...,78.0,1,female,HS/GED,smoking,no meds,no stroke,hyp,no diabetes,chd
3,1695,0,40,2.0,1,20.0,0.0,0,0,0,...,78.0,1,female,HS/GED,smoking,no meds,no stroke,no hyp,no diabetes,chd
4,2692,1,58,2.0,1,20.0,0.0,0,0,0,...,78.0,0,male,HS/GED,smoking,no meds,no stroke,no hyp,no diabetes,no chd


In [116]:
chd_test.shape

(1024, 25)

In [115]:
for each in chd_test.columns:
    print(f"{each}: {chd_test[each].isnull().sum()}")

Unnamed: 0: 0
sex: 0
age: 0
education: 0
currentSmoker: 0
cigsPerDay: 0
BPMeds: 0
prevalentStroke: 0
prevalentHyp: 0
diabetes: 0
totChol: 0
sysBP: 0
diaBP: 0
BMI: 0
heartRate: 0
glucose: 0
TenYearCHD: 0
sex_category: 0
edu_category: 0
smoker_category: 0
BPMeds_category: 0
stroke_category: 0
hyp_category: 0
diabetes_category: 0
chd_category: 0


##**Training cleaning**
*********

###Exploration
*************

In [8]:
chd_train.head()

Unnamed: 0.1,Unnamed: 0,sex,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1267,1,58,1.0,0,0.0,0.0,0,0,0,220.0,143.0,104.0,29.85,75,87.0,1
1,1209,0,40,1.0,1,15.0,0.0,0,0,0,199.0,122.0,82.0,22.16,85,77.0,0
2,2050,0,52,1.0,0,0.0,0.0,0,0,0,275.0,112.0,71.0,25.68,80,,0
3,1183,1,38,2.0,1,43.0,0.0,0,1,0,170.0,130.0,94.0,23.9,110,75.0,0
4,3225,0,43,1.0,0,0.0,0.0,0,0,0,202.0,124.0,92.0,21.26,75,74.0,0


In [13]:
print(chd_train.shape)
print(chd_train.dtypes)

(3180, 17)
Unnamed: 0           int64
sex                  int64
age                  int64
education          float64
currentSmoker        int64
cigsPerDay         float64
BPMeds             float64
prevalentStroke      int64
prevalentHyp         int64
diabetes             int64
totChol            float64
sysBP              float64
diaBP              float64
BMI                float64
heartRate            int64
glucose            float64
TenYearCHD           int64
dtype: object


In [14]:
for each in chd_train.columns:
    print(f"{each}: {chd_train[each].isnull().sum()}")

Unnamed: 0: 0
sex: 0
age: 0
education: 85
currentSmoker: 0
cigsPerDay: 24
BPMeds: 37
prevalentStroke: 0
prevalentHyp: 0
diabetes: 0
totChol: 39
sysBP: 0
diaBP: 0
BMI: 15
heartRate: 0
glucose: 285
TenYearCHD: 0


###Cleaning
***********

#####sex
*********

Decision:

In [32]:
chd_train['sex'].value_counts()

sex
0    1803
1    1377
Name: count, dtype: int64

In [78]:
chd_train['sex'].isna().sum()

0

#####age
**********

Decision:

In [77]:
chd_train['age'].value_counts()

age
48    139
42    139
40    138
46    137
41    136
39    125
45    125
43    116
44    116
55    113
52    113
38    112
47    107
53    104
54    103
51    101
50    100
49    100
56     93
58     91
60     90
61     87
59     86
63     85
57     84
62     75
64     66
37     63
36     60
65     50
67     31
35     30
66     29
34     14
68     10
33      5
69      4
70      2
32      1
Name: count, dtype: int64

In [79]:
chd_train['age'].isna().sum()

0

#####education
************

Decision:

In [34]:
chd_train['education'].value_counts()

education
1.0    1310
2.0     949
3.0     495
4.0     341
Name: count, dtype: int64

In [80]:
chd_train['education'].isna().sum()

85

#####currentSmoker
***********

Decision:

In [35]:
chd_train['currentSmoker'].value_counts()

currentSmoker
0    1619
1    1561
Name: count, dtype: int64

In [81]:
chd_train['currentSmoker'].isna().sum()

0

#####cigsPerDay
**************

Decision:

In [36]:
chd_train['cigsPerDay'].value_counts()

cigsPerDay
0.0     1619
20.0     528
30.0     172
15.0     159
10.0     108
5.0       95
9.0       94
3.0       73
40.0      67
1.0       53
43.0      40
25.0      33
35.0      16
6.0       15
2.0       13
8.0       10
60.0       9
7.0        9
18.0       7
23.0       5
11.0       5
50.0       5
4.0        4
17.0       4
16.0       3
12.0       2
19.0       2
29.0       1
14.0       1
70.0       1
45.0       1
38.0       1
13.0       1
Name: count, dtype: int64

In [82]:
chd_train['cigsPerDay'].isna().sum()

24

#####BPMeds
**********

Decision:

In [37]:
chd_train['BPMeds'].value_counts()

BPMeds
0.0    3050
1.0      93
Name: count, dtype: int64

In [83]:
chd_train['BPMeds'].isna().sum()

37

#####prevalentStroke
**********

Decision:

In [38]:
chd_train['prevalentStroke'].value_counts()

prevalentStroke
0    3159
1      21
Name: count, dtype: int64

In [84]:
chd_train['prevalentStroke'].isna().sum()

0

#####prevalentHyp
************

Decision:

In [39]:
chd_train['prevalentHyp'].value_counts()

prevalentHyp
0    2159
1    1021
Name: count, dtype: int64

In [85]:
chd_train['prevalentHyp'].isna().sum()

0

#####diabetes
**********

Decision:

In [40]:
chd_train['diabetes'].value_counts()

diabetes
0    3097
1      83
Name: count, dtype: int64

In [86]:
chd_train['diabetes'].isna().sum()

0

#####totChol
***********

Decision:

In [41]:
chd_train['totChol'].value_counts()

totChol
240.0    63
220.0    50
232.0    49
260.0    48
210.0    45
         ..
385.0     1
359.0     1
355.0     1
119.0     1
392.0     1
Name: count, Length: 243, dtype: int64

In [87]:
chd_train['totChol'].isna().sum()

39

#####sysBP
***********

Decision:

In [42]:
chd_train['sysBP'].value_counts()

sysBP
110.0    72
130.0    71
120.0    70
115.0    69
125.0    68
         ..
213.0     1
187.5     1
85.0      1
201.0     1
169.5     1
Name: count, Length: 229, dtype: int64

In [88]:
chd_train['sysBP'].isna().sum()

0

#####diaBP
**************

Decision:

In [43]:
chd_train['diaBP'].value_counts()

diaBP
80.0     198
82.0     106
84.0     102
85.0     102
81.0      98
        ... 
115.5      1
135.0      1
110.5      1
129.0      1
122.5      1
Name: count, Length: 139, dtype: int64

In [89]:
chd_train['diaBP'].isna().sum()

0

#####BMI
************

Decision:

In [44]:
chd_train['BMI'].value_counts()

BMI
23.48    17
22.91    15
22.54    12
23.09    12
25.09    11
         ..
17.65     1
29.46     1
32.81     1
23.83     1
26.78     1
Name: count, Length: 1234, dtype: int64

In [90]:
chd_train['BMI'].isna().sum()

15

#####heartRate
**********

Decision:

In [45]:
chd_train['heartRate'].value_counts()

heartRate
75     428
80     273
70     226
60     174
85     170
      ... 
51       1
46       1
97       1
47       1
130      1
Name: count, Length: 70, dtype: int64

In [91]:
chd_train['heartRate'].isna().sum()

0

#####glucose
************

Decision:

In [46]:
chd_train['glucose'].value_counts()

glucose
75.0     147
77.0     126
73.0     119
80.0     117
70.0     113
        ... 
348.0      1
119.0      1
320.0      1
136.0      1
186.0      1
Name: count, Length: 130, dtype: int64

In [92]:
chd_train['glucose'].isna().sum()

285

#####TenYearCHD
*************

Decision:

In [47]:
chd_train['TenYearCHD'].value_counts()

TenYearCHD
0    2693
1     487
Name: count, dtype: int64

In [93]:
chd_train['TenYearCHD'].isna().sum()

0