<a href="https://colab.research.google.com/github/Kianjputnam/project_chd/blob/main/Chd_Cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Import libraries**
******

In [241]:
import pandas as pd
import numpy as np

##**Load and inspect data**
*********

In [242]:
! git clone https://www.github.com/Kianjputnam/project_chd

fatal: destination path 'project_chd' already exists and is not an empty directory.


In [243]:
chd_test = pd.read_csv("./project_chd/fhs_test.csv")

In [244]:
chd_train = pd.read_csv("./project_chd/fhs_train.csv")

Given that predictive models generally don't do well with na values, I made the following decisions for variables

* Categorical variables: dropped na values and mapped encodings to proper categorical values in new columns. This allows for the models to run with the one-hot encodings and for visuals to be done with categorical values
* Numerical variables: converted na values into the median values

**In categorical variables which the na's were dropped, this only removed 3.3% of the testing data and 3.8% of the training data

##**Test cleaning**
**********

###Exploration
****************

In [245]:
chd_test.head()

Unnamed: 0.1,Unnamed: 0,sex,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,674,0,58,1.0,1,20.0,0.0,0,0,0,,126.0,77.0,30.08,78.0,,0
1,4070,0,51,3.0,0,0.0,0.0,0,0,0,264.0,135.0,83.0,26.68,60.0,74.0,0
2,3150,0,44,2.0,1,9.0,0.0,0,1,0,,147.5,96.0,30.57,78.0,,1
3,1695,0,40,2.0,1,20.0,0.0,0,0,0,271.0,138.5,88.0,27.24,80.0,,1
4,2692,1,58,2.0,1,20.0,0.0,0,0,0,207.0,110.0,80.0,23.55,78.0,78.0,0


In [246]:
print(chd_test.shape)
print(chd_test.dtypes)

(1060, 17)
Unnamed: 0           int64
sex                  int64
age                  int64
education          float64
currentSmoker        int64
cigsPerDay         float64
BPMeds             float64
prevalentStroke      int64
prevalentHyp         int64
diabetes             int64
totChol            float64
sysBP              float64
diaBP              float64
BMI                float64
heartRate          float64
glucose            float64
TenYearCHD           int64
dtype: object


###Cleaning
**********

#####sex
*******

decision: no na's to drop, mapped the one-hot encodings to categories

In [247]:
chd_test['sex'].value_counts()

sex
0    617
1    443
Name: count, dtype: int64

In [248]:
chd_test['sex'].isna().sum()

0

In [249]:
# Map numerical values to categories
sex_map = {0: 'female', 1: 'male'}

# Create a new column 'sex_category' based on the mapping
chd_test['sex_category'] = chd_test['sex'].map(sex_map)

#####age
*********

decision: no na's to drop, no changes to be made

In [250]:
chd_test['age'].value_counts()

age
40    54
44    50
46    45
51    45
39    45
43    43
42    41
50    40
57    39
41    38
45    37
52    36
53    35
47    34
48    34
59    33
55    32
49    32
38    32
56    30
37    29
54    29
64    27
58    26
63    25
62    24
36    24
61    23
60    21
67    14
35    12
66     9
68     8
65     7
34     4
69     3
Name: count, dtype: int64

In [251]:
chd_test['age'].isna().sum()

0

#####education
**********

Decision: drop na's, mapped the one-hot encodings to categories

In [252]:
chd_test['education'].value_counts()

education
1.0    410
2.0    304
3.0    194
4.0    132
Name: count, dtype: int64

In [253]:
chd_test['education'].isna().sum()

20

In [254]:
chd_test.dropna(subset=['education'], inplace=True)

In [255]:
# Map numerical values to categories
education_map = {1: 'some HS', 2: 'HS/GED', 3: 'some college/vocation', 4: 'college'}

# Create a new column based on the mapping
chd_test['edu_category'] = chd_test['education'].map(education_map)

#####currentSmoker
**********

Decision: no na's, mapped one-hot encodings to categorical

In [256]:
chd_test['currentSmoker'].value_counts()

currentSmoker
1    523
0    517
Name: count, dtype: int64

In [257]:
chd_test['currentSmoker'].isna().sum()

0

In [258]:
# Map numerical values to categories
smoker_map = {0: 'no smoking', 1: 'smoking'}

# Create a new column based on the mapping
chd_test['smoker_category'] = chd_test['currentSmoker'].map(smoker_map)

#####cigsPerDay
**************

Decision: filled na's with the median score

In [259]:
chd_test['cigsPerDay'].value_counts()

cigsPerDay
0.0     517
20.0    202
15.0     51
30.0     45
9.0      34
10.0     34
3.0      27
5.0      25
25.0     21
43.0     16
1.0      14
40.0     13
35.0      6
2.0       5
4.0       4
17.0      3
7.0       3
6.0       3
13.0      2
45.0      2
60.0      2
8.0       1
12.0      1
50.0      1
18.0      1
23.0      1
14.0      1
Name: count, dtype: int64

In [260]:
chd_test['cigsPerDay'].isna().sum()

5

In [261]:
median_value = chd_test['cigsPerDay'].median()

# Replace missing values with the median
chd_test['cigsPerDay'].fillna(median_value, inplace=True)

#####BPMeds
***********

Decision: remove na's, mapped one-hot encodings to categorical

In [262]:
chd_test['BPMeds'].value_counts()

BPMeds
0.0    993
1.0     31
Name: count, dtype: int64

In [263]:
chd_test['BPMeds'].isna().sum()

16

In [264]:
chd_test.dropna(subset=['BPMeds'], inplace=True)

In [265]:
# Map numerical values to categories
BPMeds_map = {0: 'no meds', 1: 'on meds'}

# Create a new column based on the mapping
chd_test['BPMeds_category'] = chd_test['BPMeds'].map(BPMeds_map)

#####prevalentStroke
****************

Decision: no na's, mapped one-hot encodings to categorical

In [266]:
chd_test['prevalentStroke'].value_counts()

prevalentStroke
0    1020
1       4
Name: count, dtype: int64

In [267]:
chd_test['prevalentStroke'].isna().sum()

0

In [268]:
# Map numerical values to categories
stroke_map = {0: 'no stroke', 1: 'had stroke'}

# Create a new column based on the mapping
chd_test['stroke_category'] = chd_test['prevalentStroke'].map(stroke_map)

#####prevalentHyp
*************

Decision: no na's, mapped one-hot encodings to categorical

In [269]:
chd_test['prevalentHyp'].value_counts()

prevalentHyp
0    740
1    284
Name: count, dtype: int64

In [270]:
chd_test['prevalentHyp'].isna().sum()

0

In [271]:
# Map numerical values to categories
hyp_map = {0: 'no hyp', 1: 'hyp'}

# Create a new column based on the mapping
chd_test['hyp_category'] = chd_test['prevalentHyp'].map(hyp_map)

#####diabetes
*************

Decision: no na's, mapped one-hot encoding to categorical

In [272]:
chd_test['diabetes'].value_counts()

diabetes
0    999
1     25
Name: count, dtype: int64

In [273]:
chd_test['diabetes'].isna().sum()

0

In [274]:
# Map numerical values to categories
diabetes_map = {0: 'no diabetes', 1: 'diabetes'}

# Create a new column based on the mapping
chd_test['diabetes_category'] = chd_test['diabetes'].map(diabetes_map)

#####totChol
************

Decision: replace na's with median value

In [275]:
chd_test['totChol'].value_counts()

totChol
240.0    20
220.0    19
250.0    16
210.0    16
226.0    16
         ..
314.0     1
366.0     1
322.0     1
327.0     1
184.0     1
Name: count, Length: 196, dtype: int64

In [276]:
chd_test['totChol'].isna().sum()

11

In [277]:
median_value_chol = chd_test['totChol'].median()

# Replace missing values with the median
chd_test['totChol'].fillna(median_value_chol, inplace=True)

#####sysBP
***********

Decision: no na's, no mapping needed

In [278]:
chd_test['sysBP'].value_counts()

sysBP
120.0    37
130.0    30
119.0    28
110.0    23
122.0    22
         ..
164.5     1
135.5     1
186.5     1
99.5      1
145.5     1
Name: count, Length: 166, dtype: int64

In [279]:
chd_test['sysBP'].isna().sum()

0

#####diaBP
***********

Decision: no na's, no mapping needed

In [280]:
chd_test['diaBP'].value_counts()

diaBP
80.0     63
82.0     45
78.0     41
72.0     36
70.0     36
         ..
88.5      1
103.5     1
69.5      1
100.5     1
58.0      1
Name: count, Length: 116, dtype: int64

In [281]:
chd_test['diaBP'].isna().sum()

0

#####BMI
**************

Decision: replace na's with median value

In [282]:
chd_test['BMI'].value_counts()

BMI
22.19    8
25.38    7
22.18    6
25.94    6
22.54    6
        ..
23.02    1
37.62    1
23.75    1
31.50    1
42.15    1
Name: count, Length: 652, dtype: int64

In [283]:
chd_test['BMI'].isna().sum()

4

In [284]:
median_value_BMI = chd_test['BMI'].median()

# Replace missing values with the median
chd_test['BMI'].fillna(median_value_BMI, inplace=True)

#####heartRate
***********

Decision: replace na with median value

In [285]:
chd_test['heartRate'].value_counts()

heartRate
75.0     132
80.0     106
70.0      77
85.0      55
60.0      55
65.0      50
72.0      48
90.0      41
68.0      37
78.0      26
67.0      23
100.0     22
95.0      21
62.0      21
63.0      20
83.0      18
64.0      17
66.0      16
58.0      16
77.0      15
88.0      15
96.0      14
69.0      13
76.0      13
82.0      11
73.0      11
55.0      10
87.0       9
71.0       9
92.0       9
79.0       8
110.0      7
74.0       7
56.0       6
52.0       6
84.0       6
94.0       6
86.0       5
105.0      5
81.0       4
57.0       4
108.0      3
102.0      3
50.0       3
48.0       2
91.0       2
98.0       2
45.0       2
125.0      1
103.0      1
143.0      1
89.0       1
101.0      1
112.0      1
107.0      1
104.0      1
122.0      1
59.0       1
61.0       1
120.0      1
Name: count, dtype: int64

In [286]:
chd_test['heartRate'].isna().sum()

1

In [287]:
median_value_hr = chd_test['heartRate'].median()

# Replace missing values with the median
chd_test['heartRate'].fillna(median_value_hr, inplace=True)

#####glucose
**************

Decision: converted na's to median value

In [288]:
chd_test['glucose'].value_counts()

glucose
76.0     45
75.0     44
83.0     41
77.0     40
78.0     40
         ..
57.0      1
325.0     1
177.0     1
386.0     1
145.0     1
Name: count, Length: 88, dtype: int64

In [289]:
chd_test['glucose'].isna().sum()

101

In [290]:
median_value_gluc = chd_test['glucose'].median()

# Replace missing values with the median
chd_test['glucose'].fillna(median_value_gluc, inplace=True)

#####TenYearCHD
**********

Decision: no na's, mapped one-hot encodings to categorical

In [291]:
chd_test['TenYearCHD'].value_counts()

TenYearCHD
0    875
1    149
Name: count, dtype: int64

In [292]:
chd_test['TenYearCHD'].isna().sum()

0

In [293]:
# Map numerical values to categories
chd_map = {0: 'no chd', 1: 'chd'}

# Create a new column based on the mapping
chd_test['chd_category'] = chd_test['TenYearCHD'].map(chd_map)

####Inspection
********

In [294]:
chd_test.head()

Unnamed: 0.1,Unnamed: 0,sex,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,...,glucose,TenYearCHD,sex_category,edu_category,smoker_category,BPMeds_category,stroke_category,hyp_category,diabetes_category,chd_category
0,674,0,58,1.0,1,20.0,0.0,0,0,0,...,78.0,0,female,some HS,smoking,no meds,no stroke,no hyp,no diabetes,no chd
1,4070,0,51,3.0,0,0.0,0.0,0,0,0,...,74.0,0,female,some college/vocation,no smoking,no meds,no stroke,no hyp,no diabetes,no chd
2,3150,0,44,2.0,1,9.0,0.0,0,1,0,...,78.0,1,female,HS/GED,smoking,no meds,no stroke,hyp,no diabetes,chd
3,1695,0,40,2.0,1,20.0,0.0,0,0,0,...,78.0,1,female,HS/GED,smoking,no meds,no stroke,no hyp,no diabetes,chd
4,2692,1,58,2.0,1,20.0,0.0,0,0,0,...,78.0,0,male,HS/GED,smoking,no meds,no stroke,no hyp,no diabetes,no chd


In [295]:
chd_test.shape
#originally had 1060 observations

(1024, 25)

In [296]:
for each in chd_test.columns:
    print(f"{each}: {chd_test[each].isnull().sum()}")

Unnamed: 0: 0
sex: 0
age: 0
education: 0
currentSmoker: 0
cigsPerDay: 0
BPMeds: 0
prevalentStroke: 0
prevalentHyp: 0
diabetes: 0
totChol: 0
sysBP: 0
diaBP: 0
BMI: 0
heartRate: 0
glucose: 0
TenYearCHD: 0
sex_category: 0
edu_category: 0
smoker_category: 0
BPMeds_category: 0
stroke_category: 0
hyp_category: 0
diabetes_category: 0
chd_category: 0


##**Training cleaning**
*********

###Exploration
*************

In [297]:
chd_train.head()

Unnamed: 0.1,Unnamed: 0,sex,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1267,1,58,1.0,0,0.0,0.0,0,0,0,220.0,143.0,104.0,29.85,75,87.0,1
1,1209,0,40,1.0,1,15.0,0.0,0,0,0,199.0,122.0,82.0,22.16,85,77.0,0
2,2050,0,52,1.0,0,0.0,0.0,0,0,0,275.0,112.0,71.0,25.68,80,,0
3,1183,1,38,2.0,1,43.0,0.0,0,1,0,170.0,130.0,94.0,23.9,110,75.0,0
4,3225,0,43,1.0,0,0.0,0.0,0,0,0,202.0,124.0,92.0,21.26,75,74.0,0


In [298]:
print(chd_train.shape)
print(chd_train.dtypes)

(3180, 17)
Unnamed: 0           int64
sex                  int64
age                  int64
education          float64
currentSmoker        int64
cigsPerDay         float64
BPMeds             float64
prevalentStroke      int64
prevalentHyp         int64
diabetes             int64
totChol            float64
sysBP              float64
diaBP              float64
BMI                float64
heartRate            int64
glucose            float64
TenYearCHD           int64
dtype: object


In [299]:
for each in chd_train.columns:
    print(f"{each}: {chd_train[each].isnull().sum()}")

Unnamed: 0: 0
sex: 0
age: 0
education: 85
currentSmoker: 0
cigsPerDay: 24
BPMeds: 37
prevalentStroke: 0
prevalentHyp: 0
diabetes: 0
totChol: 39
sysBP: 0
diaBP: 0
BMI: 15
heartRate: 0
glucose: 285
TenYearCHD: 0


###Cleaning
***********

#####sex
*********

Decision: no na's, mapped the one-hot encodings to categories

In [300]:
chd_train['sex'].value_counts()

sex
0    1803
1    1377
Name: count, dtype: int64

In [301]:
chd_train['sex'].isna().sum()

0

In [302]:
# Map numerical values to categories
sex_map = {0: 'female', 1: 'male'}

# Create a new column based on the mapping
chd_train['sex_category'] = chd_train['sex'].map(sex_map)

#####age
**********

Decision: no na's dropped, no changes needed

In [303]:
chd_train['age'].value_counts()

age
48    139
42    139
40    138
46    137
41    136
39    125
45    125
43    116
44    116
55    113
52    113
38    112
47    107
53    104
54    103
51    101
50    100
49    100
56     93
58     91
60     90
61     87
59     86
63     85
57     84
62     75
64     66
37     63
36     60
65     50
67     31
35     30
66     29
34     14
68     10
33      5
69      4
70      2
32      1
Name: count, dtype: int64

In [304]:
chd_train['age'].isna().sum()

0

#####education
************

Decision: drop na's, mapped the one-hot encodings to categories

In [305]:
chd_train['education'].value_counts()

education
1.0    1310
2.0     949
3.0     495
4.0     341
Name: count, dtype: int64

In [306]:
chd_train['education'].isna().sum()

85

In [307]:
chd_train.dropna(subset=['education'], inplace=True)

In [308]:
# Map numerical values to categories
education_map = {1: 'some HS', 2: 'HS/GED', 3: 'some college/vocation', 4: 'college'}

# Create a new column based on the mapping
chd_train['edu_category'] = chd_train['education'].map(education_map)

#####currentSmoker
***********

Decision: Decision: no na's, mapped one-hot encodings to categorical

In [309]:
chd_train['currentSmoker'].value_counts()

currentSmoker
0    1572
1    1523
Name: count, dtype: int64

In [310]:
chd_train['currentSmoker'].isna().sum()

0

In [311]:
# Map numerical values to categories
smoker_map = {0: 'no smoking', 1: 'smoking'}

# Create a new column based on the mapping
chd_train['smoker_category'] = chd_train['currentSmoker'].map(smoker_map)

#####cigsPerDay
**************

Decision: filled na's with the median score

In [312]:
chd_train['cigsPerDay'].value_counts()

cigsPerDay
0.0     1572
20.0     518
30.0     169
15.0     156
10.0     106
5.0       92
9.0       91
3.0       71
40.0      65
1.0       51
43.0      40
25.0      30
35.0      16
6.0       15
2.0       12
8.0       10
7.0        9
60.0       8
18.0       7
11.0       5
50.0       5
23.0       4
4.0        4
17.0       4
16.0       3
12.0       2
19.0       2
29.0       1
14.0       1
70.0       1
45.0       1
38.0       1
13.0       1
Name: count, dtype: int64

In [313]:
chd_train['cigsPerDay'].isna().sum()

22

In [314]:
median_value_cig = chd_train['cigsPerDay'].median()

# Replace missing values with the median
chd_train['cigsPerDay'].fillna(median_value, inplace=True)

#####BPMeds
**********

Decision: remove na's, mapped one-hot encodings to categorical

In [315]:
chd_train['BPMeds'].value_counts()

BPMeds
0.0    2968
1.0      90
Name: count, dtype: int64

In [316]:
chd_train['BPMeds'].isna().sum()

37

In [317]:
chd_train.dropna(subset=['BPMeds'], inplace=True)

In [318]:
# Map numerical values to categories
BPMeds_map = {0: 'no meds', 1: 'on meds'}

# Create a new column based on the mapping
chd_train['BPMeds_category'] = chd_train['BPMeds'].map(BPMeds_map)


#####prevalentStroke
**********

Decision: no na's, mapped one-hot encodings to categorical

In [319]:
chd_train['prevalentStroke'].value_counts()

prevalentStroke
0    3038
1      20
Name: count, dtype: int64

In [320]:
chd_train['prevalentStroke'].isna().sum()

0

In [321]:
# Map numerical values to categories
stroke_map = {0: 'no stroke', 1: 'had stroke'}

# Create a new column based on the mapping
chd_train['stroke_category'] = chd_train['prevalentStroke'].map(stroke_map)

#####prevalentHyp
************

Decision: no na's, mapped one-hot encodings to categorical


In [322]:
chd_train['prevalentHyp'].value_counts()

prevalentHyp
0    2077
1     981
Name: count, dtype: int64

In [323]:
chd_train['prevalentHyp'].isna().sum()

0

In [324]:
# Map numerical values to categories
hyp_map = {0: 'no hyp', 1: 'hyp'}

# Create a new column based on the mapping
chd_train['hyp_category'] = chd_train['prevalentHyp'].map(hyp_map)

#####diabetes
**********

Decision: no na's, mapped one-hot encoding to categorical

In [325]:
chd_train['diabetes'].value_counts()

diabetes
0    2979
1      79
Name: count, dtype: int64

In [326]:
chd_train['diabetes'].isna().sum()

0

In [327]:
# Map numerical values to categories
diabetes_map = {0: 'no diabetes', 1: 'diabetes'}

# Create a new column based on the mapping
chd_train['diabetes_category'] = chd_train['diabetes'].map(diabetes_map)

#####totChol
***********

Decision: replace na's with median value

In [328]:
chd_train['totChol'].value_counts()

totChol
240.0    61
220.0    47
232.0    47
260.0    46
210.0    43
         ..
359.0     1
148.0     1
355.0     1
119.0     1
392.0     1
Name: count, Length: 242, dtype: int64

In [329]:
chd_train['totChol'].isna().sum()

37

In [330]:
median_value_chol = chd_train['totChol'].median()

# Replace missing values with the median
chd_train['totChol'].fillna(median_value_chol, inplace=True)

#####sysBP
***********

Decision: no na's, no mapping needed

In [331]:
chd_train['sysBP'].value_counts()

sysBP
130.0    70
110.0    68
120.0    68
125.0    66
115.0    65
         ..
185.5     1
85.0      1
201.0     1
295.0     1
169.5     1
Name: count, Length: 228, dtype: int64

In [332]:
chd_train['sysBP'].isna().sum()

0

#####diaBP
**************

Decision: no na's, no mapping needed

In [333]:
chd_train['diaBP'].value_counts()

diaBP
80.0     190
82.0     105
81.0      96
85.0      96
84.0      95
        ... 
54.0       1
105.5      1
109.5      1
129.0      1
122.5      1
Name: count, Length: 138, dtype: int64

In [334]:
chd_train['diaBP'].isna().sum()

0

#####BMI
************

Decision: replace na's with median value

In [335]:
chd_train['BMI'].value_counts()

BMI
23.48    17
22.91    15
22.54    12
23.09    12
25.09    11
         ..
27.71     1
23.02     1
33.68     1
27.67     1
26.78     1
Name: count, Length: 1213, dtype: int64

In [336]:
chd_train['BMI'].isna().sum()

14

In [337]:
median_value_BMI = chd_train['BMI'].median()

# Replace missing values with the median
chd_train['BMI'].fillna(median_value_BMI, inplace=True)

#####heartRate
**********

Decision: no na's, no mapping needed

In [338]:
chd_train['heartRate'].value_counts()

heartRate
75     416
80     261
70     224
60     169
72     163
      ... 
51       1
46       1
97       1
47       1
130      1
Name: count, Length: 70, dtype: int64

In [339]:
chd_train['heartRate'].isna().sum()

0

#####glucose
************

Decision: converted na's to median value

In [340]:
chd_train['glucose'].value_counts()

glucose
75.0     138
77.0     126
73.0     114
70.0     112
74.0     108
        ... 
119.0      1
320.0      1
136.0      1
210.0      1
186.0      1
Name: count, Length: 126, dtype: int64

In [341]:
chd_train['glucose'].isna().sum()

277

In [342]:
median_value_gluc = chd_train['glucose'].median()

# Replace missing values with the median
chd_train['glucose'].fillna(median_value_gluc, inplace=True)

#####TenYearCHD
*************

Decision: no na's, mapped one-hot encodings to categorical

In [343]:
chd_train['TenYearCHD'].value_counts()

TenYearCHD
0    2590
1     468
Name: count, dtype: int64

In [344]:
chd_train['TenYearCHD'].isna().sum()

0

In [345]:
# Map numerical values to categories
chd_map = {0: 'no chd', 1: 'chd'}

# Create a new column based on the mapping
chd_train['chd_category'] = chd_train['TenYearCHD'].map(chd_map)

####Inspection
********

In [346]:
chd_train.head()

Unnamed: 0.1,Unnamed: 0,sex,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,...,glucose,TenYearCHD,sex_category,edu_category,smoker_category,BPMeds_category,stroke_category,hyp_category,diabetes_category,chd_category
0,1267,1,58,1.0,0,0.0,0.0,0,0,0,...,87.0,1,male,some HS,no smoking,no meds,no stroke,no hyp,no diabetes,chd
1,1209,0,40,1.0,1,15.0,0.0,0,0,0,...,77.0,0,female,some HS,smoking,no meds,no stroke,no hyp,no diabetes,no chd
2,2050,0,52,1.0,0,0.0,0.0,0,0,0,...,78.0,0,female,some HS,no smoking,no meds,no stroke,no hyp,no diabetes,no chd
3,1183,1,38,2.0,1,43.0,0.0,0,1,0,...,75.0,0,male,HS/GED,smoking,no meds,no stroke,hyp,no diabetes,no chd
4,3225,0,43,1.0,0,0.0,0.0,0,0,0,...,74.0,0,female,some HS,no smoking,no meds,no stroke,no hyp,no diabetes,no chd


In [347]:
chd_train.shape
#originally had 3180 observations

(3058, 25)

In [348]:
for each in chd_train.columns:
    print(f"{each}: {chd_train[each].isnull().sum()}")

Unnamed: 0: 0
sex: 0
age: 0
education: 0
currentSmoker: 0
cigsPerDay: 0
BPMeds: 0
prevalentStroke: 0
prevalentHyp: 0
diabetes: 0
totChol: 0
sysBP: 0
diaBP: 0
BMI: 0
heartRate: 0
glucose: 0
TenYearCHD: 0
sex_category: 0
edu_category: 0
smoker_category: 0
BPMeds_category: 0
stroke_category: 0
hyp_category: 0
diabetes_category: 0
chd_category: 0


####Saving new files

In [349]:
# Save cleaned chd_test to a CSV file
chd_test.to_csv('clean_chd_test.csv', index=False)

# Save cleaned chd_train to a CSV file
chd_train.to_csv('clean_chd_train.csv', index=False)