### Preprocessing

Preprocessing involves the following steps
- Import dataset
- Take a look at its contents
- Check for missing values
- Convert all values of "num" (angiographic disease status) greater than 0 to 1 (binary classification problem)
- Drop all columns other than "thalach" and "chol"
- Divide into train and test sets (75:25 split)
- Write to separate csv files, train.csv and test.csv

In [1]:
import pandas as pd
import numpy as np

In [2]:
#read dataset
rawdf = pd.read_csv("raw-data/data.csv")
fulldf = rawdf.copy()

In [3]:
#check out its contents
fulldf.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


In [4]:
fulldf.shape

(303, 14)

In [5]:
fulldf.dtypes

age         float64
sex         float64
cp          float64
trestbps    float64
chol        float64
fbs         float64
restecg     float64
thalach     float64
exang       float64
oldpeak     float64
slope       float64
ca           object
thal         object
num           int64
dtype: object

In [6]:
#check for missing values
fulldf.isnull().values.any()

False

In [7]:
#check values of response
fulldf["num"].unique()

array([0, 2, 1, 3, 4])

In [8]:
#No missing values. Go to next step
#convert all num values >0 to 1
fulldf.loc[fulldf.num>0,["num"]] = 1

In [9]:
fulldf["num"].unique()

array([0, 1])

In [10]:
#Divide into train and test dataframes
#75:25 split
np.random.seed(143)
msk = np.random.rand(len(fulldf)) < 0.75
train = fulldf[msk][["thalach", "chol", "num"]]
test = fulldf[~msk][["thalach", "chol", "num"]]

In [11]:
train.shape, test.shape

((243, 3), (60, 3))

In [12]:
#write train and test dataframes to csv files without index columns
train.to_csv("train.csv", index = False)
test.to_csv("test.csv", index = False)