# ISM6251.003S23- Assignment 1

## Data Preprocessing
This notebook would be focused on exploring the dataset, pre-processing the data to obtain a clean data for fitting the models in the model fitting notebook.

## 1. Package Import

In [112]:
# Import packages
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split


In [113]:
# set random seed to ensure that results are repeatable
np.random.seed(1)

## 2. Data Import

In [114]:
# Import data from csv file
heart = pd.read_csv("heart_disease_uci.csv")

In [115]:
#Data Exploration
heart.head()

Unnamed: 0,id,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
0,1,63,Male,Cleveland,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0.0,fixed defect,0
1,2,67,Male,Cleveland,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,flat,3.0,normal,2
2,3,67,Male,Cleveland,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,2.0,reversable defect,1
3,4,37,Male,Cleveland,non-anginal,130.0,250.0,False,normal,187.0,False,3.5,downsloping,0.0,normal,0
4,5,41,Female,Cleveland,atypical angina,130.0,204.0,False,lv hypertrophy,172.0,False,1.4,upsloping,0.0,normal,0


In [116]:
# exploration of numerical variables
heart.describe()

Unnamed: 0,id,age,trestbps,chol,thalch,oldpeak,ca,num
count,920.0,920.0,861.0,890.0,865.0,858.0,309.0,920.0
mean,460.5,53.51087,132.132404,199.130337,137.545665,0.878788,0.676375,0.995652
std,265.725422,9.424685,19.06607,110.78081,25.926276,1.091226,0.935653,1.142693
min,1.0,28.0,0.0,0.0,60.0,-2.6,0.0,0.0
25%,230.75,47.0,120.0,175.0,120.0,0.0,0.0,0.0
50%,460.5,54.0,130.0,223.0,140.0,0.5,0.0,1.0
75%,690.25,60.0,140.0,268.0,157.0,1.5,1.0,2.0
max,920.0,77.0,200.0,603.0,202.0,6.2,3.0,4.0


### 2.1 Target and Input variables description

Target Variable in this dataset is the 'num' column. This column represents if a person has any heart disease or not. All the values in this column that are not 0 indicates that the person is suffering from heart disease. 

The dataset has 15 input variables, out of which not every variable is logically related to the target variable. Hence, those unrealted variables ('id' and 'dataset') with no logical contribution to the target would be dropped for a better analysis.

In [117]:
#Dropping unrelated variables
heart = heart.drop(['id', 'dataset'], axis=1)

### 2.2 Checking the data for null values

In [118]:
heart.isnull().sum()


age           0
sex           0
cp            0
trestbps     59
chol         30
fbs          90
restecg       2
thalch       55
exang        55
oldpeak      62
slope       309
ca          611
thal        486
num           0
dtype: int64

#### 2.2.1 Handling Null values

In [119]:
#Checking datatype for each column
heart.dtypes

age           int64
sex          object
cp           object
trestbps    float64
chol        float64
fbs          object
restecg      object
thalch      float64
exang        object
oldpeak     float64
slope        object
ca          float64
thal         object
num           int64
dtype: object

The columns- 'slope','ca' and 'thal' has highest number of null values in the dataset and also these column contain categorical data, so imputing these missing values with mean or median would introduce bias in the data and would affect the analysis.

To handle these missing values effectively, these three columns will be dropped from the analysis. 

In [120]:
heart = heart.drop(['slope', 'ca','thal'], axis=1)

In [121]:
heart.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,num
0,63,Male,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,0
1,67,Male,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,2
2,67,Male,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,1
3,37,Male,non-anginal,130.0,250.0,False,normal,187.0,False,3.5,0
4,41,Female,atypical angina,130.0,204.0,False,lv hypertrophy,172.0,False,1.4,0


In [122]:
heart.isnull().sum()

age          0
sex          0
cp           0
trestbps    59
chol        30
fbs         90
restecg      2
thalch      55
exang       55
oldpeak     62
num          0
dtype: int64

In [123]:
# imputing missing values in numerical columns with the median
heart.trestbps = heart.trestbps.fillna(value=heart['trestbps'].median())
heart.chol = heart.chol.fillna(value=heart['chol'].median())
heart.thalch = heart.thalch.fillna(value=heart['thalch'].median())
heart.oldpeak = heart.oldpeak.fillna(value=heart['oldpeak'].median())

In [124]:
#Check for missing values
heart.isnull().sum()

age          0
sex          0
cp           0
trestbps     0
chol         0
fbs         90
restecg      2
thalch       0
exang       55
oldpeak      0
num          0
dtype: int64

Missing values in the numerical columns ('trestbps', 'chol', 'thalch', 'oldpeak') were imputed by median values. Although there are few other missing rows in the categorical column('fbs','rstech' and 'exang') and imputing these values with median or mean would introduce bias in the data. Therefore, the missing rows from these columns will be dropped. 

In [125]:
# Populate missing values in categorical columns with NA 
heart.replace('', np.nan, inplace=True)

In [126]:
# Drop rows with missing values only in categorical columns
heart.dropna(subset=['fbs'], inplace=True)
heart.dropna(subset=['restecg'], inplace=True)
heart.dropna(subset=['exang'], inplace=True)

In [127]:
#Final check for any missing values
heart.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalch      0
exang       0
oldpeak     0
num         0
dtype: int64

In [128]:
heart.describe()

Unnamed: 0,age,trestbps,chol,thalch,oldpeak,num
count,774.0,774.0,774.0,774.0,774.0,774.0
mean,53.071059,132.775194,219.301034,138.677003,0.885401,0.919897
std,9.43097,18.577723,92.594114,25.808812,1.08189,1.133424
min,28.0,0.0,0.0,60.0,-1.0,0.0
25%,46.0,120.0,198.0,120.0,0.0,0.0
50%,54.0,130.0,228.0,140.0,0.5,1.0
75%,60.0,140.0,269.0,159.0,1.5,1.0
max,77.0,200.0,603.0,202.0,6.2,4.0


Total of 146 rows have been dropped from the dataset

### 2.3 Encoding categorical variables

In [129]:
heart.dtypes

age           int64
sex          object
cp           object
trestbps    float64
chol        float64
fbs          object
restecg      object
thalch      float64
exang        object
oldpeak     float64
num           int64
dtype: object

Among the columns in the dataset there are several catergorical variables ('sex','cp','fbs','restecg' and 'exang'). These variables are needed to be encoded for model fitting

The categorical columns do not seem to have any inherent order or ranking and deeper understanding of these attributes will be required to determine any inherent order among them. Considering this one hot encoding will be used for these variables 

In [130]:
labelencoder = LabelEncoder()
heart['sex'] = labelencoder.fit_transform(heart['sex'])
heart['cp'] = labelencoder.fit_transform(heart['cp'])
heart['fbs'] = labelencoder.fit_transform(heart['fbs'])
heart['restecg'] = labelencoder.fit_transform(heart['restecg'])
heart['exang'] = labelencoder.fit_transform(heart['exang'])

In [131]:
heart.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,num
0,63,1,3,145.0,233.0,1,0,150.0,0,2.3,0
1,67,1,0,160.0,286.0,0,0,108.0,1,1.5,2
2,67,1,0,120.0,229.0,0,0,129.0,1,2.6,1
3,37,1,2,130.0,250.0,0,1,187.0,0,3.5,0
4,41,0,1,130.0,204.0,0,0,172.0,0,1.4,0


### 2.4 Scaling continous variables

In [132]:
# Create scaler for standardization
scaler = StandardScaler()

stand_col = ['age','trestbps','chol','thalch','oldpeak']

# Apply to numerical columns: age, trestbps, chol, thalch, oldpeak
heart[stand_col] = scaler.fit_transform(heart[stand_col])

In [133]:
heart.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,num
count,774.0,774.0,774.0,774.0,774.0,774.0,774.0,774.0,774.0,774.0,774.0
mean,-5.909714000000001e-17,0.76615,0.794574,2.003852e-16,2.62638e-16,0.151163,0.936693,5.531034e-16,0.394057,-9.848568e-16,0.919897
std,1.000647,0.423552,0.947754,1.000647,1.000647,0.358439,0.624943,1.000647,0.488963,1.000647,1.133424
min,-2.660094,0.0,0.0,-7.151632,-2.369944,0.0,0.0,-3.050426,0.0,-1.743819,0.0
25%,-0.7502549,1.0,0.0,-0.6881066,-0.2301961,0.0,1.0,-0.7241356,0.0,-0.8189126,0.0
50%,0.09856263,1.0,0.0,-0.1494795,0.09400804,0.0,1.0,0.05129461,0.0,-0.3564594,1.0
75%,0.7351758,1.0,2.0,0.3891477,0.5370871,0.0,1.0,0.7879533,1.0,0.568447,1.0
max,2.538913,1.0,3.0,3.620911,4.14656,1.0,2.0,2.455128,1.0,4.915507,4.0


In [134]:
heart.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,num
0,1.053482,1,3,0.658461,0.148042,1,0,0.43901,0,1.308372,0
1,1.477891,1,0,1.466402,0.720803,0,0,-1.189394,1,0.568447,2
2,1.477891,1,0,-0.688107,0.104815,0,0,-0.375192,1,1.585844,1
3,-1.705175,1,2,-0.149479,0.331758,0,1,1.873556,0,2.41826,0
4,-1.280766,0,1,-0.149479,-0.165355,0,0,1.291983,0,0.475956,0


### 2.5 Converting target into binary classification

The target variable 'num' has five different classes from 0 to 4. The value 0 indicates no heart disease and all other values (1,2,3,4) indicates presence of heart disease at different level. This project is only focuses on determining if the person has any heart disease or not and not the level of heart disease. 

The value 0 will represent no heart disease.
The values 1,2,3,4 will be assigned to 1 and will represent presence of heart disease

In [135]:
# Simplify target variable to a binary classification
heart.num = heart.num.replace(2, 1, regex=True)
heart.num = heart.num.replace(3, 1, regex=True)
heart.num = heart.num.replace(4, 1, regex=True)

In [136]:
heart.num.value_counts()

1    397
0    377
Name: num, dtype: int64

### 2.6 Splitting the data into train and test

In [137]:
# split the data into validation and training set
train_df, test_df = train_test_split(heart, test_size=0.3)

In [138]:
# defining predictors and target
target = 'num'
predictors = list(heart.columns)
predictors.remove(target)


### 2.7 Save data to csv

In [111]:
train_X = train_df[predictors]
train_y = train_df[target] # train_target is now a series object train_df.to_csv
test_X = test_df[predictors]
test_y = test_df[target] # validation_target is now a series object

train_df.to_csv('heart_train_df.csv', index=False)
train_X.to_csv('heart_train_X.csv', index=False)
train_y.to_csv('heart_train_y.csv', index=False)
test_df.to_csv('heart_test_df.csv', index=False)
test_X.to_csv('heart_test_X.csv', index=False)
test_y.to_csv('heart_test_y.csv', index=False)