heart.csv 
1. Load hear disease dataset in pandas dataframe
2. Remove outliers using Z score. Usual guideline is to remove anyting that has Z score > 3 formula or Z score < -3 
3. Convert text columns to number using label encoding and one hot encoding 
4. Apply Scaling
5. Build a classification model using various methods (SVM, logistic regression, random forest) and check which model gives you the best accuracy
6. Now use PCA to reduce dimensions, retrain your model and see what impact it has on your model in terms of accuracy. Keep in mind that many times doing PCA reduces the accuracy but computation is much lighter and that's the trade off you need to consider while building models in real life

In [68]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [69]:
heart = pd.read_csv('heart.csv')
heart.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


### Remove outliers using Z score

In [70]:
# Finding out the columns for which to use outliers for 
from scipy import stats

columns = ['Age','RestingBP','Cholesterol','MaxHR','Oldpeak','HeartDisease']
for col in columns:
    z_scores = stats.zscore(heart[col])
    for i, z_score in enumerate(z_scores):
        if (abs(z_score) > 3) or (abs(z_score) <-3):
            print(f"Row {i} in column {col} is an outlier.")

Row 109 in column RestingBP is an outlier.
Row 241 in column RestingBP is an outlier.
Row 365 in column RestingBP is an outlier.
Row 399 in column RestingBP is an outlier.
Row 449 in column RestingBP is an outlier.
Row 592 in column RestingBP is an outlier.
Row 732 in column RestingBP is an outlier.
Row 759 in column RestingBP is an outlier.
Row 76 in column Cholesterol is an outlier.
Row 149 in column Cholesterol is an outlier.
Row 616 in column Cholesterol is an outlier.
Row 390 in column MaxHR is an outlier.
Row 166 in column Oldpeak is an outlier.
Row 324 in column Oldpeak is an outlier.
Row 702 in column Oldpeak is an outlier.
Row 771 in column Oldpeak is an outlier.
Row 791 in column Oldpeak is an outlier.
Row 850 in column Oldpeak is an outlier.
Row 900 in column Oldpeak is an outlier.


In [71]:
heart['z_score'] = np.abs(stats.zscore(heart['HeartDisease']))

In [72]:
heart = heart[(heart['z_score'] < 3) & (heart['z_score'] > -3)]

In [73]:
heart.drop('z_score',axis=1,inplace=True)

In [74]:
heart.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


### OneHotEncoding to change text to numerical data

In [75]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()


In [76]:
dummies1 = pd.get_dummies(heart.Sex)
dummies2 = pd.get_dummies(heart.ChestPainType)
dummies3 = pd.get_dummies(heart.RestingECG)
dummies4 = pd.get_dummies(heart.ExerciseAngina)
dummies5 = pd.get_dummies(heart.ST_Slope)

In [77]:
merge = pd.concat([heart,dummies1,dummies2,dummies3,dummies4,dummies5],axis='columns')

In [78]:
fh = merge.drop(['Sex','ChestPainType','RestingECG','ExerciseAngina','ST_Slope'],axis=1)
fh.head()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease,F,M,ASY,...,NAP,TA,LVH,Normal,ST,N,Y,Down,Flat,Up
0,40,140,289,0,172,0.0,0,0,1,0,...,0,0,0,1,0,1,0,0,0,1
1,49,160,180,0,156,1.0,1,1,0,0,...,1,0,0,1,0,1,0,0,1,0
2,37,130,283,0,98,0.0,0,0,1,0,...,0,0,0,0,1,1,0,0,0,1
3,48,138,214,0,108,1.5,1,1,0,1,...,0,0,0,1,0,0,1,0,1,0
4,54,150,195,0,122,0.0,0,0,1,0,...,1,0,0,1,0,1,0,0,0,1


In [79]:
fh = fh.drop(['F','ASY','ST','N','Down'],axis=1)
fh.head()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease,M,ATA,NAP,TA,LVH,Normal,Y,Flat,Up
0,40,140,289,0,172,0.0,0,1,1,0,0,0,1,0,0,1
1,49,160,180,0,156,1.0,1,0,0,1,0,0,1,0,1,0
2,37,130,283,0,98,0.0,0,1,1,0,0,0,0,0,0,1
3,48,138,214,0,108,1.5,1,0,0,0,0,0,1,1,1,0
4,54,150,195,0,122,0.0,0,1,0,1,0,0,1,0,0,1


In [80]:
y = fh['HeartDisease']

In [81]:
fh = fh.drop('HeartDisease',axis=1)

### Classification Model to use

In [58]:
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB

In [84]:

model_params = {
    'svm' : {
        'model' : svm.SVC(gamma='auto'),
        'params' : {
            'C' : [1,10,20],
            'kernel' : ['rbf','linear']
        }
    },
    'random_forest' : {
        'model' : RandomForestClassifier(),
        'params' : {
            'n_estimators' : [1,5,10]
        }
    },
    'decision_tree' : {
        'model' : DecisionTreeClassifier(),
        'params' : {
            'splitter' : ['best','random'],
            
        }
    },
    'logistic_regression' : {
        'model' : LogisticRegression(),
        'params' : {
            'solver' : ['liblinear','saga'],
            
        }
    },
    'gaussianNB' : {
        'model' : GaussianNB(),
        'params' : {}
    }
}

In [85]:
from sklearn.model_selection import GridSearchCV
scores = []
for model_name, mp in model_params.items():
    clf = GridSearchCV(mp['model'],mp['params'],cv=5,return_train_score = False)
    clf.fit(fh,y)
    scores.append({
        'model' : model_name,
        'best_score' : clf.best_score_,
        'best_params' : clf.best_params_
    })



In [86]:
df = pd.DataFrame(scores)
df.head()

Unnamed: 0,model,best_score,best_params
0,svm,0.83325,"{'C': 20, 'kernel': 'linear'}"
1,random_forest,0.801681,{'n_estimators': 5}
2,decision_tree,0.76242,{'splitter': 'random'}
3,logistic_regression,0.834379,{'solver': 'liblinear'}
4,gaussianNB,0.840871,{}


Hence, From this we can see that the best model to use is gaussianNB, and svm and logistic regression comes right after 

522    1
513    1
326    0
757    1
684    1
      ..
53     0
110    0
88     1
26     0
896    0
Name: HeartDisease, Length: 184, dtype: int64

### PCA

In [110]:
from sklearn.decomposition import PCA
pca = PCA(0.90)
x_pca = pca.fit_transform(fh)
x_pca

array([[ 9.23110772e+01],
       [-1.71438907e+01],
       [ 8.19069658e+01],
       [ 1.36543440e+01],
       [-4.34875075e+00],
       [ 1.41767320e+02],
       [ 4.00777910e+01],
       [ 9.09400808e+00],
       [ 8.06494898e+00],
       [ 8.39083066e+01],
       [ 1.25788659e+01],
       [-3.68925980e+01],
       [ 5.57560165e+00],
       [ 3.54882788e+01],
       [ 1.19896491e+01],
       [ 7.46078161e+01],
       [-1.36254737e+00],
       [ 3.70133024e+00],
       [ 4.78147623e+01],
       [ 6.93508607e+01],
       [ 2.39893107e+01],
       [-1.46037317e+01],
       [ 3.66071599e+00],
       [ 9.01816473e+01],
       [ 1.63140901e+01],
       [ 1.26674065e+01],
       [ 5.95163071e+01],
       [ 8.37577139e+01],
       [ 2.67807946e+02],
       [-1.04121733e+01],
       [ 3.18436479e+02],
       [-3.31194658e+01],
       [ 2.41680374e+01],
       [-2.70759981e+01],
       [-1.13898136e+01],
       [ 5.62061700e+01],
       [ 1.04158556e+02],
       [ 5.11291099e+01],
       [-2.1

In [113]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_pca,y,test_size=0.2)

In [114]:
model = RandomForestClassifier()
model.fit(x_train,y_train)
model.score(x_test,y_test)

0.6358695652173914