Download heart disease dataset heart.csv in Exercise folder and do following, (credits of dataset: https://www.kaggle.com/fedesoriano/heart-failure-prediction)

Load heart disease dataset in pandas dataframe

Remove outliers using Z score. Usual guideline is to remove anything that has Z score > 3 formula or Z score < -3

Convert text columns to numbers using label encoding and one hot encoding

Apply scaling
Build a classification model using various methods (SVM, logistic regression, random forest) and check which model gives you the best accuracy

Now use PCA to reduce dimensions, retrain your model and see what impact it has on your model in terms of accuracy. Keep in mind that many times doing PCA reduces the accuracy but computation is much lighter and that's the trade off you need to consider while building models in real life

In [19]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import zscore
%matplotlib inline 

df=pd.read_csv('heart.csv')
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [20]:
# Calculate Z-scores for all numerical columns
z_scores = df.select_dtypes(include=['float64', 'int64']).apply(zscore)

# Set a threshold for Z-score (e.g., ±3)
threshold = 3

# Filter rows where all Z-scores are within the threshold
df_no_outliers = df[(z_scores < threshold).all(axis=1)]

print("Original DataFrame shape:", df.shape)
print("DataFrame shape after removing outliers:", df_no_outliers.shape)
df_no_outliers['RestingECG'].unique()

Original DataFrame shape: (918, 12)
DataFrame shape after removing outliers: (902, 12)


array(['Normal', 'ST', 'LVH'], dtype=object)

In [21]:
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
le=LabelEncoder()
df_no_outliers['Sex']=le.fit_transform(df_no_outliers['Sex'])
df_no_outliers['ExerciseAngina']=le.fit_transform(df_no_outliers['ExerciseAngina'])
df_no_outliers['ChestPainType']=le.fit_transform(df_no_outliers['ChestPainType'])
df_no_outliers['RestingECG']=le.fit_transform(df_no_outliers['RestingECG'])
df_no_outliers['ST_Slope']=le.fit_transform(df_no_outliers['ST_Slope'])
df_no_outliers

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_no_outliers['Sex']=le.fit_transform(df_no_outliers['Sex'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_no_outliers['ExerciseAngina']=le.fit_transform(df_no_outliers['ExerciseAngina'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_no_outliers['ChestPainType']=le.fit_transform(df_no_outl

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,1,1,140,289,0,1,172,0,0.0,2,0
1,49,0,2,160,180,0,1,156,0,1.0,1,1
2,37,1,1,130,283,0,2,98,0,0.0,2,0
3,48,0,0,138,214,0,1,108,1,1.5,1,1
4,54,1,2,150,195,0,1,122,0,0.0,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...
913,45,1,3,110,264,0,1,132,0,1.2,1,1
914,68,1,0,144,193,1,1,141,0,3.4,1,1
915,57,1,0,130,131,0,1,115,1,1.2,1,1
916,57,0,1,130,236,0,0,174,0,0.0,1,1


Scaling the overall dataset

In [22]:
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
scaling=StandardScaler()

In [23]:
X=df_no_outliers.drop('HeartDisease',axis='columns')
X_final=X.values
X_final=scaling.fit_transform(X_final)
Y_final=df_no_outliers['HeartDisease'].values
ct = ColumnTransformer([('Car MOdel', OneHotEncoder(), [0])], remainder = 'passthrough')
X_final=ct.fit_transform(X_final)
X_final=X_final[:,1:]

Applying Different Models for Hyperparameter tuning

In [24]:
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
model_params = {
    'svm': {
        'model': SVC(gamma='auto'),
        'params': {
            'C': [1, 10, 20],
            'kernel': ['rbf', 'linear']
        }
    },
    'random_forest': {
        'model': RandomForestClassifier(),
        'params': {
            'n_estimators': [1, 5, 10]
        }
    },
    'logistic_regression': {
        'model': LogisticRegression(solver='liblinear', multi_class='auto'),
        'params': {
            'C': [1, 5, 10]
        }
    }
}

In [25]:
from sklearn.model_selection import GridSearchCV
scores=[]

for model_name, mp in model_params.items():
    clf_new =  GridSearchCV(mp['model'], mp['params'], cv=5, return_train_score=False)
    clf_new.fit(X_final,Y_final)
    scores.append({
        'model': model_name,
        'best_score': clf_new.best_score_,
        'best_params': clf_new.best_params_
    })
    
df_new= pd.DataFrame(scores,columns=['model','best_score','best_params'])
df_new



Unnamed: 0,model,best_score,best_params
0,svm,0.833646,"{'C': 20, 'kernel': 'rbf'}"
1,random_forest,0.812566,{'n_estimators': 10}
2,logistic_regression,0.808183,{'C': 1}


Result Using PCA X array

In [26]:
from sklearn.decomposition import PCA
pca=PCA(n_components=2)
X_pca=pca.fit_transform(X_final)

In [27]:
scores_new=[]

for model_name, mp in model_params.items():
    clf_new =  GridSearchCV(mp['model'], mp['params'], cv=5, return_train_score=False)
    clf_new.fit(X_pca,Y_final)
    scores_new.append({
        'model': model_name,
        'best_score': clf_new.best_score_,
        'best_params': clf_new.best_params_
    })
    
df_pca= pd.DataFrame(scores_new,columns=['model','best_score','best_params'])
df_pca



Unnamed: 0,model,best_score,best_params
0,svm,0.841418,"{'C': 1, 'kernel': 'rbf'}"
1,random_forest,0.797053,{'n_estimators': 10}
2,logistic_regression,0.822591,{'C': 1}
