<a href="https://colab.research.google.com/github/SrijanxxxSharma/Heart_disease_classification/blob/main/Heart_disease_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Heart disease dataset**

### This data set can be found in my Github repository as https://github.com/SrijanxxxSharma/Heart_disease_classification.

Looking at the data it is quite clear that this is a classification problem(target column is a categorical column). 8 of its columns are categorical(0,1,2,3...) while remaining 5 are continuous numeric column.

  Since we don't have a lot of features we won't use any DR(dimension reduction). We'll just scale 5 of our features which are continuous(standardizing).

In [21]:
import pandas as pd
import numpy as np
url = "https://github.com/SrijanxxxSharma/Heart_disease_classification/raw/main/heart-disease-dataset.csv"
raw_data=pd.read_csv(url)

#Now we will use train test split for creating 3 sets
1.Training set.(70% of data)

2.Test set.(20% of data)

3.Validation set.(10% of data)

In [22]:
from sklearn.model_selection import train_test_split

target=raw_data.target
features=raw_data.drop("target",axis=1)

# Spliting the data into train and remain sets
X_train, X_remain, y_train, y_remain = train_test_split(features, target, test_size=0.30, random_state=42)

#spliting test set into test and validation test

X_test, X_val, y_test, y_val = train_test_split(X_remain, y_remain, test_size=0.34, random_state=42)
X_train.reset_index(inplace=True,drop=True)
X_test.reset_index(inplace=True,drop=True)
X_val.reset_index(inplace=True,drop=True)
y_train=y_train.reset_index(drop=True)
y_test=y_test.reset_index(drop=True)
y_val=y_val.reset_index(drop=True)



#preprocessing






Now we extract continuous variables and convert them to numpy array to scale them.

In [23]:
def preprocess(raw_data):
  #extracting continuous variables as list of numpy arrays  
  continuous_data=[raw_data[members].to_numpy() for members in ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']]
  
  #scaling
  name=['Scaled_age', 'Scaled_trestbps', 'Scaled_chol', 'Scaled_thalach', 'Scaled_oldpeak']
  i=0
  for feature in continuous_data:
    new=3*(feature-np.min(feature))/(np.max(feature)-np.min(feature))
    #new=(4*feature-feature.mean())/feature.std()
    raw_data=pd.concat([raw_data,pd.DataFrame(new, columns = [name[i]])], axis=1)
    i+=1
  columns=['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
  raw_data.drop(columns,inplace=True,axis=1)
  return raw_data

scaled_data = preprocess(X_train) 

## Error analysis
We use mean squared error to analyse our model


In [24]:
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import mean_absolute_error as err
def sq_err_per(original,predictions):
  return 100*mse(original,predictions)

def err_per(original,predictions):
  return 100*err(original,predictions)

# Now Model
Now we have nicely scaled data as scaled_data. We will use classifiers to train our Model.

1.K Nearest Neighbour

2.Logistic Regression

3.Naïve Bayes

4.Support Vector Machine

5.Decision Tree

6.Random Forest

7.XGboost 

## K NearestNeighbours(86.69% accuracy)

This classifier has simple concept. It checks the neighbouring point's label using Eucledean distance. Here K is number if neighbours one wants to consider.

#Curse of Dimensionality

KNN performs better with a lower number of features than a large number of features. You can say that when the number of features increases than it requires more data. Increase in dimension also leads to the problem of overfitting. To avoid overfitting, the needed data will need to grow exponentially as you increase the number of dimensions. This problem of higher dimension is known as the Curse of Dimensionality.

In [25]:

# Import relavent model from sklearn
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
model = KNeighborsClassifier(n_neighbors=10)

# Now fit in each point
model.fit(scaled_data,y_train)
# Now predict for test set but remember to preprocess it else it will be a disaster
predictions = model.predict(preprocess(X_test))
cm = metrics.confusion_matrix(y_test, predictions)#Confusion metrics

#printing errors
print((100-sq_err_per(y_test, predictions)))
print((100-err_per(y_test, predictions)))
print(cm)

85.71428571428572
85.71428571428572
[[96 16]
 [13 78]]


# Logistic regressor(79.8% accuracy)

Logistic Regression is used when the dependent variable(target) is categorical.

In [26]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
model2=LogisticRegression()
model2.fit(scaled_data,y_train)

# Now predict for test set but remember to preprocess it else it will be a disaster
predictions = model2.predict(preprocess(X_test))

cm = metrics.confusion_matrix(y_test, predictions)#Confusion metrics

#printing errors
print((100-sq_err_per(y_test, predictions)))
print((100-err_per(y_test, predictions)))
print(cm)

79.80295566502463
79.80295566502463
[[79 33]
 [ 8 83]]


# Naïve Bayes(78.3% accuracy)
A Naive Bayes classifier is a probabilistic machine learning model that’s used for classification task. The crux of the classifier is based on the Bayes theorem.

We make two assumptions here, one as stated above we consider that these predictors are independent. That is, if the temperature is hot, it does not necessarily mean that the humidity is high. Another assumption made here is that all the predictors have an equal effect on the outcome. 

In [27]:
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB

model3 = GaussianNB()
model3.fit(scaled_data,y_train)



# Now predict for test set but remember to preprocess it else it will be a disaster
predictions = model3.predict(preprocess(X_test))

cm = metrics.confusion_matrix(y_test, predictions)#Confusion metrics

#printing errors
print((100-sq_err_per(y_test, predictions)))
print((100-err_per(y_test, predictions)))
print(cm)

78.32512315270935
78.32512315270935
[[86 26]
 [18 73]]


#Support Vector Machines(84.2%)
Generally, Support Vector Machines is considered to be a classification approach, it but can be employed in both types of classification and regression problems. It can easily handle multiple continuous and categorical variables. SVM constructs a hyperplane in multidimensional space to separate different classes. SVM generates optimal hyperplane in an iterative manner, which is used to minimize an error. The core idea of SVM is to find a maximum marginal hyperplane(MMH) that best divides the dataset into classes.




In [28]:
from sklearn import svm
from sklearn import metrics
model4=svm.SVC()#SVC is support vector classifier
model4.fit(scaled_data,y_train)
predictions=model4.predict(preprocess(X_test))

cm=metrics.confusion_matrix(y_test,predictions)

#printing errors
print((100-sq_err_per(y_test, predictions)))
print((100-err_per(y_test, predictions)))
print(cm)


85.71428571428572
85.71428571428572
[[88 24]
 [ 5 86]]


# Decision Trees(81.7%)

A decision tree is a flowchart-like tree structure where an internal node represents feature(or attribute), the branch represents a decision rule, and each leaf node represents the outcome. The topmost node in a decision tree is known as the root node. It learns to partition on the basis of the attribute value. It partitions the tree in recursively manner call recursive partitioning. This flowchart-like structure helps you in decision making.

In [29]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
model5=DecisionTreeClassifier(criterion="entropy", max_depth=5)

model5.fit(scaled_data,y_train)
predictions=model5.predict(preprocess(X_test))

cm=metrics.confusion_matrix(y_test,predictions)

#printing errors
print((100-sq_err_per(y_test, predictions)))
print((100-err_per(y_test, predictions)))
print(cm)


81.77339901477832
81.77339901477832
[[82 30]
 [ 7 84]]


#Random forest classifier(94.5%)

It technically is an ensemble method (based on the divide-and-conquer approach) of decision trees generated on a randomly split dataset. This collection of decision tree classifiers is also known as the forest. The individual decision trees are generated using an attribute selection indicator such as information gain, gain ratio, and Gini index for each attribute.

In [30]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

model6=RandomForestClassifier(n_estimators=100)

model6.fit(scaled_data,y_train)
predictions=model6.predict(preprocess(X_test))

cm=metrics.confusion_matrix(y_test,predictions)

#printing errors
print((100-sq_err_per(y_test, predictions)))
print((100-err_per(y_test, predictions)))
print(cm)

94.58128078817734
94.58128078817734
[[106   6]
 [  5  86]]


#XGBoost(84%)
They are boosting trees. Boosters turn weak learners into strong learners by focusing on where the individual models (usually Decision Trees) went wrong. In Gradient Boosting, individual models train upon the residuals, the difference between the prediction and the actual results. Instead of aggregating trees, gradient boosted trees learns from errors during each boosting round.

Basic version is applied here for simplicity.

In [31]:
import xgboost as xgb
from sklearn import metrics

model7=gb_model = xgb.XGBClassifier(objective="binary:logistic")

model7.fit(scaled_data,y_train)
predictions=model7.predict(preprocess(X_test))

cm=metrics.confusion_matrix(y_test,predictions)

#printing errors
print((100-sq_err_per(y_test, predictions)))
print((100-err_per(y_test, predictions)))
print(cm)

84.72906403940887
84.72906403940887
[[87 25]
 [ 6 85]]


#Conclusion
 
 On fine tuning scaled data(min-max scaler with some twicks) we reached an accuracy of 94.5% using Random Forest regressor.

 ## Note 
 #### Accuracy may vary under some parameters