<a href="https://colab.research.google.com/github/SrijanxxxSharma/Heart_disease_classification/blob/main/Heart_disease_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Heart disease dataset**

### This data set can be found in my Github repository as __.

Looking at the data it is quite clear that this is a classification problem(target column is a categorical column). 8 of its columns are categorical(0,1,2,3...) while remaining 5 are continuous numeric column.

  Since we don't have a lot of features we won't use any DR(dimension reduction). We'll just scale 5 of our features which are continuous(standardizing).

In [114]:
import pandas as pd
import numpy as np
url = "https://github.com/SrijanxxxSharma/Heart_disease_classification/raw/main/heart-disease-dataset.csv"
raw_data=pd.read_csv(url)

#Now we will use train test split for creating 3 sets
1.Training set.(70% of data)

2.Test set.(20% of data)

3.Validation set.(10% of data)

In [115]:
from sklearn.model_selection import train_test_split

target=raw_data.target
features=raw_data.drop("target",axis=1)

# Spliting the data into train and remain sets
X_train, X_remain, y_train, y_remain = train_test_split(features, target, test_size=0.30, random_state=42)

#spliting test set into test and validation test

X_test, X_val, y_test, y_val = train_test_split(X_remain, y_remain, test_size=0.34, random_state=42)
X_train.reset_index(inplace=True,drop=True)
X_test.reset_index(inplace=True,drop=True)
X_val.reset_index(inplace=True,drop=True)
y_train=y_train.reset_index(drop=True)
y_test=y_test.reset_index(drop=True)
y_val=y_val.reset_index(drop=True)



(717, 13) (203, 13) (105, 13)
     age  sex  cp  trestbps  chol  ...  exang  oldpeak  slope  ca  thal
0     59    1   1       140   221  ...      1      0.0      2   0     2
1     58    1   0       128   216  ...      1      2.2      1   3     3
2     44    0   2       118   242  ...      0      0.3      1   1     2
3     50    1   2       140   233  ...      0      0.6      1   1     3
4     43    0   2       122   213  ...      0      0.2      1   0     2
..   ...  ...  ..       ...   ...  ...    ...      ...    ...  ..   ...
712   41    1   2       130   214  ...      0      2.0      1   0     2
713   61    1   0       140   207  ...      1      1.9      2   1     3
714   51    1   0       140   299  ...      1      1.6      2   0     3
715   43    1   0       110   211  ...      0      0.0      2   0     3
716   52    1   0       112   230  ...      0      0.0      2   1     2

[717 rows x 13 columns]


#preprocessing






Now we extract continuous variables and convert them to numpy array to scale them.

In [116]:
def preprocess(raw_data):
  #extracting continuous variables as list of numpy arrays  
  continuous_data=[raw_data[members].to_numpy() for members in ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']]
  
  #scaling
  name=['Scaled_age', 'Scaled_trestbps', 'Scaled_chol', 'Scaled_thalach', 'Scaled_oldpeak']
  i=0
  for feature in continuous_data:
    new=4*(feature-np.min(feature))/(np.max(feature)-np.min(feature))
    #new=(4*feature-feature.mean())/feature.std()
    raw_data=pd.concat([raw_data,pd.DataFrame(new, columns = [name[i]])], axis=1)
    i+=1
  columns=['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
  raw_data.drop(columns,inplace=True,axis=1)
  return raw_data

scaled_data = preprocess(X_train) 

## Error analysis
We use mean squared error to analyse our model


In [117]:
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import mean_absolute_error as err
def sq_err_per(original,predictions):
  return 100*mse(original,predictions)

def err_per(original,predictions):
  return 100*err(original,predictions)

# Now Model
Now we have nicely scaled data as scaled_data. We will use classifiers to train our Model.

1.K Nearest Neighbour

2.Logistic Regression

3.Naïve Bayes

4.Stochastic Gradient Descent

5.Support Vector Machine

6.Decision Tree

7.Random Forest

8.XGboost 

## K NearestNeighbours

This classifier has simple concept. It checks the neighbouring point's label using Eucledean distance. Here K is number if neighbours one wants to consider.

#Curse of Dimensionality

KNN performs better with a lower number of features than a large number of features. You can say that when the number of features increases than it requires more data. Increase in dimension also leads to the problem of overfitting. To avoid overfitting, the needed data will need to grow exponentially as you increase the number of dimensions. This problem of higher dimension is known as the Curse of Dimensionality.

In [None]:

# Import relavent model from sklearn
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=10)

# Now fit in each point
model.fit(scaled_data,y_train)
# Now predict for test set but remember to preprocess it else it will be a disaster
predictions = model.predict(preprocess(X_test))

#printing errors
print((100-sq_err_per(y_test, predictions)))
print((100-err_per(y_test, predictions)))

# Logistic regressor