<a href="https://colab.research.google.com/github/Arjunmukundann/Heart-Disease-Prediction/blob/main/Heart_disease_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Heart Disease Prediction


In this machine learning project, I have collected the dataset from Kaggle (https://www.kaggle.com/ronitf/heart-disease-uci) and I will be using Machine Learning to make predictions on whether a person is suffering from Heart Disease or not.

# Importing libraries


Let's first import all the necessary libraries. I'll use numpy and pandas to start with.  For implementing Machine Learning models and processing of data, I will use the sklearn library.

In [None]:
import numpy as np
import pandas as pd

For processing the data, I'll import a few libraries. To split the available dataset for testing and training, I'll use the train_test_split method. To scale the features, I am using StandardScaler.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

Next, I'll import all the Machine Learning algorithms I will be using.



*   K Neighbors Classifier
*   Support Vector Classifier
*   Decision Tree Classifier
*   Random Forest Classifier


In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

 # Importing dataset


Now that we have all the libraries we will need, I can import the dataset and take a look at it. The dataset is stored in the file dataset.csv. I'll use the pandas read_csv method to read the dataset.



In [None]:
dataset = pd.read_csv('dataset.csv')


The dataset is now loaded into the variable dataset. I'll just take a glimpse of the data using the info() methods before I actually start processing and visualizing it.


<class 'pandas.core.frame.DataFrame'>
* RangeIndex: 303 entries, 0 to 302
* Data columns (total 14 columns):
* age         ----   303 non-null int64
* sex   ----      303 non-null int64
* cp        ----  303 non-null int64
* trestbps   ----  303 non-null int64
* chol      ----   303 non-null int64
* fbs       ----   303 non-null int64
* restecg    ----  303 non-null int64
* thalach    ----  303 non-null int64
* exang      ----  303 non-null int64
* oldpeak    ----  303 non-null float64
* slope       ---- 303 non-null int64
* ca          ---- 303 non-null int64
* thal        ---- 303 non-null int64
* target     ----  303 non-null int64
* dtypes: float64(1), int64(13)
* memory usage: 33.3 KB



Looks like the dataset has a total of 303 rows and there are no missing values. There are a total of 13 features along with one target value which we wish to find.

# Splitting the dataset into the Training set and Test set

I'll now import train_test_split to split our dataset into training and testing datasets. Then, I'll import all Machine Learning models I'll be using to train and test the data.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

# Data Processing


After exploring the dataset, I observed that I need to convert some categorical variables into dummy variables and scale all the values before training the Machine Learning models.I will use the "StandardScaler" from sklearn to scale my dataset.

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test

# Support Vector Machine

There are several kernels for Support Vector Classifier. I'll test linear kernel.


In [None]:
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_train, y_train)

Now l'll be implementing confusion matrix for svm after that implementing cross validation step.


**confusion matrix**

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[  24   ,     9  ]

 [   2 ,
          41   ]  ]

0.8552631578947368

The above metrix indicates that,


* True Negatives - 24
* False Positives- 	9
* False Negatives-  2
* True Positives -	41

* *Key insights*:

  * False Positives (9): These are cases incorrectly flagged as "Disease" when there is none.
  * False Negatives (2): These are cases missed by the model (it predicts "No Disease" when there is actually a disease). This number is very low, indicating the model is great at identifying heart disease.






**Cross Validation**

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(classifier, X, y, cv=5, scoring='accuracy')

print("Cross-Validation Scores:", scores)
print("Mean Accuracy:", scores.mean())
print("Standard Deviation:", scores.std())


Cross-Validation Scores: [0.81967213 0.8852459  0.80327869 0.86666667 0.76666667]


Mean Accuracy: 0.8283060109289618

Standard Deviation: 0.042927871064461436


1. Mean Accuracy (82.83%): This indicates consistent performance across the 5 folds.
2. Standard Deviation (4.29%): A low standard deviation means the model's performance is stable across the different data splits.


**Observation**

* Best recall (95.3%), crucial for identifying true positives in medical diagnosis.
* Highest cross-validation accuracy (82.83%) and lowest standard deviation (4.29%), making it the most stable and generalizable model.
* Strong F1-Score (88.2%), close to k-NN.
* Slightly lower precision (82%) compared to k-NN.

**Conclusion**

   * SVM is the best overall model due to its excellent generalization, stability, and high recall, making it ideal for medical applications.

# K Neighbors classifier

The classification score varies based on different values of neighbors that we choose.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 8, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)

Now l'll be implementing confusion matrix for KNN after that implementing cross validation step.




**confusion matrix**

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[27  , 6]

 [ 4 ,39]]

0.868421052631579

The above metrix indicates that ,
1. True Negatives -   27
2. False Positives - 	6
3.False Negatives - 	4
4.True Positives   - 	39
* *Key insights*:
 * True Negatives (27): k-NN correctly identifies more cases as "No Disease"compared to the SVM and Random Forest models.
  * False Positives (6): The lowest among all models tested so far.
  * False Negatives (4): Matches the Random Forest model, but worse than SVM .



**Cross Validation**

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(classifier, X, y, cv=5, scoring='accuracy')

print("Cross-Validation Scores:", scores)
print("Mean Accuracy:", scores.mean())
print("Standard Deviation:", scores.std())


Cross-Validation Scores: [0.57377049 0.63934426 0.62295082 0.73333333 0.53333333]

Mean Accuracy: 0.6205464480874316

Standard Deviation: 0.06763747153255116

1. Mean Accuracy (62.05%): Significantly lower than the SVM (82.83%) and Random Forest (79.21%) models, showing that the k-NN model struggles to generalize well on different splits of the data.
2. Standard Deviation (6.76%): Higher than SVM (4.29%), indicating less stability across folds.

* **Observation**
 * *High Test Accuracy*: While k-NN achieves the highest test set accuracy (86.84%), its poor cross-validation mean accuracy (62.05%) indicates it might be overfitting the data.
 * *Low Stability*: The cross-validation results show that k-NN performs inconsistently across different data splits.
 * *Strengths*: Very low false positives (6), making it better for avoiding unnecessary follow-ups.
 * *Weaknesses*: The poor generalization ability (as shown by the low cross-validation mean accuracy) is a significant drawback.

**Conclusion**

*  k-NN performs well in precision and test accuracy but struggles with generalization due to poor cross-validation accuracy.



# Decision Tree Classification

Here, I'll use the Decision Tree Classifier to model the problem at hand.

In [None]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

Now l'll be implementing confusion matrix for Decision tree classsifier  after that implementing cross validation step.



**Confusion Matrix**

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[24 , 9]

 [ 6 , 37]]

0.8026315789473685


The above matrix indicates that
1. True Negatives	24
2. False Positives	9
3. False Negatives	6
4. True Positives	37

* *Key insights*:

 *  True Negatives (24): The Decision Tree correctly identifies 24 non-disease cases.
 *  False Positives (9): Same as SVM and k-NN, meaning the model predicts more "Disease" cases than actually exist (i.e., a higher false positive rate).
 *  False Negatives (6): This is higher than both k-NN and SVM (4 and 2, respectively), meaning the Decision Tree misses more true disease cases.
 *  True Positives (37): Correctly predicts "Disease" in 37 cases.

 **Cross Validation**

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(classifier, X, y, cv=5, scoring='accuracy')
print("Cross-Validation Scores:", scores)
print("Mean Accuracy:", scores.mean())
print("Standard Deviation:", scores.std())

Cross-Validation Scores: [0.7704918  0.85245902 0.73770492 0.73333333 0.66666667]

Mean Accuracy: 0.7521311475409835

Standard Deviation: 0.060445754781489364
 1. Mean Accuracy (75.21%): Higher than Random Forest (79.21%) but lower than k-NN (62.05%) and SVM (82.83%).
 2. Standard Deviation (6.04%): Similar to Random Forest (6.13%) and k-NN (6.76%), indicating some variability in performance across folds.

**Observations**
 * *False Negatives*: The Decision Tree has more false negatives (6) than the other models, which might be a concern in a medical diagnosis setting where missing positive cases is crucial.
 * *Test Set Accuracy*: While the test set accuracy (80.26%) is decent, it is still below that of k-NN (86.84%) and SVM (85.53%).
 * *Stability*: The Decision Tree’s cross-validation stability (75.21%) is better than k-NN's but slightly worse than Random Forest's.

**Conclusion**

* Decision Tree is the weakest model in this comparison due to its lower accuracy and recall.

# Random Forest Classification

Now, I'll use the ensemble method, Random Forest Classifier.

In [None]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

Now l'll be implementing confusion matrix for Random Forest classsifier after that implementing cross validation step.



**Confusion Matrix**

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[24 , 9]

 [ 4 ,39]]

0.8289473684210527

The above metrix indicates that ,

1. True Negatives	24
2. False Positives	9
3. False Negatives	4
4. True Positives	39

*Key insights*:

   * False Positives (9): Incorrectly predicted as "Disease" when there is no disease.
   * False Negatives (4): Cases where the model missed predicting "Disease" correctly (a higher number than SVM's 2 false negatives).


**cross validation**




In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(classifier, X, y, cv=5, scoring='accuracy')

print("Cross-Validation Scores:", scores)
print("Mean Accuracy:", scores.mean())
print("Standard Deviation:", scores.std())

Cross-Validation Scores: [0.81967213 0.85245902 0.70491803 0.85       0.73333333]

Mean Accuracy: 0.7920765027322405

Standard Deviation: 0.061335237887616995

1. Mean Accuracy (79.21%): Lower than both the test accuracy and the SVM model's cross-validation mean accuracy (82.83%).
2. Standard Deviation (6.13%): Higher than the SVM model's standard deviation (4.29%), indicating the Random Forest model is less stable across different folds.

**Observations**:
* *Performance*: The Random Forest model is solid but slightly underperforms compared to the SVM model, both in test accuracy and cross-validation scores.
* *False Negatives*: Random Forest has higher false negatives (4) compared to SVM (2), which could be crucial in a medical context where missing actual cases is more severe than false alarms.
* *Model Stability*: The higher standard deviation (6.13%) compared to SVM (4.29%) suggests the Random Forest model may be more sensitive to changes in the dataset.

**Conclusion:**
  * Random Forest provides a good trade-off between accuracy and recall but is slightly weaker than SVM in generalization.

# conclusion


In this project, I applied Machine Learning techniques to predict whether a person is suffering from heart disease. After importing the dataset, I performed data preprocessing, including data visualization, generating dummy variables for categorical features, and scaling the numerical features for uniformity.

I implemented and compared the performance of four Machine Learning algorithms: k-Nearest Neighbors (k-NN), Support Vector Machine (SVM), Decision Tree Classifier, and Random Forest Classifier. I fine-tuned the hyperparameters for each model to optimize their performance.

Among the models, the k-Nearest Neighbors Classifier achieved the highest test accuracy of 87%, using 8 nearest neighbors. The Support Vector Machine showed strong generalization capability with the highest cross-validation score, making it another robust choice.

This project highlights the potential of Machine Learning in accurately predicting heart disease, contributing to more efficient and early diagnosis in the medical field.

