<a href="https://colab.research.google.com/github/JogendraSingh1879/Random-Forest-Project/blob/main/Random_Forest_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Heart Disease Prediction**

# **INSPIRATION OF THE PROJECT**

World Health Organization has estimated 12 million deaths occur worldwide, every year due to Heart diseases. Half the deaths in the United States and other developed countries are due to cardio vascular diseases. The early prognosis of cardiovascular diseases can aid in making decisions on lifestyle changes in high risk patients and in turn reduce the complications. This research intends to pinpoint the most relevant/risk factors of heart disease as well as predict the overall risk.

# **About Dataset (Meta data)**
# **Context**
This is a multivariate type of dataset which means providing or involving a variety of separate mathematical or statistical variables, multivariate numerical data analysis. It is composed of 14 attributes which are age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting electrocardiographic results, maximum heart rate achieved, exercise-induced angina, oldpeak — ST depression induced by exercise relative to rest, the slope of the peak exercise ST segment, number of major vessels and Thalassemia. This database includes 76 attributes, but all published studies relate to the use of a subset of 14 of them. The Cleveland database is the only one used by ML researchers to date. One of the major tasks on this dataset is to predict based on the given attributes of a patient that whether that particular person has heart disease or not and other is the experimental task to diagnose and find out various insights from this dataset which could help in understanding the problem more.

# **Content**
Column Descriptions:
id(Unique id for each patient)
age (Age of the patient in years)
origin (place of study)
sex (Male/Female)
cp chest pain type
typical angina.
atypical angina.
non-anginal.
asymptomatic.
trestbps resting blood pressure (resting blood pressure (in mm Hg on admission to the hospital))
chol (serum cholesterol in mg/dl)
fbs (if fasting blood sugar > 120 mg/dl)
restecg (resting electrocardiographic results)
-- Values: [normal, stt abnormality, lv hypertrophy]
thalach: maximum heart rate achieved
exang: exercise-induced angina (True/ False)
oldpeak: ST depression induced by exercise relative to rest
slope: the slope of the peak exercise ST segment
ca: number of major vessels (0-3) colored by fluoroscopy
thal:[normal; fixed defect; reversible defect]
num: the predicted attribute
# **Acknowledgements**
Creators:
Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.
Relevant Papers:
Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J., Sandhu, S., Guppy, K., Lee, S., & Froelicher, V. (1989). International application of a new probability algorithm for the diagnosis of coronary artery disease. American Journal of Cardiology, 64,304--310.
David W. Aha & Dennis Kibler. "Instance-based prediction of heart-disease presence with the Cleveland database."
Gennari, J.H., Langley, P, & Fisher, D. (1989). Models of incremental concept formation. Artificial Intelligence, 40, 11--61.
Citation Request:
The authors of the databases have requested that any publications resulting from the use of the data include the names of the principal investigator responsible for the data collection at each institution.

# They would be:

Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
V.A. Medical Center, Long Beach and Cleveland Clinic Foundation:Robert Detrano, M.D., Ph.D.

# **Aims & Objectives**
we will fill this after some exploratory data analysis

In [108]:
import pandas as pd
import numpy as np

# **Loading the Data Sets**

In [129]:
data = pd.read_csv('heart.csv')

In [130]:
data.head()

Unnamed: 0,age,sex,chest pain type,resting blood pressure,serum cholestoral,fasting blood sugar,resting electrocardiographic results,max heart rate,exercise induced angina,oldpeak,ST segment,major vessels,thal,heart disease
0,70,1,4,130,322,0,2,109,0,2.4,2,3,3,2
1,67,0,3,115,564,0,2,160,0,1.6,2,0,7,1
2,57,1,2,124,261,0,0,141,0,0.3,1,0,7,2
3,64,1,4,128,263,0,0,105,1,0.2,2,1,7,1
4,74,0,2,120,269,0,2,121,1,0.2,1,1,3,1


# **SUMMARIZING THE DATASET**

In [131]:
data.shape

(270, 14)

# **PRE-PROCESSING THE DATASET**

In [132]:
data.isnull().sum()

Unnamed: 0,0
age,0
sex,0
chest pain type,0
resting blood pressure,0
serum cholestoral,0
fasting blood sugar,0
resting electrocardiographic results,0
max heart rate,0
exercise induced angina,0
oldpeak,0


# **SEGREGATING THE DATASET INTO INPUT(x) AND OUTPUT(y)**

In [133]:
x  = data.iloc[ : , :-1].values
print(x)
print('-'*40)
y = data.iloc[ : , -1].values
print(y)

[[70.  1.  4. ...  2.  3.  3.]
 [67.  0.  3. ...  2.  0.  7.]
 [57.  1.  2. ...  1.  0.  7.]
 ...
 [56.  0.  2. ...  2.  0.  3.]
 [57.  1.  4. ...  2.  0.  6.]
 [67.  1.  4. ...  2.  3.  3.]]
----------------------------------------
[2 1 2 1 1 1 2 2 2 2 1 1 1 2 1 1 2 2 1 1 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2 2 2
 2 1 1 2 1 1 1 2 1 2 2 2 2 2 1 1 1 1 1 2 1 2 2 1 2 1 1 1 2 1 2 1 2 2 1 1 1
 1 2 1 1 1 1 2 2 2 1 1 1 1 1 1 2 1 2 2 2 2 2 1 2 1 1 1 2 1 2 2 2 1 2 2 1 2
 1 2 1 1 1 2 2 1 2 2 2 2 1 1 1 2 1 1 2 2 2 1 2 1 1 1 2 1 1 2 1 2 1 2 2 2 2
 2 1 1 1 1 1 1 1 2 1 1 2 2 2 1 2 1 1 1 1 1 2 1 2 2 1 1 2 2 2 2 1 1 2 2 1 1
 1 2 1 1 2 1 2 1 2 1 1 1 1 1 2 1 2 2 2 2 1 1 1 2 1 2 1 1 2 1 1 1 1 1 1 2 2
 1 2 1 1 2 2 1 1 2 2 1 2 1 2 1 2 1 1 2 1 1 2 1 2 2 1 2 2 2 1 2 1 1 1 1 2 2
 1 1 2 2 1 2 1 1 1 1 2]


# **PERFORMING FEATURE SCALING**

In [134]:
from sklearn.preprocessing import StandardScaler
scaling      = StandardScaler()
scaled_input = scaling.fit_transform(x)

scaled_input

array([[ 1.71209356,  0.6894997 ,  0.87092765, ...,  0.67641928,
         2.47268219, -0.87570581],
       [ 1.38213977, -1.45032695, -0.18355874, ...,  0.67641928,
        -0.71153494,  1.18927733],
       [ 0.2822938 ,  0.6894997 , -1.23804513, ..., -0.95423434,
        -0.71153494,  1.18927733],
       ...,
       [ 0.1723092 , -1.45032695, -1.23804513, ...,  0.67641928,
        -0.71153494, -0.87570581],
       [ 0.2822938 ,  0.6894997 ,  0.87092765, ...,  0.67641928,
        -0.71153494,  0.67303154],
       [ 1.38213977,  0.6894997 ,  0.87092765, ...,  0.67641928,
         2.47268219, -0.87570581]])

# **SPLITTING THE DATASET INTO TRAINING AND TESTING DATA**

In [136]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(scaled_input, y, test_size=0.2, random_state=42)

In [137]:
x.shape, y.shape

((270, 13), (270,))

In [138]:
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((216, 13), (54, 13), (216,), (54,))

# **LOADING THE RANDOM FOREST MODEL**

In [139]:
from sklearn.ensemble import RandomForestClassifier

Random_Forest_Model = RandomForestClassifier()



# **TRAINING THE MODEL**

In [140]:
Random_Forest_Model.fit(x_train,y_train)

# **PREDICTING THE RESULT USING THE TRAINED MODEL**

In [141]:
y_pred = Random_Forest_Model.predict(x_test)
y_pred

array([1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 2, 1, 2, 2, 2, 1,
       1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1,
       1, 1, 1, 2, 2, 1, 1, 1, 1, 2])

# **CALCULATING THE ACCURACY OF THE TRAINED MODEL**

In [142]:
from sklearn.metrics import accuracy_score

print("Accuracy_Score:",np.round(accuracy_score(y_test, y_pred),2)*100,"%")

Accuracy_Score: 85.0 %


# **PREDICTING THE OUTPUT OF SINGLE TEST DATA USING THE TRAINED MODEL**

In [143]:
y_test

array([2, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 2, 1, 2, 1,
       2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1,
       1, 1, 1, 2, 2, 1, 1, 1, 1, 2])

In [144]:
x_test[2]#.reshape(1,13)

array([ 0.1723092 ,  0.6894997 , -1.23804513, -0.63630951, -0.26476342,
       -0.41702883, -1.02628472,  1.22486246, -0.7012223 , -0.21870599,
       -0.95423434, -0.71153494, -0.87570581])

In [145]:
x_test[2].shape

(13,)

In [146]:
x_test[2].reshape(1,13)

array([[ 0.1723092 ,  0.6894997 , -1.23804513, -0.63630951, -0.26476342,
        -0.41702883, -1.02628472,  1.22486246, -0.7012223 , -0.21870599,
        -0.95423434, -0.71153494, -0.87570581]])

In [147]:
x_test[2].reshape(1,13).shape

(1, 13)

In [148]:
print("ACTUAL OUTPUT    : ",y_test[2])
print("PREDICTED OUTPUT : ",Random_Forest_Model.predict(x_test[2].reshape(1,13))[0])

ACTUAL OUTPUT    :  1
PREDICTED OUTPUT :  1
