### The dataset contains the 303 elements which can be trained to predict whether the patient has a heart-disease along with 303 heart disease or non-heart-disease cases.

### You can  find the dataset here:'https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset' 

In [47]:
# Let's import necassary libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

In [17]:
# Loading the data using pandas

df = pd.read_csv("Documents/heart.csv")
df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [18]:
# Defining x and y
x = df.drop("target", axis=1) # we will drop the 'target' column as it is what we are going to predict based on other parameters.
y = df["target"]

In [21]:
# Splitting the data
# We will train the model on 85 percent of the data and evaluate its performance on the rest of the data, which is 15 percent. 

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.15) # you can change the test size

In [22]:
# Fitting the model (training process)

clf = RandomForestClassifier()
clf.fit(x_train, y_train)

In [25]:
### Inserting new variable called 'y_preds' to compare with actual results (Commonly used for binary-data)

y_preds = clf.predict(x_test)
y_preds

array([0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1,
       0, 1], dtype=int64)

In [41]:
# The actual values vs predicted values (array)
y_test, y_preds

(210    0
 10     1
 234    0
 285    0
 263    0
 194    0
 173    0
 228    0
 299    0
 222    0
 233    0
 95     1
 257    0
 92     1
 259    0
 260    0
 17     1
 15     1
 191    0
 57     1
 136    1
 18     1
 75     1
 198    0
 172    0
 107    1
 226    0
 68     1
 27     1
 224    0
 101    1
 91     1
 190    0
 127    1
 20     1
 145    1
 90     1
 29     1
 297    0
 52     1
 166    0
 13     1
 182    0
 87     1
 42     1
 216    0
 Name: target, dtype: int64,
 array([0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1,
        1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1,
        0, 1], dtype=int64))

In [42]:
### Evaluating the model on the 15 percent of the data.
accuracy = clf.score(x_test, y_test)
print(f'The accuracy score of the model is {round(accuracy * 100, 2)} percent')

The accuracy score of the model is 71.74 percent


In [46]:
# Let's evaluate the data in another way (classification report)
print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.73      0.70      0.71        23
           1       0.71      0.74      0.72        23

    accuracy                           0.72        46
   macro avg       0.72      0.72      0.72        46
weighted avg       0.72      0.72      0.72        46



In [49]:
# Let's try another way (confusion matrix)

con_mat = confusion_matrix(y_test, y_preds)
con_mat

array([[16,  7],
       [ 6, 17]], dtype=int64)

In [50]:
# There is also 'predict_proba' method which we will consider now (predict class possibilities for x_test):
clf.predict_proba(x_test)

array([[0.5 , 0.5 ],
       [0.22, 0.78],
       [0.69, 0.31],
       [0.96, 0.04],
       [0.58, 0.42],
       [0.42, 0.58],
       [0.56, 0.44],
       [0.27, 0.73],
       [0.56, 0.44],
       [0.38, 0.62],
       [0.82, 0.18],
       [0.83, 0.17],
       [0.87, 0.13],
       [0.19, 0.81],
       [0.35, 0.65],
       [0.85, 0.15],
       [0.44, 0.56],
       [0.02, 0.98],
       [1.  , 0.  ],
       [0.12, 0.88],
       [0.13, 0.87],
       [0.19, 0.81],
       [0.04, 0.96],
       [0.83, 0.17],
       [0.21, 0.79],
       [0.24, 0.76],
       [0.63, 0.37],
       [0.  , 1.  ],
       [0.27, 0.73],
       [0.96, 0.04],
       [0.66, 0.34],
       [0.61, 0.39],
       [0.86, 0.14],
       [0.14, 0.86],
       [0.6 , 0.4 ],
       [0.35, 0.65],
       [0.11, 0.89],
       [0.07, 0.93],
       [0.9 , 0.1 ],
       [0.77, 0.23],
       [0.97, 0.03],
       [0.46, 0.54],
       [0.1 , 0.9 ],
       [0.25, 0.75],
       [0.67, 0.33],
       [0.49, 0.51]])

In [54]:
# Let's give parameters by ourselves and find out the prediction of the model: 
sohibjon = np.array([16, 1, 3, 125, 230, 1, 0, 170, 1, 1.8, 2, 2, 3])

# Let's reshape the dataframe

sohibjon = sohibjon.reshape(1, -1)

# Let's predict whether I have a heart-disease or not:
predictions = clf.predict(sohibjon)

# Print predictions
print(predictions)

[1]




In [55]:
# It is saying I have an heart-attack!!! That is because I inserted the parameters randomly, which means they are not correct

### As we consider only 303 elements to train, the model is not performing high enough. However, if the number of elements increases, the accuracy will rise simultaneously.