DATASET: https://www.kaggle.com/datasets/rashikrahmanpritom/heart-attack-analysis-prediction-dataset?datasetId=1226038&sortBy=voteCount

# BUSSINESS UNDERSTANDING

Penyakit kardiovaskular (CVD) adalah penyebab kematian nomor 1 di seluruh dunia, diperkirakan merenggut 17,9 juta nyawa setiap tahunnya, yang merupakan 31% dari seluruh kematian di seluruh dunia. Empat dari 5 kematian akibat CVD disebabkan oleh serangan jantung dan stroke, dan sepertiga dari kematian ini terjadi secara prematur pada orang di bawah usia 70 tahun. Gagal jantung adalah kejadian umum yang disebabkan oleh CVD dan kumpulan data ini berisi 11 fitur yang dapat digunakan untuk memprediksi kemungkinan penyakit jantung.

Orang dengan penyakit kardiovaskular atau yang memiliki risiko kardiovaskular tinggi (karena adanya satu atau lebih faktor risiko seperti hipertensi, diabetes, hiperlipidemia, atau penyakit yang sudah ada sebelumnya) membutuhkan deteksi dini dan manajemen di mana model pembelajaran mesin dapat sangat membantu.

model ini dibuat untuk mempermudah dalam melakukan prediksi apakah pasien tersebut memiliki potensi yang besar/kecil untuk terkena serangan jantung. model ini pun dibuat menggunakan metode klasifikasi dan algoritma linear regression

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns


In [2]:
df = pd.read_csv('heart.csv')

# DATA UNDERSTANDING

deskripsi tabel:

age : Usia pasien

sex : Jenis kelamin pasien (0 = Perempuan, 1 = Laki-laki)

cp: Jenis nyeri dada ~ 0 = Angina Khas, 1 = Angina Atipikal, 2 = Nyeri Non-angina, 3 = Tanpa gejala

trtbps : tekanan darah istirahat (dalam mmHg)

chol : kolestrol dalam mg/dl yang diambil melalui sensor BMI

fbs : (gula darah puasa > 120 mg/dl) (1 = benar; 0 = salah)

rest_ecg : Hasil elektrokardiografi istirahat ~ 0 = Normal, 1 = Normalitas gelombang ST-T, 2 = Hipertrofi ventrikel kiri

thalach: Denyut jantung maksimum yang dicapai

oldpeak - Puncak sebelumnya

slp - Slope

exang: angina yang diinduksi oleh olahraga (1 = ya; 0 = tidak)

caa: jumlah pembuluh darah utama (0-3)

thall - Hasil Tes Stres Thalium ~ (0,3)

target: 0 = kemungkinan serangan jantung lebih kecil 1 = kemungkinan serangan jantung lebih besar

In [3]:
df.head()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [4]:
df.shape

(303, 14)

In [5]:
df.columns

Index(['age', 'sex', 'cp', 'trtbps', 'chol', 'fbs', 'restecg', 'thalachh',
       'exng', 'oldpeak', 'slp', 'caa', 'thall', 'output'],
      dtype='object')

In [6]:
df['output'].value_counts()

1    165
0    138
Name: output, dtype: int64

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trtbps    303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalachh  303 non-null    int64  
 8   exng      303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slp       303 non-null    int64  
 11  caa       303 non-null    int64  
 12  thall     303 non-null    int64  
 13  output    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


In [8]:
df.isnull().sum()

age         0
sex         0
cp          0
trtbps      0
chol        0
fbs         0
restecg     0
thalachh    0
exng        0
oldpeak     0
slp         0
caa         0
thall       0
output      0
dtype: int64

In [9]:
df.duplicated().sum()

1

# DATA PREPROSESSING

In [10]:
df = df.drop_duplicates()

In [11]:
df.duplicated().sum()

0

In [12]:
columns_to_drop = ['slp','oldpeak']
df = df.drop(columns=columns_to_drop)

In [13]:
# Shows the Distribution of Heat Diseases with respect to male and female
fig=px.histogram(df, 
                 x="output",
                 color="sex",
                 hover_data=df.columns,
                 title="Distribution of Heart Diseases, 1 = Laki-Laki dan 0 = Perempuan",
                 barmode="group")
fig.show()

In [14]:
df.columns

Index(['age', 'sex', 'cp', 'trtbps', 'chol', 'fbs', 'restecg', 'thalachh',
       'exng', 'caa', 'thall', 'output'],
      dtype='object')

# MODELLING

In [15]:
X = df.drop(columns='output', axis=1)
Y = df['output']

In [16]:
print(X)

     age  sex  cp  trtbps  chol  fbs  restecg  thalachh  exng  caa  thall
0     63    1   3     145   233    1        0       150     0    0      1
1     37    1   2     130   250    0        1       187     0    0      2
2     41    0   1     130   204    0        0       172     0    0      2
3     56    1   1     120   236    0        1       178     0    0      2
4     57    0   0     120   354    0        1       163     1    0      2
..   ...  ...  ..     ...   ...  ...      ...       ...   ...  ...    ...
298   57    0   0     140   241    0        1       123     1    0      3
299   45    1   3     110   264    0        1       132     0    0      3
300   68    1   0     144   193    1        1       141     0    2      3
301   57    1   0     130   131    0        1       115     1    1      3
302   57    0   1     130   236    0        0       174     0    1      2

[302 rows x 11 columns]


In [17]:
print(Y)

0      1
1      1
2      1
3      1
4      1
      ..
298    0
299    0
300    0
301    0
302    0
Name: output, Length: 302, dtype: int64


In [18]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.2, stratify=Y, random_state=2)

In [19]:
print(X.shape, X_train.shape, X_test.shape)

(302, 11) (241, 11) (61, 11)


In [20]:
model = LogisticRegression()

In [21]:
model.fit(X_train, Y_train)


lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression



In [22]:
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [23]:
print("Akurasi data training : ", training_data_accuracy*100)

Akurasi data training :  85.06224066390041


In [24]:
X_test_prediction = model.predict(X_test)
testing_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [25]:
print("Akurasi data testing : ", testing_data_accuracy*100)

Akurasi data testing :  81.9672131147541


In [26]:
df[df['output'] == 0].head()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,caa,thall,output
165,67,1,0,160,286,0,0,108,1,3,2,0
166,67,1,0,120,229,0,0,129,1,2,3,0
167,62,0,0,140,268,0,0,160,0,2,2,0
168,63,1,0,130,254,0,0,147,0,1,3,0
169,53,1,0,140,203,1,0,155,1,0,3,0


# EVALUATION

In [27]:
input_data = (67,1,0,160,286,0,0,108,1,3,2) #0

input_data_as_numpy_array = np.asarray(input_data)

input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)

prediction = model.predict(input_data_reshaped)
print(prediction)

if (prediction[0] == 0):
    print('kemungkinan serangan jantung lebih kecil')
else :
    print('kemungkinan serangan jantung lebih besar')

[0]
kemungkinan serangan jantung lebih kecil



X does not have valid feature names, but LogisticRegression was fitted with feature names



# DEPLOYMENT

In [28]:
import pickle

In [29]:
filename = 'heart_model.sav'
pickle.dump(model, open(filename, 'wb'))