## Packages

In [95]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report

%matplotlib inline

## Dataset
Dataset ini diambil dari Kaggle https://www.kaggle.com/datasets/dileep070/heart-disease-prediction-using-logistic-regression.

Dataset ini merupakan data prediksi penyakit Jantung (*Heart Disease*). Dataset ini terdiri dari **4238 baris** dan **16 kolom**. Oleh karena itu, 15 kolom pertama akan menjadi features (X) dan 1 kolom terakhir, yakni "TenYearCHD" akan menjadi target (y).

In [96]:
# load dataset
data = pd.read_csv("framingham.txt")
data.head()

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


In [97]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4238 entries, 0 to 4237
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   male             4238 non-null   int64  
 1   age              4238 non-null   int64  
 2   education        4133 non-null   float64
 3   currentSmoker    4238 non-null   int64  
 4   cigsPerDay       4209 non-null   float64
 5   BPMeds           4185 non-null   float64
 6   prevalentStroke  4238 non-null   int64  
 7   prevalentHyp     4238 non-null   int64  
 8   diabetes         4238 non-null   int64  
 9   totChol          4188 non-null   float64
 10  sysBP            4238 non-null   float64
 11  diaBP            4238 non-null   float64
 12  BMI              4219 non-null   float64
 13  heartRate        4237 non-null   float64
 14  glucose          3850 non-null   float64
 15  TenYearCHD       4238 non-null   int64  
dtypes: float64(9), int64(7)
memory usage: 529.9 KB


## Data Cleaning

Jadi, sebelum masuk ke tahap Data Preparation, dataset ini perlu dibershikan terlebih dahulu karena **dataset ini ternyata memiliki sejumlah nilai NaN atau NULL**. Oleh karena itu, data yang kosong itu perlu diisi menggunakan teknik-teknik *data cleaning*.

In [98]:
null_perc=data.isnull().sum()/len(data)*100
null=data.isnull().sum()
overview=pd.concat((null,null_perc,data.nunique()),axis=1, keys=['Null counts','Null %','Cardinality'])
overview

Unnamed: 0,Null counts,Null %,Cardinality
male,0,0.0,2
age,0,0.0,39
education,105,2.477584,4
currentSmoker,0,0.0,2
cigsPerDay,29,0.684285,33
BPMeds,53,1.25059,2
prevalentStroke,0,0.0,2
prevalentHyp,0,0.0,2
diabetes,0,0.0,2
totChol,50,1.179802,248


In [99]:
# Generating list of categorical factors:
temp=data.drop(columns=['TenYearCHD']).nunique()
cat=temp.loc[temp.values <5].index.to_list()

# Updating null values to the most dominant category:
for factor in cat:
    data[factor].fillna(data[factor].value_counts().idxmax(),inplace=True)

In [100]:
# Generating list of non-categorical factors:
temp=data.drop(columns=['TenYearCHD']).nunique()
non_cat=temp.loc[temp.values > 5].index.to_list()

# Implementing interpolation (with linear method), on known data for null values:
for factor in non_cat:
    data[factor]=data[factor].interpolate(method='linear')

In [101]:
null_perc=data.isnull().sum()/len(data)*100
null=data.isnull().sum()
overview=pd.concat((null,null_perc,data.nunique()),axis=1, keys=['Null counts','Null %','Cardinality'])
overview

Unnamed: 0,Null counts,Null %,Cardinality
male,0,0.0,2
age,0,0.0,39
education,0,0.0,4
currentSmoker,0,0.0,2
cigsPerDay,0,0.0,41
BPMeds,0,0.0,2
prevalentStroke,0,0.0,2
prevalentHyp,0,0.0,2
diabetes,0,0.0,2
totChol,0,0.0,274


## Data Preparation
Pada bagian ini dilakukan *data preparation* dengan cara membagi dataset menjadi 2 bagian, yakni untuk **training (80%) dan testing (20%).**

In [102]:
y = data["TenYearCHD"]
X = data.drop('TenYearCHD',axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state = 0)

In [103]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [104]:
print(y_test.unique())
Counter(y_train)

[0 1]


Counter({0: 2884, 1: 506})

## Logistic Regression

Jadi, pada kasus ini, untuk memprediksi penyakit Jantung (*Heart Disease*), digunakan algoritma atau model **Regresi Logistik**. Regresi Logistik merupakan salah satu contoh model yang sering digunakan untuk klasifikasi (*Supervised Learning*). 

Prinsip dasar dari model ini, yakni menggunakan **fungsi sigmoid** untuk mengklasifikasikan suatu data. Berikut yang dimaksud dengan fungsi sigmoid:
$$g(z) = \frac{1}{1+e^{-z}}$$

Fungsi sigmoid cenderung hanya akan menghasilkan 2 *output*, yakni 0 dan 1 (sesuai dengan bentuk fungsinya). Oleh karena itu, model Regresi Logistik ini sering digunakan untuk klasifikasi (*Supervised Learning*).

In [105]:
m1 = 'Logistic Regression'
lr = LogisticRegression()

model = lr.fit(X_train, y_train)
lr_predict = lr.predict(X_test)

lr_conf_matrix = confusion_matrix(y_test, lr_predict)
lr_acc_score = accuracy_score(y_test, lr_predict)

print("confussion matrix")
print(lr_conf_matrix)
print("\n")
print("Accuracy of Logistic Regression:",lr_acc_score*100,'\n')
print(classification_report(y_test,lr_predict))

confussion matrix
[[708   2]
 [129   9]]


Accuracy of Logistic Regression: 84.55188679245283 

              precision    recall  f1-score   support

           0       0.85      1.00      0.92       710
           1       0.82      0.07      0.12       138

    accuracy                           0.85       848
   macro avg       0.83      0.53      0.52       848
weighted avg       0.84      0.85      0.79       848



Dari hasil prediksi dan evaluasi model yang dilakukan, terlihat bahwa: 
1. Dari nilai confusion matrix, dapat diketahui bahwa hasil tes memiliki jumlah *True Positive* (TP) sebanyak 708, *False Negative* (FN) sebanyak 129, *False Positive* (FP) sebanyak 2, dan *True Negative* (TN) sebanyak 9.
2. **Model Regresi Logistik memiliki kinerja yang cukup baik dengan akurasi 84,55%** dalam memprediksi (mengklasifikasikan) penyakit Jantung dari dataset yang diberikan. 