# Stroke Prediction

Berdasarkan informasi WHO, stroke merupakan penyebab kematian terbanyak nomor 2 di dunia dan menjadi penyebab dari 11% total kematian. [Dataset](https://www.kaggle.com/fedesoriano/stroke-prediction-dataset) ini digunakan untuk memprediksi apakah seorang pasien memiliki kemungkinan besar untuk terkena stroke berdasarkan informasi tentang pasien seperti jenis kelamin, umur, penyakit dan status merokok.

Tujuan eksperimen:
1.   Peserta memahami rangkaian proses analitika data menggunakan pendekatan pembelajaran mesin. 
2.   Peserta memahami bahwa proses pengembangan model pembelajaran mesin juga ditentukan dari kualitas data, penanganan data, dan penentuan algoritma serta hiperparameternya; tidak cukup hanya dengan memastikan implementasi algoritma berjalan tanpa kesalahan.
3. Peserta mampu menginterpretasikan hasil dari evaluasi model dalam proses analitika menggunakan pendekatan pembelajaran mesin.

Praktikum dilaksanakan secara berkelompok, dengan 1 kelompok terdiri atas 2 mahasiswa. Soal praktikum terdapat di bagian bawah berkas ini. Harap diperhatikan bahwa terdapat berkas yang harus dikumpulkan sebelum waktu praktikum selesai (4 April 2022 11.00 WIB) dan berkas yang dikumpulkan setelah waktu praktikum selesai (4 April 2022 23.59 WIB). Untuk detil deliverables dan soal, dapat dilihat pada bagian bawah notebook

# Data Preparation

In [103]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
import numpy as np

In [104]:
data = pd.read_csv("healthcare-dataset-stroke-data.csv")
X = data.drop(columns="stroke")
y = data["stroke"].copy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=123)

df_train = pd.concat([X_train, y_train], axis=1)
df_val = pd.concat([X_val, y_val], axis=1)
df_test = pd.concat([X_test, y_test], axis=1)

# Soal Eksperimen

Disediakan data yang sudah dibagi menjadi data train (df_train), validasi (df_val), dan test (df_test). Lakukanlah:
1. Buatlah baseline dengan menggunakan model Logistic Regression
2. Lakukan analisa data terkait:
- Duplicate value
- Missing value
- Outlier
- Balance of data
3. Jelaskan bagaimana kalian akan menangani permasalahan yang disebutkan pada poin 2
4. Sebutkan dan jelaskan alasan dari teknik encoding yang akan kalian gunakan terhadap data tersebut
5. Buatlah desain eksperimen dengan menentukan hal berikut:
- Tujuan eksperimen
- Dependent dan Independent variabel
- Strategi eksperimen
- Skema validasi
6. Implementasikan strategi eksperimen dan skema validasi yang sudah kalian buat
7. Berdasarkan hasil prediksi yang kalian hasilkan, buatlah kesimpulan analisis karakteristik pasien yang terkena stroke

Poin 1 - 5 dikerjakan saat praktikum berlangsung (pukul 09.00 WIB - 11.00 WIB)
Poin 6 - 7 dikerjakan saat setelah praktikum berlangsung (pukul 11.00 WIB - 23.59 WIB)

Jika terdapat perubahan jawaban pada poin 1 - 5 (semisal perbedaan cara melakukan handling missing value), dapat dijelaskan pada laporan mengenai jawaban sebelum, jawaban sesudah, dan alasan merubah jawaban tersebut (semisal menemukan suatu hal menarik pada data, sehingga missing value dapat dihandle dengan metode yang lebih bagus) 

In [105]:
# 1. Ubah categorical 
lb = LabelEncoder()
df_train['gender'] = lb.fit_transform(df_train['gender'] ) 
df_val['gender'] = lb.fit_transform(df_val['gender'] ) 
df_test['gender'] = lb.fit_transform(df_test['gender'] ) 

df_train['ever_married'] = lb.fit_transform(df_train['ever_married'] ) 
df_val['ever_married'] = lb.fit_transform(df_val['ever_married'] ) 
df_test['ever_married'] = lb.fit_transform(df_test['ever_married'] ) 

df_train['work_type'] = lb.fit_transform(df_train['work_type'] ) 
df_val['work_type'] = lb.fit_transform(df_val['work_type'] ) 
df_test['work_type'] = lb.fit_transform(df_test['work_type'] ) 

df_train['work_type'] = lb.fit_transform(df_train['work_type'] ) 
df_val['work_type'] = lb.fit_transform(df_val['work_type'] ) 
df_test['work_type'] = lb.fit_transform(df_test['work_type'] ) 

df_train['Residence_type'] = lb.fit_transform(df_train['Residence_type'] ) 
df_val['Residence_type'] = lb.fit_transform(df_val['Residence_type']) 
df_test['Residence_type'] = lb.fit_transform(df_test['Residence_type']) 

df_train['smoking_status'] = lb.fit_transform(df_train['smoking_status'] ) 
df_val['smoking_status'] = lb.fit_transform(df_val['smoking_status']) 
df_test['smoking_status'] = lb.fit_transform(df_test['smoking_status']) 


df_train['gender'] = lb.fit_transform(df_train['gender'] ) 
df_val['gender'] = lb.fit_transform(df_val['gender'] ) 
df_test['gender'] = lb.fit_transform(df_test['gender'] ) 

df_train['gender'].fillna(df_train['gender'].mean(), inplace= True)
df_test['gender'].fillna(df_test['gender'].mean(), inplace= True)
df_val['gender'].fillna(df_val['gender'].mean(), inplace= True)

df_train['bmi'].fillna(df_train['bmi'].mean(), inplace= True)
df_test['bmi'].fillna(df_test['bmi'].mean(), inplace= True)
df_val['bmi'].fillna(df_val['bmi'].mean(), inplace= True)
model = LogisticRegression()
X_train = df_train.drop(columns="stroke")
y_train = df_train["stroke"].copy()
model.fit(X_train, y_train)
model.predict(X_train)

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [106]:
boolean = df_train.duplicated().any()
boolean = df_train.duplicated(subset=['id','gender','age','hypertension','heart_disease','ever_married','work_type','Residence_type','avg_glucose_level','bmi','smoking_status','stroke']).any()
boolean
#Tidak ada duplicate Value

False

2. Lakukan analisa data terkait:
- Duplicate value
- Missing value
- Outlier
- Balance of data

In [107]:
print("Duplicate value: ")
print(" Data train duplicate value: ", df_train.duplicated().sum())
print(" Data validasi duplicate value: ", df_val.duplicated().sum())
print(" Data test duplicate value: ", df_test.duplicated().sum())

Duplicate value: 
 Data train duplicate value:  0
 Data validasi duplicate value:  0
 Data test duplicate value:  0


In [108]:
print("Missing value: ")
print(" Data train missing value: ", df_train.isna().sum())

Missing value: 
 Data train missing value:  id                   0
gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64


In [109]:
print(" Data validasi missing value: ", df_val.isna().sum())

 Data validasi missing value:  id                   0
gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64


In [110]:
print(" Data test missing value: ", df_test.isna().sum())

 Data test missing value:  id                   0
gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64


In [111]:
print("Outlier: ")
print("Outlier Data Train: ")
def outlier_data_train(column):
    Q1 = np.percentile(df_train[column], 25,
                    interpolation = 'midpoint')
    
    Q3 = np.percentile(df_train[column], 75,
                   interpolation = 'midpoint')

    IQR = Q3 - Q1

    upper = df_train[column] >= (Q3+1.5*IQR)
    count_upper = len(np.where(upper)[0])

    lower = df_train[column] <= (Q1-1.5*IQR)
    count_lower = len(np.where(lower)[0])

    count_outlier = count_upper + count_lower
    return(count_outlier)

for col in df_train.columns:
    print(col + ": " + str(outlier_data_train(col)))

Outlier: 
Outlier Data Train: 
id: 0
gender: 0
age: 0
hypertension: 6244
heart_disease: 6355
ever_married: 0
work_type: 427
Residence_type: 0
avg_glucose_level: 414
bmi: 91
smoking_status: 0
stroke: 6379


In [112]:
print("Outlier Data Validasi: ")
def outlier_data_validasi(column):
    Q1 = np.percentile(df_val[column], 25,
                    interpolation = 'midpoint')
    
    Q3 = np.percentile(df_val[column], 75,
                   interpolation = 'midpoint')

    IQR = Q3 - Q1

    upper = df_val[column] >= (Q3+1.5*IQR)
    count_upper = len(np.where(upper)[0])

    lower = df_val[column] <= (Q1-1.5*IQR)
    count_lower = len(np.where(lower)[0])

    count_outlier = count_upper + count_lower
    return(count_outlier)

for col in df_val.columns:
    print(col + ": " + str(outlier_data_validasi(col)))

Outlier Data Validasi: 
id: 0
gender: 0
age: 0
hypertension: 1550
heart_disease: 1592
ever_married: 0
work_type: 107
Residence_type: 0
avg_glucose_level: 106
bmi: 20
smoking_status: 0
stroke: 1598


In [113]:
print("Outlier Data Test: ")
def outlier_data_test(column):
    Q1 = np.percentile(df_test[column], 25,
                    interpolation = 'midpoint')
    
    Q3 = np.percentile(df_test[column], 75,
                   interpolation = 'midpoint')

    IQR = Q3 - Q1

    upper = df_test[column] >= (Q3+1.5*IQR)
    count_upper = len(np.where(upper)[0])

    lower = df_test[column] <= (Q1-1.5*IQR)
    count_lower = len(np.where(lower)[0])

    count_outlier = count_upper + count_lower
    return(count_outlier)

for col in df_test.columns:
    print(col + ": " + str(outlier_data_test(col)))

Outlier Data Test: 
id: 0
gender: 0
age: 0
hypertension: 1928
heart_disease: 1997
ever_married: 0
work_type: 123
Residence_type: 0
avg_glucose_level: 101
bmi: 19
smoking_status: 0
stroke: 1994


In [121]:
print("Balance of Data: ")
print(" Oversampling: ")
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
# define pipeline
steps = [('over', RandomOverSampler()), ('model', model)]
pipeline = Pipeline(steps=steps)
# evaluate pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipeline, X_train, y_train, scoring='f1_micro', cv=cv, n_jobs=-1)
score = mean(scores)
print(' F1 Score: %.3f' % score)

Balance of Data: 
 Oversampling: 


ImportError: cannot import name '_euclidean_distances' from 'sklearn.metrics.pairwise' (C:\Users\joseg\anaconda3\lib\site-packages\sklearn\metrics\pairwise.py)

In [None]:
print(" Undersampling: ")
# define pipeline
steps = [('under', RandomUnderSampler()), ('model', model)]
pipeline = Pipeline(steps=steps)
# evaluate pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipeline, X_train, y_train, scoring='f1_micro', cv=cv, n_jobs=-1)
score = mean(scores)
print(' F1 Score: %.3f' % score)

 Undersampling: 
 F1 Score: 0.687


3. Jelaskan bagaimana kalian akan menangani permasalahan yang disebutkan pada poin 2

1. Pada data duplicate kami tidak perlu menangani apa-apa dikarenakan tidak terdapat data duplicate pada dataset
2. Pada missing value kami menanganinya dengan mengisi missing value dengan nilai mean dari instance-instance yang terdapat pada feature tersebut
3. Pada outlier kami mendeteksinya dengan menggunakan metode IQR (Inter Quartile Range) dan akan menghapus outlier-outlier tersebut
4. Pada Balance of Data kami tidak perlu menangani apa-apa dikarenakan data sudah cukup balance

In [None]:
df_train.isna().sum()

id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  130
smoking_status         0
stroke                 0
dtype: int64

In [None]:
# Missing Value Handling
df_train['bmi'].fillna(df_train['bmi'].mean(), inplace= True)
df_test['bmi'].fillna(df_test['bmi'].mean(), inplace= True)
df_val['bmi'].fillna(df_val['bmi'].mean(), inplace= True)

df_train.isna().sum()

id                   0
gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

4. Sebutkan dan jelaskan alasan dari teknik encoding yang akan kalian gunakan terhadap data tersebut
Encoding yang digunakan adalah dengan menggunakan LabelEncoder untuk mengubah kategorikal atribut

5. Buatlah desain eksperimen dengan menentukan hal berikut:
- Tujuan eksperimen
- Dependent dan Independent variabel
- Strategi eksperimen
- Skema validasi

Tujuan Eksperimen :
1.   Peserta memahami rangkaian proses analitika data menggunakan pendekatan pembelajaran mesin. 
2.   Peserta memahami bahwa proses pengembangan model pembelajaran mesin juga ditentukan dari kualitas data, penanganan data, dan penentuan algoritma serta hiperparameternya; tidak cukup hanya dengan memastikan implementasi algoritma berjalan tanpa kesalahan.
3. Peserta mampu menginterpretasikan hasil dari evaluasi model dalam proses analitika menggunakan pendekatan pembelajaran mesin.

Dependent Variabel : 'stroke
Independent Variable : 'id','gender','age','hypertension','heart_disease','ever_married','work_type','Residence_type','avg_glucose_level','bmi','smoking_status','stroke'

Strategi Eksperimen : Grid Search

Skema Validasi : K - Fold


6. Implementasi Strategi Eksperimen dan Skema Validasi

In [117]:
from sklearn.model_selection import GridSearchCV
logModel = LogisticRegression()

param_grid = [
    {'penalty' : ['11','12','elasticnet','none'],
    'C' : np.logspace(-4,4,20),
    'solver' : ['lbfgs','newton-cg','liblinear','sag','saga'],
    'max_iter' : [100,1000,2500,5000]
    }
]

clf = GridSearchCV(logModel, param_grid=param_grid, cv = 3, verbose=True, n_jobs=-1)

best_clf = clf.fit(X_train,y_train)

Fitting 3 folds for each of 1600 candidates, totalling 4800 fits




In [118]:
best_clf.best_estimator_

LogisticRegression(C=0.0001, penalty='none')

In [120]:
#Check Accuracy
print(f'Accuracy - : {best_clf.score(X_train,y_train):.3f}')

Accuracy - : 0.950


K Fold Cross Validation

In [124]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=2)
for train_index, test_index in kf.split(X):
     print("TRAIN:", train_index, "TEST:", test_index)
     X_train, X_test = X.iloc[train_index], X.iloc[test_index]
     y_train, y_test = y.iloc[train_index], y.iloc[test_index]


TRAIN: [2555 2556 2557 ... 5107 5108 5109] TEST: [   0    1    2 ... 2552 2553 2554]
TRAIN: [   0    1    2 ... 2552 2553 2554] TEST: [2555 2556 2557 ... 5107 5108 5109]


7. Berdasarkan hasil prediksi yang kalian hasilkan, buatlah kesimpulan analisis karakteristik pasien yang terkena stroke

In [136]:
model = LogisticRegression()
X_train['ever_married'] = lb.fit_transform(X_train['ever_married'] ) 

X_train['work_type'] = lb.fit_transform(X_train['work_type'] ) 

X_train['work_type'] = lb.fit_transform(X_train['work_type'] ) 

X_train['Residence_type'] = lb.fit_transform(X_train['Residence_type'] ) 

X_train['smoking_status'] = lb.fit_transform(X_train['smoking_status'] )

X_train['bmi'].fillna(df_train['bmi'].mean(), inplace= True)

model.fit(X_train,y_train)
print(model.coef_)
X_train.head()



[[-1.17621241e-05 -2.16705433e-03  4.44544031e-02  1.93261650e-03
   1.68858375e-03 -1.07100679e-03 -1.60420874e-02 -2.61508086e-03
   2.40582032e-03 -1.49977854e-01 -6.90380742e-03]]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train['ever_married'] = lb.fit_transform(X_train['ever_married'] )
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train['work_type'] = lb.fit_transform(X_train['work_type'] )
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train['work_type'] = lb.fit_transform(X_train['work_type'] )
A value is

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status
0,9046,1,67.0,0,1,1,2,1,228.69,36.6,1
1,51676,0,61.0,0,0,1,3,0,202.21,28.914618,2
2,31112,1,80.0,0,1,1,2,0,105.92,32.5,2
3,60182,0,49.0,0,0,1,2,1,171.23,34.4,3
4,1665,0,79.0,1,0,1,3,0,174.12,24.0,2


Berdasarkan hasil dari coef_ maka karakteristik dari orang yang mengalami stroke adalah sudah berumur, mengalami hypertension, mengalami heart_disease, memiliki avg_glucose_level yang tinggi

Jelaskan pembagian tugas/ kerja antar anggota kelompok saat eksperimen,  pada sel ini.

Jose Galbraith H.
1. Transform categorical value
2. Mengisi NaN value dengan mean
3. Mengerjakan Desain Eksperimen

Muhammad Fahkry Malta
1. Menganalisa Duplicate Value, Missing Value, Outlier, dan Balance of Data
2. Membuat hasil analisis dari analisa diatas (nomor 3)
3. Membuat laporan

# Deliverables

1. Notebook dengan nama file PraktikumIF3270_M1_NIM1_NIM2.ipynb untuk poin 1 - 5.
2. Notebook dengan nama file PraktikumIF3270_M2_NIM1_NIM2.ipynb yang merupakan kelanjutan dari notebook poin 1, dengan tambahan hasil poin 6 dan 7.
3. Laporan dengan nama file PraktikumIF3270_NIM1_NIM2.pdf dengan isi sebagai berikut:
- Hasil analisa terhadap data, penanganan yang dilakukan serta justifikasi teknik-teknik yang dipilih
- Perubahan yang dilakukan pada jawaban poin 1 - 5 jika ada
- Desain eksperimen
- Hasil eksperimen
- Analisis dan kesimpulan
- Pembagian Tugas / Kerja antar anggota kelompok

Deadline pengumpulan:
- Deliverables poin 1 dikumpulkan sebelum <b>pukul 11.00 WIB</b>, Senin 4 April 2022
- Deliverables poin 2 dikumpulkan sebelum <b>pukul 23.59 WIB</b>, Senin 4 April 2022
- Deliverables poin 3 dikumpulkan sebelum <b>pukul 23.59 WIB</b>, Senin 4 April 2022