# Hacktiv8 Phase 1: Non Graded Challenge 1

---

Non-Graded Assignment ini dibuat guna mengevaluasi pembelajaran pada Hacktiv8 Data Science Fulltime Program khususnya pada konsep Logistic Regression.

## Introduction

By [Rifky Aliffa](https://github.com/Penzragon)

### Dataset

Dataset yang digunakan dalam project ini adalah dataset stroke yang berisi 5110 baris dengan 12 kolom yang diantaranya adalah id, gender, age, hypertension, heart_disease, ever_married, work_type, residence_type, avg_glucose_level, bmi, smoking_status, dan stroke. Dataset dapat dilihat di [Kaggle](https://www.kaggle.com/fedesoriano/stroke-prediction-dataset).

Keterangan kolom pada dataset ini adalah:

| Feature           | Description                                                                            |
| ----------------- | -------------------------------------------------------------------------------------- |
| id                | unique identifier                                                                      |
| gender            | "Male", "Female" or "Other"                                                            |
| age               | age of the patient                                                                     |
| hypertension      | 0 if the patient doesn't have hypertension, 1 if the patient has hypertension          |
| heart_disease     | 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease |
| ever_married      | "No" or "Yes"                                                                          |
| work_type         | "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"                   |
| Residence_type    | "Rural" or "Urban"                                                                     |
| avg_glucose_level | average glucose level in blood                                                         |
| bmi               | body mass index                                                                        |
| smoking_status    | "formerly smoked", "never smoked", "smokes" or "Unknown"                               |
| stroke            | 1 if the patient had a stroke or 0 if not                                              |


### Objectives

- Lakukan cleaning dan preprocessing terhadap data yang akan dipakai.
- Buat sebuah classification model dengan menggunakan Logistic Regression dengan stroke prediction sebagai target.

## Import Libraries

Pada project ini library yang akan digunakan adalah **Pandas**, **Numpy**, **Matplotlib**, **Seaborn**, dan **scikit-learn**.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

## Data Loading

Membuat dataframe bernama **stroke** dari file csv bernama `healthcare-dataset-stroke-data.csv`.

In [2]:
stroke = pd.read_csv('healthcare-dataset-stroke-data.csv')
stroke.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


### Data Characteristics

In [3]:
stroke.shape

(5110, 12)

Dataframe terdiri dari **5110 rows** dan **12 columns**.

In [4]:
stroke.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


Dataframe ini terdiri dari:
- 3 kolom dengan tipe data **float**
- 4 kolom dengan tipe data **integer**
- 5 kolom dengan tipe data **object**

Dari info ini juga dapat dilihat terdapat missing value pada kolom `bmi`.

In [5]:
stroke.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,5110.0,36517.829354,21161.721625,67.0,17741.25,36932.0,54682.0,72940.0
age,5110.0,43.226614,22.612647,0.08,25.0,45.0,61.0,82.0
hypertension,5110.0,0.097456,0.296607,0.0,0.0,0.0,0.0,1.0
heart_disease,5110.0,0.054012,0.226063,0.0,0.0,0.0,0.0,1.0
avg_glucose_level,5110.0,106.147677,45.28356,55.12,77.245,91.885,114.09,271.74
bmi,4909.0,28.893237,7.854067,10.3,23.5,28.1,33.1,97.6
stroke,5110.0,0.048728,0.21532,0.0,0.0,0.0,0.0,1.0


Dari tabel diatas dapat diketahui beberapa hal, yaitu:
- Distribusi dari kolom `age` dan `bmi` cenderung normal, karena **mean** dan **median** berdekatan.
- Body Mass Index (**BMI**) paling kecil adalah **10.30** dengan yang paling besar adalah **97.60** maka range dari kolom tersebut adalah **87.30**.

## Data Cleansing

### Missing Value

In [6]:
stroke.isna().sum()

id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

Terdapat **201 missing value** pada kolom `bmi`.

Karena distribusi dari kolom `bmi` cenderung normal, makan missing value akan diisi oleh nilai dari **mean** kolom tersebut.

In [7]:
stroke['bmi'] = stroke['bmi'].fillna(stroke['bmi'].mean())

In [8]:
stroke.isna().sum()

id                   0
gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

Setelah missing value diisi dengan nilai mean dari kolom `bmi`, tidak lagi terdapat missing value pada kolom tersebut.

### Duplicate Data

In [9]:
stroke.duplicated().any()

False

Tidak terdapat duplicated data pada dataset.

## Exploratory Data Analysis (EDA)