# Project: Stroke Prediction Binary Classification
---
This is the part of the project that aims to perform binary classification to a [Kaggle Stroke Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset) using neural network. It is recommended to run this particular notebook in Windows, to benefit from GPU accelerated Tensorflow computations (but you can also run this in Linux if you have the patience to deal with driver problems). Below documents the structure of the project:
* Import necessary Libraries
* Load the pre-downloaded dataset
* Analyze features and build different models


***Note**: The pre-downloaded data was generated on: 12/13/2024

## Import necessary Libraries

In [1]:
import tensorflow as tf

In [2]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

In [3]:
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import Dense, Activation
import tensorflow.keras.backend as K

In [4]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## Load the pre-downloaded dataset:
Similar to the non-neural-network part of the project, the dataset is loaded from Dataset folder

In [5]:
df = pd.read_csv('Dataset/healthcare-dataset-stroke-data.csv')
print("Dataset Shape:", df.shape)
df.head()

Dataset Shape: (5110, 12)


Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


## Data processing and clean up

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


In [7]:
df.isnull().sum()

id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

From the above two cells, it's obvious that 201 samples are missing bmi values; hence, we replace it with the mean bmi

In [8]:
df['bmi'] = df['bmi'].fillna(df['bmi'].mean())

## Analyze features and build different models:
---

### Model Plans:
Since the dataset contains multiple categorical features, such as ever_married, work_type, residence_type .etc, below are the plans for different models:
* XL - X Light: One-hot-encoded lightweight model that could be run on a simple hardware (NPU)
* XM - X Medium: Partially one-hot-encoded model that's slightly larger than XL. Few categorical features would be converted to                  standardized weights (based on numerical analysis of the data, such as correlations)
* XS - X Super: Model in its full complexity - all categorical features converted based on analysis and external research papers


### Analysis Plans:
1. Each features would be first compared against each other to determine availability of samples. Then, distinct features would be graphed versus whether the group with the said feature has stroke. Finally, correlations amongst features would be determined. Numerical weights would be extracted from categorical features that have clear correlation.

2. For categorical data with unclear or more than one correlations, external research papers would be consulted, such that the categorical data could be converted to weights.
---

"I have a plan. We just need time and money." - Dutch poet: Van Der Linde