## Stroke Predictor using C-Support Vector Classification (SVC)

[Dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset)

[Google Colab NoteBook](https://colab.research.google.com/drive/1reXaYY2rGvJkvWl8G6Em7j6QWIAt5gkc)

[Github](https://colab.research.google.com/drive/14Sw8YTHfaTO38bznu9BpVqdYMRFamPYo)

According to Australian Institute of Health and Welfare (AIHW), stroke affected 1.3% of the Australian population. Stroke occurs when a blood vessle suppling blood to the brain suddenly becomes obstructed or bleeds. As a result the brain fails to operate properly and can lead to activities such as speaking, things and movement.

The dataset provided can be used to as a binary classification of whether a patient could have stroke. It uses attributes such as smoking status, diagnosises such as heart diseases or hypertensions and gender.

The target column for the dataset will be `stroke`, will the outputs as binary value of 0 or 1. A UI implementation of this can seen via this script,
```py
for target in df['stroke'].unique():
  if(target == 0): print(f"{target}: Not likely of stroke") 
  elif(target == 1): print(f"{target}: Possibly of stroke") 
```
where 
- 1: Possibly of stroke 
- 0: Not likely of stroke. 

For the input parameters that is given to the model includes,
- gender
- age
- hypertension
- heart_disease
- ever_married
- work_type
- Residence_type
- avg_glucose_level
- bmi
- smoking_status

In [38]:
import numpy as np
import pandas as pd
import pickle

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import binarize, LabelEncoder, MinMaxScaler, StandardScaler
from sklearn import metrics
from sklearn.metrics import accuracy_score, mean_squared_error, precision_recall_curve
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

import warnings
warnings.filterwarnings('ignore')

In [39]:
# Loading and seeking the data

# Load dataset
df = pd.read_csv('https://raw.githubusercontent.com/z5208980/machine-learning-health/main/stroke/data/raw.csv')

print(f"There have {df.shape[0]} rows with {df.shape[1]} columns including targets")

# Seek the dataset
df.head(5)

There have 5110 rows with 12 columns including targets


Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [40]:
# Processing the data

df.drop("id", axis=1, inplace=True) # Drop index
df['bmi'].fillna(df['bmi'].mean(),inplace=True)

# Best to preprocess scale the age
features = ["age", "avg_glucose_level", "bmi"]
for feature in features:
  scaler = MinMaxScaler()
  df[feature] = scaler.fit_transform(df[[feature]])

features = ["gender", "ever_married", "work_type", "smoking_status", "Residence_type"]
# features = ['hypertension', 'heart_disease', 'ever_married', 'Residence_type', 'stroke']
encoder = LabelEncoder()
for feature in features:
  encoder = LabelEncoder()
  df[feature]=  encoder.fit_transform(df[feature])

# One Hot Encoding
# df = pd.get_dummies(df, columns=['gender', 'work_type', 'smoking_status'], prefix = ['gender', 'work_type', 'smoking_status'])

# Save
filename = '/content/sample_data/processed.csv'
df.to_csv(filename, index=False)


In [41]:
y = df.stroke
X = df.drop("stroke", axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=200)

scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

The choosen model use is **SVC** which yields a 96% accurancy in training and testing. Parameters include C = 1.0 and kernel as linear.

## Using the model

The list should be of format `[gender, age, hypertension, heart_disease, ever_married, work_type, Residence_type, avg_glucose_level, bmi, smoking_status]`

In [42]:
def cSVC():
  model = SVC(C=1, kernel='linear')
  model.fit(X_train, y_train)
  y_pred_class = model.predict(X_test)

  print('RESULT')
  print('Accuracy:', metrics.accuracy_score(y_test, y_pred_class))

  return model

model = cSVC()

filename = '/content/sample_data/model.sav'
pickle.dump(model, open(filename, 'wb'))

RESULT
Accuracy: 0.9577464788732394


In [43]:
model = pickle.load(open('/content/sample_data/model.sav', 'rb'))   # load model

val = []
row = 526
for x in X.iloc[row]:
  val.append(x)

input = [val]
output = model.predict(input)

print("X=%s, Predicted=%s, Actually=%s" % (input[0], output[0], y.iloc[row]))

X=[0.0, 0.658203125, 0.0, 0.0, 1.0, 2.0, 1.0, 0.3417966946726987, 0.2531500572737686, 2.0], Predicted=0, Actually=0
