# Stroke Prediction


## Context

According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.

This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relevant information about the patient.


## Attribute Information

### Features
* id: unique identifier
* gender: "Male", "Female" or "Other"
* age: age of the patient
* hypertension: `0` if the patient doesn't have hypertension, `1` if the patient has hypertension
* heart_disease: `0` if the patient doesn't have any heart diseases, `1` if the patient has a heart disease
* ever_married: `No` or `Yes`
* work_type: `children`, `Govt_jov`, `Never_worked`, `Private` or `Self-employed`
* Residence_type: `Rural` or `Urban`
* avg_glucose_level: average glucose level in blood in mg/dL
* bmi: body mass index
* smoking_status: `formerly smoked`, `never smoked`, `smokes` or _Unknown_

### Target
* stroke: 1 if the patient had a stroke or 0 if not


## Goal
You want to build a predictive model for pre-screening patients for a high stroke risk. The idea is that for each patient, the according features are automatically analyzed and in case of a high stroke risk, you are informed about the presence of the stroke risk.

__Important:__ This new tool is only aimed to assist you in your daily work, it is not ment to diagnose a patient without human interaction. All patients will still be checked by a doctor.



## Data Set

You have collected ~4000 historical records of patients, including the information whether they had a stroke or not. A preview of this data set is available below.

In [None]:
import sklearn.metrics

from xautoml.util.config import plot_runhistory
from xautoml.util.datasets import stroke

X_train, y_train = stroke('stroke.csv', train=True)
X_train

## Start the Model Building

You load the data set in an AutoML tool you have found on the internet, to create a predictive model. After starting the optimization, the AutoML tool tests various possible models and evaluates how good each candidate is. In the meantime you have to wait for the program to finish its optimization.

In [None]:
import pickle
import autosklearn.classification
from autosklearn.metrics import balanced_accuracy

automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=900,
    per_run_time_limit=10,
    tmp_folder='/opt/xautoml/autosklearn/stroke/',
    max_models_on_disc=None,
    delete_tmp_folder_after_terminate=False,
    metric=balanced_accuracy
)
automl.fit(X_train, y_train, dataset_name='stroke')

with open(f'/opt/xautoml/autosklearn/stroke/autosklearn.pkl', 'wb') as f:
    pickle.dump(automl, f)

In [None]:
import pickle

with open('/opt/xautoml/autosklearn/stroke/autosklearn.pkl', 'rb') as f:
    automl = pickle.load(f)

After waiting for 15 minutes, you are presented with the following results:

### The score of the Final Model

Internally, the AutoML tool uses a measure to determine how good a candidate is, for example the number of correct predictions (accuracy). After the optimization, you want to test how good the model actually is before using it with patients. Therefore, you have hidden a part of the data set which you will now use to test how good the best model actually is:

In [None]:
from sklearn.metrics import balanced_accuracy_score

X_test, y_test = stroke('stroke.csv', test=True)

predictions = automl.predict(X_test)
balanced_accuracy_score(y_test, predictions)

Meaning, that the generated model is able to predict that many new patients, it has never seen before, correctly.


### View the Models found by auto-sklearn

Besides the raw performance, the tool also tells you which the best models are

In [None]:
automl.leaderboard()

With this information you are good to go and can decide if you actually want to use the generated model.

## Load the Same Results in XAutoML

In [None]:
from xautoml.main import XAutoML
from xautoml.adapter import import_auto_sklearn
from xautoml.util.datasets import stroke
import pickle

with open('/opt/xautoml/autosklearn/stroke/autosklearn.pkl', 'rb') as f:
    automl = pickle.load(f)

X_test, y_test = stroke('stroke.csv', test=True)


rh = import_auto_sklearn(automl)
main = XAutoML(rh, X_test, y_test)

main

In [None]:
main.explain(include={'overview', 'candidate:domain', 'ensemble'})

In [None]:
main.explain_domain(rank=0, exclude={'candidate:domain:performance'})