## Loading Dataset

We begin by loading the original healthcare stroke dataset, which is located in the Dataset folder. We'll use pandas to read it into a DataFrame.


In [1]:
import pandas as pd

# Load the dataset
dataset_path = 'Dataset/healthcare-dataset-stroke-data.csv'
df = pd.read_csv(dataset_path)

# Display basic information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


We will take exactly 100 rows, preserving the proportion of stroke == 1 and stroke == 0 from the original dataset. The output will be saved to the Dataset/ folder.

In [8]:
from sklearn.model_selection import train_test_split

# Get stratified sample of exactly 100 rows
# First, split out 100 samples using stratified sampling on the 'stroke' column
_, sample_df = train_test_split(
    df,
    stratify=df['stroke'],
    test_size=100,
    random_state=42
)

# Verify the distribution
sample_df['stroke'].value_counts(normalize=True)

stroke
0    0.95
1    0.05
Name: proportion, dtype: float64

In [9]:
# Count how many samples have stroke == 1
stroke_positive_count = sample_df['stroke'].sum()
print(f"Number of positive stroke cases in sample: {stroke_positive_count}")

Number of positive stroke cases in sample: 5


To make the sample dataset easier to trace, we will overwrite the id column so that it runs from 1 to 100 sequentially.

In [10]:
# Overwrite the 'id' column with values from 1 to 100
sample_df = sample_df.copy()
sample_df['id'] = range(1, 101)

# Confirm the change
sample_df[['id', 'stroke']].head()

Unnamed: 0,id,stroke
379,1,0
4847,2,0
1834,3,0
3341,4,0
1265,5,0


In [11]:
# Save the sample to the Dataset folder
sample_df.to_csv('Dataset/SP_sample.csv', index=False)
print("✅ Stratified 100-sample dataset saved to 'Dataset/SP_sample.csv'")

✅ Stratified 100-sample dataset saved to 'Dataset/SP_sample.csv'


## Evaluate and Test the Models

Stroke Prediction produces a model, exported using tflite and onnx formats. The two formats would be evaluated and analyzed in terms of their performance.

In [2]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.utils import resample
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, confusion_matrix

import onnxruntime as ort

Below is a preprocessing function that will be used to prepare the data for training and testing; it follows the same steps as the original code.

In [3]:
def preprocess_sample_data(df_sample):
    df = df_sample.copy()
    
    df = df.drop(columns="id")
    
    df["age_group"] = df["age"].apply(lambda x: "Infant" if (x >= 0) & (x <= 2)
        else ("Child" if (x > 2) & (x <= 12)
        else ("Adolescent" if (x > 12) & (x <= 18)
        else ("Young Adults" if (x > 19) & (x <= 35)
        else ("Middle Aged Adults" if (x > 35) & (x <= 60)
        else "Old Aged Adults")))))

    df['bmi'] = df['bmi'].fillna(df.groupby(["gender", "ever_married", "age_group"])["bmi"].transform('mean'))
    
    df = df[(df["bmi"] < 66) & (df["bmi"] > 12)]
    df = df[(df["avg_glucose_level"] > 56) & (df["avg_glucose_level"] < 250)]
    df = df.drop(df[df["gender"] == "Other"].index)
    
    had_stroke = df[df["stroke"] == 1]
    no_stroke = df[df["stroke"] == 0]
    upsampled_had_stroke = resample(had_stroke, replace=True, n_samples=no_stroke.shape[0], random_state=123)
    upsampled_data = pd.concat([no_stroke, upsampled_had_stroke])
    
    # One-hot encoding
    cols = ['gender', 'hypertension', 'heart_disease', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']
    dums = pd.get_dummies(upsampled_data[cols], dtype=int)
    
    # Ensure all expected dummy columns are present
    expected_dummy_cols = [
        'gender_Female', 'gender_Male',
        'ever_married_No', 'ever_married_Yes',
        'work_type_Govt_job', 'work_type_Never_worked',
        'work_type_Private', 'work_type_Self-employed', 'work_type_children',
        'Residence_type_Rural', 'Residence_type_Urban',
        'smoking_status_Unknown', 'smoking_status_formerly smoked',
        'smoking_status_never smoked', 'smoking_status_smokes'
    ]
    
    for col in expected_dummy_cols:
        if col not in dums:
            dums[col] = 0  # Add missing columns as 0s
    
    # Reorder to match model input
    dums = dums[expected_dummy_cols]
    
    model_data = pd.concat([upsampled_data.drop(columns=cols), dums], axis=1)

    # Encode ordinal column
    encoder = LabelEncoder()
    model_data["age_group"] = encoder.fit_transform(model_data["age_group"])
    
    # Normalize numerical features
    scaler = MinMaxScaler()
    for col in ['age', 'avg_glucose_level', 'bmi']:
        scaler.fit(model_data[[col]])
        model_data[col] = scaler.transform(model_data[[col]])
        
    return model_data


Load and preprocess the dataset

In [None]:
# Load raw dataset
df_raw = pd.read_csv("Dataset/SP_sample.csv")

model_data = preprocess_sample_data(df_raw)

X_processed = model_data.drop(columns="stroke")
y_true = model_data["stroke"]

model_data.head()


Unnamed: 0,age,avg_glucose_level,bmi,stroke,age_group,gender_Female,gender_Male,ever_married_No,ever_married_Yes,work_type_Govt_job,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,Residence_type_Rural,Residence_type_Urban,smoking_status_Unknown,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
249,0.035645,0.20208,0.108571,0,1,0,1,1,0,0,0,0,0,1,1,0,1,0,0,0
250,0.707031,0.165028,0.512381,0,3,0,1,0,1,0,0,1,0,0,0,1,0,0,1,0
251,0.09668,0.283689,0.100952,0,1,1,0,1,0,0,0,1,0,0,0,1,1,0,0,0
252,0.853516,0.067119,0.449524,0,4,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0
253,0.169922,0.544452,0.129524,0,0,0,1,1,0,0,1,0,0,0,1,0,1,0,0,0


Create Dataset File for RKNN

In [5]:
X_processed.to_csv('./Models/dataset.txt', index=False, header=False, float_format='%.6f')

Run ONNX Inference

In [None]:
onnx_path = "Models/best_model_91.onnx"
session = ort.InferenceSession(onnx_path)
input_name = session.get_inputs()[0].name

X_input = np.array(X_processed, dtype=np.float32)
onnx_preds = session.run(None, {input_name: X_input})[0].flatten()
y_pred_onnx = (onnx_preds > 0.5).astype(int)


Below is the function definition for evaluating the ONNX model performance:

In [39]:
def evaluate_model(y_true, y_pred, label):
    print(f"--- {label} ---")
    print(f"Accuracy:  {accuracy_score(y_true, y_pred):.4f}")
    print(f"Precision: {precision_score(y_true, y_pred):.4f}")
    print(f"Recall:    {recall_score(y_true, y_pred):.4f}")
    print(f"F1 Score:  {f1_score(y_true, y_pred):.4f}")
    print("Confusion Matrix:")
    print(confusion_matrix(y_true, y_pred))
    print()

In [40]:
evaluate_model(y_true, y_pred_onnx, "ONNX Model")

--- ONNX Model ---
Accuracy:  0.9211
Precision: 0.8636
Recall:    1.0000
F1 Score:  0.9268
Confusion Matrix:
[[80 15]
 [ 0 95]]



Below is a comparison between the performance of the orignial model and the onnx model (TL;DR: They're practically the same):

| **Metric**             | **Original Model (Full Test Set)** | **ONNX Model (Sample Dataset)** | **Remarks**                                |
|------------------------|-------------------------------------|----------------------------------|--------------------------------------------|
| **Accuracy**           | 91.04%                              | 92.11%                           | Very close; ONNX model slightly higher.    |
| **Precision (Stroke)** | 85%                                 | 86.36%                           | Slight improvement in sample run.          |
| **Recall (Stroke)**    | 100%                                | 100%                             | Perfect in both – no false negatives.      |
| **F1 Score (Stroke)**  | 92%                                 | 92.68%                           | Slight boost on the embedded test.         |
| **False Positives**    | 169                                 | 15                               | Consistent ratio; no impact on safety.     |
| **False Negatives**    | 0                                   | 0                                | Critical metric maintained.              |


## Convert to RKNN Model to be deployed on Embedded Device

In [16]:
model_data.info()


<class 'pandas.core.frame.DataFrame'>
Index: 190 entries, 0 to 91
Data columns (total 20 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   age                             190 non-null    float64
 1   avg_glucose_level               190 non-null    float64
 2   bmi                             190 non-null    float64
 3   stroke                          190 non-null    int64  
 4   age_group                       190 non-null    int64  
 5   gender_Female                   190 non-null    int64  
 6   gender_Male                     190 non-null    int64  
 7   ever_married_No                 190 non-null    int64  
 8   ever_married_Yes                190 non-null    int64  
 9   work_type_Govt_job              190 non-null    int64  
 10  work_type_Never_worked          190 non-null    int64  
 11  work_type_Private               190 non-null    int64  
 12  work_type_Self-employed         190 non-nu