Skip to content

Avisprof/stroke-prediction

Repository files navigation

Stroke Prediction Project

Overview

This project focuses on building a machine learning model to predict the likelihood of a patient having a stroke. According to the World Health Organization (WHO), stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths. Early and accurate prediction can be crucial for preventive healthcare and improving patient outcomes.

In this project, I developed a classification model based on patient input parameters such as gender, age, various diseases, and smoking status. The goal is to provide a reliable tool for identifying high-risk individuals.

Dataset

The dataset contains relevant medical and demographic information about patients. Each row represents a unique patient record.

Features include:

  • id: Unique identifier
  • gender: Patient's gender ("Male", "Female")
  • age: Patient's age
  • hypertension: Boolean for hypertension (0 = No, 1 = Yes)
  • heart_disease: Boolean for heart disease (0 = No, 1 = Yes)
  • ever_married: Marital status ("No", "Yes")
  • work_type: Type of employment ("children", "Govt_job", "Never_worked", "Private", "Self-employed")
  • residence_type: Area of residence ("Rural", "Urban")
  • avg_glucose_level: Average glucose level in blood
  • bmi: Body Mass Index
  • smoking_status: Smoking status ("formerly smoked", "never smoked", "smokes", "Unknown")
  • stroke: Target variable (0 = No Stroke, 1 = Stroke)

Methodology

1. Exploratory Data Analysis (EDA)

The initial phase involved a thorough EDA inside the notebook.ipynb file to understand the data's structure, distribution, and relationships. This included:

  • Handling missing values (e.g., in the bmi column).
  • Analyzing the distribution of numerical and categorical variables.
  • Creating various visualizations (histograms, count plots) to uncover patterns.
  • Identifying a significant class imbalance in the target variable (stroke), where the number of positive cases (stroke) was much lower than negative cases.

2. Data Preprocessing

To prepare the data for modeling, the following steps were taken:

  • Feature Encoding: Categorical variables (e.g., gender, ever_married, work_type, smoking_status) were converted into numerical format using appropriate techniques like Label Encoding or One-Hot Encoding.

  • Handling Class Imbalance: The Synthetic Minority Over-sampling Technique (SMOTE) was applied to address the severe class imbalance. SMOTE generates synthetic samples from the minority class (stroke patients) to create a balanced dataset, which helps prevent the model from being biased towards the majority class and improves its ability to recognize the positive cases.

  • Data Splitting: The dataset was split into training and testing sets to ensure unbiased evaluation.

  • Feature Scaling: Numerical features were standardized to have a mean of 0 and a standard deviation of 1, which is essential for models that rely on distance calculations, like Logistic Regression.

3. Model Training and Evaluation

Several machine learning classifiers were implemented and compared:

  • LogisticRegression
  • DecisionTreeClassifier
  • RandomForestClassifier
  • XGBClassifier

The models were evaluated using metrics suitable for imbalanced datasets, including:

  • Precision, Recall, and F1-Score
  • Accuracy
  • ROC-AUC Score
  • Confusion Matrix

4. Model Selection and Hyperparameter Tuning

After an initial evaluation, hyperparameter tuning was performed to optimize the performance of the promising models.

The Logistic Regression model was selected as the final model for this project. The key reasons for this choice are:

  • Performance: After tuning and using SMOTE, it demonstrated a strong and robust performance on the key evaluation metrics, particularly recall for the stroke class.

  • Interpretability: The model's coefficients are easily interpretable, allowing us to understand the influence of each feature on the prediction (e.g., "An increase in age is associated with an increase in the log-odds of having a stroke").

  • Speed & Efficiency: Logistic Regression is computationally efficient for both training and prediction, making it suitable for potential deployment in real-time applications.

Results

The final Logistic Regression model achieved a satisfactory balance between precision and recall for the positive class. The ROC-AUC score confirms the model's good capability to distinguish between patients who are likely to have a stroke and those who are not.

Key findings from the model coefficients align with known medical insights, such as age, hypertension, and average glucose level being significant risk factors.

Deployment

The model has been deployed as a web service with two access points:

  • Port 9696: REST API endpoint for programmatic predictions
  • Port 8501: Streamlit web interface for interactive use

How to reproduce this project?

Option 1: Build and Run from Source

  1. Clone the repository:

git clone https://gitlab.com/avisprof/stroke-prediction.git cd stroke-prediction

  1. Build the Docker image:

sudo docker build -t stroke-prediction .

  1. Run the container:

sudo docker run -it --rm -p 9696:9696 -p 8501:8501 stroke-prediction

Option 2: Use Pre-built Image from Docker Hub

For quick deployment without building, use the pre-built image:

sudo docker run -it --rm -p 9696:9696 -p 8501:8501 interests/stroke-prediction

Access the Application

After running the container, access the services through:

  • Web Interface

Open http://localhost:8501 in your browser

To enhance accessibility and provide a user-friendly interface for model interaction, an interactive web application was built using Streamlit. This interface allows users to easily experiment with different input parameters (such as age, BMI, smoking status, etc.) using intuitive sliders and dropdown menus, and immediately see the model's stroke risk prediction without any technical knowledge required.

streamlit_query

streamlit_answer

  • API Endpoint

Send POST requests to http://localhost:9696/predict

You can use predict-test.ipynb to send a POST request to the prediction endpoint with patient data.

  • Swagger UI interface

Additionally, for developers and API testing, the service includes the built-in interactive documentation provided by FastAPI. Once the container is running, you can access this documentation at http://localhost:9696/docs. This interface allows you to directly test the prediction endpoint, send sample requests with properly formatted JSON data, and see the API's responses in real-time

swagger

Live Demo

The service has been successfully deployed to a cloud environment and is publicly accessible for testing.

You can experience the live application and interact with the stroke prediction model by visiting the following URL: http://79.137.198.248:8501/.

This hosted instance allows you to test the model's functionality without any local setup.

Architecture Note

While running both FastAPI and Streamlit services within a single Docker container is not considered a best practice in production environments, this implementation choice was made deliberately for simplicity and ease of deployment.

A more robust architecture would typically use docker-compose to orchestrate separate, dedicated containers for each service.

However, for the purposes of this demonstration and to minimize setup complexity for end-users, the combined container approach provides a straightforward and functional solution.

About

Predict whether a patient is likely to get stroke based on the input parameters

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published