This project focuses on building a machine learning model to predict the likelihood of a patient having a stroke. According to the World Health Organization (WHO), stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths. Early and accurate prediction can be crucial for preventive healthcare and improving patient outcomes.
In this project, I developed a classification model based on patient input parameters such as gender, age, various diseases, and smoking status. The goal is to provide a reliable tool for identifying high-risk individuals.
The dataset contains relevant medical and demographic information about patients. Each row represents a unique patient record.
id: Unique identifiergender: Patient's gender ("Male", "Female")age: Patient's agehypertension: Boolean for hypertension (0 = No, 1 = Yes)heart_disease: Boolean for heart disease (0 = No, 1 = Yes)ever_married: Marital status ("No", "Yes")work_type: Type of employment ("children", "Govt_job", "Never_worked", "Private", "Self-employed")residence_type: Area of residence ("Rural", "Urban")avg_glucose_level: Average glucose level in bloodbmi: Body Mass Indexsmoking_status: Smoking status ("formerly smoked", "never smoked", "smokes", "Unknown")stroke: Target variable (0 = No Stroke, 1 = Stroke)
The initial phase involved a thorough EDA inside the notebook.ipynb file to understand the data's structure, distribution, and relationships. This included:
- Handling missing values (e.g., in the bmi column).
- Analyzing the distribution of numerical and categorical variables.
- Creating various visualizations (histograms, count plots) to uncover patterns.
- Identifying a significant class imbalance in the target variable (stroke), where the number of positive cases (stroke) was much lower than negative cases.
To prepare the data for modeling, the following steps were taken:
-
Feature Encoding: Categorical variables (e.g., gender, ever_married, work_type, smoking_status) were converted into numerical format using appropriate techniques like Label Encoding or One-Hot Encoding.
-
Handling Class Imbalance: The Synthetic Minority Over-sampling Technique (SMOTE) was applied to address the severe class imbalance. SMOTE generates synthetic samples from the minority class (stroke patients) to create a balanced dataset, which helps prevent the model from being biased towards the majority class and improves its ability to recognize the positive cases.
-
Data Splitting: The dataset was split into training and testing sets to ensure unbiased evaluation.
-
Feature Scaling: Numerical features were standardized to have a mean of 0 and a standard deviation of 1, which is essential for models that rely on distance calculations, like Logistic Regression.
Several machine learning classifiers were implemented and compared:
LogisticRegressionDecisionTreeClassifierRandomForestClassifierXGBClassifier
The models were evaluated using metrics suitable for imbalanced datasets, including:
Precision,Recall, andF1-ScoreAccuracyROC-AUC ScoreConfusion Matrix
After an initial evaluation, hyperparameter tuning was performed to optimize the performance of the promising models.
The Logistic Regression model was selected as the final model for this project. The key reasons for this choice are:
-
Performance: After tuning and using SMOTE, it demonstrated a strong and robust performance on the key evaluation metrics, particularly recall for the stroke class.
-
Interpretability: The model's coefficients are easily interpretable, allowing us to understand the influence of each feature on the prediction (e.g., "An increase in age is associated with an increase in the log-odds of having a stroke").
-
Speed & Efficiency: Logistic Regression is computationally efficient for both training and prediction, making it suitable for potential deployment in real-time applications.
The final Logistic Regression model achieved a satisfactory balance between precision and recall for the positive class. The ROC-AUC score confirms the model's good capability to distinguish between patients who are likely to have a stroke and those who are not.
Key findings from the model coefficients align with known medical insights, such as age, hypertension, and average glucose level being significant risk factors.
The model has been deployed as a web service with two access points:
- Port 9696: REST API endpoint for programmatic predictions
- Port 8501: Streamlit web interface for interactive use
- Clone the repository:
git clone https://gitlab.com/avisprof/stroke-prediction.git
cd stroke-prediction
- Build the Docker image:
sudo docker build -t stroke-prediction .
- Run the container:
sudo docker run -it --rm -p 9696:9696 -p 8501:8501 stroke-prediction
For quick deployment without building, use the pre-built image:
sudo docker run -it --rm -p 9696:9696 -p 8501:8501 interests/stroke-prediction
After running the container, access the services through:
- Web Interface
Open http://localhost:8501 in your browser
To enhance accessibility and provide a user-friendly interface for model interaction, an interactive web application was built using Streamlit. This interface allows users to easily experiment with different input parameters (such as age, BMI, smoking status, etc.) using intuitive sliders and dropdown menus, and immediately see the model's stroke risk prediction without any technical knowledge required.
- API Endpoint
Send POST requests to http://localhost:9696/predict
You can use predict-test.ipynb to send a POST request to the prediction endpoint with patient data.
- Swagger UI interface
Additionally, for developers and API testing, the service includes the built-in interactive documentation provided by FastAPI. Once the container is running, you can access this documentation at http://localhost:9696/docs. This interface allows you to directly test the prediction endpoint, send sample requests with properly formatted JSON data, and see the API's responses in real-time
The service has been successfully deployed to a cloud environment and is publicly accessible for testing.
You can experience the live application and interact with the stroke prediction model by visiting the following URL: http://79.137.198.248:8501/.
This hosted instance allows you to test the model's functionality without any local setup.
While running both FastAPI and Streamlit services within a single Docker container is not considered a best practice in production environments, this implementation choice was made deliberately for simplicity and ease of deployment.
A more robust architecture would typically use docker-compose to orchestrate separate, dedicated containers for each service.
However, for the purposes of this demonstration and to minimize setup complexity for end-users, the combined container approach provides a straightforward and functional solution.


