Skip to content

A comprehensive project to predict and analyze diabetes health data using advanced machine learning models, including Logistic Regression, Random Forest, and XGBoost. ๐Ÿ“Š๐Ÿ”

Notifications You must be signed in to change notification settings

ThecoderPinar/Diabetes_Health_Prediction_and_Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

4 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Diabetes Health Prediction and Analysis ๐ŸŽ‰

Diabetes Health Prediction


Welcome to the Diabetes Health Prediction and Analysis project! This repository contains a comprehensive pipeline for predicting diabetes diagnosis using various machine learning and deep learning models, along with an in-depth exploratory data analysis and feature engineering steps.

๐Ÿš€ Project Overview

This project aims to provide a thorough analysis of diabetes-related health data, develop predictive models, and evaluate their performance. The key components of the project include:

  • ๐Ÿ“Š Data Preprocessing
  • ๐Ÿ” Exploratory Data Analysis (EDA)
  • ๐Ÿ› ๏ธ Feature Engineering
  • ๐Ÿง  Model Training
  • ๐Ÿ“ˆ Model Evaluation
  • ๐Ÿ“‘ Comprehensive Reports

๐Ÿ“‚ Project Structure

Here's an overview of the project directory structure:

Diabetes_Health_Prediction_and_Analysis/
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ raw/
โ”‚   โ”‚   โ””โ”€โ”€ diabetes_data.csv
โ”‚   โ”œโ”€โ”€ processed/
โ”‚   โ”‚   โ”œโ”€โ”€ X_train.csv
โ”‚   โ”‚   โ”œโ”€โ”€ X_train_engineered.csv
โ”‚   โ”‚   โ”œโ”€โ”€ X_test.csv
โ”‚   โ”‚   โ”œโ”€โ”€ X_test_engineered.csv
โ”‚   โ”‚   โ”œโ”€โ”€ y_train.csv
โ”‚   โ”‚   โ””โ”€โ”€ y_test.csv
โ”œโ”€โ”€ app/
โ”‚   โ”œโ”€โ”€ app.py
โ”‚   โ”œโ”€โ”€ templates/
โ”‚   โ”‚   โ””โ”€โ”€ index.html
โ”‚   โ””โ”€โ”€ static/
โ”‚       โ””โ”€โ”€ styles.css
โ”œโ”€โ”€ models/
โ”‚   โ”œโ”€โ”€ logistic_regression.pkl
โ”‚   โ”œโ”€โ”€ random_forest.pkl
โ”‚   โ””โ”€โ”€ xgboost.pkl
โ”œโ”€โ”€ notebooks/
โ”‚   โ””โ”€โ”€ exploratory_data_analysis.ipynb
โ”œโ”€โ”€ scripts/
โ”‚   โ”œโ”€โ”€ plots/
โ”‚   โ”œโ”€โ”€ reports/
โ”‚   โ”œโ”€โ”€ data_preprocessing.py
โ”‚   โ”œโ”€โ”€ feature_engineering.py
โ”‚   โ”œโ”€โ”€ model_training.py
โ”‚   โ”œโ”€โ”€ model_evaluation.py
โ”‚   โ””โ”€โ”€ model_performance_report.py
โ”œโ”€โ”€ tests/
โ”‚   โ”œโ”€โ”€ models/
โ”‚   โ”œโ”€โ”€ test_data_preprocessing.py
โ”‚   โ”œโ”€โ”€ test_feature_engineering.py
โ”‚   โ”œโ”€โ”€ test_model_training.py
โ”œโ”€โ”€ requirements.txt 
โ””โ”€โ”€ README.md

๐Ÿ”ง Setup and Installation

To get started with this project, follow the steps below:

  1. Clone the repository:

    git clone https://github.com/ThecoderPinar/Diabetes_Health_Prediction_and_Analysis.git
    cd Diabetes_Health_Prediction_and_Analysis
  2. Create and activate a virtual environment:

    python -m venv venv
    source venv/bin/activate  # On Windows use `venv\Scripts\activate`
  3. Install the required packages:

    pip install -r requirements.txt
  4. Run the data preprocessing script:

    python scripts/data_preprocessing.py
  5. Run the feature engineering script:

    python scripts/feature_engineering.py
  6. Train the models:

    python scripts/model_training.py
  7. Evaluate the models:

    python scripts/model_evaluation.py
  8. Generate comprehensive model performance reports:

    python script/comprehensive_model_report.py

๐Ÿš€ Usage

  • Exploratory Data Analysis: Check the notebooks/exploratory_data_analysis.ipynb notebook for detailed data analysis and visualizations.
  • Scripts: All scripts for data preprocessing, feature engineering, model training, and evaluation are located in the scripts/ directory.
  • Tests: To ensure code quality and correctness, tests are included in the tests/ directory. Run them with pytest.

๐Ÿ“Š Models

The following models are trained and evaluated in this project:


Logistic Regression

ROC Curve:

Logistic Regression ROC Curve

The ROC curve illustrates the true positive rate (sensitivity) versus the false positive rate (1-specificity) for different threshold settings. A higher area under the curve (AUC) indicates better model performance.

Confusion Matrix:

Logistic Regression Confusion Matrix

The confusion matrix provides a summary of the prediction results on the classification problem. It shows the number of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) predictions.


Random Forest

ROC Curve:

Random Forest ROC Curve

The ROC curve illustrates the true positive rate (sensitivity) versus the false positive rate (1-specificity) for different threshold settings. A higher area under the curve (AUC) indicates better model performance.

Confusion Matrix:

Random Forest Confusion Matrix

*The confusion matrix provides a summary of the prediction results on

๐ŸŽฏ Performance Metrics

The performance of the models is evaluated using the following metrics:

  • Accuracy
  • Precision
  • Recall
  • F1 Score
  • ROC AUC Score
  • Confusion Matrix

Logistic Regression

  • Accuracy (DoฤŸruluk): %78.99
  • Precision (Kesinlik): %73.19
  • Recall (Duyarlฤฑlฤฑk): %70.63
  • F1 Score: %71.89
  • ROC AUC: %83.86

Confusion Matrix:

[[196  37]
 [ 42 101]]

Model dosyasฤฑ:

models/logistic_regression.pkl

Random Forest

  • Accuracy (DoฤŸruluk): %91.22
  • Precision (Kesinlik): %94.35
  • Recall (Duyarlฤฑlฤฑk): %81.82
  • F1 Score: %87.64
  • ROC AUC: %97.69

Confusion Matrix:

[[226   7]
 [ 26 117]]

Model dosyasฤฑ:

models/random_forest.pkl
Explanations:
  1. Accuracy: The ratio of correctly predicted instances to the total instances.
  2. _Precision:_ The ratio of true positive predictions to the total predicted positives. It measures the accuracy of positive predictions.
  3. Recall: The ratio of true positive predictions to the actual positives. It measures the model's ability to identify positive instances.
  4. F1 Score: The harmonic mean of precision and recall. It provides a balance between precision and recall.
  5. ROC AUC: The area under the ROC curve. It summarizes the model's ability to distinguish between classes.

Confusion Matrix:

  • True Positive (TP): 117 - The number of actual positive cases correctly identified by the model.
  • True Negative (TN): 226 - The number of actual negative cases correctly identified by the model.
  • False Positive (FP): 7 - The number of actual negative cases incorrectly identified as positive by the model.
  • False Negative (FN): 26 - The number of actual positive cases incorrectly identified as negative by the model.
Explanations:
  1. Accuracy: The ratio of correctly predicted instances to the total instances.
  2. Precision: The ratio of true positive predictions to the total predicted positives. It measures the accuracy of positive predictions.
  3. Recall: The ratio of true positive predictions to the actual positives. It measures the model's ability to identify positive instances.
  4. F1 Score: The harmonic mean of precision and recall. It provides a balance between precision and recall.
  5. ROC AUC: The area under the ROC curve. It summarizes the model's ability to distinguish between classes.

Confusion Matrix:

  • True Positive (TP): 117 - The number of actual positive cases correctly identified by the model.
  • True Negative (TN): 226 - The number of actual negative cases correctly identified by the model.
  • False Positive (FP): 7 - The number of actual negative cases incorrectly identified as positive by the model.
  • False Negative (FN): 26 - The number of actual positive cases incorrectly identified as negative by the model.

XGBoost

  • Accuracy (DoฤŸruluk): %91.76
  • Precision (Kesinlik): %93.08
  • Recall (Duyarlฤฑlฤฑk): %84.62
  • F1 Score: %88.64
  • ROC AUC: %98.41

Confusion Matrix:

[[224   9]
 [ 22 121]]

Model dosyasฤฑ:

models/xgboost.pkl
Explanations:
  1. Accuracy: The ratio of correctly predicted instances to the total instances.
  2. Precision: The ratio of true positive predictions to the total predicted positives. It measures the accuracy of positive predictions.
  3. Recall: The ratio of true positive predictions to the actual positives. It measures the model's ability to identify positive instances.
  4. F1 Score: The harmonic mean of precision and recall. It provides a balance between precision and recall.
  5. ROC AUC: The area under the ROC curve. It summarizes the model's ability to distinguish between classes.

Confusion Matrix:

  • True Positive (TP): 121 - The number of actual positive cases correctly identified by the model.
  • True Negative (TN): 224 - The number of actual negative cases correctly identified by the model.
  • False Positive (FP): 9 - The number of actual negative cases incorrectly identified as positive by the model.
  • False Negative (FN): 22 - The number of actual positive cases incorrectly identified as negative by the model.

๐Ÿ“ˆ Results

Model performance reports and evaluation metrics are saved and displayed in the comprehensive_model_report.py script output.

๐Ÿ’ก Future Work

  • Implement more advanced deep learning models (e.g., Neural Networks, LSTM).
  • Perform hyperparameter tuning to optimize model performance.
  • Explore feature selection techniques to improve model accuracy.
  • Integrate additional health datasets for broader analysis.

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Whether it's improving the documentation, adding new features, or fixing bugs, your contributions are highly appreciated. Let's make this project better together! ๐Ÿš€

How to Contribute:

  1. Fork the Repository: Click on the 'Fork' button at the top right corner of this page to create a copy of this repository in your GitHub account.

  2. Clone the Forked Repository:

    git clone https://github.com/your-username/Diabetes_Health_Prediction_and_Analysis.git
  3. Create a New Branch:

    git checkout -b feature/your-feature-name
  4. Make Your Changes: Implement your feature, bug fix, or improvement.

  5. Commit Your Changes:

    git commit -m "Add your commit message here"
  6. Push to Your Forked Repository:

    git push origin feature/your-feature-name
  7. Open a Pull Request: Go to the original repository on GitHub and click on the 'New Pull Request' button. Compare changes from your forked repository and submit the pull request.


Thank you for your contributions! Together, we can build a more robust and efficient Diabetes Health Prediction and Analysis tool. ๐ŸŒŸ

๐Ÿ“„ License

This project is licensed under the MIT License.

๐Ÿ“ฌ Contact

If you have any questions or suggestions, feel free to open an issue or contact me directly. I am always open to feedback and would love to hear from you!


How to Reach Me:


Thank you for your interest in the Diabetes Health Prediction and Analysis project! Your feedback and suggestions are invaluable in making this project better and more useful for everyone. ๐ŸŒŸ

Contact Us



โญ๏ธ Don't forget to give this project a star if you found it useful! โญ๏ธ