Diabetes Health Prediction and Analysis 🎉

Welcome to the Diabetes Health Prediction and Analysis project! This repository contains a comprehensive pipeline for predicting diabetes diagnosis using various machine learning and deep learning models, along with an in-depth exploratory data analysis and feature engineering steps.

🚀 Project Overview

This project aims to provide a thorough analysis of diabetes-related health data, develop predictive models, and evaluate their performance. The key components of the project include:

📊 Data Preprocessing
🔍 Exploratory Data Analysis (EDA)
🛠️ Feature Engineering
🧠 Model Training
📈 Model Evaluation
📑 Comprehensive Reports

📂 Project Structure

Here's an overview of the project directory structure:

Diabetes_Health_Prediction_and_Analysis/
├── data/
│   ├── raw/
│   │   └── diabetes_data.csv
│   ├── processed/
│   │   ├── X_train.csv
│   │   ├── X_train_engineered.csv
│   │   ├── X_test.csv
│   │   ├── X_test_engineered.csv
│   │   ├── y_train.csv
│   │   └── y_test.csv
├── app/
│   ├── app.py
│   ├── templates/
│   │   └── index.html
│   └── static/
│       └── styles.css
├── models/
│   ├── logistic_regression.pkl
│   ├── random_forest.pkl
│   └── xgboost.pkl
├── notebooks/
│   └── exploratory_data_analysis.ipynb
├── scripts/
│   ├── plots/
│   ├── reports/
│   ├── data_preprocessing.py
│   ├── feature_engineering.py
│   ├── model_training.py
│   ├── model_evaluation.py
│   └── model_performance_report.py
├── tests/
│   ├── models/
│   ├── test_data_preprocessing.py
│   ├── test_feature_engineering.py
│   ├── test_model_training.py
├── requirements.txt 
└── README.md

🔧 Setup and Installation

To get started with this project, follow the steps below:

Clone the repository:

git clone https://github.com/ThecoderPinar/Diabetes_Health_Prediction_and_Analysis.git
cd Diabetes_Health_Prediction_and_Analysis

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Install the required packages:
```
pip install -r requirements.txt
```
Run the data preprocessing script:
```
python scripts/data_preprocessing.py
```
Run the feature engineering script:
```
python scripts/feature_engineering.py
```
Train the models:
```
python scripts/model_training.py
```
Evaluate the models:
```
python scripts/model_evaluation.py
```
Generate comprehensive model performance reports:
```
python script/comprehensive_model_report.py
```

🚀 Usage

Exploratory Data Analysis: Check the notebooks/exploratory_data_analysis.ipynb notebook for detailed data analysis and visualizations.
Scripts: All scripts for data preprocessing, feature engineering, model training, and evaluation are located in the scripts/ directory.
Tests: To ensure code quality and correctness, tests are included in the tests/ directory. Run them with pytest.

📊 Models

The following models are trained and evaluated in this project:

Logistic Regression

ROC Curve:

The ROC curve illustrates the true positive rate (sensitivity) versus the false positive rate (1-specificity) for different threshold settings. A higher area under the curve (AUC) indicates better model performance.

Confusion Matrix:

The confusion matrix provides a summary of the prediction results on the classification problem. It shows the number of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) predictions.

Random Forest

ROC Curve:

The ROC curve illustrates the true positive rate (sensitivity) versus the false positive rate (1-specificity) for different threshold settings. A higher area under the curve (AUC) indicates better model performance.

Confusion Matrix:

*The confusion matrix provides a summary of the prediction results on

🎯 Performance Metrics

The performance of the models is evaluated using the following metrics:

Accuracy
Precision
Recall
F1 Score
ROC AUC Score
Confusion Matrix

Logistic Regression

Accuracy (Doğruluk): %78.99
Precision (Kesinlik): %73.19
Recall (Duyarlılık): %70.63
F1 Score: %71.89
ROC AUC: %83.86

Confusion Matrix:

[[196  37]
 [ 42 101]]

Model dosyası:

models/logistic_regression.pkl

Random Forest

Accuracy (Doğruluk): %91.22
Precision (Kesinlik): %94.35
Recall (Duyarlılık): %81.82
F1 Score: %87.64
ROC AUC: %97.69

Confusion Matrix:

[[226   7]
 [ 26 117]]

Model dosyası:

models/random_forest.pkl

Explanations:

Accuracy: The ratio of correctly predicted instances to the total instances.
_Precision:_ The ratio of true positive predictions to the total predicted positives. It measures the accuracy of positive predictions.
Recall: The ratio of true positive predictions to the actual positives. It measures the model's ability to identify positive instances.
F1 Score: The harmonic mean of precision and recall. It provides a balance between precision and recall.
ROC AUC: The area under the ROC curve. It summarizes the model's ability to distinguish between classes.

Confusion Matrix:

True Positive (TP): 117 - The number of actual positive cases correctly identified by the model.
True Negative (TN): 226 - The number of actual negative cases correctly identified by the model.
False Positive (FP): 7 - The number of actual negative cases incorrectly identified as positive by the model.
False Negative (FN): 26 - The number of actual positive cases incorrectly identified as negative by the model.

Explanations:

Accuracy: The ratio of correctly predicted instances to the total instances.
Precision: The ratio of true positive predictions to the total predicted positives. It measures the accuracy of positive predictions.
Recall: The ratio of true positive predictions to the actual positives. It measures the model's ability to identify positive instances.
F1 Score: The harmonic mean of precision and recall. It provides a balance between precision and recall.
ROC AUC: The area under the ROC curve. It summarizes the model's ability to distinguish between classes.

Confusion Matrix:

True Positive (TP): 117 - The number of actual positive cases correctly identified by the model.
True Negative (TN): 226 - The number of actual negative cases correctly identified by the model.
False Positive (FP): 7 - The number of actual negative cases incorrectly identified as positive by the model.
False Negative (FN): 26 - The number of actual positive cases incorrectly identified as negative by the model.

XGBoost

Accuracy (Doğruluk): %91.76
Precision (Kesinlik): %93.08
Recall (Duyarlılık): %84.62
F1 Score: %88.64
ROC AUC: %98.41

Confusion Matrix:

[[224   9]
 [ 22 121]]

Model dosyası:

models/xgboost.pkl

Explanations:

Accuracy: The ratio of correctly predicted instances to the total instances.
Precision: The ratio of true positive predictions to the total predicted positives. It measures the accuracy of positive predictions.
Recall: The ratio of true positive predictions to the actual positives. It measures the model's ability to identify positive instances.
F1 Score: The harmonic mean of precision and recall. It provides a balance between precision and recall.
ROC AUC: The area under the ROC curve. It summarizes the model's ability to distinguish between classes.

Confusion Matrix:

True Positive (TP): 121 - The number of actual positive cases correctly identified by the model.
True Negative (TN): 224 - The number of actual negative cases correctly identified by the model.
False Positive (FP): 9 - The number of actual negative cases incorrectly identified as positive by the model.
False Negative (FN): 22 - The number of actual positive cases incorrectly identified as negative by the model.

📈 Results

Model performance reports and evaluation metrics are saved and displayed in the comprehensive_model_report.py script output.

💡 Future Work

Implement more advanced deep learning models (e.g., Neural Networks, LSTM).
Perform hyperparameter tuning to optimize model performance.
Explore feature selection techniques to improve model accuracy.
Integrate additional health datasets for broader analysis.

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Whether it's improving the documentation, adding new features, or fixing bugs, your contributions are highly appreciated. Let's make this project better together! 🚀

How to Contribute:

Fork the Repository: Click on the 'Fork' button at the top right corner of this page to create a copy of this repository in your GitHub account.

Clone the Forked Repository:

git clone https://github.com/your-username/Diabetes_Health_Prediction_and_Analysis.git

Create a New Branch:

git checkout -b feature/your-feature-name

Make Your Changes: Implement your feature, bug fix, or improvement.

Commit Your Changes:

git commit -m "Add your commit message here"

Push to Your Forked Repository:

git push origin feature/your-feature-name

Open a Pull Request: Go to the original repository on GitHub and click on the 'New Pull Request' button. Compare changes from your forked repository and submit the pull request.

Thank you for your contributions! Together, we can build a more robust and efficient Diabetes Health Prediction and Analysis tool. 🌟

📄 License

This project is licensed under the MIT License.

📬 Contact

If you have any questions or suggestions, feel free to open an issue or contact me directly. I am always open to feedback and would love to hear from you!

How to Reach Me:

Email: piinartp@gmail.com
GitHub Issues: Open an Issue
LinkedIn: Your LinkedIn Profile

Thank you for your interest in the Diabetes Health Prediction and Analysis project! Your feedback and suggestions are invaluable in making this project better and more useful for everyone. 🌟

⭐️ Don't forget to give this project a star if you found it useful! ⭐️

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Diabetes Health Prediction and Analysis 🎉

🚀 Project Overview

📂 Project Structure

🔧 Setup and Installation

🚀 Usage

📊 Models

Logistic Regression

ROC Curve:

Confusion Matrix:

Random Forest

ROC Curve:

Confusion Matrix:

🎯 Performance Metrics

Logistic Regression

Random Forest

Explanations:

Explanations:

XGBoost

Explanations:

📈 Results

💡 Future Work

🤝 Contributing

How to Contribute:

📄 License

📬 Contact

How to Reach Me:

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
app		app
data/processed		data/processed
models		models
notebooks		notebooks
raw		raw
scripts		scripts
tests		tests
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

ThecoderPinar/Diabetes_Health_Prediction_and_Analysis

Folders and files

Latest commit

History

Repository files navigation

Diabetes Health Prediction and Analysis 🎉

🚀 Project Overview

📂 Project Structure

🔧 Setup and Installation

🚀 Usage

📊 Models

Logistic Regression

ROC Curve:

Confusion Matrix:

Random Forest

ROC Curve:

Confusion Matrix:

🎯 Performance Metrics

Logistic Regression

Random Forest

Explanations:

Explanations:

XGBoost

Explanations:

📈 Results

💡 Future Work

🤝 Contributing

How to Contribute:

📄 License

📬 Contact

How to Reach Me:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages