## **Project Title: Diabetes Progression Prediction**  

### **Introduction**  
Diabetes is a chronic disease or serious health problem that affects many people around the world. If not detected early, it can cause severe health issues. In this project, we will build a machine learning or deep learning model to predict diabetes using [Early Stage Diabetes Risk Prediction Dataset](https://www.kaggle.com/datasets/ishandutta/early-stage-diabetes-risk-prediction-dataset). By analyzing different features like age, gender, and various symptoms like polyuria, polydipsia, and sudden weight loss, along with the target variable (Class: Positive/Negative), we can check if a person is at risk of diabetes.

### **Overview**  
This project involves an end-to-end machine learning pipeline, including Exploratory Data Analysis (EDA), feature engineering, model training, and evaluation. We will compare various machine learning techniques such as Logistic Regression, Neural Networks, Decision Trees, and Random Forests to determine the best-performing model. 

Here, are the short descriptions of each columns present in datasets:

1. **Age** – Patient's age (20 to 65 years).  
2. **Gender** – Male or Female.  
3. **Polyuria** – Excessive urination (Yes/No).  
4. **Polydipsia** – Excessive thirst (Yes/No).  
5. **Sudden Weight Loss** – Rapid weight loss (Yes/No).  
6. **Weakness** – Feeling of fatigue (Yes/No).  
7. **Polyphagia** – Excessive hunger (Yes/No).  
8. **Genital Thrush** – Fungal infection (Yes/No).  
9. **Visual Blurring** – Blurred vision (Yes/No).  
10. **Itching** – Skin irritation (Yes/No).  
11. **Irritability** – Frequent mood swings (Yes/No).  
12. **Delayed Healing** – Slow wound healing (Yes/No).  
13. **Partial Paresis** – Muscle weakness (Yes/No).  
14. **Muscle Stiffness** – Stiff or tight muscles (Yes/No).  
15. **Alopecia** – Hair loss (Yes/No).  
16. **Obesity** – Overweight condition (Yes/No).  
17. **Class** – Target variable (Positive: Diabetic, Negative: Non-Diabetic).  

### **Objectives**  
This project aims to build a predictive model for diabetes progression using a "Early Stage Diabetes Risk Prediction Dataset". We will explore different machine learning algorithms, evaluate their performance, and identify the most effective model for accurate predictions.

### **Step 1: Import Necessary Libraries and Load the Datasets**

In [4]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler
from sklearn.metrics import  accuracy_score, f1_score, classification_report, precision_score, confusion_matrix, recall_score, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.svm import SVC
import joblib

import warnings
warnings.filterwarnings("ignore")

# Load dataset
df = pd.read_csv("diabetes.csv")

# Display first few rows
df.head()

Unnamed: 0,Age,Gender,Polyuria,Polydipsia,sudden weight loss,weakness,Polyphagia,Genital thrush,visual blurring,Itching,Irritability,delayed healing,partial paresis,muscle stiffness,Alopecia,Obesity,class
0,40,Male,No,Yes,No,Yes,No,No,No,Yes,No,Yes,No,Yes,Yes,Yes,Positive
1,58,Male,No,No,No,Yes,No,No,Yes,No,No,No,Yes,No,Yes,No,Positive
2,41,Male,Yes,No,No,Yes,Yes,No,No,Yes,No,Yes,No,Yes,Yes,No,Positive
3,45,Male,No,No,Yes,Yes,Yes,Yes,No,Yes,No,Yes,No,No,No,No,Positive
4,60,Male,Yes,Yes,Yes,Yes,Yes,No,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Positive


### **Step 2: Basic Data Exploration**

In [5]:
# Check dataset shape
print(f"Dataset contains {df.shape[0]} rows and {df.shape[1]} columns.")

Dataset contains 520 rows and 17 columns.


In [6]:
df.shape

(520, 17)