### Import the important libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix , accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

### Read the dataset

In [None]:
df=pd.read_csv('/kaggle/input/diabetes/diabetes_prediction_dataset.csv')

### Show the first 5 rows

In [None]:
df.head()

<div style="border-radius:10px; border:#808080 solid; padding: 15px; background-color:##F0E68C ; font-size:100%; text-align:left">


## 📊 Dataset Features – Diabetes Prediction

Below is a detailed description of each feature in the dataset:

---

### 🧓 **1. Age**
- **Description**: Age of the patient in years, ranging from **0.08 to 80**.
- **Significance**: Risk of diabetes increases with age, particularly for Type 2 diabetes. The mean age is approximately **41.79 years**.

---

### 🚻 **2. Gender**
- **Description**: Biological sex of the individual — **Male**, **Female**, or **Other**.
- **Significance**: Gender can influence diabetes susceptibility due to differences in hormones, body composition, and lifestyle factors. The dataset includes **Male: 39,537** and **Female: 55,262**.

---

### ⚖️ **3. Body Mass Index (BMI)**
- **Description**: A measure of body fat based on height and weight, with values in the dataset ranging from **10.16 to 71.55**.
- **Categories**:
  - Underweight: < 18.5  
  - Normal weight: 18.5 – 24.9  
  - Overweight: 25 – 29.9  
  - Obese: ≥ 30  
- **Significance**: A higher BMI increases the likelihood of developing Type 2 diabetes. The most common value in the dataset is **27.32** (Overweight).

---

### ❤️ **4. Hypertension**
- **Description**: Indicates if a patient has high blood pressure — **0 = No**, **1 = Yes**.
- **Distribution**: **0: 87,482**, **1: 7,317**
- **Significance**: Hypertension often coexists with diabetes and is linked to insulin resistance.

---

### 💔 **5. Heart Disease**
- **Description**: Indicates if the patient has any form of heart disease — **0 = No**, **1 = Yes**.
- **Distribution**: **0: 90,907**, **1: 3,892**
- **Significance**: Heart disease is both a complication and a risk factor for diabetes.

---

### 🚬 **6. Smoking History**
- **Description**: Smoking status of the patient.
- **Categories & Counts**:
  - Never: 34,011  
  - No Info: 32,242  
  - Former: 9,176  
  - Current: 9,109  
  - Not Current: 6,299  
  - Ever: 3,962  
- **Significance**: Smoking is known to contribute to insulin resistance and other metabolic issues that increase diabetes risk.

---

### 🧪 **7. HbA1c Level**
- **Description**: Reflects the average blood sugar level over the past 2–3 months.
- **Significance**: An HbA1c level of **6.5% or more** typically indicates diabetes. Values in the dataset range from **3.5 to 9.0**, with common values being **6.6, 5.7, 6.5, and 6.0**.

---

### 💉 **8. Blood Glucose Level**
- **Description**: The current level of glucose in the blood.
- **Significance**: A crucial diagnostic marker — normal fasting blood glucose is generally between **70–130 mg/dL**. Values above **250** are considered very high and potentially dangerous. The dataset contains values from **80 to 300**, with common values like **159, 130, 126, and 140**.

---

### 🎯 **9. Diabetes**
- **Description**: Target variable — **0 = No Diabetes**, **1 = Has Diabetes**
- **Significance**: This is the outcome we're trying to predict based on the above features using machine learning models.

---

### Show how many columns and rows

In [None]:
df.shape

### Check if there any missing value

In [None]:
df.isna().sum()

### Check if there any missing value

In [None]:
df.duplicated().sum()

### Drop The duplicated value

In [None]:
df.drop_duplicates(inplace=True)

### Check if still duplicated value

In [None]:
df.duplicated().sum()

### Show information about the data

In [None]:
df.info()

### Show the statistic information

In [None]:
df.describe()

### Handing Outliers

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(data=df[["age"]], palette="coolwarm")
plt.title("Age Checking for Outliers", fontsize=14)
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(data=df[["bmi"]], palette="coolwarm")
plt.title("BMI Checking for Outliers", fontsize=14)
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(data=df[["HbA1c_level"]], palette="coolwarm")
plt.title("HbA1c Level Checking for Outliers", fontsize=14)
plt.show()

In [None]:
# Compute Q1 (25%) and Q3 (75%)
Q1 = df["bmi"].quantile(0.25)
Q3 = df["bmi"].quantile(0.75)

# Compute IQR (Interquartile Range)
IQR = Q3 - Q1

# Define outlier thresholds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Detect outliers
outliers = df[(df["bmi"] < lower_bound) | (df["bmi"] > upper_bound)]
print(f"Number of outliers in BMI: {len(outliers)}")

In [None]:
df=df[df['bmi']<50]

In [None]:
df=df[df['bmi']>14]

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(data=df[["bmi"]], palette="coolwarm")
plt.title("BMI Checking for Outliers", fontsize=14)
plt.show()

# **Univariate Analysis**

In [None]:
df['gender'].replace('Other', 'Male', inplace=True)

## Gender Distribution

In [None]:
plt.figure(figsize=(8, 5))
sns.countplot(x='gender', data=df)
plt.title("Gender Distribution")
plt.xlabel("Gender")
plt.ylabel("Count")
plt.show()

## 📊 Gender Distribution in the Dataset

The **Gender** feature represents the gender of the individuals in the dataset. Below is the distribution of gender in the dataset.

### ✅ **Gender Value Distribution:**

| Gender | Count   |
|--------|---------|
| **Male**   | 39,537  |
| **Female** | 55,262  |

---

### 🧠 **Interpretation:**
- **39,537** individuals in the dataset are **male**, while **55,262** are **female**.
- The dataset contains a larger number of **female** individuals, making up a greater proportion of the dataset.

### ⚠️ **Health Implications:**
- The gender distribution might have an impact on health-related outcomes, as the prevalence of **diabetes** or other conditions might vary between men and women. It's important to explore whether gender correlates with other health metrics in the dataset, such as **BMI**, **hypertension**, or **smoking history**.


## Distribution of Age

In [None]:
plt.figure(figsize=(10, 5))
sns.histplot(df['age'], bins=30, kde=True)
plt.title("Distribution of Age")
plt.xlabel("Age")
plt.ylabel("Count")
plt.show()

## 📊 Age Distribution in the Dataset

The **Age** feature in the dataset represents the age of the individuals. Below are the statistical details for the age distribution.

### ✅ **Age Statistics:**

| Statistic  | Value        |
|------------|--------------|
| **Mean**   | 41.79        |
| **Standard Deviation (STD)** | 22.46       |
| **Minimum**| 0.08         |
| **Maximum**| 80           |
| **25th Percentile (25%)** | 24           |
| **50th Percentile (Median)** | 43         |
| **75th Percentile (75%)** | 59           |

### 🧠 **Interpretation:**

- **Mean Age (41.79)**: The average age of individuals in the dataset is approximately **41.79 years**, which suggests that the majority of individuals in the dataset are adults, likely between **30 and 50 years old**.
  
- **Standard Deviation (22.46)**: The wide standard deviation indicates a **diverse age range**. The age distribution includes both younger and older individuals, with some very young patients (e.g., 0.08 years old) and some older patients (up to 80 years old).
  
- **Age Range**: The **minimum age** is extremely low (**0.08 years**), which could indicate infants or very young children, while the **maximum age** is 80 years, suggesting that the dataset contains people from a broad range of adult age groups.
  
- **Percentiles**:
  - The **25th percentile (24 years)** indicates that **25%** of the individuals are younger than **24** years.
  - The **50th percentile (Median)** is **43 years**, meaning that half of the individuals in the dataset are younger than **43**, and half are older.
  - The **75th percentile (59 years)** shows that **75%** of individuals are younger than **59 years**.

### ⚠️ **Health Implications:**
- The presence of younger individuals (with age as low as **0.08**) could indicate that some cases in the dataset involve children or infants, which may need to be verified.
- The **older individuals** (up to **80 years**) suggest that the dataset spans across a wide age group, including middle-aged and elderly individuals. This could be important for analyzing health trends related to **age**, such as the risk of **diabetes** in different age groups.

---

### 📝 **Note:**
- The age distribution in this dataset could provide valuable insights for predicting **diabetes risk**, as the risk tends to increase with age. Further investigation into how age interacts with other features (like BMI or hypertension) could yield meaningful insights.


## Distribution of BMI

In [None]:
plt.figure(figsize=(10, 5))
sns.histplot(df['bmi'], bins=30, kde=True)
plt.title("Distribution of BMI")
plt.xlabel("BMI")
plt.ylabel("Count")
plt.xticks([i for i in range(14, 51, 2)])
plt.show()

## 📊 BMI Categories and Common Values in the Dataset

In the dataset, the BMI values range from 15 to 50, with the most common value being **27.32**. Here's a breakdown of the BMI categories and what they represent in terms of health risks:

### ✅ **BMI Categories:**

| Category            | BMI Range (kg/m²)    | Description                        |
|---------------------|----------------------|------------------------------------|
| **Underweight**      | Less than 18.5       | A BMI below 18.5 indicates a person is underweight, which may lead to potential health problems like malnutrition. |
| **Normal weight**    | 18.5 – 24.9          | A BMI in this range is considered normal and healthy, associated with lower health risks. |
| **Overweight**       | 25.0 – 29.9          | A BMI of **27.32** falls into the **overweight** category. This range indicates an increased risk of developing health problems, including diabetes and heart disease. |
| **Obese (Class I)**  | 30.0 – 34.9          | A BMI in this range suggests higher risk for serious health issues, including Type 2 diabetes and heart disease. |
| **Obese (Class II)** | 35.0 – 39.9          | Severe health risks, requiring medical intervention for weight management. |
| **Obese (Class III)**| 40.0 or more         | Extremely high risk of serious health problems, often requiring intensive treatment or surgery. |

---

### 🧠 **Common Value:**
- The most common **BMI value in the dataset** is **27.32**, which falls into the **Overweight** category. This suggests that many individuals in this dataset have a BMI above the healthy range, which may indicate a higher risk of developing chronic health conditions like diabetes.

---

### ⚠️ **Note:**
- BMI is a general tool for assessing body weight but does not account for factors like muscle mass or distribution of fat. Always consider additional health indicators for a more comprehensive evaluation.


## Hypertension Distribution

In [None]:
plt.figure(figsize=(8, 5))
sns.countplot(x=df['hypertension'], order=df['hypertension'].value_counts().index)
plt.title("Hypertension Distribution")
plt.xlabel("Hypertension Rating")
plt.ylabel("Count")
plt.show()

## 📊 Hypertension Values in the Dataset

The **Hypertension** feature in the dataset indicates whether a patient has hypertension (1 = Yes, 0 = No).

### ✅ **Hypertension Value Distribution:**

| Value | Description            | Count   |
|-------|------------------------|---------|
| **0** | No Hypertension         | 87,482  |
| **1** | Hypertension (Yes)      | 7,317   |

---

### 🧠 **Interpretation:**
- **87,482** individuals in the dataset do **not** have hypertension (value 0).
- **7,317** individuals have hypertension (value 1).

This suggests that the majority of individuals in this dataset **do not** suffer from hypertension, with only a smaller portion (about **7.7%**) having high blood pressure. 

### ⚠️ **Health Implication:**
- **Hypertension** is a major risk factor for cardiovascular diseases, stroke, and kidney disease. 
- Since the number of individuals with hypertension is relatively low, it could indicate a population with **generally healthier blood pressure levels**. However, if the dataset is representative of a broader population, it may reflect a relatively **healthy demographic**.

---

### 📝 **Note:**
- If you're building a model, **hypertension** as a feature may have significant predictive power for **diabetes** risk, as hypertension and diabetes often co-occur.


## Heart Disease Distribution

In [None]:
plt.figure(figsize=(8, 5))
sns.countplot(x=df['heart_disease'], order=df['hypertension'].value_counts().index)
plt.title("Heart Disease Distribution")
plt.xlabel("Heart Disease Rating")
plt.ylabel("Count")
plt.show()

## 📊 Heart Disease Distribution in the Dataset

The **Heart Disease** feature indicates whether an individual has heart disease (1 = Yes, 0 = No).

### ✅ **Heart Disease Value Distribution:**

| Value | Description           | Count   |
|-------|-----------------------|---------|
| **0** | No Heart Disease      | 90,907  |
| **1** | Heart Disease (Yes)   | 3,892   |

---

### 🧠 **Interpretation:**
- **90,907** individuals in the dataset **do not** have heart disease (value 0).
- **3,892** individuals have **heart disease** (value 1).

This suggests that the majority of individuals in this dataset **do not** suffer from heart disease, with only a smaller portion (about **4.1%**) having heart disease. 

### ⚠️ **Health Implication:**
- **Heart disease** is one of the leading causes of morbidity and mortality worldwide. 
- The **low prevalence** of heart disease in this dataset (4.1%) may indicate a **healthier population** or a sample that is not heavily affected by cardiovascular issues. However, it could also point to **selection bias**, as individuals with heart disease may be underrepresented or excluded from this dataset.

---

### 📝 **Note:**
- **Heart disease** and **diabetes** often have overlapping risk factors (such as high BMI, hypertension, and age). This relationship may make the **heart disease** feature valuable in predictive models, as those with heart disease may also have a higher likelihood of developing diabetes.


## Smoking History Distribution

In [None]:
labels = ['Never', 'No Info', 'Former', 'Current', 'Not Current', 'Ever']
sizes = [34011, 32242, 9176, 9109, 6299, 3962]
colors = ['#66b3ff', '#ff6666', '#99ff99', '#ffcc99', '#c2c2f0', '#ffb3e6']  # Colors for each category
explode = (0.1, 0, 0, 0, 0, 0) 
plt.figure(figsize=(8, 7))
plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=140)
plt.title('Smoking History Distribution')
plt.axis('equal')
plt.show()

## 📊 Smoking History Distribution in the Dataset

The **Smoking History** feature indicates the smoking status of individuals. Below is the distribution of smoking history in the dataset.

### ✅ **Smoking History Value Distribution:**

| Smoking History Type | Count   |
|----------------------|---------|
| **Never**            | 34,011  |
| **No Info**          | 32,242  |
| **Former**           | 9,176   |
| **Current**          | 9,109   |
| **Not Current**      | 6,299   |
| **Ever**             | 3,962   |

---

### 🧠 **Interpretation:**
- The majority of individuals in the dataset are classified as **"Never"** smokers (**34,011** individuals), which suggests that a significant portion of the population has never smoked.
- A substantial portion of individuals have **no information** about their smoking history (**32,242**), which may indicate missing or incomplete data.
- **Former smokers** (**9,176**) and **current smokers** (**9,109**) together account for a combined **17,285** individuals. This indicates that roughly **17.5%** of the population either used to smoke or currently smokes.
- The category **"Not Current"** (**6,299**) could refer to people who were smokers in the past but are currently not smoking, or those who are classified differently from "former" smokers.

### ⚠️ **Health Implications:**
- **Smoking** is a known risk factor for various chronic diseases, including **heart disease**, **diabetes**, and **respiratory illnesses**. The dataset shows a significant number of individuals with smoking history, both **former** and **current** smokers.
- The large number of individuals in the **"No Info"** category suggests that missing or unclear data may need to be addressed during analysis or modeling. The lack of smoking information could affect the accuracy of predictions, especially when building models related to health risks.
  
---

### 📝 **Note:**
- Smoking history is an important feature when analyzing the risk of developing **diabetes** or **heart disease**, as smoking is closely linked to both conditions. The **"Never"** smokers may serve as a healthier reference group in predictive modeling.


## Distribution of HbA1c Level

In [None]:
plt.figure(figsize=(10, 5))
sns.histplot(df['HbA1c_level'], bins=15, kde=True)
plt.title("Distribution of HbA1c Level")
plt.xlabel("HbA1c Level")
plt.ylabel("Count")
plt.xticks([i for i in range(3, 10)])
plt.show()

## 📊 HbA1c Level Distribution in the Dataset

The **HbA1c Level** feature represents the average blood sugar levels over the past 2-3 months, measured as a percentage. Below is the distribution of HbA1c levels in the dataset.

### ✅ **HbA1c Level Value Distribution:**

| HbA1c Level | Count   |
|-------------|---------|
| **6.6**     | 8,040   |
| **5.7**     | 8,015   |
| **6.5**     | 7,934   |
| **5.8**     | 7,874   |
| **6.0**     | 7,874   |
| **6.2**     | 7,861   |
| **6.1**     | 7,600   |
| **4.8**     | 7,220   |
| **3.5**     | 7,218   |
| **4.5**     | 7,200   |
| **4.0**     | 7,113   |
| **5.0**     | 7,075   |
| **8.8**     | 642     |
| **8.2**     | 641     |
| **9.0**     | 635     |
| **6.8**     | 628     |
| **7.5**     | 621     |
| **7.0**     | 608     |

---

### 🧠 **Interpretation:**
- **Most Common HbA1c Levels**: The most frequent HbA1c levels in the dataset are **6.6**, **5.7**, **6.5**, **5.8**, and **6.0**, with each having over **7,500** individuals.
- **Low HbA1c Levels**: The dataset also contains individuals with **lower HbA1c levels** (such as **3.5**, **4.0**, **4.5**, and **4.8**), which might indicate individuals with controlled or non-diabetic levels.
- **High HbA1c Levels**: There are individuals with **higher HbA1c levels**, such as **8.8**, **8.2**, and **9.0**, indicating poor blood sugar control or potential diabetes.

### ⚠️ **Health Implications:**
- **HbA1c levels** are crucial in diagnosing and managing **diabetes**. A higher HbA1c level generally indicates **poor blood sugar control** and a higher risk of complications.
- A **normal HbA1c** level is typically below **5.7%**, while levels between **5.7% and 6.4%** indicate **pre-diabetes**. Levels **6.5% and above** are typically used to diagnose **diabetes**.
- The distribution shows a significant portion of individuals have **HbA1c levels in the range** where diabetes could be diagnosed, which may be useful when analyzing the dataset for **diabetes risk**.
  
---

### 📝 **Note:**
- **Monitoring HbA1c levels** is essential for managing **diabetes** and preventing complications. It could be helpful to use this feature in predictive models to assess the likelihood of **diabetes** or evaluate the effectiveness of treatments and lifestyle changes.


## Distribution of Blood Glucose Level

In [None]:
plt.figure(figsize=(10, 5))
sns.histplot(df['blood_glucose_level'], bins=15, kde=True)
plt.title("Distribution of Blood Glucose Level")
plt.xlabel("Blood Glucose Level")
plt.ylabel("Count")
plt.xticks([i for i in range(80, 301,5)])
plt.xticks(rotation=270)
plt.show()

## 📊 Blood Glucose Level Distribution in the Dataset

The **Blood Glucose Level** feature represents the blood glucose concentration, typically measured after fasting. Below is the distribution of blood glucose levels in the dataset.

### ✅ **Blood Glucose Level Value Distribution:**

| Blood Glucose Level | Count   |
|---------------------|---------|
| **159**             | 7,371   |
| **130**             | 7,355   |
| **126**             | 7,335   |
| **140**             | 7,314   |
| **160**             | 7,271   |
| **145**             | 7,266   |
| **200**             | 7,202   |
| **155**             | 7,178   |
| **90**              | 6,740   |
| **100**             | 6,700   |
| **80**              | 6,700   |
| **158**             | 6,656   |
| **85**              | 6,536   |
| **280**             | 706     |
| **300**             | 649     |
| **240**             | 620     |
| **260**             | 618     |
| **220**             | 582     |

---

### 🧠 **Interpretation:**
- **Most Common Blood Glucose Levels**: The most frequent blood glucose levels are **159**, **130**, **126**, **140**, and **160**, with over **7,200** individuals in each category. These levels might suggest a higher concentration of individuals with **elevated blood glucose** or those at risk for diabetes.
- **Lower Blood Glucose Levels**: There are also many individuals with **lower blood glucose levels**, such as **80**, **85**, **90**, and **100**, which are generally considered within the normal range for fasting blood glucose.
- **High Blood Glucose Levels**: There are **few individuals** with significantly higher blood glucose levels, such as **240**, **260**, **280**, and **300**, which could indicate **severe hyperglycemia** and possibly poorly managed diabetes or a related condition.

### ⚠️ **Health Implications:**
- **Blood Glucose Levels** are key indicators of **diabetes risk**. According to the American Diabetes Association:
  - **Normal fasting blood glucose**: less than **100 mg/dL**.
  - **Prediabetes**: fasting blood glucose between **100 mg/dL** and **125 mg/dL**.
  - **Diabetes**: fasting blood glucose **126 mg/dL** or higher.
- The dataset shows that many individuals fall within the **prediabetic range** (e.g., **130**, **140**, **159**), which suggests that this dataset may contain individuals at **risk of developing diabetes**.
- The presence of higher blood glucose levels (e.g., **200**, **240**, **280**) may indicate poorly controlled diabetes or individuals in need of urgent medical intervention.

---

### 📝 **Note:**
- **Blood glucose levels** are essential for diagnosing **diabetes** and monitoring glucose control in individuals with the condition. This feature could be very useful in predicting **diabetes risk** and creating personalized treatment plans.


# **Bivariate Analysis**

## **Compare Numerical columns with Diabetes**

In [None]:
numerical_features = ['age', 'bmi', 'HbA1c_level', 'blood_glucose_level']
for feature in numerical_features:
    plt.figure(figsize=(8, 8))
    sns.boxplot(x='diabetes', y=feature, data=df)
    plt.title(f'{feature} vs Diabetes')
    plt.show()

## 📊 Bivariate Analysis: Numerical Features and Diabetes Status

### Introduction
In this analysis, we explore the relationship between **diabetes status** and various **numerical features** in the dataset, such as **Age**, **BMI**, **HbA1c Level**, and **Blood Glucose Level**. These numerical features provide insights into the physical health characteristics of individuals, and understanding how they correlate with **diabetes status** is crucial for identifying risk factors and developing predictive models for diabetes.

### Numerical Features Analyzed
The following numerical features are considered in this analysis:
- **Age**: The age of the individuals in years.
- **BMI (Body Mass Index)**: A measure of body fat based on height and weight.
- **HbA1c Level**: The average blood sugar levels over the past 2-3 months.
- **Blood Glucose Level**: The fasting blood glucose concentration.

---

### 1. **Age vs Diabetes Status**

#### Boxplot Analysis:
We observe how the **Age** distribution varies between **diabetic** and **non-diabetic** groups. The boxplot visualizes the median, interquartile range, and potential outliers for both groups. Generally, the **diabetic** group tends to have a slightly higher median age than the **non-diabetic** group.

#### Summary Statistics:
- **Mean Age**: Diabetic individuals tend to be older on average compared to non-diabetic individuals. This aligns with known trends, as the likelihood of developing **type 2 diabetes** increases with age.

#### Benefit:
By analyzing **Age** against **diabetes status**, we can identify the **age groups** most at risk for developing diabetes, enabling healthcare providers to target preventive measures for specific age ranges.

---

### 2. **BMI vs Diabetes Status**

#### Boxplot Analysis:
The **BMI** distribution shows that individuals with **higher BMI** tend to be more likely to be diabetic. The **diabetic** group generally has a higher **BMI** compared to the **non-diabetic** group, particularly in the **overweight** and **obese** ranges.

#### Summary Statistics:
- **Mean BMI**: Diabetic individuals typically have a BMI above the **normal range**, which is consistent with known associations between **obesity** and **diabetes** risk. Higher BMI values correlate with an increased risk of **type 2 diabetes**.

#### Benefit:
By understanding the relationship between **BMI** and diabetes, we can pinpoint **obesity** as a significant risk factor and help prioritize **weight management** interventions for at-risk populations.

---

### 3. **HbA1c Level vs Diabetes Status**

#### Boxplot Analysis:
The **HbA1c Level** is a key marker for blood glucose control. The **diabetic** group consistently shows higher **HbA1c levels** compared to the **non-diabetic** group. A higher **HbA1c** indicates **poor blood sugar control** or the presence of **diabetes**.

#### Summary Statistics:
- **Mean HbA1c**: Diabetic individuals show an **HbA1c level** above the **6.5% threshold**, which is the diagnostic criteria for diabetes.

#### Benefit:
This analysis emphasizes the critical role of **HbA1c testing** in **diagnosing and monitoring diabetes**. By identifying individuals with higher HbA1c values, we can provide more targeted **medical interventions** and **lifestyle recommendations** to manage blood glucose levels.

---

### 4. **Blood Glucose Level vs Diabetes Status**

#### Boxplot Analysis:
**Blood Glucose Levels** show a noticeable difference between the two groups. The **diabetic** group exhibits higher **fasting blood glucose levels**, with some individuals showing levels well above the **normal range** of **100 mg/dL**.

#### Summary Statistics:
- **Mean Blood Glucose Level**: Diabetic individuals typically have higher fasting glucose concentrations, which is consistent with the **diagnosis of diabetes** based on fasting blood glucose criteria (≥126 mg/dL).

#### Benefit:
This feature is invaluable in **early detection of diabetes**. Monitoring blood glucose levels allows for **timely intervention** and **prevention** of complications such as heart disease and kidney failure associated with uncontrolled diabetes.

---

### Summary of Findings:
- **Age, BMI, HbA1c Level, and Blood Glucose Level** all show significant differences between **diabetic** and **non-diabetic** individuals.
- **Age** and **BMI** appear to be important risk factors for diabetes, with older age and higher BMI correlating with higher diabetes prevalence.
- **HbA1c Level** and **Blood Glucose Level** are key indicators for diabetes diagnosis and management.
- By leveraging these features, we can develop more accurate **predictive models** for diabetes risk and inform **preventive healthcare strategies**.

---

### Conclusion
The bivariate analysis of numerical features provides valuable insights into how various health metrics influence the likelihood of developing **diabetes**. By understanding these relationships, we can identify **high-risk individuals** and implement targeted **prevention strategies**. This analysis is crucial for developing **data-driven healthcare interventions** and improving **diabetes management** at both individual and population levels.


## **Compare Categorical columns with Diabetes**

In [None]:
categorical_features = ['gender', 'smoking_history', 'hypertension', 'heart_disease']  # add more categorical features here

for feature in categorical_features:
    plt.figure(figsize=(8, 6))
    sns.countplot(x=feature, hue='diabetes', data=df)
    plt.title(f'{feature} vs Diabetes')
    plt.show()

## 📊 Bivariate Analysis: Categorical Features and Diabetes Status

### Introduction
In this section, we explore the relationship between **diabetes status** and various **categorical features** in the dataset, including **Gender**, **Smoking History**, **Hypertension**, and **Heart Disease**. Understanding these relationships helps identify key risk factors and guides the development of targeted interventions to prevent or manage diabetes.

### Categorical Features Analyzed
The following categorical features are considered in this analysis:
- **Gender**: The sex of the individuals (Male, Female).
- **Smoking History**: The smoking status of the individuals (e.g., Never, Former, Current, No Info).
- **Hypertension**: Whether the individual has hypertension (0 for No, 1 for Yes).
- **Heart Disease**: Whether the individual has heart disease (0 for No, 1 for Yes).

---

### 1. **Gender vs Diabetes Status**

#### Count Plot Analysis:
The **Gender** distribution between **diabetic** and **non-diabetic** individuals is analyzed through a **count plot**. This visualizes the number of males and females in both groups.

#### Summary:
- The dataset contains a higher number of **female** individuals (55,262) compared to **male** individuals (39,537).
- The diabetes prevalence might differ slightly between the genders, but the count plot will give insights into whether gender influences diabetes risk.

#### Benefit:
This analysis can help healthcare providers understand if **gender** plays a significant role in diabetes risk. Knowing the distribution of diabetes cases across gender groups could guide gender-specific interventions or awareness campaigns.

---

### 2. **Smoking History vs Diabetes Status**

#### Count Plot Analysis:
We analyze how different categories of **Smoking History** (e.g., Never, Former, Current, No Info) are distributed across **diabetic** and **non-diabetic** groups using a **count plot**.

#### Summary:
- **Never smoked**: 34,011 individuals.
- **No Info**: 32,242 individuals.
- **Former smokers**: 9,176 individuals.
- **Current smokers**: 9,109 individuals.
- **Not current**: 6,299 individuals.
- **Ever smoked**: 3,962 individuals.

#### Benefit:
This analysis reveals the association between **smoking history** and diabetes status. Smoking is a known risk factor for **type 2 diabetes**, and analyzing this feature will help identify individuals who might benefit from smoking cessation programs as part of **diabetes prevention** strategies.

---

### 3. **Hypertension vs Diabetes Status**

#### Count Plot Analysis:
The **Hypertension** feature (binary: 0 for No, 1 for Yes) is analyzed to see how the presence of hypertension correlates with diabetes status. 

#### Summary:
- **Hypertension (0)**: 87,482 individuals.
- **Hypertension (1)**: 7,317 individuals.

#### Benefit:
Hypertension is a major **cardiovascular risk factor** and is strongly associated with an increased risk of **diabetes**. By analyzing this feature, we can identify individuals who have both conditions and implement strategies for **joint management** of hypertension and diabetes. Early intervention in hypertensive individuals could potentially help prevent the onset of **diabetes**.

---

### 4. **Heart Disease vs Diabetes Status**

#### Count Plot Analysis:
We examine the relationship between the presence of **Heart Disease** (binary: 0 for No, 1 for Yes) and diabetes status through a **count plot**.

#### Summary:
- **No Heart Disease (0)**: 90,907 individuals.
- **Heart Disease (1)**: 3,892 individuals.

#### Benefit:
The presence of **heart disease** increases the likelihood of developing **type 2 diabetes** due to the close association between **cardiovascular health** and **insulin resistance**. Understanding this relationship helps in the **early identification of at-risk individuals**, and interventions focused on improving **cardiovascular health** could help in managing both heart disease and diabetes simultaneously.

---

### Summary of Findings:
- **Gender**: The analysis of gender shows the distribution of diabetes cases between male and female individuals, helping to understand the **gender-specific diabetes prevalence**.
- **Smoking History**: A significant association might be observed between smoking and diabetes risk. This analysis will help identify **smoking cessation** as a potential intervention to lower diabetes risk.
- **Hypertension**: The presence of hypertension is strongly correlated with an increased risk of diabetes. This relationship underlines the importance of monitoring both conditions together and providing comprehensive care for patients.
- **Heart Disease**: Heart disease is another important risk factor that often co-occurs with diabetes. Analyzing this feature highlights the need for **joint interventions** in patients with both conditions to improve health outcomes.

---

### Conclusion
The bivariate analysis of categorical features with diabetes status reveals critical insights into the factors that contribute to the risk of developing **diabetes**. By understanding how features such as **Gender**, **Smoking History**, **Hypertension**, and **Heart Disease** are related to diabetes, healthcare providers can implement more targeted **preventive measures**, provide **early interventions**, and develop **personalized treatment plans** for individuals at risk.


# **Multivariate Analysis**

## Correlation Heatmap: Numerical Features and Diabetes

In [None]:
correlations = df[numerical_features + ['diabetes']].corr()
plt.figure(figsize=(10, 7))
sns.heatmap(correlations, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Heatmap: Numerical Features and Diabetes')
plt.show()

## Data encoding

In [None]:
label_encoders = {}
for col in df.select_dtypes(include=['object']).columns:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col].astype(str))
    label_encoders[col] = le

In [None]:
df.head()

## Select the Features and Target

In [None]:
X = df.drop(columns=['diabetes'])
y = df['diabetes']

## Data Spliting

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# **Build Logistic Regression Model**

In [None]:
model = LogisticRegression()
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)
print(accuracy_score(y_test ,y_pred))

# **Build SVC Model**

In [None]:
svm = SVC()
svm.fit(X_train,y_train)

In [None]:
y_pred1 =svm.predict(X_test)
print(accuracy_score(y_test,y_pred1))

# **Build KNeighborsClassifier Model**

In [None]:
k = 3
knn = KNeighborsClassifier(n_neighbors=k)

In [None]:
knn.fit(X_train ,y_train)

In [None]:
y_pred2 =knn.predict(X_test)
print(accuracy_score(y_test ,y_pred2))

# **Build Random Forest Classifier Model**

In [None]:
rf = RandomForestClassifier()
rf.fit(X_train , y_train)

In [None]:
y_pred_rf =rf.predict(X_test)
print(accuracy_score(y_test ,y_pred_rf))

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
df['diabetes'].value_counts()

In [None]:
print(confusion_matrix(y_test, y_pred))

In [None]:
y_pred = model.predict(X_test)  # Replace with your model and test data
cm = confusion_matrix(y_test, y_pred)

In [None]:
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive'])

# Add labels and title
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')

# Show the plot
plt.show()

## Model Comparison and Performance Evaluation

In this section, we compare the performance of four machine learning models applied to the diabetes prediction task. The models evaluated are:

1. **Logistic Regression**
2. **Support Vector Classifier (SVC)**
3. **K-Nearest Neighbors (KNN) Classifier**
4. **Random Forest Classifier**

### Evaluation Metric
The primary metric used to evaluate model performance is **accuracy**, which represents the proportion of correctly predicted instances out of the total instances.

### Model Performance

- **Logistic Regression**:
  - **Accuracy**: **95.8%**
  - Logistic Regression is a widely used model for binary classification tasks. It performed well, achieving an accuracy of 95.8%. This result suggests that the model is able to correctly predict the diabetes status of most patients.

- **Support Vector Classifier (SVC)**:
  - **Accuracy**: **96.0%**
  - The SVC model, which seeks to find the optimal hyperplane to separate the classes, slightly outperformed Logistic Regression with an accuracy of 96.0%. This indicates that SVC is highly effective for this classification problem.

- **K-Nearest Neighbors (KNN) Classifier**:
  - **Accuracy**: **95.6%**
  - The KNN classifier, which makes predictions based on the majority class of its nearest neighbors, achieved an accuracy of 95.6%. While slightly lower than Logistic Regression and SVC, this is still a strong performance.

- **Random Forest Classifier**:
  - **Accuracy**: **96.8%**
  - The Random Forest model, which creates an ensemble of decision trees to make predictions, achieved the highest accuracy of 96.8%. This indicates that the model is highly robust and capable of capturing complex patterns in the data.

### Summary of Model Performance

| Model                    | Accuracy (%) |
|--------------------------|--------------|
| **Logistic Regression**   | 95.8         |
| **SVC**                   | 96.0         |
| **KNN**                   | 95.6         |
| **Random Forest**         | 96.8         |

### Conclusion
- **Random Forest** achieved the highest accuracy among the four models, closely followed by **SVC** and **Logistic Regression**. 
- **KNN**, while slightly lower in performance, still performed well and could be a viable option for this task.
- **Random Forest**'s superior performance suggests that the model effectively captures the complexities of the data, making it a strong choice for this diabetes prediction task.
- Overall, these results indicate that **Random Forest** and **SVC** are the most suitable models for this dataset, with **Random Forest** being the top performer.
