#  Dataset Overview

- **Rows:** 768  
- **Columns:** 9  
- **Description:** Medical data used to predict diabetes risk based on clinical measures.

##  Observations
- Columns like *SkinThickness*, *Insulin*, *BMI* contain unrealistic `0` values.
- `Unnamed: 0` is an index column → should be dropped.
- Mean glucose level ≈ 120 → slightly high.
- Insulin and BMI have large standard deviations → possible outliers.

##  Next Steps
1. Drop `Unnamed: 0`
2. Replace invalid zeros with `NaN`
3. Handle missing values using median
4. Analyze correlations


##  Variable Distributions

The histograms below visualize how each numerical feature is distributed.  
We can easily detect patterns, outliers, and possible missing data issues.

- **Pregnancies:** Mostly low counts, right-skewed — most women have between 0 and 5 pregnancies.  
- **Glucose:** Slightly right-skewed with a normal-like shape — a few high values that might represent outliers.  
- **BloodPressure:** Centered around 70–80 mm Hg — looks roughly normal, but some zeros are unrealistic and may need cleaning.  
- **SkinThickness:** Very right-skewed — many zero values, likely missing data.  
- **Insulin:** Highly right-skewed — a large number of zeros suggest unrecorded measurements.  
- **BMI:** Roughly normal distribution centered around 30 — indicates many participants are overweight or obese.  
- **DiabetesPedigreeFunction:** Right-skewed — most participants have low genetic risk, with a few higher-risk cases.  
- **Age:** Concentrated between 20 and 50 years — fewer older individuals in the dataset.  

 **Interpretation**

These distributions highlight several key points:
- Some variables (like `Insulin` and `SkinThickness`) have **many zero values** that may represent missing data.  
- Most features are **right-skewed**, suggesting that **normalization or transformation** might be needed before modeling.  
- The data shows **realistic population patterns** — e.g., BMI around 30 and glucose peaking near 120–140 mg/dL.  
- Understanding these shapes helps in **detecting outliers**, **deciding preprocessing steps**, and **choosing suitable algorithms**.

![Alt Text](../reports/figures/output1.png "Variable Distributions")


### Observation on Variable Correlations

Most variables show weak correlations, with a few moderate relationships:  
- **Pregnancies & Age (0.544):** Older individuals tend to have more pregnancies.  
- **Insulin & SkinThickness (0.437):** Higher skin thickness relates to higher insulin levels.  
- **BMI & SkinThickness (0.393):** Higher BMI corresponds to increased skin thickness.  
- **Glucose & Insulin (0.331):** Mild positive relationship between glucose and insulin.  

Overall, the features are mostly independent, providing useful information for predictive modeling.

![Alt Text](../reports/figures/output2.png "Variable Distributions")

