# Exploratory Data Analysis (EDA) of Diabetes Dataset

## Introduction
This notebook explores the `diabetes_dataset.csv` file, which contains health metrics related to diabetes. The goal is to understand the dataset's structure, identify patterns, and prepare the data for use in a Streamlit app.

## Dataset Description
The dataset includes the following features:
- **Pregnancies**: Number of times pregnant.
- **Glucose**: Plasma glucose concentration.
- **BloodPressure**: Diastolic blood pressure.
- **SkinThickness**: Triceps skinfold thickness.
- **Insulin**: 2-Hour serum insulin.
- **BMI**: Body Mass Index.
- **DiabetesPedigreeFunction**: A genetic risk score.
- **Age**: Age in years.
- **Outcome**: Target variable (0 = no diabetes, 1 = diabetes).

## Steps
1. Load the dataset.
2. Perform basic data exploration (e.g., `df.head()`, `df.info()`).
3. Visualize feature distributions (e.g., histograms, scatter plots).
4. Analyze correlations between features.
5. Summarize insights and observations.


In [30]:
import pandas as pd
import plotly.express as px

# Load the dataset
df = pd.read_csv('/Users/bmhknicks/Documents/Projects/my_first_soft_dev/data/diabetes_dataset.csv')

# Display the first few rows
df.head()

Unnamed: 0,Age,Pregnancies,BMI,Glucose,BloodPressure,HbA1c,LDL,HDL,Triglycerides,WaistCircumference,HipCircumference,WHR,FamilyHistory,DietType,Hypertension,MedicationUse,Outcome
0,69,5,28.39,130.1,77.0,5.4,130.4,44.0,50.0,90.5,107.9,0.84,0,0,0,1,0
1,32,1,26.49,116.5,72.0,4.5,87.4,54.2,129.9,113.3,81.4,1.39,0,0,0,0,0
2,89,13,25.34,101.0,82.0,4.9,112.5,56.8,177.6,84.7,107.2,0.79,0,0,0,1,0
3,78,13,29.91,146.0,104.0,5.7,50.7,39.1,117.0,108.9,110.0,0.99,0,0,0,1,1
4,38,8,24.56,103.2,74.0,4.7,102.5,29.1,145.9,84.1,92.8,0.91,0,1,0,0,0


### Step 2: Perform Basic Data Exploration

In [31]:
# Display dataset information
df.info()

# Check for missing values
print("Missing Values:\n", df.isnull().sum())

# Display basic statistics
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9538 entries, 0 to 9537
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Age                 9538 non-null   int64  
 1   Pregnancies         9538 non-null   int64  
 2   BMI                 9538 non-null   float64
 3   Glucose             9538 non-null   float64
 4   BloodPressure       9538 non-null   float64
 5   HbA1c               9538 non-null   float64
 6   LDL                 9538 non-null   float64
 7   HDL                 9538 non-null   float64
 8   Triglycerides       9538 non-null   float64
 9   WaistCircumference  9538 non-null   float64
 10  HipCircumference    9538 non-null   float64
 11  WHR                 9538 non-null   float64
 12  FamilyHistory       9538 non-null   int64  
 13  DietType            9538 non-null   int64  
 14  Hypertension        9538 non-null   int64  
 15  MedicationUse       9538 non-null   int64  
 16  Outcom

Unnamed: 0,Age,Pregnancies,BMI,Glucose,BloodPressure,HbA1c,LDL,HDL,Triglycerides,WaistCircumference,HipCircumference,WHR,FamilyHistory,DietType,Hypertension,MedicationUse,Outcome
count,9538.0,9538.0,9538.0,9538.0,9538.0,9538.0,9538.0,9538.0,9538.0,9538.0,9538.0,9538.0,9538.0,9538.0,9538.0,9538.0,9538.0
mean,53.577584,7.986161,27.052364,106.104183,84.475781,4.650661,100.133456,49.953418,151.147746,93.951678,103.060621,0.9174,0.302474,0.486161,0.001048,0.405012,0.344097
std,20.764651,4.933469,5.927955,21.91859,14.12348,0.476395,29.91191,15.242194,48.951627,15.594468,13.438827,0.140828,0.459354,0.661139,0.032364,0.49092,0.475098
min,18.0,0.0,15.0,50.0,60.0,4.0,-12.0,-9.2,50.0,40.3,54.8,0.42,0.0,0.0,0.0,0.0,0.0
25%,36.0,4.0,22.87,91.0,74.0,4.3,80.1,39.7,117.2,83.4,94.0,0.82,0.0,0.0,0.0,0.0,0.0
50%,53.0,8.0,27.05,106.0,84.0,4.6,99.9,50.2,150.55,93.8,103.2,0.91,0.0,0.0,0.0,0.0,0.0
75%,72.0,12.0,31.18,121.0,94.0,5.0,120.2,60.2,185.1,104.6,112.1,1.01,1.0,1.0,0.0,1.0,1.0
max,89.0,16.0,49.66,207.2,138.0,6.9,202.2,107.8,345.8,163.0,156.6,1.49,1.0,2.0,1.0,1.0,1.0


### Step 3: Visualize Feature Distributions

In [32]:
# Histogram for Age
fig_age = px.histogram(df, x='Age', nbins=20, title="Distribution of Age")
fig_age.show()

In [33]:
# Boxplot for BMI
fig_bmi = px.box(df, y='BMI', title="Boxplot of BMI")
fig_bmi.show()

### Step 4: Analyze Correlations Between Features

In [34]:
# Calculate the correlation matrix
corr_matrix = df.corr()

# Create an interactive heatmap
fig_heatmap = px.imshow(
    corr_matrix,
    text_auto=True,
    title="Correlation Heatmap",
    labels=dict(x="Features", y="Features", color="Correlation"),
    color_continuous_scale="Viridis"
)
fig_heatmap.show()

### Step 5: Summarize Insights and Observations

- **Insight 1**: The dataset contains no missing values.
- **Insight 2**: The `Age` column is right-skewed, indicating a younger population.
- **Insight 3**: There is a strong positive correlation between `Glucose` and `Outcome`.
- **Insight 4**: The `BMI` distribution shows some outliers, which may need further investigation.