# Data Exploration: Diabetes Risk Dataset

This notebook performs basic exploratory data analysis on the diabetes dataset.

**Purpose**: Understand the data structure, distributions, and basic characteristics.

**Approach**: Keep it simple - we're not looking for hidden patterns, just understanding what we're working with.

In [None]:
import sys
sys.path.append('../src')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from data_loader import load_diabetes_data, get_feature_info
from preprocessing import check_missing_values, prepare_features_target, get_class_distribution

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

## 1. Load Data

Loading directly from UCI repository to ensure reproducibility.

In [None]:
# Load dataset
df = load_diabetes_data()

# Display basic information
print("\nDataset shape:", df.shape)
print("\nFirst few rows:")
df.head()

In [None]:
# Get feature information
feature_info = get_feature_info()
print(f"Target variable: {feature_info['target']}")
print(f"Description: {feature_info['description']}")
print(f"\nNumber of features: {len(feature_info['features'])}")

## 2. Data Quality Check

Checking for missing values and basic data types.

In [None]:
# Check missing values
missing = check_missing_values(df)

# Check data types
print("\nData types:")
print(df.dtypes.value_counts())

In [None]:
# Basic statistics
df.describe()

## 3. Target Variable Analysis

Understanding the class distribution is critical for diabetes prediction.

**Why this matters**: Class imbalance affects model training and evaluation.

In [None]:
# Separate features and target
X, y = prepare_features_target(df)

# Get detailed distribution
distribution = get_class_distribution(y)

In [None]:
# Visualize class distribution
fig, ax = plt.subplots(1, 2, figsize=(12, 4))

# Count plot
y.value_counts().plot(kind='bar', ax=ax[0], color=['skyblue', 'salmon'])
ax[0].set_xlabel('Diabetes Status')
ax[0].set_ylabel('Count')
ax[0].set_title('Class Distribution (Counts)')
ax[0].set_xticklabels(['No Diabetes', 'Diabetes'], rotation=0)

# Percentage plot
y.value_counts(normalize=True).plot(kind='bar', ax=ax[1], color=['skyblue', 'salmon'])
ax[1].set_xlabel('Diabetes Status')
ax[1].set_ylabel('Proportion')
ax[1].set_title('Class Distribution (Proportions)')
ax[1].set_xticklabels(['No Diabetes', 'Diabetes'], rotation=0)

plt.tight_layout()
plt.savefig('../outputs/figures/class_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

print(f"\nImbalance ratio: {distribution['imbalance_ratio']:.2f}:1")
print("This means non-diabetic cases outnumber diabetic cases significantly.")

## 4. Feature Distributions

Looking at a few key health indicators.

In [None]:
# Select key features to visualize
key_features = ['BMI', 'GenHlth', 'Age', 'HighBP', 'HighChol']

fig, axes = plt.subplots(2, 3, figsize=(15, 8))
axes = axes.ravel()

for idx, feature in enumerate(key_features):
    df[feature].hist(bins=30, ax=axes[idx], edgecolor='black', alpha=0.7)
    axes[idx].set_title(f'{feature} Distribution')
    axes[idx].set_xlabel(feature)
    axes[idx].set_ylabel('Frequency')

# Remove extra subplot
fig.delaxes(axes[5])

plt.tight_layout()
plt.savefig('../outputs/figures/feature_distributions.png', dpi=300, bbox_inches='tight')
plt.show()

## 5. Feature Correlation with Target

Which features are most associated with diabetes?

In [None]:
# Calculate correlation with target
correlations = df.corr()['Diabetes_binary'].sort_values(ascending=False)

# Plot top correlations
plt.figure(figsize=(10, 8))
correlations[1:11].plot(kind='barh', color='steelblue')
plt.xlabel('Correlation with Diabetes')
plt.title('Top 10 Features Correlated with Diabetes')
plt.tight_layout()
plt.savefig('../outputs/figures/feature_correlations.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nTop 5 positively correlated features:")
print(correlations[1:6])

## Summary

**Key Findings**:
1. Dataset has no missing values (pre-cleaned)
2. Significant class imbalance - more non-diabetic cases
3. Features like GenHlth, HighBP, and BMI show correlation with diabetes
4. Data is ready for modeling with minimal preprocessing needed

**Implications**:
- Class imbalance will need to be addressed in classification
- Self-reported health indicators may contain bias
- The dataset represents survey responses, not clinical diagnoses