# Exploratory Data Analysis (EDA) for Diabetes Prediction

In this notebook, we will perform exploratory data analysis (EDA) on the diabetes dataset. The goal is to understand the dataset, identify patterns, and prepare the data for further analysis and model training.


Let's start by importing the necessary libraries and loading the dataset.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualisation style
sns.set(style="whitegrid")

# Load the dataset
file_path = '../data/raw/diabetes.csv'
df = pd.read_csv(file_path)

# Display the first few rows of the dataset
df.head()

In [None]:
df['Pregnancies'].unique()

In [None]:
# Display basic information about the dataset
df.info()



In [None]:
# Check for missing values
df.isnull().sum()



In [None]:
# Summary statistics for numerical features
df.describe()

In [None]:
# Plot histograms for each feature
df.hist(figsize=(12, 10), bins=30, edgecolor='black')
plt.suptitle('Distribution of Features')
plt.show()

In [None]:
# Calculate the correlation matrix
correlation_matrix = df.corr()

# Plot heatmap of correlations
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()

In [None]:
# Plot pairplot for the dataset
sns.pairplot(df, hue='Outcome', palette='husl')
plt.title('Pairplot of Features')
plt.show()

In [None]:
# Plot boxplots for features grouped by 'Outcome'
plt.figure(figsize=(15, 10))
for i, column in enumerate(df.columns[:-1]):
    plt.subplot(4, 2, i+1)
    sns.boxplot(x='Outcome', y=column, data=df, palette='husl')
    plt.title(f'Boxplot of {column}')
plt.tight_layout()
plt.show()

- **Distribution of Features**: The histograms reveal the distribution of each feature. For instance, features like `Glucose` and `BMI` show a wide range of values, while others like `Pregnancies` have a more concentrated distribution.

- **Correlation Analysis**: The correlation heatmap indicates that some features are highly correlated with each other. For example, `Glucose` and `Insulin` show a strong positive correlation. These relationships can inform feature selection and engineering.

- **Feature Relationships**: The pairplot highlights how features interact with the target variable `Outcome`. Notably, features like `Glucose` and `BMI` show clear patterns that differentiate between diabetic and non-diabetic cases.

- **Boxplots**: The boxplots illustrate how feature distributions vary with the target variable. Features such as `BloodPressure` and `SkinThickness` exhibit distinct distributions between diabetic and non-diabetic patients.

## 7. Conclusion

This exploratory analysis provides valuable insights into the dataset and guides the next steps in data preprocessing and model training.

