# Exploratory Data Analysis (EDA)

In this notebook, we will perform exploratory data analysis on the healthcare datasets. The goal is to understand the data distributions, correlations, and initial insights that can guide further analysis and modeling.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualisation style
sns.set(style='whitegrid')

In [None]:
# Load the datasets
diabetes_data = pd.read_csv('../data/raw/pima-indians-diabetes.csv')
heart_disease_data = pd.read_csv('../data/raw/heart_disease.csv')

# Display the first few rows of the diabetes dataset
diabetes_data.head()

In [None]:
# Display the first few rows of the heart disease dataset
heart_disease_data.head()

In [None]:
# Summary statistics for diabetes dataset
diabetes_data.describe()

In [None]:
# Summary statistics for heart disease dataset
heart_disease_data.describe()

In [None]:
# Visualize the distribution of the target variable in the diabetes dataset
plt.figure(figsize=(8, 5))
sns.countplot(x='Outcome', data=diabetes_data)
plt.title('Distribution of Diabetes Outcome')
plt.xlabel('Diabetes Outcome (0 = No, 1 = Yes)')
plt.ylabel('Count')
plt.show()

In [None]:
# Visualize the distribution of the target variable in the heart disease dataset
plt.figure(figsize=(8, 5))
sns.countplot(x='target', data=heart_disease_data)
plt.title('Distribution of Heart Disease Outcome')
plt.xlabel('Heart Disease Outcome (0 = No, 1 = Yes)')
plt.ylabel('Count')
plt.show()

In [None]:
# Correlation heatmap for diabetes dataset
plt.figure(figsize=(10, 8))
sns.heatmap(diabetes_data.corr(), annot=True, fmt='.2f', cmap='coolwarm', square=True)
plt.title('Correlation Heatmap for Diabetes Dataset')
plt.show()

In [None]:
# Correlation heatmap for heart disease dataset
plt.figure(figsize=(10, 8))
sns.heatmap(heart_disease_data.corr(), annot=True, fmt='.2f', cmap='coolwarm', square=True)
plt.title('Correlation Heatmap for Heart Disease Dataset')
plt.show()

## Conclusion

In this notebook, we performed exploratory data analysis on the diabetes and heart disease datasets. We visualized the distributions of the target variables, summarized the datasets, and examined the correlations between features. These insights will inform our preprocessing and modeling steps in subsequent notebooks.