# UCI Heart Disease Dataset: Exploratory Data Analysis

<a id="toc"></a>
## Table of Contents
1. [Introduction](#introduction)
2. [Setup & Data Loading](#setup-data-loading)
3. [Population Overview](#population-overview)
4. [Variable Distribution Comparisons](#variable-distribution)
5. [Target Variable by Population](#target-variably-analysis)
6. [Pooling Decision](#pooling-decision)
7. [Missing Data Decision](#missing-data-decision)
8. [Data Cleaning Implementation](#data-cleaning-implementation)
9. [Conclusions & Prepared Dataset](#conclusions-and-dataset)

<a id='introduction'></a>
## Introduction
- Reference to Notebooks 1 & 2
- Objectives
- Research questions

### Context
Following the population analysis and data cleaning in Notebook 2, we now have a prepared dataset ready for in-depth exploratory analysis. [*Decision for how to proceed with the analysis will be written here*].

### Objective
This notebook explores relationships between features and heart disease severity to understand which factors are most associated with cardiovascular disease and generate hypotheses for predictive modeling.

### Research Questions
1. Which individual features show the strongest relationships with heart disease severity?
2. Are there non-linear relationships that need to be captured?
3. What interactions exist between predictor variables?
4. Do any features require transformation for modeling?
5. What patterns emerge that align with or challenge medical knowledge?

<a id='setup-and-data-loading'></a>
## Setup & Data Loading
- Import libraries
- Load cleaned dataset from Notebook 2
- Verify data quality and completeness

<a id='dataset-overview'></a>
## Dataset Overview
- Final sample size
- Target variable distribution
- Summary statistics

<a id='univariate-analysis'></a>
## Univariate Analysis
- Distribution of each predictor variable
- Identify skewness, outliers
- Check for transformations needed
- Visualizations: histograms, box plots
- Identify which variables are consistent vs. population-specific

<a id='target-variable-relationships'></a>
## Target Variable Relationships
- For each predictor:
- Relationship with heart disease severity
- Visualization (box plots for categorical, scatter/violin for continuous)
- Statistical significance
- Rank variables by apparent association strength

<a id='correlation-analysis'></a>
## Correlation Analysis
- Correlation matrix (for continuous variables)
- Identify multicollinearity issues
- Visualization: heatmap

<a id='bivariate-relationships'></a>
## Bivariate Relationships
- Key predictor pairs
- Interaction effects
- Conditional relationships

<a id='feature-engineering-ideas'></a>
## Feature Engineering Ideas
- Potential transformations (log, polynomial, binning)
- Interaction terms to create
- Domain-knowledge based features
- Rationale for each

<a id='conclusions-and-modeling-preview'></a>
## Conclusions & Modeling Preview
- Top features associated with heart disease
- Hypotheses for modeling
- Summary of exploratory findings
- Expected important features
- Transition to modeling phase
