<a href="https://colab.research.google.com/github/OscarTMa/heart-disease-classification./blob/main/notebooks/Heart_Disease_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Heart Disease Classification

This notebook performs an exploratory data analysis (EDA) on the **Heart Disease UCI** dataset. The goal is to identify patterns and relationships in the data to aid in building a classification model.

## 1. Importing Libraries
Let's start by importing the required Python libraries.

In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Settings for visualizations
sns.set(style='whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

## 2. Loading the Dataset
The dataset can be downloaded from Kaggle. For now, upload the file manually or use the relative path if it is in the `data/` directory.

In [2]:
# Load the dataset
file_path = 'data/heart.csv'  # Relative path for GitHub compatibility
df = pd.read_csv(file_path)

# Display the first few rows
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'data/heart.csv'

## 3. Data Overview
### 3.1. Dataset Shape
Understanding the number of rows and columns.

In [None]:
# Shape of the dataset
print(f'Dataset contains {df.shape[0]} rows and {df.shape[1]} columns.')

### 3.2. Inspecting Columns
Checking data types, non-null counts, and basic statistics.

In [None]:
# Dataset information
df.info()

# Statistical summary
df.describe()

## 4. Handling Missing Values
Identify and handle any missing values in the dataset.

In [None]:
# Checking for missing values
missing_values = df.isnull().sum()
print('Missing values per column:\n', missing_values)

## 5. Data Visualization
Exploring the relationships between variables and the target.

- Distribution of numerical variables.
- Correlation heatmap.
- Relationship with target variable.

In [None]:
# Distribution of numerical variables
numerical_features = df.select_dtypes(include=['int64', 'float64']).columns
df[numerical_features].hist(bins=15, figsize=(15, 10), color='steelblue', edgecolor='black')
plt.tight_layout()
plt.show()

In [None]:
# Correlation heatmap
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

In [None]:
# Relationship with target variable
sns.countplot(data=df, x='target', palette='viridis')
plt.title('Target Variable Distribution')
plt.show()

## 6. Next Steps
- Feature engineering and preprocessing.
- Building classification models.
- Evaluating model performance.
