# Module 2: Exploratory Data Analysis (EDA)

**Course**: End-to-End Machine Learning (Datacamp)  
**Case Study**: CardioCare Heart Disease Prediction  
**Author**: Seif

---

## Overview

In this module, we explore the Exploratory Data Analysis (EDA) stage:
1. Understanding our data structure
2. Checking class balance
3. Identifying missing values
4. Detecting outliers
5. Visualizing data patterns

**EDA is critical** — it helps us understand the dataset and identify issues that could affect model performance.

---

In [None]:
# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

%matplotlib inline

---

## Hands-On Exercise: Basic EDA

Let's practice the fundamental EDA techniques with actual code.

In [None]:
# Print the first 5 rows of the DataFrame
print(heart_disease_df.head())

In [None]:
# Print information about the DataFrame
print(heart_disease_df.info())

In [None]:
# Visualize the cholesterol column
heart_disease_df['chol'].plot(kind='hist')

# Set the title and axis labels
plt.title('Cholesterol distribution')
plt.xlabel('Cholesterol')
plt.ylabel('Frequency')
plt.show()

### What This Exercise Shows:

1. **`.head()`** — Quick snapshot of the first 5 rows to understand data structure
2. **`.info()`** — Summary of data types, non-null counts, and memory usage
3. **Histogram visualization** — Shows distribution of cholesterol values
   - Helps identify if values are normally distributed
   - Can reveal outliers or unusual patterns
   - Shows the frequency of different cholesterol ranges

---

## What is EDA?

**Exploratory Data Analysis (EDA)** is the process of examining and analyzing data to:
- Gain insights and discover patterns
- Understand the characteristics of the data
- Identify issues that could affect model performance

EDA helps us answer critical questions:
- Do men have higher rates of heart disease than women?
- Does data fall within acceptable ranges?
- Are there any unexpected patterns or anomalies?

---

## 1. Understanding Our Data

We'll use pandas methods to get a quick overview of the dataset:
- **`.head()`** — shows the first few rows (snapshot of the data)
- **`.info()`** — summary of DataFrame (non-null entries, data types per column)

In [None]:
# Example: Load a sample heart disease dataset (placeholder)
# In practice, you would load the CardioCare clinic data here

# For demonstration, let's create a sample dataset structure
# You'll replace this with actual data loading: df = pd.read_csv('heart_disease.csv')

# Sample column names based on typical heart disease datasets:
# age, sex, chest_pain_type, resting_bp, cholesterol, fasting_bs, 
# resting_ecg, max_heart_rate, exercise_angina, oldpeak, 
# st_slope, target (0 = no disease, 1 = disease)

# Uncomment when you have actual data:
# df = pd.read_csv('data/heart_disease.csv')

print("Data loading placeholder - replace with actual CardioCare data")

In [None]:
# View the first few rows of the dataset
# df.head()

# Example output:
# Shows first 5 rows with all columns (age, sex, cholesterol, etc.)

In [None]:
# Get summary information about the DataFrame
# df.info()

# This shows:
# - Number of entries (rows)
# - Number of non-null values per column
# - Data type of each column (int, float, object)
# - Memory usage

## 2. Class (Im)balance

**Class imbalance** occurs when one class has significantly more samples than another.

### Why it matters:
- Can cause the model to always predict the majority class
- Reduces model's ability to detect minority class (e.g., patients with heart disease)
- Affects performance metrics

### How to check:
Use **`.value_counts()`** to count occurrences of each class.

In [None]:
# Count the number of patients with and without heart disease
# df['target'].value_counts()

# Example output:
# 0    500  (no heart disease)
# 1    300  (heart disease)
# Name: target, dtype: int64

In [None]:
# Get proportions instead of counts
# df['target'].value_counts(normalize=True)

# Example output:
# 0    0.625  (62.5% no heart disease)
# 1    0.375  (37.5% heart disease)
# Name: target, dtype: float64

# A balanced dataset would have close to 50/50 split

## 3. Missing Values

**Missing values** can bias results and affect model performance.

### Example Bias Scenario:
If we have less information about healthier patients (shorter screenings), this might bias results toward sicker patients.

### How to check:
Use **`.isnull()`** to detect missing data.

In [None]:
# Check for missing values in a specific column (e.g., oldpeak - exercise measure)
# df['oldpeak'].isnull()

# Returns True if value is null, False otherwise
# Example output:
# 0      False
# 1      False
# 2      True   <- missing value
# 3      False
# ...
# Name: oldpeak, dtype: bool

In [None]:
# Check if ALL values in a column are non-null
# df['oldpeak'].isnull().all()

# Returns True if all values are null, False otherwise

In [None]:
# Check missing values across entire DataFrame
# df.isnull().sum()

# Example output:
# age                  0
# sex                  0
# cholesterol         15
# oldpeak              8
# target               0
# dtype: int64

# This shows how many missing values exist in each column

## 4. Outliers

**Outliers** are data points significantly different from other observations.

### Causes:
- Measurement errors
- Data entry errors  
- Rare events (sometimes legitimate!)

### Example:
Patient age = 500 → clearly anomalous

### Impact:
- Can skew model performance
- Cause model to learn from extreme values that aren't representative
- May reduce generalization

### When to keep outliers:
If they represent rare but **valid** events (e.g., extremely high cholesterol in a sick patient)

### Detection methods:
- **Box plots** — visualize distribution and outliers
- **Interquartile Range (IQR)** — statistical method to identify outliers

In [None]:
# Detect outliers using box plot
# df.boxplot(column='age')
# plt.title('Age Distribution - Outlier Detection')
# plt.ylabel('Age')
# plt.show()

# Box plot shows:
# - Median (middle line)
# - Quartiles (box edges)
# - Outliers (dots beyond whiskers)

In [None]:
# Detect outliers using Interquartile Range (IQR) method
# Q1 = df['age'].quantile(0.25)
# Q3 = df['age'].quantile(0.75)
# IQR = Q3 - Q1

# Define outlier bounds
# lower_bound = Q1 - 1.5 * IQR
# upper_bound = Q3 + 1.5 * IQR

# Find outliers
# outliers = df[(df['age'] < lower_bound) | (df['age'] > upper_bound)]
# print(f"Number of outliers in age: {len(outliers)}")

## 5. Visualizing Our Data

**Visualizations** make it easy to:
- See general trends
- Spot missing values
- Identify outliers
- Understand distributions

We can use pandas **`.plot()`** method or dedicated libraries like **Seaborn**.

In [None]:
# Visualize age distribution using pandas
# df['age'].plot(kind='hist', bins=20, title='Age Distribution', edgecolor='black')
# plt.xlabel('Age')
# plt.ylabel('Frequency')
# plt.show()

In [None]:
# Visualize correlations between features using heatmap
# plt.figure(figsize=(10, 8))
# sns.heatmap(df.corr(), annot=True, cmap='coolwarm', center=0)
# plt.title('Feature Correlation Heatmap')
# plt.show()

# This helps identify:
# - Which features are highly correlated with target
# - Multicollinearity between features

In [None]:
# Visualize class distribution
# df['target'].value_counts().plot(kind='bar', color=['green', 'red'])
# plt.title('Heart Disease Distribution')
# plt.xlabel('Target (0 = No Disease, 1 = Disease)')
# plt.ylabel('Count')
# plt.xticks(rotation=0)
# plt.show()

## 6. Goals of EDA

EDA serves multiple critical purposes:

### 1. Understand the Data
- Identify patterns and trends
- Example: Do men have higher rates of heart disease than women?

### 2. Detect Outliers
- Find data points outside acceptable ranges
- Determine if they're errors or legitimate rare events

### 3. Design Hypotheses
- Validate assumptions
- Check if expectations align with reality

### 4. Inform Downstream Decisions
EDA influences:
- **Choice of ML algorithm** (e.g., linear vs. non-linear models)
- **Feature selection** (which variables to include)
- **Feature engineering** (creating new features from existing ones)
- **Data preprocessing strategy** (how to handle missing values, outliers, scaling)

**Remember**: EDA is vital to the future success of the project!

---

## Key Takeaways

1. **EDA is Critical**: It helps identify issues before they affect model performance
2. **Check Class Balance**: Imbalanced data can cause models to always predict the majority class
3. **Missing Values Matter**: They can introduce bias and reduce model accuracy
4. **Outliers Need Attention**: Decide whether to keep, remove, or transform them based on context
5. **Visualizations are Powerful**: They make patterns, trends, and issues immediately visible
6. **EDA Informs Everything**: Insights from EDA guide algorithm choice, feature selection, and preprocessing

---

## Practice Exercises

When you have real CardioCare data:
1. Load the dataset and use `.head()` and `.info()` to understand structure
2. Check class balance with `.value_counts(normalize=True)`
3. Identify missing values with `.isnull().sum()`
4. Detect outliers using box plots and IQR method
5. Create visualizations (histograms, correlation heatmaps, bar charts)
6. Answer: Do certain demographics have higher heart disease rates?

---

## References
- Datacamp: End-to-End Machine Learning Course
- Video 2: Exploratory Data Analysis
- [Seaborn Visualization Tutorial](https://seaborn.pydata.org/tutorial/distributions.html)
- [DataCamp: Intermediate Data Visualization with Seaborn](https://app.datacamp.com/learn/courses/intermediate-data-visualization-with-seaborn)

# Module 2: Data Preparation

**Course**: End-to-End Machine Learning (Datacamp)  
**Case Study**: CardioCare Heart Disease Prediction  
**Author**: Seif

---

## Overview

This notebook will cover:
1. Loading patient health data
2. Exploratory Data Analysis (EDA)
3. Data cleaning and preprocessing
4. Feature engineering
5. Handling missing values and outliers

**Status**: Placeholder — to be completed after next video

---

In [None]:
# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

%matplotlib inline

## Coming Soon

This module will be populated after the next video on data preparation.