<a href="https://colab.research.google.com/github/JordanDCunha/R-for-Data-Science-2e-/blob/main/Chapter_10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üîç **10.1 Introduction ‚Äî Exploratory Data Analysis (EDA)**

This chapter introduces **Exploratory Data Analysis (EDA)**: a systematic yet creative approach to understanding your data before formal modeling or inference.

EDA is an **iterative cycle**, not a one-time step.

---

## üîÅ The EDA Cycle

You continuously move through three stages:

1. **Ask questions** about your data  
2. **Search for answers** using visualization, transformation, and modeling  
3. **Refine or generate new questions** based on what you learn  

Each step informs the next, creating a feedback loop of discovery.

---

## üß† A Mindset, Not a Recipe

EDA doesn‚Äôt follow strict rules or checklists. Instead, it‚Äôs a **state of mind**:
- Explore freely
- Follow curiosity
- Expect dead ends
- Let insights emerge gradually

Early exploration is messy by design. Over time, you‚Äôll narrow in on the most meaningful patterns and relationships‚Äîthose are what you eventually communicate to others.

---

## üßπ EDA and Data Quality

Even when research questions are already defined, EDA is still essential because:
- You must **check data quality**
- You must confirm whether data **matches expectations**
- Data cleaning is itself a form of EDA  

To do this well, you rely on the same tools:
- **Visualization**
- **Transformation**
- **Modeling**

---

## üß∞ What You‚Äôll Use

In this chapter, EDA is performed by **combining**:
- **ggplot2** ‚Üí to see patterns
- **dplyr** ‚Üí to manipulate and summarize data  

Together, these allow you to interactively ask questions, answer them, and iterate toward insight.


In [None]:
library(tidyverse)


# ‚ùì **10.2 Questions ‚Äî The Heart of EDA**

> *‚ÄúThere are no routine statistical questions, only questionable statistical routines.‚Äù*  
> ‚Äî **Sir David Cox**

> *‚ÄúFar better an approximate answer to the right question ‚Ä¶ than an exact answer to the wrong question.‚Äù*  
> ‚Äî **John Tukey**

Exploratory Data Analysis begins with **questions**. Questions focus your attention, guide your choice of visualizations and transformations, and help you uncover structure in your data.

---

## üß≠ Questions as Tools

When you ask a question, you:
- Narrow your focus to a specific aspect of the data
- Decide which plots, summaries, or models to use
- Create direction in what could otherwise be an overwhelming dataset

EDA isn‚Äôt about immediately finding answers‚Äîit‚Äôs about **learning where to look next**.

---

## üé® A Creative, Iterative Process

EDA is fundamentally **creative**:
- Good questions are hard to ask at the beginning
- Early questions often lead to dead ends
- Each new question reveals a new angle on the data

The key is **quantity**:
- Ask many questions
- Follow each answer with a new question
- Gradually zoom in on the most interesting patterns

This iterative questioning dramatically increases the chance of discovery.

---

## üîç Two Questions That Always Matter

While there are no strict rules for which questions to ask, two types are universally useful:

### 1Ô∏è‚É£ Variation  
**What type of variation occurs within my variables?**  
- How are values distributed?
- Are there outliers or unusual patterns?
- Does the data behave as expected?

### 2Ô∏è‚É£ Covariation  
**What type of covariation occurs between my variables?**  
- How do variables change together?
- Are there relationships or associations?
- Do patterns differ across groups?

---

## üöÄ What‚Äôs Next

The rest of the chapter explores these two ideas‚Äî**variation** and **covariation**‚Äîand demonstrates practical ways to investigate them using visualization and transformation.


# üìä **10.3 Variation ‚Äî Understanding How Variables Change**

Variation describes how the values of a variable **change across measurements, observations, or time**. Every variable has a pattern of variation, and understanding that pattern is a key step in Exploratory Data Analysis (EDA).

---

## üîç Visualizing Variation

The best way to understand variation is to examine the **distribution** of a variable‚Äôs values.

- For **numerical variables** ‚Üí Use histograms
- For **categorical variables** ‚Üí Use bar charts

Distributions reveal how values are spread, where most observations occur, and whether unusual patterns exist.

---

## ‚≠ê Typical Values

Tall bars in histograms or bar charts represent **common values**, while short or missing bars indicate **rare or unseen values**.

When analyzing distributions, ask:

- Which values occur most often?
- Which values are rare?
- Are there unexpected patterns or clusters?
- Do clusters suggest subgroups in the data?

Clusters may indicate groups of observations that share similar characteristics, which can lead to deeper investigation and new insights.

---

## ‚ö†Ô∏è Unusual Values (Outliers)

Outliers are values that **do not follow the overall pattern** of the data. They may occur because:

- Data entry or measurement errors
- Rare but legitimate extreme values
- Important discoveries or new phenomena

Large datasets can hide outliers because common values dominate visualizations. Zooming into plots using coordinate limits can reveal these unusual observations.

---

## üîé Investigating Outliers

When outliers are found, you should:

- Determine if they are errors or valid observations
- Consider recoding incorrect values (e.g., replacing impossible values with `NA`)
- Compare analysis results with and without outliers
- Clearly document any data removal decisions

Outliers should never be removed without proper justification.

---

## üß† Why Variation Matters

Understanding variation helps you:

- Detect errors and missing data
- Identify clusters or subgroups
- Generate new research questions
- Better understand how variables behave

Variation is often the first step toward discovering relationships between variables, which leads into studying **covariation**.


# üö® **10.4 Unusual Values ‚Äî Handling the Weird Stuff**

Unusual values don‚Äôt mean your analysis is doomed‚Äîbut **how you handle them matters**. When you encounter implausible or incorrect values, there are two main strategies.

---

## ‚ùå Dropping Rows (Not Recommended)

You *can* remove entire observations that contain strange values, but this is risky:

- One bad value doesn‚Äôt mean the whole row is bad
- Repeated filtering across variables can wipe out your dataset
- You may lose valid and valuable information

This approach should be a last resort.

---

## ‚úÖ Replacing with Missing Values (Recommended)

A better approach is to **replace unusual values with `NA`** so they don‚Äôt interfere with analysis but still preserve the rest of the observation.

- Keeps valid data intact
- Makes issues explicit
- Plays nicely with ggplot2 and dplyr tools

---

## üìâ Visualizing Missing Values

By default, **ggplot2 silently drops missing values** from plots and shows a warning.  
You can suppress this warning with `na.rm = TRUE`.

Sometimes, missing values are meaningful. For example:
- In flight data, missing departure times indicate **cancelled flights**
- Comparing missing vs non-missing values can reveal important patterns

Creating an indicator variable with `is.na()` lets you **analyze missingness directly**.

---

## üß† Key Takeaways

- Don‚Äôt delete data unless you‚Äôre sure
- Use `NA` to neutralize problematic values
- Missing values can carry real-world meaning
- Always think about *why* values are missing

Handling unusual values thoughtfully leads to cleaner analysis and more trustworthy conclusions.


In [None]:
library(tidyverse)
library(nycflights13)

# ‚ùå Dropping rows with unusual values (not recommended)
diamonds2 <- diamonds |>
  filter(between(y, 3, 20))

# ‚úÖ Replacing unusual values with NA (recommended)
diamonds2 <- diamonds |>
  mutate(y = if_else(y < 3 | y > 20, NA, y))

# Plot with missing values removed (warning shown)
ggplot(diamonds2, aes(x = x, y = y)) +
  geom_point()

# Suppress warning about missing values
ggplot(diamonds2, aes(x = x, y = y)) +
  geom_point(na.rm = TRUE)

# Understanding missing values: cancelled flights
nycflights13::flights |>
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + sched_min / 60
  ) |>
  ggplot(aes(x = sched_dep_time)) +
  geom_freqpoly(aes(color = cancelled), binwidth = 0.25)


# üîó **10.5 Covariation ‚Äî How Variables Move Together**

While **variation** looks at patterns *within* a single variable, **covariation** examines how **two or more variables change together**. Detecting covariation is central to EDA because it helps uncover relationships, trends, and structure in the data.

The most effective way to study covariation is through **visualization**.

---

## üü¶ Categorical √ó Numerical

When one variable is categorical and the other is numerical, useful tools include:

- **Frequency polygons** to compare distribution *shapes*
- **Boxplots** to compare medians, spread, and outliers
- **Reordering categories** to reveal trends more clearly

Key ideas:
- Raw counts can be misleading ‚Üí use **density** for fair comparison
- Compact plots (like boxplots) trade detail for clarity
- Reordering factors (e.g., by median) often reveals hidden structure

---

## üü© Categorical √ó Categorical

To understand how two categorical variables relate:

- **Counts** reveal how often combinations occur
- **Size** or **fill color** can encode frequency
- **Heatmaps** (tiles) help spot strong associations

Covariation appears when certain category combinations dominate.

---

## üü• Numerical √ó Numerical

For two numerical variables, **scatterplots** are the starting point.

Challenges and solutions:
- **Overplotting** ‚Üí use transparency (`alpha`)
- **Very large datasets** ‚Üí use 2D binning (`geom_bin2d()`, `geom_hex()`)
- **Conditional distributions** ‚Üí bin one variable and summarize with boxplots

These techniques reveal trends, nonlinear relationships, clusters, and outliers that aren‚Äôt visible in one-dimensional plots.

---

## üß† Big Picture

- Covariation helps explain *why* patterns exist
- Different variable types require different visual tools
- There‚Äôs no single ‚Äúbest‚Äù plot‚Äîuse multiple views
- Strong EDA often combines several techniques

Understanding covariation turns raw data into insight.


In [None]:
library(tidyverse)

# Categorical √ó Numerical: boxplots
ggplot(diamonds, aes(x = cut, y = price)) +
  geom_boxplot()

# Reordering categories by median
ggplot(mpg, aes(x = fct_reorder(class, hwy, median), y = hwy)) +
  geom_boxplot()

# Categorical √ó Categorical: counts and tiles
diamonds |>
  count(color, cut) |>
  ggplot(aes(x = color, y = cut)) +
  geom_tile(aes(fill = n))

# Numerical √ó Numerical: scatterplot with transparency
smaller <- diamonds |> filter(carat < 3)

ggplot(smaller, aes(x = carat, y = price)) +
  geom_point(alpha = 1/100)

# 2D binning for large datasets
ggplot(smaller, aes(x = carat, y = price)) +
  geom_bin2d()

# Conditional distribution using binned boxplots
ggplot(smaller, aes(x = carat, y = price)) +
  geom_boxplot(aes(group = cut_width(carat, 0.1)))


# üìà **10.6 Patterns and Models**

When two variables have a **systematic relationship**, it shows up as a **pattern** in your data. Spotting these patterns is a key goal of EDA‚Äîbut noticing them isn‚Äôt enough. You should always question and probe what you see.

---

## üîç Questions to Ask When You See a Pattern

When a pattern appears, ask yourself:
- Could this be **random chance**?
- How can I **describe** the relationship (linear, curved, exponential)?
- How **strong** is the relationship?
- What **other variables** might influence it?
- Does the relationship change across **subgroups**?

Patterns point to **covariation**, which reduces uncertainty. If two variables covary, knowing one helps you predict the other. In special cases, this covariation may even reflect **causation**.

---

## üß† Why Use Models in EDA?

Sometimes relationships are **hidden** because multiple variables are tightly linked.  
Models help by:
- **Extracting dominant patterns**
- **Removing strong effects** to reveal subtler ones
- Allowing you to focus on what‚Äôs *left over* (residuals)

In the diamonds data:
- `price` is strongly related to `carat`
- `cut` also affects `price`, but that effect is harder to see
- A model can remove the carat effect so we can study cut more clearly

---

## üßÆ Residuals Reveal Subtle Structure

By modeling **price as a function of carat** and examining the **residuals**, we can ask:
> ‚ÄúIs this diamond more or less expensive than we‚Äôd expect for its size?‚Äù

After removing the size effect:
- Differences by **cut** become clear
- Higher-quality cuts are **more expensive relative to size**

---

## üß© Big Takeaway

- Patterns suggest relationships, not explanations
- Models are tools to **clarify**, not replace, visualization
- Residuals help uncover structure hidden by dominant trends
- EDA + simple models = deeper insight

Formal modeling comes later‚Äîhere, models are used as **exploratory lenses**, not final answers.


In [None]:
library(tidyverse)
library(tidymodels)

# Log-transform to linearize the relationship
diamonds <- diamonds |>
  mutate(
    log_price = log(price),
    log_carat = log(carat)
  )

# Fit a simple linear model
diamonds_fit <- linear_reg() |>
  fit(log_price ~ log_carat, data = diamonds)

# Compute residuals and return to original price scale
diamonds_aug <- augment(diamonds_fit, new_data = diamonds) |>
  mutate(.resid = exp(.resid))

# Residuals vs carat
ggplot(diamonds_aug, aes(x = carat, y = .resid)) +
  geom_point()

# Residuals vs cut
ggplot(diamonds_aug, aes(x = cut, y = .resid)) +
  geom_boxplot()
