# Lesson 4: Advanced Visualization, Reproducible Reporting, and Final Projects
**Author:** Petr Čala  
**Date:** 2025-02-26

# Lesson 4 Notebook

Welcome to **Lesson 4**! By now, you’ve covered:

1. **Basic & Intermediate Data Cleaning** (missing values, merges, reshaping, text handling).
2. **Introduction to SQL in R**.
3. **Data exploration** and some elementary visuals.

In this final lesson, we’ll focus on:

1. **Advanced Data Visualization** using `ggplot2`.
2. **Reproducible Workflows & Reporting** with R Markdown (or Quarto).
3. **Project Best Practices** – how to structure a final project/thesis.
4. **Wrap-up & Additional Resources**.


---

## 1. Advanced Data Visualization

In previous lessons, we created basic **histograms**, **bar plots**, and **boxplots**. Now let’s explore some more powerful techniques and customizations in **`ggplot2`**.

### 1.1 Adding Facets

Facets split your data visualization by a certain variable, letting you see multiple small charts at once.


In [None]:
library(tidyverse)

# Example dataset
set.seed(123)
cars_df <- mtcars %>%
  mutate(
    cyl = as.factor(cyl), # # of cylinders
    gear = as.factor(gear) # # of gears
  )

# Basic scatter plot of mpg vs. hp (miles/gallon vs. horsepower)
ggplot(cars_df, aes(x = hp, y = mpg, color = cyl)) +
  geom_point(size = 3) +
  facet_wrap(~gear) +
  labs(
    title = "MPG vs. Horsepower, Faceted by # of Gears",
    x = "Horsepower",
    y = "Miles per Gallon",
    color = "Cylinders"
  ) +
  theme_minimal()


### 1.2 Custom Themes & Labels

`ggplot2` comes with built-in themes like `theme_minimal()`, `theme_bw()`, etc. You can also customize fonts, colors, or even build your own theme.


In [None]:
# Customizing the theme
ggplot(cars_df, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "MPG Distribution by Cylinder Type",
    x = "Cylinders",
    y = "Miles per Gallon",
    fill = "Cylinders"
  ) +
  theme_classic() +
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    axis.title = element_text(face = "bold", size = 12)
  )


### 1.3 Advanced Geoms

Consider **geoms** beyond `geom_point()` and `geom_bar()`:

- **`geom_smooth()`** for regression lines.
- **`geom_density()`** or **`geom_violin()`** for distribution insights.
- **`geom_text()`** / **`geom_label()`** to annotate plots.

> **Exercise**: Try plotting a **smoothed** line (`geom_smooth`) over scatter points to see a trend, or a **violin plot** to compare distributions across groups.


---
## 2. Reproducible Workflows & Reporting
As you head toward a final project or thesis, it’s best to keep everything **reproducible**. This means someone else (or future you) can rerun all the code and get the same results.

### 2.1 Using R Markdown or Quarto
While we’re using Jupyter Notebooks here, **R Markdown** (or the newer **Quarto**) is a powerful way to **mix code, text, and output** in one document. You can generate HTML, PDF, or Word reports directly.

#### Example R Markdown Setup
```yaml
---

title: "My Data Analysis"
author: "My Name"
output: html_document

---

````

Then, in R code chunks, you can do:
```{r}
# Load data
df <- read_csv("mydata.csv")
summary(df)
````

And **knit** the document to produce a nice HTML with text, code, and results.

> **Tip**: Quarto (https://quarto.org) is the next-generation version of R Markdown, supporting multiple languages (R, Python, Julia) in one framework.


### 2.2 Project Structure

A recommended layout:

```
my-project/
├─ data/
│   ├─ raw/       (store raw, unmodified data)
│   └─ processed/ (store cleaned datasets)
├─ code/
│   ├─ cleaning_scripts.R
│   ├─ analysis_scripts.R
│   └─ figures/   (optional folder for saved plots)
├─ output/
│   └─ final_results.csv
├─ my_report.Rmd  (or .qmd, .ipynb)
└─ README.md
```

Clear organization helps you maintain **version control** and collaborate.


### 2.3 Version Control with Git (Optional)

If you want to track changes:

1. **Initialize** a Git repository in your project folder.
2. Commit changes regularly (e.g., after each major step).
3. Push to **GitHub** (or similar) if you’d like an online backup.

> **Exercise**: Try creating a small Git repo for one of your class projects. Make commits after you add data or update code.


---

## 3. Project Best Practices

From a **journalism** standpoint:

1. **Document Data Sources**: Cite where you got the data and any transformations.
2. **Show Your Work**: If you changed data (e.g., replaced missing values), mention how and why.
3. **Focus on Clarity**: Visuals should be clear, well-labeled, and highlight the story.
4. **Ethics & Privacy**: Check if the data includes personal info. Anonymize as necessary.

### 3.1 Outline for a Final Report

A typical structure might look like:

1. **Introduction**: Problem statement or news angle.
2. **Data & Sources**: Where data comes from, how it was collected.
3. **Methodology**: Data cleaning steps, analysis approach, mention of tools.
4. **Results / Findings**: Key insights, tables, and plots.
5. **Discussion**: Limitations, significance, or potential biases.
6. **Conclusion**: Summarize the story or main point.
7. **Appendix (optional)**: Code listings, extended tables.

> **Tip**: For maximum clarity, keep each step well documented. If you’re using R Markdown, consider building a table of contents.


---

## 4. Lesson 4 Workflow Example

Below is a brief demonstration combining advanced visualization with a reproducible approach. We’ll simulate a scenario:

1. Import a dataset about **fuel economy** (a stand-in for a real dataset).
2. Create advanced visualizations.
3. Show how we might embed these in an R Markdown doc.


In [None]:
# Example advanced plot with some annotation
ggplot(mpg, aes(displ, hwy, color = class)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  labs(
    title = "Engine Displacement vs. Highway MPG",
    subtitle = "Data from ggplot2's built-in mpg dataset",
    x = "Displacement (L)",
    y = "MPG (Highway)"
  ) +
  theme_minimal()


#### R Markdown Embedding Example

In an R Markdown file, you’d write:

````md
```{r fancy-plot, echo=FALSE}
ggplot(mpg, aes(displ, hwy, color = class)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  labs(title = "Engine Displacement vs. Highway MPG") +
  theme_minimal()
```
````

```
This chunk would render the **fancy-plot** directly in your report.

> **Exercise**: Take any dataset from prior lessons, put it into an R Markdown file, and produce a short 1-page “mini-report” with an advanced plot.
```


---

## 5. Additional Resources

- **R for Data Science** (Hadley Wickham & Garrett Grolemund): a free online book.
- **R Markdown** documentation: <https://rmarkdown.rstudio.com>
- **Quarto**: <https://quarto.org>
- **The Data Visualization Catalogue**: <https://datavizcatalogue.com> for chart ideas.
- **Plotting Extensions**:
  - `plotly` for interactive plots.
  - `leaflet` for maps.
  - `highcharter` for dynamic charts.

> **Tip**: If you’re telling a story, consider interactive visuals with `shiny` or `plotly`, especially if it’s for online journalism.


---

## Summary & Wrap-Up

In **Lesson 4**, you learned:

1. **Advanced Visualization** with `ggplot2` (facets, custom themes, advanced geoms).
2. **Reproducible Reporting** with R Markdown or Quarto.
3. **Project Organization & Best Practices** for your final thesis or data story.

**Key Takeaways**:

- Keep your data workflow **clean and documented**.
- Use **version control** where possible.
- Spend time on **visual clarity**; a well-crafted plot can make or break a story.
- For your final thesis or journalism project, consider **R Markdown** to combine text and code seamlessly.

This completes your structured journey from basic R data handling to advanced visualization and reproducible workflows. Best of luck applying these skills to your final projects, newsroom investigations, or future data-driven stories!

# End of Lesson 4
