# **Week 1: Introduction to Data Wrangling and Visualisation in R**
```
.------------------------------------.
|   __  ____  ______  _  ___ _____   |
|  |  \/  \ \/ / __ )/ |/ _ \___  |  |
|  | |\/| |\  /|  _ \| | | | | / /   |
|  | |  | |/  \| |_) | | |_| |/ /    |
|  |_|  |_/_/\_\____/|_|\___//_/     |
'------------------------------------'

```


This week, we will dive into data wrangling and data visualisation in R using the `dplyr` and `ggplot2` packages. If you are not familiar with basic R data types and operations, please take some time to review [Week 0](https://colab.research.google.com/github/edelweiss611428/MXB107-Notebooks/blob/main/notebooks/Week_0.ipynb) content.

## **Pre-Configurating the Notebook**

### **Switching to the R Kernel on Colab**

By default, Google Colab uses Python as its programming language. To use R instead, you’ll need to manually switch the kernel by going to **Runtime > Change runtime type**, and selecting R as the kernel. This allows you to run R code in the Colab environment.

However, our notebook is already configured to use R by default. Unless something goes wrong, you shouldn’t need to manually change runtime type.

### **Importing Required Datasets and Packages**
**Run the following lines of code**:

In [1]:
#Do not modify

setwd("/content")

# Remove `MXB107-Notebooks` if exists,
if (dir.exists("MXB107-Notebooks")) {
  system("rm -rf MXB107-Notebooks")
}

# Fork the repository
system("git clone https://github.com/edelweiss611428/MXB107-Notebooks.git")

# Change working directory to "MXB107-Notebooks"
setwd("MXB107-Notebooks")

#
invisible(source("R/preConfigurated.R"))

Loading required package: ggplot2

Loading required package: dplyr


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Loading required package: tidyr

Loading required package: stringr

Loading required package: magrittr


Attaching package: ‘magrittr’


The following object is masked from ‘package:tidyr’:

    extract


Loading required package: IRdisplay

Loading required package: png

“there is no package called ‘png’”
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Loading required package: grid

Loading required package: knitr



**Do not modify the following**:

In [2]:
if (!require("testthat")) install.packages("testthat"); library("testthat")

test_that("Test if all packages have been loaded", {

  expect_true(all(c("ggplot2", "tidyr", "dplyr", "stringr", "magrittr") %in% loadedNamespaces()))

})

test_that("Test if all utility functions have been loaded", {
  expect_true(exists("skewness"))
  expect_true(exists("kurtosis"))
})

Loading required package: testthat


Attaching package: ‘testthat’


The following objects are masked from ‘package:magrittr’:

    equals, is_less_than, not


The following object is masked from ‘package:tidyr’:

    matches


The following object is masked from ‘package:dplyr’:

    matches




[32mTest passed[39m 🥳
[32mTest passed[39m 😸


## **Data Wrangling in R via `dplyr` and `magrittr`**

Data wrangling is the process of cleaning, transforming, and reshaping data for analysis. The `dplyr` package, often used with the pipe operator `%>%` from `magrittr`, provides a set of intuitive functions for these tasks.

### **Can We Do This in Base R?**

Yes, but the syntax tends to be more verbose and less readable. For demonstration, we will load the CSV file `MXB107_2025.csv`.

In [3]:
MXB107_Info = read.csv("./datasets/MXB107_2025.csv")
str(MXB107_Info)

'data.frame':	12 obs. of  8 variables:
 $ Class         : chr  "LEC01 01" "LEC01 01" "PRC01 01" "PRC01 01" ...
 $ Type          : chr  "Lecture (Internal)" "Lecture (Online)" "Practical (Online)" "Practical (Internal)" ...
 $ Day           : chr  "Wed" "Wed" "Wed" "Thu" ...
 $ Location      : chr  "GP B117" "Online" "Online" "GP D413" ...
 $ Limit         : int  240 1000 30 35 30 35 25 30 35 35 ...
 $ Teaching_Staff: chr  "Chris Drovandi" "Chris Drovandi" "Narayan Srinivasan" "Narayan Srinivasan" ...
 $ From          : int  11 11 16 16 16 9 14 9 9 11 ...
 $ To            : int  13 13 18 18 18 11 16 11 11 13 ...


We want to extract all classes that start after 9:00 AM on either Thursday or Friday and are not online.

In [5]:
notOnline = MXB107_Info$Location != "Online"
onThursday = MXB107_Info$Day %in% c("Thu", "Fri")
startAfter9AM = MXB107_Info$From > 9
subset(MXB107_Info, notOnline & onThursday & startAfter9AM) ### subset đi với df và condition phía sau

### hoặc như này

results = MXB107_Info %>%
  filter(Location != "Online",
         Day %in% c("Thu", "Fri"),
         From > 9)
print(results) ### cái này thì không in ra dạng data frame

Unnamed: 0_level_0,Class,Type,Day,Location,Limit,Teaching_Staff,From,To
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<int>,<int>
4,PRC01 01,Practical (Internal),Thu,GP D413,35,Narayan Srinivasan,16,18
7,PRC01 07,Practical (Internal),Thu,GP S520,25,Ryan Kelly,14,16
10,PRC01 04,Practical (Internal),Fri,GP G216,35,Arwen Nugteren,11,13
11,PRC01 05,Practical (Internal),Fri,GP S502,35,Arwen Nugteren,15,17
12,PRC01 06,Practical (Internal),Fri,GP S519,35,Minh Long Nguyen,15,17


     Class                 Type Day Location Limit     Teaching_Staff From To
1 PRC01 01 Practical (Internal) Thu  GP D413    35 Narayan Srinivasan   16 18
2 PRC01 07 Practical (Internal) Thu  GP S520    25         Ryan Kelly   14 16
3 PRC01 04 Practical (Internal) Fri  GP G216    35     Arwen Nugteren   11 13
4 PRC01 05 Practical (Internal) Fri  GP S502    35     Arwen Nugteren   15 17
5 PRC01 06 Practical (Internal) Fri  GP S519    35   Minh Long Nguyen   15 17


### **Defining a Data-Processing Pipeline**

The `dplyr` and `magrittr` packages simplify and streamline data manipulation by providing a data-processing pipeline.

When we write

```
input %>% do_something_1() %>% do_something_2()
```

The pipe operator %>% takes the output from the expression on the left and passes it as the **first argument** to the function on the right. This allows chaining multiple operations in a clear, readable sequence.


In base R, it would be something like

```
output1 = do_something_1(input)
output2 = do_something_2(output1)
```
Which one looks better to you?




Another great feature of `dplyr` is that it natively understands data frame columns by internally converting data frames into **tibbles** — a more user-friendly data frame format.

This means we don’t need to repeatedly write `MXB107_Info$From` or similar; inside `dplyr` **verbs** like `filter()`, we can simply refer to columns by their names, such as `From` and `To`, which makes the code much cleaner and easier to read.

In [7]:
MXB107_Info %>%
  filter(Location != "Online",
         Day %in% c("Thu", "Fri"),
         From > 9) -> results
print(results)

### giải thích: cái hàm filter là cách dùng sẽ là filter(dataFrame, con1, con2, ...)
### ở ví dụ trên thì MXB107_Info đã được truyền vào phần dataFrame của filter rồi do có dùng %>%
### xong là gán tất cả vào results, gán ở đây là gán từ phần MXB107 tới > 9) đó nghe

     Class                 Type Day Location Limit     Teaching_Staff From To
1 PRC01 01 Practical (Internal) Thu  GP D413    35 Narayan Srinivasan   16 18
2 PRC01 07 Practical (Internal) Thu  GP S520    25         Ryan Kelly   14 16
3 PRC01 04 Practical (Internal) Fri  GP G216    35     Arwen Nugteren   11 13
4 PRC01 05 Practical (Internal) Fri  GP S502    35     Arwen Nugteren   15 17
5 PRC01 06 Practical (Internal) Fri  GP S519    35   Minh Long Nguyen   15 17


That indeed looks much better! Here the right-arrow assignment operator really shines.

### **Common `dplyr` Verbs**

#### **`filter()` — subset rows by condition**

Which sessions are classified as internal lectures in the schedule?

In [12]:
MXB107_Info %>%
  filter(Type == "Lecture (Internal)")

### hoặc là
filter(MXB107_Info, Type == "Lecture (Internal)")
### hoặc
filter(MXB107_Info, str_detect(Type, "Lecture"))

Class,Type,Day,Location,Limit,Teaching_Staff,From,To
<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<int>,<int>
LEC01 01,Lecture (Internal),Wed,GP B117,240,Chris Drovandi,11,13


Class,Type,Day,Location,Limit,Teaching_Staff,From,To
<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<int>,<int>
LEC01 01,Lecture (Internal),Wed,GP B117,240,Chris Drovandi,11,13


Class,Type,Day,Location,Limit,Teaching_Staff,From,To
<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<int>,<int>
LEC01 01,Lecture (Internal),Wed,GP B117,240,Chris Drovandi,11,13
LEC01 01,Lecture (Online),Wed,Online,1000,Chris Drovandi,11,13


Which practical sessions are scheduled on Wednesday?

In [15]:
MXB107_Info %>%
  filter(str_detect(Type, "Practical"), #base R: str_detect(MXB107_Info$Type, pattern = "Practical")
         Day == "Wed")

### Như dưới này là sai vì %in% là tìm kiếm kiểu chính xác luôn chứ đừng hiểu nhầm
filter(MXB107_Info, Type %in% "Pratical", Day == "Wed")

### %in% như là == chỉ khác là nó linh hoạt khi so sánh như này: x %in% c("A", "B", "C")  # kiểm tra x có bằng A hoặc B hoặc C không

### như này mới đúng:
filter(MXB107_Info, str_detect(Type, "Practical"), Day %in% "Wed")


Class,Type,Day,Location,Limit,Teaching_Staff,From,To
<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<int>,<int>
PRC01 01,Practical (Online),Wed,Online,30,Narayan Srinivasan,16,18


Class,Type,Day,Location,Limit,Teaching_Staff,From,To
<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<int>,<int>


Class,Type,Day,Location,Limit,Teaching_Staff,From,To
<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<int>,<int>
PRC01 01,Practical (Online),Wed,Online,30,Narayan Srinivasan,16,18


#### **`select()` — pick specific columns**

Which classes does Chris Drovandi teach, and when and where are they scheduled?

In [17]:
MXB107_Info %>%
  filter(Teaching_Staff == "Chris Drovandi") %>%
  select(Class, Teaching_Staff, Day, Location, From, To)


### filter dataFrame về Chris Drovandi trước rồi mới chọn cột từ đó

### cách viết rối rắm hơn theo kiểu python :)

filter_data = filter(MXB107_Info, Teaching_Staff == "Chris Drovandi")
select(filter_data, Class, Teaching_Staff, Day, Location, From, To )

Class,Teaching_Staff,Day,Location,From,To
<chr>,<chr>,<chr>,<chr>,<int>,<int>
LEC01 01,Chris Drovandi,Wed,GP B117,11,13
LEC01 01,Chris Drovandi,Wed,Online,11,13


Class,Teaching_Staff,Day,Location,From,To
<chr>,<chr>,<chr>,<chr>,<int>,<int>
LEC01 01,Chris Drovandi,Wed,GP B117,11,13
LEC01 01,Chris Drovandi,Wed,Online,11,13


#### **`mutate()` — add or modify columns**

How long is each Wednesday session, and what are their other scheduled details?

In [18]:
MXB107_Info %>%
  filter(Day == "Wed") %>%
  mutate(Duration = To-From)

### Đơn giản mutate chỉ là sửa hoặc thêm cột th
### Như ở trên là mutate là thêm cột duration cái giá trị được tạo bằng to trừ to from
### Như này thì nó không thay đổi data gốc của MXB_107 trừ khi bạn tạo ra một dataFrame mới rồi gán vào

Class,Type,Day,Location,Limit,Teaching_Staff,From,To,Duration
<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<int>,<int>,<int>
LEC01 01,Lecture (Internal),Wed,GP B117,240,Chris Drovandi,11,13,2
LEC01 01,Lecture (Online),Wed,Online,1000,Chris Drovandi,11,13,2
PRC01 01,Practical (Online),Wed,Online,30,Narayan Srinivasan,16,18,2


#### **`arrange()` — reorder rows by column(s)**

Which sessions are scheduled on Wednesday, and how do they compare in the number of scheduled students (`Limit`), arranged from smallest to largest?

In [19]:
MXB107_Info %>%
  filter(Day == "Wed") %>%
  arrange(Limit) #Ascending order

### Cái arrange này thì xếp theo Limit

Class,Type,Day,Location,Limit,Teaching_Staff,From,To
<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<int>,<int>
PRC01 01,Practical (Online),Wed,Online,30,Narayan Srinivasan,16,18
LEC01 01,Lecture (Internal),Wed,GP B117,240,Chris Drovandi,11,13
LEC01 01,Lecture (Online),Wed,Online,1000,Chris Drovandi,11,13


#### **`group_by()` and `summarise()` — group data and aggregate**

For each day of the week, how many students are scheduled in total, what is the average number of students per session, and how many sessions are there?

In [23]:
MXB107_Info %>%
  group_by(Day) %>%
  summarise(totalLimit = sum(Limit),
            averageLimit = mean(Limit),
            Count = n(),
            .groups = "drop") #Set `.group = "drop"` after completion.

### cái này thì nhớ .groups để ungroup chứ không là cái dataFrame này sẽ bị thay đổi theo cái group là mấy cái thao tác sau sẽ rối vl
### thôi thì khi nào dùng group_by thì thêm cái .groups = "drop" cho lành


Day,totalLimit,averageLimit,Count
<chr>,<int>,<dbl>,<int>
Fri,140,35.0,4
Thu,155,31.0,5
Wed,1270,423.3333,3


#### **`rename()` — rename columns**

In [24]:
MXB107_Info %>%
  rename(Start = From, End = To) %>%
  head(3)

### dễ hiểu thôi, rename thì thay đổi tên cột,

Unnamed: 0_level_0,Class,Type,Day,Location,Limit,Teaching_Staff,Start,End
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<int>,<int>
1,LEC01 01,Lecture (Internal),Wed,GP B117,240,Chris Drovandi,11,13
2,LEC01 01,Lecture (Online),Wed,Online,1000,Chris Drovandi,11,13
3,PRC01 01,Practical (Online),Wed,Online,30,Narayan Srinivasan,16,18


#### **`slice()` — select rows by position**

In [25]:
MXB107_Info %>%
  slice(1:3) #Similar to indexing

### này lấy row theo hàng, lấy từ 1 tới 3 nè


Class,Type,Day,Location,Limit,Teaching_Staff,From,To
<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<int>,<int>
LEC01 01,Lecture (Internal),Wed,GP B117,240,Chris Drovandi,11,13
LEC01 01,Lecture (Online),Wed,Online,1000,Chris Drovandi,11,13
PRC01 01,Practical (Online),Wed,Online,30,Narayan Srinivasan,16,18


Some relevant verbs include:
- `slice_max(data, order_by, n)`: extracts the top `n` rows from data with the highest values in the `order_by` column.
- `slice_min(data, order_by, n)`: extracts the bottom `n` rows from data with the lowest values in the `order_by` column.

#### **`distinct()` — get unique rows by columns**


Who are the unique teaching staff members listed in the dataset?

In [27]:
MXB107_Info %>%
  distinct(Teaching_Staff)

### này là liệt kê tất cả các nhân viên có trong dataset, thì dùng dinstinct thôi, tất cả khác biệt


Teaching_Staff
<chr>
Chris Drovandi
Narayan Srinivasan
Oliver Vu
Minh Long Nguyen
Ryan Kelly
Nicholas Gecks-Preston
Arwen Nugteren


#### **`pivot_longer()` — reshapes wide-format data into long format**

We are often more familiar with wide-format data, where each row represents an observation and each column represents a variable or feature.

However, in many situations — particularly for modelling, statistical analysis, and plotting — it is more convenient or even required to work with data in long format. In long format, each row corresponds to a single measurement or value, along with its associated identifiers.

For example, this data frame is in wide-format.

In [None]:
MXB107_Info %>%
  select(Class, Type, Day, Teaching_Staff, From, To) %>%
  head(3) %>%
  mutate(id = row_number())

To convert this to long-format, we can use `pivot_longer()`:

In [None]:
MXB107_Info %>%
  select(Class, Teaching_Staff, From, To) %>%
  head(3) %>%
  mutate(id = row_number()) %>%
  pivot_longer(
    cols = c(From, To),
    names_to = "timeType",
    values_to = "Hour"
  )

We observe that the number of rows has doubled. This is because the original `From` and `To` columns have been reshaped into a single column called `Hour`, with a corresponding column `timeType` indicating whether the value refers to the start or end time.

Instead of storing both `From` and `To` in the same row, each is now represented as a separate row — one for the start time and one for the end time. This structure is characteristic of long format data. If you are familiar with databases, now `<id, timeType>` becomes the new `key` (identifier).

Long-format data frames are less memory-efficient but more convenient for many modelling, statistical analysis, and data visualisation tasks.

Suppose you want to predict the 2023 math grades of Alice, Bob, and Jane using `Year` as a predictor. However, the data is currently in wide format (`grade_wide`). This format is not ideal for modelling or prediction tasks where you want to use `Year` as a predictor variable, because each year is a different column instead of being a value in a single variable.

In [None]:
grade_wide = data.frame(
  Name = c("Alice", "Bob", "Jane"),
  `2020` = c(88, 75, 93),
  `2021` = c(90, 78, 95),
  `2022` = c(92, 82, 97),
   check.names = FALSE
)
grade_wide

Converting the data to the long format simplifies the modelling process. Now you can easily fit a statistical model to predict `Grade` (e.g., `%>% lm(Grade ~ Year + Name)` —  a linear regression model with `Grade` being the response variable and `Name` and `Year` being the predictors).

In [None]:
grade_wide %>%
  pivot_longer(
    cols = `2020`:`2022`,
    names_to = "Year",
    values_to = "Grade") %>%
  mutate(Year = as.integer(Year))


In [None]:
grade_wide %>%
  pivot_longer(
    cols = `2020`:`2022`,
    names_to = "Year",
    values_to = "Grade") %>%
  mutate(Year = as.integer(Year)) %>%
  lm(formula = Grade ~ Year + Name) %>%
  summary()

A `pivot_wider()` counterpart exists. We will discuss it later.

#### **Exercise**

How many classes are held online vs in-person for each type (Lecture, Practical)?

<details>
<summary>▶️ Click to show the solution</summary>

```r
MXB107_Info %>%
  mutate(Mode = ifelse(Location == "Online", "Online", "In-Person")) %>%
  group_by(Type, Mode) %>%
  summarize(Count = n(), .groups = "drop") %>%
  arrange(Type, Mode)
```

</details>

## **Data Visualisation via `ggplot2'**

`ggplot2` is a powerful and widely-used R package for data visualisation based on the "Grammar of Graphics" concept. It allows you to create complex and elegant plots by layering components step-by-step.

Key features of `ggplot2`:

- **Layered approach**: Build plots by adding layers like stacking LEGO blocks.
- **Consistent syntax**: Uses a clear, declarative style making plots easy to read and modify.
- **Highly customisable**: Control every detail of your plot’s appearance.
- **Works well with "tidy" data**: Designed to work seamlessly with data in long format.

### **Basic Data Visualisation Principles**

Depending on how we create graphical depictions of data, we can alter the viewer’s impression of the data; in other words, if a picture is worth a thousand words, then how we make the picture can change the story. Ideally, we want our graphical summaries to be as objective as possible; we want the data to speak for themselves. There are no hard and fast rules for creating graphical summaries. Still, there are some basic principles to follow:

- Always have a title for your graphical summary.
- Titles should accurately describe the variables and the relationship shown in the summary. If one of the axes is time or the data are for a specific period, that should be in the title.
- Clearly label the axes and include units.
When comparing two data sets, the axes for each summary should match.


### **`iris` dataset**

To demonstrate the use of `ggplot2`, we will use a new dataset named `iris`, a classic dataset containing measurements of `Sepal.Length`, `Sepal.Width`, `Petal.Length`, and `Petal.Width` for three `Species` of iris flowers.

In [None]:
iris = read.csv("./datasets/iris.csv")
iris %>% head()

### **Stacking LEGO Blocks**

Think of building a plot in `ggplot2` like stacking LEGO blocks, where each block adds something new.


#### **First LEGO Block: Data**

`iris %>% ggplot()` tells `ggplot2` that we will use data from `iris`.

In [None]:
iris %>%
  ggplot()

Alternatively, you can write this explicitly as:

In [None]:
ggplot(data = iris)

#### **Second LEGO Block: Aesthetics**

The aesthetics block `aes()` is like setting up the grid and rules of your LEGO baseplate — it defines how data variables map to visual properties on the plot.

For example:
- `x = Sepal.Width` defines what data goes along the horizontal axis (x-axis).
- `y = Petal.Length` defines what data goes along the vertical axis (y-axis).
- `color = Species` decides how points are colored based on their group.

Yes, `ggplot2` natively understand column names. In fact, `ggplot2` and `dplyr` belong to a bigger family called `tidyverse`.

In [None]:
iris %>%
  ggplot(aes(x = Sepal.Width, y = Petal.Length, color = Species))

#### **Third LEGO Block: The Actual Plot**

The third block defines the type of plot or visual representation you want — this is called a `geom` (geometric object). It tells `ggplot2` how to draw your data on the axes set up by the aesthetics.

Examples of `geom` include:
- `geom_point()` for scatter-plots
- `geom_line()` for line plots
- `geom_bar()` for bar charts
- `geom_histogram()` for histograms
and many more

Without adding a geom, your plot has no visual marks — just empty axes.

From now, we use the `+` operator in `ggplot2` to add layers onto our plot one by one.

In [None]:
iris %>%
  ggplot(aes(x = Sepal.Width, y = Petal.Length, color = Species)) +
  geom_point(size = 4)

#### **Fourth LEGO Block: Customisation**

The fourth block in `ggplot2` is all about customizing the appearance and style of your plot.

This includes things like:

- Adding titles, axis labels, and captions using `labs()`
- Changing the theme (background, grid lines, fonts) with functions like `theme_minimal()`, `theme_classic()`, or customising with `theme()`
- Adjusting scales for axes, colors, and sizes (e.g., `scale_color_manual()`, `scale_x_continuous()`)
- Adding facets to create small multiples (`facet_wrap()`, `facet_grid()`)

This block is like painting and decorating your LEGO model — after you’ve built the structure, you choose colors, textures, and details to make it look exactly how you want.

In [None]:
iris %>%
  ggplot(aes(x = Sepal.Width, y = Petal.Length, color = Species)) +
  geom_point(size = 4) +
  labs(
    title = "Petal Length vs Sepal Width",
    x = "Sepal Width (cm)",
    y = "Petal Length (cm)",
    color = "Species Type"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    legend.position = "top"
  )


#### **Exercise**

Create a scatter plot of `Sepal.Length` vs `Sepal.Width`, colored by `Species`. Follow the same block-based `ggplot2` structure we discussed.


<details>
<summary>▶️ Click to show the solution</summary>

```r
iris %>%
  ggplot(aes(x = Sepal.Width, y = Setal.Length, color = Species)) +
  geom_point(size = 4) +
  labs(
    title = "Setal Length vs Sepal Width",
    x = "Sepal Width (cm)",
    y = "Setal Length (cm)",
    color = "Species Type"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    legend.position = "top"
  )
```

</details>

### **Visualising Subgroups with `facet_wrap()`**

Assume that you want to create a scatter plot of `Petal.Length` vs `Sepal.Width`, however, for each species separately.

One approach is to split the dataset by `Species`, extract the `Petal.Length` and `Sepal.Width` columns for each group, create a separate scatter plot for each, and then combine or "stack" the plots to compare across species. This is not convenient.

`facet_wrap()` simplifies the process.

In [None]:
iris %>%
  ggplot(aes(x = Sepal.Width, y = Petal.Length)) +
  geom_point(color = "steelblue", size = 2) +
  facet_wrap(~ Species) +
  theme_minimal() +
  labs(
    title = "Petal Length vs Sepal Width by Species",
    x = "Sepal Width",
    y = "Petal Length"
  )

### **When Long-Format Data Are Needed**

Suppose you want to plot `Sepal.Length` and `Petal.Length` values on the y-axis against `Sepal.Width` on the x-axis separately.

Wide-format `iris` won't work in this case because `Sepal.Length` and `Petal.Length` are different columns — `ggplot` doesn’t know how to treat them as the same variable. Here, we can't set `y = c("Sepal.Length", "Petal.Length")` for example — it simply doesn't work.

**Solution**: We need to convert `iris` to long-format.

In [None]:
iris %>%
  select(Sepal.Length, Sepal.Width, Petal.Length, Species) %>%
  pivot_longer(
    cols = c(Sepal.Length, Petal.Length),
    names_to = "Measurement",
    values_to = "Length"
  ) -> long_iris

long_iris %>%
  head()

Now, it is possible to set`x = Measurement` as the x-axis of the plot.

In [None]:
long_iris %>%
  ggplot(aes(x = Sepal.Width, y = Length, color = Species)) +
  geom_point(size = 4) +
  facet_wrap(~ Measurement) +
  labs(
    title = "Sepal and Petal Length vs Sepal Width by Measurement",
    x = "Sepal Width",
    y = "Length (cm)"
  ) +
  theme_minimal()+
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    legend.position = "top"
  )

## **Workshop Questions**


### **EPA Fuel Economy Dataset**

A dataset containing information on over 13,500 cars sold in the US from 2010 to 2020, including measurements and characteristics related to vehicle fuel economy and specifications. Data sourced from the [US Fuel Economy website](https://www.fueleconomy.gov/feg/download.shtml).

| Variable | Description                                    |
|----------|------------------------------------------------|
| `city`   | EPA measured fuel economy in miles per gallon (city driving) |
| `hwy`    | EPA measured fuel economy in miles per gallon (highway driving) |
| `cyl`    | Number of cylinders in the engine              |
| `disp`   | Engine displacement (litres)                    |
| `drive`  | Vehicle drivetrain layout (e.g., FWD, RWD, AWD) |
| `make`   | Vehicle manufacturer name                       |
| `model`  | Vehicle model name                              |
| `trans`  | Transmission type (manual or automatic)        |
| `year`   | Vehicle model year                              |


In [None]:
epa_data = read.csv("./datasets/epa_data.csv")
str(epa_data)

### **Question 1**

Suppose you want to compare the fuel economy in city driving between manual and automatic transmissions using the EPA dataset.
- What type of graphical summary would best display this comparison?
- Use `ggplot` to produce this visualisation.

**Hint**:
- Only specify the x-axis inside `ggplot(aes())`.
- Use `geom_histogram(aes(y = after_stat(density)))` to plot normalised histograms for comparison (y-axis is defined here).
- Use `facet_wrap()` to create small multiples.

<details>
<summary>▶️ Click to show the solution</summary>

```r
Solution will be released at the end of the week!
```

</details>

### **Question 2**


Suppose you want to compare the combined fuel economy (city and highway driving) between manual and automatic transmissions using the EPA dataset.

- What steps would you take to prepare the data?
- What type of graphical summary would best display this comparison?
- Use `ggplot` to produce this visualisation.

**Hint**: We need a new `long-format` data frame.


<details>
<summary>▶️ Click to show the solution</summary>

```r
Solution will be released at the end of the week!
```

</details>

### **Question 3**

Suppose that you want to explore how engine displacement changed over time.

- What type of graph or chart would you use and why?
- Use `ggplot` to produce this graphical summary.

**Hint**:
- Use `stat_summary(func = "mean", geom = "line")` instead of `geom_line`.
- A simpler approach is to use `group_by() %>% summarise()`. Use `na.rm = TRUE` option in `mean()`.

<details>
<summary>▶️ Click to show the solution </summary>

Solution will be released at the end of the week!

</details>

### **Question 4**

Suppose you want to identify which manufacturers produced the most fuel-efficient cars given city driving EPA in the `epa_data` dataset.

- Use `group_by() %>% summarise()` to summarise `EPA` by manufacturer. Name the summary column `mean_mpg`.

<details>
<summary>▶️ Click to show the solution </summary>

Solution will be released at the end of the week!

</details>

- Use `ggplot` to create a Pareto plot (i.e., a sorted bar chart) to support your analysis? Is the plot descriptive enough?

**Hint**:  Use `ggplot(aes(x = reorder(make,-mean_mpg)))` to make sure `make` is the x-asis but sorted by `-mean_mpg` (e.g., larger values come first).

<details>
<summary>▶️ Click to show the solution </summary>

Solution will be released at the end of the week!

</details>

- How might you improve the previous plot to better answer the question?

**Hint**: Use `slice_max(order_by = mean_mpg, n)` to select top `n` manufacturers based on `-mean_mpg`.

<details>
<summary>▶️ Click to show the solution </summary>

Solution will be released at the end of the week!

</details>

Solution notebook has been published! See [Week 1 Solutions](https://colab.research.google.com/github/edelweiss611428/MXB107-Notebooks/blob/main/notebooks/solutions/Week_1_Solutions.ipynb#scrollTo=aT7mowtWOSOR)!