#### **Tidyverse: readr, dplyr, tidyr, purrr**
###### **Ní Leathlobhair lab, Trinity College Dublin 2025**

###### *Designed by Jorge G. García, 2025*
---
Welcome to this short introduction to tabular data manipulation with the **[Tidyverse](https://www.tidyverse.org/)** in **R**. While R remains one of the most powerful and versatile tools for statistical computing, it has often been criticized for its inconsistent syntax and steep learning curve. **Tidyverse** comes to the rescue, addressing these challenges through a coherent suite of packages that promote **readable**, **consistent**, and **expressive** code for data manipulation, visualization, and analysis.

But more than just a collection of tools, the **Tidyverse** embodies a philosophy of data science. It encourages working with data in a **tidy**, **rectangular** form where each **variable is a column**, each **observation a row**, and each **type of observational unit a table**. This consistent data structure, paired with human-readable verbs like [`filter()`](https://dplyr.tidyverse.org/reference/filter.html), [`mutate()`](https://dplyr.tidyverse.org/reference/mutate.html), and [`summarise()`](https://dplyr.tidyverse.org/reference/summarise.html), makes code both expressive and easy to reason about. The **Tidyverse** favors **declarative pipelines** over procedural loops, aiming to mirror how analysts think about data.

As we begin this journey, we'll learn how to transform a chaotic, messy dataset—the norm rather than the exception in real-world data—into something structured and insightful. We'll do this programmatically, but in a way that's *intuitive*, *coherent*, and *fluent*, thanks to the **Tidyverse**’s design. At the core of this workflow is the powerful **pipe operator** ([`%>%`](https://magrittr.tidyverse.org/reference/pipe.html)), which lets us chain operations together in a readable, step-by-step flow. It's not just a convenience—it's the beating heart of the **Tidyverse** paradigm, enabling data transformations that feel natural and expressive.

This workshop is designed to equip you with tools and techniques that feel *intuitive* and *natural* to use—tools that might even change how you see R. I’ll admit, I was once firmly in the **R-skeptic** camp, but this approach transformed my experience: its **clarity**, **consistency**, and **expressive syntax** made R not just usable, but genuinely enjoyable—and my go-to language for anything involving **tabular data**. I haven’t looked back since.

Even though we will use **real-world data** throughout most of the exercise, it is directed towards a wide audience and doesn't require any previous knowledge of **bioinformatic pipelines**, or even **R**! If you have no experience with code whatsoever, *fret not*! This exercise is designed to lead you effortlessly straight to the end. If for some reason you break it, restart and *voilà*. If you are already familiar with programming, this exercise should be of use to understand how **tabular data**, whatever its origin, can be manipulated in the **Tidyverse way**. Feel free to tinker with other parameters and explore how the code works!

Last but not least, across the exercise you will see many **hyperlinks** scattered across the sections. Click on them to access **extra resources** that will enrich your experience, add context or reveal interesting facts about the lesson.

With all this in mind, take a deep breath and press the **Run** button in the first cell. Welcome to the **Tidyverse**.


In [None]:
suppressMessages({
    library(arrow)
    library(readr)
    library(dplyr)
    library(tidyr)
    library(purrr)
})

**Congratulations!** You just imported the necessary libraries to perform all the analysis in this exercise. 

In programming, a **library** is a collection of precompiled routines—sets of instructions—that a program can use. These routines are designed to accomplish specific tasks, such as **handling files and data**, **performing mathematical computations**, or **managing network connections**—and that is exactly what we are going to do with these. Many of the libraries we will be using, like [`dplyr`](https://dplyr.tidyverse.org/) for **tabular data manipulation** or the powerful [`ggplot2`](https://ggplot2.tidyverse.org/) (next tutorial) for **plot generation**, are used everywhere across the world in many different fields, from **Google Analytics** to **epidemiology**.

But where is the **Tidyverse**? Well, you see, these are the core components of the Tidyverse:

| Package       | Description                                             |
|---------------|---------------------------------------------------------|
| [`ggplot2`](https://ggplot2.tidyverse.org/)   | Grammar of graphics for data visualization        |
| [`dplyr`](https://dplyr.tidyverse.org/)       | Data manipulation (`filter`, `mutate`, `summarise`) |
| [`tidyr`](https://tidyr.tidyverse.org/)       | Reshaping and tidying data                        |
| [`readr`](https://readr.tidyverse.org/)       | Fast and consistent file I/O                      |
| [`tibble`](https://tibble.tidyverse.org/)     | Modern reimagining of the dataframe               |
| [`purrr`](https://purrr.tidyverse.org/)       | Functional programming with lists and vectors     |
| [`stringr`](https://stringr.tidyverse.org/)   | Consistent, easy string manipulation              |
| [`forcats`](https://forcats.tidyverse.org/)   | Tools for working with categorical variables      |
| [`lubridate`](https://lubridate.tidyverse.org/)| Tools for date-time data handling                |

But we could very much just do:


In [None]:
suppressMessages({
    library(tidyverse)
})

And it would work just as well. So why don’t we always do it that way?

Well—if you need to hammer a nail into a wall, do you bring the entire toolbox, or just the hammer?

The **Tidyverse** is a collection of packages designed to work together seamlessly, but that doesn’t mean you need to load all of them every time. In practice, it's often better to load only the packages you need for the task at hand. This keeps your code **cleaner**, **leaner**, and **easier to reason about**—especially in **collaborative** or **production environments**.

That said, these libraries—when combined—cover the overwhelming majority of **tabular data manipulation tasks** a data scientist is likely to encounter. Their shared design principles mean that you can confidently **mix and match** them, chaining operations across packages without breaking mental flow. It’s not about loading everything—it’s about having the **right tool**, when you need it, ready to fit naturally into the pipeline.

The **Tidyverse** is the result of a broader movement within the R community led by [Hadley Wickham](https://hadley.nz/), a statistician and software engineer who recognized the need for **consistency**, **expressiveness**, and **usability** in data science workflows within R. Prior to the **Tidyverse**, R was powerful but incoherent:


- Functions for similar tasks had wildly inconsistent interfaces.

- Base R’s syntax was often cryptic or verbose.

- There were multiple ways to do everything, but no clear best practices.

- Data cleaning and manipulation—often the bulk of data work—was painful.




We are ready to start our analysis. What will we be analysing? This is your story: You just got accepted in the prestigious **[Trinity College Dublin](https://www.tcd.ie/)**, and you are eager to start your new life as a student. But your first challenge lies outside the campus walls. First stop, the **Irish accommodation market**. You need to find a place to live, and you want to make sure you get the **best deal possible**.


[^1]: test


In [None]:
data <- readr::read_csv('irish_accomodations_augmented.csv', show_col_types = FALSE)

So far so good, right?

Well, kind of.

Well, actually, no.

Let's take a look at some qualities of our loaded dataset:

In [None]:
head(data)

From the onset, we can see many things that would set any data scientist worth their salt into alert:

- Multiple types of data: **Strings**, **decimals** (*floats* or *doubles*), **integers**, **lists**.
- Liberal use of **upper case**, both in features (column names) and in the data.
- `NA` (*Null*) values present in several rows.  

Let’s go with the first (and most urgent) one.  
The interpreter has done an admirable job at categorizing whatever data columns it found in categories (`<dbl>`, `<chr>`),  
but many of them are either unnecessary or outright wrong.

So, where’s the problem?

It’s the **CSV file**.  
It’s *always* been the CSV file.

---

**CSV files** are the data scientist’s perennial source of headaches.  
They are a **messy way of storing complex datasets** for mainly (but not only) the following reasons:

- **Everything** in them is stored as **plain text**.  
  That means **hidden unicode characters**, **linebreaks**, **invisible tab spaces** — the list goes on and on.

- **Everything** in them is stored as **plain text**.  
  That means **no compression** whatsoever. Every single character is a Unicode symbol, with a corresponding **load on your working memory**.

- **Everything** in them is stored as **plain text**.  
  That means **no data types**.  
  There are no **numbers**, no **integers**, no **factors**, no **lists**.  
  Everything in them is a **character**.  
  You wanted to keep those **three decimals** in the second column?  
  Too bad! Depending on the next interpreter, it may decide **two are enough**.

- **Everything** in them is stored as **plain text**.  
  **No traceability** whatsoever.  
  It’s just impossible to retroengineer what type of data each feature contained when this dataset was turned into a **text file**.  
  We are off to a bad start as **reproducibility** goes!

---

You are a perceptive fellow (you’ve been accepted at [TCD](https://www.tcd.ie/), after all), so you start to pinpoint where the trouble lies.

Let’s make it clear one last time:

### **Plain text bad**

---

Okay, but you must be wondering:

> *"It can’t be that bad, right? It’s the format I’ve used my whole life—and the one every course, teacher, and tutorial insisted on."*

Well, you are **partially right**.

Perhaps data types in your particular dataset are not very difficult to infer or reconstruct.  
Perhaps most people simply don’t care if an invisible space gets in the way or the file is 2MB heavier.

But we are **data scientists**.  
Our data might be messy (it usually is), but we are **precise**.  
If we say a column contains **three decimals**, there **must** be three decimals.

---

Let’s try something else:


In [None]:
data <- arrow::read_parquet('irish-accomodations-augmented.parquet')

head(data)

Notice anything different? Here’s what’s changed:

- We now have **many more data types**: `<int>`, `<dbl>`, `<chr>`, `<fct>`.  
  We even have some odd ones like `<list<dbl>>`. Would you have noticed without additional info?

- Our dataset is now **_lazy-loaded_**.  
  What does that mean?  
  Your environment **only loads the parts of the dataset you're actually using — nothing more**.  
  If you were loading a `.csv`, you'd be pulling the **entire** file into memory, slowing down every step of your analysis.

- The generated object is a **`tibble`**, not a base `data.frame`.  
  A tibble is a **modern, strict, and tidy-aware** form of tabular data that’s easier to read and work with.

- The file size is **~3× smaller** than its `.csv` equivalent.

- And while it might not be obvious at this scale, the file has been loaded **much** faster than the `.csv` version.

---

Now imagine you're working with a massive **genomic dataset**: Tens of millions of rows, hundreds of columns.  
Chances are your laptop can’t even load the full CSV into memory — and if it does, you’ll lose metadata and pay the price in speed.

This is where **Parquet** comes in. Enter: `arrow` and the Parquet format.

**Officially endorsed by Hadley Wickham** (chief tidyverse author), the [`arrow`](https://arrow.apache.org/docs/r/) library is the **go-to** for working with `.parquet` files in R.  
These files belong to a family of **sequential binary formats** — not human-readable, but:

- **Highly compressed**
- **Efficiently queryable**
- **Rich in metadata**
- **Built for scale**

Something as simple as switching to Parquet can **speed up and lighten your workflows by orders of magnitude**, especially as data generation accelerates in the coming years.

---

But enough with lazy loadings. Let's look at the data once more:

In [None]:
head(data)

Our first stop is the incredibly powerful [`dplyr`](https://dplyr.tidyverse.org/) (pronounced “**dee-ply-er**”) library from the **Tidyverse**.  
`dplyr` contains several commands and options for **tabular data manipulation**.

For example, notice another flaw in our dataset: **The column names**.

In computer science in general, **simple usually beats complex** nine out of ten times.  
In feature and file naming, it's actually **ten out of ten** times. This is because:

- **Longer names** → Higher chance of human error  
- **More case types** → Higher chance of human error  
- **Special characters** → Higher chance of human error  
- **Spaces** → Higher chance of human error + Computer mishandling  
- **More complex names** → Higher chance of human error + Variable confusion  

There are few adagios more repeated in **bioinformatics**, or plain **data science** across time than this one:  
**Clear, simple, informative names.**

So we are changing the column names for something **clear**, **simple**, and **informative**.  
Enter [`rename()`](https://dplyr.tidyverse.org/reference/rename.html):


In [None]:
data_renamed <- dplyr::rename(data, name = Name)

head(data_renamed)

Notice the straightforward syntax, almost bordering **natural language processing**.  
We can concatenate as many columns as we want within a single command call using a `,`.  

Let’s finish with the rest of the columns:


In [None]:
data_renamed <- rename(data_renamed, 
                       url = Url,
                       telephone = Telephone,
                       longitude = Longitude,
                       latitude = Latitude,
                       region = AddressRegion,
                       locality = AddressLocality,
                       country = AddressCountry,
                       tags = Tags,
                       price = Price_EUR,
                       tag_region = Tag_Region,
                       rating = Rating,
                       unit = Unit,
                       section = Section,
                       c_list = ComplexList)

head(data_renamed)

Much better. What about some light filtering?

In [None]:
data_filtered <- filter(data_renamed, price <= 800)

head(data_filtered)

_Notice how our column remains with the new name `price`. Always remember to store your modifications within a variable (`data_XXX <- operation(data)`)_

`filter` allows us to select rows based on comparison operators. This command has selected all rows where the column `price` was equal or less than 800. We can specify more complex thresholds using logical operators:

In [None]:
data_filtered <- filter(data_renamed,
                        price >= 800 & price < 1000,   # AND operator
                        latitude > 45 | longitude < -5,    # OR operator
                        between(section, 10, 100))    # between function, defines a range

head(data_filtered)

We can also perform categorical comparisons:

In [None]:
data_filtered <- filter(data_filtered,
                        tag_region == 'Camping_Donegal')    # Notice we are using "==", not "="

head(data_filtered)

Aren't there too many columns? We could easily fix that with the powerful `select` function from `dplyr`. It allows to simply and intuitively pick the features that we want:

In [None]:
data_selected <- dplyr::select(data_renamed,
                               name,
                               url,
                               telephone,
                               region,
                               locality,
                               price)

head(data_selected)

But selecting one by one each of the columns we want to preserve can quickly become tedious; and worst of all, **error-prone**. `select` allows us to "drop" features from the dataset just by adding a `-` symbol before the undesired columns:

In [None]:
data_selected <- dplyr::select(data_renamed,
                               -latitude,
                               -longitude,
                               -country,
                               -tags,
                               -tag_region,
                               -unit,
                               -section,
                               -c_list)

head(data_selected)

And of course, you can seamlessly select and drop in the same function call.

_1. You may have noticed that for some functions we use the prefix `dplyr::`, like in `dplyr::select`. This is to explicitly call the `select` function from `dplyr`. "Select", being a rather common word, is often used as a function name for other packages, generating "namespace conflicts". They are a silent but vicious source of bugs, trust me!_

_2. You may end up having to "deselect" more features than if you went the other way around. Always strive to make your code both efficient and easy to read!_

---

We can also sort our data with the `arrange` function. `arrange` will sort the whole table in numerical order. For categorial variables, it will use alphabetical order. To sort our data by just one variable, we can do it like this: 

In [None]:
data_arranged <- arrange(data_renamed,
                         rating)

head(data_arranged)

We can sort them in descending order using the `desc()` accessory function:

In [None]:
data_arranged <- arrange(data_renamed,
                         desc(rating))

head(data_arranged)

And of course, we can arrange the dataset by two or more features, in any order:

In [None]:
data_arranged <- arrange(data_renamed,
                         desc(rating),
                         price,
                         locality)

head(data_arranged)

Now I’d like to introduce what is probably `dplyr`’s most used function: [`mutate()`](https://dplyr.tidyverse.org/reference/mutate.html).  
`mutate()` generates **a whole new column** based on the data available in other specified columns.

For example, let’s say we want to create a column with the price in **Chinese yuan (CNY)**.  
Assuming the current exchange rate is **1 EUR = 8.29 CNY**, the operation would go like this:


In [None]:
data_mutated_yuan <- mutate(data_renamed,
                            price_yuan = price * 8.29)

head(data_mutated_yuan)

Or we can define a new, more complex feature.  
For example, we can enrich our *rating* metric with a **normalized** version of the *price* metric.

Let’s define the **minimum** and **maximum** values of the *price* column,  
then use them to obtain a **rescaled** metric that we can add to our *rating* feature.


In [None]:
price_min <- min(data_renamed$price)

price_max <- max(data_renamed$price)

data_mutated_rating <- mutate(data_mutated_yuan,
                              rating_enriched = as.integer(rating) + ((price - price_min) / (price_max - price_min)))

head(data_mutated_rating)

There it is! Although you may be missing it because it's at the very far end of the table. Fret not! `relocate` comes to the rescue!

---

_Yes, there is a function in `dplyr` to obtain minimum and maximum values of a vertor feature. We will cover it later in this tutorial._

In [None]:
data_relocated <- relocate(data_mutated_rating,
                           price_yuan,
                           .after = price)

data_relocated <- relocate(data_relocated,
                           rating_enriched,
                           .before = rating)

head(data_relocated)

Of course, you can define the original column names, in which case the original column will be replaced with the new version. For example:

In [None]:
price_min <- min(data_renamed$price)

price_max <- max(data_renamed$price)

data_mutated_rating_original <- mutate(data_renamed,
                                rating = as.integer(rating) + ((price - price_min) / (price_max - price_min)))

head(data_mutated_rating_original)

_Observe how the new rating type went from`<fct>` to `<dbl>`. This is the type of metadata that we want to preserve whenever possible!_

There's a "version" of mutate, `transmute`, that performs the exact same operation. The only difference is that this one **returns only the modified column**. If we run the exact same command as before, but with `transmute` instead of `mutate`, this is what happens:

In [None]:
price_min <- min(data_renamed$price)

price_max <- max(data_renamed$price)

data_transmuted_rating_original <- transmute(data_renamed,
                                rating = as.integer(rating) + ((price - price_min) / (price_max - price_min)))

head(data_transmuted_rating_original)

Neat, isn’t it? Well,

### **Buckle up**

This is the perfect moment to introduce the hero of our story: the pipe operator [`%>%`](https://magrittr.tidyverse.org/reference/pipe.html).

This deceptively simple symbol is incredibly powerful. It allows us to **chain together any function** within the **entire Tidyverse ecosystem**,  
acting as seamless glue between libraries and operations. It turns step-by-step logic into clean, readable pipelines.

Let’s now perform **all** the previous operations in a **single command**.


In [None]:
data_transformed <- data_renamed %>%
                        filter(price >= 800 & price < 1000,
                               latitude > 45 | longitude < -5,
                               between(section, 10, 100)) %>%
                        select(-country,
                               -section) %>%
                        arrange(desc(rating),
                                price,
                                locality) %>%
                        mutate(price_yuan = price * 8.29,
                               rating_enriched = as.integer(rating) + ((price - price_min) / (price_max - price_min))) %>%
                        relocate(price_yuan, .after = price) %>%
                        relocate(rating_enriched, .before = rating)

head(data_transformed)

And as long as the operations are **logically valid**, you can keep passing the dataset to other functions **indefinitely**.  

In the following tutorials, we’ll explore how the **exact same syntax** allows us to pipe this tibble directly into Tidyverse’s powerful [`ggplot2`](https://ggplot2.tidyverse.org/) library to start creating plots immediately.  
The same holds true for **any other library** within the broader **Tidyverse ecosystem**.

Even more exciting: in future practicals, we’ll learn how to apply this same syntax to **genomic ranges** using **[Bioconductor](https://www.bioconductor.org/)** — so make sure to commit it to memory!


Finally, let’s turn our attention to two powerful `dplyr` functions that are sure to become some of your most trusted allies when working with all kinds of data: `group_by` and `summarize`.

`group_by` is a somewhat unique function in the `dplyr` toolkit. On its own, it doesn’t return **any visible result**—instead, it silently creates groups within your dataset based on the features you select. These groups then allow subsequent operations to be performed **independently** on each one. It’s a bit tricky to describe, and its learning curve is slightly steeper than that of most other `dplyr` functions—definitely a challenge for the bravest of data-driven house hunters!

In [None]:
data_group <- data_transformed %>%
                  group_by(region)

head(data_group)

See? Nothing happened.

The magic begins when we combine it with other functions—chief among them, [`summarise()`](https://dplyr.tidyverse.org/reference/summarise.html):


In [None]:
data_group <- data_transformed %>%
                  group_by(region) %>%
                  summarize(min_price = min(price),
                            max_price = max(price))

data_group

The returned tibble is **not** our original—and that’s perfectly fine.  

We can use [`summarise()`](https://dplyr.tidyverse.org/reference/summarise.html) to compute **all kinds of statistics** on our groups.  
Keep in mind: without [`group_by()`](https://dplyr.tidyverse.org/reference/group_by.html), these operations are performed on **the entire dataset**.

Feel free to tinker with it and see what happens!

_1. See? I told you we can calculate mins and maxes with `dplyr`._

---

Let’s take another look at our newly transformed dataset:


In [None]:
head(data_transformed)

You’re ready to start contacting some of the available options, sorted by **price** and **rating**.  

But wait—didn’t we have some **empty cells** in there?  
Let’s take a look:


In [None]:
data_transformed %>% filter(is.na(url))

See what we did there? Yes, you can use filter with **any** function, as long as it returns a boolean (True or False). Keep that in mind!

_1. Also check how we didn't have to call our new data because we didn't store it as a variable._

---

You’re a thorough person—and you like your data thorough as well.  
Those `NA` values are looking at you funny, and you don’t like that.

In science, `NA` values range from **noisome inconveniences** to **Lovecraftian nightmares** that can ruin your entire analysis.  
We don’t like them.


In [None]:
data_full <- data_transformed %>%
                tidyr::drop_na(url)

data_full %>% filter(is.na(url))

Welcome to your very first [`tidyr`](https://tidyr.tidyverse.org/) function.  
`tidyr` is a close cousin of `dplyr`, and a powerful toolkit for **data rearrangement**.  

Remember: full compatibility with `dplyr`—and almost any other Tidyverse function—is guaranteed by the [`%>%`](https://magrittr.tidyverse.org/reference/pipe.html) operator!

In this case, our previous command to detect `NA` values in the *url* column returns an **empty tibble**.  
[`drop_na()`](https://tidyr.tidyverse.org/reference/drop_na.html) took care of them!

There are **many, many ways** to perform the same operation in R—but always remember the **first principle of engineering**:


$\pi \approx e \approx 3$

Nope! Not that one. This one:

### **Don’t reinvent the wheel.**

If an open-source solution audited by millions of people every day **works**,  
it’s very likely to cover **more edge cases** and have **more robust fallback mechanisms** than your well-intentioned custom function.  
**Trust the Tidyverse.**

---

But data rarely comes with **clean column names**, the right number of `NA` values to make us feel good,  
or **neatly defined categorical features**.  
More often than not, we have to **create** or **reshape** them ourselves.

Introducing `tidyr`’s [`unite()`](https://tidyr.tidyverse.org/reference/unite.html) and [`separate()`](https://tidyr.tidyverse.org/reference/separate.html):


In [None]:
data_composite_location <- data_transformed %>%
                               unite(comp_location, longitude, latitude, sep = " & ", remove = FALSE)  # Observe the remove argument. What does it do?

head(data_composite_location)

See what we got there? A new column, made from two. Very useful, although often not as much as its counterpart:

In [None]:
data_separate_modality <- data_transformed %>%
                              separate(tag_region, into = c("mode", "region_2"), sep = "_")

head(data_separate_modality)

And now we have two new columns from one. Observe that we defined _region_2_ because we already had a region column. Observe as well that we have to specify the new column names as strings ("mode", "region"). Try using the name _region_ as the second value name and see what happens!

_1. You are a smart fellow, so you have noticed outright that we didn't have to `mutate` the tibble to integrate the new columns._

---

Last but not least, I’d like to introduce you to the [`purrr`](https://purrr.tidyverse.org/) package.

R is a language that lends itself particularly well to the **functional programming paradigm**.  
In this paradigm, you define **functions**, and then **map features** to those functions.  
They offer an elegant alternative to classic loops, letting you remain in full control of the operation simply by inspecting the mapped function.

As an alternative to the well-known `apply` family in base R (`lapply`, `sapply`, etc.),  
`purrr` provides:

- **Safer execution**
- **Predictable return types**
- **Seamless integration** with other Tidyverse tools
- A **clean, readable syntax**

We’ll use it alongside our favorite function, [`mutate()`](https://dplyr.tidyverse.org/reference/mutate.html):


In [None]:
data_purrred <- data_transformed %>%
                    mutate(c_list_cv = map_dbl(c_list, ~ sd(.) / mean(.)))

head(data_purrred)

Ok, what just happened. What’s that `~`? Why is there a dot out of nowhere? What’s even a *c_list_cv*?  
Let’s break it down:

- `~` is used in R… for a lot of things. But one of its most powerful use cases is for **anonymous functions**.  
  These are function definitions written on the fly—identical in purpose to [*lambda functions*](https://python-tutorials.in/python-lambda-functions-with-examples/) in Python.  
  They let us define a "single-use" function directly in place.  
  It’s a compact and expressive way to write logic without cluttering the code with one-off named functions.  
  Still, when in doubt, always favor clarity: **write the full function explicitly** if it improves legibility.

- `.` refers to the **current element** being processed.  
  In this case, each element in the `c_list` column is a `<list<dbl>>`.  
  As [`map()`](https://purrr.tidyverse.org/reference/map.html) applies the function, `.` stands in for each of these lists.  
  The function then runs **vectorized operations** on them one-by-one. *(See why data types are a big deal yet?)*  
  Could you pull this off using just [`mutate()`](https://dplyr.tidyverse.org/reference/mutate.html)? Try it and find out—**exploration is part of the journey**.  
  And remember: every Tidyverse tool can be chained with `%>%`!

- What’s CV? The **coefficient of variation**, and here’s the formula:  
  $\text{CV} = \frac{\sigma}{\mu}$

- What’s `c_list`?  
  *I have no idea either!*

---

With your newly polished dataset in hand, you’re more prepared than ever to tackle your housing hunt.  
**Good luck out there—we’ll meet again soon to learn how to visualize your findings!**

---

This concludes our brief introduction to the **Tidyverse** and some of its most essential functions.  
The good news? We’ve only scratched the surface.  
There’s a vast landscape of tools waiting to be explored—entire libraries packed with functionality.  
(If you’re curious, [`stringr`](https://stringr.tidyverse.org/) is a great next stop, especially for mastering **regular expressions** and **string manipulation**.)

Many of the functions we’ve just encountered have **alternate forms**, **variants**, and **deep parameter sets** that let you refine behavior with surgical precision.  
I wasn’t exaggerating: the Tidyverse really does contain **most of the tools a data scientist needs** in their day-to-day work.

The bad news?  
We’ve only scratched the surface.  

Like with any form of coding, true fluency comes not from reading, but from **doing**—your fingers will learn before your brain does.  
And with great power comes great… debugging.  
Some errors will silently haunt your code, leaving you bewildered for days, only to be traced back to a **missing comma**.  
You’ll be **awed** by what the Tidyverse can do, and occasionally **infuriated** by its more cryptic behavior.  
**Be patient.** This toolkit powers insights at some of the world’s most advanced organizations.  
If it doesn’t work the way you want, don’t blame the hammer—**learn how to swing it better**.

As you grow more comfortable, you’ll realize the **Tidyverse isn’t just a collection of packages—it’s a philosophy**.  
You’ll stop thinking in terms of functions and start thinking in terms of **transformations**.  
You’ll begin to **intuit how tools can chain together**, how **pipelines** can flow, and how operations can be reduced to their most **expressive form**.  
Eventually, you’ll reach the point where you use R not just as a programming language,  
but as a **language for thinking statistically**.  
And that’s exactly what it was built for.

---

You’ve now taken your first confident steps into the **Tidyverse**—but this is just the beginning.  
The ecosystem is vast, the tools are deep, and your journey toward fluent, expressive data science in R has (perhaps) only just started.

Below is a curated set of **hand-picked resources** to help you go further, deeper, and faster.

---

### 📚 **Official Documentation & Cheat Sheets**

- **Tidyverse Main Site**: [https://www.tidyverse.org](https://www.tidyverse.org)  
  Gateway to all packages, updates, and community links.

- **Tidyverse Cheat Sheets (PDF)**:  
  The fastest way to internalize syntax and workflows.  
  → [https://posit.co/resources/cheatsheets/](https://posit.co/resources/cheatsheets/)

- **R for Data Science** (2nd edition) by Hadley Wickham & Mine Çetinkaya-Rundel  
  The canonical book—free online!  
  → [https://r4ds.hadley.nz/](https://r4ds.hadley.nz/)

---

### 🧬 **Bioinformatics Extensions**

If you´ve come for the transcriptomics, you'll want to check out:

- **Bioconductor**:  
  The de facto standard for R-based bioinformatics workflows  
  → [https://www.bioconductor.org](https://www.bioconductor.org)

- **`tidybulk`**: Tidyverse-style RNA-seq analysis  
  → [https://stemangiola.github.io/tidybulk/](https://stemangiola.github.io/tidybulk/)

- **plyranges**: Tidyverse-style genomic range manipulation  
  → [https://www.bioconductor.org/packages/release/bioc/html/plyranges.html](https://www.bioconductor.org/packages/release/bioc/html/plyranges.html)

---

### 🎓 **Learning by Doing**

- **Posit Cloud (formerly RStudio Cloud)**:  
  Write and run R code in the cloud, no setup required  
  → [https://posit.cloud](https://posit.cloud)

- **TidyTuesday**:  
  Weekly datasets + community visualizations  
  → [https://github.com/rfordatascience/tidytuesday](https://github.com/rfordatascience/tidytuesday)

- **Swirl**: Learn R inside R, interactively  
  → [https://swirlstats.com](https://swirlstats.com)

- **Kaggle**: Challenges and datasets (like the one you just used)  
  → [https://www.kaggle.com/](https://www.kaggle.com/)