<a href="https://colab.research.google.com/github/JordanDCunha/R-for-Data-Science-2e-/blob/main/Chapter_7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üì• **7.1 Introduction: Reading Data**

Working with datasets that come bundled with R packages is a fantastic way to learn data science tools. But sooner or later, you‚Äôll want to apply those skills to **your own data**. That‚Äôs where this chapter comes in.

In this chapter, you‚Äôll learn the fundamentals of **reading data files into R**, focusing specifically on **plain-text rectangular data**‚Äîthe most common and practical format you‚Äôll encounter in the real world.

You‚Äôll start with hands-on guidance for dealing with common data issues, including:
- **Column names**
- **Column types**
- **Missing values**

From there, you‚Äôll move on to more advanced (and very useful!) workflows:
- Reading data from **multiple files at once**
- **Writing data** from R back to disk
- **Handcrafting data frames** directly in R

By the end of this chapter, you‚Äôll be able to confidently move data **into and out of R**, which is a critical step in any real analysis pipeline.

---

## üì¶ **7.1.1 Prerequisites**

This chapter focuses on the **readr** package, which provides fast, consistent tools for reading flat files into R.  
`readr` is part of the **core tidyverse**, so loading the tidyverse gives you everything you need.

From here on, we‚Äôll assume the tidyverse is available and ready to go.


In [None]:
library(tidyverse)


# üìÑ **7.2 Reading Data from a File**

The most common rectangular data format you‚Äôll encounter is the **CSV file** (comma-separated values). CSV files store data in rows and columns, where:
- The **first row** usually contains column names (the header),
- Each **subsequent row** represents an observation,
- Columns are **delimited by commas**.

Once a CSV file exists in your project (typically inside a `data/` folder), you can read it into R using `read_csv()`. When you do, readr automatically:
- Detects column names,
- Guesses column types,
- Reports potential issues (like missing values or type mismatches).

---

## üß≠ Practical Data-Cleaning Workflow

After reading data, your *very next step* is almost always cleaning and standardizing it so it‚Äôs easier to analyze. Common tasks include:

- **Handling missing values**  
  Some datasets use strings like `"N/A"` instead of real `NA`s. You can tell `read_csv()` exactly which values should be treated as missing.

- **Fixing column names**  
  Columns with spaces or symbols become *non-syntactic names*, requiring backticks. These are best cleaned immediately using either `rename()` or `janitor::clean_names()`.

- **Correcting column types**  
  - Categorical variables ‚Üí factors  
  - Numeric values stored as text ‚Üí numbers  
  - Inconsistent entries (e.g., `"five"` instead of `5`) ‚Üí standardized values

This early cleanup prevents subtle bugs and confusion later.

---

## ‚öôÔ∏è Useful `read_csv()` Arguments

Beyond the file path, a few arguments handle most real-world cases:

- `na` ‚Äì define which strings represent missing values  
- `skip` ‚Äì ignore the first *n* lines (metadata)  
- `comment` ‚Äì drop lines starting with a comment character  
- `col_names` ‚Äì specify whether headers exist (or supply your own)

A neat trick: `read_csv()` can even read **inline CSV text**, which is great for examples and debugging.

---

## üìÅ Other File Types in `readr`

Once you understand `read_csv()`, the rest are easy:

- `read_csv2()` ‚Äì semicolon-delimited files  
- `read_tsv()` ‚Äì tab-delimited files  
- `read_delim()` ‚Äì arbitrary delimiters  
- `read_fwf()` / `read_table()` ‚Äì fixed-width files  
- `read_log()` ‚Äì Apache-style log files  

The interface stays consistent ‚Äî only the delimiter logic changes.

---

## üß† Key Takeaway

Reading data is not just about importing files ‚Äî it‚Äôs about **establishing clean, reliable structure** at the very beginning. If you standardize names, fix types, and handle missing values up front, the rest of your analysis becomes dramatically easier.


In [None]:
library(tidyverse)

# Read CSV and handle missing values
students <- read_csv(
  "data/students.csv",
  na = c("N/A", "")
)

# Clean names, fix types, and repair age values
students <- students |>
  janitor::clean_names() |>
  mutate(
    meal_plan = factor(meal_plan),
    age = parse_number(if_else(age == "five", "5", age))
  )

students


# üéõÔ∏è **7.3 Controlling Column Types**

CSV files don‚Äôt store information about variable types, so **readr must guess** whether each column contains logicals, numbers, dates, or strings. This section explains how that guessing works, why it sometimes fails, and how to take control when needed.

---

## üîç 7.3.1 Guessing Types

When you read a CSV, readr:
- Samples **1,000 values per column**, evenly spaced,
- Ignores missing values,
- Applies a set of rules in order:

1. Logical values only (`TRUE`, `FALSE`, `T`, `F`) ‚Üí logical  
2. Numbers only (`1`, `-4.5`, `Inf`, `5e6`) ‚Üí double  
3. ISO8601 format ‚Üí date / datetime  
4. Otherwise ‚Üí character  

This heuristic works well for clean data, but real-world data often breaks these assumptions.

---

## ‚ö†Ô∏è 7.3.2 Missing Values and Parsing Problems

A very common failure happens when **missing values are encoded unexpectedly** (e.g. `"."`, `"NA"`, `"NULL"`). When that happens, readr may fall back to character type.

To debug this, you can:
1. **Force a column type** using `col_types`
2. Inspect failures with `problems()`

Once you identify the offending value, you can usually fix the issue by telling readr which strings represent missing values using the `na` argument.

---

## üß± 7.3.3 Column Types You Can Specify

readr provides several column type helpers, including:

- `col_logical()`, `col_double()`, `col_integer()`
- `col_character()` for IDs or codes that look numeric
- `col_factor()`, `col_date()`, `col_datetime()`
- `col_number()` for messy numeric data (e.g. currencies)
- `col_skip()` to ignore columns entirely

You can also:
- Override the default guessing for *all* columns using `.default`
- Read only selected columns using `cols_only()`

Taking control of column types is especially useful when working with large or messy datasets, where silent parsing errors can lead to subtle bugs later in your analysis.


In [None]:
library(tidyverse)

# Example: guessing column types
read_csv("
logical,numeric,date,string
TRUE,1,2021-01-15,abc
false,4.5,2021-02-15,def
T,Inf,2021-02-16,ghi
")

# Example: parsing problem caused by unexpected missing value
simple_csv <- "
x
10
.
20
30"

df <- read_csv(
  simple_csv,
  col_types = list(x = col_double())
)

# Inspect parsing problems
problems(df)

# Fix by specifying missing value
read_csv(simple_csv, na = ".")

# Override all column types
another_csv <- "
x,y,z
1,2,3"

read_csv(
  another_csv,
  col_types = cols(.default = col_character())
)

# Read only selected columns
read_csv(
  another_csv,
  col_types = cols_only(x = col_character())
)


# üìÇ **7.4 Reading Data from Multiple Files**

Sometimes your data isn‚Äôt neatly stored in a single file. Instead, it‚Äôs **split across multiple files**‚Äîfor example, monthly sales data like `01-sales.csv`, `02-sales.csv`, and `03-sales.csv`. Luckily, **readr** makes it easy to read them all at once and **stack them into one tidy data frame**.

---

## üß© Reading Multiple Files at Once

If you pass a **vector of file paths** to `read_csv()`, it will:
- Read each file,
- Stack the rows on top of each other,
- Return a single tibble.

By using the **`id` argument**, you can also keep track of **which file each row came from**, which is extremely useful when the original files don‚Äôt contain an identifying variable.

---

## üåê Local Files vs URLs

This approach works whether:
- The files live in a local project directory (e.g. `data/`)
- Or the files are hosted online and accessed via URLs

In both cases, the workflow is exactly the same.

---

## üîç Automatically Finding Files

When you have **many files**, writing them all out manually is tedious and error-prone. Instead, you can use `list.files()` to:
- Search a directory,
- Match file names using a **pattern**,
- Automatically return all relevant file paths.

This scales cleanly as your project grows and keeps your code flexible and reproducible.


In [None]:
library(tidyverse)

# Manually listing multiple CSV files
sales_files <- c(
  "data/01-sales.csv",
  "data/02-sales.csv",
  "data/03-sales.csv"
)

read_csv(sales_files, id = "file")

# Reading the same files directly from URLs
sales_files <- c(
  "https://pos.it/r4ds-01-sales",
  "https://pos.it/r4ds-02-sales",
  "https://pos.it/r4ds-03-sales"
)

read_csv(sales_files, id = "file")

# Automatically finding all sales CSV files in a directory
sales_files <- list.files(
  "data",
  pattern = "sales\\.csv$",
  full.names = TRUE
)

sales_files


# üíæ **7.5 Writing Data to a File**

The **readr** package doesn‚Äôt just help you read data‚Äîit also provides tools to **write data back to disk**. This is essential for saving results, sharing data, or caching intermediate steps in your analysis.

---

## üßæ Writing Plain-Text Files

The two most common functions are:

- **`write_csv()`** ‚Äî writes comma-separated values  
- **`write_tsv()`** ‚Äî writes tab-separated values  

The key arguments are:
- **`x`**: the data frame to save  
- **`file`**: the file path where the data will be written  

You can also control how **missing values** are written with `na`, and whether to **append** to an existing file.

‚ö†Ô∏è **Important caveat:**  
When you save data to a CSV file, **column type information is lost**. When you read the file back in, R must guess the types again, which may not match the original object (e.g., factors becoming characters).

This makes CSV files **less reliable for caching interim results** during analysis.

---

## üì¶ Better Alternatives for Saving R Objects

### üß† **RDS Files**
- Use **`write_rds()`** and **`read_rds()`**
- Store data in R‚Äôs native **binary format**
- Reloading restores the **exact same R object**, including column types

This is ideal for internal workflows and reproducibility.

---

## ‚ö° Cross-Language, High-Performance Storage with Parquet

The **arrow** package allows you to work with **Parquet files**, which are:
- **Very fast**
- **Compressed**
- **Usable outside of R** (e.g., Python, Spark)

Parquet combines performance with portability, but requires the **arrow** package.

We‚Äôll explore this format in more detail later, but it‚Äôs a powerful option for larger or shared datasets.


In [None]:
library(tidyverse)

# Write a CSV file
write_csv(students, "students.csv")

# Write another CSV and read it back in
write_csv(students, "students-2.csv")
read_csv("students-2.csv")

# Save and load using RDS (preserves column types)
write_rds(students, "students.rds")
read_rds("students.rds")

# Save and load using Parquet (fast, cross-language)
library(arrow)
write_parquet(students, "students.parquet")
read_parquet("students.parquet")


# ‚úçÔ∏è **7.6 Data Entry**

Sometimes you‚Äôll need to create a small dataset **by hand** directly in your R script. This is common for toy examples, look-up tables, or quick tests. The **tibble** package provides two especially useful functions for this, depending on whether you want to think in terms of **columns** or **rows**.

---

## üß± Creating a Tibble by Columns with `tibble()`

`tibble()` works column-by-column, similar to how you‚Äôd build a data frame in base R. Each column is defined as a vector.

This approach is straightforward, but when datasets get wider, it can be harder to visually match values across rows.

---

## üîÑ Creating a Tibble by Rows with `tribble()`

`tribble()` (short for **transposed tibble**) is designed specifically for **data entry in code**. Instead of defining columns as vectors, you define the data **row by row**, which often makes small datasets much easier to read.

Key features of `tribble()`:
- Column names start with `~`
- Values are separated by commas
- Each row appears on its own line

This layout closely mirrors how we naturally read tables and is ideal for small, hand-typed datasets.

---

## ‚úÖ When to Use Each

- Use **`tibble()`** when:
  - You already have vectors
  - You‚Äôre thinking column-wise

- Use **`tribble()`** when:
  - You‚Äôre entering data manually
  - You want maximum readability


In [None]:
library(tidyverse)

# Column-wise data entry with tibble()
tibble(
  x = c(1, 2, 5),
  y = c("h", "m", "g"),
  z = c(0.08, 0.83, 0.60)
)

# Row-wise data entry with tribble()
tribble(
  ~x, ~y, ~z,
   1, "h", 0.08,
   2, "m", 0.83,
   5, "g", 0.60
)
