# 🌾 Week 3: Data Manipulation with dplyr
**PLS 120 - Applied Statistics in Agriculture**

**Binder Developer:** Mohammadreza Narimani  
**Lab Content Developer:** Parastoo Farajpoor  
**Date:** 2025-10-15

This week we'll learn how to manipulate data using the "dplyr" package of tidyverse. Tidyverse is a series of tools and functions that can be used for manipulating and organizing data in R. We'll also explore data visualization with ggplot2.

## 📊 Loading Libraries and Data

First, we'll load the dplyr package and the iris dataset we've been working with.

**Expected Output:** Libraries will load and the iris dataset will be available for manipulation.

In [None]:
# Load the dplyr package (part of tidyverse)
suppressPackageStartupMessages({
  library(dplyr)
  library(ggplot2)
})

# Load the iris dataset
data <- iris

# Examine the structure
head(data)
str(data)

## 🔍 Basic Data Subsetting

There are some basic ways to subset the dataset by selecting specific row indices and column indices using brackets.

**Expected Output:** You'll see different ways to subset data using row and column indices.

In [None]:
# Subset row 1 column 1. In the bracket, the first number specifies the row number 
# and the second number specifies the column number.
example <- data[1,1]
example

# Subset row 1, column 1 to 4:
example <- data[1,1:4]
example

# Subset row 1, all columns:
example <- data[1, ]
head(example)

# Subset all rows, column 1:
example <- data[, 1]
head(example)

## 🎯 Data Manipulation with Pipes

In tidyverse we separate different parts of the code with a "pipe" (%>%). Think of a pipe like a conjunction in a sentence where %>% is equivalent to "then".

**Expected Output:** You'll learn to use pipes to chain operations together.

In [None]:
# filter() function allows you to subset rows based on criteria
# Create subsets for each species
setosa_dataset <- data %>% filter(Species == "setosa")
versicolor_dataset <- data %>% filter(Species == "versicolor")
virginica_dataset <- data %>% filter(Species == "virginica")

# Check the number of rows in each subset
nrow(setosa_dataset)
nrow(versicolor_dataset)
nrow(virginica_dataset)

# You can also use subset() function
setosa_subset <- subset(data, Species == "setosa")
nrow(setosa_subset)

## ✂️ Slicing Rows

slice() creates a subset of the dataframe from selected rows based on their row number.

**Expected Output:** Different ways to slice rows from the dataset.

In [None]:
# slice() rows 10 to 20
sliced_10_20 <- data %>% slice(10:20)
nrow(sliced_10_20)

# slice_head() and slice_tail() subset the top and bottom of the dataset
slice_head_5 <- data %>% slice_head(n=5)
slice_tail_5 <- data %>% slice_tail(n=10)

# slice_sample() randomly selects rows
slice_random_5 <- data %>% slice_sample(n=5)
head(slice_random_5)

## 🎯 Your Turn: Practice Slicing

The data has a total of 150 rows. Try different slice functions.

**Expected Output:** Different subsets of the iris data.

In [None]:
# Slice 20 rows using different methods
# Hint: slice_example_1 <- data %>% slice(40:60)
slice_example_1 <-

# Hint: slice_example_2 <- data %>% slice_head(n=20)
slice_example_2 <-

# Hint: slice_example_3 <- data %>% slice_tail(n=20)
slice_example_3 <-

# Hint: slice_example_4 <- data %>% slice_sample(n=20)
slice_example_4 <-

# Check the number of rows
nrow(slice_example_1)
nrow(slice_example_2)

## 📋 Selecting Columns

Selecting columns is helpful when you have a dataset with lots of variables, but you're only interested in one or two variables.

**Expected Output:** Subsets of data with specific columns.

In [None]:
# select() allows you to subset columns by name or column number
# Select sepal length and species
select_by_sepal_length <- data %>% select(Sepal.Length, Species)
head(select_by_sepal_length)

# Same thing using column numbers
select_by_sepal_length <- data %>% select(1, 5)
head(select_by_sepal_length)

# Select petal variables
select_by_petal <- data %>% select(Petal.Length, Petal.Width)
head(select_by_petal)

## 🔍 Advanced Column Selection

There are several functions for precise column selection: starts_with(), ends_with(), contains(), and matches().

**Expected Output:** Different ways to select columns based on patterns.

In [None]:
# Select columns that start with "Sepal"
select_by_sepal <- data %>% select(starts_with("Sepal"))
head(select_by_sepal)

# Select columns that end with "Length"
select_by_length <- data %>% select(ends_with("Length"))
head(select_by_length)

# Select columns that contain "Petal"
select_by_petal <- data %>% select(contains("Petal"))
head(select_by_petal)

## 🔗 Combining Functions

You can combine slice() and select() functions to choose specific rows and columns.

**Expected Output:** Subsets with both specific rows and columns.

In [None]:
# Combine slice() and select() functions
# Subset rows 4 to 7 and columns 1 to 2
subset_data <- data %>% slice(4:7) %>% select(1:2)
subset_data

# Same thing using base R
subset_data <- data[4:7, 1:2]
subset_data

## 🔄 Data Manipulation Functions

Let's explore arrange(), rename(), and mutate() functions for data manipulation.

**Expected Output:** Sorted, renamed, and transformed datasets.

In [None]:
# arrange() orders data along a quantitative variable (ascending by default)
sepal_length_arranged_ascending <- data %>% arrange(Sepal.Length)
head(sepal_length_arranged_ascending)

# Use desc() for descending order
sepal_length_arranged_descending <- data %>% arrange(desc(Sepal.Length))
head(sepal_length_arranged_descending)

# rename() changes column names
renamed_data <- data %>% rename(Sepal_Length = Sepal.Length, Plant_Species = Species)
head(renamed_data)

In [None]:
# mutate() adds new columns
# Convert measurements to millimeters
new_df <- data %>% mutate(mm_Sepal.Length = Sepal.Length * 10)
head(new_df %>% select(Sepal.Length, mm_Sepal.Length))

## 📊 Grouping and Summarizing Data

group_by() lets you group data by variables, and summarize() calculates summary statistics for each group.

**Expected Output:** Summary statistics grouped by species.

In [None]:
# Group by species and calculate mean sepal length
grouped_df <- data %>% group_by(Species) %>% summarize(mean_sepal_length = mean(Sepal.Length))
grouped_df

# Calculate multiple statistics
grouped_summary <- data %>% 
  group_by(Species) %>% 
  summarize(
    mean_sepal = mean(Sepal.Length),
    sd_sepal = sd(Sepal.Length),
    count = n()
  )
grouped_summary

## 🎯 Your Turn: Practice Exercise

Create a subset containing only petal widths for the setosa species, then calculate the mean.

**Expected Output:** A subset of data and its mean value.

In [None]:
# Create a subset of petal widths for setosa species
# Hint: setosa_petal_width <- data %>% filter(Species == "setosa") %>% select(Petal.Width)
setosa_petal_width <-

# Calculate the mean
# Hint: mean_setosa_petal_width <- mean(setosa_petal_width$Petal.Width)
mean_setosa_petal_width <-

# Print the result
print(mean_setosa_petal_width)

## 📊 Data Visualization with ggplot2

Let's create some visualizations using ggplot2 to explore our data.

**Expected Output:** Professional-looking plots with customizable aesthetics.

In [None]:
# Create a histogram with ggplot2
ggplot(data, aes(x = Petal.Width)) + 
  geom_histogram(bins = 20, fill = "lightblue", color = "black") +
  labs(title = "Distribution of Petal Width", x = "Petal Width", y = "Frequency")

In [None]:
# Create a boxplot by species
ggplot(data, aes(x = Species, y = Petal.Width, fill = Species)) + 
  geom_boxplot() +
  labs(title = "Petal Width by Species", x = "Species", y = "Petal Width")

In [None]:
# Create a scatter plot
ggplot(data, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + 
  geom_point(size = 3) +
  labs(title = "Sepal Length vs Width by Species", 
       x = "Sepal Length", y = "Sepal Width")

## 🎯 Your Turn: Create Your Own Plot

Create a density plot for sepal length using ggplot2.

**Expected Output:** A density plot showing the distribution of sepal length.

In [None]:
# Create a density plot for sepal length
# Hint: ggplot(data, aes(x = Sepal.Length)) + geom_density()
ggplot(data, aes()) +
  geom_density() +
  labs(title = "Distribution of Sepal Length", x = "Sepal Length", y = "Density")

## 🎉 Congratulations!

You've completed Week 3 of PLS 120! You've learned:

✅ **Data Subsetting** - Using brackets and indices  
✅ **dplyr Functions** - filter(), select(), slice(), arrange()  
✅ **Pipes** - Chaining operations with %>%  
✅ **Column Selection** - starts_with(), ends_with(), contains()  
✅ **Data Manipulation** - rename(), mutate(), group_by()  
✅ **Data Visualization** - Creating plots with ggplot2  

---

## 📧 Questions?

If you have more questions about this lab or need help with R programming, please contact:

**Mohammadreza Narimani**  
📧 mnarimani@ucdavis.edu  
🏫 Department of Biological and Agricultural Engineering, UC Davis

---

*Next week: We'll explore more advanced statistical concepts and analysis!* 🚀