## Lesson 3: Data Transformation with dplyr - Part 1 (Select, Filter, Arrange)

Welcome to Lesson 3! Now that you've learned data cleaning, let's explore **data transformation** - the art of reshaping and manipulating your data to extract meaningful insights.

---

### üéØ What is Data Transformation?

Think of data transformation like organizing your closet:
- üì¶ **Selecting** = Keeping only the clothes you need for a specific occasion
- üîç **Filtering** = Pulling out only items that match certain criteria (e.g., all blue shirts)
- üìä **Arranging** = Organizing items in a specific order (by color, size, or season)

| Transformation | Real-World Analogy | R Function |
|---------------|-------------------|------------|
| Choosing columns | Selecting which spreadsheet columns to print | `select()` |
| Keeping specific rows | Filtering emails by sender | `filter()` |
| Sorting data | Arranging files by date | `arrange()` |

---

### üíº Why is Data Transformation Important for Business?

In the business world, raw data rarely comes in the format you need:
- **Sales Reports**: You might have 50 columns but only need Product, Region, and Revenue
- **Customer Analysis**: You want to focus only on customers from specific states
- **Performance Reviews**: You need data sorted by highest sales first

| Business Question | Transformation Needed |
|------------------|----------------------|
| "Show me only customer names and purchase amounts" | `select()` |
| "Find all orders over $1,000" | `filter()` |
| "List employees from highest to lowest salary" | `arrange()` |

---

### üìö In This Lesson, You'll Learn:

| Function | What It Does | Example |
|----------|-------------|---------|
| `select()` | Chooses specific columns to keep | Keep only `Name`, `Price`, `Date` |
| `filter()` | Keeps rows that meet your conditions | Keep only rows where `Price > 100` |
| `arrange()` | Sorts rows in a specific order | Sort by `Price` from low to high |
| `%>%` (pipe) | Chains operations together | Do multiple steps in one flow |

**üöÄ By the end of this lesson**, you'll be able to extract exactly the data you need from any dataset!

## Loading Required Packages

For data transformation, we'll use the **dplyr** package, which is part of the **tidyverse** collection.

---

### üì¶ What is the tidyverse?

The **tidyverse** is like a toolbox üß∞ containing multiple specialized tools for data work:

| Package | What It Does | Analogy |
|---------|-------------|---------|
| **dplyr** | Data manipulation (filter, select, arrange) | Your main data scissors & organizer |
| **ggplot2** | Creating visualizations | Your graph-making paintbrush |
| **tidyr** | Reshaping data layout | Your data reformatting tool |
| **readr** | Reading data files | Your data import machine |

---

### üîß Understanding the Code Below:

```r
library(tidyverse)
```

| Code Part | What It Means |
|-----------|--------------|
| `library()` | "Load this package so I can use its functions" |
| `tidyverse` | The name of the package collection to load |

**üí° Analogy**: Think of `library(tidyverse)` like opening your toolbox before starting a project. You need to open it first before you can use the tools inside!

---

### ‚ö†Ô∏è Common Question: "Why Do I Need to Load Packages?"

R comes with basic functions built-in, but **packages add superpowers**:

| Without tidyverse | With tidyverse |
|------------------|----------------|
| Complex, hard-to-read code | Clean, readable code |
| Multiple lines for one task | Single-line solutions |
| `subset(data, column > 5)` | `data %>% filter(column > 5)` |

Let's load the tidyverse package:

In [1]:
# Load necessary packages
library(tidyverse)   # Load the tidyverse collection of packages
                     # This includes: dplyr, ggplot2, tidyr, readr, purrr, tibble
                     # The comment after # explains what this package includes

‚îÄ‚îÄ [1mAttaching core tidyverse packages[22m ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ tidyverse 2.0.0 ‚îÄ‚îÄ
[32m‚úî[39m [34mdplyr    [39m 1.1.4     [32m‚úî[39m [34mreadr    [39m 2.1.6
[32m‚úî[39m [34mforcats  [39m 1.0.1     [32m‚úî[39m [34mstringr  [39m 1.6.0
[32m‚úî[39m [34mggplot2  [39m 4.0.1     [32m‚úî[39m [34mtibble   [39m 3.3.1
[32m‚úî[39m [34mlubridate[39m 1.9.4     [32m‚úî[39m [34mtidyr    [39m 1.3.2
[32m‚úî[39m [34mpurrr    [39m 1.2.1     
‚îÄ‚îÄ [1mConflicts[22m ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ tidyverse_conflicts() ‚îÄ‚îÄ
[31m‚úñ[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m‚úñ[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36m‚Ñπ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become erro

## Creating Sample Sales Dataset

For this lesson, we'll work with a realistic sales dataset. Before we create it, let's understand what each column represents:

---

### üìã Dataset Column Definitions

| Column Name | Data Type | Description | Example Values |
|------------|-----------|-------------|----------------|
| **OrderID** | Number | Unique identifier for each order | 101, 102, 103 |
| **CustomerName** | Text | Name of the customer | "Alice", "Bob" |
| **Product** | Text | Item that was purchased | "Laptop", "Mouse" |
| **Category** | Text | Product category | "Electronics" |
| **Price** | Number | Price per item in dollars | 1200, 25, 75 |
| **Quantity** | Number | Number of items ordered | 1, 2, 3 |
| **OrderDate** | Date | When the order was placed | "2024-01-15" |
| **Region** | Text | Geographic sales region | "East", "West" |

---

### üîß Understanding the Code That Creates This Data:

The code below uses several R functions. Here's what each one does:

| Function | Purpose | Example |
|----------|---------|---------|
| `data.frame()` | Creates a table (data frame) | Like creating a new Excel spreadsheet |
| `101:110` | Creates a sequence of numbers | Produces 101, 102, 103... up to 110 |
| `c()` | Combines values into a list | `c("A", "B", "C")` creates a list of three items |
| `as.Date()` | Converts text to a proper date | `as.Date("2024-01-15")` |
| `print()` | Displays output on screen | Shows the result so you can see it |

---

### üíº Business Context

This dataset mimics what you'd see in a real company's sales database:
- **10 orders** from different customers
- **4 products**: Laptop ($1150-$1250), Monitor ($300-$320), Keyboard ($75-$80), Mouse ($20-$30)
- **4 regions**: East, West, North, South
- **Date range**: January 15-23, 2024

---

### üéØ HOW TO ADAPT THIS CODE FOR YOUR OWN DATA

**Template for creating your own dataset:**
```r
my_data <- data.frame(
  Column1 = c("value1", "value2", "value3"),    # Text values
  Column2 = c(100, 200, 300),                    # Numeric values
  Column3 = 1:3,                                 # Sequence of numbers
  DateColumn = as.Date(c("2024-01-01", "2024-01-02", "2024-01-03"))  # Dates
)
```

**üí° Tips:**
- Each column must have the same number of values
- Text values need quotes: `"like this"`
- Numbers don't need quotes: `123`
- Dates need `as.Date()` to be treated as dates

In [2]:
# Create a sample sales dataset
sales_data <- data.frame(              # Create a data frame (R's table structure)
  OrderID = 101:110,                   # Create sequence from 101 to 110 using :
  CustomerName = c("Alice", "Bob", "Charlie", "David", "Eve", "Frank", "Grace", "Heidi", "Ivan", "Judy"),
                                       # c() combines values into a vector
  Product = c("Laptop", "Mouse", "Keyboard", "Monitor", "Laptop", "Mouse", "Keyboard", "Monitor", "Laptop", "Mouse"),
                                       # Character vector with product names
  Category = c("Electronics", "Electronics", "Electronics", "Electronics", "Electronics", "Electronics", "Electronics", "Electronics", "Electronics", "Electronics"),
                                       # All same category - shows repetitive data
  Price = c(1200, 25, 75, 300, 1150, 20, 80, 320, 1250, 30),
                                       # Numeric vector with price values
  Quantity = c(1, 2, 1, 1, 1, 3, 1, 1, 1, 2),
                                       # Integer vector for quantities ordered
  OrderDate = as.Date(c("2024-01-15", "2024-01-15", "2024-01-16", "2024-01-17", "2024-01-18",
                        "2024-01-19", "2024-01-20", "2024-01-21", "2024-01-22", "2024-01-23")),
                                       # as.Date() converts text to proper date format
  Region = c("East", "West", "North", "South", "East", "West", "North", "South", "East", "West")
                                       # Factor-like data for geographic regions
)

print("Original Sales Data:")          # print() displays output to console
print(sales_data)                      # Show the entire data frame

[1] "Original Sales Data:"
   OrderID CustomerName  Product    Category Price Quantity  OrderDate Region
1      101        Alice   Laptop Electronics  1200        1 2024-01-15   East
2      102          Bob    Mouse Electronics    25        2 2024-01-15   West
3      103      Charlie Keyboard Electronics    75        1 2024-01-16  North
4      104        David  Monitor Electronics   300        1 2024-01-17  South
5      105          Eve   Laptop Electronics  1150        1 2024-01-18   East
6      106        Frank    Mouse Electronics    20        3 2024-01-19   West
7      107        Grace Keyboard Electronics    80        1 2024-01-20  North
8      108        Heidi  Monitor Electronics   320        1 2024-01-21  South
9      109         Ivan   Laptop Electronics  1250        1 2024-01-22   East
10     110         Judy    Mouse Electronics    30        2 2024-01-23   West


## Introduction to dplyr and the Pipe Operator (%>%)

The **pipe operator (%>%)** is one of the most powerful features in R for data manipulation. It's also one of the most confusing for beginners - so let's break it down completely!

---

### üîß What Does `%>%` Actually Mean?

**The pipe `%>%` means: "Take what's on the left and pass it to the function on the right."**

| Symbol | Pronunciation | Meaning |
|--------|--------------|---------|
| `%>%` | "pipe" or "then" | Take the result and pass it to the next step |

---

### üìö Reading Code with Pipes

When you see `%>%`, mentally replace it with **"and then"**:

```r
sales_data %>%       # Take sales_data, AND THEN...
  select(Product) %>%  # select the Product column, AND THEN...
  filter(Price > 100)   # filter where Price > 100
```

---

### ‚öîÔ∏è Old Way vs. New Way (Comparison)

| Approach | Code | Problem |
|----------|------|---------|
| **Old way (nested)** | `filter(select(sales_data, Product, Price), Price > 100)` | Read inside-out üòï |
| **New way (piped)** | `sales_data %>% select(Product, Price) %>% filter(Price > 100)` | Read left-to-right ‚úÖ |

---

### üí° Real-World Analogy

Think of `%>%` like a **factory assembly line**:

```
üì¶ Raw Materials ‚Üí üîß Step 1 ‚Üí üîß Step 2 ‚Üí üîß Step 3 ‚Üí üì¶ Final Product
```

In R:
```
sales_data %>% select(...) %>% filter(...) %>% arrange(...)
     ‚Üì              ‚Üì             ‚Üì              ‚Üì
   Start         Step 1        Step 2        Step 3
```

---

### üéπ Keyboard Shortcut

**To type `%>%` quickly:**
- **Windows**: `Ctrl` + `Shift` + `M`
- **Mac**: `Cmd` + `Shift` + `M`

---

### ‚úÖ Why Use the Pipe Operator?

| Benefit | Explanation |
|---------|------------|
| **Readability** | Code reads like a sentence: "Take data, then select, then filter" |
| **Debugging** | Run one line at a time to see intermediate results |
| **Less clutter** | No need for temporary variables between steps |
| **Professional standard** | Industry-standard practice in data science |

Let's see it in action:

In [3]:
# Example: without pipe (harder to read)
# result <- filter(select(sales_data, Product, Price), Price > 100)
# This nests functions: select first, then filter the result

# Example: with pipe (easier to read)
result <- sales_data %>%               # Start with sales_data, then pipe to next function
  select(Product, Price) %>%           # Select only Product and Price columns, then pipe
  filter(Price > 100)                  # Filter rows where Price is greater than 100
                                       # %>% passes result from left side to right side

print("Example of pipe operator - Products with Price > 100:")
print(result)                          # Display the final filtered result

[1] "Example of pipe operator - Products with Price > 100:"
  Product Price
1  Laptop  1200
2 Monitor   300
3  Laptop  1150
4 Monitor   320
5  Laptop  1250


## 1. select(): Choosing Columns

The **select()** function lets you choose which columns to keep from your dataset. Think of it like highlighting specific columns in an Excel spreadsheet and deleting the rest.

---

### üéØ When to Use select()

| Business Scenario | What You'd Select |
|------------------|-------------------|
| Creating a customer mailing list | `CustomerName`, `Address`, `Email` |
| Analyzing sales performance | `Product`, `Revenue`, `Region` |
| Building a simple report | Only the 3-4 columns your boss needs |

---

### üîß select() Syntax Breakdown

```r
new_data <- original_data %>% select(Column1, Column2, Column3)
```

| Part | What It Does |
|------|-------------|
| `new_data` | Name for your result (you choose this name) |
| `<-` | Assignment operator ("save this as...") |
| `original_data` | Your starting dataset |
| `%>%` | Pipe operator ("and then...") |
| `select()` | The function that chooses columns |
| `Column1, Column2` | Names of columns to keep (separate with commas) |

---

### üìã The 6 Ways to Use select()

| Method | Code | What It Does |
|--------|------|-------------|
| **By name** | `select(col1, col2)` | Keep only these specific columns |
| **By range** | `select(col1:col5)` | Keep columns from col1 through col5 |
| **Exclude columns** | `select(-col1, -col2)` | Keep everything EXCEPT these columns |
| **Starts with** | `select(starts_with("O"))` | Keep columns starting with "O" |
| **Ends with** | `select(ends_with("ID"))` | Keep columns ending with "ID" |
| **Contains** | `select(contains("Name"))` | Keep columns containing "Name" anywhere |

---

### üí° Understanding the Minus Sign `-`

The minus sign **excludes** columns instead of including them:

| Code | Result |
|------|--------|
| `select(Name, Price)` | Keep ONLY Name and Price |
| `select(-Name, -Price)` | Keep everything EXCEPT Name and Price |

**üíº Business Use**: When you have 50 columns and only want to remove 2, use the minus sign!

---

### üéØ HOW TO ADAPT THIS CODE

**Template for selecting specific columns:**
```r
my_result <- my_data %>%
  select(column_name_1, column_name_2, column_name_3)
```

**Template for excluding columns:**
```r
my_result <- my_data %>%
  select(-column_to_remove_1, -column_to_remove_2)
```

**Template for pattern matching:**
```r
# Select all columns that start with "Sales"
my_result <- my_data %>%
  select(starts_with("Sales"))
```

**üí° Remember**: Column names are case-sensitive! `Price` ‚â† `price`

---

Let's see each selection method in action:

In [4]:
# Select specific columns
selected_columns <- sales_data %>%    # Take sales_data and pipe it to select()
  select(OrderID, CustomerName, Product, Price)
                                       # select() keeps only the named columns
                                       # All other columns (Category, Quantity, etc.) are dropped
print("Selected Columns (OrderID, CustomerName, Product, Price):")
print(selected_columns)               # Result has only 4 columns instead of original 8

[1] "Selected Columns (OrderID, CustomerName, Product, Price):"
   OrderID CustomerName  Product Price
1      101        Alice   Laptop  1200
2      102          Bob    Mouse    25
3      103      Charlie Keyboard    75
4      104        David  Monitor   300
5      105          Eve   Laptop  1150
6      106        Frank    Mouse    20
7      107        Grace Keyboard    80
8      108        Heidi  Monitor   320
9      109         Ivan   Laptop  1250
10     110         Judy    Mouse    30


In [5]:
# Select columns by range
selected_range <- sales_data %>%      # Take sales_data and pipe to select()
  select(OrderID:Product)              # : means "from OrderID through Product"
                                       # This selects OrderID, CustomerName, and Product
                                       # Range selection based on column position
print("Selected Columns by Range (OrderID to Product):")
print(selected_range)                 # Shows first 3 columns only

[1] "Selected Columns by Range (OrderID to Product):"
   OrderID CustomerName  Product
1      101        Alice   Laptop
2      102          Bob    Mouse
3      103      Charlie Keyboard
4      104        David  Monitor
5      105          Eve   Laptop
6      106        Frank    Mouse
7      107        Grace Keyboard
8      108        Heidi  Monitor
9      109         Ivan   Laptop
10     110         Judy    Mouse


In [6]:
# Select all columns EXCEPT some
except_columns <- sales_data %>%      # Take sales_data and pipe to select()
  select(-Category, -OrderDate)       # Minus sign (-) means "exclude these columns"
                                       # Keep everything except Category and OrderDate
                                       # Useful when you want most columns but not all
print("Selected All Columns Except Category and OrderDate:")
print(except_columns)                 # Result has 6 columns instead of 8

[1] "Selected All Columns Except Category and OrderDate:"
   OrderID CustomerName  Product Price Quantity Region
1      101        Alice   Laptop  1200        1   East
2      102          Bob    Mouse    25        2   West
3      103      Charlie Keyboard    75        1  North
4      104        David  Monitor   300        1  South
5      105          Eve   Laptop  1150        1   East
6      106        Frank    Mouse    20        3   West
7      107        Grace Keyboard    80        1  North
8      108        Heidi  Monitor   320        1  South
9      109         Ivan   Laptop  1250        1   East
10     110         Judy    Mouse    30        2   West


In [7]:
# Select columns that start with a specific string
starts_with_o <- sales_data %>%       # Take sales_data and pipe to select()
  select(starts_with("O"))            # starts_with() is a helper function
                                       # Finds columns beginning with "O"
                                       # Case-sensitive: looks for "OrderID" and "OrderDate"
print("Selected Columns Starting with 'O':")
print(starts_with_o)                  # Result shows OrderID and OrderDate columns only

[1] "Selected Columns Starting with 'O':"
   OrderID  OrderDate
1      101 2024-01-15
2      102 2024-01-15
3      103 2024-01-16
4      104 2024-01-17
5      105 2024-01-18
6      106 2024-01-19
7      107 2024-01-20
8      108 2024-01-21
9      109 2024-01-22
10     110 2024-01-23


In [8]:
# Select columns that contain a specific string
contains_name <- sales_data %>%       # Take sales_data and pipe to select()
  select(contains("Name"))            # contains() is a helper function
                                       # Finds columns with "Name" anywhere in column name
                                       # Matches "CustomerName" in our dataset
print("Selected Columns Containing 'Name':")
print(contains_name)                  # Result shows only CustomerName column

[1] "Selected Columns Containing 'Name':"
   CustomerName
1         Alice
2           Bob
3       Charlie
4         David
5           Eve
6         Frank
7         Grace
8         Heidi
9          Ivan
10         Judy


## 2. filter(): Subsetting Rows Based on Conditions

The **filter()** function lets you keep only the rows that match your criteria. Think of it like using the filter feature in Excel to show only specific records.

---

### üéØ When to Use filter()

| Business Question | filter() Condition |
|------------------|-------------------|
| "Show me expensive products" | `filter(Price > 100)` |
| "Find orders from the East region" | `filter(Region == "East")` |
| "Get this week's orders" | `filter(OrderDate >= "2024-01-15")` |
| "Find laptop OR monitor orders" | `filter(Product %in% c("Laptop", "Monitor"))` |

---

### üîß filter() Syntax Breakdown

```r
result <- data %>% filter(column_name operator value)
```

| Part | What It Does |
|------|-------------|
| `data` | Your starting dataset |
| `%>%` | Pipe operator ("and then...") |
| `filter()` | The function that keeps matching rows |
| `column_name` | Which column to check |
| `operator` | How to compare (see table below) |
| `value` | What to compare against |

---

### üìã Comparison Operators Explained

| Operator | Meaning | Example | Result |
|----------|---------|---------|--------|
| `==` | Equals exactly | `Region == "East"` | Keep rows where Region is "East" |
| `!=` | Not equal to | `Region != "Test"` | Remove test data |
| `>` | Greater than | `Price > 100` | Prices above $100 |
| `>=` | Greater than or equal | `Price >= 100` | $100 and above |
| `<` | Less than | `Quantity < 5` | Small orders |
| `<=` | Less than or equal | `Quantity <= 5` | 5 items or fewer |

---

### ‚ö†Ô∏è Common Mistake: `=` vs `==`

| Symbol | Use For | Example |
|--------|---------|---------|
| `=` | Assigning values | `x = 5` (x now equals 5) |
| `==` | Comparing values | `x == 5` (is x equal to 5?) |

**üö® In filter(), ALWAYS use `==` for comparisons!**

---

### üîó Combining Conditions (AND / OR)

| Logic | Symbol | Code | Meaning |
|-------|--------|------|---------|
| **AND** | `,` or `&` | `filter(Price > 100, Region == "East")` | BOTH must be true |
| **OR** | `\|` | `filter(Region == "East" \| Region == "West")` | EITHER can be true |
| **IN list** | `%in%` | `filter(Product %in% c("Laptop", "Monitor"))` | Match any in the list |

---

### üí° Understanding `%in%`

The `%in%` operator checks if a value is in a list - much cleaner than multiple OR conditions:

| Instead of this... | Use this! |
|-------------------|-----------|
| `filter(Product == "Laptop" \| Product == "Monitor" \| Product == "Keyboard")` | `filter(Product %in% c("Laptop", "Monitor", "Keyboard"))` |

---

### üéØ HOW TO ADAPT THIS CODE

**Template for simple filter:**
```r
result <- my_data %>%
  filter(column_name > value)
```

**Template for text matching (exact):**
```r
result <- my_data %>%
  filter(column_name == "exact_text")
```

**Template for multiple conditions (AND):**
```r
result <- my_data %>%
  filter(column1 > value1, column2 == "text")
```

**Template for matching a list (OR):**
```r
result <- my_data %>%
  filter(column_name %in% c("option1", "option2", "option3"))
```

**Template for date filtering:**
```r
result <- my_data %>%
  filter(date_column >= as.Date("2024-01-01"))
```

**üí° Remember**: Text values need quotes `"like this"`, numbers don't!

---

Let's see each filter method in action:

In [9]:
# Filter rows where Price is greater than 100
high_price_items <- sales_data %>%    # Take sales_data and pipe to filter()
  filter(Price > 100)                 # filter() keeps rows where condition is TRUE
                                       # > is the "greater than" comparison operator
                                       # Only keeps rows where Price column > 100
print("Items with Price > 100:")
print(high_price_items)               # Shows only expensive items (laptops, monitors)

[1] "Items with Price > 100:"
  OrderID CustomerName Product    Category Price Quantity  OrderDate Region
1     101        Alice  Laptop Electronics  1200        1 2024-01-15   East
2     104        David Monitor Electronics   300        1 2024-01-17  South
3     105          Eve  Laptop Electronics  1150        1 2024-01-18   East
4     108        Heidi Monitor Electronics   320        1 2024-01-21  South
5     109         Ivan  Laptop Electronics  1250        1 2024-01-22   East


In [10]:
# Filter rows where Product is 'Laptop' and Quantity is 1
laptop_single_quantity <- sales_data %>%  # Take sales_data and pipe to filter()
  filter(Product == "Laptop", Quantity == 1)
                                       # Multiple conditions separated by comma = AND logic
                                       # == is "exactly equal to" (use == not = for comparison)
                                       # Both conditions must be TRUE for row to be kept
print("Laptops with Quantity = 1:")
print(laptop_single_quantity)         # Shows only laptop orders with quantity of 1

[1] "Laptops with Quantity = 1:"
  OrderID CustomerName Product    Category Price Quantity  OrderDate Region
1     101        Alice  Laptop Electronics  1200        1 2024-01-15   East
2     105          Eve  Laptop Electronics  1150        1 2024-01-18   East
3     109         Ivan  Laptop Electronics  1250        1 2024-01-22   East


In [11]:
# Filter rows where Region is 'East' or 'West'
east_west_region <- sales_data %>%    # Take sales_data and pipe to filter()
  filter(Region == "East" | Region == "West")
                                       # | is the OR operator (either condition can be true)
                                       # Keep rows where Region equals "East" OR "West"
                                       # Excludes "North" and "South" regions
print("Orders from East or West Region:")
print(east_west_region)               # Shows only orders from East or West regions

[1] "Orders from East or West Region:"
  OrderID CustomerName Product    Category Price Quantity  OrderDate Region
1     101        Alice  Laptop Electronics  1200        1 2024-01-15   East
2     102          Bob   Mouse Electronics    25        2 2024-01-15   West
3     105          Eve  Laptop Electronics  1150        1 2024-01-18   East
4     106        Frank   Mouse Electronics    20        3 2024-01-19   West
5     109         Ivan  Laptop Electronics  1250        1 2024-01-22   East
6     110         Judy   Mouse Electronics    30        2 2024-01-23   West


In [12]:
# Filter using %in% operator
selected_products <- sales_data %>%   # Take sales_data and pipe to filter()
  filter(Product %in% c("Laptop", "Monitor"))
                                       # %in% checks if value exists in a list
                                       # c("Laptop", "Monitor") creates a vector of allowed values
                                       # More efficient than Product == "Laptop" | Product == "Monitor"
print("Orders for Laptop or Monitor:")
print(selected_products)              # Shows only laptop and monitor orders

[1] "Orders for Laptop or Monitor:"
  OrderID CustomerName Product    Category Price Quantity  OrderDate Region
1     101        Alice  Laptop Electronics  1200        1 2024-01-15   East
2     104        David Monitor Electronics   300        1 2024-01-17  South
3     105          Eve  Laptop Electronics  1150        1 2024-01-18   East
4     108        Heidi Monitor Electronics   320        1 2024-01-21  South
5     109         Ivan  Laptop Electronics  1250        1 2024-01-22   East


In [13]:
# Filter rows based on date
orders_jan_17_onwards <- sales_data %>%  # Take sales_data and pipe to filter()
  filter(OrderDate >= as.Date("2024-01-17"))
                                       # >= means "greater than or equal to"
                                       # as.Date() converts text string to date format
                                       # Keeps orders from Jan 17, 2024 and later
print("Orders from Jan 17, 2024 onwards:")
print(orders_jan_17_onwards)          # Shows orders from Jan 17 through Jan 23

[1] "Orders from Jan 17, 2024 onwards:"
  OrderID CustomerName  Product    Category Price Quantity  OrderDate Region
1     104        David  Monitor Electronics   300        1 2024-01-17  South
2     105          Eve   Laptop Electronics  1150        1 2024-01-18   East
3     106        Frank    Mouse Electronics    20        3 2024-01-19   West
4     107        Grace Keyboard Electronics    80        1 2024-01-20  North
5     108        Heidi  Monitor Electronics   320        1 2024-01-21  South
6     109         Ivan   Laptop Electronics  1250        1 2024-01-22   East
7     110         Judy    Mouse Electronics    30        2 2024-01-23   West


## 3. arrange(): Reordering Rows

The **arrange()** function sorts your data in a specific order. Think of it like clicking a column header in Excel to sort A‚ÜíZ or largest‚Üísmallest.

---

### üéØ When to Use arrange()

| Business Need | arrange() Code |
|--------------|----------------|
| List employees alphabetically | `arrange(Name)` |
| Show highest sales first | `arrange(desc(Sales))` |
| Sort by date (oldest first) | `arrange(Date)` |
| Sort by date (newest first) | `arrange(desc(Date))` |
| Group by region, then by sales | `arrange(Region, Sales)` |

---

### üîß arrange() Syntax Breakdown

```r
result <- data %>% arrange(column_name)
```

| Part | What It Does |
|------|-------------|
| `data` | Your starting dataset |
| `%>%` | Pipe operator ("and then...") |
| `arrange()` | The function that sorts rows |
| `column_name` | Which column to sort by |

---

### ‚¨ÜÔ∏è‚¨áÔ∏è Ascending vs. Descending

| Type | Code | Result |
|------|------|--------|
| **Ascending** (default) | `arrange(Price)` | $20, $25, $75... ‚Üí $1250 |
| **Descending** | `arrange(desc(Price))` | $1250, $1200... ‚Üí $20 |

---

### üîß Understanding `desc()`

The `desc()` function reverses the sort order:

| Code | Meaning | Result Order |
|------|---------|--------------|
| `arrange(Price)` | Sort Price ascending | Lowest ‚Üí Highest |
| `arrange(desc(Price))` | Sort Price descending | Highest ‚Üí Lowest |
| `arrange(Name)` | Sort Name ascending | A ‚Üí Z |
| `arrange(desc(Name))` | Sort Name descending | Z ‚Üí A |
| `arrange(Date)` | Sort Date ascending | Oldest ‚Üí Newest |
| `arrange(desc(Date))` | Sort Date descending | Newest ‚Üí Oldest |

---

### üìä Sorting by Multiple Columns

When you sort by multiple columns, R uses the **first column as primary sort, second as tiebreaker**:

```r
arrange(Region, Price)
```

| Step | What Happens |
|------|-------------|
| 1 | Sort all rows alphabetically by Region (A‚ÜíZ) |
| 2 | Within each Region, sort by Price (low‚Üíhigh) |

**üíº Business Example**: "List all employees by department, and within each department, by salary"
```r
arrange(Department, desc(Salary))
```

---

### üìã How arrange() Handles Different Data Types

| Data Type | Default Sort (Ascending) |
|-----------|-------------------------|
| **Numbers** | Smallest ‚Üí Largest (1, 2, 3...) |
| **Text** | Alphabetical (A, B, C...) |
| **Dates** | Earliest ‚Üí Latest (Jan 1 ‚Üí Dec 31) |

---

### üéØ HOW TO ADAPT THIS CODE

**Template for simple ascending sort:**
```r
result <- my_data %>%
  arrange(column_name)
```

**Template for descending sort:**
```r
result <- my_data %>%
  arrange(desc(column_name))
```

**Template for multi-column sort:**
```r
result <- my_data %>%
  arrange(first_sort_column, second_sort_column)
```

**Template for mixed sort (desc first, asc second):**
```r
result <- my_data %>%
  arrange(desc(column1), column2)
```

**üí° Common Patterns:**
- Sort products by price (highest first): `arrange(desc(Price))`
- Sort employees alphabetically: `arrange(LastName, FirstName)`
- Sort orders by date (newest first): `arrange(desc(OrderDate))`

---

Let's see each sorting method in action:

In [14]:
# Arrange by Price in ascending order (default)
arranged_by_price_asc <- sales_data %>%  # Take sales_data and pipe to arrange()
  arrange(Price)                      # arrange() sorts rows by specified column
                                       # Default is ascending order (lowest to highest)
                                       # Cheapest items appear first
print("Arranged by Price (Ascending):")
print(arranged_by_price_asc)          # Shows data sorted from $20 to $1250

[1] "Arranged by Price (Ascending):"
   OrderID CustomerName  Product    Category Price Quantity  OrderDate Region
1      106        Frank    Mouse Electronics    20        3 2024-01-19   West
2      102          Bob    Mouse Electronics    25        2 2024-01-15   West
3      110         Judy    Mouse Electronics    30        2 2024-01-23   West
4      103      Charlie Keyboard Electronics    75        1 2024-01-16  North
5      107        Grace Keyboard Electronics    80        1 2024-01-20  North
6      104        David  Monitor Electronics   300        1 2024-01-17  South
7      108        Heidi  Monitor Electronics   320        1 2024-01-21  South
8      105          Eve   Laptop Electronics  1150        1 2024-01-18   East
9      101        Alice   Laptop Electronics  1200        1 2024-01-15   East
10     109         Ivan   Laptop Electronics  1250        1 2024-01-22   East


In [15]:
# Arrange by Price in descending order
arranged_by_price_desc <- sales_data %>%  # Take sales_data and pipe to arrange()
  arrange(desc(Price))                # desc() function reverses sort order
                                       # desc = descending (highest to lowest)
                                       # Most expensive items appear first
print("Arranged by Price (Descending):")
print(arranged_by_price_desc)          # Shows data sorted from $1250 to $20

[1] "Arranged by Price (Descending):"
   OrderID CustomerName  Product    Category Price Quantity  OrderDate Region
1      109         Ivan   Laptop Electronics  1250        1 2024-01-22   East
2      101        Alice   Laptop Electronics  1200        1 2024-01-15   East
3      105          Eve   Laptop Electronics  1150        1 2024-01-18   East
4      108        Heidi  Monitor Electronics   320        1 2024-01-21  South
5      104        David  Monitor Electronics   300        1 2024-01-17  South
6      107        Grace Keyboard Electronics    80        1 2024-01-20  North
7      103      Charlie Keyboard Electronics    75        1 2024-01-16  North
8      110         Judy    Mouse Electronics    30        2 2024-01-23   West
9      102          Bob    Mouse Electronics    25        2 2024-01-15   West
10     106        Frank    Mouse Electronics    20        3 2024-01-19   West


In [16]:
# Arrange by multiple columns (e.g., Region then Price)
arranged_by_region_price <- sales_data %>%  # Take sales_data and pipe to arrange()
  arrange(Region, Price)              # Multiple columns: first sort by Region
                                       # Then within each region, sort by Price
                                       # Primary sort = Region (alphabetical)
                                       # Secondary sort = Price (within each region)
print("Arranged by Region then Price:")
print(arranged_by_region_price)        # Groups by East, North, South, West, then price within each

[1] "Arranged by Region then Price:"
   OrderID CustomerName  Product    Category Price Quantity  OrderDate Region
1      105          Eve   Laptop Electronics  1150        1 2024-01-18   East
2      101        Alice   Laptop Electronics  1200        1 2024-01-15   East
3      109         Ivan   Laptop Electronics  1250        1 2024-01-22   East
4      103      Charlie Keyboard Electronics    75        1 2024-01-16  North
5      107        Grace Keyboard Electronics    80        1 2024-01-20  North
6      104        David  Monitor Electronics   300        1 2024-01-17  South
7      108        Heidi  Monitor Electronics   320        1 2024-01-21  South
8      106        Frank    Mouse Electronics    20        3 2024-01-19   West
9      102          Bob    Mouse Electronics    25        2 2024-01-15   West
10     110         Judy    Mouse Electronics    30        2 2024-01-23   West


## Combining Operations: The Power of the Pipe

The real power of dplyr comes from **chaining multiple operations** together. This is where data transformation becomes truly powerful - solving complex business questions with elegant, readable code.

---

### üè≠ Think of It Like an Assembly Line

```
üì¶ Raw Data ‚Üí üîç filter() ‚Üí üìã select() ‚Üí üìä arrange() ‚Üí ‚úÖ Final Result
```

Each step transforms the data, and the result flows to the next step!

---

### üíº Business Example: From Question to Code

**Business Question**: *"I need a list of high-value orders from our priority regions, showing customer name and product, sorted by highest value first."*

**Breaking it down:**

| Step | Business Need | R Code |
|------|--------------|--------|
| 1 | Start with all data | `sales_data %>%` |
| 2 | Keep high-value orders | `filter(Price * Quantity > 1000) %>%` |
| 3 | Keep only relevant columns | `select(CustomerName, Product, Price, Quantity) %>%` |
| 4 | Sort by value (highest first) | `arrange(desc(Price * Quantity))` |

**Complete code:**
```r
sales_data %>%
  filter(Price * Quantity > 1000) %>%
  select(CustomerName, Product, Price, Quantity) %>%
  arrange(desc(Price * Quantity))
```

---

### üîß Reading a Pipeline Step by Step

```r
sales_data %>%               # Step 1: Start with the sales_data dataset
  filter(Price > 100) %>%    # Step 2: Keep only rows where Price > 100
  select(Product, Price) %>% # Step 3: Keep only Product and Price columns
  arrange(desc(Price))       # Step 4: Sort by Price, highest first
```

| After Step | What You Have |
|------------|--------------|
| Step 1 | All 10 rows, all 8 columns |
| Step 2 | Only 6 rows (expensive items), still 8 columns |
| Step 3 | Still 6 rows, but only 2 columns now |
| Step 4 | Same 6 rows, 2 columns, now sorted |

---

### üí° Pro Tip: Testing Pipelines Step by Step

You can run **part of a pipeline** to see intermediate results:

```r
# Run just this to see filtered data before selecting:
sales_data %>%
  filter(Price > 100)

# Then add more steps once you're happy:
sales_data %>%
  filter(Price > 100) %>%
  select(Product, Price)
```

---

### ‚ö†Ô∏è Order Matters!

The order of operations can change your results:

| Order | What Happens |
|-------|-------------|
| `filter()` then `select()` | Filter uses all columns, then select reduces columns ‚úÖ |
| `select()` then `filter()` | ‚ö†Ô∏è Can only filter on columns you kept! |

**Example of a problem:**
```r
# This WON'T work - we selected away the Price column before filtering!
sales_data %>%
  select(Product, CustomerName) %>%  # Oops! No Price column anymore
  filter(Price > 100)                 # ERROR: Price doesn't exist!
```

---

### üéØ HOW TO ADAPT THIS CODE

**Template for a typical analysis pipeline:**
```r
result <- my_data %>%
  filter(condition1, condition2) %>%    # Step 1: Keep relevant rows
  select(col1, col2, col3) %>%          # Step 2: Keep relevant columns
  arrange(desc(col1))                   # Step 3: Sort the results
```

**Template for calculating then sorting:**
```r
result <- my_data %>%
  filter(column1 * column2 > threshold) %>%   # Filter on a calculation
  arrange(desc(column1 * column2))            # Sort by the same calculation
```

**Common Pipeline Patterns:**

| Pattern | Use Case |
|---------|----------|
| `filter() %>% select() %>% arrange()` | Standard analysis flow |
| `filter() %>% arrange()` | When you need all columns |
| `select() %>% filter()` | When filtering on selected columns only |

---

Let's see a combined pipeline in action:

In [17]:
# Combine operations: filter and then arrange
high_value_orders_arranged <- sales_data %>%  # Start with sales_data
  filter(Price * Quantity > 1000) %>% # Step 1: Calculate total value (Price √ó Quantity)
                                       # Keep only rows where total > 1000
  arrange(desc(Price * Quantity))     # Step 2: Sort by total value (highest first)
                                       # desc() sorts descending order
                                       # Pipeline: data ‚Üí filter ‚Üí arrange ‚Üí result
print("High Value Orders (Price * Quantity > 1000) arranged by total value (Descending):")
print(high_value_orders_arranged)      # Shows high-value orders sorted by total value

[1] "High Value Orders (Price * Quantity > 1000) arranged by total value (Descending):"
  OrderID CustomerName Product    Category Price Quantity  OrderDate Region
1     109         Ivan  Laptop Electronics  1250        1 2024-01-22   East
2     101        Alice  Laptop Electronics  1200        1 2024-01-15   East
3     105          Eve  Laptop Electronics  1150        1 2024-01-18   East


## Summary and Next Steps

### üìö What You Learned in This Lesson

| Function | Purpose | Example |
|----------|---------|---------|
| `select()` | Choose specific columns | `select(Name, Price, Region)` |
| `filter()` | Keep rows matching conditions | `filter(Price > 100)` |
| `arrange()` | Sort rows in order | `arrange(desc(Price))` |
| `%>%` | Chain operations together | `data %>% filter() %>% select()` |

---

### üîë Key Takeaways

| Concept | Remember |
|---------|----------|
| **Pipe operator** | `%>%` means "and then..." - read code left to right |
| **select() syntax** | Column names without quotes, separated by commas |
| **filter() comparisons** | Use `==` for equals (not `=`), text in quotes |
| **arrange() default** | Ascending order; use `desc()` for descending |
| **Order matters** | Filter before select if you need all columns for filtering |

---

### ‚ö†Ô∏è Common Mistakes to Avoid

| Mistake | Problem | Solution |
|---------|---------|----------|
| Using `=` instead of `==` | `filter(x = 5)` causes error | Use `filter(x == 5)` |
| Forgetting quotes for text | `filter(Region == East)` fails | Use `filter(Region == "East")` |
| Selecting away needed columns | Can't filter on removed columns | Filter before select |
| Typos in column names | R is case-sensitive | Match exact column names |

---

### üéØ Quick Reference Cheat Sheet

```r
# SELECT - Keep only these columns
data %>% select(col1, col2, col3)
data %>% select(-col_to_remove)
data %>% select(starts_with("Sales"))

# FILTER - Keep rows matching conditions
data %>% filter(column > value)
data %>% filter(column == "text")
data %>% filter(col1 > 5, col2 == "A")        # AND
data %>% filter(col1 == "A" | col1 == "B")    # OR
data %>% filter(col1 %in% c("A", "B", "C"))   # IN list

# ARRANGE - Sort rows
data %>% arrange(column)                       # Ascending
data %>% arrange(desc(column))                 # Descending
data %>% arrange(col1, col2)                   # Multi-column

# COMBINE - Chain operations
data %>%
  filter(condition) %>%
  select(columns) %>%
  arrange(sort_column)
```

---

### üìÖ Coming Up in Part 2

| Function | What It Does |
|----------|-------------|
| `mutate()` | Create new calculated columns |
| `summarize()` | Calculate summary statistics (mean, sum, count) |
| `group_by()` | Perform operations by group (e.g., by region) |

---

### üìù Practice Exercises

Try these on your own to reinforce your learning:

1. **Select** only the `CustomerName`, `Product`, and `OrderDate` columns
2. **Filter** for orders where `Quantity > 1`
3. **Arrange** the data by `OrderDate` (newest first)
4. **Combine**: Find all Laptop orders, show only `CustomerName` and `Price`, sorted by `Price` descending
5. **Challenge**: Find orders from "East" or "West" regions with `Price > 50`, arranged by `Region` then `Price`

**üí° Solution Pattern:**
```r
# Exercise 4 solution:
sales_data %>%
  filter(Product == "Laptop") %>%
  select(CustomerName, Price) %>%
  arrange(desc(Price))
```