# Workshop: R Basics

Welcome to this R workshop! The goal of this notebook is to introduce you to the essentials of data analysis in R, from the very basics to some more advanced data wrangling techniques.


---
Before you start, make sure you are familiar with the [R Syntax Fundamentals](https://github.com/RaHub4AI/MI7032/blob/main/Introduction_to_R_and_Python/R_Syntax_Fundamentals.md).  
These rules cover things like comments, variables, assignment, reserved words, and code readability - and you should keep them in mind while writing your code.


## General Information

In [1]:
# Checking the complete version of the installed R
R.version.string  # Print the R version string

In [2]:
# Check the current working directory
getwd()

# Set your working directory (example path)
#setwd("C:/Users/YourName/Documents/EDS_projects")

# Check the files in the directory
dir()


## Operators

---

#### Arithmetic

Used for basic mathematical calculations.  

| Operator | Description             | Example   | Result |
|----------|-------------------------|-----------|--------|
| `+`      | Addition                | `5 + 3`   | 8      |
| `-`      | Subtraction             | `5 - 3`   | 2      |
| `*`      | Multiplication          | `5 * 3`   | 15     |
| `/`      | Division                | `5 / 2`   | 2.5    |
| `^` or `**` | Exponentiation       | `2 ^ 3`   | 8      |
| `%%`     | Modulus                 | `5 %% 2`  | 1      |
| `%/%`    | Integer division        | `5 %/% 2` | 2      |

---

#### Assignment

Used to assign values to objects.  

| Operator | Example   | Explanation |
|----------|-----------|-------------|
| `<-`     | `x <- 5`  | Assign 5 to `x` (recommended in R) |
| `->`     | `5 -> x`  | Same as above, but reversed |
| `=`      | `x = 5`   | Assign 5 to `x` (less common, also used in function arguments) |
| `<<-`     | `x <<- 5`  | Global assigner|
---

#### Comparison

Used to compare values. They return logical values (`TRUE` / `FALSE`).  

| Operator | Description      | Example   | Result |
|----------|------------------|-----------|--------|
| `==`     | Equal to         | `5 == 3`  | FALSE |
| `!=`     | Not equal to     | `5 != 3`  | TRUE  |
| `>`      | Greater than     | `5 > 3`   | TRUE  |
| `<`      | Less than        | `5 < 3`   | FALSE |
| `>=`     | Greater or equal | `5 >= 5`  | TRUE  |
| `<=`     | Less or equal    | `3 <= 5`  | TRUE  |

---

#### Logical

Used to combine logical values.  

| Operator | Description                       | Example | Result |
|----------|-----------------------------------|---|---|
| `&`      | Element-wise `AND` | `((-2:2) >= 0) & ((-2:2) <= 0)` | `FALSE FALSE TRUE FALSE FALSE` |
| `\|`      | Element-wise `OR` | `c(T,T,F,F) \| c(T,F,T,F)` | `TRUE TRUE TRUE FALSE` |
| `&&`     | Logical `AND` | `((-2:2) >= 0) && ((-2:2) <= 0)` | `FALSE` |
| `\|\|`     | Logical `OR` | `c(T,T,F,F) \|\| c(T,F,T,F)` | `TRUE` |
| `!`      | NOT (negation)                    | `2 != 3` | `TRUE` |
---

#### Miscellaneous

Other useful operators in R.  

| Operator | Description                                | Example      | Result |
|----------|--------------------------------------------|--------------|--------|
| `:`      | Sequence operator                          | `1:5`        | `1 2 3 4 5` |
| `%in%`   | Matching operator (is element in set)      | `2 %in% c(1,2,3)` | `TRUE` |
| `%*%`    | Matrix multiplication                      | `matrix(1:4,2) %*% matrix(1:4,2)` | Matrix result |
| `$`      | Access variables in a data frame/list      | `df$col`     | Extract column `col` |
| `[]`     | Indexing / subsetting                      | `x[1:3]`     | First 3 elements |
| `[[]]`   | Extract single element from list/data frame | `list[[1]]` | First element |


## R as a Calculator
The simplest use of R is doing computations directly:

In [5]:
# Let's try the arithmetic operators
3 + 5

In [6]:
25 ** 2

In [7]:
1 / 2

In [8]:
1 %% 2

Core mathematical functions and constants are readily available:

In [11]:
2 + sqrt(375769) - 25^2  # exponentiation can be ^ or ** (e.g., 2**3)

In [21]:
sin(pi/6) + acosh(1)

In [13]:
log(exp(1))

In [14]:
sqrt(-1+0i)  # complex numbers

In [15]:
factorial(6) / choose(4, 2)

In **R**, some special values are used to represent results that go beyond ordinary numbers:

- `Inf` (infinity): appears when dividing a nonzero number by zero, e.g. `1/0`.  
- `-Inf` (negative infinity): appears with negative numbers divided by zero, e.g. `-1/0`.  
- `NaN` (“Not a Number”): appears when the result is undefined, e.g. `0/0` or `Inf - Inf`.  

These values come from the [IEEE 754 floating-point standard](https://standards.ieee.org/ieee/754/6210/) and allow computations to continue even when a calculation cannot produce an ordinary number.  

You can test for them using R’s built-in functions:  
- `is.infinite(x)` checks for `Inf` or `-Inf`.  
- `is.nan(x)` checks for `NaN`.

*Read more: https://doi.org/10.1080/09332480.2025.2510168*

In [9]:
1 / 0

In [10]:
0 / 0

In [24]:
x = 2
is.infinite(x)

In [26]:
is.infinite(x/0)

In [31]:
Inf - Inf

Functions (see more in [Functions](#functions)) are called by **name** followed by parentheses containing **arguments**. You may omit argument names if you provide values in the documented order. These two calls are equivalent:

In [None]:
log(x = 25, base = 5)
log(25, 5)

## Variables

Creating objects means defining and assigning values to variables. In R, everything is stored as an object, including data, functions, models, and results. Objects are created using the assignment operator (`<-` or `=`). While both operators work, `<-` is recommended because `=` is also used for specifying function arguments.

An object’s name should describe the data it contains, allowing the name itself to serve as a clear reference to the underlying information.

In [35]:
sum1 = 2 + 3
sum2 <- 2 + 3

In [36]:
# Print the result
print(paste("The sum1 of the numbers is:", sum1))  # display the sum
print(paste0("The sum2 of the numbers is:", sum2))

[1] "The sum1 of the numbers is: 5"
[1] "The sum2 of the numbers is:5"


><font color='gold'> Do you notice the difference between `paste()` and `paste0()`?</font>
> - `paste()` adds a space (by default) between the elements.
> - `paste0()` concatenates the elements without any space.

## Getting Help

Many functions have **default values**, so you don't always need to set every argument. Open a function's help page by prefixing it with `?` (in RStudio, this shows in the Help pane):

In [37]:
?log

Help pages typically include:
- **Usage**: function signature with parameters and defaults.
- **Arguments**: meaning of each parameter.
- **Details**: implementation details and caveats.
- **Examples**: runnable examples—often the fastest way to learn a new function.

### <font color='gold'> Task 1 </font>
1. What does the function `rm` do?
2. Try one example from the `rm` help page.

In [None]:
# YOUR CODE HERE!

## Data Types

In R, data types define the kind of values a variable can hold and how they are processed. Use `class()` or `typeof()` to check an object’s type.

In [38]:
weight <- 80
height <- 180
bmi <- weight / (height / 100)^2
bmi

In [39]:
class(bmi)

In [40]:
typeof(bmi)

#### Common Data Types
- `numeric`: real numbers (includes both, integers and doubles)

In [None]:
class(weight)
typeof(weight)

In [None]:
# Use suffix L to specify integer data
weight2 <- 80L
class(weight2)
typeof(weight2)

- `complex`: numbers with real and imaginary parts

In [None]:
complex_variable <- 3 + 2i
class(complex_variable)

- `logical`: boolean values `TRUE`/`FALSE` (or `T`/`F`)

In [None]:
logical_variable1 <- TRUE
logical_variable2 <- F

class(logical_variable1)
class(logical_variable2)

- `character`: strings (surrounded by quotes, either `"double quotes"` or `'single quotes'` are fine)

In [2]:
string1 <- 'I’m going to become an Environmental Data Scientist!'

class(string1)

You can convert types using `as.<type>`:

In [None]:
'5' + 5

ERROR: Error in "5" + 5: non-numeric argument to binary operator


In [None]:
as.numeric('5') + 5

### <font color='gold'> Task 2 </font>
- Can you add logical values?
- What is the numeric value of `TRUE`?

In [None]:
# YOUR CODE HERE!

## Functions

Functions are a way to group code that performs a specific task, so you can call it whenever needed.  
They can take in data (arguments), perform operations, and return a result. Functions make code easier to read, reduce repetition, and allow you to organize your work more clearly.

---

#### General syntax

```r
function_name <- function(arguments) {
  # operations
  return(value)
}
```

- `function_name` is how you will call the function later.
- `arguments` are the inputs the function accepts. They can have default values.
- `return()` sends a value back from the function (if omitted, the last expression is returned automatically).

**Example: Body Mass Index (BMI)**

In [5]:
bmi <- function(weight, height = 180) {
  res <- weight / (height / 100) ^ 2
  return(res)
}

# Using the function
bmi(85)       # height defaults to 180 cm
bmi(85, 195)  # override default height


> - Parameters are the variable names you specify in the function definition (e.g., `weight`, `height`). Arguments are the actual values you pass when you call the function (e.g., `85`, `195`).
>
>R supports:
> - Positional arguments: `bmi(85, 180)`
> - Named arguments: `bmi(weight = 85, height = 180)` (clearer, order-independent)
> - Default arguments: `height = 180` allows you to omit `height`
>
> The `return()` function specifies what value is given back.
If you omit it, R returns the last evaluated expression by default

In [6]:
square_of <- function(number) {
  return(number ^ 2)
}

# Without explicit return()
square_of2 <- function(number) {
  number ^ 2   # last expression is returned automatically
}


In [7]:
square_of(3)

In [8]:
square_of2(3)

> Some functions don’t return values but are useful because of their side effects, such as printing.

In [9]:
print_greeting <- function(name) {
  print(paste("Hello,", name, "!"))
}

print_greeting("ACES")


[1] "Hello, ACES !"


It’s good practice to add comments inside your function that explain what it does, what inputs it expects, and what it returns.
Unlike Python, R does not have a special “docstring” syntax, but comments with `#` serve the same purpose.

In [10]:
# Convert Celsius to Kelvin (K = C + 273.15)
celsius_to_kelvin <- function(c) {
  return(c + 273.15)
}

There are several types of functions in R:

- **Built-in functions**: These come with base R and are always available.  
  Examples: `sum()`, `mean()`, `sqrt()`, `log()`  
- **Imported functions**: These are provided by external packages. To use them, you first need to install and load the package with `library()`.  
  Examples: `stats::rnorm()` for random numbers, `ggplot2::ggplot()` for plotting  
  (We haven’t used these yet - you’ll learn more about them in the next chapter on [Packages](#packages).)

- **User-defined functions**: These are the functions you write yourself to extend R with custom behavior.  
  Example: the `bmi()` function we defined above

## Packages

Many extra functions come in packages; load them with `library()`. For example, `as_date()` is in the `lubridate` package:

In [None]:
# This will error if lubridate is not yet installed:
# as_date('2025-10-02')

library(lubridate)
as_date('2025-10-02')

In [None]:
class(as_date('2025-10-02'))

Packages from CRAN [(available packages)](https://cran.r-project.org/web/packages/available_packages_by_date.html) can be installed with `install.packages()`.

>Sometimes you’ll see a function written with the syntax `package::function()`.
>This means: “use this function from that package directly.”
> This is useful because:
> - It lets you call a function without loading the whole package with `library()`.
> - It avoids conflicts when two packages provide functions with the same name.
For example, both `dplyr` and `stats` have a function named `filter()`.
Using `dplyr::filter()` or `stats::filter()` makes it clear which one you want.

### <font color='gold'> Task 3 </font>
Install the package `tidyverse` and load it.

In [None]:
# YOUR CODE HERE!

## Vectors

The central object type in R is the vector (similar to a 1D NumPy array), which is, in other words, a one-dimensional sequence of variables of the same type. There are several ways to create vectors.

In [None]:
1:10                       # sequences
9:2                        # reverse sequence
c(1, 4, 2, 6)              # arbitrary elements
c(1:10, 4, c(2, 4))        # arguments can be either vectors or individual values
c("A", "B", "C")           # character vector
seq(0, 1, length.out = 10) # sequences with an arbitrary start and end point and equal intervals
rep(1:2, times = 5)

Extracting data from a vector is done using square brackets (`[]`).

<font color='gold'>It’s important to remember that in **R indexing starts at `1`**.</font>

In [13]:
x <- 10:1   # 10 down to 1

print(x[1])     # first element (10)
print(x[1:5])   # first 5 elements

print(x[-1])    # all elements EXCEPT the first one

# To get the last element, use length(x)
print(x[length(x)])   # last element (1)


[1] 10
[1] 10  9  8  7  6
[1] 9 8 7 6 5 4 3 2 1
[1] 1


In fact, even single-value objects are treated as vectors in R. Therefore, it makes little difference whether an operation involves a vector or just individual values. For vectors, operations are carried out element by element. If the input vectors have different lengths, the shorter one is repeated until it matches the length of the longer one. The same logic applies to both arithmetic operations and many functions.

In [None]:
1:10 + 5
1:10 + 10:1
log(c(1, 10, 100, 1000), base = 10)
bmi(c(85, 90, 95, 100, 105), 194)

Many functions take vectors and return a single number:

In [None]:
min(1:10)
max(1:10)
mean(1:10)

### <font color='gold'> Task 4</font>
Define a function that **rescales** a numeric vector to the range [0, 1] so that the minimum becomes 0 and the maximum becomes 1.

> *Hint:* for vector $x$ and element $x_i$, use:
$$ \frac{x_i - \text{min}(x)}{\text{max}(x) - \text{min}(x)} $$

Test your function on vectors `0:10` and `-5:5`. Are the results similar or different?

In [None]:
# YOUR CODE HERE!

## Data Tables

In practical data analysis, data are often given in tables where the columns contain variables of different types. Such a table is also central in R, where historically the `data.frame` object has been used, and more recently the `tibble`, an enhanced version of `data.frame` from the `tidyverse` package. You can think of such a table as a collection of feature vectors, where all feature vectors must have the same length.

In [None]:
library(tidyverse)

stations <- tibble(
  ID = c('101', '102', '103', '104'),
  name = c('Umeå', 'Vindeln', 'Siljansfors', 'Asa'),
  longitude = c(20.19, 19.46, 14.24, 14.47),
  latitude = c(63.49, 64.14, 60.53, 57.10),
  altitude = c(33, 225, 320, 180)
)

stations

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.2     [32m✔[39m [34mtibble   [39m 3.3.0
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.1.0     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


ID,name,longitude,latitude,altitude
<chr>,<chr>,<dbl>,<dbl>,<dbl>
101,Umeå,20.19,63.49,33
102,Vindeln,19.46,64.14,225
103,Siljansfors,14.24,60.53,320
104,Asa,14.47,57.1,180


In [None]:
class(stations)

To add new data to your data frames (or to combine two data frames), you can use the functions `cbind()` and `rbind()`.
- `cbind()` (column bind): adds columns side by side.
    - The number of rows must match.
    - Row names are ignored.

- `rbind()` (row bind): stacks rows on top of each other.
    - Both the number and the names of columns must match.

> If data frames don’t have identical columns, use:
>- `plyr::rbind.fill()` (from the `plyr` package), or
> - `dplyr::bind_rows()` (from the `dplyr`/`tidyverse` package), which automatically fills missing columns with `NA`.

In [None]:
# Add a new column with cbind()
regions <- c('Norrland', 'Norrland', 'Svealand', 'Götaland')
stations <- cbind(stations, region = regions)
stations

ID,name,longitude,latitude,altitude,region
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>
101,Umeå,20.19,63.49,33,Norrland
102,Vindeln,19.46,64.14,225,Norrland
103,Siljansfors,14.24,60.53,320,Svealand
104,Asa,14.47,57.1,180,Götaland


In [None]:
# Add a new row with rbind()
new_station <- tibble(
  ID = '105',
  name = 'Tönnersjöheden',
  longitude = 13.07,
  latitude = 56.42,
  altitude = 80,
  region = 'Götaland'
)
stations <- rbind(stations, new_station)
stations

ID,name,longitude,latitude,altitude,region
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>
101,Umeå,20.19,63.49,33,Norrland
102,Vindeln,19.46,64.14,225,Norrland
103,Siljansfors,14.24,60.53,320,Svealand
104,Asa,14.47,57.1,180,Götaland
105,Tönnersjöheden,13.07,56.42,80,Götaland


In [None]:
# Missing values
extra_station <- tibble(
  ID = '106',
  name = 'Kulbäcksliden',
  longitude = 19.49,
  latitude = 64.52,
  region = 'Norrland'
)

extra_station

ID,name,longitude,latitude,region
<chr>,<chr>,<dbl>,<dbl>,<chr>
106,Kulbäcksliden,19.49,64.52,Norrland


In [None]:
# Use bind_rows() function
stations <- bind_rows(stations, extra_station)
stations

ID,name,longitude,latitude,altitude,region
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>
101,Umeå,20.19,63.49,33.0,Norrland
102,Vindeln,19.46,64.14,225.0,Norrland
103,Siljansfors,14.24,60.53,320.0,Svealand
104,Asa,14.47,57.1,180.0,Götaland
105,Tönnersjöheden,13.07,56.42,80.0,Götaland
106,Kulbäcksliden,19.49,64.52,,Norrland


We often need to subset (extract rows and columns) our data frames. We can access elements in a data frame using square brackets `[]` or the dollar `$` operator.

In [None]:
# Access the "region" column with $
stations$region

In [None]:
# Access the cell in the 3rd row and 2nd column
stations[3, 2]

In [None]:
stations[-3, 2]

In [None]:
# Filter rows where altitude is exactly (`==`) 80
stations[stations$altitude == 80, ]

Unnamed: 0_level_0,ID,name,longitude,latitude,altitude,region
Unnamed: 0_level_1,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>
5.0,105.0,Tönnersjöheden,13.07,56.42,80.0,Götaland
,,,,,,


In [None]:
stations[complete.cases(stations), ]

Unnamed: 0_level_0,ID,name,longitude,latitude,altitude,region
Unnamed: 0_level_1,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>
1,101,Umeå,20.19,63.49,33,Norrland
2,102,Vindeln,19.46,64.14,225,Norrland
3,103,Siljansfors,14.24,60.53,320,Svealand
4,104,Asa,14.47,57.1,180,Götaland
5,105,Tönnersjöheden,13.07,56.42,80,Götaland


In [None]:
na.omit(stations[stations$altitude == 80, ])

Unnamed: 0_level_0,ID,name,longitude,latitude,altitude,region
Unnamed: 0_level_1,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>
5,105,Tönnersjöheden,13.07,56.42,80,Götaland


In [None]:
# Filter rows where altitude is NOT equal to 80
stations[stations$altitude != 80, ]

Unnamed: 0_level_0,ID,name,longitude,latitude,altitude,region
Unnamed: 0_level_1,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>
1.0,101.0,Umeå,20.19,63.49,33.0,Norrland
2.0,102.0,Vindeln,19.46,64.14,225.0,Norrland
3.0,103.0,Siljansfors,14.24,60.53,320.0,Svealand
4.0,104.0,Asa,14.47,57.1,180.0,Götaland
,,,,,,


In [None]:
# Filter rows where altitude is between 100 and 200
stations[stations$altitude > 100 & stations$altitude < 200, ]

Unnamed: 0_level_0,ID,name,longitude,latitude,altitude,region
Unnamed: 0_level_1,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>
4.0,104.0,Asa,14.47,57.1,180.0,Götaland
,,,,,,


In [None]:
# Filter rows where altitude is either > 200 or < 100
stations[stations$altitude > 200 | stations$altitude < 100, ]

Unnamed: 0_level_0,ID,name,longitude,latitude,altitude,region
Unnamed: 0_level_1,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>
1.0,101.0,Umeå,20.19,63.49,33.0,Norrland
2.0,102.0,Vindeln,19.46,64.14,225.0,Norrland
3.0,103.0,Siljansfors,14.24,60.53,320.0,Svealand
5.0,105.0,Tönnersjöheden,13.07,56.42,80.0,Götaland
,,,,,,


### Factors
Factors are a special type of vector for categorical variables with levels.
We can convert the `region` column to a factor using `factor()`.

In [None]:
# Convert 'region' column to a factor
region_factor <- factor(stations$region)

# Display the attributes
attributes(region_factor)


### <font color='gold'>Task 5</font>  

Create a data frame named `earthquake_classes` with three columns:
- `class`
- `magnitude`
- `description`  
Use the information from this image provided by the Alaska Earthquake Center:  
![earthquake.png](https://earthquake.alaska.edu/sites/default/files/inline-images/magnitude%20classes_0.png)  

Finally, convert the `class` column into a factor variable.  


In [None]:
# YOUR CODE HERE!

## Reading and Writing Data Files

Hopefully, Task 5 gave you an idea that **entering data manually can be tedious and error-prone**.  
Fortunately, in most real workflows we can **read data directly from files** instead of typing everything by hand.  

Depending on the field, data can come in very different formats.  
For example:  
- **Text-based tables** (e.g., medical records stored as `.txt` files) that can be read with simple text import functions.  
- **Highly specialized formats**, such as weather radar data stored in the `ODIM HDF5` standard, which require special packages to read and interpret.

In this course, however, we will mainly deal with **tabular data**, most often stored in:  
- **CSV** (`.csv`) → comma-separated (`,`) values  
- **TSV** (`.tsv`) → tab-separated (`\t`) values  
- **Excel** (`.xlsx`, `.xls`) → spreadsheet files  

I strongly recommend using **TSV** files whenever possible. Although CSV files are popular, they can be problematic because commas often appear inside data values (e.g., in addresses or descriptions).  **bold text**
This makes parsing difficult and error-prone.  
**TSV files** are usually safer: tabs rarely appear inside the data, making these files easier to parse reliably.  
That’s why I prefer - <font color='#2fa1ff'>and strongly recommend</font> - using TSV files whenever possible.  


There are **several ways** to import data into R.  
The most flexible function is `read_delim()` (from the `readr` package, part of the `tidyverse`), where you specify the delimiter yourself.  

On top of that, there are convenient shortcut functions like `read_csv()` and `read_tsv()` (also from `readr`) for the most common file types.


To try this out, download the file `berry_data.csv` from the following source:  
Langvall, O. (2021). *Swedish Forest Phenology dataset (Version 1)* [Data set]. Swedish University of Agricultural Sciences. https://doi.org/10.5878/jbab-cy46  
> Dataset page: https://researchdata.se/en/catalogue/dataset/2021-194-1/1  

Once you have the file, place it in your working directory (or upload it if you are using Google Colab).  

Now let’s read it into R using both `read_delim()` and `read_csv()`.

In [16]:
# Check files in the working directory
dir()

In [None]:
# Read with explicit delimiter
#read_delim('berry_data.csv', delim = ',')

# Read with shortcut
#read_csv('berry_data.csv')

[1mRows: [22m[34m1889[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m (7): Station, Species, Year, doy, Flowers, Unripe, Ripe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Station,Species,Year,doy,Flowers,Unripe,Ripe
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
103,2440100,2006,135,0.0,,
103,2440100,2006,142,50.3,,
103,2440100,2006,144,83.4,,
103,2440100,2006,149,111.7,,
103,2440100,2006,153,110.5,0.0,
103,2440100,2006,156,81.5,0.0,
103,2440100,2006,160,39.6,39.7,
103,2440100,2006,163,8.8,76.3,
103,2440100,2006,166,1.0,95.0,
103,2440100,2006,170,0.4,70.0,


### <font color='gold'> Task 6 </font>  

1. Read the file `berry_data.csv` into an object called `berry_data`.  
2. Explore the dataset:  
   - How many **rows** and **columns** does it contain?  
   - What are the **features** (**variables**) in the dataset?  

> *Hint:* For the first part, check the dimensions with `dim(berry_data)`,  
> or use `nrow()` and `ncol()` separately.  
> For the second part, check `names(berry_data)` and consult the accompanying `Metadata.pdf` file.

In [14]:
# YOUR CODE HERE!

>Files may have quirks (custom missing value markers, wrong types, etc.). Most can be handled via `read_*` parameters. Check the help pages and examples.

In [None]:
glimpse(berry_data)

Rows: 1,889
Columns: 7
$ Station [3m[90m<dbl>[39m[23m 103[90m, [39m103[90m, [39m103[90m, [39m103[90m, [39m103[90m, [39m103[90m, [39m103[90m, [39m103[90m, [39m103[90m, [39m103[90m, [39m103[90m, [39m103[90m, [39m10…
$ Species [3m[90m<dbl>[39m[23m 2440100[90m, [39m2440100[90m, [39m2440100[90m, [39m2440100[90m, [39m2440100[90m, [39m2440100[90m, [39m2440100[90m,[39m…
$ Year    [3m[90m<dbl>[39m[23m 2006[90m, [39m2006[90m, [39m2006[90m, [39m2006[90m, [39m2006[90m, [39m2006[90m, [39m2006[90m, [39m2006[90m, [39m2006[90m, [39m2006[90m, [39m20…
$ doy     [3m[90m<dbl>[39m[23m 135[90m, [39m142[90m, [39m144[90m, [39m149[90m, [39m153[90m, [39m156[90m, [39m160[90m, [39m163[90m, [39m166[90m, [39m170[90m, [39m173[90m, [39m177[90m, [39m18…
$ Flowers [3m[90m<dbl>[39m[23m 0.0[90m, [39m50.3[90m, [39m83.4[90m, [39m111.7[90m, [39m110.5[90m, [39m81.5[90m, [39m39.6[90m, [39m8.8[90m, [39m1.0[90m

After processing data in R, we often want to export it back to a file.
The `readr` package provides matching functions `write_csv()` and `write_tsv()` for writing CSV and TSV files.

In [None]:
# Save as TSV (tab-separated values)
#write_tsv(berry_data, 'berry_data.tsv')

In [None]:
# Check files in the working directory
#dir()

### Saving and loading R objects (RData)

Saving R tables as text files and later reading them back in can sometimes cause errors. If a file is created in R and will later be processed again in R, it is useful to save it as a binary R object. This way it is very easy to load it again later. This can be done with the commands `save()` and `load()`. The save command can also take multiple objects at once.

In [None]:
x <- 1
save(x, stations, file = "objects.RData")  # save variable x and the stations data frame into an RData file

In [None]:
load("objects.RData", verbose = TRUE)  # load all objects saved in "objects.RData" and print their names as they are restored

Loading objects:
  x
  stations


## Data Wrangling and Data Manipulation  

In real projects, data is **rarely provided in the exact format we need for analysis**.  
Before we can clean or transform anything, the **first step is to explore the dataset** and understand what it contains.  

Exploration helps us answer questions like:  
- How many rows and columns are there?  
- What variables (features) are included?  
- What are the data types (numeric, character, factor…)?  
- Are there missing or unusual values?  

Some useful functions for exploring a data frame (`berry_data` as example):


In [None]:
dim(berry_data) # number of rows and columns

In [None]:
names(berry_data) # column names

In [None]:
str(berry_data) # structure and types

spc_tbl_ [1,889 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Station: num [1:1889] 103 103 103 103 103 103 103 103 103 103 ...
 $ Species: num [1:1889] 2440100 2440100 2440100 2440100 2440100 ...
 $ Year   : num [1:1889] 2006 2006 2006 2006 2006 ...
 $ doy    : num [1:1889] 135 142 144 149 153 156 160 163 166 170 ...
 $ Flowers: num [1:1889] 0 50.3 83.4 111.7 110.5 ...
 $ Unripe : num [1:1889] NA NA NA NA 0 0 39.7 76.3 95 70 ...
 $ Ripe   : num [1:1889] NA NA NA NA NA NA NA NA NA NA ...
 - attr(*, "spec")=
  .. cols(
  ..   Station = [32mcol_double()[39m,
  ..   Species = [32mcol_double()[39m,
  ..   Year = [32mcol_double()[39m,
  ..   doy = [32mcol_double()[39m,
  ..   Flowers = [32mcol_double()[39m,
  ..   Unripe = [32mcol_double()[39m,
  ..   Ripe = [32mcol_double()[39m
  .. )
 - attr(*, "problems")=<externalptr> 


In [None]:
summary(berry_data) # quick statistics

    Station         Species             Year           doy       
 Min.   :102.0   Min.   :2440100   Min.   :2006   Min.   :105.0  
 1st Qu.:102.0   1st Qu.:2440100   1st Qu.:2009   1st Qu.:160.0  
 Median :103.0   Median :2440100   Median :2013   Median :194.0  
 Mean   :103.3   Mean   :2440142   Mean   :2013   Mean   :199.9  
 3rd Qu.:104.0   3rd Qu.:2440200   3rd Qu.:2017   3rd Qu.:238.0  
 Max.   :105.0   Max.   :2440200   Max.   :2020   Max.   :320.0  
                                                  NA's   :1      
    Flowers           Unripe            Ripe       
 Min.   :  0.00   Min.   :  0.00   Min.   : 0.000  
 1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.: 0.000  
 Median :  0.50   Median :  4.10   Median : 2.000  
 Mean   : 10.96   Mean   : 15.29   Mean   : 7.075  
 3rd Qu.: 10.50   3rd Qu.: 22.20   3rd Qu.:10.175  
 Max.   :134.50   Max.   :120.00   Max.   :69.200  
 NA's   :985      NA's   :622      NA's   :859     

In [None]:
glimpse(berry_data) # tidyverse-friendly overview

Rows: 1,889
Columns: 7
$ Station [3m[90m<dbl>[39m[23m 103[90m, [39m103[90m, [39m103[90m, [39m103[90m, [39m103[90m, [39m103[90m, [39m103[90m, [39m103[90m, [39m103[90m, [39m103[90m, [39m103[90m, [39m103[90m, [39m10…
$ Species [3m[90m<dbl>[39m[23m 2440100[90m, [39m2440100[90m, [39m2440100[90m, [39m2440100[90m, [39m2440100[90m, [39m2440100[90m, [39m2440100[90m,[39m…
$ Year    [3m[90m<dbl>[39m[23m 2006[90m, [39m2006[90m, [39m2006[90m, [39m2006[90m, [39m2006[90m, [39m2006[90m, [39m2006[90m, [39m2006[90m, [39m2006[90m, [39m2006[90m, [39m20…
$ doy     [3m[90m<dbl>[39m[23m 135[90m, [39m142[90m, [39m144[90m, [39m149[90m, [39m153[90m, [39m156[90m, [39m160[90m, [39m163[90m, [39m166[90m, [39m170[90m, [39m173[90m, [39m177[90m, [39m18…
$ Flowers [3m[90m<dbl>[39m[23m 0.0[90m, [39m50.3[90m, [39m83.4[90m, [39m111.7[90m, [39m110.5[90m, [39m81.5[90m, [39m39.6[90m, [39m8.8[90m, [39m1.0[90m

In [None]:
unique(berry_data$Station) # distinct values in a column

In [None]:
table(berry_data$Station) # frequency counts


102 103 104 105 
581 451 494 363 

In [None]:
head(berry_data)  # first rows

Station,Species,Year,doy,Flowers,Unripe,Ripe
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
103,2440100,2006,135,0.0,,
103,2440100,2006,142,50.3,,
103,2440100,2006,144,83.4,,
103,2440100,2006,149,111.7,,
103,2440100,2006,153,110.5,0.0,
103,2440100,2006,156,81.5,0.0,


Once we know what the dataset looks like, the next step is wrangling:
the process of cleaning, structuring, and transforming raw or messy data into a usable format.

Typical wrangling tasks include:
- Handling missing values
- Merging multiple datasets
- Reshaping between wide and long formats
- Converting variables to the right data types

Closely related is data manipulation, which covers operations like:
- Filtering rows
- Sorting observations
- Aggregating values
- Selecting or mutating variables

There are many ways to process data in R. In recent years a very popular approach has been the [`tidyverse`](https://www.tidyverse.org/packages/): a collection of packages that share a common design philosophy and consistent APIs (e.g., `dplyr`, `tidyr`, `readr`, `ggplot2`). You can think of it as a “mini-language” inside R that borrows ideas from `SQL` and `bash`, aiming to make common operations work with clear, uniform principles.

#### The pipe `%>%`
One of the most important tools in the `tidyverse` is the [**pipe** operator `%>%` (from `magrittr`)](https://magrittr.tidyverse.org/reference/pipe.html). It passes the result of one expression as the **first argument** to the next function, letting you write long sequences of operations **left-to-right** in a readable way.

Compare the following approaches:
- without `%>%`
- with `%>%`

In [None]:
prices <- c("$1423.55", "$556.98", "$4321.99", "$657.01")

# Stepwise (many temporary variables):
prices_trim <- str_replace(prices, "\\$", "") # remove the dollar sign from each string
prices_trim_num <- as.numeric(prices_trim) # convert the cleaned strings into numeric values
prices_trim_num_round <- round(prices_trim_num, digits = 0) # round the numeric values to zero decimal places
prices_round_final <- str_c("$", prices_trim_num_round) # add the dollar sign back to the rounded values
prices_round_final

In [None]:
# One-liner (harder to read):
str_c("$", round(as.numeric(str_replace(prices, "\\$", "")), digits = 0))

In [None]:
# With the pipe (left-to-right):
prices %>%
  str_replace("\\$", "") %>% # remove the dollar sign
  as.numeric() %>% # convert strings to numeric
  round() %>% # round values to 0 decimals
  str_c("$", .) # add the dollar sign back

> *Hint:* Note that if the output of the previous function is used in the first argument of the next function call, you can omit it. If the output should go into another argument position, you can use the placeholder `.` (see the last line above).

### Tidyverse Functions  

The tidyverse is built on a set of functions where **each function takes in a data frame and returns a modified data frame**.  
Each function performs only one operation, but when combined with the `%>%`, very complex tasks can be expressed clearly and concisely.  

The most important tidyverse functions for data manipulation are:  
- `select`: choose columns from a data frame  
- `filter`: filter rows based on conditions  
- `mutate`: create new columns or modify existing ones  
- `group_by` + `summarize`: summarize values within groups defined by one or more variables  
- `arrange`: sort the data frame by one or more variables  

In the following sections, we will look at each function with simple examples, using the dataset `berry_data` that we loaded earlier.  


#### `select()`
The `select()` function allows you to choose columns from a data frame and also rename them in the process. Column names can be provided without quotation marks.

In [None]:
berry_data %>% select(doy, Ripe) # keep only the columns 'doy' (day of year) and 'Ripe' from the dataset

doy,Ripe
<dbl>,<dbl>
135,
142,
144,
149,
153,
156,
160,
163,
166,
170,


In [None]:
berry_data %>% select(-doy, -Ripe)  # select all columns except 'doy' and 'Ripe'

Station,Species,Year,Flowers,Unripe
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
103,2440100,2006,0.0,
103,2440100,2006,50.3,
103,2440100,2006,83.4,
103,2440100,2006,111.7,
103,2440100,2006,110.5,0.0
103,2440100,2006,81.5,0.0
103,2440100,2006,39.6,39.7
103,2440100,2006,8.8,76.3
103,2440100,2006,1.0,95.0
103,2440100,2006,0.4,70.0


In [None]:
berry_data %>% select(day_of_year = doy) # select the column 'doy' and rename it to 'day_of_year'

day_of_year
<dbl>
135
142
144
149
153
156
160
163
166
170


#### `filter()`
The `filter()` function allows you to filter rows by setting logical conditions on columns. The column names in the input data frame are recognized automatically by `filter()`.

In [None]:
berry_data %>% filter(Year > 2019) # keep only the rows where the value in 'Year' is greater than 2019

Station,Species,Year,doy,Flowers,Unripe,Ripe
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
103,2440100,2020,141,0.0,,
103,2440100,2020,148,15.0,0.0,
103,2440100,2020,155,39.2,0.0,
103,2440100,2020,164,3.7,26.7,
103,2440100,2020,170,0.1,21.6,
103,2440100,2020,178,0.0,22.9,
103,2440100,2020,183,0.0,24.1,0.0
103,2440100,2020,189,,26.2,0.0
103,2440100,2020,197,,22.7,2.0
103,2440100,2020,206,,5.3,18.3


In [None]:
berry_data %>% filter(Year == 2012) # keep only the rows where the value in 'Year' is exactly 2012

Station,Species,Year,doy,Flowers,Unripe,Ripe
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
103,2440100,2012,135,0.0,,
103,2440100,2012,142,1.0,0.0,
103,2440100,2012,146,87.5,0.0,
103,2440100,2012,153,74.3,25.4,
103,2440100,2012,159,29.7,75.9,
103,2440100,2012,167,2.3,86.7,
103,2440100,2012,178,0.1,46.2,0.0
103,2440100,2012,190,0.0,50.1,0.0
103,2440100,2012,197,0.0,44.7,0.9
103,2440100,2012,205,,31.5,9.1


In [None]:
berry_data %>% filter(Station %in% c(104, 105)) # keep only the rows where 'Station' is either 104 or 105

Station,Species,Year,doy,Flowers,Unripe,Ripe
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
104,2440100,2006,125,0.0,,
104,2440100,2006,131,6.0,,
104,2440100,2006,139,14.3,,
104,2440100,2006,142,14.3,0.0,
104,2440100,2006,149,7.9,0.0,
104,2440100,2006,151,5.6,7.6,
104,2440100,2006,153,4.6,11.2,
104,2440100,2006,160,1.6,16.8,
104,2440100,2006,163,0.5,13.4,
104,2440100,2006,166,0.2,13.0,


In [None]:
berry_data %>% filter(Station == 105 & Year == 2020) # keep only the rows where 'Station' is 105 AND 'Year' is 2020

Station,Species,Year,doy,Flowers,Unripe,Ripe
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
105,2440100,2020,120,0.0,,
105,2440100,2020,128,20.5,0.0,
105,2440100,2020,132,40.5,3.0,
105,2440100,2020,141,18.8,28.5714,
105,2440100,2020,147,5.1,26.8,
105,2440100,2020,154,2.4,19.3,
105,2440100,2020,163,0.6,17.0,
105,2440100,2020,170,0.0,16.9,
105,2440100,2020,184,,3.9,0.0
105,2440100,2020,191,,1.3,1.2


#### `mutate()`
The `mutate()` function allows you to create new columns or modify existing ones, depending on whether the column to which a value is assigned already exists. As with the previous functions, column names can be used directly in expressions inside `mutate()`.

In [None]:
berry_data %>%
  mutate(second_half = doy > 365/2) %>%   # create a new column 'second_half': TRUE if day-of-year > 182.5, FALSE otherwise
  select(doy, second_half)                # keep only 'doy' and the new 'second_half' column

doy,second_half
<dbl>,<lgl>
135,FALSE
142,FALSE
144,FALSE
149,FALSE
153,FALSE
156,FALSE
160,FALSE
163,FALSE
166,FALSE
170,FALSE


Now we generated a new column, but what if we want more descriptive values, like `'first_half'` or `'second_half'` instead of logical values?
That’s where the function `if_else()` is useful.

##### `if_else()`

- Works like a simple **if–then–else** statement.  
- You give it a logical test, a value if the test is `TRUE`, and a value if the test is `FALSE`.  
- It is **vectorized** → meaning it checks the condition for every row in the dataset.

In [None]:
berry_data %>%
  mutate(half = if_else(doy < 365/2, "first_half", "second_half")) %>%
  select(doy, half)

doy,half
<dbl>,<chr>
135,first_half
142,first_half
144,first_half
149,first_half
153,first_half
156,first_half
160,first_half
163,first_half
166,first_half
170,first_half


Now the new column clearly shows `'first_half'` or `'second_half'` instead of `TRUE`/`FALSE`.

But what if we need more than two categories, for example, dividing the year into quarters?
In that case, we use `case_when()`, which allows multiple conditions.

##### `case_when()`

- Useful when you have more than two conditions.
- Each condition is written on the left of `~` and its result on the right.
- The first condition that matches will be applied.

In [None]:
berry_data %>%
  mutate(quarter = case_when(
    doy <= 91 ~ "Q1",           # days 1–91
    doy <= 182 ~ "Q2",          # days 92–182
    doy <= 273 ~ "Q3",          # days 183–273
    T ~ "Q4"                    # all remaining days
  )) %>%
  select(doy, quarter)

doy,quarter
<dbl>,<chr>
135,Q2
142,Q2
144,Q2
149,Q2
153,Q2
156,Q2
160,Q2
163,Q2
166,Q2
170,Q2


#### `summarize()`
The `summarize()` command allows you to calculate summary statistics from a dataset. Unlike the `mutate()` command, it returns only the computed values and nothing else.

In [None]:
berry_data %>% summarize( mean_flower_count = mean(Flowers)) # calculate mean of 'Flowers' column

mean_flower_count
<dbl>
""


In [None]:
berry_data %>% summarize(mean_flower_count = mean(Flowers, na.rm = T)) # calculate mean of 'Flowers' while ignoring NA values

mean_flower_count
<dbl>
10.95753


In [None]:
berry_data %>% summarize(mean_flower_count = mean(Flowers, na.rm = T), N = n()) # the function n() inside summarize returns the number of rows in the input table.

mean_flower_count,N
<dbl>,<int>
10.95753,1889


#### `group_by()`
The `group_by()` function enables the so-called “split-apply-combine” strategy: a dataset is split into subsets based on the values of one or more variables, a function is applied to each subset, and the results are then combined. Once you apply `group_by()` to a dataset, any following operations will respect this grouping.

Used together, `group_by()` and `summarize()` let you calculate summary statistics for groups of observations defined by one or more variables. The result keeps the grouping variables and adds the summary statistics defined in the `summarize()` step. You can group by a single variable or by multiple variables at once.

In [None]:
berry_data %>% group_by(Station) %>% summarize(mean_flower_count = mean(Flowers, na.rm = T))

Station,mean_flower_count
<dbl>,<dbl>
102,12.955981
103,14.519676
104,4.528571
105,11.472199


In [None]:
berry_data %>% group_by(Station, Year) %>% summarize(mean_flower_count = mean(Flowers, na.rm = T))

[1m[22m`summarise()` has grouped output by 'Station'. You can override using the
`.groups` argument.


Station,Year,mean_flower_count
<dbl>,<dbl>,<dbl>
102,2006,12.98
102,2007,18.18125
102,2008,6.1529412
102,2009,10.2333333
102,2010,15.7666667
102,2011,9.2571429
102,2012,9.6692308
102,2013,7.5769231
102,2014,11.5733333
102,2015,17.1466667


The function `n()` inside `summarize()` returns the number of rows that correspond to a particular combination of grouping variables. This is very useful for creating frequency tables.

In [None]:
berry_data %>% group_by(Station) %>% summarize(mean_flower_count = mean(Flowers, na.rm = T), N = n())

Station,mean_flower_count,N
<dbl>,<dbl>,<int>
102,12.955981,581
103,14.519676,451
104,4.528571,494
105,11.472199,363


When you apply `mutate()` after `group_by()`, the `mutate()` function operates separately within each subset defined by `group_by()`. For example, this allows you to add the group mean to each row, or to assign a sequence number within each group.

In [None]:
berry_data %>%
  select(Station, Flowers) %>%
  group_by(Station) %>%
  mutate(mean_flower_count_per_station =  mean(Flowers, na.rm = T))

Station,Flowers,mean_flower_count_per_station
<dbl>,<dbl>,<dbl>
103,0.0,14.51968
103,50.3,14.51968
103,83.4,14.51968
103,111.7,14.51968
103,110.5,14.51968
103,81.5,14.51968
103,39.6,14.51968
103,8.8,14.51968
103,1.0,14.51968
103,0.4,14.51968


In [None]:
berry_data %>%
  select(Station, Flowers) %>%
  group_by(Station) %>%
  mutate(ID_in_group = 1:n())

Station,Flowers,ID_in_group
<dbl>,<dbl>,<int>
103,0.0,1
103,50.3,2
103,83.4,3
103,111.7,4
103,110.5,5
103,81.5,6
103,39.6,7
103,8.8,8
103,1.0,9
103,0.4,10


#### `arrange()`
The `arrange()` function simply sorts a data frame by the specified variable(s).

In [None]:
berry_data %>% arrange(Flowers) # by default, from smallest to largest.

Station,Species,Year,doy,Flowers,Unripe,Ripe
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
103,2440100,2006,135,0,,
103,2440100,2006,173,0,75.9,
103,2440100,2006,177,0,80.7,
103,2440100,2007,128,0,,
103,2440100,2007,166,0,50.3,
103,2440100,2007,180,0,32.9,0.00000
103,2440100,2007,187,0,31.7,8.80000
103,2440100,2008,123,0,,
103,2440100,2008,165,0,64.0,
103,2440100,2008,169,0,62.8,


In [None]:
berry_data %>% arrange(desc(Flowers)) # change the order (largest to smallest)

Station,Species,Year,doy,Flowers,Unripe,Ripe
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
105,2440200,2014,151,134.5,1.0,
105,2440200,2016,159,134.5,12.5,
102,2440100,2020,161,126.7,0.0,
105,2440200,2018,142,118.5,0.0,
102,2440100,2016,160,115.4,4.0,
103,2440100,2006,149,111.7,,
102,2440100,2016,151,111.2,0.0,
103,2440100,2006,153,110.5,0.0,
105,2440200,2019,156,110.0,,
103,2440100,2009,139,109.5,0.0,


### Joining Data Frames  

In many cases, information is spread across multiple data frames (or tables), and we need to **combine them based on shared keys** (e.g., IDs, station codes, years).  
This process is called a **join**.  

The `dplyr` package provides several join functions, all of which work in a similar way:  
- `inner_join(df1, df2, by = 'key')`: keeps only rows with matching keys in both tables  
- `left_join(df1, df2, by = 'key')`: keeps all rows from `df1` (the left table) and adds matching information from `df2`
- `right_join(df1, df2, by = 'key')`: keeps all rows from `df2` (the right table) and adds matching information from `df1`
- `full_join(df1, df2, by = 'key')`: keeps all rows from both tables, filling in `NA` where no match is found  
- `semi_join(df1, df2, by = 'key')`: keeps only rows from `df1` that have a match in `df2` (like a filter)  
- `anti_join(df1, df2, by = 'key')`: keeps only rows from `df1` that **do not** have a match in `df2`.


![Joins.png](https://raw.githubusercontent.com/RaHub4AI/MI7032/refs/heads/main/Pictures/joins.jpg)


### <font color='gold'>Task 7</font>  

You will work with the two data frames: `stations` and `berry_data`.  

1. **Combine the datasets**  
   Use `inner_join()` to combine them into a new data frame called `stations_berries`, based on the station ID.  
   > *Hint:* Check the column names in both data frames. How can you join data frames when the key variable has different names?
   >
   > Remember that you can change the data type of a variable using functions such as `as.numeric()`, `as.character()`, `as.factor()`, etc.  

2. **Identify missing measurements**  
   Use appropriate join commands to find out which stations do **not** have berry measurements.  
   *Reflect: does it make sense that some stations have no berry data? Why might that be the case?*

3. **Analyze berry observations**  
   Using `tidyverse` commands, answer the following:  
   - Which station had the **highest average count of ripe berries per 0.25 m²** in the year **2020**?  
   - For each station, what was the **earliest day of year (doy) per year** when the **average number of ripe berries per 0.25 m²** was greater than the **average number of unripe berries per 0.25 m²**?  


In [None]:
# YOUR CODE HERE!

## Iterations

A loop is a programming construct that repeats a certain action multiple times in sequence. It is useful, for example, when we want to go through the elements of a vector one by one and perform some action depending on the value of the element. In R you’ll mainly use two loop constructs: **`while`** and **`for`**.


### `while` loops

A `while` loop is repeated as long as a certain condition is satisfied. The so-called stopping condition must be defined when setting up the loop. It is important to be careful when writing the loop to ensure it does not repeat indefinitely. A while loop is useful when implementing an algorithm whose termination depends on convergence. To reduce the risk of infinite repetition, it is often sensible to include a loop counter as part of the stopping conditions in the loop header.
![while_loop.jpg](https://raw.githubusercontent.com/RaHub4AI/MI7032/refs/heads/main/Pictures/while_loop.jpg)

In [11]:
# Sum daily CO₂ measurements until a threshold is reached
co2_readings <- c(392, 401, 407, 415, 420, 425)
i <- 1
total <- 0

while (i < length(co2_readings) & total < 1200) {
    total <- total + co2_readings[i]
    i <- i + 1
}

total # cumulative sum
i  # how many readings used


#### Infinite loops (and how they happen)
If the condition never becomes `FALSE`, the loop never ends.

In [13]:
# Example of a infinite loop/bug
#i = 0
#while (i < 5) {
    #print(i)
    #i <- i + 1  # forgotten increment → infinite loop
#}

### `for` loops

A `for` loop performs a predetermined number of steps. When defining the loop, you specify a vector whose elements will be traversed one by one. Most often, this vector consists of consecutive integers, such as the integers `1:10`.
![for_loop.jpg](https://raw.githubusercontent.com/RaHub4AI/MI7032/refs/heads/main/Pictures/for_loop.jpg)

In [20]:
# Days of year: first six days
for (doy in 1:6) {    # 1..6
  print(doy)
}


[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6


In [14]:
# Print each station name
stations <- c("Umeå", "Vindeln", "Siljansfors", "Asa")
for (name in stations) {
  print(name)
}


[1] "Umeå"
[1] "Vindeln"
[1] "Siljansfors"
[1] "Asa"


In [21]:
# Simple growing degree-day (GDD) accumulator
base <- 10
daily_tmean <- c(8, 12, 15, 9, 14, 11)  # °C
gdd <- 0

for (t in daily_tmean) {
  gdd <- gdd + max(0, t - base)
}

gdd

### Nested Loops
Sometimes we need a loop inside another loop - this is called a nested loop.
Nested loops are useful when:
- You want to process **two dimensions of data** (e.g., rows × columns in a table or grid).
- You need to **compare every element of one collection with every element of another**.
- You are **working with spatial or temporal data where multiple variables interact**.

In [25]:
for (i in 1:4) {
  for (j in 1:3) {
    print(i+j)
  }
}

[1] 2
[1] 3
[1] 4
[1] 3
[1] 4
[1] 5
[1] 4
[1] 5
[1] 6
[1] 5
[1] 6
[1] 7


In [26]:
# Monitoring daily rainfall across multiple stations

stations <- c("Umeå", "Vindeln", "Siljansfors")
rainfall_data <- list(
  c(5.2, 0.0, 1.3),   # day 1
  c(0.0, 0.0, 0.0),   # day 2
  c(12.1, 3.4, 0.0)   # day 3
)

# Outer loop: iterate over days
for (day in seq_along(rainfall_data)) {
  daily_measurements <- rainfall_data[[day]]

  # Inner loop: pair station names with rainfall values
  for (i in seq_along(stations)) {
    cat("Day", day, "at", stations[i], ":", daily_measurements[i], "mm\n")
  }
}


Day 1 at Umeå : 5.2 mm
Day 1 at Vindeln : 0 mm
Day 1 at Siljansfors : 1.3 mm
Day 2 at Umeå : 0 mm
Day 2 at Vindeln : 0 mm
Day 2 at Siljansfors : 0 mm
Day 3 at Umeå : 12.1 mm
Day 3 at Vindeln : 3.4 mm
Day 3 at Siljansfors : 0 mm


> *Hints*:
> - `seq_along(rainfall_data)` → generates a sequence of indices `(1, 2, 3)` for the list of daily rainfall data.
> - `rainfall_data[[day]]` → extracts the day-th element from the list (double brackets `[[ ]]` are used to pull out the actual vector, not a sublist).
> - `seq_along(stations)` → iterates through indices of the stations.
> - `cat()` → concatenates and prints text in R.
>
> The outer loop goes over days, and the inner loop goes over stations within each day — exactly the same logic as in the Python version.

Controlling loop flow: `break` and `next`
- `break` → exit the loop immediately.
- `next`→ skip the rest of the current iteration and move to the next one.

In [22]:
# BREAK: stop scanning once PM2.5 exceeds a health threshold
pm25 <- c(8, 11, 4, 39, 51, 22, 10)
threshold <- 15
for (value in pm25) {
  if (value > threshold) {
    print(paste("Alert: unhealthy air quality!", value))
        break
  }
}

[1] "Alert: unhealthy air quality! 39"


In [24]:
# NEXT: skip missing temperature readings (NaN) when averaging
temps <- c(12.5, NaN, 13.2, 11.8, NaN, 12.9)
count <- 0
total <- 0.0

for (t in temps) {
  if (is.nan(t)) {
    next
  }
  total <- total + t
  count <- count + 1
}

if (count > 0) {
  avg <- total / count
} else {
  avg <- NA
}

avg

## Conditional Statements
When we learned about **controlling loops**, we already saw how the keyword `if` can decide whether a block of code should run.
Conditional statements are one of the most important building blocks of programming. They let us make decisions in code:
- Run certain parts of code only if conditions are `TRUE`.
- Skip or choose alternatives if they’re `FALSE`.

This enables us to write programs that adapt to different situations, much like we do in real-world decision-making.

### `if` Statement

The simplest form is the `if` statement.
It checks whether a condition is `TRUE` and, if so, executes the block of code inside it.
![if_condition.jpg](https://raw.githubusercontent.com/RaHub4AI/MI7032/refs/heads/main/Pictures/if_condition.jpg)

In [28]:
# Check if temperature is above freezing
temperature <- -3

if (temperature > 0) {
  print("Water is liquid.")
}

[1] "Water is liquid."


> Nothing is printed here, since the condition was `FALSE`.\
> If the temperature had been above 0, the program would have printed `"Water is liquid."`.

If the content of an `if` block is only one line long, the curly braces can be omitted.

In [34]:
temperature <- 10

# One-line if without braces
if (temperature > 0) print("Water is liquid.")

[1] "Water is liquid."


It is also possible to place multiple commands on a single line, separated by semicolons. However, from the perspective of code readability, this is generally not recommended.

In [36]:
# Multiple statements on one line with semicolons
temp_c <- -2; state <- "unknown"; if (temp_c <= 0) state <- "freezing"; cat("State:", state, "\n")

State: freezing 


### `if`/`else`
We can add an `else` block to specify what should happen when the condition is `FALSE`.
![if_and_else_condition.jpg](https://raw.githubusercontent.com/RaHub4AI/MI7032/refs/heads/main/Pictures/if_and_else_condition.jpg)

In [31]:
# Are trees actively photosynthesizing?
sunlight <- F

if (sunlight) {
  print("Photosynthesis is happening.")
} else {
  print("Trees are not photosynthesizing right now.")
}


[1] "Trees are not photosynthesizing right now."


> Earlier, when wrangling data, we already used `if_else()` from `dplyr`.
> - `if_else()` is a vectorized version of `if-else`, meaning it checks the condition for each element in a column and returns a new column.
> - Base R also provides a similar function called `ifelse()`. It works the same way - element-wise checks over a vector - but has slightly different type-handling rules.
> - In contrast, the base R `if`/`else` statement works on a single logical condition at a time.
> So:
> - Use `if_else()` (`tidyverse`) or `ifelse()` (base R) inside `mutate()` or whenever you need vectorized column transformations.
> - Use plain `if`/`else` when you need control flow in scripts, not element-wise operations.

In [37]:
precip <- c(0, 0.3, 5.1, NA, 2.2, 0)
label <- ifelse(is.na(precip), "missing",
                ifelse(precip == 0, "dry", "wet"))
label


### `else if`

You can also check multiple conditions sequentially using `else if`.
![else_if_condition.jpg](https://raw.githubusercontent.com/RaHub4AI/MI7032/refs/heads/main/Pictures/else_if_condition.jpg)

In [32]:
aqi <- 135

if (aqi <= 50) {
  print("Air quality is good.")
} else if (aqi <= 100) {
  print("Air quality is moderate.")
} else if (aqi <= 150) {
  print("Air quality is unhealthy for sensitive groups.")
} else {
  print("Air quality is unhealthy for everyone.")
}


[1] "Air quality is unhealthy for sensitive groups."


> R evaluates the conditions in order and executes only the first one that is `TRUE`.

### Nested `if`Statements

Sometimes, one decision depends on another. In that case, you can nest `if` statements.

In [33]:
rainfall_mm <- 8

if (rainfall_mm > 0) {
  print("It rained today.")

  if (rainfall_mm >= 50) {
    print("Soil moisture is sufficient for crops.")
  } else {
    print("Rainfall was too low, irrigation might be needed.")
  }
} else {
  print("No rainfall today.")
}


[1] "It rained today."
[1] "Rainfall was too low, irrigation might be needed."


> Be careful: nesting too many `if` statements can make your code hard to read and maintain.\

In additon, sometimes the function `switch()` can also be very useful. For `switch()`, the first argument is usually a single integer (only one integer!), and the remaining arguments correspond to the actions for each integer value (in increasing order).

In [38]:
# Map season code → name
season_code <- 3L  # 1: Winter, 2: Spring, 3: Summer, 4: Autumn
season_name <- switch(season_code,
  "Winter",   # 1
  "Spring",   # 2
  "Summer",   # 3
  "Autumn"    # 4
)
season_name
# [1] "Summer"


In [39]:
# Map quality-control level → action
qc_level <- 2L
action <- switch(qc_level,
  "use_as_is",         # 1 = high quality
  "use_with_caution",  # 2 = medium quality
  "exclude"            # 3 = low quality
)
action

### <font color='gold'>Task 8</font>

Using the dataset `berry_data.csv`, your goal is to calculate the maximum number of ripe berries per 0.25 m² for each station across all years.

Steps to guide you:
- Load the dataset into an object called `berry_data`.
- Identify the unique stations in the dataset.
- Use a loop to go through each station.
- Inside the loop, use an `if`–`else` statement to check whether the station has ripe berry data.
- If data exists, calculate the maximum number of ripe berries for that station.
- If no data exists, store a value like `NA`.
- Collect the results into a named vector or a data frame, where:
  - column names = station IDs
  - values = maximum ripe berry counts.

> *Hint*: use `unique()` to get the list of stations.


In [None]:
# YOUR CODE HERE!