# Introduction to `tidyverse` Round 2
An ["opinionated" collection of R packages](https://tidyverse.org) for data science, driven by a coherent underlying design philosophy.
These packages are meant to help you with two essential processes:
1. **Data clean-up and organization**: Structure should be intuitive, so that it's easy to model, manipulate, and think about the data
2. **Data plotting**: The grammar of graphics (Week 13)

In [60]:
library(tidyverse)

## Data preparation

### Loading the data

In [78]:
data <- read_csv("data/148338_220209_095045_M057814.csv", skip=2)

[1mRows: [22m[34m241[39m [1mColumns: [22m[34m32[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (21): type, stim1, stim2, stimPos, stimFormat, feedbackIncorrect, head, ...
[32mdbl[39m (11): rowNo, ITI, feedbackTime, random, ITI_ms, ITI_f, ITI_fDuration, ti...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [79]:
education_level <- data %>% pull(response) %>% first()  

data <- data %>%
    # Keep only useful columns
    select(c(rowNo, type, stim1, stim2, stimPos, trialType, response, RT)) %>%
    
    # Keep only useful rows
    filter(type != "form") %>%
    
    # Add demographic and trial-number info, turn trial type to factor
    mutate(
        education_level = education_level, # Add info
        trial_number = row_number(),
        trialType = factor(trialType, levels = c("incongruent", "congruent"))
    ) %>%
    
    # Rename trialType to trial_type
    rename(trial_type = trialType)

head(data)

rowNo,type,stim1,stim2,stimPos,trial_type,response,RT,education_level,trial_number
<dbl>,<chr>,<chr>,<chr>,<chr>,<fct>,<chr>,<dbl>,<chr>,<int>
190,test,croissantLarge,breadSmall,-341 0; 341 0,incongruent,j,729,College or Technical School,1
125,test,lampSmall,shelfLarge,-235 0; 235 0,congruent,f,625,College or Technical School,2
200,test,schoolbusLarge,bikeSmall,-370 0; 370 0,congruent,j,1477,College or Technical School,3
152,test,orangeLarge,raspberrySmall,-307 0; 307 0,congruent,j,793,College or Technical School,4
110,test,hammerLarge,ladderSmall,-156 0; 156 0,incongruent,j,900,College or Technical School,5
215,test,wheelSmall,coinLarge,-300 0; 300 0,incongruent,f,699,College or Technical School,6


### Tidying the data

Our data is almost tidy, but we still need to do several things.

#### Exercise 1
There are two columns that are uninformative. Remove them.

In [63]:
data <- data %>% 
  select(-rowNo, -type) # columns type (always test) and RowNo seem uninformative

head(data)

stim1,stim2,stimPos,trial_type,response,RT,education_level,trial_number
<chr>,<chr>,<chr>,<fct>,<chr>,<dbl>,<chr>,<int>
croissantLarge,breadSmall,-341 0; 341 0,incongruent,j,729,College or Technical School,1
lampSmall,shelfLarge,-235 0; 235 0,congruent,f,625,College or Technical School,2
schoolbusLarge,bikeSmall,-370 0; 370 0,congruent,j,1477,College or Technical School,3
orangeLarge,raspberrySmall,-307 0; 307 0,congruent,j,793,College or Technical School,4
hammerLarge,ladderSmall,-156 0; 156 0,incongruent,j,900,College or Technical School,5
wheelSmall,coinLarge,-300 0; 300 0,incongruent,f,699,College or Technical School,6


#### Exercise 2
The columns `stim1` and `stim2` refer to the stimulus presented on the left and, respectively, right. Rename the columns to make them more informative.

In [64]:
data <- data %>%
  rename(left_stim = stim1,right_stim = stim2)

head(data)

left_stim,right_stim,stimPos,trial_type,response,RT,education_level,trial_number
<chr>,<chr>,<chr>,<fct>,<chr>,<dbl>,<chr>,<int>
croissantLarge,breadSmall,-341 0; 341 0,incongruent,j,729,College or Technical School,1
lampSmall,shelfLarge,-235 0; 235 0,congruent,f,625,College or Technical School,2
schoolbusLarge,bikeSmall,-370 0; 370 0,congruent,j,1477,College or Technical School,3
orangeLarge,raspberrySmall,-307 0; 307 0,congruent,j,793,College or Technical School,4
hammerLarge,ladderSmall,-156 0; 156 0,incongruent,j,900,College or Technical School,5
wheelSmall,coinLarge,-300 0; 300 0,incongruent,f,699,College or Technical School,6


#### Exercise 3
Add a new column called `subject_id` and set this participant to 1.

In [65]:
data <- data %>%
  mutate(subject_id = 1)

head(data)

left_stim,right_stim,stimPos,trial_type,response,RT,education_level,trial_number,subject_id
<chr>,<chr>,<chr>,<fct>,<chr>,<dbl>,<chr>,<int>,<dbl>
croissantLarge,breadSmall,-341 0; 341 0,incongruent,j,729,College or Technical School,1,1
lampSmall,shelfLarge,-235 0; 235 0,congruent,f,625,College or Technical School,2,1
schoolbusLarge,bikeSmall,-370 0; 370 0,congruent,j,1477,College or Technical School,3,1
orangeLarge,raspberrySmall,-307 0; 307 0,congruent,j,793,College or Technical School,4,1
hammerLarge,ladderSmall,-156 0; 156 0,incongruent,j,900,College or Technical School,5,1
wheelSmall,coinLarge,-300 0; 300 0,incongruent,f,699,College or Technical School,6,1


#### Exercise 4
Add a new column called `correct_side`. This column should encode the side of the correct answer (equivalently, the side of the smaller image). Use `str_detect(string, pattern)` to compute where the smaller image was on the screen and use a conditional `mutate` to fill in the values of `correct_side`.

In [66]:
data <- data %>%
mutate(
    correct_side = case_when(
      str_detect(left_stim,  "Small") ~ "left",
      str_detect(right_stim, "Small") ~ "right"
    )
  )

head(data)

left_stim,right_stim,stimPos,trial_type,response,RT,education_level,trial_number,subject_id,correct_side
<chr>,<chr>,<chr>,<fct>,<chr>,<dbl>,<chr>,<int>,<dbl>,<chr>
croissantLarge,breadSmall,-341 0; 341 0,incongruent,j,729,College or Technical School,1,1,right
lampSmall,shelfLarge,-235 0; 235 0,congruent,f,625,College or Technical School,2,1,left
schoolbusLarge,bikeSmall,-370 0; 370 0,congruent,j,1477,College or Technical School,3,1,right
orangeLarge,raspberrySmall,-307 0; 307 0,congruent,j,793,College or Technical School,4,1,right
hammerLarge,ladderSmall,-156 0; 156 0,incongruent,j,900,College or Technical School,5,1,right
wheelSmall,coinLarge,-300 0; 300 0,incongruent,f,699,College or Technical School,6,1,left


#### Exercise 5
Add a new column called `correct_key`. This column should be equal to `f` if the smaller image was on the left, and to `j` if the smaller image was on the right.

In [67]:
data <- data %>%
  mutate(
    correct_key = case_when(
      correct_side == "left"  ~ "f",
      correct_side == "right" ~ "j"
    )
  )

head(data)

left_stim,right_stim,stimPos,trial_type,response,RT,education_level,trial_number,subject_id,correct_side,correct_key
<chr>,<chr>,<chr>,<fct>,<chr>,<dbl>,<chr>,<int>,<dbl>,<chr>,<chr>
croissantLarge,breadSmall,-341 0; 341 0,incongruent,j,729,College or Technical School,1,1,right,j
lampSmall,shelfLarge,-235 0; 235 0,congruent,f,625,College or Technical School,2,1,left,f
schoolbusLarge,bikeSmall,-370 0; 370 0,congruent,j,1477,College or Technical School,3,1,right,j
orangeLarge,raspberrySmall,-307 0; 307 0,congruent,j,793,College or Technical School,4,1,right,j
hammerLarge,ladderSmall,-156 0; 156 0,incongruent,j,900,College or Technical School,5,1,right,j
wheelSmall,coinLarge,-300 0; 300 0,incongruent,f,699,College or Technical School,6,1,left,f


#### Exercise 6
Add a new column called `correct`. This column should be equal to 1 if the participant provided a correct response, 0 otherwise.

In [68]:
data <- data %>%
  mutate(correct = as.integer(response == correct_key))

head(data)

left_stim,right_stim,stimPos,trial_type,response,RT,education_level,trial_number,subject_id,correct_side,correct_key,correct
<chr>,<chr>,<chr>,<fct>,<chr>,<dbl>,<chr>,<int>,<dbl>,<chr>,<chr>,<int>
croissantLarge,breadSmall,-341 0; 341 0,incongruent,j,729,College or Technical School,1,1,right,j,1
lampSmall,shelfLarge,-235 0; 235 0,congruent,f,625,College or Technical School,2,1,left,f,1
schoolbusLarge,bikeSmall,-370 0; 370 0,congruent,j,1477,College or Technical School,3,1,right,j,1
orangeLarge,raspberrySmall,-307 0; 307 0,congruent,j,793,College or Technical School,4,1,right,j,1
hammerLarge,ladderSmall,-156 0; 156 0,incongruent,j,900,College or Technical School,5,1,right,j,1
wheelSmall,coinLarge,-300 0; 300 0,incongruent,f,699,College or Technical School,6,1,left,f,1


#### Exercise 7
The experiment had 240 trials, equally divided into two blocks. Add a new column called `trial_block` that encodes this.

In [69]:
data <- data %>%
  mutate(trial_block = if_else(trial_number <= 120, 1, 2)) # 2 blocks for 240 trials so the first 120 are block 1 and the second 120 are block 2

#### Exercise 8
Recode `trial_number` so that it codes the trial number within a block. Instead of going from 1 to 240, it should go from 1 to 120 twice.

In [70]:
data <- data %>%
  mutate(trial_number_block = trial_number %% 120,
         trial_number_block = if_else(trial_number_block == 0, 120, trial_number_block))
head(data)

left_stim,right_stim,stimPos,trial_type,response,RT,education_level,trial_number,subject_id,correct_side,correct_key,correct,trial_block,trial_number_block
<chr>,<chr>,<chr>,<fct>,<chr>,<dbl>,<chr>,<int>,<dbl>,<chr>,<chr>,<int>,<dbl>,<dbl>
croissantLarge,breadSmall,-341 0; 341 0,incongruent,j,729,College or Technical School,1,1,right,j,1,1,1
lampSmall,shelfLarge,-235 0; 235 0,congruent,f,625,College or Technical School,2,1,left,f,1,1,2
schoolbusLarge,bikeSmall,-370 0; 370 0,congruent,j,1477,College or Technical School,3,1,right,j,1,1,3
orangeLarge,raspberrySmall,-307 0; 307 0,congruent,j,793,College or Technical School,4,1,right,j,1,1,4
hammerLarge,ladderSmall,-156 0; 156 0,incongruent,j,900,College or Technical School,5,1,right,j,1,1,5
wheelSmall,coinLarge,-300 0; 300 0,incongruent,f,699,College or Technical School,6,1,left,f,1,1,6


#### Exercise 9
It is good practice to use a consistent style throughout your script. One such style is called **snake**, the standard in python, which uses only lowercase letters and underscores: `variable_name`. Another common style is **camel**, the standard in JavaScript, which uses capital letters to mark the beginning of a new word: `variableName`. Our tibble at this point uses both styles, so turn all variable names to snake case. You can use `colnames` to see the vector of column names.

In [71]:
data <- data %>%
  rename(leftStim = left_stim ,rightStim = right_stim, trialType = trial_type, educationLevel = education_level, trialNumber = trial_number, subjectId = subject_id,
  correctSide = correct_side, correctKey = correct_key, trialBlock = trial_block, trialNumberBlock = trial_number_block)

colnames(data)

#### Exercise 10
Using `select`, reorder the columns such that participant information comes first, followed by trial block and number, followed by trial info.

In [72]:
data <- data %>%
  select(
    # participant information
    subjectId, educationLevel,
    # + trials + the rest
    trialBlock, trialNumber, everything()
  )

head(data)

subjectId,educationLevel,trialBlock,trialNumber,leftStim,rightStim,stimPos,trialType,response,RT,correctSide,correctKey,correct,trialNumberBlock
<dbl>,<chr>,<dbl>,<int>,<chr>,<chr>,<chr>,<fct>,<chr>,<dbl>,<chr>,<chr>,<int>,<dbl>
1,College or Technical School,1,1,croissantLarge,breadSmall,-341 0; 341 0,incongruent,j,729,right,j,1,1
1,College or Technical School,1,2,lampSmall,shelfLarge,-235 0; 235 0,congruent,f,625,left,f,1,2
1,College or Technical School,1,3,schoolbusLarge,bikeSmall,-370 0; 370 0,congruent,j,1477,right,j,1,3
1,College or Technical School,1,4,orangeLarge,raspberrySmall,-307 0; 307 0,congruent,j,793,right,j,1,4
1,College or Technical School,1,5,hammerLarge,ladderSmall,-156 0; 156 0,incongruent,j,900,right,j,1,5
1,College or Technical School,1,6,wheelSmall,coinLarge,-300 0; 300 0,incongruent,f,699,left,f,1,6


#### Exercise 11
Concatenate all the commands in Exercises 1–10 into a single cell and store the output in a variable called `tidy_data`.

In [None]:
tidy_data <- read_csv("data/148338_220209_095045_M057814.csv", skip = 2) %>%
  
  select(-rowNo, -type) %>%
  
  rename(left_stim = stim1, right_stim = stim2) %>%
  
  mutate(
    subject_id   = 1,
    trial_number = row_number()
  ) %>%
  
  mutate(
    correct_side = case_when(
      str_detect(left_stim,  "Small") ~ "left",
      str_detect(right_stim, "Small") ~ "right",
      TRUE ~ NA_character_
    )
  ) %>%
  
  mutate(
    correct_key = case_when(
      correct_side == "left"  ~ "f",
      correct_side == "right" ~ "j"
    )
  ) %>%
  
  mutate(correct = as.integer(response == correct_key)) %>%
  
  mutate(
    trial_block = if_else(trial_number <= 120, 1, 2),
    trial_number_block = trial_number %% 120,
    trial_number_block = if_else(trial_number_block == 0, 120, trial_number_block)
  ) %>%
  
  rename(leftStim = left_stim ,rightStim = right_stim, educationLevel = education_level, trialNumber = trial_number, subjectId = subject_id, correctSide = correct_side, correctKey = correct_key, trialBlock = trial_block, trialNumberBlock = trial_number_block) %>%
  
  select(
    subjectId, educationLevel, trialBlock, trialNumber, everything()
  )

head(tidy_data)

[1mRows: [22m[34m241[39m [1mColumns: [22m[34m32[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (21): type, stim1, stim2, stimPos, stimFormat, feedbackIncorrect, head, ...
[32mdbl[39m (11): rowNo, ITI, feedbackTime, random, ITI_ms, ITI_f, ITI_fDuration, ti...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


ERROR: [1m[33mError[39m in `rename()`:[22m
[33m![39m Names must be unique.
[31m✖[39m These names are duplicated:
  * "correctSide" at locations 20 and 33.


### Summary statistics

In [None]:
tidy_data %>%
    summarize(rt = mean(rt), accuracy = mean(correct), error = mean(1 - correct))

In [None]:
tidy_data %>%
    mutate(avg_error = mean(1 - correct))

#### Exercise 12
Using the `.by` argument in the call to `summarize`, find out if our participant showed the predicted size Stroop in reaction times and error rates.

#### Exercise 13
Using a similar logic, find out if our participant also showed a SNARC effect: Was the participant faster when the small image was on the left?

#### Exercise 14
Find out if the SNARC effect depends on trial type.

## From a single participant to a full dataset
We load all the 12 csv files in the 'data' folder, then apply the `read_csv` function to each of them using `map_dfr`. The `.id` argument creates a column that keeps the information from each file separate. This is equivalent to having a subject_id, if there is one .csv file per participant.

In [None]:
# Fetch all the files in the 'data' folder that end in .csv
raw_data <- list.files(path = 'data', pattern = ".csv$", full.names = TRUE) %>% 

  # Map the read_csv function to all of them, skipping the first 2 rows and creating a new id column called 'id' so that each file gets its own id
  # col_types = cols() just makes explicit that you want tidyverse to do its best to guess the type of each column (string, numeric, etc.)
  map_dfr(read_csv, col_types = cols(), skip = 2, .id = 'id') 

#### Exercise 15
Tidy the dataset exactly as we did for subject 1, while keeping the education-level information for each subject. Store it as `tidy_data`.

Hint: **group** the tibble before calling `first(response)`), then follow the same steps as before to obtain a tidy dataset. Use `ungroup()` to return to the tibble to the ungrouped state. In fact, in most of the exercises that follow, you will need to use grouping wisely.

```R
full_data <- raw_data %>% 
    mutate(education_level = first(response), .by = id) %>%
    ...
```

#### Exercise 16
Trials where the responses were too slow or too fast should be excluded from the analysis. (Why?)  
Exclude the trials where the response is below 200 ms or higher than 1,500 ms. How many trials were excluded?

#### Exercise 17 
Exclude participants who didn't achieve 93% overall accuracy.  
How many subjects were excluded?

#### Exercise 18
Summarize the response-time and accuracy measures by trial type to check whether there's a Stroop effect.

## Changing the format of the data: `pivot_wider`, `pivot_longer`

#### Exercise 19: `pivot_wider`
Compute the average Stroop effects for each participant. One column should be called `stroop_rt`, the other should be called `stroop_error`. Using `pull`, extract the reaction-time Stroop vector and plot its histogram.

#### Exercise 20: `pivot_longer`
Building on the output tibble in Exercise 19, remove all columns except `id`, `stroop_rt`, and `stroop_error`, then gather the two stroop columns into a single column called `measure`. This column should take one of two values for each subject (`stroop_rt` or `stroop_error`), while the `value` column should register the respective participants' stroop effect.

## Data plotting

### Rule 1: If your data is in the tidy format (one variable per column, one observation per row), plotting with `ggplot` will be very easy.

### Rule 2: No barplots.

### One possible way to do it

In [None]:
average_data <- full_data %>% summarize(rt = mean(rt), .by = c(id, trial_type)) 

ggplot(average_data, aes(x = trial_type, y = rt, fill = trial_type)) +
  geom_boxplot(width = 0.5, alpha = 0.45) +
  geom_point(size = 2) +
  geom_line(aes(group = id), color = 'gray') +         
  stat_summary(fun.data = mean_se, linewidth = 2, shape = 21, size = 1.5) +
  labs(title = "Average reaction times by trial type (ms)", x = "Trial type", y = "") +
  theme_minimal() + 
  theme(
    legend.position = "none", 
    plot.title = element_text(face = "bold", size = 20),
    axis.title = element_text(size = 18),
    axis.text = element_text(size = 16))