# Analysis

The main thing we'll go through today is analysis of the Demo Experiment data. We have data from 14 participants who completed the German version of the experiment.

**Before we can start:**

**a**) [Download the data](demo_data.zip) as a `.zip` file.

**b**) Extract the data from the `.zip` file to a folder that makes sense for you. The data should be stored in that location in a folder called `data`.

**c**) Start a new R script in RStudio.

Once everyone has completed these steps, we'll begin...

## 1: Load the Relevant Libraries

These are the libraries we'll use for the analysis. Each library has a comment explaining what we will use it for.

In [None]:
options(repr.plot.width=3.5, repr.plot.height=3)

In [None]:
library(readr)    # for reading the data into R
library(purrr)    # for easily importing multiple files
library(dplyr)    # for wrangling data (e.g., adding/renaming columns)
library(tidyr)    # for switching between long and wide formats of data
library(ordinal)  # for fitting CLMMs
library(lme4)     # for fitting LMMs

library(ggplot2)  # for visualising data
theme_set(theme_bw())

## 2: Import the Data

Let's import the data extracted from the `.zip` folder:

In [None]:
# first, get a list of paths to all .csv data files
data_paths <- list.files("data", pattern=".*\\.csv$", full.names=TRUE)

# now, iterate over these with `read_csv()` to import them
# (note: col_types just says that we want these two columns to be stored as text, not numbers)
raw_data <- map_df(data_paths, read_csv, col_types=c(participant="c", frameRate="c"))

Let's have a look at the first few rows of the data, with `print()`

In [None]:
print(raw_data)

## 3: Format the Data

Here we make two quick changes:

1) We rename the column containing response variable to a short name: `resp`

2) We create a new column, which will be the same variable but with the order of the responses hardcoded as a factor

We store the resulting data frame in a new variable with a handy short name, `d`

In [None]:
d <- raw_data |>
  # 1) rename to a shorter name
  rename(resp = resp_scale.response) |>
  # 2) set the factor levels explicitly
  mutate( resp_fct = factor(resp, levels=1:5, ordered=TRUE) )

Coding the variable as a factor in this last step is useful for the CLMM approach. It ensures that R knows the order of our Likert responses.

## 4: Clean the Data

To remove outliers, we may want to filter by response times, or exclude participants who only provided one response for the whole experiment.

Here are the rules we'll use:

* *Rule a)* We will exclude participants who only pressed one button for the whole experiment

* *Rule b)* We will exclude trials where participants were too fast to be paying attention (faster than 500 ms).

* *Rule c)* We will exclude trials where participants were too slow to be paying attention (slower than 15 seconds).

Remember that for your actual experiment, you will need to preregister these criteria.

<br>
We'll start by excluding participants who gave too few unique responses. To do this we first count the number of unique responses given by each participant. Then we can take from this a list of participants who only pressed one button.

In [None]:
# get the number of unique responses from each participant
participant_variety <- d |>
  # select the relevant columns
  select(participant, resp_fct) |>
  # get unique responses
  distinct() |>
  # count unique responses per participant
  count(participant) |>
  # sort ascendingly
  arrange(n)

# get IDs of participants who only ever pressed one button
bad_participants <- participant_variety |>
  filter(n==1) |>
  pull(participant)

Then we can filter our data to not include these bad participants. This code says that we should only keep participants who are not (`!`) in (`%in%`) the vector containing bad participant IDs (`bad_participants`).

In [None]:
# a) exclude participants who only ever pressed one button
d_clean_a <- filter(d, ! participant %in% bad_participants)

Excluding participants by Response Times (RTs) is much easier. We can say we want to keep trials where participants were under the maximum RT, and above the minimum RT.

In [None]:
# b) exclude trials that were too slow
d_clean_b <- filter(d_clean_a, resp_scale.rt < 15)

# c) exclude trials that were too fast
d_clean_c <- filter(d_clean_b, resp_scale.rt > 0.5)

The last step is to calculate the number of participants/trials lost at each step. We can do this with the `nrow()` function to count the number of rows (i.e., trials) in the dataframe. To count the number of bad participants, we can use `length()`.

In [None]:
# there were 0 participants excluded on the basis of only ever pressing one button
print(length(bad_participants))

In [None]:
# there were 14 trials excluded on the basis of slow responses
print( nrow(d_clean_a) - nrow(d_clean_b) )

In [None]:
# there were 0 trials excluded on the basis of fast responses
print( nrow(d_clean_b) - nrow(d_clean_c) )