## Data wrangling & summary statistics
### Tidy up the mess (20 points)

Here’s a messy data set from an experiment in which participants saw three critical conditions, and had to respond with pressing a button for either option A or option B. There were four participants in the experiment, identified anonymously in variable subject_id. The button press and associated reaction times of each of three trials are stored, respectively, in columns choices and reaction_times (in milliseconds) in a string which separates the data from different trials either with a comma (for choices) or a single white space (for reaction_times).

messy_data <- tribble(
  ~subject_id,  ~choices,  ~reaction_times,
  1,            "A,B,B",   "312 433 365",
  2,            "B,A,B",   "393 491 327",
  3,            "B,A,A",   "356 313 475",
  4,            "A,B,B",   "292 352 378"
)
## # A tibble: 12 x 4
##    subject_id condition response    RT
##         <dbl> <chr>     <chr>    <int>
##  1          1 C_1       A          312
##  2          1 C_2       B          433
##  3          1 C_3       B          365
##  4          2 C_1       B          393
##  5          2 C_2       A          491
##  6          2 C_3       B          327
##  7          3 C_1       B          356
##  8          3 C_2       A          313
##  9          3 C_3       A          475
## 10          4 C_1       A          292
## 11          4 C_2       B          352
## 12          4 C_3       B          378

In [1]:
import pandas as pd
import numpy as np
import pandas as pd

data = {
    "subject_id": [1, 2, 3, 4],
    "choices": ["A,B,B", "B,A,B", "B,A,A", "A,B,B"],
    "reaction_times": ["312 433 365", "393 491 327", "356 313 475", "292 352 378"]
}

messy_data = pd.DataFrame(data)
print(messy_data)

   subject_id choices reaction_times
0           1   A,B,B    312 433 365
1           2   B,A,B    393 491 327
2           3   B,A,A    356 313 475
3           4   A,B,B    292 352 378


In [14]:
# tidy up the data by splitting the strings in the choices and reaction_times columns, and add a column for condition c_1, c_2, c_3, so at the end we have 12 rows
# and a column for choice and reaction time
# Create the initial DataFrame
messy_data = pd.DataFrame(data)

# Split the 'choices' and 'reaction_times' columns into lists
messy_data['choices'] = messy_data['choices'].str.split(',')
#print(messy_data['choices'])
messy_data['reaction_times'] = messy_data['reaction_times'].str.split()
#print(messy_data['reaction_times'])

# Create a new DataFrame by expanding the lists into individual rows
tidy_data = messy_data.explode(['choices', 'reaction_times'])
#print(tidy_data)

# Add the 'condition' column
conditions = ['C_1', 'C_2', 'C_3']
tidy_data['condition'] = conditions * (len(tidy_data) // len(conditions))

# Rename columns for clarity
tidy_data.rename(columns={'choices': 'response', 'reaction_times': 'RT'}, inplace=True)

# Convert 'RT' column to integer
tidy_data['RT'] = tidy_data['RT'].astype(int)

# Reset index for a clean DataFrame
tidy_data.reset_index(drop=True, inplace=True)

# Print the tidy DataFrame
print(tidy_data)


    subject_id response   RT condition
0            1        A  312       C_1
1            1        B  433       C_2
2            1        B  365       C_3
3            2        B  393       C_1
4            2        A  491       C_2
5            2        B  327       C_3
6            3        B  356       C_1
7            3        A  313       C_2
8            3        A  475       C_3
9            4        A  292       C_1
10           4        B  352       C_2
11           4        B  378       C_3


Summarize the reaction times 

Use the final tidy representation of the messy_data from the previous exercise, stored in a variable tidy_data. Produce a summary table of mean reaction times per condition, using the tools from the tidyverse. Your output should look like this:
## # A tibble: 3 x 2
##   condition mean_RT
##   <chr>       <dbl>
## 1 C_1          338.
## 2 C_2          397.
## 3 C_3          386.

Now produce a table giving the mean reaction times for each participant. But make sure that, in this case, the mean reaction times are rounded to full integers. (Hint: you can use mutate in a final step or round inside of a call to summarise). The output should look like this:

## # A tibble: 4 x 2
##   subject_id mean_RT
##        <dbl>   <dbl>
## 1          1     370
## 2          2     404
## 3          3     381
## 4          4     341

In [15]:
# mean reaction time for each condition
mean_RT = tidy_data.groupby('condition')['RT'].mean()
print(mean_RT)

condition
C_1    338.25
C_2    397.25
C_3    386.25
Name: RT, dtype: float64


In [17]:
# mean reaction time for each participant
mean_RT = tidy_data.groupby('subject_id')['RT'].mean()
#print(mean_RT)
# round to integer
mean_RT = tidy_data.groupby('subject_id')['RT'].mean().round().astype(int)
print(mean_RT)

subject_id
1    370
2    404
3    381
4    341
Name: RT, dtype: int64


#kof