Osnabrück University - A&C: Computational Cognition (Summer Term 2019)

# Exercise Sheet 02: Basic statistics

## Introduction

This week's sheet should be solved and handed in at 14:00 at **Tuesday, April 30, 2019**. If you need help (and Google and other resources were not enough), feel free to contact your tutors. Please push your results to your Github group folder.

In this exercise sheet you will have to work with ```pandas``` and ```seaborn```. ```pandas``` is one of the most preferred and widely used tools in data processing. What’s cool about ```pandas``` is that it takes data (like a CSV or TSV file, or a SQL database) and creates a Python object with rows and columns called 'data frame' that looks very similar to tables in a statistical software (think Excel or SPSS for example). ```pandas``` makes data processing a lot easier in comparison to working with lists and/or dictionaries through for-loops or list comprehension.  
```seaborn``` is a library for making plots. It is based on ```matplotlib``` but offers more functions speicialized for statistical visualization. Also most people agree that ```seaborn``` looks more legit.

Don't forget that you we will also give **2 points** for nice coding style!

## Assignment 0: Peer review for sheet 01 [3 pts]

Beginning this week you will have to make a peer review of the other groups' solutions. Each group reviews the solutions of two other groups and give points according to the given point distribution considering the correctness of the solution. For this reviews the tutors will give you up to 3 points each week.

| * |Group 1|Group 2|Group 3|Group 4|Group 5|Group 6|Group 7|Group 8|Group 9|Group 10|Group 11|
| ------- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ------ | ------ |
| check solutions of group: | 10, 7 | 4, 9  | 1, 4  | 11, 1 | 8, 11 | 5, 3  | 9, 10 | 6, 5  | 3, 2  | 2, 8   | 7, 6   |

You should open an issue in repositories of groups you have to check. The title of the issue should be your group name (e.g."Group 1"). Comments on what was good and bad, how much points they get etc.  
Refer to https://guides.github.com/features/issues/ to learn more about issues.

## Assignment 1: Dataframes [4 pts]

In [1]:
# import libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

```matplotlib``` and ```seaborn``` should already be installed in your environment. If not please run:
```sh
pip install seaborn
```

### a) Importing a csv file [2 pts]

Import the csv files of all subjects into one dataframe. Make sure that each row has a unique index. You might want to take a look at what ***pandas.concat*** does.<br>
Extra fun: Display the output of the dataframe using the ***pandas.set_option*** function to display the data in a well-arranged way. Play a little bit around with the settings that you are allowed to change.<br>
Save ```df_concatenated```.


In [4]:
import glob
import os
import pandas as pd

PATH = os.getcwd()+ "/Data"
all_files = glob.glob(os.path.join(PATH, "*.csv"))     # advisable to use os.path.join as this makes concatenation OS independent
                                                       # http://www.oipapio.com/question-88634    
df_concatenated = [] # TODO

all_files.sort() # read data of subjects in order
for filename in all_files:
    data_frame = pd.read_csv(filename, header=0) # infer column-names from first row 
    df_concatenated.append(data_frame) # add dataframe to list 
df_concatenated = pd.concat(df_concatenated, axis=0, ignore_index=True) # concat list to one big dataframe

print(df_concatenated)

# use pd.set_option here to display in a nice way
# TODO
pd.set_option('display.max_rows',2000)
pd.set_option('display.max_columns',6)
pd.set_option('display.width',1000)


# save concatenated dataframe
DATAPATH = os.getcwd() + '/Processed/data_concatenated.csv'
#TODO
df_concatenated.to_csv(path_or_buf=DATAPATH)

      SubjectID  StimulusType  response    RT
0          9001             1         1   369
1          9001             1         1   532
2          9001             1         1   154
3          9001             0         1   151
4          9001             1         1   418
5          9001             1         1   185
6          9001             1         1   451
7          9001             1         1   402
8          9001             1         1   202
9          9001             1         1   353
10         9001             1         1   381
11         9001             0         1   236
12         9001             1         1   251
13         9001             1         1   204
14         9001             0         1   150
15         9001             1         1   613
16         9001             1         1   671
17         9001             1         1   219
18         9001             0         1   215
19         9001             1         1   985
20         9001             1     

### b) Working with dataframes [2 pts]

- Add a column called "congruence" to ```df_concatenated```. The column should have a value *True* if "StimulusType" and "reponse" matches. Otherwise the column should have a value *False*.

- Create a new dataframe which has "SubjectID","StiumulusType","RT" and "congruence" as a column. For each combination of "SubjectID" and "StimulusType" (e.g. "7001" and "0") compute the average RT and congruence level.

- When computing the average RT, omit all reaction times which are 0 as these will manipulate the mean.

- Rename "congruence" as "accuracy" and save the dataframe as a csv file. 

In [14]:
# add a column "congruence"
# TODO
df_concatenated["congruence"] = df_concatenated["StimulusType"] == df_concatenated["response"] 



# create a new dataframe with averaged data

df_concatenated_avg = df_concatenated[["SubjectID", "StimulusType","RT","congruence"]]
print(df_concatenated_avg.groupby(["SubjectID","StimulusType"]).mean())

# save averaged dataframe
#DATAPATH = os.getcwd() + '/Processed/data_concatenated_averaged.csv'
# TODO



                              RT  congruence
SubjectID StimulusType                      
9001      0              77.2000      0.6000
          1             161.6500      0.4125
9002      0              17.7000      0.9500
          1             362.8750      1.0000
9003      0              14.4000      0.9500
          1             392.2625      0.9750
9004      0              11.9000      0.9500
          1             305.0375      1.0000
9005      0               0.0000      1.0000
          1             364.8750      0.9875
9006      0              17.1000      0.9500
          1             427.4375      1.0000
9007      0              11.7000      0.9500
          1             307.9375      1.0000
9008      0               2.5000      0.9500
          1             328.1875      1.0000
9009      0               0.0000      1.0000
          1             439.7000      1.0000
9010      0              54.2000      0.8000
          1             313.7750      1.0000
9011      

## Assignment 2: Statistical plotting [6 pts]

### a) Boxplot and Violinplot [2 pts]

Plot the RT of each trial for all subjects as a stripplot and a boxplot on top of each other. Do the same with a striplot and a violinplot. Plot go trials as green dots and no-go trails as red dots. Reminder: don't forget to mask the data where RT=0. Make sure that the legends are informative (Don't display duplicated legends).

In [None]:
# read data
data_concat = pd.read_csv(os.getcwd() + "/Processed/data_concatenated.csv")

# create two axes
fig, axes = plt.subplots(nrows=1,ncols=2)

# first subplot with stripplot and boxplot
# TODO 

# second subplot with stripplot and violinplot
# TODO

# handling legends
# TODO

fig.tight_layout()

### b) Violinplot combining all data of all groups [3 pts]

- Make a dataframe consisting of all data across groups. You already did this in 1.a). At the end this dataframe you should have 8 * 11 * 100 rows.

- Every group has used their ID convention. Make sure that every data point follows this SubjectID system: group number + "00" + subject number.  
e.g) 3002 for the second subject of the third group.

- Compute average RT and accuaracy for each subject in the big dataframe you just created. You already did this in 1.b). At the end this dataframe will have 8 * 11 rows.

- On the first column plot average RT and accuracy for 8 subjects from your group's data. Use violinplot and split go/no-go conditions.

- On the second column plot average RT and accuracy for 80 subjects from all data. Use violinplot and split go/no-go conditions.

- Do you see any difference between the first column and the second column? What does this tell us about the central limit theorem (CLT) ?

In [None]:
# again create a concatenated dataframe over all (averaged) groups.
# Don't forget to modify the Subject ID
# TODO

# Now it's time to plot your results
figs, axes = plt.subplots(nrows=2, ncols=2, sharey="row")

# violin plot for your group's data
# TODO

# violin plot of all group's data
# TODO


Compare two datasets and relate it with CLT. Write your opinion here.

### c) Scatterplot [1 pts]

Make a scatterplot comparing RT and accuracy. Do you see some correlation?

In [None]:
# TODO