Osnabrück University - A&C: Computational Cognition (Summer Term 2019)

# Exercise Sheet 02: Basic statistics

## Introduction

This week's sheet should be solved and handed in at 14:00 at **Tuesday, April 30, 2019**. If you need help (and Google and other resources were not enough), feel free to contact your tutors. Please push your results to your Github group folder.

In this exercise sheet you will have to work with ```pandas``` and ```seaborn```. ```pandas``` is one of the most preferred and widely used tools in data processing. What’s cool about ```pandas``` is that it takes data (like a CSV or TSV file, or a SQL database) and creates a Python object with rows and columns called 'data frame' that looks very similar to tables in a statistical software (think Excel or SPSS for example). ```pandas``` makes data processing a lot easier in comparison to working with lists and/or dictionaries through for-loops or list comprehension.  
```seaborn``` is a library for making plots. It is based on ```matplotlib``` but offers more functions speicialized for statistical visualization. Also most people agree that ```seaborn``` looks more legit.

Don't forget that you we will also give **2 points** for nice coding style!

## Assignment 0: Peer review for sheet 01 [3 pts]

Beginning this week you will have to make a peer review of the other groups' solutions. Each group reviews the solutions of two other groups and give points according to the given point distribution considering the correctness of the solution. For this reviews the tutors will give you up to 3 points each week.

| * |Group 1|Group 2|Group 3|Group 4|Group 5|Group 6|Group 7|Group 8|Group 9|Group 10|Group 11|
| ------- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ------ | ------ |
| check solutions of group: | 10, 7 | 4, 9  | 1, 4  | 11, 1 | 8, 11 | 5, 3  | 9, 10 | 6, 5  | 3, 2  | 2, 8   | 7, 6   |

You should open an issue in repositories of groups you have to check. The title of the issue should be your group name (e.g."Group 1"). Comments on what was good and bad, how much points they get etc.  
Refer to https://guides.github.com/features/issues/ to learn more about issues.

## Assignment 1: Dataframes [4 pts]

In [1]:
# import libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

```matplotlib``` and ```seaborn``` should already be installed in your environment. If not please run:
```sh
pip install seaborn
```

### a) Importing a csv file [2 pts]

Import the csv files of all subjects into one dataframe. Make sure that each row has a unique index. You might want to take a look at what ***pandas.concat*** does.<br>
Extra fun: Display the output of the dataframe using the ***pandas.set_option*** function to display the data in a well-arranged way. Play a little bit around with the settings that you are allowed to change.<br>
Save ```df_concatenated```.


In [5]:
import glob
import os

PATH = os.getcwd()+ "/Data"

all_files = glob.glob(os.path.join(PATH, "*.csv"))     # advisable to use os.path.join as this makes concatenation OS independent
                                                       # http://www.oipapio.com/question-88634    

# use pd.set_option here to display in a nice way
df_from_each_file = (pd.read_csv(f) for f in all_files) 
df_concatenated = pd.concat(df_from_each_file, ignore_index=True) # combines all dataframes and creates new dataframe
df_concatenated
# save concatenated dataframe
DATAPATH = os.getcwd() + '/Processed/data_concatenated.csv'

df_concatenated.to_csv(DATAPATH, index=False) # write concatenated to a csv in the Processed folder

### b) Working with dataframes [2 pts]

- Add a column called "congruence" to ```df_concatenated```. The column should have a value *True* if "StimulusType" and "reponse" matches. Otherwise the column should have a value *False*.

- Create a new dataframe which has "SubjectID","StiumulusType","RT" and "congruence" as a column. For each combination of "SubjectID" and "StimulusType" (e.g. "7001" and "0") compute the average RT and congruence level.

- When computing the average RT, omit all reaction times which are 0 as these will manipulate the mean.

- Rename "congruence" as "accuracy" and save the dataframe as a csv file. 

In [35]:
# add a column "congruence"
# TODO
#csv_input = pd.read_csv('Processed/data_concatenated.csv')
#csv_input['congruence'] = np.where(csv_input['StimulusType']==csv_input['response'], True, False)
#csv_input.to_csv('Processed/data_concatenated.csv', index=False)
df_concatenated['congruence'] = "" # creates an empty column called 'congruence'
# assigns value True to congruence column if StimulusType and response are matching, assigns False if not matching
df_concatenated['congruence'] = np.where(df_concatenated['StimulusType']==df_concatenated['response'], True, False)


# create a new dataframe with averaged data
#df_concatenated_avg = pd.DataFrame(columns=['SubjectID', 'StimulusType', 'avRT', 'AVсongruence'])
#df_concatenated_avg = pd.read_csv('Processed/data_concatenated.csv')
#df_concatenated_avg['RT']=df_concatenated_avg.groupby('SubjectID').mean()
df_concatenated_avg = df_concatenated.copy() # creates a deep copy of old data frame
df_concatenated_avg = df_concatenated_avg.drop(columns=['response']) # deletes the column 'response'
# calculates the mean RT while omitting 0 values for different combinations of SubjectID and StimulusType 
average_RT = df_concatenated_avg.groupby(['SubjectID', 'StimulusType'])['RT'].mean() 
print(average_RT)
# calculates the mean congruence for different combinations of SubjectID and StimulusType
average_congruence = df_concatenated_avg.groupby(['SubjectID', 'StimulusType'])['congruence'].mean()
print(average_congruence)

# changes the name of the 'congruence' column to 'accuracy'
df_concatenated_avg = df_concatenated_avg.rename(columns={'congruence':'accuracy'})

# TODO
# saves averaged dataframe as a csv file
DATAPATH = os.getcwd() + '/Processed/data_concatenated_averaged.csv'
df_concatenated_avg.to_csv(DATAPATH, index=False)
# TODO
print(df_concatenated_avg)

SubjectID  StimulusType
8001       0                90.6000
           1               342.2125
8002       0                49.2000
           1               357.7625
8003       0                 0.0000
           1               435.1625
8004       0                32.1500
           1               340.2125
8005       0                99.7500
           1               321.7500
8006       0                74.4500
           1               328.9625
8007       0                30.4500
           1               355.3875
8008       0                 0.0000
           1               358.2000
Name: RT, dtype: float64
SubjectID  StimulusType
8001       0               0.7000
           1               1.0000
8002       0               0.8500
           1               1.0000
8003       0               1.0000
           1               1.0000
8004       0               0.9000
           1               0.9875
8005       0               0.7000
           1               1.0000
8006       

## Assignment 2: Statistical plotting [6 pts]

### a) Boxplot and Violinplot [2 pts]

Plot the RT of each trial for all subjects as a stripplot and a boxplot on top of each other. Do the same with a striplot and a violinplot. Plot go trials as green dots and no-go trails as red dots. Reminder: don't forget to mask the data where RT=0. Make sure that the legends are informative (Don't display duplicated legends).

In [None]:
# read data
data_concat = pd.read_csv(os.getcwd() + "/Processed/data_concatenated.csv")

# create two axes
fig, axes = plt.subplots(nrows=1,ncols=2)

# first subplot with stripplot and boxplot
# TODO 

# second subplot with stripplot and violinplot
# TODO

# handling legends
# TODO

fig.tight_layout()

### b) Violinplot combining all data of all groups [3 pts]

- Make a dataframe consisting of all data across groups. You already did this in 1.a). At the end this dataframe you should have 8 * 11 * 100 rows.

- Every group has used their ID convention. Make sure that every data point follows this SubjectID system: group number + "00" + subject number.  
e.g) 3002 for the second subject of the third group.

- Compute average RT and accuaracy for each subject in the big dataframe you just created. You already did this in 1.b). At the end this dataframe will have 8 * 11 rows.

- On the first column plot average RT and accuracy for 8 subjects from your group's data. Use violinplot and split go/no-go conditions.

- On the second column plot average RT and accuracy for 80 subjects from all data. Use violinplot and split go/no-go conditions.

- Do you see any difference between the first column and the second column? What does this tell us about the central limit theorem (CLT) ?

In [None]:
# again create a concatenated dataframe over all (averaged) groups.
# Don't forget to modify the Subject ID
# TODO

# Now it's time to plot your results
figs, axes = plt.subplots(nrows=2, ncols=2, sharey="row")

# violin plot for your group's data
# TODO

# violin plot of all group's data
# TODO


Compare two datasets and relate it with CLT. Write your opinion here.

### c) Scatterplot [1 pts]

Make a scatterplot comparing RT and accuracy. Do you see some correlation?

In [None]:
# TODO