Osnabrück University - A&C: Computational Cognition (Summer Term 2019)

# Exercise Sheet 02: Basic statistics

## Introduction

This week's sheet should be solved and handed in at 14:00 at **Tuesday, April 30, 2019**. If you need help (and Google and other resources were not enough), feel free to contact your tutors. Please push your results to your Github group folder.

In this exercise sheet you will have to work with ```pandas``` and ```seaborn```. ```pandas``` is one of the most preferred and widely used tools in data processing. What’s cool about ```pandas``` is that it takes data (like a CSV or TSV file, or a SQL database) and creates a Python object with rows and columns called 'data frame' that looks very similar to tables in a statistical software (think Excel or SPSS for example). ```pandas``` makes data processing a lot easier in comparison to working with lists and/or dictionaries through for-loops or list comprehension.  
```seaborn``` is a library for making plots. It is based on ```matplotlib``` but offers more functions speicialized for statistical visualization. Also most people agree that ```seaborn``` looks more legit.

Don't forget that you we will also give **2 points** for nice coding style!

## Assignment 0: Peer review for sheet 01 [3 pts]

Beginning this week you will have to make a peer review of the other groups' solutions. Each group reviews the solutions of two other groups and give points according to the given point distribution considering the correctness of the solution. For this reviews the tutors will give you up to 3 points each week.

| * |Group 1|Group 2|Group 3|Group 4|Group 5|Group 6|Group 7|Group 8|Group 9|Group 10|Group 11|
| ------- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ------ | ------ |
| check solutions of group: | 10, 7 | 4, 9  | 1, 4  | 11, 1 | 8, 11 | 5, 3  | 9, 10 | 6, 5  | 3, 2  | 2, 8   | 7, 6   |

You should open an issue in repositories of groups you have to check. The title of the issue should be your group name (e.g."Group 1"). Comments on what was good and bad, how much points they get etc.  
Refer to https://guides.github.com/features/issues/ to learn more about issues.

## Assignment 1: Dataframes [4 pts]

In [1]:
# import libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

```matplotlib``` and ```seaborn``` should already be installed in your environment. If not please run:
```sh
pip install seaborn
```

### a) Importing a csv file [2 pts]

Import the csv files of all subjects into one dataframe. Make sure that each row has a unique index. You might want to take a look at what ***pandas.concat*** does.<br>
Extra fun: Display the output of the dataframe using the ***pandas.set_option*** function to display the data in a well-arranged way. Play a little bit around with the settings that you are allowed to change.<br>
Save ```df_concatenated```.


In [2]:
import glob
import os

PATH = os.getcwd()+ "/Data"
all_files = glob.glob(os.path.join(PATH, "*.csv"))     # advisable to use os.path.join as this makes concatenation OS independent
                                                       # http://www.oipapio.com/question-88634

#iterates through files in all_files list and creates a dataframe each
df_to_concat = [pd.read_csv(f) for f in all_files]
#iterates through list of dataframes and modifies subjectIDs
for x, count in zip(df_to_concat, range(1, len(df_to_concat)+1)):
    x['SubjectID'].replace(x['SubjectID'], count, inplace=True)

df_concatenated = pd.concat(df_to_concat, axis=0, ignore_index = True)
print("df_concatenated: \n", df_concatenated)

#some display settings to make it look nice (not all data will be displayed)
#displayed rows depends on number of subjects (which also influences length of list)
pd.set_option('display.max_rows', len(all_files)*5)

# save concatenated dataframe
DATAPATH = os.path.join(os.getcwd(),'Processed', 'data_concatenated.csv')
# making sure that directory exists
if not os.path.isdir(os.path.join(os.getcwd(), 'Processed')):
    os.mkdir(os.path.join(os.getcwd(), 'Processed'))

# safe concatenated file 
df_concatenated.to_csv(DATAPATH)

1
<class 'pandas.core.series.Series'>
hi 0     1
1     1
2     1
3     1
4     1
5     1
6     1
7     1
8     1
9     1
10    1
11    1
12    1
13    1
14    1
15    1
16    1
17    1
18    1
19    1
20    1
21    1
22    1
23    1
24    1
25    1
26    1
27    1
28    1
29    1
     ..
70    1
71    1
72    1
73    1
74    1
75    1
76    1
77    1
78    1
79    1
80    1
81    1
82    1
83    1
84    1
85    1
86    1
87    1
88    1
89    1
90    1
91    1
92    1
93    1
94    1
95    1
96    1
97    1
98    1
99    1
Name: SubjectID, Length: 100, dtype: int64
2
<class 'pandas.core.series.Series'>
hi 0     2
1     2
2     2
3     2
4     2
5     2
6     2
7     2
8     2
9     2
10    2
11    2
12    2
13    2
14    2
15    2
16    2
17    2
18    2
19    2
20    2
21    2
22    2
23    2
24    2
25    2
26    2
27    2
28    2
29    2
     ..
70    2
71    2
72    2
73    2
74    2
75    2
76    2
77    2
78    2
79    2
80    2
81    2
82    2
83    2
84    2
85    2
86    2
87 

### b) Working with dataframes [2 pts]

- Add a column called "congruence" to ```df_concatenated```. The column should have a value *True* if "StimulusType" and "reponse" matches. Otherwise the column should have a value *False*.

- Create a new dataframe which has "SubjectID","StiumulusType","RT" and "congruence" as a column. For each combination of "SubjectID" and "StimulusType" (e.g. "7001" and "0") compute the average RT and congruence level.

- When computing the average RT, omit all reaction times which are 0 as these will manipulate the mean.

- Rename "congruence" as "accuracy" and save the dataframe as a csv file. 

In [19]:
#add a column "congruence" that has value true if Stimulus and response hold the same value
df_concatenated['congruence'] = (df_concatenated['StimulusType'] == df_concatenated['response'])

#create a new dataframe with averaged data
#create columns
df_concatenated_avg = pd.DataFrame(columns=['SubjectID', 'StimulusType', 'RT', 'congruence'])

#group dataframe by SubjectID and StimulusType (for each Subject we have 2 types of Stimuli 0 - 1)
grouped = df_concatenated.groupby(['SubjectID','StimulusType'])

#helper for adding rows to dataframe
row = 0

#we have 2 groups per subject (one group with stimulus 0 and one with stimulus 1)
#thus the number of subjects is half the number of created groups
for subject in range(1, (len(grouped)//2)+1):
    #for each subject there are two groups / stimulus conditions
    print("sub:", subject)
    for stimType in range(2):
        #get the groups with current subject index and stimulus Type
        print("stim:", stimType)
        group = grouped.get_group((subject,stimType))
        #computing the mean of the reaction time
        #the zeros have to be replaced since we don't want them to manipulate the mean
        meanRT = group.RT.replace(0, np.NaN).mean()
        #computing the congruence mean
        meanCongruence = group.congruence.mean()
        #if all reaction time values are zero, the mean will be NaN due to our conversion measure
        #thus we have to replace the NaN value with zero again since we only want numeric values
        if(np.isnan(meanRT)):
            meanRT = 0
        #adding the new row to the dataframe while rounding the mean to 2 digits after the decimal point
        df_concatenated_avg.loc[row] = [subject, stimType, round(meanRT,2), meanCongruence]
        #increasing the row index to make sure that new rows are added while looping through the loops
        row = row + 1

#renaming the column called congruence to accuracy
df_concatenated_avg.rename(columns={'congruence':'accuracy'}, inplace=True)

print(df_concatenated_avg)

# save averaged dataframe
DATAPATH = os.getcwd() + '/Processed/data_concatenated_averaged.csv'
df_concatenated_avg.to_csv(DATAPATH)

sub: 1
stim: 0
stim: 1
sub: 2
stim: 0
stim: 1
sub: 3
stim: 0
stim: 1
sub: 4
stim: 0
stim: 1
sub: 5
stim: 0
stim: 1
sub: 6
stim: 0
stim: 1
    SubjectID  StimulusType      RT  accuracy
0         1.0           0.0  316.67      0.85
1         1.0           1.0  357.00      1.00
2         2.0           0.0  263.45      0.45
3         2.0           1.0  343.05      1.00
4         3.0           0.0  181.00      0.85
5         3.0           1.0  330.24      1.00
6         4.0           0.0  339.00      0.95
7         4.0           1.0  339.01      1.00
8         5.0           0.0  271.00      0.95
9         5.0           1.0  299.06      1.00
10        6.0           0.0  254.00      0.90
11        6.0           1.0  311.38      1.00


## Assignment 2: Statistical plotting [6 pts]

### a) Boxplot and Violinplot [2 pts]

Plot the RT of each trial for all subjects as a stripplot and a boxplot on top of each other. Do the same with a striplot and a violinplot. Plot go trials as green dots and no-go trails as red dots. Reminder: don't forget to mask the data where RT=0. Make sure that the legends are informative (Don't display duplicated legends).

In [None]:
print(df_concatenated_avg)
print(grouped)

In [None]:
# read data
data_concat = pd.read_csv(os.getcwd() + "/Processed/data_concatenated.csv")

# create two axes
fig, axes = plt.subplots(nrows=1,ncols=2, figsize=(8,8))

data_concat[''] = ''
# first subplot with stripplot and boxplot    
data_concat['RT'] = data_concat['RT'].replace(0, np.NaN)
data_concat['StimulusType'] = data_concat['StimulusType'].replace(1, 'green')
data_concat['StimulusType'] = data_concat['StimulusType'].replace(0, 'red')

sns.stripplot(x='', data = data_concat, y='RT', hue='StimulusType', palette=['g', 'r'], ax=axes[0])
sns.boxplot(y=data_concat['RT'], ax=axes[0])

# second subplot with stripplot and violinplot
sns.stripplot(x=data_concat[''], y=data_concat['RT'], hue=data_concat['StimulusType'], palette=['g', 'r'])
sns.violinplot(y=data_concat['RT'], hue=data_concat['StimulusType'])

# handling legends

fig.tight_layout()

### b) Violinplot combining all data of all groups [3 pts]

- Make a dataframe consisting of all data across groups. You already did this in 1.a). At the end this dataframe you should have 8 * 11 * 100 rows.

- Every group has used their ID convention. Make sure that every data point follows this SubjectID system: group number + "00" + subject number.  
e.g) 3002 for the second subject of the third group.

- Compute average RT and accuaracy for each subject in the big dataframe you just created. You already did this in 1.b). At the end this dataframe will have 8 * 11 rows.

- On the first column plot average RT and accuracy for 8 subjects from your group's data. Use violinplot and split go/no-go conditions.

- On the second column plot average RT and accuracy for 80 subjects from all data. Use violinplot and split go/no-go conditions.

- Do you see any difference between the first column and the second column? What does this tell us about the central limit theorem (CLT) ?

In [None]:
# again create a concatenated dataframe over all (averaged) groups.
# Don't forget to modify the Subject ID
all_groups = []
for groupNr in range(1, 12):
    print("x:", groupNr)
    PATH = os.getcwd()+ "/experimental_Data/Group_{}".format(groupNr)
    all_files_group = glob.glob(os.path.join(PATH, "*.csv")) 
    #iterates through files in all_files list and creates a dataframe each
    df_to_concat_group = [pd.read_csv(g) for g in all_files_group]
    #iterates through list of dataframes and modifies subjectIDs
    for y, subjectNr in zip(df_to_concat_group, range(1, 9)):
        print(subjectNr)
        y['SubjectID'].replace(y['SubjectID'], int(str(groupNr)+"00"+str(subjectNr)), inplace=True)
        # add the created dataframe to the list for all files
        all_groups.append(y)
        
df_concatenated_all = pd.concat(all_groups, axis=0, ignore_index = True)

# Now it's time to plot your results
figs, axes = plt.subplots(nrows=2, ncols=2, sharey="row")

# violin plot for your group's data
# TODO

# violin plot of all group's data
# TODO


x: 1
1
2
3
4
5
6
7
8
x: 2
1
2
3
4
5
6
7
8
x: 3
1
2
3
4
5
6
7
8


In [61]:
#add a column "accuracy" that has value true if Stimulus and response hold the same value
df_concatenated_all['accuracy'] = (df_concatenated_all['StimulusType'] == df_concatenated_all['response'])

#print(df_concatenated_all.to_string())
#create a new dataframe with averaged data
df_concatenated_avg_all = pd.DataFrame(columns=['SubjectID', 'StimulusType', 'RT', 'accuracy'])

#group dataframe by SubjectID and StimulusType (for each Subject we have 2 types of Stimuli 0 - 1)
grouped_all = df_concatenated_all.groupby(['SubjectID','StimulusType'])


#grouped_all.get_group((10001))

for key, item in grouped_all:
    print(grouped_all.get_group(key), "\n\n")
    print("key:", key)

#for adding rows to dataframe
row = 0
print(int(77/8)+1)
#we have 2 groups per subject (one group with stimulus 0 and one with stimulus 1)
#thus the number of subjects is half the number of created groups
for subject in range(1, (len(grouped_all)//2)+1):
    #for each subject there are two groups / stimulus conditions
    for stimType in range(2):
        print("stim:", stimType)
        #get the groups with current subject index and stimulus Type
        group = grouped_all.get_group(((int(subject/8)+1)+1000+(subject%8),stimType))
        #computing the mean of the reaction time
        #the zeros have to be replaced since we don't want them to manipulate the mean
        meanRT = group.RT.replace(0, np.NaN).mean()
        #computing the congruence mean
        meanCongruence = group.congruence.mean()
        #if all reaction time values are zero, the mean will be NaN due to our conversion measure
        #thus we have to replace the NaN value with zero again since we only want numeric values
        if(np.isnan(meanRT)):
            meanRT = 0
        #adding the new row to the dataframe while rounding the mean to 2 digits after the decimal point
        df_concatenated_avg_all.loc[row] = [subject, stimType, round(meanRT,2), meanCongruence]
        #increasing the row index to make sure that new rows are added while looping through the loops
        row = row + 1

     SubjectID  StimulusType  response   RT  accuracy
7212     10001             0         0    0      True
7218     10001             0         0    0      True
7220     10001             0         0    0      True
7223     10001             0         0    0      True
7230     10001             0         0    0      True
7231     10001             0         1  255     False
7239     10001             0         0    0      True
7240     10001             0         1  321     False
7245     10001             0         0    0      True
7250     10001             0         0    0      True
7254     10001             0         0    0      True
7266     10001             0         0    0      True
7268     10001             0         0    0      True
7273     10001             0         0    0      True
7277     10001             0         1  354     False
7284     10001             0         0    0      True
7285     10001             0         1  273     False
7286     10001             0

     SubjectID  StimulusType  response   RT  accuracy
7800     10007             0         0    0      True
7806     10007             0         1  249     False
7811     10007             0         0    0      True
7817     10007             0         0    0      True
7818     10007             0         0    0      True
7828     10007             0         0    0      True
7830     10007             0         0    0      True
7833     10007             0         1  299     False
7837     10007             0         0    0      True
7848     10007             0         0    0      True
7863     10007             0         0    0      True
7867     10007             0         0    0      True
7870     10007             0         1  251     False
7872     10007             0         0    0      True
7873     10007             0         1  315     False
7877     10007             0         0    0      True
7890     10007             0         0    0      True
7891     10007             0

    SubjectID  StimulusType  response   RT  accuracy
400      1005             1         1  475      True
401      1005             1         1  337      True
403      1005             1         1  356      True
404      1005             1         1  340      True
405      1005             1         1  319      True
406      1005             1         1  338      True
409      1005             1         1  374      True
410      1005             1         1  322      True
411      1005             1         1  320      True
412      1005             1         1  323      True
413      1005             1         1  305      True
414      1005             1         1  305      True
415      1005             1         1  304      True
416      1005             1         1  318      True
417      1005             1         1  307      True
..        ...           ...       ...  ...       ...
482      1005             1         1  322      True
483      1005             1         1  589    

     SubjectID  StimulusType  response   RT  accuracy
8302     11004             0         0    0      True
8303     11004             0         0    0      True
8305     11004             0         0    0      True
8307     11004             0         0    0      True
8308     11004             0         0    0      True
8319     11004             0         0    0      True
8328     11004             0         1  416     False
8332     11004             0         0    0      True
8334     11004             0         1  372     False
8351     11004             0         0    0      True
8352     11004             0         0    0      True
8357     11004             0         1  613     False
8364     11004             0         1  370     False
8372     11004             0         0    0      True
8375     11004             0         0    0      True
8383     11004             0         0    0      True
8388     11004             0         1  436     False
8389     11004             0

     SubjectID  StimulusType  response   RT  accuracy
1000      2003             1         1  621      True
1001      2003             1         1  406      True
1002      2003             1         1  403      True
1003      2003             1         1  354      True
1004      2003             1         1  337      True
1005      2003             1         1  369      True
1006      2003             1         1  340      True
1007      2003             1         1  354      True
1008      2003             1         1  356      True
1009      2003             1         1  303      True
1012      2003             1         1  356      True
1013      2003             1         1  336      True
1015      2003             1         1  373      True
1016      2003             1         1  355      True
1017      2003             1         1  368      True
...        ...           ...       ...  ...       ...
1083      2003             1         1  366      True
1084      2003             1

     SubjectID  StimulusType  response   RT  accuracy
1600      3001             1         1  660      True
1601      3001             1         1  490      True
1602      3001             1         1  337      True
1603      3001             1         1  269      True
1604      3001             1         1  272      True
1605      3001             1         1  239      True
1606      3001             1         1  217      True
1610      3001             1         1  408      True
1611      3001             1         1  573      True
1612      3001             1         1  368      True
1613      3001             1         1  340      True
1614      3001             1         1  410      True
1615      3001             1         1  439      True
1616      3001             1         1  426      True
1617      3001             1         1  386      True
...        ...           ...       ...  ...       ...
1683      3001             1         1  471      True
1684      3001             1

2199      3006             0         0    0      True 


key: ('3006', 0)
     SubjectID  StimulusType  response   RT  accuracy
2100      3006             1         1  457      True
2101      3006             1         1  491      True
2102      3006             1         1  369      True
2103      3006             1         1  320      True
2104      3006             1         1  297      True
2105      3006             1         1  203      True
2106      3006             1         1  333      True
2107      3006             1         1  237      True
2108      3006             1         1  254      True
2109      3006             1         1  386      True
2110      3006             1         1  353      True
2111      3006             1         1  321      True
2112      3006             1         1  217      True
2113      3006             1         1  304      True
2114      3006             1         1  269      True
...        ...           ...       ...  ...       ...
2183    

2698      4003             0         0    0      True 


key: ('4003', 0)
     SubjectID  StimulusType  response   RT  accuracy
2600      4003             1         1  322      True
2601      4003             1         1  289      True
2602      4003             1         1  272      True
2603      4003             1         1  306      True
2604      4003             1         1  306      True
2607      4003             1         1  255      True
2608      4003             1         1  374      True
2609      4003             1         1  339      True
2610      4003             1         1  306      True
2611      4003             1         1  272      True
2612      4003             1         1  323      True
2614      4003             1         1  323      True
2616      4003             1         1  289      True
2617      4003             1         1  357      True
2618      4003             1         1  289      True
...        ...           ...       ...  ...       ...
2681    


key: ('4008', 1)
     SubjectID  StimulusType  response   RT  accuracy
3201      5001             0         1  269     False
3206      5001             0         0    0      True
3210      5001             0         0    0      True
3211      5001             0         0    0      True
3216      5001             0         0    0      True
3217      5001             0         0    0      True
3219      5001             0         0    0      True
3221      5001             0         0    0      True
3223      5001             0         0    0      True
3229      5001             0         1  381     False
3232      5001             0         0    0      True
3246      5001             0         0    0      True
3254      5001             0         0    0      True
3262      5001             0         0    0      True
3271      5001             0         1  300     False
3272      5001             0         0    0      True
3277      5001             0         0    0      True
3284      

     SubjectID  StimulusType  response   RT  accuracy
3800      5007             1         1  740      True
3801      5007             1         1  691      True
3802      5007             1         1  628      True
3804      5007             1         1  462      True
3805      5007             1         1  447      True
3806      5007             1         1  476      True
3807      5007             1         1  480      True
3808      5007             1         1  412      True
3809      5007             1         1  432      True
3810      5007             1         1  478      True
3811      5007             1         1  493      True
3812      5007             1         1  414      True
3813      5007             1         1  413      True
3814      5007             1         1  414      True
3815      5007             1         1  378      True
...        ...           ...       ...  ...       ...
3882      5007             1         1  479      True
3883      5007             1

[80 rows x 5 columns] 


key: ('6005', 1)
     SubjectID  StimulusType  response  RT  accuracy
4500      6006             0         0   0      True
4505      6006             0         0   0      True
4511      6006             0         0   0      True
4520      6006             0         0   0      True
4521      6006             0         0   0      True
4524      6006             0         0   0      True
4528      6006             0         0   0      True
4535      6006             0         0   0      True
4542      6006             0         0   0      True
4544      6006             0         0   0      True
4545      6006             0         0   0      True
4546      6006             0         0   0      True
4553      6006             0         0   0      True
4555      6006             0         0   0      True
4572      6006             0         0   0      True
4584      6006             0         0   0      True
4585      6006             0         0   0      True
4588

     SubjectID  StimulusType  response   RT  accuracy
5019      7003             0         0    0      True
5022      7003             0         0    0      True
5029      7003             0         0    0      True
5043      7003             0         0    0      True
5055      7003             0         0    0      True
5059      7003             0         0    0      True
5061      7003             0         0    0      True
5068      7003             0         0    0      True
5071      7003             0         0    0      True
5073      7003             0         0    0      True
5074      7003             0         1  381     False
5077      7003             0         0    0      True
5078      7003             0         0    0      True
5082      7003             0         0    0      True
5084      7003             0         0    0      True
5085      7003             0         0    0      True
5086      7003             0         0    0      True
5090      7003             0

     SubjectID  StimulusType  response   RT  accuracy
5500      7008             1         1  521      True
5501      7008             1         1  237      True
5502      7008             1         1  863      True
5503      7008             1         1  762      True
5504      7008             1         1  923      True
5505      7008             1         1  658      True
5506      7008             1         1  849      True
5507      7008             1         1  657      True
5508      7008             1         1  428      True
5509      7008             1         1  514      True
5510      7008             1         0    0     False
5512      7008             1         1  547      True
5514      7008             1         0    0     False
5515      7008             1         0    0     False
5516      7008             1         0    0     False
...        ...           ...       ...  ...       ...
5581      7008             1         1  468      True
5583      7008             1

     SubjectID  StimulusType  response   RT  accuracy
6200      8007             1         1  448      True
6201      8007             1         1  416      True
6202      8007             1         1  337      True
6203      8007             1         1  304      True
6205      8007             1         1  368      True
6206      8007             1         1  416      True
6207      8007             1         1  288      True
6208      8007             1         1  320      True
6209      8007             1         1  353      True
6211      8007             1         1  384      True
6212      8007             1         1  384      True
6213      8007             1         1  416      True
6215      8007             1         1  289      True
6216      8007             1         1  368      True
6217      8007             1         1  337      True
...        ...           ...       ...  ...       ...
6285      8007             1         1  336      True
6286      8007             1

     SubjectID  StimulusType  response    RT  accuracy
6800      9005             1         1  1019      True
6801      9005             1         1   425      True
6802      9005             1         1   375      True
6803      9005             1         1   476      True
6804      9005             1         1   425      True
6805      9005             1         1   373      True
6806      9005             1         1   306      True
6808      9005             1         1   306      True
6809      9005             1         1   375      True
6810      9005             1         1   393      True
6811      9005             1         1   358      True
6813      9005             1         1   357      True
6814      9005             1         1   341      True
6815      9005             1         1   358      True
6817      9005             1         1   307      True
...        ...           ...       ...   ...       ...
6885      9005             1         1   629      True
6886      

stim: 0


KeyError: (1002, 0)

Compare two datasets and relate it with CLT. Write your opinion here.

### c) Scatterplot [1 pts]

Make a scatterplot comparing RT and accuracy. Do you see some correlation?

In [None]:
# TODO