# Primer on Python - Part 1 of 3

This Jupyter notebook is a primer on Python coding for data analysis, written for the Cognitive Neuroimaging Skills Training In Cambridge (COGNESTIC) summer school. The focus is on getting started with Python with minimal jargon, so we will use two common scenarios in cognitive neuroscience – working with fMRI events files and working with correlation matrices – as illustrative examples. Along the way, we will cover basic programming concepts such as variables, loops, conditional statements, functions, etc., and at the end of each part, we will put the different snippets together in the form of a complete script. We will use Python (3.11+) and popular libraries pandas, numpy, matplotlib and seaborn. 

In this first part, we will use the **pandas** library to:
1. Get data from files
2. Access and manipulate data
3. Save outputs to files

Pandas is a Python library used for working with data sets. It is conceptually similar to working with R dataframes.

# 1. Read data from files

## At its simplest:

In [5]:
import pandas

pandas.read_csv('~/COGNESTIC/01_Primer_on_Python/data/sub-04/sub-04_ses-mri_task-facerecognition_run-01_events.tsv', sep='\t')

Unnamed: 0,onset,duration,circle_duration,stim_type,trigger,button_pushed,response_time,stim_file
0,0.000,0.983,0.492,SCRAMBLED,17,7,1.553,func/s008.bmp
1,3.240,0.971,0.488,SCRAMBLED,17,7,1.904,func/s004.bmp
2,6.498,0.832,0.409,SCRAMBLED,18,4,1.548,func/s004.bmp
3,9.538,0.898,0.441,FAMOUS,5,4,1.685,func/f002.bmp
4,12.644,0.993,0.445,FAMOUS,6,4,1.482,func/f002.bmp
...,...,...,...,...,...,...,...,...
95,384.952,0.825,0.475,SCRAMBLED,19,7,1.136,func/s013.bmp
96,388.008,0.931,0.464,UNFAMILIAR,13,7,1.230,func/u015.bmp
97,391.266,0.977,0.552,UNFAMILIAR,13,7,0.907,func/u011.bmp
98,394.506,0.852,0.491,UNFAMILIAR,14,7,1.018,func/u011.bmp


## Variables in Python:

First, let's use variables to define the paths to the project folder, the input folder and the results folder. This makes it easy to reuse a script or to update it if you need to rename folders or move files around.

Use a consistent naming style, the preferred format is primarily **lowercase_with_underscores**

In [8]:
project = '~/COGNESTIC/01_Primer_on_Python/'

inpath = project + 'data/'
respath = project + 'outputs/'

print(inpath, respath)

~/Desktop/Cognestic/data/ ~/Desktop/Cognestic/outputs/


In [9]:
# The code to read a file, now with variables:

import pandas as pd

filename = 'sub-04/sub-04_ses-mri_task-facerecognition_run-01_events.tsv'

df = pd.read_csv(inpath + filename, sep='\t')

type(df)

pandas.core.frame.DataFrame

## Looping:

Our subject has multiple files, so let's read each one and combine them

In [11]:
# Structure of a loop

for i in range(9):
    print(i)    

0
1
2
3
4
5
6
7
8


In [12]:
# Print all subject file names

for i in range(1, 10):
    print('sub-04_ses-mri_task-facerecognition_run-0' + str(i) + '_events.tsv')

sub-04_ses-mri_task-facerecognition_run-01_events.tsv
sub-04_ses-mri_task-facerecognition_run-02_events.tsv
sub-04_ses-mri_task-facerecognition_run-03_events.tsv
sub-04_ses-mri_task-facerecognition_run-04_events.tsv
sub-04_ses-mri_task-facerecognition_run-05_events.tsv
sub-04_ses-mri_task-facerecognition_run-06_events.tsv
sub-04_ses-mri_task-facerecognition_run-07_events.tsv
sub-04_ses-mri_task-facerecognition_run-08_events.tsv
sub-04_ses-mri_task-facerecognition_run-09_events.tsv


In [13]:
# Skip some files in loop: for instance if you know sub-04 is missing the 2nd run

for i in range(1, 10):
    if i == 2:
        continue
    else:
        print('sub-04_ses-mri_task-facerecognition_run-0' + str(i) + '_events.tsv')

sub-04_ses-mri_task-facerecognition_run-01_events.tsv
sub-04_ses-mri_task-facerecognition_run-03_events.tsv
sub-04_ses-mri_task-facerecognition_run-04_events.tsv
sub-04_ses-mri_task-facerecognition_run-05_events.tsv
sub-04_ses-mri_task-facerecognition_run-06_events.tsv
sub-04_ses-mri_task-facerecognition_run-07_events.tsv
sub-04_ses-mri_task-facerecognition_run-08_events.tsv
sub-04_ses-mri_task-facerecognition_run-09_events.tsv


In [14]:
# To combine data from all the files, we will:
# 1. Create an empty dataframe for all the data for our subject ("sub_data")
# 2. Loop over each individual file and add its data to sub_data
    
sub_data = pd.DataFrame()

for i in range(9):
    run_file = 'sub-04/sub-04_ses-mri_task-facerecognition_run-0' + str(i+1) +'_events.tsv'
    run_data = pd.read_csv(inpath + run_file, sep='\t')
    sub_data = pd.concat([sub_data, run_data])

print(sub_data)

      onset  duration  circle_duration   stim_type  trigger  button_pushed  \
0     0.000     0.983            0.492   SCRAMBLED       17            7.0   
1     3.240     0.971            0.488   SCRAMBLED       17            7.0   
2     6.498     0.832            0.409   SCRAMBLED       18            4.0   
3     9.538     0.898            0.441      FAMOUS        5            4.0   
4    12.644     0.993            0.445      FAMOUS        6            4.0   
..      ...       ...              ...         ...      ...            ...   
93  379.122     0.859            0.421  UNFAMILIAR       13            4.0   
94  382.229     0.895            0.468  UNFAMILIAR       14            4.0   
95  385.353     0.982            0.455      FAMOUS        5            4.0   
96  388.644     0.999            0.551      FAMOUS        7            7.0   
97  391.450     0.012            0.000         NaN      999            0.0   

    response_time        stim_file  
0           1.553    func/

In [15]:
# Zero-padding: useful for some naming conventions

for i in range(1, 17):
    x = str(i)
    x = x.zfill(2)
    print('sub-' + x)

sub-01
sub-02
sub-03
sub-04
sub-05
sub-06
sub-07
sub-08
sub-09
sub-10
sub-11
sub-12
sub-13
sub-14
sub-15
sub-16


# 2. Access and manipulate dataframes

## Examining the data:

First, we will examine the dataframe and see what we have imported

In [18]:
# Print the first n rows of the dataframe (5 rows if you do not specify n)

df.head(n=4)

Unnamed: 0,onset,duration,circle_duration,stim_type,trigger,button_pushed,response_time,stim_file
0,0.0,0.983,0.492,SCRAMBLED,17,7,1.553,func/s008.bmp
1,3.24,0.971,0.488,SCRAMBLED,17,7,1.904,func/s004.bmp
2,6.498,0.832,0.409,SCRAMBLED,18,4,1.548,func/s004.bmp
3,9.538,0.898,0.441,FAMOUS,5,4,1.685,func/f002.bmp


In [19]:
# Similarly, you can print the final rows (5 rows if you do not specify n)

df.tail(n=4)

Unnamed: 0,onset,duration,circle_duration,stim_type,trigger,button_pushed,response_time,stim_file
96,388.008,0.931,0.464,UNFAMILIAR,13,7,1.23,func/u015.bmp
97,391.266,0.977,0.552,UNFAMILIAR,13,7,0.907,func/u011.bmp
98,394.506,0.852,0.491,UNFAMILIAR,14,7,1.018,func/u011.bmp
99,397.128,0.012,0.0,,999,0,0.0,func/Circle.bmp


In [20]:
# How many rows and columns?

df.shape

(100, 8)

In [21]:
# What are the column names?

df.columns

Index(['onset', 'duration', 'circle_duration', 'stim_type', 'trigger',
       'button_pushed', 'response_time', 'stim_file'],
      dtype='object')

In [22]:
# What types of data?

df.dtypes

onset              float64
duration           float64
circle_duration    float64
stim_type           object
trigger              int64
button_pushed        int64
response_time      float64
stim_file           object
dtype: object

In [23]:
# Use a function to look at these together in a single command

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   onset            100 non-null    float64
 1   duration         100 non-null    float64
 2   circle_duration  100 non-null    float64
 3   stim_type        94 non-null     object 
 4   trigger          100 non-null    int64  
 5   button_pushed    100 non-null    int64  
 6   response_time    100 non-null    float64
 7   stim_file        100 non-null    object 
dtypes: float64(4), int64(2), object(2)
memory usage: 6.4+ KB


## Accessing dataframe:

In [25]:
# Get a specific column

df['stim_type']

0      SCRAMBLED
1      SCRAMBLED
2      SCRAMBLED
3         FAMOUS
4         FAMOUS
         ...    
95     SCRAMBLED
96    UNFAMILIAR
97    UNFAMILIAR
98    UNFAMILIAR
99           NaN
Name: stim_type, Length: 100, dtype: object

In [26]:
# Two columns

df[['stim_type', 'trigger']]

Unnamed: 0,stim_type,trigger
0,SCRAMBLED,17
1,SCRAMBLED,17
2,SCRAMBLED,18
3,FAMOUS,5
4,FAMOUS,6
...,...,...
95,SCRAMBLED,19
96,UNFAMILIAR,13
97,UNFAMILIAR,13
98,UNFAMILIAR,14


In [27]:
# Get a specific row

df.loc[2, :]

onset                      6.498
duration                   0.832
circle_duration            0.409
stim_type              SCRAMBLED
trigger                       18
button_pushed                  4
response_time              1.548
stim_file          func/s004.bmp
Name: 2, dtype: object

In [28]:
# Multiple rows

df.loc[2:4, :]

Unnamed: 0,onset,duration,circle_duration,stim_type,trigger,button_pushed,response_time,stim_file
2,6.498,0.832,0.409,SCRAMBLED,18,4,1.548,func/s004.bmp
3,9.538,0.898,0.441,FAMOUS,5,4,1.685,func/f002.bmp
4,12.644,0.993,0.445,FAMOUS,6,4,1.482,func/f002.bmp


In [29]:
# A specific cell by row and column names

df.loc[3, 'stim_type']

'FAMOUS'

In [30]:
# Using only the cell indices

df.iloc[3, 3]

'FAMOUS'

## Cleaning data:

We saw previously that there were some NaN values in the stim_type column, let's remove these

In [32]:
# Function to drop all rows with any NaN values (somewhat reckless!)

df.dropna()

# Best to use this only if ALL values in a row are NaN

df.dropna(how='all')

Unnamed: 0,onset,duration,circle_duration,stim_type,trigger,button_pushed,response_time,stim_file
0,0.000,0.983,0.492,SCRAMBLED,17,7,1.553,func/s008.bmp
1,3.240,0.971,0.488,SCRAMBLED,17,7,1.904,func/s004.bmp
2,6.498,0.832,0.409,SCRAMBLED,18,4,1.548,func/s004.bmp
3,9.538,0.898,0.441,FAMOUS,5,4,1.685,func/f002.bmp
4,12.644,0.993,0.445,FAMOUS,6,4,1.482,func/f002.bmp
...,...,...,...,...,...,...,...,...
95,384.952,0.825,0.475,SCRAMBLED,19,7,1.136,func/s013.bmp
96,388.008,0.931,0.464,UNFAMILIAR,13,7,1.230,func/u015.bmp
97,391.266,0.977,0.552,UNFAMILIAR,13,7,0.907,func/u011.bmp
98,394.506,0.852,0.491,UNFAMILIAR,14,7,1.018,func/u011.bmp


In [33]:
# A more cautious approach, keep only non-NaNs in an important column like stim_type

df[df['stim_type'].notna()]

Unnamed: 0,onset,duration,circle_duration,stim_type,trigger,button_pushed,response_time,stim_file
0,0.000,0.983,0.492,SCRAMBLED,17,7,1.553,func/s008.bmp
1,3.240,0.971,0.488,SCRAMBLED,17,7,1.904,func/s004.bmp
2,6.498,0.832,0.409,SCRAMBLED,18,4,1.548,func/s004.bmp
3,9.538,0.898,0.441,FAMOUS,5,4,1.685,func/f002.bmp
4,12.644,0.993,0.445,FAMOUS,6,4,1.482,func/f002.bmp
...,...,...,...,...,...,...,...,...
94,381.861,0.853,0.478,FAMOUS,6,4,0.600,func/f007.bmp
95,384.952,0.825,0.475,SCRAMBLED,19,7,1.136,func/s013.bmp
96,388.008,0.931,0.464,UNFAMILIAR,13,7,1.230,func/u015.bmp
97,391.266,0.977,0.552,UNFAMILIAR,13,7,0.907,func/u011.bmp


In [34]:
# Remove rows with response times over 1 second

new_df = df[df['response_time'] <= 1.0]

new_df.head()

Unnamed: 0,onset,duration,circle_duration,stim_type,trigger,button_pushed,response_time,stim_file
7,22.483,0.946,0.5,SCRAMBLED,18,7,0.955,func/s001.bmp
8,25.774,0.998,0.584,SCRAMBLED,17,7,0.826,func/s010.bmp
14,61.553,0.848,0.57,UNFAMILIAR,14,7,0.861,func/u010.bmp
15,64.609,0.844,0.445,FAMOUS,5,7,0.976,func/f012.bmp
18,73.947,0.81,0.463,FAMOUS,6,7,0.75,func/f003.bmp


## Summarising data:

In [36]:
# Get mean and median of response time

x = df['response_time'].mean()
y = df['response_time'].median()

print(x,y)

1.99844 1.0070000000000001


In [37]:
# Tabulate the values in a specific column

df['stim_type'].value_counts()

stim_type
FAMOUS        32
SCRAMBLED     31
UNFAMILIAR    31
Name: count, dtype: int64

In [38]:
# Get mean RT for each type of stimulus

df.groupby('stim_type')['response_time'].mean()

stim_type
FAMOUS        1.093094
SCRAMBLED     1.072935
UNFAMILIAR    1.019484
Name: response_time, dtype: float64

### More about data types

Let's save some of these summary measures and dataframe properties as variables

To do that, we shall learn more about the data types used to store collections of data

In [40]:
# Pandas has two main data types: dataframes and series

print(type(df))
print(type(df['stim_type']))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


In [41]:
# Python has 4 built-in data types for variable collections: list, tuple, set, dictionary

#### I. List

List is an extremely versatile data type that you will see and use regularly

In [43]:
# Lists are created using [] or the function list()

x = [1, 2, 3]
column_list = list(df.columns)

print(type(x), x)
print(type(column_list), column_list)

<class 'list'> [1, 2, 3]
<class 'list'> ['onset', 'duration', 'circle_duration', 'stim_type', 'trigger', 'button_pushed', 'response_time', 'stim_file']


In [44]:
# You can easily combine lists of different variable types

y = x + column_list + [list(range(4))*2]
print(y)

[1, 2, 3, 'onset', 'duration', 'circle_duration', 'stim_type', 'trigger', 'button_pushed', 'response_time', 'stim_file', [0, 1, 2, 3, 0, 1, 2, 3]]


In [45]:
## You can add items

# Add to end
column_list.append('accuracy')
print(column_list)

# Add at specific index
column_list.insert(1, 'group')
print(column_list)

['onset', 'duration', 'circle_duration', 'stim_type', 'trigger', 'button_pushed', 'response_time', 'stim_file', 'accuracy']
['onset', 'group', 'duration', 'circle_duration', 'stim_type', 'trigger', 'button_pushed', 'response_time', 'stim_file', 'accuracy']


In [46]:
## You can remove items

# Remove first occurrence
column_list.remove('onset')
print(column_list)

# Remove from index
y.pop(5)
print(column_list)

['group', 'duration', 'circle_duration', 'stim_type', 'trigger', 'button_pushed', 'response_time', 'stim_file', 'accuracy']
['group', 'duration', 'circle_duration', 'stim_type', 'trigger', 'button_pushed', 'response_time', 'stim_file', 'accuracy']


In [47]:
# Change 'subject' to 'subjects'

column_list[0] = 'subjects'
print(column_list)

['subjects', 'duration', 'circle_duration', 'stim_type', 'trigger', 'button_pushed', 'response_time', 'stim_file', 'accuracy']


In [48]:
# Loop through list and print each name

for col in column_list:
    print(col)

subjects
duration
circle_duration
stim_type
trigger
button_pushed
response_time
stim_file
accuracy


In [49]:
# Loop through list and print index and name

for i, col in enumerate(column_list):
    print(i, col)

0 subjects
1 duration
2 circle_duration
3 stim_type
4 trigger
5 button_pushed
6 response_time
7 stim_file
8 accuracy


In [50]:
# Loop through list and print only column names containing underscores

for col in column_list:
    if ('_' in col):
        print(col)

circle_duration
stim_type
button_pushed
response_time
stim_file


In [51]:
# This can be shortened to:

[col for col in column_list if '_' in col]

['circle_duration', 'stim_type', 'button_pushed', 'response_time', 'stim_file']

In [52]:
# Syntax: [expression for item in list if condition == True]

# Print first two letters of column names containing underscore
[col[:2] for col in column_list if '_' in col]

['ci', 'st', 'bu', 're', 'st']

In [158]:
# Syntax: [expression for item in list if condition == True]

# Print first 3 letters of column names longer than 3
[col[:3] for col in column_list if len(col)>3]

['sub', 'dur', 'cir', 'sti', 'tri', 'but', 'res', 'sti', 'acc', 'dum']

In [55]:
# IMPORTANT: you cannot copy a list by assigning it to new variable even if it looks possible

x = column_list
print(x)

['subjects', 'duration', 'circle_duration', 'stim_type', 'trigger', 'button_pushed', 'response_time', 'stim_file', 'accuracy']


In [56]:
# x is only a reference to the original list, and any changes to x WILL ALSO CHANGE the original list

x.append('dummy_column')
print(column_list)

['subjects', 'duration', 'circle_duration', 'stim_type', 'trigger', 'button_pushed', 'response_time', 'stim_file', 'accuracy', 'dummy_column']


In [57]:
# To create a copy of a list, use the copy method

x = column_list.copy()

x.append('dummy2')
print(x)
print(column_list)

['subjects', 'duration', 'circle_duration', 'stim_type', 'trigger', 'button_pushed', 'response_time', 'stim_file', 'accuracy', 'dummy_column', 'dummy2']
['subjects', 'duration', 'circle_duration', 'stim_type', 'trigger', 'button_pushed', 'response_time', 'stim_file', 'accuracy', 'dummy_column']


#### II. Tuple

You won't use it much, but you will see this from time to time so it's useful to understand its properties

In [59]:
# Tuples are ordered and unchangeable

x = df.shape

print(x)
print(type(x))
print(x[0])

(100, 8)
<class 'tuple'>
100


In [60]:
# You cannot modify a tuple

#x[0] = 5

In [61]:
# You cannot add items to a tuple

#x.append(5)

#### III. Set

Sets are unordered and cannot have duplicate elements

You can perform set operations

In [63]:
# Get unique values of a collection

stim_list = set(df['stim_type'])

print(stim_list)

{'SCRAMBLED', 'FAMOUS', 'UNFAMILIAR', nan}


In [64]:
# Create a new set of the desired stimulus types

stim_keep = {'FAMOUS', 'UNFAMILIAR'}

print(stim_keep)

{'FAMOUS', 'UNFAMILIAR'}


In [65]:
# Perform set operations, for example:

print(stim_list & stim_keep) # Items present in both sets
print(stim_list ^ stim_keep) # Items only present in one set

{'FAMOUS', 'UNFAMILIAR'}
{'SCRAMBLED', nan}


#### IV. Dictionaries

Store data in key:value pairs

Popular way to store data

In [67]:
# Save summary values of df

df_summary = {
    'subject' : 'sub-04',
    'mean': df['response_time'].mean()
}

print(type(df_summary), df_summary)

<class 'dict'> {'subject': 'sub-04', 'mean': 1.99844}


In [68]:
# Add entry to dictionary

df_summary['median'] = df['response_time'].median()

print(type(df_summary), df_summary)

<class 'dict'> {'subject': 'sub-04', 'mean': 1.99844, 'median': 1.0070000000000001}


In [69]:
## Loop through dictionary

# Using the keys
for key in df_summary.keys():
    print(key)

subject
mean
median


In [70]:
## Loop through dictionary

# Using the values
for value in df_summary.values():
    print(value)

sub-04
1.99844
1.0070000000000001


In [71]:
# IMPORTANT: to make a distionary copy, you need to use the copy method, do not assign directly to new variable

## Editing dataframe:

Sometimes, we want to add columns or update values

In [73]:
## Adding columns

# Let's add a subject column
df['subject'] = 'sub-04'

# Add a column with the run number
df['run'] = 1

df.head()

Unnamed: 0,onset,duration,circle_duration,stim_type,trigger,button_pushed,response_time,stim_file,subject,run
0,0.0,0.983,0.492,SCRAMBLED,17,7,1.553,func/s008.bmp,sub-04,1
1,3.24,0.971,0.488,SCRAMBLED,17,7,1.904,func/s004.bmp,sub-04,1
2,6.498,0.832,0.409,SCRAMBLED,18,4,1.548,func/s004.bmp,sub-04,1
3,9.538,0.898,0.441,FAMOUS,5,4,1.685,func/f002.bmp,sub-04,1
4,12.644,0.993,0.445,FAMOUS,6,4,1.482,func/f002.bmp,sub-04,1


In [74]:
## Rearranging columns

# Let's re-arrange the columns into a more pleasing order
df = df[['subject', 'run', 'onset', 'duration', 'circle_duration', 'stim_type', 'trigger', 'button_pushed', 'response_time', 'stim_file']]

df.head()

Unnamed: 0,subject,run,onset,duration,circle_duration,stim_type,trigger,button_pushed,response_time,stim_file
0,sub-04,1,0.0,0.983,0.492,SCRAMBLED,17,7,1.553,func/s008.bmp
1,sub-04,1,3.24,0.971,0.488,SCRAMBLED,17,7,1.904,func/s004.bmp
2,sub-04,1,6.498,0.832,0.409,SCRAMBLED,18,4,1.548,func/s004.bmp
3,sub-04,1,9.538,0.898,0.441,FAMOUS,5,4,1.685,func/f002.bmp
4,sub-04,1,12.644,0.993,0.445,FAMOUS,6,4,1.482,func/f002.bmp


In [75]:
# We can also drop some of the columns altogether

new_df = df[['subject', 'run', 'stim_type', 'response_time', 'stim_file']]

new_df.head()

Unnamed: 0,subject,run,stim_type,response_time,stim_file
0,sub-04,1,SCRAMBLED,1.553,func/s008.bmp
1,sub-04,1,SCRAMBLED,1.904,func/s004.bmp
2,sub-04,1,SCRAMBLED,1.548,func/s004.bmp
3,sub-04,1,FAMOUS,1.685,func/f002.bmp
4,sub-04,1,FAMOUS,1.482,func/f002.bmp


In [76]:
## Editing values

# Change the units of a column, for instance convert response_time between seconds and milliseconds
new_df.loc[:, 'response_time'] *= 1000

new_df.head()

Unnamed: 0,subject,run,stim_type,response_time,stim_file
0,sub-04,1,SCRAMBLED,1553.0,func/s008.bmp
1,sub-04,1,SCRAMBLED,1904.0,func/s004.bmp
2,sub-04,1,SCRAMBLED,1548.0,func/s004.bmp
3,sub-04,1,FAMOUS,1685.0,func/f002.bmp
4,sub-04,1,FAMOUS,1482.0,func/f002.bmp


In [77]:
## Sorting 

# We could also sort the data by the stim_type column
new_df.sort_values(by='stim_type')


Unnamed: 0,subject,run,stim_type,response_time,stim_file
72,sub-04,1,FAMOUS,1023.0,func/f004.bmp
52,sub-04,1,FAMOUS,852.0,func/f008.bmp
25,sub-04,1,FAMOUS,981.0,func/f012.bmp
71,sub-04,1,FAMOUS,767.0,func/f010.bmp
70,sub-04,1,FAMOUS,1166.0,func/f010.bmp
...,...,...,...,...,...
29,sub-04,1,,20000.0,func/i999.bmp
50,sub-04,1,,20000.0,func/i999.bmp
69,sub-04,1,,20000.0,func/i999.bmp
89,sub-04,1,,20000.0,func/i999.bmp


In [78]:
# Or by two columns

new_df.sort_values(by=['stim_type', 'response_time'])

Unnamed: 0,subject,run,stim_type,response_time,stim_file
94,sub-04,1,FAMOUS,600.0,func/f007.bmp
73,sub-04,1,FAMOUS,747.0,func/f004.bmp
18,sub-04,1,FAMOUS,750.0,func/f003.bmp
71,sub-04,1,FAMOUS,767.0,func/f010.bmp
91,sub-04,1,FAMOUS,787.0,func/f005.bmp
...,...,...,...,...,...
11,sub-04,1,,20000.0,func/i999.bmp
29,sub-04,1,,20000.0,func/i999.bmp
50,sub-04,1,,20000.0,func/i999.bmp
69,sub-04,1,,20000.0,func/i999.bmp


# 3. Saving data externally

We simply use a pandas function to save our dataframe. You can view this file outside Python using a text editor or Excel, etc.

In [80]:
# This time, let's export to the more common csv format. We also specify that we don't want an extra index column in the file.

new_df.to_csv(respath + 'just_an_example.csv', index=False)

# **Putting it all together**

Let's put what we've learnt so far into a coherent block of code, and we shall use the resulting dataframe for the next part of the primer on plotting data.

In [82]:
# --- Libraries --- #

import pandas as pd

# --- Set paths --- #

project = '~/COGNESTIC/01_Primer_on_Python/'
inpath = project + 'data/'
respath = project + 'outputs/'

# --- Get data from sixteen subjects (assume that sub-04 is missing run-05) --- #

all_data = pd.DataFrame()

# For each subject
for s in range(1, 17):

    # Zeropad integer to get subject name
    subject = 'sub-' + str(s).zfill(2)
    
    # For each run
    for i in range(1, 10):

        # If sub-04's 5th run, skip loop
        if subject == 'sub-04' and i == 5:
            continue

        # Get the file and read data
        run_file = subject + '/' + subject + '_ses-mri_task-facerecognition_run-0' + str(i) +'_events.tsv'
        run_data = pd.read_csv(inpath + run_file, sep='\t')

        # Add columns for subject and run
        run_data['subject'] = subject
        run_data['run'] = i
        
        # Add to main dataframe
        all_data = pd.concat([all_data, run_data])
        
# --- Clean data --- #

# Remove any rows without a valid stim_type or button_pushed value
all_data = all_data[all_data['stim_type'].notna() & all_data['button_pushed'].notna()]

# Remove response times over 1 second
all_data = all_data[all_data['response_time'] <= 1.0]

# --- Output collated data to single tsv file --- #

# Export to csv file
all_data.to_csv(respath + 'events_16sub.csv', index=False)

# --- Print out summary: mean RT for each stimulus type --- #

all_data.groupby('stim_type')['response_time'].mean()


stim_type
FAMOUS        0.770255
SCRAMBLED     0.781195
UNFAMILIAR    0.762843
Name: response_time, dtype: float64