# Primer on Python - Part 1 of 3

This Jupyter notebook is a primer on Python coding for data analysis, written for the Cognitive Neuroimaging Skills Training In Cambridge (COGNESTIC) summer school. The focus is on getting started with Python with minimal jargon, so we will use two common scenarios in cognitive neuroscience – working with fMRI events files and working with correlation matrices – as illustrative examples. Along the way, we will cover basic programming concepts such as variables, loops, conditional statements, functions, etc., and at the end of each part, we will put the different snippets together in the form of a complete script. We will use Python (3.11+) and popular libraries pandas, numpy, matplotlib and seaborn. 

In this first part, we will use the **pandas** library to:
1. Get data from files
2. Access and manipulate data
3. Save outputs to files

Pandas is a Python library used for working with data sets. It is conceptually similar to working with R dataframes.

# 1. Read data from files

## At its simplest

In [None]:
import pandas

pandas.read_csv('~/COGNESTIC/01_Primer_on_Python/data/sub-04/sub-04_ses-mri_task-facerecognition_run-01_events.tsv', sep='\t')

## Variables in Python

First, let's use variables to define the paths to the project folder, the input folder and the results folder. This makes it easy to reuse a script or to update it if you need to rename folders or move files around.

Use a consistent naming style, the preferred format is primarily **lowercase_with_underscores**

In [None]:
project = '~/COGNESTIC/01_Primer_on_Python/'

inpath = project + 'data/'
respath = project + 'outputs/'

print(inpath, respath)

In [None]:
# The code to read a file, now with variables:

import pandas as pd

filename = 'sub-04/sub-04_ses-mri_task-facerecognition_run-01_events.tsv'

df = pd.read_csv(inpath + filename, sep='\t')

type(df)

## Looping

Our subject has multiple files, so let's read each one and combine them

In [None]:
# Structure of a loop

for i in range(9):
    print(i)    

In [None]:
# Print all subject file names

for i in range(1, 10):
    print('sub-04_ses-mri_task-facerecognition_run-0' + str(i) + '_events.tsv')

In [None]:
# Skip some files in loop: for instance if you know sub-04 is missing the 2nd run

for i in range(1, 10):
    if i == 2:
        continue
    else:
        print('sub-04_ses-mri_task-facerecognition_run-0' + str(i) + '_events.tsv')

In [None]:
# To combine data from all the files, we will:
# 1. Create an empty dataframe for all the data for our subject ("sub_data")
# 2. Loop over each individual file and add its data to sub_data
    
sub_data = pd.DataFrame()

for i in range(9):
    run_file = 'sub-04/sub-04_ses-mri_task-facerecognition_run-0' + str(i+1) +'_events.tsv'
    run_data = pd.read_csv(inpath + run_file, sep='\t')
    sub_data = pd.concat([sub_data, run_data])

print(sub_data)

In [None]:
# Zero-padding: useful for some naming conventions

for i in range(1, 17):
    x = str(i)
    x = x.zfill(2)
    print('sub-' + x)

# 2. Access and manipulate dataframes

## Examining the data

First, we will examine the dataframe and see what we have imported

In [None]:
# Print the first n rows of the dataframe (5 rows if you do not specify n)

df.head(n=4)

In [None]:
# Similarly, you can print the final rows (5 rows if you do not specify n)

df.tail(n=4)

In [None]:
# How many rows and columns?

df.shape

In [None]:
# What are the column names?

df.columns

In [None]:
# What types of data?

df.dtypes

In [None]:
# Use a function to look at these together in a single command

df.info()

## Accessing dataframe

In [None]:
# Get a specific column

df['stim_type']

In [None]:
# Two columns

df[['stim_type', 'trigger']]

In [None]:
# Get a specific row

df.loc[2, :]

In [None]:
# Multiple rows

df.loc[2:4, :]

In [None]:
# A specific cell by row and column names

df.loc[3, 'stim_type']

In [None]:
# Using only the cell indices

df.iloc[3, 3]

## Cleaning data

We saw previously that there were some NaN values in the stim_type column, let's remove these

In [None]:
# Function to drop all rows with any NaN values (somewhat reckless!)

df.dropna()

# Best to use this only if ALL values in a row are NaN

df.dropna(how='all')

In [None]:
# A more cautious approach, keep only non-NaNs in an important column like stim_type

df[df['stim_type'].notna()]

In [None]:
# Remove rows with response times over 1 second

new_df = df[df['response_time'] <= 1.0]

new_df.head()

## Summarising data

In [None]:
# Get mean and median of response time

x = df['response_time'].mean()
y = df['response_time'].median()

print(x,y)

In [None]:
# Tabulate the values in a specific column

df['stim_type'].value_counts()

In [None]:
# Get mean RT for each type of stimulus

df.groupby('stim_type')['response_time'].mean()

### More about data types

Let's save some of these summary measures and dataframe properties as variables

To do that, we shall learn more about the data types used to store collections of data

In [None]:
# Pandas has two main data types: dataframes and series

print(type(df))
print(type(df['stim_type']))

In [None]:
# Python has 4 built-in data types for variable collections: list, tuple, set, dictionary

#### I. List

List is an extremely versatile data type that you will see and use regularly

In [None]:
# Lists are created using [] or the function list()

x = [1, 2, 3]
column_list = list(df.columns)

print(type(x), x)
print(type(column_list), column_list)

In [None]:
# You can easily combine lists of different variable types

y = x + column_list + [list(range(4))*2]
print(y)

In [None]:
## You can add items

# Add to end
column_list.append('accuracy')
print(column_list)

# Add at specific index
column_list.insert(1, 'group')
print(column_list)

In [None]:
## You can remove items

# Remove first occurrence
column_list.remove('onset')
print(column_list)

# Remove from index
y.pop(5)
print(column_list)

In [None]:
# Change 'subject' to 'subjects'

column_list[0] = 'subjects'
print(column_list)

In [None]:
# Loop through list and print each name

for col in column_list:
    print(col)

In [None]:
# Loop through list and print index and name

for i, col in enumerate(column_list):
    print(i, col)

In [None]:
# Loop through list and print only column names containing underscores

for col in column_list:
    if ('_' in col):
        print(col)

In [None]:
# This can be shortened to:

[col for col in column_list if '_' in col]

In [None]:
# Syntax: [expression for item in list if condition == True]

# Print first two letters of column names containing underscore
[col[:2] for col in column_list if '_' in col]

In [None]:
# Syntax: [expression for item in list if condition == True]

# Print first 3 letters of column names longer than 3
[col[:3] for col in column_list if len(col)>3]

In [None]:
# IMPORTANT: you cannot copy a list by assigning it to new variable even if it looks possible

x = column_list
print(x)

In [None]:
# x is only a reference to the original list, and any changes to x WILL ALSO CHANGE the original list

x.append('dummy_column')
print(column_list)

In [None]:
# To create a copy of a list, use the copy method

x = column_list.copy()

x.append('dummy2')
print(x)
print(column_list)

#### II. Tuple

You won't use it much, but you will see this from time to time so it's useful to understand its properties

In [None]:
# Tuples are ordered and unchangeable

x = df.shape

print(x)
print(type(x))
print(x[0])

In [None]:
# You cannot modify a tuple

x[0] = 5

In [None]:
# You cannot add items to a tuple

x.append(5)

#### III. Set

Sets are unordered and cannot have duplicate elements

You can perform set operations

In [None]:
# Get unique values of a collection

stim_list = set(df['stim_type'])

print(stim_list)

In [None]:
# Create a new set of the desired stimulus types

stim_keep = {'FAMOUS', 'UNFAMILIAR'}

print(stim_keep)

In [None]:
# Perform set operations, for example:

print(stim_list & stim_keep) # Items present in both sets
print(stim_list ^ stim_keep) # Items only present in one set

#### IV. Dictionaries

Store data in key:value pairs

Popular way to store data

In [None]:
# Save summary values of df

df_summary = {
    'subject' : 'sub-04',
    'mean': df['response_time'].mean()
}

print(type(df_summary), df_summary)

In [None]:
# Add entry to dictionary

df_summary['median'] = df['response_time'].median()

print(type(df_summary), df_summary)

In [None]:
## Loop through dictionary

# Using the keys
for key in df_summary.keys():
    print(key)

In [None]:
## Loop through dictionary

# Using the values
for value in df_summary.values():
    print(value)

In [None]:
# IMPORTANT: to make a distionary copy, you need to use the copy method, do not assign directly to new variable

## Editing dataframe

Sometimes, we want to add columns or update values

In [None]:
## Adding columns

# Let's add a subject column
df['subject'] = 'sub-04'

# Add a column with the run number
df['run'] = 1

df.head()

In [None]:
## Rearranging columns

# Let's re-arrange the columns into a more pleasing order
df = df[['subject', 'run', 'onset', 'duration', 'circle_duration', 'stim_type', 'trigger', 'button_pushed', 'response_time', 'stim_file']]

df.head()

In [None]:
# We can also drop some of the columns altogether

new_df = df[['subject', 'run', 'stim_type', 'response_time', 'stim_file']]

new_df.head()

In [None]:
## Editing values

# Change the units of a column, for instance convert response_time between seconds and milliseconds
new_df.loc[:, 'response_time'] *= 1000

new_df.head()

In [None]:
## Sorting 

# We could also sort the data by the stim_type column
new_df.sort_values(by='stim_type')


In [None]:
# Or by two columns

new_df.sort_values(by=['stim_type', 'response_time'])

# 3. Saving data externally

We simply use a pandas function to save our dataframe. You can view this file outside Python using a text editor or Excel, etc.

In [None]:
# This time, let's export to the more common csv format. We also specify that we don't want an extra index column in the file.

new_df.to_csv(respath + 'just_an_example.csv', index=False)

# **Putting it all together**

Let's put what we've learnt so far into a coherent block of code, and we shall use the resulting dataframe for the next part of the primer on plotting data.

In [None]:
# --- Libraries --- #

import pandas as pd

# --- Set paths --- #

project = '~/COGNESTIC/01_Primer_on_Python/'
inpath = project + 'data/'
respath = project + 'outputs/'

# --- Get data from sixteen subjects (assume that sub-04 is missing run-05) --- #

all_data = pd.DataFrame()

# For each subject
for s in range(1, 17):

    # Zeropad integer to get subject name
    subject = 'sub-' + str(s).zfill(2)
    
    # For each run
    for i in range(1, 10):

        # If sub-04's 5th run, skip loop
        if subject == 'sub-04' and i == 5:
            continue

        # Get the file and read data
        run_file = subject + '/' + subject + '_ses-mri_task-facerecognition_run-0' + str(i) +'_events.tsv'
        run_data = pd.read_csv(inpath + run_file, sep='\t')

        # Add columns for subject and run
        run_data['subject'] = subject
        run_data['run'] = i
        
        # Add to main dataframe
        all_data = pd.concat([all_data, run_data])
        
# --- Clean data --- #

# Remove any rows without a valid stim_type or button_pushed value
all_data = all_data[all_data['stim_type'].notna() & all_data['button_pushed'].notna()]

# Remove response times over 1 second
all_data = all_data[all_data['response_time'] <= 1.0]

# --- Output collated data to single tsv file --- #

# Export to csv file
all_data.to_csv(respath + 'events_16sub.csv', index=False)

# --- Print out summary: mean RT for each stimulus type --- #

all_data.groupby('stim_type')['response_time'].mean()
