In [80]:
"""
Importing libraries: The first step when initiating a data wrangling script is to import your libraries. Numpy and Pandas are two excellent and versatile libraries 
that you can always import at the beginning of a script and will allow you to do a great deal of data wrangling. You can also import more depending on your needs.

Commenting out code: It is also incredibly important to create well-commented code! You can comment with "#" or with triple quotes. An outside individual should be able
to understand what you are doing in your script based just upon reading comments.
"""
# Import Libraries
import pandas as pd 
import numpy as np 

In [81]:
"""
Read in your data! 
The data we will be using today, a test excerpt from the Stroop Task, is stored in the same github folder as this notebook. Therefore, your computer should be able to
read it in using just the name of the file, 'stroop_test.csv'. Bear in mind that pathnames must ALWAYS be read in with either double or single quotes (as a string). 

Below, we call pandas and utilize pandas function 'read_csv' to establish our dataframe. This function will read in a csv as a data frame, but to ensure that your data is 
in the correct format you can always call another pandas function, 'DataFrame', to ensure the format is correct
"""
# Read in stroop task data frame
stroop = pd.read_csv('stroop_test.csv')
#stroop = pd.read_csv('https://raw.githubusercontent.com/DANCE-Lab/RA_scripts/refs/heads/main/Learning_Python_Notebooks/stroop_test.csv')

# Convert to data frame
stroop_df = pd.DataFrame(stroop)

In [82]:
"""
Let's get a better sense of what this data looks like! You can call "head" (which will return the first five rows of your df) or "tail" (which will return the last five
rows of your df) to get a quick look at your data. You can also select specific rows by taking a slice of your data frame. 
"""
# Return the first five rows of the df
stroop_df.head()

Unnamed: 0,text,letterColor,corrAns,congruent,trials.thisRepN,trials.thisTrialN,trials.thisN,trials.thisIndex,instrText.started,instrText.stopped,...,Resp.rt,Resp.started,Resp.stopped,participant,session,date,expName,psychopyVersion,frameRate,Unnamed: 23
0,red,red,left,1,0,0,0,0,12.890473,13.897167,...,9.476369,46.314857,,0,1,2020_Nov_16_1327,stroop_test_1,2020.2.5,60.022071,
1,blue,red,left,0,0,1,1,5,,,...,0.549709,56.296472,,0,1,2020_Nov_16_1327,stroop_test_1,2020.2.5,60.022071,
2,green,green,down,1,0,2,2,2,,,...,0.348712,57.346441,,0,1,2020_Nov_16_1327,stroop_test_1,2020.2.5,60.022071,
3,green,blue,right,0,0,3,3,3,,,...,0.45476,58.196366,,0,1,2020_Nov_16_1327,stroop_test_1,2020.2.5,60.022071,
4,blue,blue,right,1,0,4,4,4,,,...,0.36104,59.163069,,0,1,2020_Nov_16_1327,stroop_test_1,2020.2.5,60.022071,


In [85]:
# Return rows 10-15 by specifying a df slice
stroop_df[2:7]

Unnamed: 0,text,letterColor,corrAns,congruent,trials.thisRepN,trials.thisTrialN,trials.thisN,trials.thisIndex,instrText.started,instrText.stopped,...,Resp.rt,Resp.started,Resp.stopped,participant,session,date,expName,psychopyVersion,frameRate,Unnamed: 23
2,green,green,down,1,0,2,2,2,,,...,0.348712,57.346441,,0,1,2020_Nov_16_1327,stroop_test_1,2020.2.5,60.022071,
3,green,blue,right,0,0,3,3,3,,,...,0.45476,58.196366,,0,1,2020_Nov_16_1327,stroop_test_1,2020.2.5,60.022071,
4,blue,blue,right,1,0,4,4,4,,,...,0.36104,59.163069,,0,1,2020_Nov_16_1327,stroop_test_1,2020.2.5,60.022071,
5,red,green,down,0,0,5,5,1,,,...,0.344365,60.029694,,0,1,2020_Nov_16_1327,stroop_test_1,2020.2.5,60.022071,
6,green,green,down,1,1,0,6,2,,,...,0.290436,60.879626,,0,1,2020_Nov_16_1327,stroop_test_1,2020.2.5,60.022071,


In [84]:
""" 
Now we know that this data frame has 30 rows (numbered 0-29). A glance at the head, middle, and tail shows us that this is one session of the stroop task, completed by a single 
individual. 
"""
# Return the last five rows of the df
stroop_df.tail()

Unnamed: 0,text,letterColor,corrAns,congruent,trials.thisRepN,trials.thisTrialN,trials.thisN,trials.thisIndex,instrText.started,instrText.stopped,...,Resp.rt,Resp.started,Resp.stopped,participant,session,date,expName,psychopyVersion,frameRate,Unnamed: 23
25,red,red,left,1,4,1,25,0,,,...,0.281424,78.646116,,0,1,2020_Nov_16_1327,stroop_test_1,2020.2.5,60.022071,
26,blue,red,left,0,4,2,26,5,,,...,0.247083,79.429428,,0,1,2020_Nov_16_1327,stroop_test_1,2020.2.5,60.022071,
27,green,green,down,1,4,3,27,2,,,...,0.211002,80.179474,,0,1,2020_Nov_16_1327,stroop_test_1,2020.2.5,60.022071,
28,blue,blue,right,1,4,4,28,4,,,...,0.129443,80.896041,,0,1,2020_Nov_16_1327,stroop_test_1,2020.2.5,60.022071,
29,green,blue,right,0,4,5,29,3,,,...,0.211121,81.529325,,0,1,2020_Nov_16_1327,stroop_test_1,2020.2.5,60.022071,


In [86]:
""" 
It also looks like we can't see all of the columns that are included in this data frame! We can ask python to list them for us so we have better sense of what we're 
working with. 
"""
list(stroop_df) 

['text',
 'letterColor',
 'corrAns',
 'congruent',
 'trials.thisRepN',
 'trials.thisTrialN',
 'trials.thisN',
 'trials.thisIndex',
 'instrText.started',
 'instrText.stopped',
 'Word.started',
 'Word.stopped',
 'Resp.keys',
 'Resp.corr',
 'Resp.rt',
 'Resp.started',
 'Resp.stopped',
 'participant',
 'session',
 'date',
 'expName',
 'psychopyVersion',
 'frameRate',
 'Unnamed: 23']

In [88]:
""" 
If we want to get a better look at any particular column, we can also call the head, tail, or index of that column. Let's take a closer look at 'Resp.corr'.
First, we can select the column within the df with brackets [] to specify which column we want to look at. Then, return the first five rows
"""
stroop_df['Resp.corr'].head()


0    0
1    0
2    1
3    0
4    0
Name: Resp.corr, dtype: int64

In [91]:
""" 
Now let's focus on figuring out which columns contain information that might be useful to us. It could be helpful to know if any columns contain ONLY NaN values! 
We can figure this out by having python iterate through the columns in this data frame and return whether any contain only NaN values. 
"""
#Determine whether any stroop_df columns contain ONLY NaN values
nan_columns = stroop_df.columns[stroop_df.isna().all()].tolist()
print(nan_columns)

""" 
This is a pretty dense line of code! Let's break it down. 
1. "nan_columns": we start by establishing a variable called nan_columns. This variable will be a list containing the names of any columns that contain ONLY NaN values. 
2. "stroop_df.columns": we clarify what data we would like our code to access. Here, we specify that we will be looking at the stroop_df data frame, and break it down by
its columns
3. []: These square brackers determine what action we will be administering to stroop_df's columns. Here we are specifying a boolean operation with TWO actions (.isna() and
.all()). These commands will be applied to stroop_df columns and will produce a boolean result (either True or False). We need to specify that these actions will be executed
in the stroop_df inside the brackets to complete the boolean test.
4. stroop_df.isna(): .isna() will test to see whether a value in the data frame is a 'NaN' value. It will return True if it is a NaN value, and False if it is not
5. .all(): .all() extends .isna() to the entire column. Now, instead of boolean test returning True if a single value in a column is a NaN value, it will only return True if
the entire column is NaN
6. .to_list: This command at the end assigns any column that the boolean test returns as True (i.e., all values are NaN) to the list 'nan_columns'
"""

""" 
In this code, we switch .all() with .any() to produce a list with a modified boolean test that checks whether ANY nan value is in a column, as opposed to whether ALL 
values are NaNs
"""
#Columns including NaN values
incl_nans = stroop_df.columns[stroop_df.isna().any()].tolist()
print(incl_nans)



['Word.stopped', 'Resp.stopped', 'Unnamed: 23']
['instrText.started', 'instrText.stopped', 'Word.stopped', 'Resp.stopped', 'Unnamed: 23']


In [92]:
""" 
Let's break these single-lines further, and re-write them as a for loop
1. all_nan = []: first, we must establish an empty list. This is just like establishing 'nan_columns' or 'incl_nan' lists, except that breaking the operations into 
multiple lines of code requires first establishing an empty list that our loop will later fill with columns meeting our requirements
2. for col in stroop_df.columns: This is the line that establishes our for loop! It contains quite a bit of information. First, 'for' initiates a loop. 'col' is a 
placeholder that allows us to specify individual elements of whatever we are iterating through. In this case, we are iterating through the columns in stroop_df, or
stroop_df.columns. Remember, for lines must always END in a colon and be followed by an INDENTED line to execute properly.
3. if stroop_df[col].isna().all(): This is an if statement, a logic statement, that performs the boolean test to see whether any columns in the df are ONLY NaNs. Here, we 
clarify what df we are using (stroop_df), we specify that instead of looking at the WHOLE df we are interested in each column, in turn (stroop_df[col]). We then specify two
boolean operations to be executed in our if statement: .isna() (are values NaN?) and .all() (do ALL values satisfy our condition?). ONLY if our boolean statement returns a
True value will the code continue to the next line. Remember, if statements must always END in a colon and be followed by an INDENTED line to execute properly.
4. all_nan.append(col): Now we are asking the empty list that we established at the beginning to append any column that met the conditions set in the if statement. Because
if statements will only pass inputs that meet their logic conditions to the following line, ONLY columns that include only NaN values will be appended to the list.
5. print(all_nan): Print the list containing only NaN values. 
It matches the list generated by the code above!
"""

#Use a for loop to iterate through stroop_df columns
all_nan = []
for col in stroop_df.columns:
    if stroop_df[col].isna().all():
        all_nan.append(col)
print(all_nan)

""" 
Of course, the term 'col' is only a placeholder in the loop above that instructs python what to do with whatever element is being passed through the list. Because loops
run more than once, the VALUE of col changes each time. The first time the loop runs, 'col' contains the value of the first item in the list. The second time it runs, 'col'
is column 2. The last time it runs, 'col' is the final column in the data frame. Because col is just a placeholder, the letters could spell out anything. Below, the code 
identically with the placeholder 'silly'. However, for readability, it's always best to make the placeholder a word that makes sense.
"""

#demonstrate placeholder--can be any word!
silly_nan = []
for silly in stroop_df.columns:
    if stroop_df[silly].isna().all():
        silly_nan.append(silly)
print(silly_nan)




['Word.stopped', 'Resp.stopped', 'Unnamed: 23']
['Word.stopped', 'Resp.stopped', 'Unnamed: 23']


In [93]:
""" 
Let's say that these columns, which contain all NaN values, are not helpful to us. We can drop any column we like from the df with the function '.drop'. However,
when dropping columns, it's always a good idea to create a NEW data frame so that the original is preserved in python's memory.
"""

""" 
In this line, we create our new df (stroop_nonan) and drop the columns our boolean test told us contained only NaN values. 'drop' will remove a list of strings 
corresponding to column names. It will NOT work if the column names are misspelled in any way, so it can be cleaner to ask it to drop a computer-generated list (all_nan).
But the following line of code is equally valid: 

stroop_nonan = stroop_df.drop(columns= ['Word.stopped', 'Resp.stopped', 'Unnamed: 23'])
"""
# Drop columns that contain only NaN values
stroop_nonan = stroop_df.drop(columns= ['Word.stopped', 'Resp.stopped', 'Unnamed: 23']) 
#print list of columns in new data frame
list(stroop_nonan)

['text',
 'letterColor',
 'corrAns',
 'congruent',
 'trials.thisRepN',
 'trials.thisTrialN',
 'trials.thisN',
 'trials.thisIndex',
 'instrText.started',
 'instrText.stopped',
 'Word.started',
 'Resp.keys',
 'Resp.corr',
 'Resp.rt',
 'Resp.started',
 'participant',
 'session',
 'date',
 'expName',
 'psychopyVersion',
 'frameRate']

In [95]:
""" 
Let's refine some of those columns even further! If we're interested in stroop task outcomes, some of these columns are probably not relevant to our analyses.
We will create a new df containing only columns we want to analyze.
Note: When using single square brackets, you can only select one column! With double square brackets, you can select many. 
"""
#select columns relevant for analysis
useful_list = ['text', 'letterColor', 'corrAns', 'congruent', 'Resp.keys', 'Resp.corr', 'Resp.rt', 'participant']
stroop_data = stroop_nonan[useful_list]

#Call the first five rows our our new data frame
stroop_data.head()

Unnamed: 0,text,letterColor,corrAns,congruent,Resp.keys,Resp.corr,Resp.rt,participant
0,red,red,left,1,down,0,9.476369,0
1,blue,red,left,0,down,0,0.549709,0
2,green,green,down,1,down,1,0.348712,0
3,green,blue,right,0,down,0,0.45476,0
4,blue,blue,right,1,down,0,0.36104,0


In [96]:
""" 
For our first analysis, let's see if there's a significant difference beetween response rate in correct versus incorrect trials! But first, let's rename some columns to
make the column list read better.
Note: the .rename command requires curly brackets. This is because this function takes a DICTIONARY, while '.drop' takes a LIST.
"""
# Rename columns in df for clarity
useful_dict = {'corrAns': 'correct_answer', 'Resp.keys': 'resp_key', 'Resp.corr': 'resp_correct', 'Resp.rt': 'resp_rate'}
stroop_data = stroop_data.rename(columns= useful_dict)

# List new column names
list(stroop_data)

['text',
 'letterColor',
 'correct_answer',
 'congruent',
 'resp_key',
 'resp_correct',
 'resp_rate',
 'participant']

In [97]:
""" 
Let's create two data frames now: one that contains ONLY correct trials, and one that contains ONLY incorrect trials. We can do this with a one-line command that will filter
based on the values in the data frame. We will name our data frame containing correct values 'stroop_correct'.
This is another one-liner containing two different commands! An iteration through data and a boolean statement. Let's break it down:
1. stroop_correct: creating our NEW data frame that will contain only correct rows
2. stroop_data: we are telling python what data we would like it to iterate through in this command
3. []: Our single brackets isolate our logic statement
4. stroop_data['resp_correct']: this isolates the column that has information that we care about
5. == 1: Two equals next to each other create a logic statement that means "is equal to one". Therefore, in this logic statement, rows in our data frame will return either
True (when the value in 'resp_correct' equals 1) or False (when the value in 'resp_correct' does not equal 1). stroop_correct will maintain only rows in stroop_data that 
pass the boolean test.
"""
# Create one data frame with only correct trials, and one with only incorrect trials
stroop_correct = stroop_data[stroop_data['resp_correct'] == 1]
stroop_incorrect= stroop_data[stroop_data['resp_correct'] == 0]

# Return the head of the data frame containing only correct trials
stroop_correct.head()

Unnamed: 0,text,letterColor,correct_answer,congruent,resp_key,resp_correct,resp_rate,participant
2,green,green,down,1,down,1,0.348712,0
5,red,green,down,0,down,1,0.344365,0
6,green,green,down,1,down,1,0.290436,0
10,red,green,down,0,down,1,0.31667,0
12,red,green,down,0,down,1,0.314894,0


In [None]:
""" 
Now let's try breaking that one-line statement into a longer-form loop! Even though loops are more computationally expensive, it can be really helpful to know exactly
what each element of a single-line command does. The loop is equally valid and produces the exact same result.
"""

"""
Let's break the loop down: 
1. stroop_loop: first we create a new data frame that is identical to stroop_data 
2. for row in stroop_data.index: Just like the command stroop_df.columns indicated that the loop would iterate through the df's columns, stroop_loop.index signifies that
the loop will be iterating through stroop_loop's rows. This line establishes the loop, and names the iteration placeholder "row" for clarity. 
3. if stroop_loop['resp_correct][row] == 0: this is the boolean statement that tests whether the trial in question was correct (stroop_loop['resp_correct][row] = 1)
or incorrect (stroop_loop['resp_correct][row] = 0). In this case, we only want trials that are incorrect to pass the test because they will be removed from the data frame
in the next line of code.
4. stroop_loop = stroop_loop.drop(row): This line removes any incorrect trial from the stroop_loop data frame. At the end of the iterations, stroop_loop will contain ONLY
correct trials.
5. stroop_loop.head(): this line returns our new df! The head reveals that it is identical to the df stroop_correct that we created in the last block.
"""
# Create a data frame with only correct trials
stroop_loop = stroop_data
for row in stroop_loop.index: 
    if stroop_loop['resp_correct'][row] == 0:
        stroop_loop = stroop_loop.drop(row)
stroop_loop.head()



In [98]:
""" 
We are almost ready to perform our student's t-test! To perform this test, we must first import the library scipy.stats and the function ttest_ind.

Then we can plug in the two indices that we would like to have tested. First, we call 'ttest_ind', which accepts two series of numbers that will be tested against each other.
In this case, we are testing response rate in correct trials versus response rates in incorrect trials. All we need to do then is specify the relevant column (resp_rate)
in both correct trial and incorrect trial dfs.

We can see from the results that the difference is not significant.
"""
# Import scipy.stats library 
import scipy.stats as stats
from scipy.stats import ttest_ind

# Perform a student's t-test on response rate data
ttest_ind(stroop_correct['resp_rate'], stroop_incorrect['resp_rate'])



TtestResult(statistic=-0.8673530747844189, pvalue=0.39312138016192943, df=28.0)

In [99]:
""" 
However, before we accept this result, we need to see whether the sizes of the arrays we are passing into the ttest_ind function are equal. You can easily determine
the number of rows and columns by calling the '.shape' command. To keep ourselves organized, let's create a dictionary that will contain a label and value so we can see 
how many correct trials versus incorrect trials our participant had.
"""

# Create a dictionary containing df shapes
size_dict = {}
size_dict['correct'] = stroop_incorrect.shape
size_dict['incorrect'] = stroop_correct.shape
print(size_dict)

{'correct': (20, 8), 'incorrect': (10, 8)}


In [100]:
""" 
Our dictionary shows us that correct is actually twice as big as incorrect! Therefore, we may have to employ a welch's t-test, which accepts unequal sample sizes, 
instead of a student's t-test, which accepts equal sample sizes.

You can easily turn your student's t-test into a welch's t-test by adding one of the ttest_ind built-in paremeters to your function. We will employ the 'equal variance' 
parameter, and set it to "false" because we now now that our variance is not equal.

Our welch's t-test has shown us that there is not a significant difference in response rate between correct and incorrect trials.
"""
# Perform a welch's t-test on correct and incorrect trial response rates
ttest_ind(stroop_correct['resp_rate'], stroop_incorrect['resp_rate'], equal_var= False)

TtestResult(statistic=-1.2370175611973058, pvalue=0.23112047170698558, df=19.04092672589239)

In [None]:
stroop_correct.head()

In [None]:
#Find what colors occur in the task and their frequency using the value_counts function
stroop_data['text'].value_counts()

In [78]:
""" 
Let's try creating a short function that will give us some more information about colors in correct versus incorrect trials! A function
is a collection of commands that takes a series of parameters (such as a data frame, a list, a dictionary, or a string) that can be 
changed to suit the programmer's needs. For example, commands we have been using such as "ttest_ind" are functions! In ttest_ind, 
the required parameters are two series that will be tested against each other, with optional parameters such as "equal_var". Being able
to write your own gives you a lot of flexibility when working with specific data.

Functions always begin with "def", then the name of the function, and then the parameters that function takes followed by a colon.
For example: 

def foo(data_frame):
    for row in data_frame.index:
        print('hello_world')
    return data_frame

The function is then populated with whatever actions you need it to execute (indented, just like a for loop or an if statement). Often,
you will need your function to return something that will be useful to you! In that case, you can call the command "return" at the end
of your function. This will yield a transformed product. You can save the transformed product with your syntax when you call the function.
To save the product of our function, "foo", I would write:

saved_data_frame = foo(data_frame)

Let's create a function that records the number and color of congruent trials in a given df
"""
#function to record the number and color of congruent trails in a given df
def color_pairs(df):
    color_count = {}
    for row in df.index:
        if df['text'][row] == df['letterColor'][row]:
            if df['text'][row] not in color_count:
                color_count[df['text'][row]] = 0
            color_count[df['text'][row]] += 1
    return color_count

In [None]:
#use "color pairs" function on correct and incorrect data frames
incorrect_color_dic = color_pairs(stroop_incorrect)
correct_color_dic = color_pairs(stroop_correct)
print('incorrect congruent colors:', incorrect_color_dic)
print('correct congruent colors:', correct_color_dic)