# Introduction to Pandas

`pandas` is an awesome library for csv file manipulations, and so much more!

## First, setup

In [1]:
import pandas as pd
import numpy as np

## Series

The basic `pandas` data structure is `Series`. It represents a one-dimensional collection of values (similar to python native `list` and `numpy` `array`)

In [2]:
s = pd.Series([1, -4, 0])

In [3]:
print(s[1])

-4


## Data Frame

`DataFrame` would be the main way to interact with csv files (**tables**). Let's see an example

### First, read from csv file (or any seperated files) and display them

In [4]:
df_samping = pd.read_csv("data/4123_IST_sampling.csv")
df_memory = pd.read_csv("data/4123_IST_memory.csv")
df_samping.columns = df_samping.columns.str.replace(' ', '')
df_memory.columns = df_memory.columns.str.replace(' ', '')

In [5]:
df_samping.head()

Unnamed: 0,participant_id,session,trial_no,global_time_onset_trial,ready_screen_on,ready_screen_duration,category_of_pics,probability,reward_type,majority_cat,...,sample_no,picture_path,pic_name,choose_sample_time,global_choose_sample,picture_on_global,picture_on_trial,picture_off_global,picture_off_trial,Unnamed: 21
0,4123,1,1,6.175991,6.176686,3.0574,ioc,0.6,5,indoor,...,1,C:/Users/EYETRACKER/Documents/info_sample_tas...,outdoor_82.jpg,1.058262,10.303376,10.332037,4.155356,12.341411,6.164731,
1,4123,1,1,6.175991,6.176686,3.0574,ioc,0.6,5,indoor,...,2,C:/Users/EYETRACKER/Documents/info_sample_tas...,outdoor_133.jpg,0.977646,13.567511,13.593086,7.416406,15.608749,9.432069,
2,4123,1,1,6.175991,6.176686,3.0574,ioc,0.6,5,indoor,...,3,C:/Users/EYETRACKER/Documents/info_sample_tas...,indoor_122.jpg,0.909801,16.767339,16.787396,10.610716,18.796897,12.620216,
3,4123,1,1,6.175991,6.176686,3.0574,ioc,0.6,5,indoor,...,4,C:/Users/EYETRACKER/Documents/info_sample_tas...,outdoor_114.jpg,0.777908,19.823399,19.844268,13.667587,21.853322,15.676641,
4,4123,1,2,26.578054,26.578616,3.051034,lnc,0.6,1,nonliving,...,1,C:/Users/EYETRACKER/Documents/info_sample_tas...,living_152.jpg,0.841852,30.479303,30.497736,3.919125,32.506476,5.927865,


In [6]:
df_memory.head()

Unnamed: 0,participant_id,session,trial_number,global_time,picture,old_new,old_new_judge,time_of_choice,confidence,time_con_rating,picture_onset,picture_offset,Unnamed: 13
0,4123,2,1,14.694848,outdoor_208.jpg,new,new,6.818242537,3,1.463256,5.389009,12.207348,
1,4123,2,2,18.178009,living_72.jpg,old,old,1.631794628,3,0.80873,14.719182,16.35107,
2,4123,2,3,21.713991,living_155.jpg,old,old,1.454035321,3,1.039911,18.201309,19.655411,
3,4123,2,4,26.417422,nonliving_247.jpg,old,new,2.600254011,2,1.063993,21.734634,24.334965,
4,4123,2,5,30.930341,living_123.jpg,old,old,1.784202552,2,1.688185,26.438702,28.223023,


### Questions we will investigate:
1. Accuracy on trials, how many trials did the participant answer correctly?
2. What is the average number of times the participant sampled per trial? Is this different for trials worth \$5.00 and \$1.00?
3. What is the participant's memory score?

#### question 1

In [39]:
total_length = len(df_samping[df_samping['sample_no'] == 1])
correct = len(df_samping[(df_samping['sample_no'] == 1) & (df_samping['majority_cat'] == df_samping['final_choice'])])
accu = correct / total_length

In [40]:
print("accuracy: ", accu)

accuracy:  0.7708333333333334


#### question 2

In [41]:
high_groups = df_samping[df_samping['reward_type'] == 5].groupby(['trial_no'])
low_groups = df_samping[df_samping['reward_type'] == 1].groupby(['trial_no'])

In [42]:
high_trials = high_groups['sample_no'].count()
print("high: ", high_trials.mean())
low_trials = low_groups['sample_no'].count()
print("low: ", low_trials.mean())

high:  7.0
low:  7.70833333333


#### question 3

In [44]:
false_alarm = 0 # new old
correct_hit = 0 # old old
miss = 0 # old new
correct_rejection = 0 # new new

correct_hit = df_memory[(df_memory["old_new_judge"] == " old") & (df_memory["old_new_judge"] == " old")]
print("correct_hit: ", len(correct_hit))

correct_hit:  245


In [45]:
# find the missing answer row
i = 0
for idx, row in df_memory.iterrows():
    if row.old_new_judge != " new" and row.old_new_judge != " old":
        i = idx
        print("one missing row right here")
        
print(df_memory.iloc[i])

one missing row right here
participant_id               4123
session                         2
trial_number                   28
global_time               133.194
picture             outdoor_6.jpg
old_new                       old
old_new_judge        No Selection
time_of_choice            No Time
confidence                      1
time_con_rating           2.40687
picture_onset             119.768
picture_offset            129.768
                              NaN
Name: 27, dtype: object
