# Statistical Plan

This document is dedicated for detailed explanation of statistical analysis of M.O.R.E. (Morphed Object Recognition Experiment). We will start with detailed explanation of the data itself. Overall we collected 22 different type of data where we will mention here the ones related with the analysis (for the rest please see "fulldataset.csv") .

**1) Participant Number:** Number of the participant, overall we have 30 participants. |pt_num|

**2) Trial Number:** As the name suggest corresponding trial number; for each participant whole experiment consisted of 1440 trials (excluding the training phase). |trial_nbr|

**3) Block Number:** Overall, experiment consists of 10 blocks (and 9 break in between the blocks), thus each block contained of 144 trials. |block_number|

**4) Reaction Time:** One of the key data to be analyzed, recorded reaction time given to a trial, a timer sets on after the stimulus presentation until a response with key is given. Clarification, reaction time does not include stimulus presentation only the time after stimulus presentation and until a key is pressed. maximum of 10 seconds window was allowed if no response given within this 10 seconds; next trial begins. |rt|

**5) Stimulus Onset Asynchrony (SOA):** This terms is currently a bit misleading, we mean duration of stimulus presented (time that image stayed on the screen), this changed across the experiment to be 25ms, 50ms and 100ms. Each duration type was evenly distributed (in a randomized order) through the experiment 480 trial per type. |soa|

**6) Accuracy:** Registry of whether a participant correctly chose the the category of the given stimuli. Thus, it is either TRUE or FALSE. |acc|

**7) Category:** Category of the presented object, one of the following options: bird, tree, cat, building, bus, person, fire hydrant, banana. |category|

**8) Chosen Category:** Response of the participant selecting the category of the object. |choiced_category|

**9) Difficulty:** Difficulty of the occlusion (how much of the image is being occluded), control = no occlusion, low = smaller occlusion, high = greater occlusion. |difficulty|.

**10) Size of Occlusion:** As we opted for partial viewing occlusion type, this occlusion also differs into two, having few but large blobs blocking the image, and many but small images blocking the image, |size_occl|

**11) Pressed Key:** Which key was pressed to give response to the trial? We recorded each pressed key per trial. |pressed_key|

**12) Correct Key:** This holds the value which key was supposed to be pressed in order to give the correct answer. |correct_key|





*Overall, we have had 3 different effect parameters (SOA, Difficulty and Size of Occlusion) that is hypothesized to influence the accuracy and the reaction time.*

## Processing Data

We must start with importing necessary libraries, datasets and merging everything into one big set. Where we will start with data cleaning and determination of outliers. 

In [114]:
import numpy as np
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import glob
import statistics

# use a personal style sheet
plt.style.use("./styles/mystyle.mplstyle")

# listing all the current data
data_files = glob.glob(r'../Experiment/data/*.csv') # taking only the two complete files

df = pd.DataFrame()
for i in range(len(data_files)):
    temp_df = pd.read_csv(data_files[i])
    df = pd.concat([df, temp_df])

# select only the main task
df = df.loc[df['task']=='experiment']

# dropping unnecessary columns in terms of data analysis
df = df.drop(labels = ["Age", "Gender", "Handedness", "manipulation", "mask", "type_occl", "filename",
                       "mask_filename", "task", "Experiment Duration", "Version"], axis=1)
df = df.reset_index(drop = True)

# extracting sample size
n = len(set(df.pt_num))
print(n)
# 
df.head()
#Overall trial number
len(df.index) #43200

30


43200

### Outlier Determination and Managing missing datapoints

Now as we have our merged dataset available to us (df) we now should start with close inspection of the datasets, this including managing missing datapoints. Coding of the experiment allow us to easily find such occasion, we will be looking for a reaction time higher-equal to 10 and pressed key is to be "NAN". we will remove such rows from the dataset.



In [115]:
df[df["pressed_key"] == "NAN"]
# there is only one occasion where there is NAN,  participant-19, trial number 1122

Unnamed: 0,pt_num,trial_nbr,block_number,rt,soa,acc,category,choiced_category,difficulty,size_occl,pressed_key,correct_key
15521,19,1122,Block 8,10.648954,0.025,False,building,NAN,low,few large,NAN,q


In [116]:
df = df.drop(labels = 15521, axis = 0)

In [117]:
df[df["pressed_key"] == "NAN"]
# Now it is removed

Unnamed: 0,pt_num,trial_nbr,block_number,rt,soa,acc,category,choiced_category,difficulty,size_occl,pressed_key,correct_key


**For outliers**: 99th percentile cut-off value will be employed when it comes to reaction time. Also for the accuracy we will check if any participant is close to (or even below) of a chance level which is 12.5% given that there are 8 possible options to be chosen. These outliers will be removed from the dataset.

## Work in Progress...