# 1.0 Analyzing Initial Participants

# Intial Setup

## Jupyter Extensions

Load [watermark](https://github.com/rasbt/watermark) to see the state of the machine and environment that's running the notebook. To make sense of the options, take a look at the [usage](https://github.com/rasbt/watermark#usage) section of the readme.

In [1]:
# Load `watermark` extension
%load_ext watermark

In [2]:
# Display the status of the machine and other non-code related info
%watermark -m -g -b -h

Compiler    : GCC 12.3.0
OS          : Linux
Release     : 5.15.0-117-generic
Machine     : x86_64
Processor   : x86_64
CPU cores   : 4
Architecture: 64bit

Hostname: apra-desktop-ubuntu

Git hash: 2af72ed6d09ec71dd7758a7d6f41dec7d980da17

Git branch: main



Load [autoreload](https://ipython.org/ipython-doc/3/config/extensions/autoreload.html) which will always reload modules marked with `%aimport`.

This behavior can be inverted by running `autoreload 2` which will set everything to be auto-reloaded *except* for modules marked with `%aimport`.

In [3]:
# Load `autoreload` extension
%load_ext autoreload
# Set autoreload behavior
%autoreload 1

Load `matplotlib` in one of the more `jupyter`-friendly [rich-output modes](https://ipython.readthedocs.io/en/stable/interactive/plotting.html). Some options (that may or may not have worked) are `inline`, `notebook`, and `gtk`.

In [4]:
# Set the matplotlib mode
%matplotlib inline

## Imports

In [15]:
# Standard library imports
import json
from pathlib import Path

# Third party
import numpy as np
import pandas as pd
import seaborn as sns
from loguru import logger

# Quickly reference all git tracked folders in the
import index

# Display  versions of everything
%watermark -v -iv

Python implementation: CPython
Python version       : 3.12.4
IPython version      : 8.26.0

numpy  : 2.0.1
json   : 2.0.9
pandas : 2.2.2
seaborn: 0.13.2



# Loading Participant Data

In [125]:
dataset_name = "hbb_dataset_240409_121323"
study_id = "669e784b617d540aa357abf4"
path_study_id = index.dir_data_participant_responses / study_id

In [126]:
def load_all_participant_responses(
    path_study_id, 
    dataset_name,
    validate=True,
    valid_colors={1: "red", 2: "green", 3: "blue"},
    columns_to_keep=(
        "rt",
        "response",
        "trial_index",
        "internal_node_id",
        "correct_response",
    ),
):
    participant_ids = [path.stem for path in (path_study_id).iterdir()]
    participant_jsons = {
        participant_id : list((path_study_id / participant_id).iterdir())[0]
        for participant_id in participant_ids
    }

    participant_responses = {}

    for participant_id, participant_json in participant_jsons.items():
        print(f"Loading participant {participant_id}...")
        with open(participant_json) as f:
            json_participant_data = json.load(f)

        df_participant_data = pd.DataFrame(json_participant_data["trials"])
    
        # Filter data based on which entries has a correct response value
        df_response = df_participant_data.loc[~df_participant_data.correct_response.isna()]
        
        # Get the paths to all the videos
        path_videos = (
            df_participant_data[np.logical_or(
                    df_participant_data.trial_type == "video-keyboard-response",
                    df_participant_data.trial_type == "video-button-response"
                )
            ]
            .stimulus.apply(lambda x: str(index.dir_public / x[0]))
            .to_list()
        )
        
        # Only keep the desired columns
        df_response = df_response[list(columns_to_keep)]
        
        # Add in the path to videos
        df_response.loc[:, "Path Video"] = path_videos

        # Subselect off those the ones that have a response value
        isna = df_participant_data.response.isna()
        
        df_response_filtered = df_response.loc[~isna]
        df_response_dict = {"missed" : df_response.loc[isna]}
        path_videos_filtered = df_response_filtered["Path Video"].tolist()
        print(f"Found {len(df_response_filtered)} responses and {len(df_response_dict["missed"])} miss(es)")

        # Do some type conversions
        columns_astype_int = ["rt", "response", "correct_response"]
        columns_astype_int = [
            col for col in columns_astype_int if col in columns_to_keep
        ]
        for column in columns_astype_int:
            df_response_filtered[column] = df_response_filtered[column].astype(int)

        if validate:
            # All collected responses are in the valid responses
            assert all(
                val in valid_colors.keys() for val in df_response_filtered.response.unique()
            )
        
            # Compare the path colors to the correct_responses
            colors_from_video_paths = [
                Path(path).stem.split("_")[-1] for path in path_videos_filtered
            ]
            colors_from_correct_response = [
                valid_colors[val] for val in df_response_filtered.correct_response
            ]
            assert colors_from_video_paths == colors_from_correct_response
    
        index_videos_walkthrough = [
            i for i, path in enumerate(path_videos_filtered) if "walkthrough" in path.split("/")
        ]
        df_response_dict["walkthrough"] = df_response_filtered.iloc[index_videos_walkthrough]
        
        index_videos_examples = [
            i for i, path in enumerate(path_videos_filtered) if "examples" in path.split("/")
        ]
        df_response_dict["examples"] = df_response_filtered.iloc[index_videos_examples]
        
        index_videos_experiment = [
            i for i in range(len(path_videos_filtered)) 
            if i not in index_videos_walkthrough + index_videos_examples
        ]
        df_response_dict["experiment"] = df_response_filtered.iloc[index_videos_experiment]
        
        participant_responses[participant_id] = df_response_dict

    return participant_responses

all_participant_responses = load_all_participant_responses(
    path_study_id,
    dataset_name,
    validate=True,
)

Loading participant 66b030a3b56fc1387defa633...
Found 285 responses and 0 miss(es)
Loading participant 669ead0b8baa798838ac2787...
Found 284 responses and 1 miss(es)
Loading participant 653bbdaa0cf432b1c544b303...
Found 285 responses and 0 miss(es)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_response_filtered[column] = df_response_filtered[column].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_response_filtered[column] = df_response_filtered[column].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_response_filtered[column] = df_response_filtered[column].a

## Shortening the Participant ID

In [140]:
shortened_length = 6
participant_ids = all_participant_responses.keys()
participant_ids_short_dict = {
    participant_id[:shortened_length] : participant_id for participant_id in participant_ids
}
participant_ids_short_dict

{'66b030': '66b030a3b56fc1387defa633',
 '669ead': '669ead0b8baa798838ac2787',
 '653bbd': '653bbdaa0cf432b1c544b303'}

In [142]:
all_participant_responses = {
    key[:shortened_length] : val 
    for key, val in all_participant_responses.items()
}

## Overall Accuracy

In [143]:
participant_accuracies = {}
for participant_id, responses_dict in all_participant_responses.items():
    participant_accuracies[participant_id] = {
        f"Accuracy {key.title()}" : (df.correct_response == df.response).mean()
        for key, df in responses_dict.items()
        if key != "missed"
    }

pd.DataFrame.from_dict(participant_accuracies, orient='index')

Unnamed: 0,Accuracy Walkthrough,Accuracy Examples,Accuracy Experiment
66b030,1.0,0.8,0.636364
669ead,1.0,0.4,0.587591
653bbd,1.0,0.4,0.552727
