# Preliminary setup and data retrieval

Users will need to install the `PsyTrack` package (version 1.3), by running the cell below. We also define a variable `SPATH` which is the directory where all data files and figures produced by the notebook will be saved.

Several standard Python packages are used: `numpy`, `scipy`, `matplotlib`, and `pandas`. We import all these libraries before proceeding, as well as setting several parameters in `matplotlib` to standardize the figures produced.

In [None]:
import os
import re
from IPython.display import clear_output
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd

# Install then import PsyTrack
!pip install psytrack==1.3
import psytrack as psy

# Set save path for all figures, decide whether to save permanently
SPATH = "ColabFigureData/"
!mkdir -p "{SPATH}"

# Set matplotlib defaults for making files consistent in Illustrator
colors = psy.COLORS
zorder = psy.ZORDER
plt.rcParams['figure.dpi'] = 140
plt.rcParams['savefig.dpi'] = 300
plt.rcParams['savefig.facecolor'] = (1,1,1,0)
plt.rcParams['savefig.bbox'] = "tight"
plt.rcParams['font.size'] = 10
# plt.rcParams['font.family'] = 'sans-serif'     # not available in Colab
# plt.rcParams['font.sans-serif'] = 'Helvetica'  # not available in Colab
plt.rcParams['pdf.fonttype'] = 42
plt.rcParams['xtick.labelsize'] = 10
plt.rcParams['ytick.labelsize'] = 10
plt.rcParams['axes.labelsize'] = 12

clear_output()

---

## Download and pre-process IBL mouse data

1) Use the command below to instal the IBL's [ONE Light](https://github.com/int-brain-lab/ibllib/tree/master/oneibl) Python library, download the [IBL mouse behavior dataset](https://doi.org/10.6084/m9.figshare.11636748.v7) _(version 7, uploaded February 7, 2020)_ to our `SPATH` directory as `ibl-behavior-data-Dec2019.zip`, and unzip the file.

In [None]:
!pip install ibllib
!wget -nc -O "{SPATH}ibl-behavior-data-Dec2019.zip" "https://ndownloader.figshare.com/files/21623715"
!unzip -d "{SPATH}" -n "{SPATH}ibl-behavior-data-Dec2019.zip"
clear_output()

2) Use the [ONE Light](https://github.com/int-brain-lab/ibllib/tree/master/oneibl) library to build a table of all the subject and session data contained within the dataset.

In [None]:
from oneibl.onelight import ONE

ibl_data_path = SPATH + 'ibl-behavioral-data-Dec2019'
current_cwd = os.getcwd()
os.chdir(ibl_data_path)

# Search all sessions that have these dataset types.
required_vars = ['_ibl_trials.choice', '_ibl_trials.contrastLeft',
                 '_ibl_trials.contrastRight','_ibl_trials.feedbackType']
one = ONE()
eids = one.search(required_vars)

mouseData = pd.DataFrame()
for eid in eids:
    lab, _, subject, date, session = eid.split("/")    
    sess_vars = {
        "eid": eid,
        "lab": lab,
        "subject": subject,
        "date": date,
        "session": session,
    }
    mouseData = mouseData.append(sess_vars, sort=True, ignore_index=True)

os.chdir(current_cwd)

3) Next, we use the table of session data to process the raw trial data below into a single CSV file, `ibl_processed.csv`, saved to our `SPATH` directory.

There are several known anomalies in the raw data:
 - CSHL_002 codes left contrasts as negative right contrasts on 81 trials (these trials are corrected)
 - ZM_1084 has `feedbackType` of 0 for 3 trials (these trials are omitted)
 - DY_009, DY_010, DY_011 each have <5000 trials total (no adjustment)
 - ZM_1367, ZM_1369, ZM_1371, ZM_1372, and ZM_1743 are shown non-standard contrast values of 0.04 and 0.08 (no adjustment)

In [None]:
all_vars = ["contrastLeft", "contrastRight", "choice", "feedbackType", "probabilityLeft"]
df = pd.DataFrame()

all_mice = []
for j, s in enumerate(mouseData["subject"].unique()):
    print("\rProcessing " + str(j+1) + " of " + str(len(mouseData["subject"].unique())), end="")
    mouse = mouseData[mouseData["subject"]==s].sort_values(['date', 'session']).reset_index()
    for i, row in mouse.iterrows():
        myVars = {}
        for v in all_vars:
            filename = "_ibl_trials." + v + ".npy"
            var_file = os.path.join(ibl_data_path, row.eid, "alf", filename)
            myVars[v] = list(np.load(var_file).flatten())

        num_trials = len(myVars[v])
        myVars['lab'] = [row.lab]*num_trials
        myVars['subject'] = [row.subject]*num_trials
        myVars['date'] = [row.date]*num_trials
        myVars['session'] = [row.session]*num_trials

        all_mice += [pd.DataFrame(myVars, columns=myVars.keys())]
        
df = pd.concat(all_mice, ignore_index=True)

df = df[df['choice'] != 0]        # dump mistrials
df = df[df['feedbackType'] != 0]  # 3 anomalous trials from ZM_1084, omit
df.loc[np.isnan(df['contrastLeft']), "contrastLeft"] = 0
df.loc[np.isnan(df['contrastRight']), "contrastRight"] = 0
df.loc[df["contrastRight"] < 0, "contrastLeft"] = np.abs(df.loc[df["contrastRight"] < 0, "contrastRight"])
df.loc[df["contrastRight"] < 0, "contrastRight"] = 0  # 81 anomalous trials in CSHL_002, correct
df["answer"] = df["feedbackType"] * df["choice"]      # new column to indicate correct answer
df.loc[df["answer"]==1, "answer"] = 0
df.loc[df["answer"]==-1, "answer"] = 1
df.loc[df["feedbackType"]==-1, "feedbackType"] = 0
df.loc[df["choice"]==1, "choice"] = 0
df.loc[df["choice"]==-1, "choice"] = 1
df.to_csv(SPATH+"ibl_processed.csv", index=False)

4) Next we run a few sanity checks on our data, to make sure everything processed correctly.

In [None]:
print("contrastLeft: ", np.unique(df['contrastLeft']))   # [0, 0.0625, 0.125, 0.25, 0.5, 1.0] and [0.04, 0.08]
print("contrastRight: ", np.unique(df['contrastRight'])) # [0, 0.0625, 0.125, 0.25, 0.5, 1.0] and [0.04, 0.08]
print("choice: ", np.unique(df['choice']))               # [0, 1]
print("feedbackType: ", np.unique(df['feedbackType']))   # [0, 1]
print("answer: ", np.unique(df['answer']))               # [0, 1]

5) Finally, we define a function `getMouse` that extracts the data for a single mouse from our CSV file, and returns it as a PsyTrack compatible `dict`. We will use this function to access IBL mouse data in the figures below. Note the keyword argument and default value $p=5$ which controls the strength of the $\tanh$ transformation on the contrast values. See Figure S3 and the STAR Methods of the accompanying paper for more details.

**Note:** Once steps 1-5 have been run once, only step 5 will need to be run on subsequent uses.

In [None]:
ibl_mouse_data_path = SPATH + "ibl_processed.csv"

MOUSE_DF = pd.read_csv(ibl_mouse_data_path)
def getMouse(subject, p=5):
    df = MOUSE_DF[MOUSE_DF['subject']==subject]   # Restrict data to the subject specified
    
    cL = np.tanh(p*df['contrastLeft'])/np.tanh(p)   # tanh transformation of left contrasts
    cR = np.tanh(p*df['contrastRight'])/np.tanh(p)  # tanh transformation of right contrasts
    inputs = dict(cL = np.array(cL)[:, None], cR = np.array(cR)[:, None])

    dat = dict(
        subject=subject,
        lab=np.unique(df["lab"])[0],
        contrastLeft=np.array(df['contrastLeft']),
        contrastRight=np.array(df['contrastRight']),
        date=np.array(df['date']),
        dayLength=np.array(df.groupby(['date','session']).size()),
        correct=np.array(df['feedbackType']),
        answer=np.array(df['answer']),
        probL=np.array(df['probabilityLeft']),
        inputs = inputs,
        y = np.array(df['choice'])
    )
    
    return dat

---

## Download and pre-process Akrami rat data

1) Download the [Akrami rat behavior dataset](https://doi.org/10.6084/m9.figshare.12213671.v1) _(version 1, uploaded May 18, 2020)_ to the `SPATH` directory as `rat_behavior.csv`.

In [None]:
!wget -nc -O "{SPATH}rat_behavior.csv" "https://ndownloader.figshare.com/files/22461707"
clear_output()

2) Sessions in the data corresponding to early shaping stages will be omitted, as will all mistrials (see the dataset's README for more info). The `getRat` function will then load a particular rat into a PsyTrack compatible `dict`.

`getRat` has two optional parameters: `first` which will return a data set with only the first `first` trials (the default of 20,000 works for all analyses); `cutoff` excludes sessions with fewer than `cutoff` valid trials (default set to 50). We will use this function to access Akrami rat data in the figures below.

In [None]:
akrami_rat_data_path = SPATH + "rat_behavior.csv"

RAT_DF = pd.read_csv(akrami_rat_data_path)
RAT_DF = RAT_DF[RAT_DF["training_stage"] > 2]  # Remove trials from early training
RAT_DF = RAT_DF[~np.isnan(RAT_DF["choice"])]   # Remove mistrials
def getRat(subject, first=20000, cutoff=50):

    df = RAT_DF[RAT_DF['subject_id']==subject]  # restrict dataset to single subject
    df = df[:first]  # restrict to "first" trials of data
    # remove sessions with fewer than "cutoff" valid trials
    df = df.groupby('session').filter(lambda x: len(x) >= cutoff)   

    # Normalize the stimuli to standard normal
    s_a = (df["s_a"] - np.mean(df["s_a"]))/np.std(df["s_a"])
    s_b = (df["s_b"] - np.mean(df["s_b"]))/np.std(df["s_b"])
    
    # Determine which trials do not have a valid previous trial (mistrial or session boundary)
    t = np.array(df["trial"])
    prior = ((t[1:] - t[:-1]) == 1).astype(int)
    prior = np.hstack(([0], prior))

    # Calculate previous average tone value
    s_avg = (df["s_a"][:-1] + df["s_b"][:-1])/2
    s_avg = (s_avg - np.mean(s_avg))/np.std(s_avg)
    s_avg = np.hstack(([0], s_avg))
    s_avg = s_avg * prior  # for trials without a valid previous trial, set to 0

    # Calculate previous correct answer
    h = (df["correct_side"][:-1] * 2 - 1).astype(int)   # map from (0,1) to (-1,1)
    h = np.hstack(([0], h))
    h = h * prior  # for trials without a valid previous trial, set to 0
    
    # Calculate previous choice
    c = (df["choice"][:-1] * 2 - 1).astype(int)   # map from (0,1) to (-1,1)
    c = np.hstack(([0], c))
    c = c * prior  # for trials without a valid previous trial, set to 0
    
    inputs = dict(s_a = np.array(s_a)[:, None],
                  s_b = np.array(s_b)[:, None],
                  s_avg = np.array(s_avg)[:, None],
                  h = np.array(h)[:, None],
                  c = np.array(c)[:, None])

    dat = dict(
        subject = subject,
        inputs = inputs,
        s_a = np.array(df['s_a']),
        s_b = np.array(df['s_b']),
        correct = np.array(df['hit']),
        answer = np.array(df['correct_side']),
        y = np.array(df['choice']),
        dayLength=np.array(df.groupby(['session']).size()),
    )
    return dat

---

## Download and pre-process Akrami human subject data

1) Download the [Akrami human subject behavior dataset](https://doi.org/10.6084/m9.figshare.12213671.v1) _(version 1, uploaded May 18, 2020)_. See the dataset's README for more info.

In [None]:
!wget -nc -O "{SPATH}human_auditory.csv" "https://ndownloader.figshare.com/files/22461695"
clear_output()

2) We define a function `getHuman` that extracts the data for a single human subject from the downloaded CSV file, and returns it in a PsyTrack compatible `dict`. We will use this function to access Akrami human subject data in the figures below.

In [None]:
akrami_human_data_path = SPATH + "human_auditory.csv"

HUMAN_DF = pd.read_csv(akrami_human_data_path)
def getHuman(subject):
    
    df = HUMAN_DF[HUMAN_DF['subject_id']==subject]
    
    s_a = (df["s_a"] - np.mean(df["s_a"]))/np.std(df["s_a"])
    s_b = (df["s_b"] - np.mean(df["s_b"]))/np.std(df["s_b"])
    
    s_avg = (df["s_a"][:-1] + df["s_b"][:-1])/2
    s_avg = (s_avg - np.mean(s_avg))/np.std(s_avg)
    s_avg = np.hstack(([0], s_avg))
    
    inputs = dict(s_a = np.array(s_a)[:, None],
                  s_b = np.array(s_b)[:, None],
                  s_avg = np.array(s_avg)[:, None])

    dat = dict(
        subject = subject,
        inputs = inputs,
        s_a = np.array(df['s_a']),
        s_b = np.array(df['s_b']),
        correct = np.array(df['reward']),
        answer = np.array(df['correct_side']),
        y = np.array(df['choice'])
    )
    return dat