# Psychophysics Tutorial I: Signal Detection Theory and 2AFC

This is the first Psychophysics tutorial, covering Signal Detection
Theory, ROC curves and the 2AFC paradigm.  See also the sdtTutorial which
covers some of the same material.

Written by G.M. Boynton for CSHL 2012

Python translation M.L. Waskom at CSHL 2018

In [None]:
%matplotlib inline
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt
from IPython.display import display, Markdown
from ipywidgets import interact

----

## Signal Detection Theory

Suppose you want to determine if a subject can reliably detect a weak
stimulus.  The simplest experiment would be to present this stimulus over
multiple trials and ask if the subject saw it.  But this won't work
because, for example, the subject could simply say 'yes'on each trial.
To alleviate this catch trials can be included to keep the subject from
cheating.

Now suppose you introduce catch trials (no stimulus trials) randomly on
half of the trials.  The subject's task is to determine if the signal was
present on any given trial.  Stimulus present trials are called 'signal'
trials, and stimulus absent trials are called 'noise' trials. A subject
that guesses, or says 'yes' or 'no' on every trial will be performing at
50%, or chance level.  No more cheating.

There is a range of stimulus intensities where a subject will perform
somewhere between chance and 100% correct performance.  The presence of
such a 'soft' threshold is most commonly explained in terms of Signal
Detection Theory (SDT).

SDT assumes that subjects base their decision on an internal response to
a stimulus that varies trom trial to trial.  If this internal response
exceeds some criterion, the subject reports to have perceived the
stimulus.

This trial-to-trial variability of the internal response could be due to
variability in the stimulus itself (as in the case of Poisson noise for
very dim lights), or to random neuronal noise at the sensory
representation of the stimulus, or due to higher level variability in the
attentional or motivational state of the subject.  

Most commonly, this variability is modeled as a normal distribution
centered around some mean. The simplest implementation has the mean
response for the noise trials be zero and signal trials some larger
value, with the standard deviations of the signal and noise responses the
same.  

Here are some example parameters all in a single structure 



In [None]:
p = dict(noise_mean=0, signal_mean=1, sd=1)

Here is a graph of the probability distribution for the internal responses to signal and noise trials:

In [None]:
z = np.linspace(-4, 6, 200)
f, ax_z = plt.subplots()
noise_dist = norm(p["noise_mean"], p["sd"])
signal_dist = norm(p["signal_mean"], p["sd"])
ax_z.plot(z, noise_dist.pdf(z), label="noise")
ax_z.plot(z, signal_dist.pdf(z), label="signal")
ax_z.set(xlabel="Internal response")
ax_z.legend(loc="best")
ax_z.figure.tight_layout()

We next need to set a criterion value for determining what internal reponses lead to 'Yes' responses.  We'll show it in the figure:

In [None]:
p["criterion"] = 1
ax_z.axvline(p["criterion"], ls="--", color=".6", label="criterion")
ax_z.legend()
display(ax_z.figure)

On any trial, one of four things will happen. Either the signal is
present or absent crossed with the subject reporting 'yes' or 'no'.
Trial types are labeled this way:

|         |  "Yes"    |  "No"    |
|---------|-----------|----------|
|Present  |    Hit    |  Miss    |
|Absent   |  False alarm   | Correct rejection  |

It's easy to see that SDT predicts the probability of each of these four
trial types by areas under the normal curve.  The probability of a hit is
the probability of drawing a value above the criterion, given that it
came from the signal distribution:


In [None]:
p_hit = signal_dist.sf(p["criterion"])
print(f"P(Hit) = {p_hit:.2f}")

In [None]:
p_fa = noise_dist.sf(p["criterion"])
print(f"P(FA) = {p_fa:.2}")

The whole table looks like this:

In [None]:
display(Markdown(
f"""
||"Yes"|"No"|
|-|-|-|
|Present|{p_hit:.1%}|{1-p_hit:.1%}|
|Absent|{p_fa:.1%}|{1-p_fa:.1%}|
"""))

Since half the trials are signal trials, the overall performance will be the average of the hit and correction rate:

In [None]:
p_c = .5 * p_hit + .5 * (1 - p_fa)
print(f"P(Correct) = {p_c:.1%}")

---

Play around with the parameters. See how:

1. If you shift your criterion very low or high performance will be at chance.
2. Performance is maximized when it's halfway between the signal and noise means.  This is the criterion an 'ideal observer' should choose. 
3. Performance increases as either the standard deviations decrease or the difference between signal and noise mean increases.
4. You can offset an increase between signal and noise means by increasing the standard deviation by the same amount. The model is over-parameterized.

---

## Estimating d-prime from Hits and False Alarms

You should see that simply reporting percent correct in a yes/no
experiment is a problem because performance varies with criterion: you
cannot estimate d-prime from percent correct alone.

Fortunately we can estimate d-prime by finding the difference in the
corresponding z-values from the hit and false alarm rates:


In [None]:
z_hit = norm.ppf(p_hit)
z_fa = norm.ppf(p_fa)
dprime_est = z_hit - z_fa
print(f"Z_hit: {z_hit:.1f} | Z_fa: {z_fa:.1f} | d': {dprime_est:.1f}")

This is a 'criterion' free estimate of d-prime, and is what is often reported instead of percent correct for a Yes/No experiment.

Use this interactive widget to play with the parameters.  See how $d^\prime$ stays constant for different criterion values.

In [None]:
@interact
def dprime_tutorial(signal_mean=(0, 2, .1), sd=(0, 2, .1), criterion=(0, 2, .1)):

    signal_dist = norm(signal_mean, sd)    
    noise_dist = norm(0, sd)

    p_hit = signal_dist.sf(criterion)
    p_fa = noise_dist.sf(criterion)
    p_c = .5 * p_hit + .5 * (1 - p_fa)
    z_hit = norm.ppf(p_hit)
    z_fa = norm.ppf(p_fa)
    dprime_est = z_hit - z_fa

    f, ax = plt.subplots()
    z = np.linspace(-4, 6, 200)
    ax.plot(z, noise_dist.pdf(z), label="noise")
    ax.plot(z, signal_dist.pdf(z), label="signal")
    ax.axvline(criterion, ls="--", color=".6", label="criterion")

    text = f"P(correct) = {p_c:.1%}\n$d^\prime$ = {dprime_est:.1f}"
    ax.text(.05, .85, text, size=12, transform=ax.transAxes)

    ax.set(xlabel="Internal response")
    ax.legend(loc="best")
    ax.figure.tight_layout()

---

## The ROC curve

The criterion determines the trade-off between hits and false alarms.  A
low (liberal) criterion is sure to get a hit but will lead to lots of
false alarms. A high (conservative) criterion will miss a lot of signals,
but will also minimize false alarms.  This trade-off is typically
visualized in the form of a 'Reciever Operating Characteristic' or ROC
curve.  An ROC curve is a plot of hits against false alarms for a range
of criterion values:


In [None]:
p_hits = norm.sf(z, p["signal_mean"], p["sd"])
p_fas = norm.sf(z, p["noise_mean"], p["sd"])
f, ax_roc = plt.subplots()
ax_roc.plot([0, 1], [0, 1], c=".5", ls=":")
ax_roc.plot(p_fas, p_hits)
ax_roc.set(xlim=(0, 1), ylim=(0, 1), aspect="equal",
       xlabel="p(FA)", ylabel="p(Hit)");

We can plot out example hit rate against our example FA rate too:

In [None]:
ax_roc.scatter(p_fa, p_hit)
display(ax_roc.figure)

Play around again.  See how:

1. The point in the ROC curve moves around as you vary the criterion.  

2. The 'bow' of the ROC curve varies with d-prime (either by increasing
the signal mean or reducing the standard deviation.

In [None]:
@interact
def roc_tutorial(signal_mean=(0, 2, .1), sd=(0, 2, .1), criterion=(0, 2, .1)):
    
    signal_dist = norm(signal_mean, sd)
    noise_dist = norm(0, sd)

    p_hit = signal_dist.sf(criterion)
    p_fa = noise_dist.sf(criterion)

    p_c = .5 * p_hit + .5 * (1 - p_fa)
    z_hit = norm.ppf(p_hit)
    z_fa = norm.ppf(p_fa)
    dprime_est = z_hit - z_fa

    p_hits = signal_dist.sf(z)
    p_fas = noise_dist.sf(z)

    f, ax = plt.subplots()
    ax.plot([0, 1], [0, 1], c=".5", ls=":")
    ax.plot(p_fas, p_hits)
    ax.scatter(p_fa, p_hit)
    ax.set(xlim=(0, 1), ylim=(0, 1), aspect="equal",
           xlabel="p(FA)", ylabel="p(Hit)",)

    text = f"P(correct) = {p_c:.1%}\n$d^\prime$ = {dprime_est:.1f}"
    ax.text(.45, .05, text, size=12, transform=ax.transAxes)
    ax.figure.tight_layout()

## Area under the ROC curve

You hopefully saw that increasing d-prime increases the bow of the ROC
curve away from the diagonal.  A measure of this bowing is the area under
the ROC curve.  This can be estimated by numerically integrating the
sampled curve.  We'll use scipy's `trapz` function. (The negative sign
is to undo the fact that the ROC curve traces from left-to-right for
increasing criterion values).

In [None]:
from scipy.integrate import trapz
roc_area = -trapz(p_hits, p_fas)
print(f"Area under ROC curve: {roc_area:.2f}")

We'll see later that this area has a special meaning - it's the percent
correct that is expected in a two-alternative forced choice (2AFC)
experiment.

## The relationship between d-prime and the area under the ROC curve

d-prime can be calculated from the area under the ROC curve by:

In [None]:
dprime_from_area = np.sqrt(2) * norm.ppf(roc_area)
print(f"d' from area under ROC curve: {dprime_from_area:.2f}")

The calculus behind this is interesting but we'll pass on it.

## Simulating a Yes/No experiment

Next we'll use SDT to simulate a subject's response to a series of trials
in a Yes/No experiment and estimate the d-prime value that was used in
the simulation.

In [None]:
n_trials = 100
signal = np.random.rand(n_trials) > .5

# Generate the internal response for each trial:
x = np.where(signal, signal_dist.rvs(n_trials), noise_dist.rvs(n_trials))

# Simulate responses and behavioral metrics
response = x > p["criterion"]
p_hit_sim = response[signal].mean()
p_fa_sim = response[~signal].mean()

display(Markdown(
f"""
||"Yes"|"No"|
|-|-|-|
|Present|{p_hit_sim:.1%}|{1-p_hit_sim:.1%}|
|Absent|{p_fa_sim:.1%}|{1-p_fa_sim:.1%}|
"""))

# Plot it on the ROC curve from above:
ax_roc.scatter(p_fa_sim, p_hit_sim, c="C1")
display(ax_roc.figure)

Calculate d prime:

In [None]:
z_hit_sim = norm.ppf(p_hit_sim)
z_fa_sim = norm.ppf(p_fa_sim)
dprime_sim = z_hit_sim - z_fa_sim
print(f"d' from simulation: {dprime_sim:.2f}")

---

Compare the simulated values to the expected values from the STD model.
You can run this section over and over to look at the variability of the
d-prime estimate. You can see that:

1) The estimates of d-prime become more accurate with increasing number
of trials

2) The estimates of d-prime become less accurate with criterion values
that deviate from the ideal value.  This is important. we typically don't
have control over the criterion.  If a lame subject says 'yes' or 'no'
almost all the time then there is very litle information for estimating
d-prime.

3) If you're motivated, add a loop to simulate a bunch of simulations to
estimate the variability in the estimate for a range of model parameters.

Simulations like this illustrate an often neglected fact:  A 'perfect'
subject that makes decisions according to Signal Detection Theory will
still have variability in performance from experimental run to
experimental run.  That is, ideal observers will still generate data
with finite-sized error bars.  Simulations can give you a feel for how
small the erorr bars should be under ideal conditions.

----

## Yes/No with rating scales

One way to get around the criterion problem is to allow subjects more
options in their response.  For example, rather than having two buttons,
let them have four to indicate their confidence that a signal was
present:

1. Definately no
2. Probably no
3. Probably yes
4. Definately yes

This effectively allows the subject to have more than one criterion.  To
model this with SDT, three criterion values will divide the internal
response range into the four response categories:


In [None]:
p["criterion"] = [-.5, .5, 1.5]
for c in p["criterion"]:
    ax_z.axvline(c, ls=":", c=".5")
display(ax_z.figure)

## Rating scales and the ROC curve

We can visualize these criterion values on the ROC curve like we did for
the single criterion earlier:

In [None]:
p_hit = signal_dist.sf(p["criterion"])
p_fa = noise_dist.sf(p["criterion"])

# Clean up the ROC plot
plt.setp(ax_roc.collections, visible=False)

# Plot the new expected points
ax_roc.scatter(p_fa, p_hit, c="C0")
display(ax_roc.figure)

For example, `p_hit` for the lowest point on the ROC curve corresponds to
the probability that subject will report a 2, 3 or 4 on a signal trial.
The next one up is for 3 or 4, and the highest is the probability of
a hit if a subject just reponds 4.

## Simulating a Yes/No experiment with rating scales


In [None]:
n_trials = 100
signal = np.random.rand(n_trials) > .5
x = np.where(signal, signal_dist.rvs(n_trials), noise_dist.rvs(n_trials))
response = np.digitize(x, p["criterion"])

crit_idxs =  np.unique(response)
p_hit = [(response > i)[signal].mean() for i in crit_idxs]
p_fa = [(response > i)[~signal].mean() for i in crit_idxs]

ax_roc.scatter(p_fa, p_hit, c="C1")
display(ax_roc.figure)

You can run this section over and over to get an idea of the variability of the simulation with respect to the expected values on the ROC curve.

---

## Maximum likelihood fit of ROC curve

You probably have the (correct) intuition that the rating scale
information adds reliability to the estimate of d-prime. But now we have
the problem of translating these three points on the ROC curve to a
single estimate of d-prime.  

There are a variety of ways of doing this.  One would be to use the
cumulative normals like we did before for each of the three points on the
ROC curve and average them.  But this doesn't seem right - different
points should have different weights due to variability in their
reliablity.  

We'll implement a model fitting method.  You should appreciate that every
point in the 2D ROC space corresponds to a unique pair of d-prime and
criterion values.  That's why there is a direct translation between a
single point on the ROC curve and d-prime.  But now we have three points
in ROC space that don't necessarily fall on the same ROC curve.  Our goal
is to find the single ROC curve that passes closest to the three points.
Specifically, we need a four-parmeter fit: what d-prime and three
criterion values best fits our observed values?

To do this we need a cost function that takes in a set of model
parameters and data points and returns a value that represents goodness
of fit. Then we'll use scipy's optimization routine to find the model
parameters that minmizes this cost function.

When dealing with proportions, the cost function is always in terms of
likelihood (You should never use a least-squares criterion for
proportional data!) Our cost function will be the probability of our
observed data for a given set of model parameters.

Suppose the first trial was a noise trial and the subject reported '1'
(Definately no).  Looking at the ROC figure, the probability of this happening
is the area under the blue curve to the left of the 1st criterion:


In [None]:
p_noise_resp = noise_dist.cdf(p["criterion"][0])
print(f"P(respond '1' | noise) = {p_noise_resp:.2f}")

The probability of responding '2' is the area between the 1st and second
criteria and so on.  The whole table of probabilities can be computed
like this:

In [None]:
bounds = np.r_[-np.inf, p["criterion"], np.inf]
noise_ps = np.diff(noise_dist.cdf(bounds))
signal_ps = np.diff(signal_dist.cdf(bounds))
ps = np.vstack([noise_ps, signal_ps]).T
print(ps.round(2))

The first column is for noise trials, the second is for signal trials,
and each row corresponds to each response ('1'-'4').  Verify that the sum
of the columns add up to 1.

The probability of obtaining our observed data set based on these
probabilities is the product of the probabilities associated with each
trial.  Multiplying 100 values that are less than 1 produces a
ludicrously small number, so we almost always maximize log likelihood
instead.  Acutally, we'll use the negative of the log likelihood because
optimization routines minmize functions. The negative of the log
likelihood of our observed data can be computed by turning the computations
above into a function:

In [None]:
def roc_cost_func(p, signal, response):

    bounds = np.r_[-np.inf, p["criterion"], np.inf]
    noise_ps = np.diff(norm.cdf(bounds, p["noise_mean"], p["sd"]))
    signal_ps = np.diff(norm.cdf(bounds, p["signal_mean"], p["sd"]))
    lls = np.log(np.where(signal, signal_ps[response], noise_ps[response]))
    cost = -lls.sum()

    return cost

In [None]:
cost = roc_cost_func(p, signal, response)
print(f"Cost (negative log-likelihood) = {cost:.1f}")

To find the parameters that minimize our cost function we'll use scipy's
`minimize` function with `method="nelder-mead"`, which will use the same
algorithm as matlab's `fminsearch`.

TODO The original matlab tutorial used a custom interface that translated
from a structure with named fields to a 1d vector of parameters that both
`fminsearch` and `minimize` work with. It also let the user specify which
parameters should be held fixed and which should be optimized. I have a
similar Python class that could be bundled with the tutorials (and others
have implemented similar things that live in libraries you can get from `pip`).
But to keep things simple let's just define a slightly less flexible but similar
interface function here:

In [None]:
from scipy.optimize import minimize, fmin
def fit(cost_func, p, signal, response):

    p_fit = p.copy()
    
    def cost_func_wrapped(xvec):
        signal_mean, *criterion = xvec
        p_fit["signal_mean"] = signal_mean
        p_fit["criterion"] = criterion
        return cost_func(p_fit, signal, response)

    x0 = np.r_[p["signal_mean"], p["criterion"]]
    res = minimize(cost_func_wrapped, x0, method="nelder-mead")
    signal_mean, *criterion = res.x
    p_fit["signal_mean"] = signal_mean
    p_fit["criterion"] = criterion
    
    return p_fit

In [None]:
p_fit = fit(roc_cost_func, p, signal, response)

We can look at the best-fitting parameters:

In [None]:
print(p_fit)

Do the resemble the original model parameters?:

In [None]:
print(p)

Let's add resuts of the fitting to the plot:

In [None]:
# First, the full estimated ROC curve
p_hits_fit = norm.sf(z, p_fit["signal_mean"], p_fit["sd"])
p_fas_fit = norm.sf(z, p_fit["noise_mean"], p_fit["sd"])
ax_roc.plot(p_fas_fit, p_hits_fit, c="C2")

# Second, the three ROC points
p_hit_best = norm.sf(p_fit["criterion"], p_fit["signal_mean"], p_fit["sd"])
p_fa_best = norm.sf(p_fit["criterion"], p_fit["noise_mean"], p_fit["sd"])
ax_roc.scatter(p_fa_best, p_hit_best, c="C2")

display(ax_roc.figure)

Our estimate of d-prime is then simply:

In [None]:
d_prime_est = (p_fit["signal_mean"] - p_fit["noise_mean"]) / p_fit["sd"]
print(f"d' estimated from fitting: {d_prime_est:.2f}")

Run this over and over to get a feel for the variability of the estimate
of d-prime.  Does the rating scale provide a more reliable estimate of
d-prime than the standard Yes/No experiment?  You can test this more
formally by repeating a bunch of simulations.

----

## Two-alternative forced choice (2AFC)

Another way to avoid the criterion problem is to use a
two-alternative-forced-choice paradigm (2AFC) where a trial consists of
both a signal and noise draw in either random temporal order or spatial
position.  The subject must choose which draw contains the signal.  2AFC
is easilly modelled with SDT by drawing once from the signal, and once
from the noise distribution.  The subject decides that the signal came
from the draw with the larger value.  


Here's a simulation:

In [None]:
n_trials = 100
signal = signal_dist.rvs(n_trials)
noise = noise_dist.rvs(n_trials)
response = signal > noise

Response is 1 for correct trials, when the draw from the signal exceeds the noise draw.

Overall performance is the mean of the response vector:

In [None]:
pc_2afc = response.mean()
print(f"P(Correct) = {pc_2afc:.2f}")

Note that percent correct is greater than the best percent correct in the
yes/no experiment.  

Here's something interesting: the expected percent correct in 2AFC should
be equal to the area under the ROC curve:

In [None]:
roc_area

Our `pc_2afc` is from simulation, so you should run this section several 
times over to convince yourself that this is true. This means that d-prime
can be directly estimated from percent correct for a 2AFC experiment
(because d-prime is directly related to area under the ROC curve).

In [None]:
dprime_2afc = np.sqrt(2) * norm.ppf(pc_2afc)
print(f"d' = {dprime_2afc:.2f}")

How reliable is this estimate of d-prime?  An interesting exercise would
be to measure the standard deviation of the d-prime estimates for
repeated simulations of the standard Yes/No, the Yes/No rating method,
and the 2AFC method.  Which wins?

----

## N-alternative forced choice 

Forced choice experiments can include more than two options.  For
example, suppose a target could appear randomly in one of four spatial
positions.  This can be modeled with SDT as taking three samples from the
noise distribution and one from the signal distribution.  Like for 2AFC,
the subject chooses the location with that generated the largest
response. A correct response occurs when the draw from the signal
distribution exceeds the maximum of the noise draws.  This is easy to
simulate:

In [None]:
n_alt = 4

signal = signal_dist.rvs(n_trials)
noise = noise_dist.rvs((n_alt - 1, n_trials))

response = signal > noise.max(axis=0)
pc_nafc_sim = response.mean()

print(f"P(Correct) = {pc_nafc_sim:.2f}")

There is a solution (using numerical integration) for percent correct in an NAFC experiment:

In [None]:
d_prime = (p["signal_mean"] - p["noise_mean"]) / p["sd"]
z = np.linspace(-5, 5, 501)
dz = z[1] - z[0]
pc_nafc_num = (norm.pdf(z - d_prime) * norm.cdf(z) ** (n_alt - 1)).sum() * dz

f, ax_nafc = plt.subplots(figsize=(3, 4))
ax_nafc.bar(["Simulated", "Numerical"], [pc_nafc_sim, pc_nafc_num])
ax_nafc.set(ylabel="P(Correct)")
ax_nafc.figure.tight_layout()

## Divided Attention

Let's simulate a real divided attention experiment. Again there are four
spatial positions, but this time it's a 2AFC experiment in which a signal
(like a grating or something) appears in one of the positions on one of
two intervals.  The subject's job is to determine the interval that
contained the signal, wherever it was.  Now, consider two conditions, one
in which you have no idea which of the four conditions will have the
signal and one in which you are cued to the correct location.  It's still
a 2AFC task, but one forces you to divide your spatial attention.  Again,
it's easy to simulate.  


In [None]:
n_pos = 4

noise = noise_dist.rvs((n_pos, n_trials))
signal = np.vstack([noise_dist.rvs((n_pos - 1, n_trials)),
                    signal_dist.rvs((1, n_trials))])

# Decision Rule: choose the signal interval if the max of the
# 3 noise + 1 signal draws exceeds the max of the four noise draws

response = signal.max(axis=0) > noise.max(axis=0)
p_uncued = response.mean()

If there is a cue to the spatial position, then we can assume that the
subject ignores the thee uncued locations.  This is just a the same old
2AFC experiment. Equivalently we can use the same code with nPositions
set to 1:

In [None]:
noise = noise_dist.rvs(n_trials)
signal = signal_dist.rvs(n_trials)
response = signal > noise
p_cued = response.mean()

In [None]:
f, ax_att = plt.subplots(figsize=(3, 4))
ax_att.bar(["Cued", "Uncued"], [p_cued, p_uncued])
ax_att.set(ylabel="P(Correct)")
ax_att.figure.tight_layout()

This result is both obvious and deep. Imagine viewing this graph before
going through the tutorial.  Suppose spatial attention leads to enhanced
responses in neurons with receptive fields at potential relevant
locations.  Suppose also that attention is resource limited so that
dividing your attention leads to a weaker attentional gain for each
location compared to the cued condition.  Like the spotlight of attention
has a fixed amount of stuff to spread around. Finally, assume that a gain
change can help performance by increasing the signal-to-noise ratio.
It would follow that weaker gain changes for the uncued (divided
attention) condition should lead to poorer performance,

This argument is all over the attention field. But this simulation shows
that you can get strong behavioral effects for attention without any gain
changes at all!  This idea was was elegantly described by John Palmer in
the 80's, was largely ignored in the 90's and 00's, but has returned
recently in light of optical imaging and fMRI results showing that V1
responses don't seem to change much with divided attention.