# Definitions

Experiments are carried out on [samples](https://colab.research.google.com/drive/1VFe7ItPEsP7ZMxpAULO_kfKZzEPYLI0R#scrollTo=tteEm2Qlgbb3) of [experimental units](https://colab.research.google.com/drive/1VFe7ItPEsP7ZMxpAULO_kfKZzEPYLI0R#scrollTo=tteEm2Qlgbb3). We can directly observe what happens in the sample. Suppose we are interested in a particular phenotype of flies, say red eyes. In this experiment, 67% of the sample of flies has red eyes. For this particular experiment, there is no uncertainty whatsoever in the [sample proportion](https://colab.research.google.com/drive/1VFe7ItPEsP7ZMxpAULO_kfKZzEPYLI0R#scrollTo=tteEm2Qlgbb3) of flies with red eyes, which is 67%. However, if I repeat the experiment, it is possible, and indeed likely, that the sample proportion of flies with red eyes would differ from 67%. Variation among samples is a key concept in statistics.

For this strain of flies, or [population](https://colab.research.google.com/drive/1VFe7ItPEsP7ZMxpAULO_kfKZzEPYLI0R#scrollTo=tteEm2Qlgbb3), we know that 75% of the flies have red eyes. We might know this because of our theoretical understanding of fly genetics, or we might know this because we have observed the strain in our lab for many years, and have counted thousands of fly eyes. There is only one population; there are many possible samples.

When we do experiments, we know exactly what happens in a particular sample. We also know that there is variability in results from sample to sample, experiment to experiment. We want to generalize our conclusions to the population. Statistics is the field of mathematics that seeks to understand sampling variability. By understanding this variability, we can extend our findings beyond our specific experiment, quantify the variability due to sampling, make global conclusions and express the uncertainty in that conclusion.

**Parameters versus statistics**: When we talk about a [parameter](https://colab.research.google.com/drive/1VFe7ItPEsP7ZMxpAULO_kfKZzEPYLI0R#scrollTo=tteEm2Qlgbb3) of a [probability distribution](https://colab.research.google.com/drive/1VFe7ItPEsP7ZMxpAULO_kfKZzEPYLI0R#scrollTo=tteEm2Qlgbb3) (often the Greek character theta, shown as $\Theta$, is used to represent the set of parameters used for a particular probability distribution), we are describing a characteristic of the population. These characteristics are things that we typically want to know but do not (and often cannot) know exactly -- because we do not have access to the full population. Instead, we have access to samples, and from those samples we compute statistics that we can use to estimate the characteristics of the population. For example, we can compute maximum-likelihood estimates of parameters (often denoted as $\hat{\Theta}$) fit to the sampled data. Some decent discussions of these concepts are [here](https://statisticsbyjim.com/basics/populations-parameters-samples-inferential-statistics/) and [here](https://mathbitsnotebook.com/Algebra1/StatisticsData/STPopSample.html).

# Selection Bias

A major concern related to samples and populations when designing your experiments is the problem of selection bias; that is, whether or not the samples you will measure and test are, in fact, appropriately representative of the population. Examples of selection bias abound in the scientific literature. A good place to start reading about them is [here](https://catalogofbias.org/biases/selection-bias/), which also includes [detailed descriptions of other kinds of biases](https://catalogofbias.org).

# Neuroscience Examples

**Sampling neurons in *in vivo* recordings**

A fundamental challenge for interpreting the results of many experiments in systems neuroscience, particularly those involving extracellular recordings of single-unit activity in awake animals, is uncertainty about if and how the sampled units are representative of the population being studied. In these experiments, microelectrodes are typically advanced blindly into a targeted brain region, then held in position when task-related neural activity is found. Under these conditions, systematic biases in characteristics of the sample relative to the full population of neurons in the given brain region or functional neural circuit can result from limited sampling times or conditions to identify neurons, the fact that some neurons and neural types tend to be more active than others and thus are more likely to be identified in neural recordings, and other factors discussed [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5244825/). We also lack a detailed understanding of what, exactly, constitutes a functional neural circuit, adding further uncertainty to questions about whether or not an experiment is appropriately sampling one such circuit.

This uncertainty leads to different approaches by different experimenters. Some try to sample relatively large numbers of neurons in an "unbiased" manner (although note that, as per the discussion above, this approach may be unbiased with respect to the experimenter's choices but not necessarily with respect to the data obtained). For example, [this study](https://www.nature.com/articles/35082081) used such a sampling strategy to draw conclusions about the relative prevalence of particular turning properties in the prefrontal cortex (PFC) based on the sampled findings:

*from the Abstract*: "The most prevalent neuronal activity observed in the PFC reflected the coding of these abstract rules".

*from Methods*: "Recordings were localized using magnetic resonance imaging and neurons were randomly sampled; no attempt was made to select neurons on the basis of responsiveness."

Alternatively, some studies use more explicit sampling strategies to target particular neurons based on measurable properties, then tests those neurons on other conditions. These studies are less able to draw conclusions about the full population of neurons in a given brain area but can be used to test more specific hypotheses about relationships between different response properties. For example, [this study](https://www.jneurosci.org/content/22/21/9475) tested the hypothesis that neurons with spatially selective persistent activity in parietal cortex contribute to information accumulation during decision making. Testing this hypothesis involved a particular sampling strategy:

"We targeted neurons in area LIP that discharge during a delay period after an instruction to make an eye movement into a particular region of the visual field... This is an appealing idea because it could explain many features of the LIP response. For example, it would provide a qualitative explanation for the persistence of activity in the delay period seen in the fixed duration version of the task. In general, it would cast the “memory” response in simpler memory-guided eye movements as the integral of a discrete impulse (e.g., a transient representation of a suprathreshold target)."

# Credits

Copyright 2021 by Joshua I. Gold, University of Pennsylvania