# Constructing a testable hypothesis

Sections:

* Constructs to models
* Falsifiability
* Parsimony

This lecture is not associated with any readings, however, this quote by Lao Tzu is likely important for how to think about things.

    "In order to understand somethings essence,
    you must know it as nameless.
    In order to observe its manifestations,
    you must know its name."

---

# 1. Constructs to models

The core of the modern scientific method is testing hypothesis. Thus, the application of data science methods to empirical scientific research is primarily oriented towards evaluating testable hypotheses. Now this does not mean that all data science is null hypothesis test statistics (we'll get into those later). Instead, this means that we should always be theoretically framing any data pipeline, from data types and data tables to visualization and statistics, to the goal of evaluating a hypothesis.

But what is a hypothesis?

Unfortunately, the term "hypothesis" has taken on many different meanings and the colloquial definitions often make their way into our scientific language. In order to understand what we mean when we talk about a hypothesis in a scientific setting, you should understand what precedes it.

![Terms](imgs/L6Terms.jpeg)

We begin with the idea of a **construct**. A construct is an abstract idea that remains theoretical in form. It cannot be directly tested itself, but its theoretical form provides a mechanism for understanding the world. The idea of working memory is a construct. You cannot see or touch or smell a working memory. Positive affect is another example of a construct. You cannot physically hold onto a positive emotion per se. Yet, as a theoretical process, they can serve to explain a collected set of observations. A construct is often misidentified as a "theory", mainly because that is the colloquial use of the term theory.

If a construct is not testable, how do we evaluate it empirically? Well we create an **operational definition** of the construct. An operational definition is a reformulation of the construct in terms of observable or tangible phenomena. Returning to our working memory example, an operational definition of working memory might be "the ability to report explicit information about the recent past." Notice that this is a much more narrow view of a construct and there might be other operational definitions that are just as valid. For example, working memory might also be operationally defined as "the ability to sustain activation of sensory events for a period of time ranging from milliseconds to minutes." Thus, any operational definition of a construct is, by nature, a much narrower scope and thus reflects only a portion of the entire construct you are interested in.

A **hypothesis**, in the scientific sense of the word, exists in the space of your operational definition. A hypothesis is a statement a set of relations that are presumed from the operational definition. Thus a hypothesis is an even narrower view of your theoretical construct. An example of a hypothesis might be "humans can only hold a limited amount of information in memory for a limited amount of time." In this definition we have two associations: amount of information and duration of time.

Now a hypothesis, as defined above, still isn't in a form to be evaluated. It remains theoretical in form. In order to narrow down an emprical form of the hypothesis, you have to specify the research **context**. The context is the task or environment that the hypothesis is evaluated in. For example, it can be a behavioral task where you ask participants to report back a number that is _n_ digits long after a varying level of delay. This behavioral context is a valid space to test your hypothesis. Alternatively, you can evaluate the hypothesis about working memory in the context of functional MRI (fMRI). For example, measuring the degree of activation in memory-associated areas as a function of digit span length and time required to hold it in memory. 

The research context is important because it determines the **predictions** you can make stemming from your hypothesis. Predictions are the tangible form of your hypothesis in a specific research context. For example, in the behavioral context described in the previous paragraph, our prediction might be that accuracy of recall drop as a function of digit span length and time required to hold the information in memory. Alternatively, in the fMRI context, it might be that prefrontal cortex activation increases with digit span length, but decays the longer you have to hold the information in memory. 

It is in the space of predictions that we will largely be working. Our statistical tools and data visualizations are all evaluating the predicted form or your hypothesis in the specific data space you have. This is illustrated in the figure below.

![Perspectives](imgs/L6Perspectives.jpeg)

Here your hypothesis is technically untouchable. You can't see it, but it exists as some shape that exists in space. When evaluated from a particular angle (reflected as the position of the lights), the shape manifests as a particular "shadow" in a particular point of space. Here the point of space (i.e., wall and ground in this example) are the research contexts and the shadows are the predictions that your hypothesis makes in these contexts.

This visual analogy summarizes the problem you will face trying to get insights from your data. We will formulate our predictions as **statistical models** (e.g., $ Y_{recall} = \beta_0 + \beta_1 X_{length} $). A model is a form of your presumed relationship between your dependent variable and the independent variables, in a form consistent with the predictions from your hypothesis. These models serve as a proxy for your hypothesis. We will discuss these more later.

# 2. Falsifiability

Above we went over the space of how you get from your theory to a testable form of your hypthesis as a model prediction in a particular research context. But how do we know whether the model is correct or not?

The way we deal with this in modern science relies on Karl Popper's concept of falsifiability. According to Popper, the only valid or testable hypothesis are those that are constructed so as to be falsifiable. In other words, the can be proven wrong. Consider a classic example.

* **Falsifiable:** "All swans are white."

This version of the hypothesis about the universal color of swans is falsifiable becasue it takes just one counter example to reject it. Just one observation of a non-white swan means that it is false. 

Now let's consider an alternative, non-falsifable version of the hypothesis.

* **Non-falsifiable:** "There are black swans."

How is this not falsifiable? Well this potentially requires you to essentially tally every swan in existance to evaluate. Just because you've never seen a black swan doesn't mean they aren't real. It could be that you haven't looked hard enough. So you must do an exhaustive search to validate and, even then, you have to have zero uncertainty. Unfortunately, the process of science almost always has uncertainty.

So the Popperian logic is that we can only reject hypothesis, not necessarily accept them. In applied contexts (and many statistical tests), your goal is to reject or falsify as many alterantive hypotheses as possible. Thus, this leads to one of the most critical concepts in modern science. 

* **Null ($ H_0 $) Hypothesis:** The falsifiable hypothesis that needs to be rejected in order to provide support for your theoretical claims. 

The Null Hypothesis (we'll just call it "Null" from here on out), is critical because it is the hypothese we most evaluate in data science. Essentially the Null needs to be how you think the world should look like if your theoretical hypothesis is incorrect. If I think that all swans are white then my Null model would be 

$$ P(swan) = \sum X_{white} = 1 $$.

Put into words this says that the probability of all swan observations is the sum of all white swans. 

Now, the alternative to the Null is 

* **Research ($ H_1 $) Hypothesis:**  _An_ alternative to the $ H_0 $ that is consistent in form of your theoretical claims. 

The reason that it is "an alternative" and not "the alternative" is that it is just one of many alternative explanations that could better explain your data than the Null. In our swan case, an exaple $ H_1 $ would be something like

$$ P(swan) = \sum X_{white} + \sum X_{black} = 1 $$.

The hard part about modern science is that we currently do not prove $ H_1 $, we simply reject $ H_0 $ (or other alternative hypotehses).

# Parsimony

So far we have gone over the definition of hypotheses and existence proof way we evalute them. But what about their form? What are the principles for constructing a good/valid hypothesis?

To understand how we constrain the form of hypotheses, we once again take a step back into the philosophy of the modern scientific method. As mentioned in the previous section, we rely on falsifability to evaluate hypotheses. But for the hypothese that we do not reject, how should we best forumlate them? 

Well, science relies on a particular type of reasoning.

* **Abductive Reasoning:** A form of logical inference that tries to find the simplest explanation for a set of observations. 

In colloquial terms, this is known as "parsimony". In essence you want to infer the simplest form of the problem that provides the best explanation with the fewest degrees of freedome. 

The reason abductive reasoning is so powerful is that it both tries to maximize explainability and generalizability at the same time. Simple models essentially compress a complex story into the smallest information packet possible.

As an example of this, let's consider the scenario where you evaluate the average reaction time (RT) in a sample of participants and also measure the number of hours of sleep they got the night before. A plot of the results would look like this.

![Scatter](imgs/L6Scatter.jpeg)

Now the most complete model of the relationship between sleep and RT would be one that relies on each individual observation.

$$ \hat{Y} = \sum_{i=1}^{n} \hat{\beta_i}X_i + \hat{\beta_0} $$

Here Y is RT and each $ X_i $ is the number of hours of sleep for subject _i_. Basically to know RT you learn a weight ($ \hat{\beta_i} $) for each sample. You essentially have _n+1_ degrees of freedom (a degree for every subject plus the mean) This is the model that has the most information complexity. It also means that it has the highest _entropy_ and the lowest parsimony.

Looking at the relationship between sleep and RT in the graph, it's clear that there is a simpler way to describe the relationship.

$$ \hat{Y} = \hat{\beta_1}X + \hat{\beta_0} $$

This is a model that compresses the relationship between X and Y to a more parsimonious form. Instead of _n+1_ degrees of freeedom, it has 2. This is the lowest entropy model and can generalize to new observations of Y (something that the previous model cannot). 



# Goals for data science

The previous two sections point to the two fundamental goals of data science.

* **1. Model parsiomony:** Explain the most variance in your data in the simplest way possible.
* **2. Falsifability:** Reject the default assumption of your data if your model were wrong.

These two principles together give us the foundation of everything we'll be going over in the class.

