# 11. Types of Investigation

## 11.1 Intended Learning Outcomes:

At the end of the session you should be able to:

1. Describe three different types of investigations that arise in medical statistics and health data science.
2. Link a research question to an investigation type and compare the properties of different investigation types.
3. Explain how and why explanatory variables are used differently in prediction studies and in causal investigations. 


## 11.2 Specifying research questions:

Specifying the research question or questions is a crucial starting point for an investigation. In some cases the research question will be highly specific, and in others could be more wide ranging with several components. The research question then informs the subsequent stages of the investigation, ranging from choice of study population; study design; data collection; monitoring and quality control; data analysis; presentation of conclusions; interpretation. Figure 1 illustrates one
way of representing the whole process of an investigation.

![image.png](attachment:image.png)

The statistician/data scientist plays an important role at all stages of an investigation, not just at
the data analysis phase. It is perhaps most usual for collaborators who are subject-matter experts
(e.g. clinicians) to pose the initial research question. However, the statistician very often plays a
key part in refining these initial ideas in order to translate them into something formal and clearly
specified.

## 11.3 Different types of investigation:

### 11.3.1 Classification

The research question informs what type of investigation is required. Investigations can be divided
broadly into the following types:

1. Description  
2. Prediction
3. Causality and explanation

Hernan, Hsu & Healy (Chance, 2019) set out to classify data science tasks and used three classifications: *Description*, *Prediction*, and *Counterfactual prediction* (meaning causality). Schmueli (Statistical Science, 2010) also described similar classifications: *Descriptive modelling*, *Predictive modelling*, and *Explanatory modelling*. See also Hand (Harvard Data Science Review, 2019) for a nice discussion on this topic.

### 11.3.2 Implications of investigation type

The distinction between the different types of investigation is crucial because it has a fundamental impact on the steps of the analysis and beyond. For example, the investigation type influences:

* How we decide what variables are to be included in the analysis
* What analysis methods to use
* How we assess the fit/performance of the model or oter analysis approach used
* How we present the results from the analysis
* How the findings might be used in practice
* How we need to work with other experts at different stages


### 11.3.3 The role of study design

The different types of investigation may be performed using data from studies of different design. Having posed a research question, we can consider (with input from collaborators) what data are required to answer it robustly, including whether new data collection is needed, or whether there are existing data that could be used to address the question. This process needs to take into account considerations of cost, timeliness, feasibility and ethics. For example, for some questions our ideal study could be a randomized controlled trial, but to perform one would require such long followup that it would be infeasible and unethical, and so we would turn to observational data to address the research question. There is a major emphasis in the recent biostatistical and epidemiological literature on the use of ‘found’ data from sources such as electronic health records, which present great opportunities to answer research questions using data on a large number of individuals, but also present challenges for analysis and interpretation. All three types of investigation may make use of observational data. Randomized controlled trials are designed to estimate treatment effects (i.e. for causal investigations), but secondary analyses of trial data can be used in other types of investigation, such as to develop a prediction model.

## 11.4 Properties of different types of investigation:

### 11.4.1 Description

In a descriptive investigation the data are used to provide a quantitative summary of features of the
population of interest, or in other words the data are summarised in a compact way.

Simple descriptive analyses involve calculating proportions of individuals with a particular characteristic (e.g. males and females; smokers and non-smokers), or estimating features of the distribution of continuous variables (e.g. mean and variance of weight or blood pressure). The resulting information is then presented using tables and data visualisation.

Some descriptive analyses may extend to use of more complex methods of analysis. For example, the research question may concern how individuals within a population cluster together interms of their dietary habits, requiring clustering methods. It may be of interest to estimate theexpected survival time post-disease diagnosis in the presence of censored survival times, which
would require survival analysis techniques.

All investigations should start with some basic descriptive analysis to gain understanding of the features of the data at hand. It is at this stage that we can uncover challenges such as missing data, gain insights into how certain variables are distributed, and, where relevant, gain understanding of correlations between key variables, including to identify collinearities. Some investigations then go on to the main research question, which goes beyond description, and others may be entirely descriptive and not proceed onto other questions.

Huebner et al. (2019) provide useful guidance on ‘initial data analysis’. See also Spiegelhalter
(2019) for an accessible discussion of summarising and communicating descriptions of data.


### 11.4.2 Prediction

Prediction is about using data on some features of individuals to predict other features with the aim of predicting the outcome for new or future observations. More formally, prediction is concerned with mapping data on variables $X_{1}$, $X_{2}$, . . . , $X_{p}$ to an outcome $Y$ . The prediction model could be developed using statistical models such as regression, or approaches that would be described as machine learning algorithms.

Results from prediction investigations are used for a range of purposes: to inform people of their risk or prognosis; to identify people at high risk of an adverse event and hence take action such as more frequent screening (though the investigation will not tell us whether such screening would be effective).

Prediction models are typically developed using observational data. A well known example is the Framingham Risk Score, which provides predictions of a person’s 10-year of developing coronary heart disease (D’Agostino et al 2008).

There is a huge literature on prediction in the medical setting. See for example the books by Riley et al. (2019) and Steyerberg (2019).

### 11.4.3 Causality and explanation

In causal investigations we seek to understand the causal effect of one or more variables on an outcome. Hernan et al. (2019) describe this as “Using data to predict certain features of the world as if the world had been different”. For a simple example of a causal investigation, consider a continuous outcome $Y$ (e.g. blood pressure) and a binary treatment variable $X$, where $X = 1$ denotes treated and $X = 0$ denotes untreated. A causal investigation asks how the mean of Y would be different if all individuals had $X = 1$ compared with if all individuals had $X = 0$. In other words, if we could change $X$ what would be the expected change in $Y$ ?

Questions such as this can be arguably simple to answer using a randomized controlled trial, where there is no confounding of the treatment-outcome association. However, issues of drop-out and non-compliance are important to consider. Historically, some have considered answering causal questions to lie only in the domain of randomized experiments. However, randomized experiments are not feasible or ethical to address many important questions. It is now recognised that causality is often the goal of investigations using observational data. See for example the paper of Hernan (2018), who wrote “being explicit about the causal objective of a study reduces ambiguity in the scientific question, errors in the data analysis, and excesses in the interpretation of the results”. The field of ‘causal inference’ has developed in recent decades, with particular advances in recent years, to enable this.

Schmeuli (2010) equates causality with ‘explanation’, meaning explanation of mechanisms of how one (or more) variable affects another. However, Hernan et al. (2019) make the point that we may be able to say that $X$ causes $Y$ without understanding the underlying mechanism. For example we may find strong evidence from a trial that a drug is effective for a given outcome, but the precise biological mechanisms through which the effect is transmitted are not well understood.

The variable of interest in a causal investigation could be use of a medical treatment (a drug) or application of a procedure. More generally it could be an ‘exposure’ such as ‘smoking’ or ‘exercising for at least 30 minutes per day’. The ‘hypothetical intervention’ of interest should be (reasonably) well defined, even if we could never in reality intervene on it in the real world (e.g. it would be impractical, not to say unethical, to intervene on smoking status). See Hernan (2016) for a discussion of related issues.

### 11.4.4 Is there a fourth investigation type?

There is arguably a fourth investigation type which is concerned with exploring how several explanatory variables $X_{1}$, . . . , $X_{p}$ are associated with an outcome $Y$. This might be described as an “exploration of risk factors” investigation. It may involve univariable analyses, looking at the association of each explanatory variable (“risk factor”) individually with the outcome, and multivariable analyses which look at association of several variables with the outcome in a single model. These types of analysis are typically carried out using observational data, and many (or perhaps most) epidemiological studies are investigations of this type, at least historically.

These types of investigation can be useful for understanding associations between variables in the population of interest and, as such, some may consider these analyses to be descriptive. However, as we all know, association is not causation! These types of investigation often do not consider the relative temporal ordering of explanatory variables, which means that interpretation of estimated associations as causal effects can be misleading. There is recent emphasis in the epidemiological literature on more principled investigations which are more explicit about the aim of the investigation.

Like in a prediction investigation, the interest is in several explanatory variables. However, unlike in a prediction investigation, the aim is to actually explore quantitatively the unconditional and conditional associations of the explanatory variables with $Y$, rather than being purely on predicting $Y$. Unlike in a causal investigation, there is not a particular focus on a single variable. However, there is often an attempt to discuss the associations as though they may be causal even though an explicit causal question has not been posed.

Investigators should be wary of over-interpreting findings from “exploration of risk factors” investigations. And if we are really interested in addressing a causal question we should be explicit about that and carry out our analysis and interpretations accordingly.

## 11.5 An example:

Table 1 provides an example of the features of different investigation types. The overall topic is stroke in women. The table (taken from Hernan et al. 2019) provides an example research question, the features of data that would be required to answer it, and the types of analysis that could be used for investigations of three types: Description, Prediction and Causal inference.

Table 1: From Hernan, Hsu & Healy 2019. Examples of Tasks Conducted by Data Scientists Working with Electronic Health Records

| | Description | Prediction | Causal inference |
| :--- | :--- | :--- | :--- |
|Example of scientific question | How can women aged 60-80 years with stroke history be partitioned in classes defined by their characteristics? | What is the probability of having a stroke next year for women with certain characteristics? | Will starting a statin reduce, on average, the risk of stroke in women with certain characteristics?  |
|<br> Data|<br> - Eligibility criteria <br> - Features (symptoms, clinical parameters . . . )|<br> - Eligibility criteria <br>- Output (diagnosis of stroke over the next year)<br> - Inputs (age, blood pressure, history of stroke, diabetes at baseline) | <br> - Eligibility criteria <br> - Outcome (diagnosis of stroke over the next year) <br> -Treatment (initiation of statins at baseline) <br> - Confounders <br> - Effect modifiers (optional) |
|<br> Example of analytics | <br> Cluster analysis | <br> Regression <br> Decision trees <br> Random forests <br> Support vector machines <br> Neural networks | <br> Regression <br> Matching <br> Inverse probability weighting <br> G-formula <br> G-estimation <br> Instrumental variable estimation |




## 11.6 Role of explanatory variables in different types of investigation:

The role of explanatory variables in different types of investigation differs. We focus here on prediction investigations and causal investigations.

### 11.6.1 Prediction
In prediction investigations the aim is to use $X_{1}$, . . . , $X_{p}$ to predict $Y$ . In this setting the $X_{1}$, . . . , $X_{p}$ are often referred to as the ‘predictors’ for obvious reasons. For a prediction problem we may well use all of the explanatory variables $X_{1}$, . . . , $X_{p}$ in the prediction model or algorithm. Crucially, in prediction we are not interested in the inter-relationships between the explanatory variables $X_{1}$, . . . , $X_{p}$ and their temporal ordering. The only aim is to achieve a good prediction of the outcome $Y$. It may be desirable to reduce the number of explanatory variables, particularly in settings where the number of potential predictors $p$ is very large. Various principled procedures are available for reducing the number of predictor variables.

### 11.6.2 Causality and explanation

In investigations of causality, one of the explanatory variables is designated as the treatment or exposure of interest. Let’s suppose this is variable $X_{1}$ and the research question is about how $X_{1}$ affects $Y$. Or, in other words, if $X_{1}$ had been different, how would $Y$ have been different? Let’s consider the setting of an Randomized Controlled Trials and an observational study separately and think of the situation where $X_{1}$ is a binary treatment variable

*Randomized controlled trials (RCT)*

Suppose individuals are randomized to receive treatment $(X_{1} = 1)$ or not $(X_{1} = 0)$, and the outcome $Y$ is observed after some period of follow-up. It is straightforward to estimate the treatment effect in this setting because of the randomization. For a continuous outcome, we would quantify the treatment effect using a difference in the mean outcome in the two treatment groups $(E(Y |X_{1} = 1) − E(Y |X_{1} = 0))$. For a binary outcome we could quantify the treatment effect in terms of a risk difference $(Pr(Y = 1|X_{1} = 1) − Pr(Y = 1|X_{1} = 0))$, risk ratio $(Pr(Y = 1|X_{1} = 1)/Pr(Y = 1|X_{1} = 0))$ or odds ratio $((Pr(Y = 1|X_{1} = 1)/Pr(Y = 0|X_{1} = 1))/(Pr(Y = 1|X_{1} = 0)/Pr(Y = 0|X_{1} = 0)))$, for example.

Some of the other explanatory variables $X_{2}$, . . . , $X_{p}$ are likely to be associated with $Y$, but we do not need to use them to estimate the treatment effect due to the study design. Sometimes investigators will adjust for baseline variables, measured at the start of the trial prior to treatment. By the study design, baseline variables are not associated with the treatment. There can be advantages of adjusting for baseline variables that are predictors of the outcome. Though there are particular nuances to the interpretation of the resulting estimates depending on the types of outcome (continuous, binary, etc) and on how the treatment effect is quantified.

Of course, there are many important considerations surrounding the validity and interpretation of
treatment effects estimated using RCTs, such as whether the effect is a ‘per-protocol’ or ‘intentionto-treat’ effect, whether there is drop-out, non-adherence or treatment switching.

*Observational studies*

Suppose we have available observational data on the treatment variable $X_{1}$ and the outcome $Y$, for example from electronic health records. In this setting the treatment in non-randomized, and there are very likely to be confounders of the association between the treatment and the outcome.

A confounder is a variable that affects both the treatment and the outcome. Confounding variables occur prior in time to both the treatment/exposure and the outcome. See VanderWeele and Schpitser (2013) for a formal statistical discussion of confounding.

To estimate the causal effect of $X_{1}$ on $Y$ requires us to control for confounding. Consider a simple setting in which there is only one other variable at play, $X_{2}$, which in the observational setting affects whether a person gets the treatment $X_{1}$ and also affects their outcome $Y$. For example, if $X_{1}$ is a blood pressure-lowering medication and $Y$ is blood pressure 1 year later, then $X_{2}$ could be the person’s blood pressure at the time origin. The assumed relationships between the three variables $X_{1}$, $X_{2}$ and $Y$ are illustrated in Figure 2 using directed acyclic graphs (DAGs), contrasting the relationships in an RCT and in an observational study. 

![image.png](attachment:image.png)

DAGs, also called ‘causal diagrams’, are used to graphically describe mechanistic relationships between variable using uni-directional arrows. An arrow connecting two variables indicates (potential) causation in the direction of the arrow and the absence of an arrow indicates an assumption that there is no direct causal effect of the first variable on the second. See Greenland et al. (Epidemiology, 1999) and Shrier and Platt (2008) for introductions to causal diagrams. Some other useful more recent articles on this are from Etminan et al. (2020) and Tenant et al. (2019). In simple situation such as this example, we don’t need a DAG to tell us that we need to account for the confounding by $X_{2}$ in our analysis in order to estimate the effect of $X_{1}$ on $Y$ . However, when there are lots of variables at play DAGs become very useful, and have formal theory attached.

In summary, in a causal investigation the variables on which the research question focuses are $X_{1}$ and $Y$ . However, depending on the study design, we may need to account for other variables in the analysis, though those other variables are not our main focus. The concept of confounding is not relevant in prediction investigations.


## 11.7 Example: the role of regression in different types of investigation:

Above, we have contrasted the different types of investigation. Although there are important conceptual differences between the methods, we often use the same statistical tools to address research questions for investigations of different types. Regression is a key tool for analyses in all the types of investigation. In this section we illustrate how the same regression model could be used in prediction investigations and causal investigations, but that the output from the regression should be used and interpreted differently.

We focus on a simple (fictitious) observational study involving three variables: two binary explanatory variables ‘maternal smoking status’ ($X_{1}$ = 1: smoker, $X_{1}$ = 0: non-smoker) and maternal socioeconomic status ($X_{2}$ = 1: low, $X_{2}$ = 0: high), and a continuous outcome ‘birth weight’ (measured in grams). The assumed relationships between the three variables are summarised in the DAG in Figure 3.

![image.png](attachment:image.png)

For the purposes of a simple illustration, we suppose that these are the only three variables at play in this ‘system’. In reality of course there are many other maternal and other characteristics that affect a baby’s birthweight, such as genetics, maternal diet and alcohol consumption, mother’s access to prenatal care, and other features of the environment.

Consider a linear regression of $Y$ on $X_{1}$, $X_{2}$, i.e.

<center>$Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + \epsilon$<center>
<div style="text-align: right"> (1) </div>

The estimated regression coefficients and corresponding 95% confidence intervals for this fictitious example are $\hat{\beta_{0}} = 3227$ (95% CI $1603, 4851)$, $\hat{\beta_{1}}= −341$ (95% CI $−513, −169)$, $\hat{\beta_{2}} =
−214$ (95% CI $−410, −18)$. We now consider how the output from this regression could be used in different investigation types.
   

### 11.7.1 Prediction

If the aim is to predict birth weight based on the two characteristics of the mother, this model allows us to do this. We could obtain the expected value of $Y$ given $X_{1}$ and $X_{2}$ (in this very simple example there are only 4 possible combinations). In this prediction setting we do not, however, particularly care about the estimates of the regression coefficients. We should instead be concerned with the predictive performance of the model. This could be measured, for example, using $R^2$ . There are many details about how to appropriately assess and quantify the predictive performance of a prediction model which we do not discuss here.

### 11.7.2 Causality and explanation

Suppose instead that the aim is to assess the causal effect of maternal smoking ($X_{1}$) on birth weight ($Y$). In the simple setting of Figure 3, maternal socioeconomic status ($X_{2}$) is the only confounder of the association between $X_{1}$ and $Y$. The regression model in equation (1) adjusts for $X_{2}$ and hence the coefficient for $X_{1}$ can be interpreted as the conditional causal effect of $X_{1}$ on $Y$. The 95% confidence interval excludes 0. We can make the interpretation that if all mothers in the study population had smoked, the mean birthweight would have been 341 grams lower than had all mothers in the study population not smoked. This is referred to as an ‘average causal effect’. Here, we have not given any interpretation of the estimate of $\beta_{2}$ because it wasn’t relevant for our research question, even though it was important to adjust for $X_{2}$ to adjust for confounding. In a more realistic setting, there will be many other variables that confound the association between $X_{1}$ and $Y$ and which would need to be accounted for to enable a causal interpretation of $\beta_{1}$.

### 11.7.3 The “Table 2 Fallacy”

After adjusting for maternal socioeconomic status, maternal smoking was associated with a lowering of 341 grams in mean birthweight. After adjusting for maternal smoking, low maternal socioeconomic status was associated with a lowering of 214 grams in mean birthweight. However, $\beta_{1}$ and $\beta_{2}$ in model (1) do not have the same type of interpretation and this is due to the relationships between the three variables. According to the causal diagram (Figure 3), maternal smoking status is on the causal pathway from socioeconomic status to birth weight. Hence the parameter β2 in fact represents the effect of socioeconomic status on birth weight that does not go through smoking status - this is a ‘direct effect’ rather than a ‘total effect’. By contrast, $\beta_{1}$ represents the total effect of smoking status on birth weight. We do not go into details about definitions of different types of effect. The aim here is simply to point out that the correct interpretation of the coefficients in the regression model in (1) depends on assumptions about the inter-relationships between the three variables, including how they are ordered in time.

In some (or perhaps many) epidemiological investigations that involve exploration of risk factors, estimates of regression coefficients from multivariable models such as that in (1) (and versions with many more explanatory variables) are presented alongside one another in a table, together with confidence intervals and p-values. They may then be interpreted as though all coefficients had the same meaning, ignoring possible inter-relationships between the variables and temporal ordering. As we have seen from the above example, this could be misleading. This problem has been referred to in the literature the ‘Table 2 fallacy’, because the estimates of regression coefficients are often presented in ‘Table 2’ in a paper (where ‘Table 1’ is usually a table of descriptive statistics). See Westreich and Greenland (2013) for a description of the Table 2 fallacy. Bandoli et al. (2018) provide an example in the context of preeclampsia and preterm birth.


## 11.8 Example: the role of regression in different types of investigation:

Above we placed some emphasis on how the investigation type affects what variables should be included in the analysis and on how the results might be interpreted. There are naturally many other things to consider which are beyond the scope of this session. The above example focused on regression. The next few sessions in this module will focus on regression models of different types. They are a fundamental part of the statistician’s toolbox and are used in investigations of different types. However, there are many other specialised methods available for specific tasks. For example, in descriptive analyses we may use clustering methods and principal components analysis. In prediction tasks, machine learning methods not based on regression are increasingly used. In studies of causal effects many specialised methods have been developed over recent years. Some of these involve regression and others not.

The type of investigation affects how we should assess the performance and assumptions of a model/analysis. For example, in prediction tasks we should assess how well the prediction model performs in terms of predicting the outcome for a new individual. This requires tools such as cross validation, and measures of predictive performance such as $R^2$ , area under the curve, sensitivity and specificity. In causal analyses we are concerned with whether the assumptions of the models used are valid and whether the model is correctly specified, alongside the validity of untestable assumptions such as whether there are any important confounders that have not been accounted for in the analysis.

This session aimed to provide a broad overview of different types of investigation used in medical statistics/health data science, and which you are likely to encounter in your future careers. This topic has seen some recent emphasis in the literature. The statistical and epidemiological community is increasingly emphasising the need for researchers to ensure they conduct meaningful studies and interpret findings appropriately, particularly relating to the use of observational data. It is a wide topic, and we have only touched on some aspects here.


## References

NOTE: You are not expected to read all of these references! It is intended as a list of resources that you may find useful in the future or if you wish to follow-up on some of the topics discussed in more detail.

Bandoli G., Palmsten K., Chambers C.D., et al. Revisiting the Table 2 fallacy: A motivating example examining preeclampsia and preterm birth. Pediatric and Perinatal Epidemiology 2018; 32: 390-397.

D’Agostino R.B., Vasan R.S., Pencina M.J., et al. General Cardiovascular Risk Profile for Use in Primary Care: The Framingham Heart Study. Circulation 2008; 117: 743–753.

Etminan M, Collins GS, Mansournia MA. Using Causal Diagrams to Improve the Design and Interpretation of Medical Research. CHEST 2020; 158: Supplement S21-S28.

Greenland S., Pearl J., Robins J.M. Causal diagrams for epidemiological research. Epidemiology 1999; 10:37–48.

Hand D. What is the Purpose of Statistical Modelling? Harvard Data Science Review 2019 https://doi.org/10.1162/99608f92.4a85af74

Hernan M.A. Does water kill? A call for less casual causal inferences. Annals of Epidemiology 2016; 26: 674-680.

Hernan M.A. The C-Word: Scientific Euphemisms Do Not Improve Causal Inference From Observational Data. Am J Public Health. 2018;108: 616–619.

Hernan M.A., Hsu J., Healy B.. A second chance to get causal inference right: a classification of data science tasks. Chance 2019; 32: 42-49.

Huebner M., le Cessie S., Schmidt C., Wach W. A Contemporary Conceptual Framework for Initial Data Analysis. Observational Studies 2019; 4: 171-192.

Riley R.D. et al. Prognosis Research in Healthcare: Concepts, Methods, and Impact. 2019. Oxford University Press.

Schmueli. To explain or to predict? Statistical Science 2010; 25: 289-310.

Schooling CM, Jones H. Clarifying questions about “risk factors”: predictors versus explanation. Emerging Themes in Epidemiology 2018; 15: 10.

Schrier I., Platt R.W. Reducing bias through directed acyclic graphs. BMC Medical Research Methodology 2008; 8: 70.

Spiegelhalter D. The Art of Statistics: Learning from Data. 2019. Penguin.

Steyerberg E. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. 2nd Edition. 2019. Springer.

Tennant PWG, Harrison WJ, Murray EJ, et al. Use of directed acyclic graphs (DAGs) in applied health research: review and recommendations. MedRxiv 2019.
https://www.medrxiv.org/content/10.1101/2019.12.20.19015511v1

VanderWeele T.J., Shpitser I. On the definition of a confounder. Annals of Statistics 2013; 41: 196-220.

Westreich D., Greenland S. The Table 2 Fallacy: Presenting and Interpreting Confounder and Modifier Coefficients. American Journal of Epidemiology 2013; 177: 292-298.
