# Data Structures
... Note that many of these examples are taken from [West, Welch & Galecki (2022)](https://www.taylorfrancis.com/books/mono/10.1201/9781003181064/linear-mixed-models-brady-west-kathleen-welch-andrzej-galecki?context=ubx&refId=4dbda113-7f59-4ba9-a710-897ef536cfee), which is the recommended text for this part of the unit.

## Types of Structure

### Clustered Data
The first type of data structure is referred to as *clustered data*. Unlike all the examples we have seen so far, clustered data has no repeated measurements on the units of analysis. Instead, the units are grouped, or nested, within *clusters* of units. 

For example, the first clustered dataset we will see concerns the birth weight of rat pups. The rat pups represent the units of analysis and each only has a single weight, so there are no repeated measurements here. However, the pups can be grouped into *litters* from the same mother. The assumption here is that pups from the same litter may be correlated in some fashion. Perhaps one mother consistently has pups that are *underweight*, or *overweight*? We can also think of this as *two levels* of variation, one between the weights within a litter and one between litters. A LME model can be used to capture this structure.

Similarly, our second clustered dataset concerns maths scores from students. Each student is the unit of analysis and only has a single maths score, so there are no repeated measurements. Nevertheless, there is structure here, as each student is clustered within a class, and classes are clustered within a school. Much like the rat pups, the assumptions here is that students from the same class may be correlated, but also that classes from the same school may also be correlated. Perhaps some schools generally produce higher maths grades than other schools, and within those schools some classes also perform consistently better or worse. There are therefore *three levels* of variation here: scores, classes and schools. Again, even without repeated measurement of the unit of analysis, a LME model can capture this structure 

### Repeated Measurements
The second type of data structure is *repeated measurements*. Obviously, we know a fair amount about these types of data by this point. However, as has been mentioned multiple times so far, there is a key difference between repeated measurements *with* replications and repeated measurements *without* replications. The power of LME models is only really available when we have replications. In other words, when there are *multiple* observations per-subject and per-repeat. Without this, we are confined to *random-intercepts* and must assume a compound symmetric covariance structure. This is effectively a repeated measures ANOVA without any sphericity correction.

For example, the repeated measurements dataset we will examine in this lesson concerns measurements of a chemical in the brains of rats under two different treatments. The treatments are defined *within-subject*, so each rat (the unit of analysis) gets *both*. Importantly, these measurements are collected from *three* different brain regions. Each rat therefore has 6 measurements, 3 for each treatment. If our interest is only on the effects of treatment, we can think of these as 3 replications per-repeat. Rather than averaging-over these replications to create a mean per-rat and per-treatment, we can instead use the replications directly in the model. This then allows much greater scope to capture complex random-effects structures, as we will see later in this lesson.

### Longitudinal
The third type of data structure is *longitudinal*. This is very similar to repeated measurements in the sense that each unit of analysis is measured *multiple times*. The difference is that these measurements are both *spaced* over a longer period of time and have a *natural order*. Repeated measurements are typically collected close together in time and the order of collection is largely immaterial. For instance, in the rat brain example above, the measurements were taken from each brain at the same point in time and it does not matter what order the brain regions were sampled in. Typically, this means that missing data is *less* common under repeated measurements, unless there is equipment failure or a subject decided to withdraw midway through collection. *Longitudinal* data is spaced much further in time, typically days, weeks, months or years. This has *two* consequences. Firstly, there is a naturalistic ordering to the measurements that needs to be taken into account. If we are looking for patterns in the data, we want these to be defined in terms of *time*. Secondly, the gap of time between the measurements needs to be considered. If we taken measurements at 1 year, 2 years and 5 years, we cannot treat the effect of a 2 year gap the same as a 5 year gap. The amount of time passed needs to be in the model. This is not true of repeated measurements, where we can model the effects of the different levels without consideration of *order* or *measurement gap*. Missing data is also much more common in longitudinal data, as you might imagine. This is where LME models have an advantage over other methods, thanks to *pooling* and *shrinkage*.

As an example of longitudinal data, we will examine a dataset of autistic children measured on a scale of *socialisation* at ages 2, 3, 5, 9 and 13 years. Notice that there are uneven gaps between these measurements, which needs to be accommodated. In addition, there will almost certainly be missing data considering the practical difficulties of getting autistic children to reliably return for measurement across 11 years of their lives. Of importance is that we need to consider the *reasons* for missing data. If these are innocuous, such as the family moving house, then this is unfortunate be not disastrous. However, if the missingness speaks to some sort of *confound* then we are in trouble. For instance, if the reason why children do not return is *related* to poor socialisation skills, then this will create a bias, as the *worst* cases will disappear over time and it will appear as if socialisation is improving when the *opposite* might be true. This highlights the serious difficulties around longitudinal data and the inferences that we can draw.

### Clustered Longitudinal
Finally, clustered longitudinal data combines elements of both *longitudinal* and *clustered* data. In these types of datasets, measurements are taken repeatedly from the unit of analysis, but these units are also grouped within clusters. We therefore have a correlation structure that relates to both the repeated measurements over time and to the clusters. 

As an example, we will look at a dataset where measurements of gingivitis were taken from teeth that had been fitted with dental veneers. These measurements were taken 3 months after fitting and 6 months after fitting. This time gap is long and has a clear temporal ordering, so we would think of this of *longitudinal* rather than *repeated measurements*. In addition, *multiple* teeth were sampled from each patient. So, each tooth has two measurements over time and is clustered within a specific individual. This adds an addition layer of variation, because teeth from the same subject are likely to be correlated. Hence, this is clustered longitudinal data.

````{admonition} A Note on Time Series Data
:class: tip
As a point of clarity, another type of data you may come across is *time series* data. Unlike longitudinal data, time series data are generated from a *single* unit and tend to have many more measurements taken much closer in time. One example could be real-time measurements of gait from someone with Parkinson's disease who is asked to perform a walking task. By the end of the task, there may be hundreds or thousands of measurements taken from across time on this single unit. Importantly, these type of data will exhibit a *very specific* correlation pattern, where values close together in time will be *more correlated* than values further away in time. This requires a specialised correlation structure, known as an *autoregressive* model, which parameterises the correlation as decreasing with time. These types of structure are more difficult to specify using mixed-effects models and provide a good opportunity to use GLS. 
````

## Structures as Hierarchies
... It is important to define your *unit of analysis* ...

A summary of the types of data, their associated examples and the various levels of the hierarchy are given in the table below. For each example, the *unit* of analysis is given in *italics*.


| **Data Type**    | **Clustered (2-level)** | **Clustered (3-level)** | **Repeated Measures**              | **Longitudinal**      | **Clustered Longitudinal** |
|------------------|-------------------------|-------------------------|------------------------------------|-----------------------|----------------------------|
| **Example**      | Rat Pup                 | Classroom               | Rat Brain                          | Autism                | Dental Veneer              |
| **Level 1**      | _Rat Pup_               | _Student_               | Replications within each treatment | Measures across years | Measures across months     |
| **Level 2**      | Litter                  | Classroom               | _Rat_                              | _Child_               | _Tooth_                    |
| **Level 3**      |                         | School                  |                                    |                       | Patient                    |

The easiest way to think about this is as follows:

- **Level 1** represents observations at the *most detailed* level of the dataset. For *clustered* data, this represents the *units* of analysis, whereas for *repeated measurement/longitudinal* data this represents the repeated measurements collected for each unit. Remember, the outcome variable is always measured at Level 1 and so Level 1 represents the finest granularity of whatever that measurement is.
- **Level 2** represents the next level in the hierarchy. For *clustered* data, this will be the *clusters of units*. For *repeated measurement/longitudinal* data, this will represent the *units* of analysis. 
- **Level 3** represents the next level in the hierarchy, above Level 2. For *clustered* data, this will represent *clusters of clusters*. For *clustered longitudinal* data, this will represent clusters of the units defined at Level 2.

There is not usually more than 3 levels in these types of datasets, though this is neither a rule nor a limitation. Part of the challenge of using a multilevel conceptualisations is defining what a particular dataset actually is. We can use the rules above to define each level of the data, but only is we know what type of structure we have. We will now turn to a set of steps you can use to try and determine this. Once it is known, apply the definitions of the levels above, and you should be able to write a table similar to the above for your data. This will be the first step in building the most suitable model. 

## How to Determine the Structure of a Dataset?

## Step 1. Determine the *unit* of analysis

## Step 2. Determine what each *row* of the dataset represents
Importantly, this corresponds to rows in the *long format*, so any wide-formatted data need to be converted first.
- If each row is *one unit* measured *multiple times*, then this is either repeated measures or longitudinal.
- If each row is *one unit* measured 

## Step 3. Determine whether you have clusters
... We can use a *visual metaphor* here to try and make the distinction clear: *rubber bands* vs *sticky notes*.

To make the distinctions, consider whether the grouping represents something the unit HAS vs something that the unit is IN. Clustering corresponds to shared context and thus is something a unit is IN: a school, a classroom, a clinic, a hospital, a litter, a neighbourhood. Between-subjects grouping corresponds to a shared characteristic of the unit and thus is something the unit HAS: a diagnosis, a blood type, a gender, an experimental treatment, IQ level. 

Another way to think about this is that a between-subjects variable represents a *systematic difference* that we want to capture. These variables affect the *mean structure*. Clusters, on the other hand, represent a *shared latent influence* that affects the *covariance structure*.

````{admonition} Clusters vs Between-subject Groups
:class: tip
... The difference comes down to *error structure*. Imagine we are only focussed on a single school and take a single pupil from a specific maths classroom and find that their test score is unusually *low* compared to the population mean of *all* maths scores across *all* the maths classrooms. The question is then, does this tell us anything about another pupil from a *different* classroom? In general, *no*. Just because that student's score is low, it does not mean another pupil from an entirely separate classroom with a different teacher would also be low. *However*, if we take a *different* pupil from the *same* classroom, this student's score may well tell us *something* about another pupil. Because they both share the same *context* (classroom environment, teacher, pupils), there could be a relationship here. If one student is scoring *below* the average, it could be that *all* pupils in that classroom are also scoring below the average due to the teacher, or another other shared elements. So, in this example, the *clustering* creates the potential for correlation.

To see how this differs from a between-subject grouping, imagine we had a sample of patients diagnosed with major depressive disorder (MDD). If we take one patient at random and calculate a mood score, it might be below the average mood of all people with depression. It might also be above the average mood, or pretty much spot-on. Either way, this tells us *nothing* about the mood of someone else with MDD. We cannot predict the deviation from the average in one person based on another. So the patients are *independent* and this is a *between-subjects* grouping, rather than a *cluster*. There is no shared environment or context that would induce correlation within the errors.

However, there *could* be additional structure here that would induce correlation. For instance, imagine that the patients with MDD are also clustered within *treatment clinics*. These clinics correspond to a shared environment, with the same doctors, nurses, location and equipment. In this instance, patients treated at the same clinic *could* be correlated. Perhaps one clinic is just more effective at treatment than another. In this case, all mood scores from that clinical will be *higher* than the population average. Knowing one patient's deviation *does* allow us to predict another patient's deviation. Because they were treated at the *same* clinic, they may *both* have similarly high deviations from the population mean. Importantly, this does not allow us to predict another patient's deviation in mood from an entirely separate clinic. So, in this structure, clinics are independent, but patients within clinics may share some degree of dependency.
````