# Data Structures
... Note that many of these examples are taken from [West, Welch & Galecki (2022)](https://www.taylorfrancis.com/books/mono/10.1201/9781003181064/linear-mixed-models-brady-west-kathleen-welch-andrzej-galecki?context=ubx&refId=4dbda113-7f59-4ba9-a710-897ef536cfee), which is the recommended text for this part of the unit.

## Types of Structure

### Clustered Data
The first type of data structure is referred to as *clustered data*. Unlike all the examples we have seen so far, clustered data has no repeated measurements on the units of analysis. Instead, the units are grouped, or nested, within *clusters* of units. 

For example, the first clustered dataset we will see concerns the birth weight of rat pups. The rat pups represent the units of analysis and each only has a single weight, so there are no repeated measurements here. However, the pups can be grouped into *litters* from the same mother. The assumption here is that pups from the same litter may be correlated in some fashion. Perhaps one mother consistently has pups that are *underweight*, or *overweight*? We can also think of this as *two levels* of variation, one between the weights within a litter and one between litters. A LME model can be used to capture this structure.

Similarly, our second clustered dataset concerns maths scores from students. Each student is the unit of analysis and only has a single maths score, so there are no repeated measurements. Nevertheless, there is structure here, as each student is clustered within a class, and classes are clustered within a school. Much like the rat pups, the assumptions here is that students from the same class may be correlated, but also that classes from the same school may also be correlated. Perhaps some schools generally produce higher maths grades than other schools, and within those schools some classes also perform consistently better or worse. There are therefore *three levels* of variation here: scores, classes and schools. Again, even without repeated measurement of the unit of analysis, a LME model can capture this structure 

### Repeated Measurements
The second type of data structure is *repeated measurements*. Obviously, we know a fair amount about these types of data by this point. However, as has been mentioned multiple times so far, there is a key difference between repeated measurements *with* replications and repeated measurements *without* replications. The power of LME models is only really available when we have replications. In other words, when there are *multiple* observations per-subject and per-repeat. Without this, we are confined to *random-intercepts* and must assume a compound symmetric covariance structure. This is effectively a repeated measures ANOVA without any sphericity correction.

For example, the repeated measurements dataset we will examine in this lesson concerns measurements of a chemical in the brains of rats under two different treatments. The treatments are defined *within-subject*, so each rat (the unit of analysis) gets *both*. Importantly, these measurements are collected from *three* different brain regions. Each rat therefore has 6 measurements, 3 for each treatment. If our interest is only on the effects of treatment, we can think of these as 3 replications per-repeat. Rather than averaging-over these replications to create a mean per-rat and per-treatment, we can instead use the replications directly in the model. This then allows much greater scope to capture complex random-effects structures, as we will see later in this lesson.

### Longitudianal
The third type of data structure is *longitudinal*. This is very similar to repeated measurements in the sense that each unit of analysis is measured *multiple times*. The difference is that these measurements are both *spaced* over a longer period of time and have a *natural order*. Repeated measurements are typically collected close together in time and the order of collection is largely immaterial. For instance, in the rat brain example above, the measurements were taken from each brain at the same point in time and it does not matter what order the brain regions were sampled in. Typically, this means that missing data is *less* common under repeated measurements, unless there is equipment failure or a subject decided to withdraw midway through collection. *Longitudinal* data is spaced much further in time, typically days, weeks, months or years. This has *two* consequences. Firstly, there is a naturalistic ordering to the measurements that needs to be taken into account. If we are looking for patterns in the data, we want these to be defined in terms of *time*. Secondly, the gap of time between the measurements needs to be considered. If we taken measurements at 1 year, 2 years and 5 years, we cannot treat the effect of a 2 year gap the same as a 5 year gap. The amount of time passed needs to be in the model. This is not true of repeated measurements, where we can model the effects of the different levels without consideration of *order* or *measurement gap*. Missing data is also much more common in longitudinal data, as you might imagine. This is where LME models have an advantage over other methods, thanks to *pooling* and *shrinkage* in the presence of missing values.

As an example of longitudinal data, we will examine a dataset of autistic children measured on a scale of *socialisation* at ages 2, 3, 5, 9 and 13 years. Notice that there are uneven gaps between these measurements, which needs to be accommodated. In addition, there will almost certainly be missing data considering the practical difficulties of getting autistic children to reliably return for measurement across 11 years of their lives. Of importance is that we need to consider the *reasons* for missing data. If these are innocuous, such as the family moving house, then this is unfortunate be not disastrous. However, if the missingness speaks to some sort of *confound* then we are in trouble. For instance, if the reason why children do not return is *related* to poor socialisation skills, then this will create a bias, as the *worst* cases will disappear over time and it will appear as if socialisation is improving when the *opposite* might actually be true. This highlights the serious difficulties around longitudinal data and the inferences that we can draw.

### Clustered-Longitudinal

````{admonition} A Note on Time-series Data
:class: tip
As a point of clarity...
````