# Aligning

![](banner_aligning.jpg)

In [1]:
f = "setup.R"; for (i in 1:10) { if (file.exists(f)) break else f = paste0("../", f) }; source(f)

## Introduction

Motivation, context, history, related topics ...

## Terms

## Data

Consider this pedagogical data.  Variables x1 and x2 are aligned at the same (date) resolution.

In [2]:
data = data.frame(date=ymd(c("2017-12-30","2017-12-31","2018-01-01","2018-01-02","2018-01-03","2018-01-04")),
                  x1=c(1000, 1100, 900, 950, 800, 1050),
                  x2=c(72, 86, 76, 80, 82, 74))
data

date,x1,x2
2017-12-30,1000,72
2017-12-31,1100,86
2018-01-01,900,76
2018-01-02,950,80
2018-01-03,800,82
2018-01-04,1050,74


Then consider this pedagogical data.  Variables x1 and x2 are not aligned at the same (date) resolution.  x1 is relatively coarse resolution (one observation per day).  x2 is relatively fine resolution (one observation per half-day).

In [3]:
data.1 = data.frame(date=ymd(c("2017-12-30","2017-12-31","2018-01-01","2018-01-02","2018-01-03","2018-01-04")),
                    x1=c(1000, 1100, 900, 950, 800, 1050))

data.2 = data.frame(date=ymd_h(c("2017-12-30 00","2017-12-30 12","2017-12-31 00","2017-12-31 12","2018-01-01 00","2018-01-01 12","2018-01-02 00","2018-01-02 12","2018-01-03 00","2018-01-03 12","2018-01-04 00","2018-01-04 12")),
                    x2=c(72, 56, 86, 60, 76, 63, 80, 68, 82, 59, 74, 61))

layout(data.1, data.2)

date,x1
date,x2
2017-12-30,1000
2017-12-31,1100
2018-01-01,900
2018-01-02,950
2018-01-03,800
2018-01-04,1050
2017-12-30 00:00:00,72
2017-12-30 12:00:00,56
2017-12-31 00:00:00,86
2017-12-31 12:00:00,60

date,x1
2017-12-30,1000
2017-12-31,1100
2018-01-01,900
2018-01-02,950
2018-01-03,800
2018-01-04,1050

date,x2
2017-12-30 00:00:00,72
2017-12-30 12:00:00,56
2017-12-31 00:00:00,86
2017-12-31 12:00:00,60
2018-01-01 00:00:00,76
2018-01-01 12:00:00,63
2018-01-02 00:00:00,80
2018-01-02 12:00:00,68
2018-01-03 00:00:00,82
2018-01-03 12:00:00,59


## Align Variables by Contraction

Mark observations with sequence numbers.  The finer resolution table will adopt the coarser resolution table's sequence numbers, repeating as necessary.

In [4]:
data.1$step = 1:nrow(data.1)
data.1

date,x1,step
2017-12-30,1000,1
2017-12-31,1100,2
2018-01-01,900,3
2018-01-02,950,4
2018-01-03,800,5
2018-01-04,1050,6


In [5]:
data.2$step = sort(rep(1:nrow(data.1), 2))
data.2

date,x2,step
2017-12-30 00:00:00,72,1
2017-12-30 12:00:00,56,1
2017-12-31 00:00:00,86,2
2017-12-31 12:00:00,60,2
2018-01-01 00:00:00,76,3
2018-01-01 12:00:00,63,3
2018-01-02 00:00:00,80,4
2018-01-02 12:00:00,68,4
2018-01-03 00:00:00,82,5
2018-01-03 12:00:00,59,5


Then aggregate the finer resolution observations to match the number of coarser resolution observations.  Choose an aggregator function that makes sense for the variable(s).

In this example, we aggregate by mean.

In [6]:
data.2.aggregated = aggregate(x2 ~ step, data.2, mean)
data.2.aggregated

step,x2
1,64.0
2,73.0
3,69.5
4,74.0
5,70.5
6,67.5


Then join the tables.

In [7]:
merge(data.1, data.2.aggregated, by="step")[,-1]

date,x1,x2
2017-12-30,1000,64.0
2017-12-31,1100,73.0
2018-01-01,900,69.5
2018-01-02,950,74.0
2018-01-03,800,70.5
2018-01-04,1050,67.5


## Align Variables by Expansion

Duplicate the coarser resolution observations as necessary to match the number of finer resolution observations.  Mark observations with sequence numbers.  The coarser resolution table will adopt the finer resolution table's sequence numbers.

In [8]:
data.1.duplicated = data.1[sort(rep(1:nrow(data.1), 2)),]
data.1.duplicated$step = 1:nrow(data.2)

data.2$step = 1:nrow(data.2)

layout(data.1.duplicated, data.2)

date,x1,step
date,x2,step
2017-12-30,1000,1.0
2017-12-30,1000,2.0
2017-12-31,1100,3.0
2017-12-31,1100,4.0
2018-01-01,900,5.0
2018-01-01,900,6.0
2018-01-02,950,7.0
2018-01-02,950,8.0
2018-01-03,800,9.0
2018-01-03,800,10.0

date,x1,step
2017-12-30,1000,1
2017-12-30,1000,2
2017-12-31,1100,3
2017-12-31,1100,4
2018-01-01,900,5
2018-01-01,900,6
2018-01-02,950,7
2018-01-02,950,8
2018-01-03,800,9
2018-01-03,800,10

date,x2,step
2017-12-30 00:00:00,72,1
2017-12-30 12:00:00,56,2
2017-12-31 00:00:00,86,3
2017-12-31 12:00:00,60,4
2018-01-01 00:00:00,76,5
2018-01-01 12:00:00,63,6
2018-01-02 00:00:00,80,7
2018-01-02 12:00:00,68,8
2018-01-03 00:00:00,82,9
2018-01-03 12:00:00,59,10


Then disaggregate the duplicated observations with a disaggregation function that makes sense for the variable(s).

In this example, we disaggregate by dividing by 2.

In [9]:
data.1.duplicated$x1 = data.1.duplicated$x1 / 2
fmt(data.1.duplicated, NA)

date,x1,step
2017-12-30,500,1
2017-12-30,500,2
2017-12-31,550,3
2017-12-31,550,4
2018-01-01,450,5
2018-01-01,450,6
2018-01-02,475,7
2018-01-02,475,8
2018-01-03,400,9
2018-01-03,400,10


Then join the tables.

In [10]:
merge(data.1.duplicated[-1], data.2, by="step")[,c(3,2,4)]

date,x1,x2
2017-12-30 00:00:00,500,72
2017-12-30 12:00:00,500,56
2017-12-31 00:00:00,550,86
2017-12-31 12:00:00,550,60
2018-01-01 00:00:00,450,76
2018-01-01 12:00:00,450,63
2018-01-02 00:00:00,475,80
2018-01-02 12:00:00,475,68
2018-01-03 00:00:00,400,82
2018-01-03 12:00:00,400,59


## Code

### Useful Functions

In [11]:
# help(aggregate) # from stats library
# help(merge)     # from base library
# help(rep)       # from base library
# help(sort)      # from base library


## Expectations

Know about this:
* How to align variables of differing resolutions, conceptually & using R

## Further Reading

Further reading coming soon ...

<p style="text-align:left; font-size:10px;">
Copyright (c) Berkeley Data Analytics Group, LLC
<span style="float:right;">
Document revised July 17, 2020
</span>
</p>