# Combine

The e-Lab studies use the `Combine.pl` script for workflows that feature multiple input files.  As you might guess from its name, this script combines data from two (or more) files into a single dataset in a single file.  This can be useful when we want to examine data from an extended time period - so large that its original data had to be broken into smaller files for upload and storage - or when we want to examine data from more than one individual detector.

## Using Combine.pl

To use the data transformation script `Combine.pl`, we provide it with any number of input files followed by what we want it to name the output file it creates:

`$ perl ./perl/Combine.pl <input file 1> <input file 2> ... <input file N> <output file>`

where the items in angled brackets `<>` are parameters we have to specify.  These are:

* `input file`:  The name of a file to be used as input; we can specify as many as we like
* `output file`: What we want to name the output file that the script will write its results to

### Combining data over time

QuarkNet's cosmic ray muon detectors can take data continuously, and it's not unusual for users to run their detectors for days at a time.  For example, the three Threshold files

`6148.2018.0602.0.thresh`

`6148.2018.0603.0.thresh`

and

`6148.2018.0604.0.thresh`

contain data gathered over the 3-day period from June 2 to June 4, 2018.  The Cosmic Ray e-Lab breaks such data up into individual days for the sake of organization and to keep files from becoming unmanageably large.  When running studies on the data, though, we may want to consider the entire run as a single dataset.  The `Combine` data transformation lets us do that.

As always, we'll start by examining the input data before we apply any transformation to it.  Look at the first 5 and last 5 data lines of each of these three files:

In [4]:
!head -8 files/6148.2018.0602.0.thresh ; echo "..." ; tail -5 files/6148.2018.0602.0.thresh

#$md5
#md5_hex(0)
#ID.CHANNEL, Julian Day, RISING EDGE(sec), FALLING EDGE(sec), TIME OVER THRESHOLD (nanosec), RISING EDGE(INT), FALLING EDGE(INT)
6148.2	2458271	0.5001851099136574	0.5001851099139322	23.74	4321599349654000	4321599349656374
6148.2	2458271	0.5001851099139611	0.5001851099141927	20.01	4321599349656624	4321599349658625
6148.3	2458271	0.5001851099140625	0.5001851099144532	33.76	4321599349657500	4321599349660876
6148.2	2458271	0.5002396629087673	0.5002396629089699	17.51	4322070687531750	4322070687533500
6148.3	2458271	0.5002396629091435	0.5002396629093895	21.26	4322070687535000	4322070687537125
...
6148.4	2458272	0.4996444257415654	0.4996444257419126	30.00	4316927838407125	4316927838410125
6148.2	2458272	0.4997058953934896	0.4997058953937355	21.24	4317458936199750	4317458936201875
6148.3	2458272	0.4997058953939526	0.4997058953940539	8.75	4317458936203750	4317458936204625
6148.2	2458272	0.4998563698267506	0.4998563698271123	31.25	4318759035303125	4318759035306250


In [5]:
!head -8 files/6148.2018.0603.0.thresh ; echo "..." ; tail -5 files/6148.2018.0603.0.thresh

#$md5
#md5_hex(0)
#ID.CHANNEL, Julian Day, RISING EDGE(sec), FALLING EDGE(sec), TIME OVER THRESHOLD (nanosec), RISING EDGE(INT), FALLING EDGE(INT)
6148.2	2458272	0.5002601237136719	0.5002601237139034	20.00	4322247468886125	4322247468888125
6148.3	2458272	0.5002601237140480	0.5002601237144098	31.25	4322247468889375	4322247468892500
6148.2	2458272	0.5002869503800058	0.5002869503802229	18.75	4322479251283250	4322479251285125
6148.3	2458272	0.5002869503804398	0.5002869503808015	31.25	4322479251287000	4322479251290125
6148.2	2458272	0.5003048243718605	0.5003048243721354	23.75	4322633682572875	4322633682575250
...
6148.3	2458273	0.4998297218178096	0.4998297218180411	20.00	4318528796505875	4318528796507875
6148.2	2458273	0.4999118278728298	0.4999118278731626	28.75	4319238192821250	4319238192824125
6148.2	2458273	0.4999118278731915	0.4999118278733796	16.25	4319238192824374	4319238192826000
6148.4	2458273	0.4999118278733073	0.4999118278734520	12.50	4319238192825375	4319238192826625

In [6]:
!head -8 files/6148.2018.0604.0.thresh ; echo "..." ; tail -5 files/6148.2018.0604.0.thresh

#$md5
#md5_hex(0)
#ID.CHANNEL, Julian Day, RISING EDGE(sec), FALLING EDGE(sec), TIME OVER THRESHOLD (nanosec), RISING EDGE(INT), FALLING EDGE(INT)
6148.2	2458273	0.5003691139060619	0.5003691139063513	25.01	4323189144148375	4323189144150875
6148.3	2458273	0.5003691139064960	0.5003691139068287	28.75	4323189144152125	4323189144155000
6148.2	2458273	0.5004505377767506	0.5004505377769820	19.99	4323892646391125	4323892646393124
6148.3	2458273	0.5004505377772280	0.5004505377775318	26.24	4323892646395250	4323892646397875
6148.2	2458273	0.5004672020113861	0.5004672020117332	30.00	4324036625378375	4324036625381375
...
6148.3	2458274	0.4997994782379630	0.4997994782382378	23.75	4318267491976000	4318267491978375
6148.1	2458274	0.4998511005815972	0.4998511005818142	18.75	4318713509025000	4318713509026875
6148.3	2458274	0.4998511005818432	0.4998511005820891	21.25	4318713509027125	4318713509029250
6148.2	2458274	0.4999579281848959	0.4999579281853009	34.99	4319636499517500	4319636499521000

In the clock system the e-Lab uses to record rising and falling edge times, the value `0.5` represents midnight in [Coordinated Universal Time](https://en.wikipedia.org/wiki/Coordinated_Universal_Time) (UTC).  So, the first file (`6148.2018.0602.0.thresh`) begins just after UTC midnight on the [Julian day](https://en.wikipedia.org/wiki/Julian_day) 2458271, which is June 1, 2018 (here's a [Julian day converter](http://www.onlineconversion.com/julian_date.htm) for the curious), and it ends just before midnight on the Julian day 2458272, which is June 2, 2018.  The next two files continue this pattern into the next two days.

*Challenge: explain the mismatch in dates.  I think it's because the e-Labs measure days as noon-to-noon UTC, while the Julian day is midnight-to-midnight UTC.  Or, it might be a simpler time zone issue*

We can also check the number of lines in each file using the UNIX utility `wc` (short for "word count") with the `-l` flag to count lines instead of words:

In [8]:
!wc -l files/6148.2018.0602.0.thresh

23055 files/6148.2018.0602.0.thresh


In [9]:
!wc -l files/6148.2018.0603.0.thresh

23468 files/6148.2018.0603.0.thresh


In [10]:
!wc -l files/6148.2018.0604.0.thresh

23449 files/6148.2018.0604.0.thresh


### Applying the data transformation

We'll use `Combine` to join these three test files into a single dataset, calling it `combineOut-6148`:

In [11]:
!perl ./perl/Combine.pl files/6148.2018.0602.0.thresh files/6148.2018.0603.0.thresh files/6148.2018.0604.0.thresh outputs/combineOut-6148

Now we'll examine the output file the same way we did for the input file:

In [12]:
!head -8 outputs/combineOut-6148; echo "..." ; tail -5 outputs/combineOut-6148

#16457cb3f02c44fe538b5c73ef984126
#md5_hex(1528996872 1529340168 1529340168 1529340169  files/6148.2018.0602.0.thresh files/6148.2018.0603.0.thresh files/6148.2018.0604.0.thresh)
#Combined data for files: files/6148.2018.0602.0.thresh files/6148.2018.0603.0.thresh files/6148.2018.0604.0.thresh 
6148.2	2458271	0.5001851099136574	0.5001851099139322	23.74	4321599349654000	4321599349656374
6148.2	2458271	0.5001851099139611	0.5001851099141927	20.01	4321599349656624	4321599349658625
6148.3	2458271	0.5001851099140625	0.5001851099144532	33.76	4321599349657500	4321599349660876
6148.2	2458271	0.5002396629087673	0.5002396629089699	17.51	4322070687531750	4322070687533500
6148.3	2458271	0.5002396629091435	0.5002396629093895	21.26	4322070687535000	4322070687537125
...
6148.3	2458274	0.4997994782379630	0.4997994782382378	23.75	4318267491976000	4318267491978375
6148.1	2458274	0.4998511005815972	0.4998511005818142	18.75	4318713509025000	4318713509026875
6148.3	2458274	0.4998511005818432	0.49

We can see at a glance that the first five data lines of the output are the first five data lines of the first input file, and the last five data lines of the output are the same as the last five data lines of the last input file.  That's what we'd expect for a transformation that does nothing but concatenate files together.

What's more, we can check the line count of the output file:

In [13]:
!wc -l outputs/combineOut-6148

69966 outputs/combineOut-6148


Is this number what you expect it to be?

### Combining data over time

---

Now we'll see what happens when we run these through `Combine.pl` using

`$ perl ./perl/Combine.pl test_data/6119.2016.0104.1.test.thresh test_data/6203.2016.0104.1.test.thresh test_data/combineOut`

The file `combineOut` gets created in the `test_data/` directory.  Before we try to `cat` it, let's see how many lines it is:

In [5]:
!wc -l test_data/combineOut

17 test_data/combineOut


Not so bad.  Let's take a look:

In [6]:
!cat test_data/combineOut

#344f56cc2ab825588ae1315357ab3096
#md5_hex(1528996872 1530043861 1530043909  test_data/6119.2016.0104.1.test.thresh test_data/6203.2016.0104.1.test.thresh)
#Combined data for files: test_data/6119.2016.0104.1.test.thresh test_data/6203.2016.0104.1.test.thresh 
6119.1	2457392	0.3721863017828993	0.3721863017831598	22.50	3215689647404250	3215689647406500
6119.3	2457392	0.3721863017829138	0.3721863017831598	21.25	3215689647404375	3215689647406500
6119.2	2457392	0.3721885846820747	0.3721885846822772	17.50	3215709371653125	3215709371654875
6119.4	2457392	0.3721885846820747	0.3721885846822917	18.75	3215709371653125	3215709371655000
6119.4	2457392	0.3721901866161603	0.3721901866163773	18.75	3215723212363625	3215723212365500
6119.1	2457392	0.3721901866161748	0.3721901866164496	23.75	3215723212363750	3215723212366125
6119.1	2457392	0.3721903650327546	0.3721903650329427	16.25	3215724753883000	3215724753884625
6203.1	2457392	0.2452114384916088	0.2452114384919415	28.75	2118626828567500	21

From the name `Combine.pl`, it's not hard to guess that the data transformation takes all of its input files and combines them into one file, and that's exactly what we see here.

Each of the input files was 10 lines long -- 3 lines of header, and 7 lines of data.  Here, we have a new header of 3 lines followed by 14 lines of data, for a total of 17 lines (just like `wc -l` gave us).  All of the data from the first input file, `test_data/6119.2016.0104.1.test.thresh`, comes first, followed by all of the data from the second input file, `test_data/6203.2016.0104.1.test.thresh`.  All the data is kept in the same order, and individual lines have not been altered: all of the columns, values, and decimal places from the original threshold files are the same.