*Instructors:*
*This Notebook uses the Combine.pl script of the Cosmic Ray e-Lab analyses to provide material for the following learning goals:*

* *Combining blocks of data to invs*
* *The 'distance = rate x time' formula*
* *Light Python programming to accomplish the above*
* *Understanding how measuring devices can affect the precision of scientific data*
* *Estimating reasonable expectations for quantitative results*

# Combine

Sarah is a data-minded track & field athlete who keeps quantitative records of her weekly training routines for the sprint, 5000 meter run, 100m hurdles, long jump, and shot put events.  She can combine this data for the past year to track her progress in each event in order to see what she's improved most at, which will tell her and her coach which training practices have been working and which haven't.

There are other ways Sarah could use this data, though.  For example, if other athletes take the same data for themselves, Sarah could combine her latest training data with theirs to see what events she's currently most competitive at, which will tell her and her coach which events she should focus on for the next event.

### Building blocks of data

More data is always better than less data, especially in science.  That doesn't mean that we always want all possible data in one big chunk, though.  Storing and analyzing scientific data in smaller units gives us the flexibility to combine the data in different ways in order to investigate different questions.

The same is true of the cosmic ray data generated by QuarkNet's Cosmic Ray Muon Detectors (CRMD).  This data is stored in individual files, with each file representing data taken by a single detector over a time period of less than 24 hours.  For some studies -- say, investigating changes in cosmic ray flux as a function of fluctuations in atmospheric pressure -- we might want to look at data taken by a single detector over the course of several days.  In other studies -- say, searching for showers of cosmic rays caused by especially high-energy primary rays -- we might want to look at data taken by many detectors covering a wide geographic area, but only over the course of a few hours.

The e-Lab studies use the `Combine.pl` script for workflows that feature multiple input files.  As you might guess from its name, it combines data from two (or more) files into a single dataset that's contained in a single file.  This gives us the flexibility to examine the same set of raw data files in different ways: to look at extended time periods for one detector, or to look at data from more than one individual detector at a time.

## Using Combine.pl

To use the data transformation script `Combine.pl`, we provide it with any number of input files followed by what we want it to name the output file it creates:

`$ perl ./perl/Combine.pl <input file 1> <input file 2> ... <input file N> <output file>`

where the items in angled brackets `<>` are parameters we have to specify.  These are:

* `input file`:  The name of a file to be used as input; we can specify as many as we like
* `output file`: What we want to name the output file that the script will write its results to

## Combining data over time

QuarkNet's cosmic ray muon detectors can take data continuously, and it's not unusual for users to run their detectors for days at a time.  For example, DAQ 6148 generated the three Threshold files

`6148.2018.0602.0.thresh`

`6148.2018.0603.0.thresh`

and

`6148.2018.0604.0.thresh`

over the 3-day period from June 2 to June 4, 2018.  The Cosmic Ray e-Lab breaks such data up into individual days for the sake of organization and to keep files from becoming unmanageably large.  When running studies on the data, though, we may want to consider the entire run all at once as a single dataset.  The `Combine` data transformation lets us do that.

As always, we'll start by examining the input data before we apply any transformation to it.  The following UNIX shell command will let us look at the header lines plus the first 5 and last 5 data lines of each of these three files:

In [4]:
!head -8 files/6148.2018.0602.0.thresh ; echo "..." ; tail -5 files/6148.2018.0602.0.thresh

#$md5
#md5_hex(0)
#ID.CHANNEL, Julian Day, RISING EDGE(sec), FALLING EDGE(sec), TIME OVER THRESHOLD (nanosec), RISING EDGE(INT), FALLING EDGE(INT)
6148.2	2458271	0.5001851099136574	0.5001851099139322	23.74	4321599349654000	4321599349656374
6148.2	2458271	0.5001851099139611	0.5001851099141927	20.01	4321599349656624	4321599349658625
6148.3	2458271	0.5001851099140625	0.5001851099144532	33.76	4321599349657500	4321599349660876
6148.2	2458271	0.5002396629087673	0.5002396629089699	17.51	4322070687531750	4322070687533500
6148.3	2458271	0.5002396629091435	0.5002396629093895	21.26	4322070687535000	4322070687537125
...
6148.4	2458272	0.4996444257415654	0.4996444257419126	30.00	4316927838407125	4316927838410125
6148.2	2458272	0.4997058953934896	0.4997058953937355	21.24	4317458936199750	4317458936201875
6148.3	2458272	0.4997058953939526	0.4997058953940539	8.75	4317458936203750	4317458936204625
6148.2	2458272	0.4998563698267506	0.4998563698271123	31.25	4318759035303125	4318759035306250


In [5]:
!head -8 files/6148.2018.0603.0.thresh ; echo "..." ; tail -5 files/6148.2018.0603.0.thresh

#$md5
#md5_hex(0)
#ID.CHANNEL, Julian Day, RISING EDGE(sec), FALLING EDGE(sec), TIME OVER THRESHOLD (nanosec), RISING EDGE(INT), FALLING EDGE(INT)
6148.2	2458272	0.5002601237136719	0.5002601237139034	20.00	4322247468886125	4322247468888125
6148.3	2458272	0.5002601237140480	0.5002601237144098	31.25	4322247468889375	4322247468892500
6148.2	2458272	0.5002869503800058	0.5002869503802229	18.75	4322479251283250	4322479251285125
6148.3	2458272	0.5002869503804398	0.5002869503808015	31.25	4322479251287000	4322479251290125
6148.2	2458272	0.5003048243718605	0.5003048243721354	23.75	4322633682572875	4322633682575250
...
6148.3	2458273	0.4998297218178096	0.4998297218180411	20.00	4318528796505875	4318528796507875
6148.2	2458273	0.4999118278728298	0.4999118278731626	28.75	4319238192821250	4319238192824125
6148.2	2458273	0.4999118278731915	0.4999118278733796	16.25	4319238192824374	4319238192826000
6148.4	2458273	0.4999118278733073	0.4999118278734520	12.50	4319238192825375	4319238192826625

In [6]:
!head -8 files/6148.2018.0604.0.thresh ; echo "..." ; tail -5 files/6148.2018.0604.0.thresh

#$md5
#md5_hex(0)
#ID.CHANNEL, Julian Day, RISING EDGE(sec), FALLING EDGE(sec), TIME OVER THRESHOLD (nanosec), RISING EDGE(INT), FALLING EDGE(INT)
6148.2	2458273	0.5003691139060619	0.5003691139063513	25.01	4323189144148375	4323189144150875
6148.3	2458273	0.5003691139064960	0.5003691139068287	28.75	4323189144152125	4323189144155000
6148.2	2458273	0.5004505377767506	0.5004505377769820	19.99	4323892646391125	4323892646393124
6148.3	2458273	0.5004505377772280	0.5004505377775318	26.24	4323892646395250	4323892646397875
6148.2	2458273	0.5004672020113861	0.5004672020117332	30.00	4324036625378375	4324036625381375
...
6148.3	2458274	0.4997994782379630	0.4997994782382378	23.75	4318267491976000	4318267491978375
6148.1	2458274	0.4998511005815972	0.4998511005818142	18.75	4318713509025000	4318713509026875
6148.3	2458274	0.4998511005818432	0.4998511005820891	21.25	4318713509027125	4318713509029250
6148.2	2458274	0.4999579281848959	0.4999579281853009	34.99	4319636499517500	4319636499521000

In the clock system the e-Lab uses to record rising and falling edge times, the value `0.5` represents midnight in [Coordinated Universal Time](https://en.wikipedia.org/wiki/Coordinated_Universal_Time) (UTC).  So, the first file (`6148.2018.0602.0.thresh`) begins just after UTC midnight on the [Julian day](https://en.wikipedia.org/wiki/Julian_day) 2458271, which is June 1, 2018 (here's a [Julian day converter](http://www.onlineconversion.com/julian_date.htm) for the curious), and it ends just before midnight on the Julian day 2458272, which is June 2, 2018.  The next two files continue this pattern into the next two days.

*Challenge: explain the mismatch in dates.  I think it's because the e-Labs measure days as noon-to-noon UTC, while the Julian day is midnight-to-midnight UTC.  Or, it might be a simpler time zone issue*

We can also check the number of lines in each file using the UNIX utility `wc` (short for "word count") with the `-l` flag to count lines instead of words:

In [8]:
!wc -l files/6148.2018.0602.0.thresh

23055 files/6148.2018.0602.0.thresh


In [9]:
!wc -l files/6148.2018.0603.0.thresh

23468 files/6148.2018.0603.0.thresh


In [10]:
!wc -l files/6148.2018.0604.0.thresh

23449 files/6148.2018.0604.0.thresh


### Applying the data transformation

We'll use `Combine` to join these three test files into a single dataset, calling it `combineOut-6148`:

In [11]:
!perl ./perl/Combine.pl files/6148.2018.0602.0.thresh files/6148.2018.0603.0.thresh files/6148.2018.0604.0.thresh outputs/combineOut-6148

Now we'll examine the output file the same way we did for the input file:

In [12]:
!head -8 outputs/combineOut-6148; echo "..." ; tail -5 outputs/combineOut-6148

#16457cb3f02c44fe538b5c73ef984126
#md5_hex(1528996872 1529340168 1529340168 1529340169  files/6148.2018.0602.0.thresh files/6148.2018.0603.0.thresh files/6148.2018.0604.0.thresh)
#Combined data for files: files/6148.2018.0602.0.thresh files/6148.2018.0603.0.thresh files/6148.2018.0604.0.thresh 
6148.2	2458271	0.5001851099136574	0.5001851099139322	23.74	4321599349654000	4321599349656374
6148.2	2458271	0.5001851099139611	0.5001851099141927	20.01	4321599349656624	4321599349658625
6148.3	2458271	0.5001851099140625	0.5001851099144532	33.76	4321599349657500	4321599349660876
6148.2	2458271	0.5002396629087673	0.5002396629089699	17.51	4322070687531750	4322070687533500
6148.3	2458271	0.5002396629091435	0.5002396629093895	21.26	4322070687535000	4322070687537125
...
6148.3	2458274	0.4997994782379630	0.4997994782382378	23.75	4318267491976000	4318267491978375
6148.1	2458274	0.4998511005815972	0.4998511005818142	18.75	4318713509025000	4318713509026875
6148.3	2458274	0.4998511005818432	0.49

We can see at a glance that the first five data lines of the output are the first five data lines of the first input file, and the last five data lines of the output are the same as the last five data lines of the last input file.  That's what we'd expect for a transformation that does nothing but concatenate files together.

What's more, we can check the line count of the output file:

In [13]:
!wc -l outputs/combineOut-6148

69966 outputs/combineOut-6148


Is this number what you expect it to be?

Note that it's almost, but not quite, equal to the sum of the line counts of each of the individual input files,

```
23055 + 23468 + 23449 = 69972
```

Why are the two counts not exactly equal?

## Combining data over place

The `Combine` data transformation script also allows us to join data from two different sources into a single dataset.  For example, we may wish to look at data taken at the same time by two nearby detectors to see if there are any correlations in muon events between them.

We'll try that with the following two files containing data taken by detectors 6148 and 6478 at Fermilab on May 7, 2017:

`6148.2017.0507.0.thresh`

`6478.2017.0507.0.thresh`

First, we'll do the same trick with the UNIX `head` and `tail` commands to see the beginning and end of each file.

In [14]:
!head -8 files/6148.2017.0507.0.thresh; echo "..." ; tail -5 files/6148.2017.0507.0.thresh

#$md5
#md5_hex(0)
#ID.CHANNEL, Julian Day, RISING EDGE(sec), FALLING EDGE(sec), TIME OVER THRESHOLD (nanosec), RISING EDGE(INT), FALLING EDGE(INT)
6148.1	2457880	0.5017287926934751	0.5017287926938802	34.99	4334936768871625	4334936768875125
6148.1	2457880	0.5017287926939091	0.5017287926941262	18.75	4334936768875375	4334936768877250
6148.4	2457880	0.5017287926940828	0.5017287926943577	23.75	4334936768876876	4334936768879250
6148.2	2457880	0.5030325167516059	0.5030325167520978	42.50	4346200944733875	4346200944738125
6148.1	2457880	0.5030325167519096	0.5030325167522569	30.00	4346200944736500	4346200944739500
...
6148.2	2457881	0.4991468762158276	0.4991468762161170	25.00	4312629010504750	4312629010507250
6148.1	2457881	0.4991468762158854	0.4991468762162615	32.49	4312629010505250	4312629010508500
6148.2	2457881	0.4996565376225549	0.4996565376225984	3.76	4317032485058874	4317032485059250
6148.3	2457881	0.4996565376228299	0.4996565376230469	18.75	4317032485061250	4317032485063125


In [18]:
!head -8 files/6478.2017.0507.0.thresh; echo "..." ; tail -5 files/6478.2017.0507.0.thresh

#$md5
#md5_hex(0)
#ID.CHANNEL, Julian Day, RISING EDGE(sec), FALLING EDGE(sec), TIME OVER THRESHOLD (nanosec), RISING EDGE(INT), FALLING EDGE(INT)
6478.2	2457880	0.5801056986752026	0.5801056986755063	26.24	5012113236553750	5012113236556375
6478.1	2457880	0.5801056986751737	0.5801056986756077	37.50	5012113236553501	5012113236557250
6478.3	2457880	0.5801059826235389	0.5801059826239294	33.75	5012115689867375	5012115689870750
6478.4	2457880	0.5801059826234953	0.5801059826238426	30.00	5012115689867000	5012115689870000
6478.4	2457880	0.5801063615076100	0.5801063615078559	21.25	5012118963425750	5012118963427875
...
6478.4	2457881	0.1316323887125434	0.1316323887129051	31.25	1137303838476375	1137303838479500
6478.1	2457881	0.1316350724972946	0.1316350724976128	27.50	1137327026376625	1137327026379375
6478.2	2457881	0.1316350724973235	0.1316350724975984	23.75	1137327026376875	1137327026379250
6478.2	2457881	0.1316383624849537	0.1316383624852720	27.50	1137355451870000	1137355451872750

We can see that the first file, `6148.2017.0507.0.thresh`, covers the whole UTC day midnight-to-midnight (equivalent to the Julian day noon 2457880 - noon 2457881).  The second file, `6478.2017.0507.0.thresh` covers a time period fully within that of the first.

For completeness, we'll take a look at the line counts:

In [16]:
!wc -l files/6148.2017.0507.0.thresh

4424 files/6148.2017.0507.0.thresh


In [19]:
!wc -l files/6478.2017.0507.0.thresh

7776 files/6478.2017.0507.0.thresh


### Applying the data transformation

We feed these two input files into the data transformation script `Combine`, calling the output file `combineOut-05072017`

In [20]:
!perl ./perl/Combine.pl files/6148.2017.0507.0.thresh files/6478.2017.0507.0.thresh outputs/combineOut-05072017

And examine the result:

In [24]:
!head -13 outputs/combineOut-05072017; echo "..." ; tail -10 outputs/combineOut-05072017

#90995444efbf33e84090296aa80383c4
#md5_hex(1528996872 1498068825 1494356148  files/6148.2017.0507.0.thresh files/6478.2017.0507.0.thresh)
#Combined data for files: files/6148.2017.0507.0.thresh files/6478.2017.0507.0.thresh 
6148.1	2457880	0.5017287926934751	0.5017287926938802	34.99	4334936768871625	4334936768875125
6148.1	2457880	0.5017287926939091	0.5017287926941262	18.75	4334936768875375	4334936768877250
6148.4	2457880	0.5017287926940828	0.5017287926943577	23.75	4334936768876876	4334936768879250
6148.2	2457880	0.5030325167516059	0.5030325167520978	42.50	4346200944733875	4346200944738125
6148.1	2457880	0.5030325167519096	0.5030325167522569	30.00	4346200944736500	4346200944739500
6148.1	2457880	0.5031973088185040	0.5031973088187355	20.00	4347624748191875	4347624748193875
6148.1	2457880	0.5031973088187789	0.5031973088189960	18.76	4347624748194250	4347624748196126
6148.2	2457880	0.5031973088185330	0.5031973088189815	38.75	4347624748192125	4347624748196000
6148.1	2457880	0.503

In [23]:
!wc -l outputs/combineOut-05072017

12197 outputs/combineOut-05072017


The line count looks like what we expect for combining two files of 4424 lines and 7776 lines, respectively.  However, there's one other aspect of this output we might not have expected: all of the DAQ 6148 data is at the beginning of the file, while all of the 6478 data is at the end of the file, even though the 6478 muon events were mixed in with the 6148 muon events over the time period that this data was collected!

This isn't a mistake at all, of course.  The `Combine` data transformation does just what its name implies, and only that: it joins two or more files into a single file.  Re-arranging the lines of data to put them in order of time is a separate data transformation handled by a separate script; we'll examine that one soon enough.