*Instructors:*
*This Notebook uses the Combine.pl script of the Cosmic Ray e-Lab analyses to provide material for the following learning goals:*

* *Combining blocks of data in different ways to investigate different questions*
* *Understanding the problem of "background" data in scientific observation*

# Combine

## Motivation: Combining cosmic ray data to improve neutrino experiments

Underneath the black tape covering each of the four detector panels of QuarkNet's cosmic ray muon detectors (CRMDs) is a clear plastic material called a [**scintillator**](https://en.wikipedia.org/wiki/Scintillator).  Scintillators are a broad class of materials defined by the property that they emit light as charged particles -- like cosmic ray muons -- pass through them.

Fig. 1 shows how many neutrino detection experiments work on a similar principle, except that these phantom particles interact so rarely that a much larger amount of scintillator is required to see any evidence of them.  Rather than a small plastic panel, these experiments use large tanks filled with thousands of tons of liquid scintillator.  And since neutrinos are uncharged, the scintillator material in these experiments does not respond to them directly, but rather to the charged particles (electrons and sometimes muons) that are "kicked out" on those rare occasions when a neutrino does interact with the nucleus of an atom of scintillator.

These charged particle *do* leave glowing tracks in the scintillator, and photomultiplier tubes (PMTs) placed around the tank record that light.  Investigators can then use the properties of the light signal to make inferences about the neutrino that started the chain of events.

![A neutrino detector](../Files/Images/TankDetector-Beam_838x363.png)*Fig 1.  A beam of neutrinos passes through a tank of scintillator, occasionally kicking out a charged particle whose scintillation tracks can be seen by a PMT.*

This detector design works well, but Fig. 2 shows one problem with it.  We know that cosmic ray muons strike the earth from the upper atmosphere constantly, so unless the detector is located deep underground (and some are!), then some cosmic rays muons will pass through the tank and leave scintillation tracks that register as data.

![A muon striking a neutrino detector](../Files/Images/TankDetector-Muon_838x418.png)*Fig 2.  A cosmic ray muon passes through the detector, leaving a track that might be mistaken for a particle produced by a neutrino event.*

That's bad.  Neutrino events are rare, and the presence of these extra muons makes the few true neutrino interactions that occur even harder to spot.  Almost all experiments have to deal with this sort of unwanted data that, at least on the surface, looks the same as the data that investigators really want to collect.  This "false-positive" data is called **background**, and understanding it is a major part of the design and interpretation of scientific experiments.

Fortunately, there are solutions to dealing with background.  Fig. 3 shows one example: we can place muon detectors on the top and bottom of the tank of scintillator.  When a cosmic ray muon passes through the tank, it will trigger CRMDs on both the top and bottom of the tank in rapid succession - almost instantaneously, in fact, since the muons are travelling close to the speed of light.  Particle tracks that originate within the tank, however, won't ever activate detectors on both sides of the tank at once.  This means that we can use software or electronic hardware to look for two closely-spaced CRMD events and prevent any signal in the detector tank from being recorded when that happens.  This rejection of bad data before it can be recorded is called a **veto**.

![A neutrino detector with a veto system](../Files/Images/TankDetector-Veto_838x432.png)*Fig 3.  A background muon will strike two CRMDs on opposite sides of the tank as it passes through, while a neutrino-produced particle within the tank will not.  This lets us "filter out," or *veto*, cosmic ray muons from the data.*

In order for this to work, however, we need a way to combine the data from all of the CRMDs into a single block so that we can quickly look for nearly simultaneous events in two different detectors.  This is the role played by the `Combine` data transformation script.  We'll study two slightly different applications of this data tranformation: combining data sets from a single detector that were recorded at different times, and combining data from multiple detectors taken at the same time, as happens in the neutrino detector veto system we just explored.

### Building blocks of data

In a scientific experiment, more signal data leads to more reliable results.  That doesn't mean that we always want all possible data in one big chunk, though.  Not only are large data sets hard to store and transfer, but patterns in relationships between two variables can be hard to see if they're mixed in with data on unrelated variables.  Storing and analyzing scientific data in smaller units gives us the flexibility to combine the data in different ways in order to investigate different questions.

The same is true of the cosmic ray data generated by QuarkNet's CRMDs.  This data is stored in individual files, with each file representing data taken by a single detector over a time period of up to 24 hours.  For some studies -- say, investigating changes in cosmic ray flux as a function of fluctuations in atmospheric pressure -- we might want to look at data taken by a single detector over the course of several days.  In other studies -- say, searching for showers of cosmic rays caused by especially high-energy primary rays -- we might want to look at data taken by many detectors covering a wide geographic area, but only over the course of a few hours.

The e-Lab studies use the `Combine.pl` script for workflows that feature multiple input files.  As you might guess from its name, it combines data from two (or more) files into a single dataset that's contained in a single file.  This gives us the flexibility to examine the same set of raw data files in different ways depending on what variables are relevant to a given question.

## Using Combine.pl

To use the data transformation script `Combine.pl`, we provide it with any number of input files followed by what we want it to name the output file it creates:

`$ perl Combine.pl <input file 1> <input file 2> ... <input file N> <output file>`

where the items in angled brackets `<>` are parameters we have to specify.  These are:

* `input file`:  The name of a file to be used as input; we can specify as many as we like
* `output file`: What we want to name the output file that the script will write its results to

## I) Combining data over time

QuarkNet's cosmic ray muon detectors can take data continuously, and it's not unusual for users to run their detectors for days at a time.  For example, DAQ 6148 generated the three Threshold files

`6148.2018.0602.0.thresh`

`6148.2018.0603.0.thresh`

`6148.2018.0604.0.thresh`

over the 3-day period from June 2 to June 4, 2018.  The Cosmic Ray e-Lab breaks such data up into individual days for the sake of organization and to keep files from becoming unmanageably large.  When running studies on the data, though, we may want to consider the entire run all at once as a single dataset.  The `Combine` data transformation lets us do that.  We'll apply it to the above three files to see how it works.

### 1) Investigate the input data

As always, we'll start by examining the input data before we apply any transformation to it.  The following UNIX shell commands will use the `head` and `tail` programs to let us look at the three header lines plus the first 5 and last 5 data lines of each of these files.

First, for the file `6148.2018.0602.0.thresh`:

In [None]:
!head -8 ../Files/Data/6148.2018.0602.0.thresh ; echo "..." ; tail -5 ../Files/Data/6148.2018.0602.0.thresh

Next, for the file `6148.2018.0603.0.thresh`:

In [None]:
!head -8 ../Files/Data/6148.2018.0603.0.thresh ; echo "..." ; tail -5 ../Files/Data/6148.2018.0603.0.thresh

Finally, for the file `6148.2018.0604.0.thresh`:

In [None]:
!head -8 ../Files/Data/6148.2018.0604.0.thresh ; echo "..." ; tail -5 ../Files/Data/6148.2018.0604.0.thresh

In the clock system the DAQ uses to record `RISING EDGE` and `FALLING EDGE` times, the value `0.5` represents midnight in [Coordinated Universal Time](https://en.wikipedia.org/wiki/Coordinated_Universal_Time) (UTC).  We see that the first file, `6148.2018.0602.0.thresh`, begins just after UTC midnight on the [Julian day](https://en.wikipedia.org/wiki/Julian_day) 2458271, which is 12:00 AM UTC on June 2, 2018, and it ends just before midnight on the Julian day 2458272, which is 12:00 AM UTC on June 3, 2018. You can see that the next two files continue this pattern into the next two days.

*You can find many Julian day converters online, such as [this one](http://www.onlineconversion.com/julian_date.htm), if you'd like to explore more*

We can also check the number of lines in each file using the UNIX utility `wc` (short for "word count") with the `-l` flag to count lines instead of words:

In [None]:
!wc -l ../Files/Data/6148.2018.0602.0.thresh

In [None]:
!wc -l ../Files/Data/6148.2018.0603.0.thresh

In [None]:
!wc -l ../Files/Data/6148.2018.0604.0.thresh

### 2) Apply the data transformation

We'll use the `Combine` data transformation to join these three test files into a single dataset, calling it `combineOut-6148`:

In [None]:
!perl ../Files/eLabScripts/Combine.pl ../Files/Data/6148.2018.0602.0.thresh ../Files/Data/6148.2018.0603.0.thresh \
../Files/Data/6148.2018.0604.0.thresh OutputFiles/combineOut-6148

### 3) Investigate the output data

Now we'll examine the output file the same way we did for the input file:

In [None]:
!head -8 OutputFiles/combineOut-6148; echo "..." ; tail -5 OutputFiles/combineOut-6148

We can see at a glance that the first five data lines of the output are the first five data lines of the first input file, `6148.2018.0602.0.thresh`, and the last five data lines of the output are the same as the last five data lines of the last input file, `6148.2018.0604.0.thresh`.  Using the [converter](http://www.onlineconversion.com/julian_date.htm) that we linked to earlier, we can see from the Julian day and first `RISING EDGE` times of the first and last data lines that the output file seems to cover a timespan of

```
2458271	0.5001851099136574   ->   00:00:16 UTC, 2 June 2018
2458274	0.4999579281853588   ->   23:59:57 UTC, 4 June 2018
```

That's what we'd expect for a transformation that adds files from June 2, June 3, and June 4 together one after the other.

Going further, we can check the line count of the output file just as we did for the input files:

In [None]:
!wc -l OutputFiles/combineOut-6148

There are 69966 total lines in the output file.  Is this number what you expect it to be?  Note that it's almost, but not quite, equal to the sum of the line counts of each of the individual input files that we found earlier,

```
6148.2018.0602.0.thresh:        23055
6148.2018.0603.0.thresh:        23468
6148.2018.0604.0.thresh:      + 23449
                                -----
                                69972
```

Why are the two counts not exactly equal?

## II) Combining data over place

The `Combine` data transformation script also allows us to join data from two different sources into a single dataset.  For example, we may wish to look at data taken at the same time by two nearby detectors to see if there are any correlations in muon events between them.  This happens, for example, during a cosmic ray "shower" that results from a high-energy primary cosmic ray hitting the upper atmosphere and creating a large number of secondary cosmic rays in a cone that can cover up to a couple of square miles of the earth's surface.  When such a shower occurs, muon detectors within that cone will all register increases in muon counts at the same (or very close to the same) time.  By combining data from multiple detectors, we can reconstruct the direction that the shower-causing primary cosmic ray hit the earth from, and we can even get a rough idea of its total energy.

QuarkNet's [Cosmic Ray e-Lab](https://www.i2u2.org/elab/cosmic/home/project.jsp) features an analysis tool for conducting studies on cosmic ray showers using uploaded CRMD data.  You can learn more about cosmic ray showers there, but for now we'll focus on how the `Combine` data transformation script works when combining data from multiple detectors.

We'll try that with the following two files containing data taken by detectors 6148 and 6478 on May 7, 2017:

`6148.2017.0507.0.thresh`

`6478.2017.0507.0.thresh`

Both of these detectors were located at [Fermilab](https://www.fnal.gov), so it's reasonable to suspect that if a cosmic ray shower occurred in the area, both detectors would have recorded it.

### 1) Investigate the input data

We'll do the usual trick with the UNIX `head` and `tail` commands to see the beginning and end of each file.

First, the file `6148.2017.0507.0.thresh`:

In [None]:
!head -8 ../Files/Data/6148.2017.0507.0.thresh; echo "..." ; tail -5 ../Files/Data/6148.2017.0507.0.thresh

And then the file `6478.2017.0507.0.thresh`:

In [None]:
!head -8 ../Files/Data/6478.2017.0507.0.thresh; echo "..." ; tail -5 ../Files/Data/6478.2017.0507.0.thresh

We can see that the first file, `6148.2017.0507.0.thresh`, covers the whole UTC day midnight-to-midnight:

```
(First data line)     2457880	0.5017287926934751   ->   00:02:29 UTC, 7 May 2017
(Last data line)      2457881	0.4996565376231048   ->   23:59:31 UTC, 7 May 2017
```

The second file, `6478.2017.0507.0.thresh`, covers a time period fully within that of the first:

```
(First data line)     2457880	0.5801056986752026   ->   01:55:22 UTC, 7 May 2017
(Last data line)      2457881	0.1316323887125434   ->   15:09:33 UTC, 7 May 2017
```

That is, a little before 2AM to a little after 3PM.

Next we'll take a look at the line counts:

In [None]:
!wc -l ../Files/Data/6148.2017.0507.0.thresh

In [None]:
!wc -l ../Files/Data/6478.2017.0507.0.thresh

Somewhat interestingly, we note that even though detector 6478 took data for only about 13 hours, it recorded many more lines of data than the nearby detector 6148, which ran for the full 24-hour day on May 7, 2017.

**Discussion:** What do you think might explain this?

In [1]:
# Textarea widget 
import ipywidgets as widgets
widgets.Textarea(
    value='',
    description='Answer:',
    disabled=False,
    layout=widgets.Layout(width='100%')
    #layout=widgets.Layout(width='100%', height='200px')
)

### 2) Apply the data transformation

We'll feed the two input files we've selected into the data transformation script `Combine`, calling the output file `combineOut-05072017`.

In [None]:
!perl ../Files/eLabScripts/Combine.pl ../Files/Data/6148.2017.0507.0.thresh \
../Files/Data/6478.2017.0507.0.thresh OutputFiles/combineOut-05072017

### 3) Investigate the output data

And we examine the result:

In [None]:
!head -8 OutputFiles/combineOut-05072017; echo "..." ; tail -5 OutputFiles/combineOut-05072017

In [None]:
!wc -l OutputFiles/combineOut-05072017

The line count looks like what we expect for combining two files of 4424 lines and 7776 lines, respectively.  However, there's one other aspect of this output we might not have expected.  Whenever we've looked at `.thresh` cosmic ray data files before, all of the data has been in chronological order with the earliest muon hits at the beginning and the latest ones at the end.  In the case of this output file, `combineOut-05072017`, though, all of the detector 6148 data is at the beginning of the file, and all of the detector 6478 data is at the end of the file, even though data from the two detectors overlapped in time.  This means that the data can't possibly be in chronological order any more!

This isn't a mistake at all, of course.  The `Combine` data transformation does just what its name implies, and only that: it joins two or more files into a single file.  Re-arranging the lines of data to put them in order of time is a separate data transformation handled by a separate script; we'll examine that one soon enough.