*Instructors:*
*This Notebook uses the SingleChannel.pl script of the Cosmic Ray e-Lab analyses to provide material for the following learning goals:*

* *Identifying and selecting relevant data from a larger dataset*
* *Understanding the use of 'control' data in an experiment*
* *Introductory principles of statistics*

# SingleChannel

## Motivation: Comparing muon absorption data to a control

When cosmic ray muons pass through a material, there's some chance that the material will absorb it.  Can we quantify this effect?  Fig. 1 shows a simple experiment you can do with a cosmic ray muon detector (CRMD) to begin answering this question.

In this experiment, we place a layer of a dense material (we'll use lead as an example) between two detector panels stacked vertically.  Many of the cosmic ray muons that pass through the top panel also pass through the lead and reach the bottom panel, but many are blocked by the lead.

![A muon striking a neutrino detector](img/MuonAbsorption_800x401.png)*Fig 1.  A layer of lead placed between two CRMD detector panels absorbs some fraction of cosmic ray muons that pass through.*

By comparing the data from the bottom panel to the data from the top panel, we can calculate what fraction of incident muons are blocked by the lead.  If we want to be thorough, we can then change the thickness of the lead and perform the measurement again multiple times in order to determine the fraction of absorbed muons as a function of lead thickness -- but that's getting ahead of ourselves.

For now, what's important is to recognize the difference between the top panel and the bottom panel.  The bottom panel takes the data that we're most directly interested in: the muon count when the path is blocked by lead.  We can't interpret that data, however, unless we also measure the *unblocked* muon count using the top panel.  Otherwise, we don't know exactly how many are being stopped, only that there seem to be fewer of them.

In this case, the top detector panel acts as a **control**: a set of data taken *without* imposing the conditions of the experiment so that the experimental data can be compared to it.

This also implies that the data from the top panel and the data from the bottom panel are meaningfully distinct and that we can't treat them as part of one big set.  After all, we've specifically designed our experiment around the expectation that there will be a physical difference between the two:  We expect fewer muon counts in the panel that's shielded by lead than in the control panel.  When we analyze the data recorded by the DAQ, we'll have to distinguish the data of the upper channel from the data of the lower channel so that we can handle them differently.  This is what the `SingleChannel` data transformation is designed to do.

## Using SingleChannel.pl

To use the data transformation script `SingleChannel.pl`, we provide it with a single input file followed by what we want it to name the output file it creates and a channel number:

`$ perl ./perl/SingleChannel.pl <input file> <output file> <channel number>`

where the items in angled brackets `<>` are parameters we have to specify.  These are:

* `input file`:  The name of a file to be used as input; we can specify only one for this script
* `output file`: What we want to name the output file that the script will write its results to
* `channel number`: Which DAQ channel (1-4) we're selecting for output

We'll try it out on the test data in the `test_data` directory.  Use the UNIX shell command `$ ls test_data` to see what's there:

In [1]:
!ls test_data

6119.2016.0104.1.test.thresh  combineOut  sortOut15
6148.2016.0109.0.test.thresh  sortOut	  sortOut51
6203.2016.0104.1.test.thresh  sortOut11


Selecting randomly, we'll choose the threshold file `6148.2016.0109.0.test.thresh` to test the `SingleChannel` data transformation on.

### 1) Investigating the input data

To get a sense for how big the data file is and what its data looks like before the `SingleChannel` data transformation, we'll first use the UNIX `wc` (word count) utility to count its lines:

In [2]:
!wc -l test_data/6148.2016.0109.0.test.thresh

1003 test_data/6148.2016.0109.0.test.thresh


(`wc` stands for "word count", and the `-l` flag means "but count lines instead of words." The first number in the output, before the filename, is the number of lines, in this case 1003)

We can see that that `6148.2016.0109.0.test.thresh` has just over a thousand lines.  In fact, since we already know that each `.thresh` file begins with three commented lines as a header, we can tell that this file has exactly 1000 lines of data.

We'll use the UNIX utility `head` to see only the first 25 lines:

In [3]:
!head -25 test_data/6148.2016.0109.0.test.thresh

#$md5
#md5_hex(0)
#ID.CHANNEL, Julian Day, RISING EDGE(sec), FALLING EDGE(sec), TIME OVER THRESHOLD (nanosec), RISING EDGE(INT), FALLING EDGE(INT)
6148.4	2457396	0.5006992493422453	0.5006992493424479	17.51	4326041514317000	4326041514318750
6148.3	2457396	0.5006992493422887	0.5006992493424768	16.25	4326041514317375	4326041514319000
6148.2	2457396	0.5007005963399161	0.5007005963400029	7.49	4326053152376876	4326053152377625
6148.3	2457396	0.5007005963401910	0.5007005963404514	22.49	4326053152379250	4326053152381500
6148.4	2457396	0.5007005963401765	0.5007005963404658	25.00	4326053152379125	4326053152381624
6148.1	2457396	0.5014987243978154	0.5014987243980903	23.75	4332948978797125	4332948978799500
6148.2	2457396	0.5014987243980759	0.5014987243982495	15.00	4332948978799376	4332948978800875
6148.1	2457396	0.5020062862072049	0.5020062862076967	42.49	4337334312830250	4337334312834500
6148.2	2457396	0.5020062862074218	0.5020062862076389	18.75	4337334312832125	4337334312834000
6148.

This certainly looks like standard threshold data.

### 2) Applying the data transformation

Remember that the command-line call to `SingleChannel.pl` is of the form

`$ perl ./perl/SingleChannel.pl <input file> <output file> <channel number>`.

We'll run

`$ perl ./perl/SingleChannel.pl test_data/6148.2016.0109.0.test.thresh outputs/singleChannelOut1 1`

to see what happens.  Notice that we've named the output file `singleChannelOut1` to indicate which channel number we're using; this will help us compare files from different trials.

This is what the first 25 lines of the resulting file `singleChannelOut1` looks like:

In [5]:
!head -25 outputs/singleChannelOut1

6148.1	2457396	0.5014987243978154	0.5014987243980903	23.75	4332948978797125	4332948978799500
6148.1	2457396	0.5020062862072049	0.5020062862076967	42.49	4337334312830250	4337334312834500
6148.1	2457396	0.5021121718857783	0.5021121718861401	31.26	4338249165093124	4338249165096250
6148.1	2457396	0.5023430585295574	0.5023430585298612	26.24	4340244025695376	4340244025698000
6148.1	2457396	0.5051481486754340	0.5051481486758101	32.50	4364480004555750	4364480004559000
6148.1	2457396	0.5052315883537616	0.5052315883540365	23.75	4365200923376500	4365200923378874
6148.1	2457396	0.5068497615760561	0.5068497615764758	36.26	4379181940017124	4379181940020750
6148.1	2457396	0.5087878610252894	0.5087878610255497	22.49	4395927119258500	4395927119260750
6148.1	2457396	0.5087878610255786	0.5087878610258102	20.01	4395927119261000	4395927119263000
6148.1	2457396	0.5106959616681134	0.5106959616684317	27.50	4412413108812500	4412413108815250
6148.1	2457396	0.5106963821786025	0.5106963821788774	23.75	4

The first thing we notice is that `SingleChannel` has no header lines.  We've seen data transformation scripts that alter the header of the input file to better suit the output file format, but this script deletes it entirely.  Headers are an important source of information for other users to understand the meaning of a data file, so in this case we'll have to rely on memory and documentation to interpret the columns.  By comparison, though, it's evident that `SingleChannel` hasn't altered the column structure with respect to the input threshold file.

(For a quick comparison, we can check the column structure of the original in the usual way:

In [1]:
!head -7 test_data/6148.2016.0109.0.test.thresh

#$md5
#md5_hex(0)
#ID.CHANNEL, Julian Day, RISING EDGE(sec), FALLING EDGE(sec), TIME OVER THRESHOLD (nanosec), RISING EDGE(INT), FALLING EDGE(INT)
6148.4	2457396	0.5006992493422453	0.5006992493424479	17.51	4326041514317000	4326041514318750
6148.3	2457396	0.5006992493422887	0.5006992493424768	16.25	4326041514317375	4326041514319000
6148.2	2457396	0.5007005963399161	0.5007005963400029	7.49	4326053152376876	4326053152377625
6148.3	2457396	0.5007005963401910	0.5007005963404514	22.49	4326053152379250	4326053152381500


).

The second thing we notice is that column 1 contains only `6148.1` values - that is, only data from the first channel of the detector with DAQ ID #6148.  Given a name like `SingleChannel`, it should be clear that this is what the script does -- it filters the input data to select only data lines for a specific DAQ channel, discarding the rest.

It turns out that `SingleChannel` has a little bit more power, though.  It can actually handle multiple single channels at a time, as odd as that might sound.  We'll try specifying additional channels while adding additional respective output names for them:

`$ perl ./perl/SingleChannel.pl test_data/6148.2016.0109.0.test.thresh "outputs/singleChannelOut1 outputs/singleChannelOut2 outputs/singleChannelOut3 outputs/singleChannelOut4" "1 2 3 4"`

Note that for multiple channels and outputs, we have to add quotes `"` to make sure that `SingleChannel` knows which arguments are the output filenames and which ones are the channel numbers.

If we run this from the command line, we do in fact get four separate output files:

In [6]:
!ls -1 outputs/

6119.2016.0104.1.test.thresh
6148.2016.0109.0.test.thresh
6203.2016.0104.1.test.thresh
combineOut
singleChannelOut1
singleChannelOut2
singleChannelOut3
singleChannelOut4
sortOut
sortOut11
sortOut15
sortOut51


Out of curiosity, let's line-count them using the UNIX `wc` utility:

In [12]:
!wc -l test_data/singleChannelOut1

258 test_data/singleChannelOut1


In [13]:
!wc -l test_data/singleChannelOut2

265 test_data/singleChannelOut2


In [14]:
!wc -l test_data/singleChannelOut3

239 test_data/singleChannelOut3


In [15]:
!wc -l test_data/singleChannelOut4

238 test_data/singleChannelOut4


Recall that the original input threshold file `6148.2016.0109.0.test.thresh` had 1003 lines - three header lines, and 1000 data lines.

**Exercise 1**

Add the line counts of the four output files above.  Do you get what you expect?

**Exercise 2**

In a well-functioning cosmic ray muon detector using 4 channels, what percentage of the total number of counts do you expect each channel to record?  Using the example above of a file with 1000 counts, how many counts would you expect each channel to have?  If the actual results differ from what you would have expected, try to explain why.

**Exercise 3**

Find a file with a much larger number of counts (that is, lines) than `6148.2016.0109.0.test.thresh` has, perhaps in the `files/` directory.  Repeat the above process of using `SingleChannel` to separate the file into individual-channel files, naming the outputs `test_data/singleChannelOut-Big1`, `test_data/singleChannelOut-Big2`, etc.  

Calculate what percentage of the total number of counts each output file has.  How do these compare to your expectations?  How do they compare to the 1000 counts of `6148.2016.0109.0.test.thresh`?

**A Word of Warning**

If you've been playing around with word counts for a bit, you may have noticed that `SingleChannel` has a quirk: if you specify an output file that already exists, `SingleChannel` will *add to* the existing file rather than replacing it with the new output.  Most of the other e-Lab data transformations will replace the existing file, so this may represent a bug in this particular script.

*Be aware of this when running similar commands multiple times!*

**Further Exploration**

1) What happens with a command like

`$ perl ./perl/SingleChannel.pl test_data/6148.2016.0109.0.test.thresh "test_data/singleChannelOut test_data/singleChannelOut test_data/singleChannelOut test_data/singleChannelOut" "1 2 3 4"`

where we apply `SingleChannel.pl` to all four channels but provide the same filename?  Does it concatenate each output into the file, effectively sorting the whole thing by channel?

2) What does `SingleChannel.pl` do for an input file with multiple detectors?