In [1]:
import ipywidgets as widgets

*Instructors:*
*This Notebook uses the Sort.pl script of the Cosmic Ray e-Lab analyses to provide material for the following learning goals:*

* *Understanding energy scales in physics*
* *Understanding ultra-high-energy cosmic rays as a real-world research problem*
* *Identifying interesting data within a large dataset*
* *Formulating questions and transforming data to answer them*

# Sort

## Motivation: Identifying high-energy cosmic rays

**Understanding cosmic ray energy scales**

Cosmic rays are considered "high-energy" radiation.  The typical energy of a muon-producing primary cosmic ray particle is 1-100 GeV.  "GeV" stands for "Giga electron-Volt"; the electron-Volt (eV) is the standard unit of energy in particle physics, and "Giga" is a prefix meaning "one billion of them", or $10^9$.  For comparison, 

* A single electron accelerated through a voltage of 1 Volt gains an energy equal to 1 eV (hence the name),

* The energy of neutron emitted during the fission of a uranium nucleus is typically ~1 MeV, or ~0.001 GeV, and

* The energy of a proton in one beam of the LHC, the highest-energy particle accelerator in history, is 6.5 TeV, or 6,500 Gev.

("M" is a shorthand prefix for "Mega", meaning "one million" or $10^6$.  "T" is a shorthand prefix for "Tera", meaning "one trillion", or $10^{12}$).

Your everyday primary cosmic ray particle isn't quite as energetic as what's collided at the LHC, but it's over 1,000 times more energetic than individual particles emitted during nuclear fission events, such as in a nuclear reactor or the detonation of a nuclear weapon.

Note that we say a "typical" cosmic ray particle has an energy of 1-100 GeV, but that isn't meant to be a limit.  Cosmic rays with energies higher than this range are measured frequently, and cosmic rays with energies of thousands or tens of thousands of GeV are usually considered unremarkable, though they become much rarer as you move further away from the 1-100 GeV range.

**Ultra-high-energy cosmic rays**

Every once in a while, though, a cosmic ray is detected with an energy far greater than the typical 1-100 GeV.  Called "ultra-high-energy cosmic rays", these particles have energies of 1 *billion*, or $10^9$, GeV or greater.

Such rare events are intriguing in their own right, but what's most significant about such particles is that *we don't know where they come from!*  We know of no natural process in the universe -- whether supernovas, quasars, or black hole accretion jets -- that are capable of accelerating particles to this energy.

Naturally, ultra-high-energy cosmic rays are the subject of scientific interest and research for what they might be able to reveal about unknown physics in the cosmos.  But in order to study them, we need to observe them, and in order to observe them, we need to find them in the data.

![The Auger Observatory](img/AugerTank.jpg) *Fig 1: A photograph taken at the Pierre Auger Observatory in Argentina.  The Auger Observatory was constructed to measure ultra-high-energy cosmic ray events so that we can learn more about this rare and mysterious phenomenon.*

**Energy in CRMD data**

Most cosmic ray muon detectors (CRMDs), including QuarkNet's, don't measure primary cosmic rays or their energies directly.  Rather, they measure muons created as secondary particles when primary cosmic ray particles strike the Earth's atmosphere.

QuarkNet CRMDs don't measure the energy of these secondary muons, either, at least not directly.  Recall how the CRMD works, though:

First, a cosmic ray muon passes through one of the scintillating detector panels, which causes it to emit light.  Then, this light is gathered by a photomultiplier tube (PMT) that converts it to a voltage.  The detector's data acquisition board (DAQ) notes the times when this voltage first crosses a certain value (called the threshold), and then it notes again when that voltage falls back below the threshold.

The duration of time that the voltage pulse exceeds the threshold is called the "time over threshold" value, and it's recorded as one of the data columns in the threshold data files that the Cosmic Ray e-Lab analyzes.

The more energy a muon has, the more light it generates in the scintillator panel of the detector.  This additional light registers at the PMT for a longer duration, meaning that the voltage pulse it produces lasts for a longer duration. Thus, the "time over threshold" value recorded by the DAQ is greater for a higher-energy muon than for a lower-energy muon.

In other words, the "time over threshold" value is *correlated* to the energy of the muon, and this correlatoin is a *positive correlation*: as one increases, so does the other.  This means that, while we can't assign an energy value in GeV to a cosmic ray muon using QuarkNet's CRMDs, we can reasonably compare the energies of different muon events by looking at the "time over threshold" value in the data.  In effect, the "time over threshold" data serves as a placeholder variable for energy.

## Analyzing "time over threshold" in CRMD data

Below, we use the UNIX utility `head` to examine a bit of data from the threshold file of a CRMD:

In [2]:
!head -8 files/6148.2016.0109.0.thresh

#$md5
#md5_hex(0)
#ID.CHANNEL, Julian Day, RISING EDGE(sec), FALLING EDGE(sec), TIME OVER THRESHOLD (nanosec), RISING EDGE(INT), FALLING EDGE(INT)
6148.4	2457396	0.5006992493422453	0.5006992493424479	17.51	4326041514317000	4326041514318750
6148.3	2457396	0.5006992493422887	0.5006992493424768	16.25	4326041514317375	4326041514319000
6148.2	2457396	0.5007005963399161	0.5007005963400029	7.49	4326053152376876	4326053152377625
6148.3	2457396	0.5007005963401910	0.5007005963404514	22.49	4326053152379250	4326053152381500
6148.4	2457396	0.5007005963401765	0.5007005963404658	25.00	4326053152379125	4326053152381624


The values we're interested in are in the `TIME OVER THRESHOLD (nanosec)` column (the ones with values 17.51, 16.25, 7.49, and so on).

**Your assignment:** Running the next cell will display 100 lines of data similar to those above, but from a different threshold file.  When you run the cell, start a timer.  Identify all lines in the data that have a "time over threshold" value between **15.24** and **21.80** by copying and pasting them into the cell below it.  Stop the timer when you're satisified that you've found all values.

In [None]:
!head -103 files/6148.2017.0507.0-100.thresh

In [4]:
#
# Ignore this box.  It's just to make the text entry box below.
#
widgets.Textarea(
    value='',
    placeholder='Paste matching data lines here',
    description='Data:',
    disabled=False,
    layout=widgets.Layout(width='100%', height='200px')
)

How many lines did you find?  How long did it take?

In [5]:
#
# Ignore this box.  It's just to make the two entry boxes below.
#
numberFound = widgets.Text(
    value='',
    placeholder='Number of data lines',
    description='# lines:',
    disabled=False
)

timeToFind = widgets.Text(
    value='',
    placeholder='Time',
    description='Time (s):',
    disabled=False
)

display(numberFound)
display(timeToFind)

*Instructor's copy: there are 23 lines with TOT between 15.24 and 21.80*

You'll probably agree that the method you just used to identify certain data in a large set is not the most efficient way of doing things.  If we're looking for particular values, or values in a certain range, it's much easier to do that if the data is *sorted* by the value of interest.

In their standard form, QuarkNet's CRMD threshold files are sorted by "rising edge" times (you can see this easily in the 100-line output of `6148.2017.0507.0-100.thresh` that you just scanned through).  That is, they're organized chronologically by the time that each muon's signal pulse first exceeded the DAQ threshold.  If we care about the "time over threshold" values more than time of day the muons arrived, then it makes more sense to re-organize the data by that value.

This is what the `Sort.pl` data transformation does.  We'll apply it to the data you examined above to see how much easier it makes the task of finding just the right data in a set.

## Using Sort.pl

To use the data transformation script `Sort.pl`, we provide it with (in order): an input file, what to name its output file, and two column-sorting parameters:

`$ perl ./perl/Sort.pl <input file> <output file> <primary sort column> <secondary sort column>`

where the items in angled brackets `<>` are parameters we have to specify.  In more detail, these are:

* `input file`:  The name of the input file to be sorted; we can specify only one for this script
* `output file`: What we want to name the output file that the script will write its results to
* `primary sort column`: The number of the data column we want to sort
* `secondary sort column`: The number of the data column we want to sub-sort

### Understanding primary and secondary sorts

"Primary" and "secondary" sort might need a little more explanation, which we can illustrate with some fake threshold data:

```
#ID.CHANNEL, Julian Day, RISING EDGE(sec), FALLING EDGE(sec), TIME OVER THRESHOLD (nanosec), RISING EDGE(INT), FALLING EDGE(INT)
6148.1	2457880	0.50751	0.50802	34.99	43325	43325
6148.1	2457880	0.50091	0.50262	18.75	43375	43350
6148.4	2457880	0.50828	0.50577	23.75	43376	43350
6148.2	2457880	0.50059	0.50978	42.50	43475	43425
6148.1	2457880	0.50096	0.50569	30.00	43400	43400
6148.1	2457880	0.50040	0.50355	20.00	43475	43475
6148.1	2457880	0.50789	0.50960	18.76	43450	43426
6148.2	2457880	0.50330	0.50815	38.75	43425	43400
6148.1	2457880	0.50539	0.50972	 3.74	43426	43400
6148.2	2457880	0.50564	0.50616	35.01	43500	43500
6148.4	2457880	0.50616	0.50497	16.25	43500	43526
6148.2	2457880	0.50050	0.50508	21.24	43575	43500
6148.4	2457880	0.50246	0.50139	25.00	43500	43500
```

This data is unsorted, in the sense that none of the columns have their data in order if you scan them from top-to-bottom.  To sort the data, we'll rearrange the lines so that one column *does* have its data in order from lowest-to-highest.  This will be the **primary sort** column.  Let's say that we want to sort the data by detector channel, which is given by the `ID.CHANNEL` values in the first column, so that we can easily see all the data taken by each detector panel.  Once sorted on that column, the data looks like this:

```
#ID.CHANNEL, Julian Day, RISING EDGE(sec), FALLING EDGE(sec), TIME OVER THRESHOLD (nanosec), RISING EDGE(INT), FALLING EDGE(INT)
6148.1	2457880	0.50751	0.50802	34.99	43325	43325
6148.1	2457880	0.50091	0.50262	18.75	43375	43350
6148.1	2457880	0.50096	0.50569	30.00	43400	43400
6148.1	2457880	0.50040	0.50355	20.00	43475	43475
6148.1	2457880	0.50789	0.50960	18.76	43450	43426
6148.1	2457880	0.50539	0.50972	 3.74	43426	43400
6148.2	2457880	0.50059	0.50978	42.50	43475	43425
6148.2	2457880	0.50330	0.50815	38.75	43425	43400
6148.2	2457880	0.50564	0.50616	35.01	43500	43500
6148.2	2457880	0.50050	0.50508	21.24	43575	43500
6148.4	2457880	0.50828	0.50577	23.75	43376	43350
6148.4	2457880	0.50616	0.50497	16.25	43500	43526
6148.4	2457880	0.50246	0.50139	25.00	43500	43500
```

You can see that the first column is in numerical order according to the decimal after `6148`, which is the DAQ ID for this detector and which doesn't change throughout the data.

All of the `ID.CHANNEL` values occur multiple times, however.  Our primary sort on column 1 means that all `6148.1` lines come first in the ordering, but there's still a "tie" among the six different lines that have that value.  Within that block of six lines, we can re-arrange the individual lines so that the values of a different column are put into order, all while keeping the primary sort order of (`6148.1`, `6148.2`, `6148.4`) intact.  This is a **secondary sort**, which "breaks the tie" whenever the primary sort values are equal.

For instance, with the primary sort accomplished as above, we may want to perform a secondary sort on the `TIME OVER THRESHOLD` column.  The results look like this:

```
#ID.CHANNEL, Julian Day, RISING EDGE(sec), FALLING EDGE(sec), TIME OVER THRESHOLD (nanosec), RISING EDGE(INT), FALLING EDGE(INT)
6148.1	2457880	0.50539	0.50972	 3.74	43426	43400
6148.1	2457880	0.50091	0.50262	18.75	43375	43350
6148.1	2457880	0.50789	0.50960	18.76	43450	43426
6148.1	2457880	0.50040	0.50355	20.00	43475	43475
6148.1	2457880	0.50096	0.50569	30.00	43400	43400
6148.1	2457880	0.50751	0.50802	34.99	43325	43325
6148.2	2457880	0.50050	0.50508	21.24	43575	43500
6148.2	2457880	0.50564	0.50616	35.01	43500	43500
6148.2	2457880	0.50330	0.50815	38.75	43425	43400
6148.2	2457880	0.50059	0.50978	42.50	43475	43425
6148.4	2457880	0.50616	0.50497	16.25	43500	43526
6148.4	2457880	0.50828	0.50577	23.75	43376	43350
6148.4	2457880	0.50246	0.50139	25.00	43500	43500
```

The data remains sorted by `ID.CHANNEL`, but whenever there's a tie for that value, the data is now sorted by the `TIME OVER THRESHOLD` value.

You might ask, "What if there's a tie in the secondary sort column?"  Could we define a "tertiary sort" to break that tie, too?  Absolutely!  In principle, you can define as many "tiebreaker" sorts as there are columns of data.  In this specific case, however, the `Sort.pl` script doesn't define anything beyond a secondary sort because it has never been useful for any of the analyses in the Cosmic Ray e-Lab.  If two lines of data are tied for both the primary sort and secondary sort values, `Sort.pl` leaves them in whatever order it found them in in the input file (which is typically in order of the `RISING EDGE` value).

Now that you understand what "primary" and "secondary" sorting is, we're ready to explore the `Sort.pl` data transformation using our standard method of investigation.

### 1) Investigating the input data

In this case, you've already thoroughly investigated the input data when you read through the file output looking for "time over threshold" values in a given range.  Good work!

### 2) Applying the data transformation

Remember that the command-line call to `Sort.pl` is of the form

`$ perl ./perl/Sort.pl <input file> <output file> <primary sort column> <secondary sort column>`

We're interested in "time over threshold," so we'll sort by that as the primary column (that's column 5).  We don't have any particular interest in any other values, so we'll take column 1, the `ID.CHANNEL` value, as secondary sort.  We'll call the output file `6148.2017.0507.0-100.sorted01` and put it in the `outputs/` folder. The command will be

`$ perl ./perl/Sort.pl files/6148.2017.0507.0-100.thresh outputs/6148.2017.0507.0-100.sorted01 5 1`

Let's run it:

In [6]:
!perl ./perl/Sort.pl files/6148.2017.0507.0-100.thresh outputs/6148.2017.0507.0-100.sorted01 5 1

md5s COMPUTED:e1195c7001d8913fddce1469f5031eb7 FROMFILE:ID.CHANNEL, Julian Day, RISING EDGE(sec), FALLING EDGE(sec), TIME OVER THRESHOLD (nanosec), RISING EDGE(INT), FALLING EDGE(INT)


### 3) Investigating the output data

Before we look at the output, we'll set up the same test as before.  At the moment you run the next cell, start a timer.  In the entry box below the cell's output, copy and paste all lines of data with a "time over threshold" value between **15.24** and **21.80**.  When you're satisfied you've found them all, stop the timer.

In [None]:
!head -103 outputs/6148.2017.0507.0-100.sorted01

In [8]:
#
# Ignore this box.  It's just to make the text entry boxes below.
#
dataFound = widgets.Textarea(
    value='',
    placeholder='Paste matching data lines here',
    description='Data:',
    disabled=False,
    layout=widgets.Layout(width='100%', height='200px')
)

numberFound = widgets.Text(
    value='',
    placeholder='Number of data lines',
    description='# lines:',
    disabled=False
)

timeToFind = widgets.Text(
    value='',
    placeholder='Time',
    description='Time (s):',
    disabled=False
)

display(dataFound)
display(numberFound)
display(timeToFind)

At this point, you should have a good idea of how sorting works as a data transformation, how to use the `Sort.pl` data transformation script, and why sorting data strategically can make data easier to find in a large set.  We'll provide one final example of the utility of sorting data for easy interpretation.

## More: Binning unsorted and sorted data

As you saw during the timed exercises, sorting data helps identify when data falls into specific "bins" or intervals.  The following bonus demonstration will help you visualize that effect.

In [31]:
from IPython.display import HTML, display

binNumber_Max = 38
binSize = 3
binList = [i*binSize for i in range(binNumber_Max+1)]
### binList = [0, 10, 20, 30, 40, 50, ...]
#print(binList)

# A 19-step blue gradient:
binColors1 = ['#f1f1ff', '#e4e4ff', '#d6d6ff', '#c9c9ff', '#bbbbff', '#aeaeff', '#a1a1ff', '#9393ff', '#8686ff', '#7878ff', '#6b6bff', '#5d5dff', '#5050ff', '#4343ff', '#3535ff', '#2828ff', '#1a1aff', '#0d0dff', '#0000ff']
#print(binColors)

# A 19-step green gradient:
binColors2 = ['#f1f7f1', '#e4efe4', '#d6e6d6', '#c9e0c9', '#bbd9bb', '#aed1ae', '#a1caa1', '#93c293', '#86bb86', '#78b378', '#6bac6b', '#5da55d', '#509d50', '#439543', '#358e35', '#288628', '#1a7f1a', '#0d770d', '#007000']
    
# For now:
#binColors = binColors1

# TODO: Alternate bins with blue/green
binColors = []
# "Zip" the color lists together into a list of alternating colors
for i in range(len(binColors1)):
    binColors.append(binColors1[i])
    binColors.append(binColors2[i])
    
sortColumn = 4

##header = '<tr style="background:light grey; text-align:left; float:left; width:100%; margin: -7px 0 -7px 0;">\
##            <th>ID.Channel</th><th>Day</th><th>Rising Edge</th><th>Falling Edge</th>\
##            <th>Time over Threshold</th></tr>'
header = '<tr style="background:light grey; text-align:left;">\
            <th>ID.Channel</th><th>Day</th><th>Rising Edge</th><th>Falling Edge</th>\
            <th>Time over Threshold</th></tr>'

htmlRows = [header]

# Read in the file
#with open('files/6148.2017.0507.0-100.thresh') as data:
with open('outputs/6148.2017.0507.0-100.sorted01') as data:

    linecount = 0
    for line in data:
        # If it's a comment, set it to a neutral color
        if str(line[0]) == '#':
            #htmlRows.append( '<tr style="background:{0}; text-align:left; float:left; width:100%; \
            #margin: -7px 0 -7px 0;"><td>{1}</td></tr>'.format('light grey', line) )
            pass
        else:
            # If it's data, break the line into an array of values
            values = line.split('\t')
            # Remove the linebreak from the last value
            values[-1] = values[-1].strip('\n')

            # If it's a rising or falling edge, round it to 5 decimal places
            values[2] = str(round(float(values[2]), 5))
            values[3] = str(round(float(values[3]), 5))
            
            # Remove the integer-value rising and falling edges
            del values[6]
            del values[5]
            
            # This is the number we want sorted:
            dataValue = float(values[sortColumn])

            # Scan through the bins to find which one the relevant value is in
            assignedBin = None
            for binLower in binList:
                binUpper = binLower + binSize
                if dataValue >= binLower and dataValue < binUpper:
                    assignedBin = binList.index(binLower)
                    break

                # If none of those worked, add the line to the last bin:
                if assignedBin == None:
                    assignedBin = binList.index(binList[-1])
            
            #newLine = '\t'.join(values)
            newLine = '<td>' + '</td><td>'.join(values) + '</td>'
            
            ##htmlRows.append( '<tr style="background:{0}; text-align:left; float:left; width:100%;\
            ##margin: -7px 0 -7px 0;">{1}</tr>'.format(str(binColors[assignedBin+1]), newLine) )

            htmlRows.append( '<tr style="background:{0}; text-align:left;">{1}</tr>'.format(str(binColors[assignedBin+1]), newLine) )
            
#htmlOutput = '<table style="text-align:left">'
htmlOutput = '<table style="font-family: monospace; margin: 0 auto; text-align:left; float:left;">'

for row in htmlRows:
    htmlOutput = htmlOutput + row

htmlOutput = htmlOutput + '</table>'

#print(htmlOutput)

display(HTML(htmlOutput))

ID.Channel,Day,Rising Edge,Falling Edge,Time over Threshold
6148.1,2457880,0.5032,0.5032,3.74
6148.2,2457880,0.5084,0.5084,6.24
6148.4,2457880,0.50776,0.50776,11.24
6148.1,2457880,0.50611,0.50611,11.25
6148.2,2457880,0.5156,0.5156,11.25
6148.3,2457880,0.5084,0.5084,11.25
6148.1,2457880,0.51524,0.51524,12.5
6148.2,2457880,0.50708,0.50708,13.75
6148.3,2457880,0.50781,0.50781,13.75
6148.4,2457880,0.50611,0.50611,13.75


In [10]:
# Excised from the above as overkill, but keep here 'til I figure out where to store it
# Create the list of bin colors
binColors = []
for i in range(len(binList)):
    
    RR = hex( i*(int(255/binNumber_Max)) )
    GG = hex( i*(int(255/binNumber_Max)) )
    BB = hex( 255 )

    if RR == '0x0':
        RR = str('00')
    if GG == '0x0':
        GG = str('00')
    if BB == '0x0':
        BB = str('00')

    col = '#' + str(RR).replace('0x','') + str(GG).replace('0x','') + str(BB).replace('0x','') + '80'
    #print(col)

    binColors.append(col)

In [20]:
htmlOutput = """
<table style="font-family: monospace; margin: 0 auto;">
    <tr><th>Cats</th><th>Dogs</th><th>Mice</th></tr>
    <tr><td>Garfield</td><td>Odie</td><td>Jerry</td></tr>
    <tr><td>Tom</td><td>Marmaduke</td><td>Speedy</td></tr>
    <tr><td>Heathcliff</td><td>Pluto</td><td>Mickey</td></tr>
</table>
"""

display(HTML(htmlOutput))

Cats,Dogs,Mice
Garfield,Odie,Jerry
Tom,Marmaduke,Speedy
Heathcliff,Pluto,Mickey


In [11]:
!ls test_data

6119.2016.0104.1.test.thresh  freqOut02  freqOut07	    singleChannelOut4
6148.2016.0109.0.test.thresh  freqOut03  freqOut08	    sortOut
6203.2016.0104.1.test.thresh  freqOut04  singleChannelOut1  sortOut11
combineOut		      freqOut05  singleChannelOut2  sortOut15
freqOut01		      freqOut06  singleChannelOut3  sortOut51


`Sort` isn't typically applied directly to threshold files in practice, but just for fun we'll see what happens when we apply its data transformation to `test_data/6203.2016.0104.1.test.thresh`.

First, use the UNIX shell command `head` to get a quick look at what this input looks like:

In [12]:
!head -10 test_data/6203.2016.0104.1.test.thresh

#$md5
#md5_hex(0)
#ID.CHANNEL, Julian Day, RISING EDGE(sec), FALLING EDGE(sec), TIME OVER THRESHOLD (nanosec), RISING EDGE(INT), FALLING EDGE(INT)
6203.1	2457392	0.2452114384916088	0.2452114384919415	28.75	2118626828567500	2118626828570375
6203.4	2457392	0.2452114384916232	0.2452114384919705	30.00	2118626828567625	2118626828570625
6203.2	2457392	0.2452114384916232	0.2452114384920283	35.00	2118626828567625	2118626828571125
6203.1	2457392	0.2452182596452402	0.2452182596455440	26.25	2118685763334875	2118685763337500
6203.4	2457392	0.2452182596452402	0.2452182596455874	30.00	2118685763334875	2118685763337875
6203.2	2457392	0.2452182596452402	0.2452182596456308	33.75	2118685763334875	2118685763338250
6203.4	2457392	0.2452190121639323	0.2452190121641204	16.25	2118692265096375	2118692265098000


`$ head -10` shows the first 10 lines of the file; in the case of `6203.2016.0104.1.test.thresh`, that happens to be the entire file.

We'll call the output file `sortOut`.  What do we pick for the two column sort parameters, though?  That isn't necessarily clear, so we'll experiment.

First, try the simplest option: 1 and 1.  And since we're going to experiment, we'll want to use different names for the output files so that we can compare them.  In this case, we'll use `sortOut11` instead, for a command of

`$ perl ./perl/Sort.pl test_data/6203.2016.0104.1.test.thresh test_data/sortOut11 1 1`

Running this on the command line, the output `sortOut11` looks like this:

In [13]:
!cat test_data/sortOut11

#ID.CHANNEL, Julian Day, RISING EDGE(sec), FALLING EDGE(sec), TIME OVER THRESHOLD (nanosec), RISING EDGE(INT), FALLING EDGE(INT)
#$md5
#md5_hex(0)
6203.1	2457392	0.2452114384916088	0.2452114384919415	28.75	2118626828567500	2118626828570375
6203.1	2457392	0.2452182596452402	0.2452182596455440	26.25	2118685763334875	2118685763337500
6203.2	2457392	0.2452114384916232	0.2452114384920283	35.00	2118626828567625	2118626828571125
6203.2	2457392	0.2452182596452402	0.2452182596456308	33.75	2118685763334875	2118685763338250
6203.4	2457392	0.2452114384916232	0.2452114384919705	30.00	2118626828567625	2118626828570625
6203.4	2457392	0.2452182596452402	0.2452182596455874	30.00	2118685763334875	2118685763337875
6203.4	2457392	0.2452190121639323	0.2452190121641204	16.25	2118692265096375	2118692265098000


With a name like `Sort.pl`, we're naturally looking for one of these columns to be sorted. A quick inspection reveals that only column 1, `detectorID.channel`, is in any sort of order, and that's by the channel number in ascending order (1,2,4).  

Any time we're sorting multi-column data like this, though, there's a little ambiguity.  Take the three `6203.4` entries, for example: they all have the same value for column 1, so how do we decide what order to put *them* in?  Looking just at those three values, we can see that they're in 

* Ascending order by all four RISING EDGE and FALLING EDGE columns
* Descending order by the TIME OVER THRESHOLD column

In fact, both of the above points are true for the two respective `6203.1` and `6203.2` data lines, as well.  Can we conclude anything about how `Sort.pl` performs what we'll call a "secondary sort?"

I can think of four possibilities offhand:
Having performed the primary sort on column 1, `Sort.pl`

1) Searches for the next column that isn't already sorted and sorts by it - in this case column 3, the first RISING EDGE column

2) Is programmed to secondary sort in descending order on column 5, the TIME OVER THRESHOLD column

3) Performs no further sorting; the apparent sorting by RISING EDGE is due to the fact that data lines in threshold files are already sequenced by time

4) Performs no further sorting; the apparent sorting by EDGE times or TIME OVER THRESHOLD is the type of coincidence that shows up when you look at small numbers of data points




**Exercise 1**

Based on the output `sortOut11`, can you think of any other explanations for what `Sort.pl` might perform as a secondary sort?

*write your answer here*

**Exercise 2**

Create a mock data file of 5 lines of data in 7 columns, like

```
 1  7 11  3  5  1  2
 5  5 12  4  2 11 32
77  1  6  1  6  1  6
10  4  4  4 21  7 12
19 30  1  7  5  2  9
```

For simplicity, use integers between 1-100 and don't include a header.  Design your mock data file so that if we apply `Sort.pl` to it, the result would distinguish among the four possibilities listed above.  If you think one or more of the possibilities isn't resolvable in this way, explain why.

*write your mock data here, between the code block delimiters:*

*Hint: start by creating a mock data file that will indicate if option one is true.  Then, adapt it for each subsequent option*

```

```

We'll experiment some more.  If the two column sort parameters both refer to columns, then `1 1` is redundant.  We should test a different combination, say `1 5`:

`$ perl ./perl/Sort.pl test_data/6203.2016.0104.1.test.thresh test_data/sortOut15 1 5`

(Note that we've changed the output file to `test_data/sortOut15` so it won't overwrite the previous output file.  That way, we can compare the two).

The output of this command is

In [14]:
!cat test_data/sortOut15

#ID.CHANNEL, Julian Day, RISING EDGE(sec), FALLING EDGE(sec), TIME OVER THRESHOLD (nanosec), RISING EDGE(INT), FALLING EDGE(INT)
#$md5
#md5_hex(0)
6203.1	2457392	0.2452182596452402	0.2452182596455440	26.25	2118685763334875	2118685763337500
6203.1	2457392	0.2452114384916088	0.2452114384919415	28.75	2118626828567500	2118626828570375
6203.2	2457392	0.2452182596452402	0.2452182596456308	33.75	2118685763334875	2118685763338250
6203.2	2457392	0.2452114384916232	0.2452114384920283	35.00	2118626828567625	2118626828571125
6203.4	2457392	0.2452190121639323	0.2452190121641204	16.25	2118692265096375	2118692265098000
6203.4	2457392	0.2452114384916232	0.2452114384919705	30.00	2118626828567625	2118626828570625
6203.4	2457392	0.2452182596452402	0.2452182596455874	30.00	2118685763334875	2118685763337875


The first column is still sorted, and column 5 (TIME OVER THRESHOLD) isn't, overall.  But compare column 5 in this output to column 5 in the previous output, `sortOut15`.  It's clearly in a different order -- looking more closely, we see that the values in column 5 are in order *for each value in column 1*. That is, for all lines that start with `6203.1`, the lines have been rearranged so that the column 5 values are in ascending order.  For all lines that start with `6203.2`, the same is true, and so on.

It seems like we've figured out what the sort parameters do: the first parameter is the *primary sort*, the column that is sorted over the entirety of the output file, while the second parameter is the *secondary sort*, the column that is sorted only within matching values of the primary sort column.

To test that hypothesis, let's play around a bit.  We'll do the same as the last data transformation, but reverse the column parameters.

`$ perl ./perl/Sort.pl test_data/6203.2016.0104.1.test.thresh test_data/sortOut51 5 1`

In [15]:
!cat test_data/sortOut51

#ID.CHANNEL, Julian Day, RISING EDGE(sec), FALLING EDGE(sec), TIME OVER THRESHOLD (nanosec), RISING EDGE(INT), FALLING EDGE(INT)
#$md5
#md5_hex(0)
6203.4	2457392	0.2452190121639323	0.2452190121641204	16.25	2118692265096375	2118692265098000
6203.1	2457392	0.2452182596452402	0.2452182596455440	26.25	2118685763334875	2118685763337500
6203.1	2457392	0.2452114384916088	0.2452114384919415	28.75	2118626828567500	2118626828570375
6203.4	2457392	0.2452114384916232	0.2452114384919705	30.00	2118626828567625	2118626828570625
6203.4	2457392	0.2452182596452402	0.2452182596455874	30.00	2118685763334875	2118685763337875
6203.2	2457392	0.2452182596452402	0.2452182596456308	33.75	2118685763334875	2118685763338250
6203.2	2457392	0.2452114384916232	0.2452114384920283	35.00	2118626828567625	2118626828571125


The first column is no longer in order, but column 5 is.  Since there are few repeating values within column 5, it's hard to see that column one is the secondary sort, but the output is at least consistent with that idea.

Play around until with the other data available in the `test_data` directory until you're satisfied you understand how `Sort.pl` works.

To wrap things up, we'll do one more important test of the `Sort.pl` script: what happens if we try to use it in the wrong manner?  For example, what if we attempt to sort on column 0?

`$ perl ./perl/Sort.pl test_data/6203.2016.0104.1.test.thresh test_data/sortOut01 0 1`

Trying this from the command line, we find

```
$ perl ./perl/Sort.pl test_data/6203.2016.0104.1.test.thresh test_data/sortOut01 0 1
sort: field number is zero: invalid field specification ‘0,0’
```

The script has given us an error that (mostly) explains the mistake.  It's always good programming practice to anticipate user errors!