# Sort

To use the data transformation script `Sort.pl`, we provide it with (in order): an input file, what to name its output file, and two column-sorting parameters.

`$ perl ./perl/Sort.pl [file to sort] [output file] [1st column sorting for (start at #1)] [2nd column sorting for]`


We'll try it out on the test data in the `test_data` directory.  Use the UNIX shell command `$ ls test_data` to see what's there:

In [1]:
!ls test_data

6119.2016.0104.1.test.thresh  6203.2016.0104.1.test.thresh  combineOut


`Sort` isn't typically applied directly to threshold files in practice, but just for fun we'll see what happens when we apply its data transformation to `test_data/6203.2016.0104.1.test.thresh`.

First, use the UNIX shell command `head` to get a quick look at what this input looks like:

In [2]:
!head -10 test_data/6203.2016.0104.1.test.thresh

#$md5
#md5_hex(0)
#ID.CHANNEL, Julian Day, RISING EDGE(sec), FALLING EDGE(sec), TIME OVER THRESHOLD (nanosec), RISING EDGE(INT), FALLING EDGE(INT)
6203.1	2457392	0.2452114384916088	0.2452114384919415	28.75	2118626828567500	2118626828570375
6203.4	2457392	0.2452114384916232	0.2452114384919705	30.00	2118626828567625	2118626828570625
6203.2	2457392	0.2452114384916232	0.2452114384920283	35.00	2118626828567625	2118626828571125
6203.1	2457392	0.2452182596452402	0.2452182596455440	26.25	2118685763334875	2118685763337500
6203.4	2457392	0.2452182596452402	0.2452182596455874	30.00	2118685763334875	2118685763337875
6203.2	2457392	0.2452182596452402	0.2452182596456308	33.75	2118685763334875	2118685763338250
6203.4	2457392	0.2452190121639323	0.2452190121641204	16.25	2118692265096375	2118692265098000


`$ head -10` shows the first 10 lines of the file; in the case of `6203.2016.0104.1.test.thresh`, that happens to be the entire file.

We'll call the output file `sortOut`.  What do we pick for the two column sort parameters, though?  That isn't necessarily clear, so we'll experiment.

First, try the simplest option: 1 and 1.  And since we're going to experiment, we'll want to use different names for the output files so that we can compare them.  In this case, we'll use `sortOut11` instead, for a command of

`$ perl ./perl/Sort.pl test_data/6203.2016.0104.1.test.thresh test_data/sortOut11 1 1`

Running this on the command line, the output `sortOut11` looks like this:

In [5]:
!cat test_data/sortOut11

#ID.CHANNEL, Julian Day, RISING EDGE(sec), FALLING EDGE(sec), TIME OVER THRESHOLD (nanosec), RISING EDGE(INT), FALLING EDGE(INT)
#$md5
#md5_hex(0)
6203.1	2457392	0.2452114384916088	0.2452114384919415	28.75	2118626828567500	2118626828570375
6203.1	2457392	0.2452182596452402	0.2452182596455440	26.25	2118685763334875	2118685763337500
6203.2	2457392	0.2452114384916232	0.2452114384920283	35.00	2118626828567625	2118626828571125
6203.2	2457392	0.2452182596452402	0.2452182596456308	33.75	2118685763334875	2118685763338250
6203.4	2457392	0.2452114384916232	0.2452114384919705	30.00	2118626828567625	2118626828570625
6203.4	2457392	0.2452182596452402	0.2452182596455874	30.00	2118685763334875	2118685763337875
6203.4	2457392	0.2452190121639323	0.2452190121641204	16.25	2118692265096375	2118692265098000


With a name like `Sort.pl`, we're naturally looking for one of these columns to be sorted. A quick inspection reveals that only column 1, `detectorID.channel`, is in any sort of order, and that's by the channel number in ascending order (1,2,4).  

Any time we're sorting multi-column data like this, though, there's a little ambiguity.  Take the three `6203.4` entries, for example: they all have the same value for column 1, so how do we decide what order to put *them* in?  Looking just at those three values, we can see that they're in 

* Ascending order by all four RISING EDGE and FALLING EDGE columns
* Descending order by the TIME OVER THRESHOLD column

In fact, both of the above points are true for the two respective `6203.1` and `6203.2` data lines, as well.  Can we conclude anything about how `Sort.pl` performs what we'll call a "secondary sort?"

I can think of four possibilities offhand:
Having performed the primary sort on column 1, `Sort.pl`

1) Searches for the next column that isn't already sorted and sorts by it - in this case column 3, the first RISING EDGE column

2) Is programmed to secondary sort in descending order on column 5, the TIME OVER THRESHOLD column

3) Performs no further sorting; the apparent sorting by RISING EDGE is due to the fact that data lines in threshold files are already sequenced by time

4) Performs no further sorting; the apparent sorting by EDGE times or TIME OVER THRESHOLD is the type of coincidence that shows up when you look at small numbers of data points




**Exercise 1**

Based on the output `sortOut11`, can you think of any other explanations for what `Sort.pl` might perform as a secondary sort?

*write your answer here*

**Exercise 2**

Create a mock data file of 5 lines of data in 7 columns, like

```
 1  7 11  3  5  1  2
 5  5 12  4  2 11 32
77  1  6  1  6  1  6
10  4  4  4 21  7 12
19 30  1  7  5  2  9
```

For simplicity, use integers between 1-100 and don't include a header.  Design your mock data file so that if we apply `Sort.pl` to it, the result would distinguish among the four possibilities listed above.  If you think one or more of the possibilities isn't resolvable in this way, explain why.

*write your mock data here, between the code block delimiters:*

*Hint: start by creating a mock data file that will indicate if option one is true.  Then, adapt it for each subsequent option*

```

```

We'll experiment some more.  If the two column sort parameters both refer to columns, then `1 1` is redundant.  We should test a different combination, say `1 5`:

`$ perl ./perl/Sort.pl test_data/6203.2016.0104.1.test.thresh test_data/sortOut15 1 5`

(Note that we've changed the output file to `test_data/sortOut15` so it won't overwrite the previous output file.  That way, we can compare the two).

The output of this command is

In [1]:
!cat test_data/sortOut15

#ID.CHANNEL, Julian Day, RISING EDGE(sec), FALLING EDGE(sec), TIME OVER THRESHOLD (nanosec), RISING EDGE(INT), FALLING EDGE(INT)
#$md5
#md5_hex(0)
6203.1	2457392	0.2452182596452402	0.2452182596455440	26.25	2118685763334875	2118685763337500
6203.1	2457392	0.2452114384916088	0.2452114384919415	28.75	2118626828567500	2118626828570375
6203.2	2457392	0.2452182596452402	0.2452182596456308	33.75	2118685763334875	2118685763338250
6203.2	2457392	0.2452114384916232	0.2452114384920283	35.00	2118626828567625	2118626828571125
6203.4	2457392	0.2452190121639323	0.2452190121641204	16.25	2118692265096375	2118692265098000
6203.4	2457392	0.2452114384916232	0.2452114384919705	30.00	2118626828567625	2118626828570625
6203.4	2457392	0.2452182596452402	0.2452182596455874	30.00	2118685763334875	2118685763337875


The first column is still sorted, and column 5 (TIME OVER THRESHOLD) isn't, overall.  But compare column 5 in this output to column 5 in the previous output, `sortOut15`.  It's clearly in a different order -- looking more closely, we see that the values in column 5 are in order *for each value in column 1*. That is, for all lines that start with `6203.1`, the lines have been rearranged so that the column 5 values are in ascending order.  For all lines that start with `6203.2`, the same is true, and so on.

It seems like we've figured out what the sort parameters do: the first parameter is the *primary sort*, the column that is sorted over the entirety of the output file, while the second parameter is the *secondary sort*, the column that is sorted only within matching values of the primary sort column.

To test that hypothesis, let's play around a bit.  We'll do the same as the last data transformation, but reverse the column parameters.

`$ perl ./perl/Sort.pl test_data/6203.2016.0104.1.test.thresh test_data/sortOut51 5 1`

In [2]:
!cat test_data/sortOut51

#ID.CHANNEL, Julian Day, RISING EDGE(sec), FALLING EDGE(sec), TIME OVER THRESHOLD (nanosec), RISING EDGE(INT), FALLING EDGE(INT)
#$md5
#md5_hex(0)
6203.4	2457392	0.2452190121639323	0.2452190121641204	16.25	2118692265096375	2118692265098000
6203.1	2457392	0.2452182596452402	0.2452182596455440	26.25	2118685763334875	2118685763337500
6203.1	2457392	0.2452114384916088	0.2452114384919415	28.75	2118626828567500	2118626828570375
6203.4	2457392	0.2452114384916232	0.2452114384919705	30.00	2118626828567625	2118626828570625
6203.4	2457392	0.2452182596452402	0.2452182596455874	30.00	2118685763334875	2118685763337875
6203.2	2457392	0.2452182596452402	0.2452182596456308	33.75	2118685763334875	2118685763338250
6203.2	2457392	0.2452114384916232	0.2452114384920283	35.00	2118626828567625	2118626828571125


The first column is no longer in order, but column 5 is.  Since there are few repeating values within column 5, it's hard to see that column one is the secondary sort, but the output is at least consistent with that idea.

Play around until with the other data available in the `test_data` directory until you're satisfied you understand how `Sort.pl` works.

To wrap things up, we'll do one more important test of the `Sort.pl` script: what happens if we try to use it in the wrong manner?  For example, what if we attempt to sort on column 0?

`$ perl ./perl/Sort.pl test_data/6203.2016.0104.1.test.thresh test_data/sortOut01 0 1`

Trying this from the command line, we find

```
$ perl ./perl/Sort.pl test_data/6203.2016.0104.1.test.thresh test_data/sortOut01 0 1
sort: field number is zero: invalid field specification ‘0,0’
```

The script has given us an error that (mostly) explains the mistake.  It's always good programming practice to anticipate user errors!