# Selecting Data

This notebook shows how to select data in `dysh`.
We illustrate this using the `Selection` class of `dysh`.
We use this approach to show the various method available, however, using a `Selection` object will have no effect on the data itself.
At the end of the notebook we show how the same selections can be accomplished using a `GBTFITSLoad` object, so that the selections made are actually applied to the data.

You can find a copy of this tutorial as a Jupyter notebook [here](https://github.com/GreenBankObservatory/dysh/blob/main/notebooks/examples/selection.ipynb) or download it by right clicking  <a href="https://raw.githubusercontent.com/GreenBankObservatory/dysh/refs/heads/main/notebooks/examples/selection.ipynb" download>here</a> and selecting "Save Link As".

## Loading Modules
We start by loading the modules we will use in this tutorial. 

In [1]:
# These modules are required for the tutorial.
import astropy.units as u
from astropy.time import Time
from dysh.fits.gbtfitsload import GBTFITSLoad
from dysh.util.selection import Selection

# These modules are only used to download the data.
from pathlib import Path
from dysh.util.download import from_url

## Data Retrieval

Download the example SDFITS data, if necessary.

The code below will download an SDFITS file from http://www.gb.nrao.edu/dysh/example_data and put it in a data directory.
The data directory must exist where this notebook is being run from, otherwise the downloaded SDFITS will be named data.
The example will work either way, but be aware if you find a new file named data after running it.

In [2]:
url = "http://www.gb.nrao.edu/dysh/example_data/hi-survey/data/AGBT04A_008_02.raw.acs/AGBT04A_008_02.raw.acs.fits"
savepath = Path.cwd() / "data"
savepath.mkdir(exist_ok=True) # Create the data directory if it does not exist.
filename = from_url(url, savepath)

## Data Loading

Next, we use `GBTFITSLoad` to load the data, and then its `summary` method to inspect its contents.

In [3]:
sdfits = GBTFITSLoad(filename)
sdfits.summary()

SCAN,OBJECT,VELOCITY,PROC,PROCSEQN,RESTFREQ,DOPFREQ,# IF,# POL,# INT,# FEED,AZIMUTH,ELEVATION
220,3C286,0.0,OffOn,1,1.4,1.4,1,2,6,1,185.280583,82.024626
221,3C286,0.0,OffOn,2,1.4,1.4,1,2,6,1,187.213578,81.998047
222,3C286,0.0,OffOn,1,1.4,1.4,1,2,6,1,193.833116,81.841281
223,3C286,0.0,OffOn,2,1.4,1.4,1,2,6,1,195.676641,81.778794
224,3C286,0.0,OffOn,1,1.4,1.4,1,2,6,1,195.518231,80.291009
225,3C286,0.0,OffOn,2,1.4,1.4,1,2,5,1,199.935766,81.600451
226,3C286,0.0,OffOn,1,1.4,1.4,1,2,6,1,200.833322,80.026463
227,3C286,0.0,OffOn,2,1.4,1.4,1,2,6,1,205.94706,81.260862
228,B1328+254,0.0,OffOn,1,1.4,1.4,1,2,6,1,207.52569,73.984375
229,B1328+254,0.0,OffOn,2,1.4,1.4,1,2,6,1,210.959989,75.158373


## Create a Selection Object for SDFITS Data

We will show how to select data using a `Selection` object.
We start by creating the `Selection` object and putting it into a variable named `selection_object`.

In [4]:
selection_object = Selection(sdfits)

## Using Selection

Now we show various ways in which the `Selection` object can be used to select data.

### Select by Column Names

One way of selecting data is by specifying a value for a column name.
For example, we can select data which has OBJECT="U8249" and polarization number 0 using the following.

In [5]:
selection_object.select(object="U8249", plnum=0)

We can view the contents of the selection using its `show` method.

In [6]:
selection_object.show()

 ID    TAG    OBJECT PLNUM # SELECTED
--- --------- ------ ----- ----------
  0 89b653b7e  U8249     0        152


This displays the selection as a table.
In the backround, each time we create a new selection, it is assigned an id and tag.

We can also specify the tag name to have a more meaningful value. In this case we will select both polarizations.

In [7]:
selection_object.select(plnum=[0, 1], tag="plnums")

In [8]:
selection_object.show()

 ID    TAG    OBJECT PLNUM # SELECTED
--- --------- ------ ----- ----------
  0 89b653b7e  U8249     0        152
  1    plnums        [0,1]       3766


### Combining Selections

Once we have multiple selection rules in our `Selection` object, we can combine them into a single selection using the `final` method of `Selection`. 
This will return a `~pandas.DataFrame`.

In [9]:
selection_object.final

Unnamed: 0,OBJECT,BANDWID,DATE-OBS,DURATION,EXPOSURE,TSYS,TDIM7,TUNIT7,CTYPE1,CRVAL1,...,SITELAT,SITEELEV,EXTNAME,FITSINDEX,UTC,CHAN,PROC,OBSTYPE,SUBOBSMODE,INTNUM
0,U8249,12500000.0,2004-04-22T06:44:49.00,5.005,4.779488,1.0,"(32768,1,1,1)",Counts,FREQ-OBS,1.408421e+09,...,38.43312,824.595,SINGLE DISH,0,2004-04-22 06:44:49.000,,OffOn,PSWITCHOFF,TPWCAL,0
1,U8249,12500000.0,2004-04-22T06:44:49.00,5.005,4.779488,1.0,"(32768,1,1,1)",Counts,FREQ-OBS,1.408421e+09,...,38.43312,824.595,SINGLE DISH,0,2004-04-22 06:44:49.000,,OffOn,PSWITCHOFF,TPWCAL,0
2,U8249,12500000.0,2004-04-22T06:44:59.01,5.005,4.779488,1.0,"(32768,1,1,1)",Counts,FREQ-OBS,1.408421e+09,...,38.43312,824.595,SINGLE DISH,0,2004-04-22 06:44:59.010,,OffOn,PSWITCHOFF,TPWCAL,1
3,U8249,12500000.0,2004-04-22T06:44:59.01,5.005,4.779488,1.0,"(32768,1,1,1)",Counts,FREQ-OBS,1.408421e+09,...,38.43312,824.595,SINGLE DISH,0,2004-04-22 06:44:59.010,,OffOn,PSWITCHOFF,TPWCAL,1
4,U8249,12500000.0,2004-04-22T06:45:09.03,5.005,4.779488,1.0,"(32768,1,1,1)",Counts,FREQ-OBS,1.408421e+09,...,38.43312,824.595,SINGLE DISH,0,2004-04-22 06:45:09.030,,OffOn,PSWITCHOFF,TPWCAL,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
147,U8249,12500000.0,2004-04-22T07:45:01.00,5.005,4.779488,1.0,"(32768,1,1,1)",Counts,FREQ-OBS,1.408411e+09,...,38.43312,824.595,SINGLE DISH,0,2004-04-22 07:45:01.000,,Track,NONE,TPWCAL,0
148,U8249,12500000.0,2004-04-22T07:45:11.01,5.005,4.779488,1.0,"(32768,1,1,1)",Counts,FREQ-OBS,1.408411e+09,...,38.43312,824.595,SINGLE DISH,0,2004-04-22 07:45:11.010,,Track,NONE,TPWCAL,1
149,U8249,12500000.0,2004-04-22T07:45:11.01,5.005,4.779488,1.0,"(32768,1,1,1)",Counts,FREQ-OBS,1.408411e+09,...,38.43312,824.595,SINGLE DISH,0,2004-04-22 07:45:11.010,,Track,NONE,TPWCAL,1
150,U8249,12500000.0,2004-04-22T07:45:21.03,5.005,4.779488,1.0,"(32768,1,1,1)",Counts,FREQ-OBS,1.408411e+09,...,38.43312,824.595,SINGLE DISH,0,2004-04-22 07:45:21.030,,Track,NONE,TPWCAL,2


In this particular case, we have 152 rows.

### Remove Selections

This can be done by id or tag.
Multiple rows with the same tag will all be removed.

In [10]:
selection_object.remove(id=0)
selection_object.remove(tag='plnums')
selection_object.show()

 ID TAG OBJECT PLNUM # SELECTED
--- --- ------ ----- ----------


To remove all selections use `Selection.clear`, like

In [11]:
selection_object.clear()

In [12]:
selection_object.show()

 ID TAG OBJECT BANDWID DATE-OBS ... PROC OBSTYPE SUBOBSMODE INTNUM # SELECTED
--- --- ------ ------- -------- ... ---- ------- ---------- ------ ----------


### Select by Range

It is also possible to define a selection given a range of values.
In this case the selection must be specified using either a list, `[]`, or a tuple, `()`, with a start and an end value.
Lower limits are give by `(value,None)` or `(value,)`.
Upper limits are given by `(None,value)`, since `(,value)` is not valid `python`.
For coordinates the default unit is taken to be degrees.
Other units can be explicitly given.
Both `()` and `[]` are valid for indicated ranges, but only tuples can be used if `(value,)` for lower limit.

For example to select only rows where the right ascention is greater than 114 degrees we would use

In [13]:
selection_object.select_range(ra=(114,))

In [14]:
selection_object.show()

 ID    TAG     CRVAL2 # SELECTED
--- --------- ------- ----------
  0 b7c5ff23d [114.0]       3766


and to select rows where the elevation is below 80 degrees

In [15]:
selection_object.select_range(elevation=[None,80])

In [16]:
selection_object.show()

 ID    TAG     CRVAL2  ELEVATIO # SELECTED
--- --------- ------- --------- ----------
  0 b7c5ff23d [114.0]                 3766
  1 097a9d362         [None,80]       3582


We can check that the selections were applied properly by inspecting at the final result and its "ELEVATIO" column.

It is also possible to use units during selection.
For example

In [17]:
selection_object.select_range(dec=[854, 855] * u.arcmin)
selection_object.show()

 ID    TAG     CRVAL2           CRVAL3            ELEVATIO # SELECTED
--- --------- ------- -------------------------- --------- ----------
  0 b7c5ff23d [114.0]                                            3766
  1 097a9d362                                    [None,80]       3582
  2 70e4164a0         [14.233333333333333,14.25]                  132


Selection keywords are case insensitive, so for example using `DeC` is the same as `dec`.
Note also elevation is aliased here to elevatio (the actual SDFITS keyword)

In [18]:
selection_object.select_range(eLEVaTIon=[None,80])
selection_object.show()

 ID    TAG     CRVAL2           CRVAL3            ELEVATIO # SELECTED
--- --------- ------- -------------------------- --------- ----------
  0 b7c5ff23d [114.0]                                            3766
  1 097a9d362                                    [None,80]       3582
  2 70e4164a0         [14.233333333333333,14.25]                  132
  3 954670db8                                    [None,80]       3582


Notice that the selections with ids 1 and 3 are the same. 
By default, `Selection` will not check for duplicates (this makes it swifter).

### Select Within a Range

It is also possible to specify the midpoint and a range to make a selection.
In this case we use `select_within` and specify the mean value and the +- range.

For example to select between elevation of 50-10 and 50+10 we would use

In [19]:
selection_object.select_within(eleVation=(50,10))
selection_object.show()

 ID    TAG     CRVAL2           CRVAL3            ELEVATIO # SELECTED
--- --------- ------- -------------------------- --------- ----------
  0 b7c5ff23d [114.0]                                            3766
  1 097a9d362                                    [None,80]       3582
  2 70e4164a0         [14.233333333333333,14.25]                  132
  3 954670db8                                    [None,80]       3582
  4 8eacb1cc1                                      [40,60]       1694


Which shows a selection between 40 and 50 degrees of elevation.

### Using Aliases

`Selection` knows about certain aliases for column names.
For example, the SDFITS column ELEVATIO can also be selected using ELEVATION.
The aliases are defined in the `aliases` attribute of `Selection`.

In [20]:
selection_object.aliases

{'FREQ': 'CRVAL1',
 'RA': 'CRVAL2',
 'DEC': 'CRVAL3',
 'GLON': 'CRVAL2',
 'GLAT': 'CRVAL3',
 'GALLON': 'CRVAL2',
 'GALLAT': 'CRVAL3',
 'ELEVATION': 'ELEVATIO',
 'SOURCE': 'OBJECT',
 'POL': 'PLNUM',
 'SUBREF': 'SUBREF_STATE'}

It is also possible to add your own aliases.
For example to use target and az as aliases for OBJECT and AZIMUTH we would use

In [21]:
selection_object.alias({'target':'object','az':'azimuth'})

In [22]:
selection_object.aliases

{'FREQ': 'CRVAL1',
 'RA': 'CRVAL2',
 'DEC': 'CRVAL3',
 'GLON': 'CRVAL2',
 'GLAT': 'CRVAL3',
 'GALLON': 'CRVAL2',
 'GALLAT': 'CRVAL3',
 'ELEVATION': 'ELEVATIO',
 'SOURCE': 'OBJECT',
 'POL': 'PLNUM',
 'SUBREF': 'SUBREF_STATE',
 'TARGET': 'OBJECT',
 'AZ': 'AZIMUTH'}

In [23]:
selection_object.select(target="U8249")
selection_object.show()

 ID    TAG    OBJECT  CRVAL2           CRVAL3            ELEVATIO # SELECTED
--- --------- ------ ------- -------------------------- --------- ----------
  0 b7c5ff23d        [114.0]                                            3766
  1 097a9d362                                           [None,80]       3582
  2 70e4164a0                [14.233333333333333,14.25]                  132
  3 954670db8                                           [None,80]       3582
  4 8eacb1cc1                                             [40,60]       1694
  5 e3987a8e5  U8249                                                     304


Notice that this will only affect the aliases for this particular instance of a `Selection`.
Any new Selection objects will not know about these aliases.

In [24]:
Selection(sdfits).aliases

{'FREQ': 'CRVAL1',
 'RA': 'CRVAL2',
 'DEC': 'CRVAL3',
 'GLON': 'CRVAL2',
 'GLAT': 'CRVAL3',
 'GALLON': 'CRVAL2',
 'GALLAT': 'CRVAL3',
 'ELEVATION': 'ELEVATIO',
 'SOURCE': 'OBJECT',
 'POL': 'PLNUM',
 'SUBREF': 'SUBREF_STATE'}

### Empty Selections

Any selection that results in no data being selected is ignored.
You will get a warning message in this case.

In [25]:
selection_object.select(target='foobar')



### Time Selections

UTC time ranges can be selected with Time objects.
This checks against the UTC timestamp column.
For LST, use select_range(lst=[number1,number2]).



In [26]:
selection_object.select_range(utc=(Time("2004-04-22T06:08:05", scale="utc"),
                                   Time("2004-04-22T06:08:26", scale="utc")))

In [27]:
selection_object.show()

 ID    TAG    OBJECT ...  ELEVATIO               UTC                # SELECTED
--- --------- ------ ... --------- -------------------------------- ----------
  0 b7c5ff23d        ...                                                  3766
  1 097a9d362        ... [None,80]                                        3582
  2 70e4164a0        ...                                                   132
  3 954670db8        ... [None,80]                                        3582
  4 8eacb1cc1        ...   [40,60]                                        1694
  5 e3987a8e5  U8249 ...                                                   304
  6 58eb427ee        ...           [numpy.datetime...6.000000000')]         12


In [28]:
selection_object.final["UTC"]

Series([], Name: UTC, dtype: datetime64[ns])

### Channel Selection

To select channels there is a special method, `Selection.select_channel`.
Channels can be ranges, individual channels or combinations there of.
Note that selecting channels does not down select rows.

In [29]:
a = [1, 4, (30, 40)]
selection_object.select_channel(a)
selection_object.show()

 ID    TAG    OBJECT ...      CHAN     # SELECTED
--- --------- ------ ... ------------- ----------
  0 b7c5ff23d        ...                     3766
  1 097a9d362        ...                     3582
  2 70e4164a0        ...                      132
  3 954670db8        ...                     3582
  4 8eacb1cc1        ...                     1694
  5 e3987a8e5  U8249 ...                      304
  6 58eb427ee        ...                       12
  7 d80c4b618        ... [1,4,(30,40)]       3766


Note that you can only have one channel selection rule at a time.

In [30]:
try: 
    selection_object.select_channel([60,70])
except Exception as e:
    print(e)

You can only have one channel selection rule. Remove the old rule before creating a new one.


### Applying Selections to Your Data

So far we have seen how to create and manage selections.
However, these have been made with a separate `Selection` object.
All of the methods exposed above are also available through the `GBTFITSLoad` object.
For example, to list the selections we'd use `GBTFITSLoad.selection.show()`, to clear the selections `GBTFITSLoad.selection.clear()`, and to select in a range `GBTFITSLoad.select_range()`.

To show the effect we start by using `gettp` with the basic required selection of ifnum, plnum and fdnum.

In [31]:
tp_all = sdfits.gettp(ifnum=0, plnum=0, fdnum=0)

In [32]:
len(tp_all)

81

That is all 81 scans were selected.

Now we select something and show the selection.

In [33]:
sdfits.select_range(eLEVaTIon=[None,80])
sdfits.selection.show()

 ID    TAG     ELEVATIO # SELECTED
--- --------- --------- ----------
  0 f78d75c89 [None,80]       3582


Now repeat the `gettp` call and notice the difference.

In [34]:
tp_selection = sdfits.gettp(ifnum=0, plnum=0, fdnum=0)
len(tp_selection)

74

Now only 74 scans are selected, the ones that have an elevation below 80 degrees.