# Flagging

This guide shows how to use flags in `dysh`.

You can find a copy of this tutorial as a Jupyter notebook [here](https://github.com/GreenBankObservatory/dysh/blob/main/notebooks/examples/flagging.ipynb) or download it by right clicking  <a href="https://raw.githubusercontent.com/GreenBankObservatory/dysh/refs/heads/main/notebooks/examples/flagging.ipynb" download>here</a> and selecting "Save Link As".

## Loading Modules
We start by loading the modules we will use for this example. 

For display purposes, we use the static (non-interactive) matplotlib backend in this tutorial. However, you can tell `matplotlib` to use the `ipympl` backend to enable interactive plots. This is only needed if working on jupyter lab or notebook.

In [None]:
# Set interactive plots in jupyter.
#%matplotlib ipympl

# We will use matplotlib for plotting.
import matplotlib.pyplot as plt

# These modules are required for working with the data.
from dysh.fits.gbtfitsload import GBTFITSLoad
from dysh.util.selection import Selection
import numpy as np

# These modules are only used to download and unpack the data.
import tarfile
from pathlib import Path
from dysh.util.download import from_url

## Data Retrieval

We download the data from a tar.gz file and then unpack it.

In [None]:
url = "http://www.gb.nrao.edu/dysh/example_data/rfi-L/data/AGBT17A_404_01.tar.gz"
savepath = Path.cwd() / "data"
savepath.mkdir(exist_ok=True) # Create the data directory if it does not exist.
filename = from_url(url, savepath)

In [None]:
# Unpack.
with tarfile.open(filename) as targz:
    targz.extractall('./data/') 
    targz.close() 

## Data Loading

After unpacking the data we load it. Notice how `dysh` tells us that it found a flag file.

In [None]:
sdfits = GBTFITSLoad("./data/AGBT17A_404_01.raw.vegas")

What flags were loaded?

In [None]:
sdfits.flags.show()

The above shows that the flag file was empty, so no flags were loaded.

Now, lets look at the summary.

In [None]:
sdfits.summary()

## Data Inspection

There are two scans, a pair of position switched observations. We will calibrate it and see how the data looks like.

We start by looking at the time average of the calibrated data.

In [None]:
# Calibrate the data.
ps_scanblock = sdfits.getps(scan=19, plnum=0, ifnum=0, fdnum=0)

# Compute the time average.
ps = ps_scanblock.timeaverage()

# Plot.
ps.plot(xaxis_unit="chan")

There is radio frequency interference (RFI) for channels above ~2300. We will plot a waterfall to see if the RFI is confined in time.
This is done using the `plot` method of a `ScanBlock`.

In [None]:
psp = ps_scanblock.plot()

The RFI is confined to integrations 42 to 52, and it affects channels >2300. We will flag this range. Since the RFI shows as negative, it is also likely that this is present in the off scan, `scan=20`.

## Data Flagging

We use the `GBTFITSLoad.flag` method to generate flags.

In [None]:
sdfits.flags.clear()
sdfits.flag(scan=20, 
            channel=[[2300,4096]], 
            intnum=[i for i in range(42,53)])
sdfits.flags.show()

We repeat the calibration after generating the flags.

In [None]:
pssb = sdfits.getps(scan=19, plnum=0, fdnum=0, ifnum=0, apply_flags=True)

pssb.plot()

The channels and times affected by RFI have been flagged. We can time average to generate the final spectrum without the RFI.

In [None]:
ps = pssb.timeaverage()
ps.plot(xaxis_unit="chan")

## Removing Flags

To remove flags from the `GBTFITSLoad` object use the `clear_flags` method.

In [None]:
sdfits.clear_flags()
sdfits.flags.show()

## Statistics-based Flagging

We can assume that any significant increase in the standard deviation of the raw spectra is due to heavy RFI. Below, we will calculate mu + 3*sigma for each of the 8 individual switching states, and flag any integrations breaching that threshold.

The last integration has been blanked by VEGAS, and is not plotted below.

In [None]:
#Get raw spectra and standard deviations
specs = sdfits.rawspectra(0,0)
stdevs = np.std(specs,axis=1)


#Organize into scan and switching state.
#There are 2 scans for the target and reference pointings, 2 calibration diode states, and 2 polarizations.
stdevs = np.reshape(stdevs, (2,-1,4))

nrows = stdevs.shape[1]

#Inspect the data
for scan in range(2):
    for sw_state in range(4):
        plt.plot(stdevs[scan,:-1,sw_state],label=f'State {(4*scan)+sw_state}')
        
plt.xlabel('Integration #')
plt.ylabel('sigma')
plt.legend()

We can see that the 4 states corresponding to the OFF scan have a significant jump corresponding to the GPS L3 RFI. It does not appear to start until the 40th integration, so we will use that as our cutoff to calculate the statistics of the good data, and the thresholds to flag by.

In [None]:
flag_mask = np.zeros(stdevs.shape)
cutoff = 40

mean = np.mean(stdevs[:,:cutoff,:],axis=1)
spread = 3 * np.std(stdevs[:,:cutoff,:],axis=1)

Now we create our flagging mask of zeros and ones, where a one corresponds to a flag to be applied.

In [None]:
flag_mask = np.zeros(stdevs.shape)

mean = np.expand_dims(mean,axis=1)
spread = np.expand_dims(spread,axis=1)

flag_mask[stdevs > mean+spread] = 1
flag_mask = flag_mask.flatten()

flag_rows = np.where(flag_mask==1)[0].tolist()
print(flag_rows)

We apply the flags using the "row" keyword, and see that the RFI is removed, along with a drop in the exposure time to 112 seconds instead of the original 150.

In [None]:
sdfits.flag(row=flag_rows)


ps = sdfits.getps(plnum=0, ifnum=0, fdnum=0).timeaverage()
print(ps.meta['EXPOSURE'])

ps.plot()