# Tutorial of `fitburst` -- II: Pre-processing your Data

For the purposes of model-fitting, it is crucial to ensure that your data are adequately prepared (or "pre-processed") in order to maximize the chance of convergence. The usual steps for pre-processing include baseline subtraction, normalization, and/or the removal of data corrupted by radio frequency interference (RFI). 

It is for this reason that `fitburst` is designed to require minimal-but-explicit invocation of a `preprocess_data()` method within the `DataReader` object. This method allows the user to produce a spectrum that is cleaned of RFI, along with metrics needed for downstream fitting, e.g., a mask of "bad" channels to ignore during the fitting procedure. The `preprocess_data()` method is currently configured to allow for flexible usage as demonstrated in this tutorial.

## Step 1: investigate the `DataReader.data_weights` attribute 

The `DataReader` object contains an attribute -- called `data_weights` -- that indicates the samples that are "good" (i.e., to be used for fitting) or "bad."

In [1]:
# import the necessary utilities.
from fitburst.backend.generic import DataReader
import matplotlib.pyplot as plt
import numpy as np

# initialize the DataReader object.
input_file = "./data_fitburst_CHIMEFRB_StokesI_FRB20220502B.npz"
data = DataReader(input_file)
data.load_data()

# the spectrum-wide mask is contained in the 'data_weights' attribute; 
# print data.data_weights to see what the data look like.
print(f"data weights: {data.data_weights}")
print(f"size of data_full: {data.data_full.shape}")
print(f"size of data_weights: {data.data_full.shape}")

data weights: [[1. 1. 1. ... 1. 1. 1.]
 [1. 1. 1. ... 1. 1. 1.]
 [1. 1. 1. ... 1. 1. 1.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
size of data_full: (16384, 162)
size of data_weights: (16384, 162)


The `data_weights` attribute contains 1s and 0s, indicating which data are usable (1) or bad (0).

## Step 2: investigate the `DataReader.good_freq` attribute

The `DataReader` contains a key attribute, called `good_freq`, that lists whether data within a specific frequency channel should be used for fitting or ignored. By default, this attribute is not overloaded when `load_data()` is executed:

In [2]:
print(data.good_freq)

None


It is natural to ask: _If it's so important, then why is `DataReader.good_freq` not initialized by default?_ 

The answer to this question may not be satisfying, but it's very important: if you want an adequate best-fit model, you must have a fairly robust understanding of your data prior to executing the fit routines. You must ensure that your data are baseline-subtracted, adequately normalized, and cleaned of any RFI-like signal you consider to be not a part of the signal you seek to model. From our experience, a significant number of unsuccessful `fitburst` fits arise due to one or more of the aforementioned operations not having been performed on the data.

## Step 3: execute the minimal form of the `DataReader.preprocess_data()` method

We can now try using the built-in cleaning routines provided in the `DataReader.preprocess_data()` method.

In [3]:
# before running, check and see what the good_freq attribute looks like.
print(f"good_freq: {data.good_freq}")

# now run the preprocessing step and print again.
data.preprocess_data()
print(f"good_freq: {data.good_freq}")

good_freq: None
INFO: flagged and removed 5672 out of 16384 channels!
good_freq: [ True  True  True ... False False False]


As can be seen above, the `preprocess_data()` method does something to initialize the `good_freq` attribute! 

However, we should understand what was specifically done to instantiate the `good_freq` attribute. When none of its options are used, the `preprocess_data()` method will only use the values in `data_weights` to determine which frequency channels are usable and which should be avoided. This operation amounts to flagging which channels contain only 0s in `data_weights`, which can be done by summing all `data_weights` values over the time axis (`axis=1`) and seeing which resultant values are 0. For example,

In [4]:
# determine which channels have non-zero data in data_weights. 
is_freq_good = (data.data_weights.sum(axis=1) > 0)

# just to be safe, make sure that this array matches the one generated by preprocess_data().
if np.all(data.good_freq == is_freq_good):
    print("the good_freq arrays are indeed the same")

the good_freq arrays are indeed the same


In principle, the default behavior of `preprocess_data()` allows for all control of RFI flagging to be given to the user: just define `data_weights` to your liking, call `preprocess_data`, and you're ready to go!

### Step 4: explore the options of `preprocess_data()`

If you instead want to "be safe" and utilize other aspects of `preprocess_data()`, feel free to experiment with its options.

In [5]:
# use the help() function to access info on options for methods and functions within fitburst.
help(data.preprocess_data)

Help on method preprocess_data in module fitburst.utilities.bases:

preprocess_data(apply_cut_variance: bool = False, apply_cut_skewness: bool = False, normalize_variance: bool = True, remove_baseline: bool = False, skewness_range: list = [-3.0, 3.0], variance_range: list = [0.2, 0.8], variance_weight: float = 1.0) -> None method of fitburst.backend.generic.DataReader instance
    Applies pre-fit routines for cleaning raw dynamic spectrum (e.g., RFI-masking,
    baseline subtraction, normalization, etc.).
    
    Parameters
    ----------
    apply_cut_variance : bool, optional
        if True, then update mask to exclude channels with variance values that exceed 
        the range specified in the 'variance_range' list
    
    apply_cut_skewness : bool, optional
        if True, then update mask to exclude channels with skewness values that exceed 
        the range specified in the 'skewness_range' list
    
    normalize_variance: bool, optional
        if true, then normalize var