_This notebook was put together by [Keneth Garcia](https://stivengarcia7113.wixsite.com/kenethgarcia). Source and license info are on [GitHub](https://github.com/KenethGarcia/GRB_ML)._

# Topic of Current Interest: t-SNE in Swift Data
The Neil Gehrels Swift Observatory presents analysis results for the Swift/BAT Gamma-Ray Burst (GRBs) on [this website](https://swift.gsfc.nasa.gov/results/batgrbcat/) (open access).

As suggested by [Jespersen et al. (2020)](https://ui.adsabs.harvard.edu/abs/2020ApJ...896L..20J/abstract), Swift GRBs can be separated into two groups when t-SNE is performed. In this Jupyter notebook, we replicate this work by adding more recent data and an in-depth analysis of other preprocessed subsets (i.e., noise filtered data). Through this document, we are using the _python3_ implementations from the _scripts_ folder. It is necessary to have a _Jupyter Notebook_/_Python 3_ compiler software.

First, we need to import the **main.py** file to our notebook (and some packages needed):

In [1]:
from scripts import main
import os  # Import os to handle folders and files
import numpy as np  # Import numpy module to read tables, manage data, etc

Then, create a new object from the `main.py` class and, if you need, set the data, table and results folder paths (by default it will be the "Data", "Table", and "Results" folders inside the path containing this notebook):

In [2]:
%matplotlib inline
object1 = main.SwiftGRBWorker()
object1.original_data_path = r'G:\Mi unidad\Cursos\Master_Degree_Project\GRB_ML\Data\Original_Data'  # Change original data path
object1.table_path = r'G:\Mi unidad\Cursos\Master_Degree_Project\GRB_ML\Tables'  # Change table path
object1.results_path = r'G:\Mi unidad\Cursos\Master_Degree_Project\GRB_ML\Results'  # Change results path
object1.noise_data_path = r'G:\Mi unidad\Cursos\Master_Degree_Project\GRB_ML\Data\Noise_Filtered_Data'
object1.noise_images_path = r'G:\Mi unidad\Cursos\Master_Degree_Project\GRB_ML\Results\Noise_Filter_Images'

If you haven't downloaded the data yet, check the _Swift_Data_Download_ notebook.

**REMARK:** This notebook uses the results obtained in previous notebooks; before continuing, check at least the _Swift_Data_Download_ and _Data_Preprocessing_ notebooks.

## Changing the Swift GRB binning
By default, this notebook uses the data for 64ms binning in Swift. There are some cases in which we need to use different data resolutions and binning; handling these situations can be solved in this package by managing the _resolution_ and _end_ variables.

Through this package, you can change the _resolution_ variable to $2$, $8$, $16$, $64$, and $256$ ms respectively. Additionally, you can set $1$ for 1s binning and change the end variable to "sn5_10s" to use data with a signal-to-noise ratio higher than 5 or 10 s binning (these data don't have uniform time spacing).

In [3]:
object1.res = 64  # Resolution for the Light Curve Data in ms, could be 2, 8, 16, 64 (default), 256 and 1 (this last in s)
# object1.end = "sn5_10s"  # Uncomment this line if you need to use signal-to-noise higher than 5 or 10s binning

It is advisable not to change both variables at the same time; this could cause unknown bugs when running package routines and sub-routines. Additionally, you will need the data downloaded for the selected binning.


# Relevant topics

In this study, we are interested in seeing patterns in tSNE embeddings. Then we searched different relevant subsets of GRBs (and topics) motivated by the tSNE convergence variation explained in the _t-SNE_Introduction_. In particular, we try to separate two groups (usually named short and long) by their underlying physical process. The following subsections review the main findings made in this task.

Before continue, let's define the perplexity values to perform t-SNE animations:

In [None]:
pp = np.array([4, 5, 6, 7, 8, 9, 10, 15, 17, 20, 25, 30, 50, 75, 100, 150, 200, 250, 300, 350, 400, 450])

In addition, as a reference, add the animation obtained in _t-SNE_Introduction_ notebook:

![](Animations/perplexity_animation.gif)

## Preprocessed data without DFT

As suggested by Jespersen: _"It should be noted that the DFT is not necessarily an optimal preprocessing solution for separating GRBs (...)"_, motivated by this fact, the easiest thing is to think about not using the DFT at all. The t-SNE results with preprocessed data without DFT follows:

In [None]:
non_DFT_data_loaded = np.load(os.path.join(object1.results_path, f"non_DFT_Preprocessed_data_{object1.end}.npz"))
non_DFT_GRB_names, non_DFT_features = non_DFT_data_loaded['GRB_Names'], non_DFT_data_loaded['Data']
durations_data_array = object1.durations_checker(non_DFT_GRB_names, t=90)  # Check for name, t_start, and t_end
start_times, end_times = durations_data_array[:, :, 1].astype(float), durations_data_array[:, :, 2].astype(float)
non_DFT_durations = np.reshape(end_times - start_times, len(durations_data_array))  # T_90 is equal to t_end - t_start
file_name = os.path.join('Animations', 'perplexity_animation_non_DFT.gif')
object1.tsne_animation(non_DFT_features, iterable='perplexity', perplexity=pp, library='sklearn', duration_s=non_DFT_durations, filename=file_name)

![](Animations/perplexity_animation_non_DFT.gif)

In this case, there are two relevant remarks:
1. The visualization map is more cloudy than the DFT pre-processed case.
2. Mainly at perplexity < 20, there are two clear subgroups with different mean durations. More important, these subgroups don't need any customized optimization parameters to be visible (specific learning rate, iteration number, etc.)

## Removing suspicious GRBs
In the [lists of GRBs with special comments](https://swift.gsfc.nasa.gov/results/batgrbcat/summary_cflux/summary_GRBlist), there is some info about failed or partially failed GRB measuring. These GRBs can distract the tSNE algorithm, and fill the spaces between defined groups, broking their general structure.

The GRBs removed are part of the lists:
1. `GRBlist_not_enough_evt_data.txt`:  The event data are only available for part of the burst duration.
2. `GRBlist_tentative_detection_with_note.txt` and `GRBlist_tentative_detection.txt`: GRBs with tentative detection.
3. `Obvious_data_gap.txt`: Obvious data gap within the burst duration.

You can download these tables using the `summary_tables_download` instance:

In [None]:
tables = ('GRBlist_not_enough_evt_data.txt', 'GRBlist_tentative_detection_with_note.txt', 'GRBlist_tentative_detection.txt', 'Obvious_data_gap.txt')
[object1.summary_tables_download(name=name, other=True) for name in tables]  # Un-comment this line to download tables

Read the tables and index the GRB names:

In [None]:
tables_path = object1.table_path
excluded_names = np.array([])
for table in tables:
    names_i = np.genfromtxt(os.path.join(tables_path, table), usecols=(0, 1), dtype=str)[:, 0]
    excluded_names = np.append(excluded_names, names_i)
excluded_names = np.unique(excluded_names)
print(f"There are {len(excluded_names)} GRBs to be excluded")

Read the preprocessed data with DFT and index their durations:

In [90]:
data_loaded = np.load(os.path.join(object1.results_path, f"DFT_Preprocessed_data_{object1.end}.npz"))
GRB_names, features = data_loaded['GRB_Names'], data_loaded['Data']
durations_data_array = object1.durations_checker(GRB_names, t=90)  # Check for name, t_start, and t_end
start_times, end_times = durations_data_array[:, :, 1].astype(float), durations_data_array[:, :, 2].astype(float)
durations = np.reshape(end_times - start_times, len(durations_data_array))  # T_90 is equal to t_end - t_start

Finding Durations: 100%|██████████| 1318/1318 [00:09<00:00, 138.43GRB/s]


Remove elements from the original GRB names and features array:

In [None]:
non_match = np.where(np.isin(GRB_names, excluded_names, invert=True))[0]
GRB_names_excluded = GRB_names[non_match]
features_excluded = features[non_match]
durations_excluded = durations[non_match]
print(f"Now there are {len(GRB_names_excluded)} GRBs to perform tSNE")

With these GRB, now the tSNE embedding follows:

In [None]:
file_name = os.path.join('README_files', 'perplexity_animation_2.gif')
object1.tsne_animation(features_excluded, iterable='perplexity', perplexity=pp, library='sklearn', duration_s=durations_excluded, filename=file_name)

![](Animations/perplexity_animation_2.gif)

As you can see, without these suspicious GRBs, the two subgroups are evident at perplexity < 10, while there aren't so many changes in the t-SNE convergence compared with reference animation.

Now, taking the pre-processing data without DFT and removing the same suspicious GRBs:

In [None]:
non_match_2 = np.where(np.isin(non_DFT_GRB_names, excluded_names, invert=True))[0]
non_DFT_GRB_names_excluded = non_DFT_GRB_names[non_match_2]
non_DFT_features_excluded = non_DFT_features[non_match_2]
non_DFT_durations_excluded = non_DFT_durations[non_match_2]
print(f"Now there are {len(non_DFT_GRB_names_excluded)} GRBs to perform tSNE")
file_name = os.path.join('Animations', 'perplexity_animation_3.gif')
object1.tsne_animation(non_DFT_features_excluded, iterable='perplexity', perplexity=pp, library='sklearn', duration_s=non_DFT_durations_excluded, filename=file_name)

![](Animations/perplexity_animation_3.gif)

At low perplexity values, it seems that there is a small grouping of very long GRBs.

## Noise Reduction
Another alternative to improve t-SNE results is to reduce the data noise. Swift data are particularly noisy, and taking more cleaned light curves for t-SNE can refine its results.

We use the non-parametric noise reduction technique results obtained from the FABADA package (see _Noise_Reduction_ notebook for further details). First, we need to load the features and durations:

In [None]:
fabada_data_loaded = np.load(os.path.join(object1.results_path, f"DFT_Noise_Filtered_data_{object1.end}.npz"))
fabada_GRB_names, fabada_features = fabada_data_loaded['GRB_Names'], fabada_data_loaded['Data']
durations_data_array = object1.durations_checker(GRB_names, t=90)  # Check for name, t_start, and t_end
start_times, end_times = durations_data_array[:, :, 1].astype(float), durations_data_array[:, :, 2].astype(float)
fabada_durations = np.reshape(end_times - start_times, len(durations_data_array))  # T_90 is equal to t_end - t_start

The embedding follows:

In [None]:
file_name = os.path.join('README_files', 'perplexity_animation_noise_filtered.gif')
object1.tsne_animation(fabada_features, iterable='perplexity', perplexity=pp, library='sklearn', duration_s=fabada_durations, filename=file_name)

![](Animations/perplexity_animation_noise_filtered.gif)

In this case, there are two relevant remarks:
1. The visualization map changes this form with the DFT pre-processed case. It is more curved, but the main idea is the same: duration and GRB map position are correlated.
2. For all perplexity < 50, there are two clear subgroups with different mean durations. Furthermore, this is the most perceptible case: these subgroups are wholly separate. And again, these subgroups don't need any customized optimization parameters to be visible (specific learning rate, iteration number, etc.) and uses the DFT approach.
As expected, when we highly reduce noise in the whole dataset, the t-SNE visualization maps improve.

Now, for the pre-processed data without DFT we get:

![](Animations/perplexity_animation_noise_filtered_2.gif)

The visualization maps follow the same structure as the DFT case, but there aren't any subgroups remarked (except for perplexity = 300).

At this point, the most important result is: **There are two subgroups separated in the visualization maps**. The reader can check the GRBs involved in each cloud and analyze them.


## Using data with a high Signal-to-noise ratio
Another approach to handle the noise problem is to use high signal-to-noise ratio data from Swift. In particular, there is one file named "sn5_10s_lc_ascii.dat" for each GRB in Swift Database. In this file, we can find the 10 seconds average binning light curve. In most cases, the GRB structure is better than 64ms binning. Then we can perform t-SNE on this data to check what happens when we use fewer points but more consistency.

First, as we pointed out in _Data_Interpolating_ notebook: we need to handle with time basis to pre-process data (the data doesn't have uniform time spacing). To solve this problem, we interpolate between data points using a fixed time step and pre-process these interpolate data normally.

Now, to perform t-SNE, we need to read the pre-processed data:

In [None]:
interpolate_data_loaded = np.load(os.path.join(object1.results_path, f"DFT_Interpolated_data_{object1.end}"))
interpolate_GRB_names, interpolate_features = interpolate_data_loaded['GRB_Names'], interpolate_data_loaded['Data']
durations_data_array = object1.durations_checker(GRB_names, t=90)  # Check for name, t_start, and t_end
start_times, end_times = durations_data_array[:, :, 1].astype(float), durations_data_array[:, :, 2].astype(float)
interpolate_durations = np.reshape(end_times - start_times, len(durations_data_array))  # T_90 is equal to t_end - t_start

And to perform the embedding:

In [None]:
file_name = os.path.join('Animations', 'perplexity_animation_interpolated.gif')
object1.tsne_animation(interpolate_features, iterable='perplexity', perplexity=pp, library='sklearn', duration_s=interpolate_durations, filename=file_name)