---
title: Reproducing Chen et. al. (2021)'s unsupervised machine learning
subtitle: Using UMAP to reduce dimensions and HDBSCAN to cluster datapoints
author: Murthadza Aznam
date: '2022-10-20'
---

:::{.callout-note}

 📌 Goal: This notebook tries to reproduce the results from [https://ui.adsabs.harvard.edu/abs/2022MNRAS.509.1227C/abstract](https://ui.adsabs.harvard.edu/abs/2022MNRAS.509.1227C/abstract)

:::

## 0. Requirements
The paper uses data from CHIME/FRB Catalog. Luckily, the CHIME/FRB team has made it simple for us to access their data through the [CHIME/FRB Open Data](https://github.com/chime-frb-open-data/chime-frb-open-data) package, `cfod`.

In [4]:
import pandas as pd
from cfod import catalog

cat: pd.DataFrame = catalog.as_dataframe()
cat.columns

Index(['tns_name', 'previous_name', 'repeater_name', 'ra', 'ra_err',
       'ra_notes', 'dec', 'dec_err', 'dec_notes', 'gl', 'gb', 'exp_up',
       'exp_up_err', 'exp_up_notes', 'exp_low', 'exp_low_err', 'exp_low_notes',
       'bonsai_snr', 'bonsai_dm', 'low_ft_68', 'up_ft_68', 'low_ft_95',
       'up_ft_95', 'snr_fitb', 'dm_fitb', 'dm_fitb_err', 'dm_exc_ne2001',
       'dm_exc_ymw16', 'bc_width', 'scat_time', 'scat_time_err', 'flux',
       'flux_err', 'flux_notes', 'fluence', 'fluence_err', 'fluence_notes',
       'sub_num', 'mjd_400', 'mjd_400_err', 'mjd_inf', 'mjd_inf_err',
       'width_fitb', 'width_fitb_err', 'sp_idx', 'sp_idx_err', 'sp_run',
       'sp_run_err', 'high_freq', 'low_freq', 'peak_freq', 'excluded_flag'],
      dtype='object')

## 0.1 Parameters
The parameters selected by the paper is as follows:
- Boxcar Width
- Width of Sub-Burst
- Flux
- Fluence
- Scattering Time
- Spectral Index
- Spectral Running
- Highest Frequency
- Lowest Frequency
- Peak Frequency

In [8]:
from typing import Type


params : Type[str] = [
    "bc_width",
    "width_fitb",
    "flux",
    "scat_time",
    "sp_idx",
    "sp_run",
    "high_freq",
    "low_freq",
    "peak_freq"
]

selected = cat[params]
selected

Unnamed: 0,bc_width,width_fitb,flux,scat_time,sp_idx,sp_run,high_freq,low_freq,peak_freq
0,0.00295,0.000296,1.70,0.00110,38.20,-45.80,760.1,485.3,607.4
1,0.00295,0.00139,0.58,<0.0017,3.80,-9.20,800.2,400.2,493.3
2,0.00098,<0.00010,11.70,0.0001574,16.46,-30.21,692.7,400.2,525.6
3,0.00197,0.000314,0.92,0.00066,14.50,-14.60,800.2,441.8,657.5
4,0.00492,0.000468,5.20,0.002073,4.27,-11.31,759.2,400.2,483.5
...,...,...,...,...,...,...,...,...,...
595,0.00197,0.000608,1.26,<0.00072,-1.10,3.30,800.2,400.2,800.2
596,0.00295,0.00063,1.10,0.00034,3.90,-11.80,732.8,400.2,471.5
597,0.00197,0.00144,0.88,<0.0018,46.20,-211.00,495.5,402.2,446.4
598,0.00885,0.00140,1.33,0.00153,6.49,-20.90,651.8,400.2,467.6


## 1.1 Model Dependent Parameters

<!-- TODO: Implement these parameters -->
Verbatim:

> - Redshift: We apply the redshift of each sub-burst as one of the input elements. The redshift is spectroscopic redshift if available; otherwise, the value is derived from the dispersion measure (see Hashimoto et al. 2020b, for details). These values are common for the sub-bursts from the same FRBs.
> - Radio Energy: This is the logarithm of radio energy integrated over 400 MHz at the emitter’s frame. The values are common for the sub-bursts from the same FRBs.
> - Rest-Frame Intrinsic Duration: This is the logarithm of the rest-frame intrinsic duration of the sub-burst. These values are different between the sub-bursts of the same FRBs.