---
title: Reproducing Chen et. al. (2021)'s unsupervised machine learning
subtitle: Using UMAP to reduce dimensions and HDBSCAN to cluster datapoints
author: Murthadza Aznam
date: '2022-10-20'
---

:::{.callout-note}

 📌 Goal: This notebook tries to reproduce the results from [https://ui.adsabs.harvard.edu/abs/2022MNRAS.509.1227C/abstract](https://ui.adsabs.harvard.edu/abs/2022MNRAS.509.1227C/abstract). As stated in the paper, verbatim 
 
 > Our goal is to **map several observational and model-dependent parameters of each FRB to a 2D embedding plane** by training the UMAP algorithm on the features of the training samples in CHIME/FRB dataset **and finally identify possibly misclassified non-repeating FRBs** which in fact have latent features of repeating FRBs. We define these possibly misclassified non-repeating FRBs as FRB repeater candidates in our paper.

:::

## 0. Getting the Data

### 0.1 Source
The paper uses data from CHIME/FRB Catalog with parameters calculated in [Hashimoto et. al. 2022](https://doi.org/10.1093/mnras/stac065).

In [1]:
import pandas as pd

catalog: pd.DataFrame = pd.read_csv('../data/raw/chimefrbcat1_Hashimoto_2022.csv')
catalog

Unnamed: 0,tns_name,previous_name,repeater_name,ra,ra_err,ra_notes,dec,dec_err,dec_notes,gl,...,weight_fluence_error_m,weight,weight_error_p,weight_error_m,weighted_logrhoA,weighted_logrhoA_error_p,weighted_logrhoA_error_m,weighted_logrhoB,weighted_logrhoB_error_p,weighted_logrhoB_error_m
0,FRB20180725A,180725.J0613+67,-9999,93.42,0.04,-9999,67.10,0.20,-9999,147.29,...,,,,,,,,,,
1,FRB20180727A,180727.J1311+26,-9999,197.70,0.10,-9999,26.40,0.30,-9999,24.76,...,,,,,,,,,,
2,FRB20180729A,180729.J1316+55,-9999,199.40,0.10,-9999,55.58,0.08,-9999,115.26,...,,,,,,,,,,
3,FRB20180729B,180729.J0558+56,-9999,89.90,0.30,-9999,56.50,0.20,-9999,156.90,...,,,,,,,,,,
4,FRB20180730A,180730.J0353+87,-9999,57.39,0.03,-9999,87.20,0.20,-9999,125.11,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
594,FRB20190701A,-9999,-9999,277.50,0.20,-9999,59.00,0.20,-9999,88.29,...,,,,,,,,,,
595,FRB20190701B,-9999,-9999,302.90,0.20,-9999,80.20,0.20,-9999,112.88,...,,,,,,,,,,
596,FRB20190701C,-9999,-9999,96.40,0.20,-9999,81.60,0.30,-9999,132.18,...,,,,,,,,,,
597,FRB20190701D,-9999,-9999,112.10,0.20,-9999,66.70,0.20,-9999,149.28,...,0.000649,1.361506,0.574701,0.206776,0.813142,0.485293,0.109293,0.894742,0.25519,0.19035


### 0.2 Dataset Validation

We first validate that the data is as described in the paper. According to the paper:

1. > The initial dataset includes 501 non-repeating FRB sub-bursts from 474 sources and 93 repeating FRB sub-bursts from 18 sources.
2. > The catalogue includes 535 FRBs at a frequency range between 400 and 800 MHz from **2018 July 25** to **2019 July 1**. Since a repeating FRB source provides several FRBs and each FRB might include several sub-bursts, the actual number of applying subburst samples are 501 non-repeating + 93 repeating = 594 sub-bursts.

We can verify this using a simple dataframe method.

In [2]:
from astropy.time import Time

start: float = Time('2018-07-25').mjd
end: float = Time('2019-07-01').mjd

interval: pd.Series = (start <= catalog['mjd_400']) & (catalog['mjd_400'] <= end)
catalog: pd.DataFrame = catalog[interval]

repeating: pd.DataFrame = catalog[(catalog['repeater_name'] != "-9999")]
non_repeating: pd.DataFrame = catalog[(catalog['repeater_name'] == "-9999")]
print(f"Total repeaters\t\t: {len(repeating)}",f"Total non-repeaters\t: {len(non_repeating)}", sep="\n")
print(f"Total sub-bursts\t: {len(repeating) + len(non_repeating)}")

Total repeaters		: 93
Total non-repeaters	: 501
Total sub-bursts	: 594


:::{.callout-note} TODO
Apply filter to get the number of sources.
This filter only gets the number of sub bursts.
:::

:::{.callout-question}
What does the authors mean by "The catalogue includes 535 FRBs"?
I could not think of a filter that fits the description.
:::

## 1. Preprocessing

### 1.1 Sample dan Selection

These are the criteria:

1. Observed between 2018 July 25 to 2019 July 1. (Already filtered in [Validation](#02-dataset-validation))
2. We exclude the FRB sub-bursts which have neither `flux` nor `fluence` measurements.
3. The input data for unsupervised learning includes a total of 10 observational and 3 model-dependent parameters. (Described in [Parameters](#12-parameters))

### 1.2 Parameters

### 1.2.1 Observational Parameters
The parameters selected by the paper is as follows:
- Boxcar Width `bc_width`
- Width of Sub-Burst `width_fitb`
- Flux `flux`
- Fluence `fluence`
- Scattering Time `scat_time`
- Spectral Index `sp_idx`
- Spectral Running `sp_run`
- Highest Frequency `high_freq`
- Lowest Frequency `low_freq`
- Peak Frequency `peak_freq`

### 1.2.2 Model Dependent Parameters
- Redshift `z`
- Radio Energy `logE_rest_400`
- Rest-Frame Intrinsic Duration `logsubw_int_rest`

In [34]:
from typing import List

params : List[str] = [
    # Observational
    "bc_width",
    "width_fitb",
    "flux",
    "fluence",
    "scat_time",
    "sp_idx",
    "sp_run",
    "high_freq",
    "low_freq",
    "peak_freq",
    # Model dependent
    "z",
    "logE_rest_400",
    "logsubw_int_rest"
]

identifiers: List[str] = [
    "tns_name",
    "repeater_name"
]

dropna_subset = ['flux', 'fluence', 'logE_rest_400']

non_repeating = non_repeating[[*params, *identifiers]]
repeating = repeating[[*params, *identifiers]]

In [35]:
from sklearn.model_selection import train_test_split

test, train = train_test_split(repeating, test_size=0.9)

In [36]:
selected = pd.concat([train, non_repeating]).dropna(subset[''])

## 2. UMAP

Parameters:

1. `n_neighbors = 8` 
    - it controls how UMAP balances between the local structure and the global structure of the data manifolds
2. `n_components = 2`
    - the resulting dimensionality of the reduced dimension
3. `min_dist = 0.1`
    - to prevent the resulting low dimensional projections clumping together

In [37]:
import umap
import seaborn as sns

model: umap.UMAP = umap.UMAP(n_neighbors=8, n_components=2, min_dist=0.1)
map = model.fit(selected[params])
test_map = map.transform(test[params])

ValueError: Input contains NaN.

In [32]:
train['x'] = map.embedding_[:, 0]
train['y'] = map.embedding_[:, 1]
train['color'] = ['non-repeater' if name == '-9999' else 'repeater' for name in train['repeater_name']]

sns.relplot(data=train, kind='scatter', x='x', y='y', style='color', hue='color')

ValueError: Length of values (588) does not match length of index (84)