# Feature Extraction

In this notebook we follow the wavelet decomposition approach of [Lochner et al. (2016)](https://iopscience.iop.org/article/10.3847/0067-0049/225/2/31) to extract features. We also include the photometric redshift and its uncertainty as classification features.

#### Index<a name="index"></a>
1. [Import Packages](#imports)
2. [Load Dataset](#loadData)
3. [Extract Features](#features)
    1. [Fit Gaussian Processes](#gps)
    2. [Wavelet Decomposition](#waveletDecomp)
    3. [Include Redshift Information](#addZ)
    4. [Save the Features](#saveFeatures)

## 1. Import Packages<a name="imports"></a>

In [None]:
!pip install ../snmachine/

In [None]:
import collections
import os
import pickle
import sys
import time

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [None]:
import lightgbm as lgb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats

In [None]:
from snmachine import gps, snfeatures
from utils.plasticc_pipeline import load_dataset

In [None]:
%config Completer.use_jedi = False  # enable autocomplete

## 2. Load Dataset<a name="loadData"></a>

First, **write** the path to the folder that contains the dataset we want to augment, `folder_path`.

In [None]:
root_dir = '/share/hypatia/snmachine_resources/plasticc'
folder_path = os.path.join(root_dir, 'data', 'raw_data')

Then, **write** in `data_file_name` the name of the file where your dataset is saved.

In this notebook we use the dataset saved in [4_augment_data]().

In [None]:
data_file_name = 'example_dataset_aug.pckl'

Load the dataset.

In [None]:
data_path = os.path.join(folder_path, data_file_name)
dataset = load_dataset(data_path)

## 3. Extract Features<a name="features"></a>

### 3.1. Fit Gaussian Processes<a name="gps"></a>

To obtain the wavelet decomposition, we first used the GPs to interpolate all light curves onto the same time grid; we chose approximately one grid point per day and used a two-level wavelet decomposition, following [Lochner et al. (2016)](https://iopscience.iop.org/article/10.3847/0067-0049/225/2/31).

If you have not fitted the GPs previously, **run** **<font color=Orange>A)</font>**; it follows the GP modeling of light curves described in [3_model_lightcurves]().
Otherwise, follow **<font color=Orange>B)</font>** to **read in** the previously saved GPs. 

First **write** the path to the folder where the GP files will be/were saved (`saved_gps_path`).

In [None]:
saved_gps_path = os.path.join(folder_path, data_file_name[:-5])

**<font color=Orange>A)</font>** **Choose**:
- `t_min`: minimim time to evaluate the Gaussian Process Regression at.
- `t_max`: maximum time to evaluate the Gaussian Process Regression at.
- `gp_dim`: dimension of the Gaussian Process Regression. If  `gp_dim` is 1, the filters are fitted independently. If `gp_dim` is 2, the Matern kernel is used to fit light curves both in time and wavelength.
- `number_gp`: number of points to evaluate the Gaussian Process Regression at.
- `number_processes`: number of processors to use for parallelisation (**<font color=green>optional</font>**).

In [None]:
t_min = 0
t_max = 277

gp_dim = 2
number_gp = 276
number_processes = 1

In [None]:
gps.compute_gps(dataset, number_gp=number_gp, t_min=t_min, t_max=t_max, 
                gp_dim=gp_dim, output_root=saved_gps_path, 
                number_processes=number_processes)

**<font color=Orange>B)</font>** Read in the previously saved GPs.

In [None]:
gps.read_gp_files_into_models(dataset, saved_gps_path)

### 3.2. Wavelet Decomposition<a name="waveletDecomp"></a>

Now, we do a wavelet decomposition of the events. **Write** in `saved_wavelets_path` the path to the folder where to save them.

In [None]:
saved_wavelets_path = saved_gps_path

Following [Lochner et al. (2016)](https://iopscience.iop.org/article/10.3847/0067-0049/225/2/31), we then reduced the dimensionality of this wavelet space using Principal Component Analysis (PCA). Therefore, **choose** the number of PCA components to keep (`number_comps`) and **write** the path to the folder where to save the reduced wavelets (`saved_reduced_wavelets_path`).

In [None]:
number_comps = 40
saved_reduced_wavelets_path = saved_gps_path

**<font color=Orange>A)</font>** Perform the wavelet decomposition and dimensionality reduction.

In [None]:
wf = snfeatures.WaveletFeatures(output_root=saved_wavelets_path)

reduced_wavelet_features = wf.compute_reduced_features(
    dataset, number_comps=number_comps, 
    **{'wavelet_name': 'sym2', 'number_decomp_levels': 2,
       'path_save_eigendecomp': saved_reduced_wavelets_path})

If you previously calculated the wavelet decomposition of the events, and are only looking to project them into a lower dimensional space saved in `saved_reduced_wavelets_path`, run **<font color=Orange>B)</font>**.

**<font color=Orange>B)</font>** Project previously calculated wavelet features onto a lower dimensional space.

```python
wf = snfeatures.WaveletFeatures(output_root=saved_wavelets_path)
feature_space = wf.load_feature_space(dataset)

reduced_wavelet_features = wf.project_to_space(
    feature_space, path_saved_eigendecomp=saved_reduced_wavelets_path,
    number_comps=number_comps)
```

Save the reduced features.

In [None]:
wf.save_reduced_features(reduced_wavelet_features, saved_reduced_wavelets_path)

### 3.3. Include Redshift Information<a name="addZ"></a>

In [paper]() we found that photometric redshift and its uncertainty are crucial for classification. Therefore, in the cell bellow, we include these properties as features. **Modify** it to include other properties as features.

In [None]:
features = reduced_wavelet_features.copy()  # only the wavelet features

metadata = dataset.metadata
features['hostgal_photoz'] = metadata.hostgal_photoz.values.astype(float)
features['hostgal_photoz_err'] = metadata.hostgal_photoz_err.values.astype(float)

### 3.4. Save the Features<a name="saveFeatures"></a>

**Write** in `saved_features_path` the path to the folder where to save the final set of features.

In [None]:
saved_features_path = saved_gps_path

Save the features and the class of the events.

In [None]:
features.to_pickle(os.path.join(saved_features_path, 'features.pckl'))

data_labels = dataset.labels.astype(int)  # class label of each event
data_labels.to_pickle(os.path.join(saved_features_path, 'data_labels.pckl'))

[Go back to top.](#index)