# Feature Extraction

In this notebook we follow the wavelet decomposition approach of [Lochner et al. (2016)](https://iopscience.iop.org/article/10.3847/0067-0049/225/2/31) to extract features. We also include the photometric redshift and its uncertainty as classification features.

#### Index<a name="index"></a>
1. [Import Packages](#imports)
2. [Load Dataset](#loadData)
3. [Extract Features](#features)
    1. [Fit Gaussian Processes](#gps)
    2. [Wavelet Decomposition](#waveletDecomp)
    3. [Include Redshift Information](#addZ)
    4. [Save the Features](#saveFeatures)
    5. [Load Features](#load)

## 1. Import Packages<a name="imports"></a>

In [1]:
import collections
import os
import pickle
import sys
import time

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [3]:
import lightgbm as lgb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats

In [4]:
from snmachine import gps, snfeatures
from utils.plasticc_pipeline import create_folder_structure, get_directories, load_dataset

No module named 'pymultinest'

                PyMultinest not found. If you would like to use, please install
                Mulitnest with 'sh install/multinest_install.sh; source
                install/setup.sh'
                


In [5]:
%config Completer.use_jedi = False  # enable autocomplete

## 2. Load Dataset<a name="loadData"></a>

First, **write** the path to the folder that contains the dataset we want to augment, `folder_path`.

In [6]:
folder_path = '../snmachine/example_data'

Then, **write** in `data_file_name` the name of the file where your dataset is saved.

In this notebook we use the dataset saved in [4_augment_data](4_augment_data.ipynb).

In [7]:
data_file_name = 'example_dataset_aug.pckl'

Load the dataset.

In [8]:
data_path = os.path.join(folder_path, data_file_name)
dataset = load_dataset(data_path)

Opening from binary pickle
Dataset loaded from pickle file as: <snmachine.sndata.PlasticcData object at 0x7faa88591b10>


## 3. Extract Features<a name="features"></a>

### 3.1. Fit Gaussian Processes<a name="gps"></a>

To obtain the wavelet decomposition, we first used the GPs to interpolate all light curves onto the same time grid; we chose approximately one grid point per day and used a two-level wavelet decomposition, following [Lochner et al. (2016)](https://iopscience.iop.org/article/10.3847/0067-0049/225/2/31).

If you have not fitted the GPs previously, **run** **<font color=Orange>A2)</font>**; it follows the GP modeling of light curves described in [3_model_lightcurves]().
Otherwise, follow **<font color=Orange>B2)</font>** to **read in** the previously saved GPs. 

First **write** the path to the folder where the GP files will be/were saved (`path_saved_gps`). Similarly to previous notebooks, you can opt:

**<font color=Orange>A1)</font>** Obtain GP path from folder structure.

If you created a folder structure, you can obtain the path from there. **Write** the name of the folder in `analysis_name`. 

In [9]:
analysis_name = data_file_name[:-5]

Create the folder structure, if needed.

In [10]:
create_folder_structure(folder_path, analysis_name)


                Folders already exist with this analysis name.

                Are you sure you would like to proceed, this will overwrite the
                example_dataset_aug folder [Y/n]
                

Please respond with 'yes' or 'no'

Obtain the required GP path.

In [11]:
directories = get_directories(folder_path, analysis_name) 
path_saved_gps = directories['intermediate_files_directory']

**<font color=Orange>A2)</font>** Directly **write** where you saved the GP files.

```python
path_saved_gps = os.path.join(folder_path, data_file_name[:-5])
```

**<font color=Orange>B1)</font>** **Choose**:
- `t_min`: minimim time to evaluate the Gaussian Process Regression at.
- `t_max`: maximum time to evaluate the Gaussian Process Regression at.
- `gp_dim`: dimension of the Gaussian Process Regression. If  `gp_dim` is 1, the filters are fitted independently. If `gp_dim` is 2, the Matern kernel is used to fit light curves both in time and wavelength.
- `number_gp`: number of points to evaluate the Gaussian Process Regression at.
- `number_processes`: number of processors to use for parallelisation (**<font color=green>optional</font>**).

In [12]:
t_min = 0
t_max = 278

gp_dim = 2
number_gp = 276
number_processes = 1

In [13]:
gps.compute_gps(dataset, number_gp=number_gp, t_min=t_min, t_max=t_max, 
                gp_dim=gp_dim, output_root=path_saved_gps, 
                number_processes=number_processes)

Performing Gaussian process regression.
Models fitted with the Gaussian Processes values.
Time taken for Gaussian process regression: 5.58s.


**<font color=Orange>B2)</font>** Read in the previously saved GPs.

```python
gps.read_gp_files_into_models(dataset, saved_gps_path)
```

### 3.2. Wavelet Decomposition<a name="waveletDecomp"></a>

Now, we do a wavelet decomposition of the events. **Write** in `path_saved_wavelets` the path to the folder where to save them.

In [14]:
path_saved_wavelets = directories['intermediate_files_directory']

Following [Lochner et al. (2016)](https://iopscience.iop.org/article/10.3847/0067-0049/225/2/31), we then reduced the dimensionality of this wavelet space using Principal Component Analysis (PCA). Therefore, **choose** the number of PCA components to keep (`number_comps`) and **write** the path to the folder where to save the reduced wavelets (`path_saved_reduced_wavelets`).

In [15]:
number_comps = 40
path_saved_reduced_wavelets = directories['features_directory']

**<font color=Orange>A)</font>** Perform the wavelet decomposition and dimensionality reduction.

In [16]:
wf = snfeatures.WaveletFeatures(output_root=path_saved_wavelets)

reduced_wavelet_features = wf.compute_reduced_features(
    dataset, number_comps=number_comps, 
    **{'wavelet_name': 'sym2', 'number_decomp_levels': 2,
       'path_save_eigendecomp': path_saved_reduced_wavelets})

The wavelet used is sym2.
Each passband is decomposed in 2 levels.
Performing wavelet decomposition.
Time taken for wavelet decomposition: 0.92s.
Performing eigendecomposition.
Time taken for eigendecomposition: 0.19s.
Dimensionality reduced feature space with 40 components.


In [17]:
path_saved_reduced_wavelets

'../snmachine/example_data/example_dataset_aug/wavelet_features'

If you previously calculated the wavelet decomposition of the events, and are only looking to project them into a lower dimensional space saved in `path_saved_reduced_wavelets`, run **<font color=Orange>B)</font>**.

**<font color=Orange>B)</font>** Project previously calculated wavelet features onto a lower dimensional space.

```python
wf = snfeatures.WaveletFeatures(output_root=saved_wavelets_path)
feature_space = wf.load_feature_space(dataset)

reduced_wavelet_features = wf.project_to_space(
    feature_space, path_saved_eigendecomp=saved_reduced_wavelets_path,
    number_comps=10)
```

Save the reduced features.

In [18]:
wf.save_reduced_features(reduced_wavelet_features, path_saved_reduced_wavelets)

### 3.3. Include Redshift Information<a name="addZ"></a>

In [paper]() we found that photometric redshift and its uncertainty are crucial for classification. Therefore, in the cell bellow, we include these properties as features. **Modify** it to include other properties as features.

In [19]:
features = reduced_wavelet_features.copy()  # only the wavelet features

metadata = dataset.metadata
features['hostgal_photoz'] = metadata.hostgal_photoz.values.astype(float)
features['hostgal_photoz_err'] = metadata.hostgal_photoz_err.values.astype(float)

### 3.4. Save the Features<a name="saveFeatures"></a>

**Write** in `saved_features_path` the path to the folder where to save the final set of features.

In [20]:
path_saved_features = directories['features_directory']

Save the features and the class of the events.

In [21]:
features.to_pickle(os.path.join(path_saved_features, 'features.pckl'))

data_labels = dataset.labels.astype(int)  # class label of each event
data_labels.to_pickle(os.path.join(path_saved_features, 'data_labels.pckl'))

### 3.5. Load Features<a name="load"></a> <font color=salmon>(Optional)</font>

We can load the saved files to verify weather they were correctly saved.

In [22]:
saved_features = pd.read_pickle(os.path.join(path_saved_features, 'features.pckl')) 
saved_data_labels = pd.read_pickle(os.path.join(path_saved_features, 'data_labels.pckl'))

As we can see, the quantities are the same.

In [23]:
print(np.allclose(saved_features, features))
print(np.allclose(saved_data_labels, data_labels))

True
True


[Go back to top.](#index)

*Previous notebook:* [4_augment_data](4_augment_data.ipynb)

**Next notebook:** [6_train_classifier](6_train_classifier.ipynb)