# Load dataset into `snmachine`

In this notebook we exemplify how to load data into `snmachine`. For that we will create an instance of the `PlasticcData` class from `.csv` files.

#### Index<a name="index"></a>
1. [Import packages](#imports)
2. [Dataset paths](#paths)
3. [Create PlasticcData instance](#createPlasticc)
    1. [Select a subset](#subset) <font color=salmon>(Optional)</font>
4. [Save PlasticcData instance](#save)
    1. [Load PlasticcData instance](#load) <font color=salmon>(Optional)</font>
5. [Repeat for test dataset](#repeat)

## 1. Import packages<a name="imports"></a>

In [1]:
import os
import pickle
import sys

In [2]:
import numpy as np

In [3]:
from snmachine import sndata
from utils.plasticc_pipeline import get_directories, load_dataset

No module named 'pymultinest'

                PyMultinest not found. If you would like to use, please install
                Mulitnest with 'sh install/multinest_install.sh; source
                install/setup.sh'
                


In [4]:
%config Completer.use_jedi = False  # activate autocomplete 

## 2. Dataset paths<a name="paths"></a>

First, we need to **write** the path to the folder where the dataset and metadata are, `folder_path`.

In [5]:
folder_path = '../snmachine/example_data'

Then, **write** the name of the dataset and its metadata, respectively `data_file_name` and `metadata_file_name`.

In [6]:
data_file_name = 'plasticc_train_lightcurves.csv'
metadata_file_name = 'plasticc_train_metadata.csv'

## 3. Create PlasticcData instance<a name="createPlasticc"></a>

We now create a `PlasticcData` instance. The following cell takes $\sim2$min to run.

In [7]:
dataset = sndata.PlasticcData(folder=folder_path, data_file=data_file_name,
                              metadata_file=metadata_file_name)

Reading data...
10%
20%
30%
40%
50%
60%
70%
80%
90%
7848 objects read into memory.
This has taken 0 days 00:01:46

Reading metadata...
10%
20%
30%
40%
50%
60%
70%
80%
90%
Finished getting the metadata for 7848 objects.
This has taken 0 days 00:00:10



See the first entries of the metadata.

In [8]:
dataset.metadata.head(10)

Unnamed: 0_level_0,ra,decl,ddf,hostgal_specz,hostgal_photoz,hostgal_photoz_err,distmod,mwebv,target,true_target,...,true_av,true_peakmjd,libid_cadence,tflux_u,tflux_g,tflux_r,tflux_i,tflux_z,tflux_y,object_id
object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
615,349.0461,-61.9438,1,0.0,0.0,0.0,-9.0,0.017,92,92,...,0.0,59570.0,69,484.7,3286.7,3214.1,3039.7,2854.5,2837.0,615
713,53.0859,-27.7844,1,1.818,1.627,0.255,45.406,0.007,88,88,...,0.0,59570.0,34,108.7,117.7,119.9,149.6,147.9,150.5,713
730,33.5742,-6.5796,1,0.232,0.226,0.016,40.256,0.021,42,42,...,0.0,60444.379,9,0.0,0.0,0.0,0.0,0.0,0.0,730
745,0.1899,-45.5867,1,0.304,0.281,1.152,40.795,0.007,90,90,...,0.0,60130.453,38,0.0,0.0,0.0,0.0,0.0,0.0,745
1124,352.7113,-63.8237,1,0.193,0.241,0.018,40.417,0.024,90,90,...,0.0,60452.641,1,0.0,0.0,0.0,0.0,0.0,0.0,1124
1227,35.6836,-5.3794,1,0.0,0.0,0.0,-9.0,0.02,65,65,...,0.0,59570.0,47,2.3,11.6,31.6,240.0,632.4,1187.7,1227
1598,347.8467,-64.7609,1,0.135,0.182,0.03,39.728,0.019,90,90,...,0.0,60628.816,20,0.0,0.0,0.0,0.0,0.0,0.0,1598
1632,348.5959,-63.0726,1,0.686,0.701,0.01,43.152,0.021,42,42,...,0.051,59602.09,93,0.0,0.0,0.0,0.0,0.0,0.0,1632
1920,149.4141,3.4338,1,0.309,0.323,0.336,41.14,0.027,90,90,...,0.0,59996.625,107,0.0,0.0,0.0,0.0,0.0,0.0,1920
1926,149.4141,1.9401,1,0.0,0.0,0.0,-9.0,0.018,65,65,...,0.0,59570.0,15,16.6,130.6,450.4,2237.3,4903.2,8229.6,1926


### 3.1 Select a subset<a name="subset"></a> <font color=salmon>(Optional)</font>

Sometimes we want a subset of the dataset. Here we illustrate how to generate a `PlasticcData` instance of that subset.

In this example, we choose 90 SNe among SN Ia, SN Ibc and SN II. See `note2_modelNames` in [Zenodo](https://zenodo.org/record/2539456#.YGM6R2RKjAM) for the mapping between the classes numbers and names.

**Replace** the above step with your chosen subset or use all events.

In [9]:
metadata = dataset.metadata
is_snia = metadata.target == 90  # SN Ia
is_snibc = metadata.target == 62  # SN Ibc
is_snii = metadata.target == 42  # SN II

In [10]:
np.random.seed(42)  # for reproducibility 

objs_to_keep = []
for is_sn in [is_snia, is_snibc, is_snii]:
    objs_to_keep.append(np.random.choice(a=metadata['object_id'][is_sn], 
                                         size=30, replace=False))
objs_to_keep = np.array(objs_to_keep).flatten()

In [11]:
print(f'We keep {len(objs_to_keep)} events.')

We keep 90 events.


Update the dataset.

In [12]:
dataset.update_dataset(objs_to_keep)

Notice how the first entries of the metadata changed; now we only have 90 events.

In [13]:
dataset.metadata

Unnamed: 0_level_0,ra,decl,ddf,hostgal_specz,hostgal_photoz,hostgal_photoz_err,distmod,mwebv,target,true_target,...,true_av,true_peakmjd,libid_cadence,tflux_u,tflux_g,tflux_r,tflux_i,tflux_z,tflux_y,object_id
object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3910,0.5895,-47.1613,1,0.197,2.677,0.593,46.727,0.009,62,62,...,0.062,60545.609,10,0.0,0.0,0.0,0.0,0.0,0.0,3910
7033,52.2070,-28.2916,1,0.083,0.085,0.007,37.941,0.007,42,42,...,0.000,60630.609,87,0.0,0.0,0.0,0.0,0.0,0.0,7033
19866,359.8148,-44.3998,1,0.261,0.288,0.023,40.851,0.009,90,90,...,0.000,59815.918,118,0.0,0.0,0.0,0.0,0.0,0.0,19866
87180,150.8203,1.6415,1,0.178,0.182,0.010,39.729,0.020,62,62,...,0.008,60418.770,60,0.0,0.0,0.0,0.0,0.0,0.0,87180
96284,152.0508,3.2844,1,0.159,2.401,0.412,46.442,0.019,42,42,...,0.000,60438.047,94,0.0,0.0,0.0,0.0,0.0,0.0,96284
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112748248,100.0195,-32.7972,0,0.140,0.154,0.012,39.323,0.076,42,42,...,0.000,60006.770,35726,0.0,0.0,0.0,0.0,0.0,0.0,112748248
122452449,16.6992,-14.4775,0,0.262,0.356,1.260,41.389,0.021,42,42,...,0.000,60577.996,17875,0.0,0.0,0.0,0.0,0.0,0.0,122452449
128451231,15.2206,-52.0297,0,0.181,2.483,1.110,46.530,0.011,90,90,...,0.000,59876.805,21794,0.0,0.0,0.0,0.0,0.0,0.0,128451231
128839699,20.2148,-15.7139,0,0.090,0.105,0.012,38.436,0.018,62,62,...,0.049,59734.602,12729,0.0,0.0,0.0,0.0,0.0,0.0,128839699


## 4. Save PlasticcData instance<a name="save"></a>

Now, **choose** a path to save the `PlasticcData` instance created (`folder_path_to_save`) and the name of the file (`file_name`).

In [14]:
folder_path_to_save = folder_path
file_name = 'example_dataset.pckl'

Finally, save the `PlasticcData` instance.

In [15]:
path_to_save = os.path.join(folder_path_to_save, file_name)
with open(path_to_save, 'wb') as f:
    pickle.dump(dataset, f, pickle.HIGHEST_PROTOCOL)

### 4.1 Load PlasticcData instance<a name="load"></a> <font color=salmon>(Optional)</font>

We can load the saved file to verify weather it was correctly saved.

In [16]:
saved_dataset = load_dataset(path_to_save)

Opening from binary pickle
Dataset loaded from pickle file as: <snmachine.sndata.PlasticcData object at 0x7fb141743290>


As we can see, the metadata is the same.

In [17]:
np.allclose(np.array(saved_dataset.metadata, dtype=float), 
            np.array(dataset.metadata, dtype=float))

True

## 5. Repeat for test dataset<a name="repeat"></a>

Here we will load an example test set that already only contains SN Ia, SN Ibc and SN II.

First, we need to **write** the path to the folder where the dataset and metadata are, `folder_path`.

In [18]:
folder_path = '../snmachine/example_data'

Then, **write** the name of the dataset and its metadata, respectively `data_file_name` and `metadata_file_name`.

In [19]:
data_file_name = 'sniabcii_test_lightcurves_example.csv'
metadata_file_name = 'sniabcii_test_metadata_example.csv'

We now create a `PlasticcData` instance. The following cell takes $\sim1$min to run.

In [20]:
dataset = sndata.PlasticcData(folder=folder_path, data_file=data_file_name,
                              metadata_file=metadata_file_name)

Reading data...
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
5000 objects read into memory.
This has taken 0 days 00:00:52

Reading metadata...
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Finished getting the metadata for 5000 objects.
This has taken 0 days 00:00:11



Now, **choose** a path to save the `PlasticcData` instance created (`folder_path_to_save`) and the name of the file (`file_name`).

In [21]:
folder_path_to_save = folder_path
file_name = 'example_test_dataset.pckl'

Finally, save the `PlasticcData` instance.

In [22]:
path_to_save = os.path.join(folder_path_to_save, file_name)
with open(path_to_save, 'wb') as f:
    pickle.dump(dataset, f, pickle.HIGHEST_PROTOCOL)

[Go back to top.](#index)

**Next notebook:** [2_preprocess_data](2_preprocess_data.ipynb)