# Merge datasets


In this notebook we exemplify how to merge two previously generated `SnanaData` datasets. In particular, we are going to merge a DDF and a WFD `SnanaData` datasets.

#### Index<a name="index"></a>
1. [Import packages](#imports)
2. [Load datasets](#data)
    1. [Dataset 1](#data1)
    2. [Dataset 2](#data2)
    3. [Diagnostics](#diagPre) <font color=salmon>(Optional)</font>
3. [Merge datasets](#merge)
    1. [Diagnostics](#diagPost)
4. [Save SnanaData instance](#save)
    1. [Load SnanaData instance](#load) <font color=salmon>(Optional)</font>

## 1. Import packages<a name="imports"></a>

In [None]:
!pip install ../snmachine/

In [None]:
import collections
import os
import pickle
import time

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [None]:
from importlib import reload  # if we need to reload any module do `reload(module_to_reload)`
from snmachine import sndata
from utils.plasticc_pipeline import get_directories, load_dataset

In [None]:
%config Completer.use_jedi = False  # enable autocomplete

#### Aestetic settings

In [None]:
%matplotlib inline

sns.set(font_scale=1.3, style="ticks")

## 2. Load datasets<a name="data"></a>

First, we need to **write** the path to the folder where the dataset and metadata are, `folder_path`. If the files are not in the same folder, add the correct path in each subsection.

In [None]:
# os_name = 'baseline_v2_0_paper'
# os_name = 'noroll_v2_0_paper'
os_name = 'presto_v2_0_paper'

folder_path = f'/folder/path/'

subset_name = 'train' # we only need to merge the train set

### 2.1. Dataset 1<a name="data1"></a>

**Write** the name of the dataset and its metadata, respectively `data_file_name`.

In [None]:
is_roll = 0
is_updated = 1

In [None]:
extra_name_to_save_1 = 'ddf'

file_id = '000'

data_file_name = f'{subset_name}_{extra_name_to_save_1}_{file_id}_gapless50.pckl'
if is_roll:
    data_file_name = f'{subset_name}_{extra_name_to_save_1}_{file_id}_roll_gapless50.pckl'
if is_updated:
    data_file_name = data_file_name[:-5] + '_updated.pckl'
data_file_name

In [None]:
data_path = os.path.join(folder_path, data_file_name)
dataset_1 = load_dataset(data_path)

In [None]:
meta_1 = dataset_1.metadata

In [None]:
# Add the ddf info if the datasets do not have it yet
try:
    meta_1.ddf
except AttributeError:
    print('Add ddf column to metadata.')
    meta_1['ddf'] = extra_name_to_save_1 == 'ddf'
    dataset_1.metadata = meta_1

### 2.2. Dataset 2<a name="data2"></a>

**Write** the name of the dataset and its metadata, respectively `data_file_name`.

In [None]:
extra_name_to_save_2 = 'wfd'

file_id = '000'

data_file_name = f'{subset_name}_{extra_name_to_save_2}_{file_id}_gapless50.pckl'
if is_roll:
    data_file_name = f'{subset_name}_{extra_name_to_save_2}_{file_id}_roll_gapless50.pckl'
if is_updated:
    data_file_name = data_file_name[:-5] + '_updated.pckl'
data_file_name

In [None]:
data_path = os.path.join(folder_path, data_file_name)
dataset_2 = load_dataset(data_path)

In [None]:
meta_2 = dataset_2.metadata

In [None]:
# Add the ddf info if the datasets do not have it yet
try:
    meta_2.ddf
except AttributeError:
    print('Add ddf column to metadata.')
    meta_2['ddf'] = extra_name_to_save_2 == 'ddf'
    dataset_2.metadata = meta_2

Go to:
* [Index](#index)
* [Merge datasets](#merge)

### 2.3. Diagnostics<a name="diagPre"></a> <font color=salmon>(Optional)</font>

Here we simply look at some statistics to ensure the above datasets are what we expected.

In [None]:
print(f'{extra_name_to_save_1} {subset_name} ;', 
      collections.Counter(dataset_1.metadata['target']), len(dataset_1.metadata))
print(f'{extra_name_to_save_2} {subset_name} ;', 
      collections.Counter(dataset_2.metadata['target']), len(dataset_2.metadata))

PLAsTiCC-like Baseline v2.0 only differs from PLAsTiCC in the cadence. Hence, we expect the relationship between the number of events to be similar to the one of the solid angle.

I obtain twice the number of DDF training set events due to the doubling of the solid angle value in Baseline v2.0 cadence. The solid angle has twice the value for DDF due to the dithers and there is also a small contribution due to correctly adding in the WFDs that fall into the DDF area. I am happy with this, so I will now add the WFD and DDF training sets.

In [None]:
print('DDF')
print('Baseline v2.0/ PLAsTiCC')
print(0.030/0.01451)
print(len(dataset_1.metadata)/1270)

In [None]:
print('WFD')
print('Baseline v2.0/ PLAsTiCC')
print(5.694/5.468)
print(len(dataset_2.metadata)/2720)

We now check the position on the sky of the events.

In [None]:
plt.plot(meta_2.ra, meta_2.dec, '.', label='WFD')
plt.plot(meta_1.ra, meta_1.dec, '.', label='DDF')
plt.xlabel('RA')
plt.ylabel('DEC')
plt.legend()

Check the redshift distribution per class.

In [None]:
datasets_metadata = [meta_1, meta_2]
datasets_label = ['DDF', 'WFD']
datasets_bw_adjust = [.3, .3]
datasets_colors = ['C0', 'C1']
sn_type_name = {42: 'SN II', 62: 'SN Ibc', 90: 'SN Ia',
                95: 'SLSN-I', 67: 'SN Ia-91bg', 52: 'SN Iax'}
unique_types = [90, 42, 62]
unique_types2 = [90, 42, 62, 95, 67, 52]

bins = np.linspace(0, 2.2, 50)

for sn_type in unique_types:
    plt.figure()
    for i, metadata in enumerate(datasets_metadata):
        label = datasets_label[i]
        bw_adjust = datasets_bw_adjust[i]
        
        is_sn_type = (metadata['target'] == sn_type)
        sn_type_metadata = metadata[is_sn_type]
        sns.distplot(a=sn_type_metadata['hostgal_photoz'], kde=True,
                     hist=True, label=label, color=datasets_colors[i],
                     bins=bins, kde_kws={'bw_adjust':bw_adjust})
    sn_name = sn_type_name[sn_type]
    plt.title('WFD+DDF train set\n'+sn_name)
    plt.xlim(-.1, 1.2)
    #plt.ylim(0, 3.6)
    plt.xlabel('Photometric redshift')
    plt.ylabel('Density')
    plt.legend(handletextpad=.3)

[Go back to top.](#index)

## 3. Merge datasets<a name="merge"></a>

First, we merge the datasets.

In [None]:
new_dataset = sndata.EmptyDataset().merge_dataset(dataset_1, dataset_2)

### 3.1. Diagnostics<a name="diagPost"></a>

Now, we do some diagnostics to ensure the datasets were successfully merged.

In [None]:
print('Number of events')
print('Before')
number_1 = len(dataset_1.object_names)
number_2 = len(dataset_2.object_names)
print(f'{extra_name_to_save_1:<10}: {number_1} {np.shape(meta_1)}')
print(f'{extra_name_to_save_2:<10}: {number_2} {np.shape(meta_2)}')
name_total = 'total'
print(f'{name_total:<10}: {number_1+number_2}')
print(35*'-')
print('After')
meta_new = new_dataset.metadata
is_ddf = meta_new.ddf == 1
print(f'{extra_name_to_save_1:<10}: {np.sum(is_ddf)}')
print(f'{extra_name_to_save_2:<10}: {np.sum(~is_ddf)}')
print(f'{name_total:<10}: {len(is_ddf)} {np.shape(meta_new)}')
print(f'{name_total:<10}: {len(new_dataset.object_names)} {len(new_dataset.data)}')
print(35*'-')
print('If the before and after are the same, the merge was successful.')

In [None]:
diverg_color = sns.color_palette("Set2", 6, desat=1)
sn_type_color = {42: diverg_color[1], 62: diverg_color[0], 90: diverg_color[2],
                 95: diverg_color[3], 67: diverg_color[4], 52: diverg_color[5]}

for sn_type in unique_types:
    plt.figure()
    is_sn_type = (meta_new['target'] == sn_type)
    sn_type_metadata = meta_new[is_sn_type]
    sns.distplot(a=sn_type_metadata['hostgal_photoz'], kde=True,
                 hist=True, label=label, color=sn_type_color[sn_type],
                 bins=bins, kde_kws={'bw_adjust':.3})
    sn_name = sn_type_name[sn_type]
    plt.title('WFD+DDF merged train set\n'+sn_name)
    plt.xlim(-.1, 1.2)
    #plt.ylim(0, 3.6)
    plt.xlabel('Photometric redshift')
    plt.ylabel('Density')
    plt.legend(handletextpad=.3)

[Go back to top.](#index)

## 4. Save SnanaData instance<a name="save"></a>

Now, **choose** a path to save the `SnanaData` instance created (`folder_path_to_save`) and the name of the file (`file_name`).

In [None]:
folder_path_to_save = folder_path
file_name = f'train_{extra_name_to_save_1}_{extra_name_to_save_2}_{file_id}_gapless50.pckl'
if is_roll:
    file_name = f'train_{extra_name_to_save_1}_{extra_name_to_save_2}_{file_id}_roll_gapless50.pckl'
if is_updated:
    file_name = file_name[:-5] + '_updated.pckl'
file_name

Finally, save the `SnanaData` instance. 

For the PLAsTiCC-like Baseline v2.0 train set it takes 10s to save.

In [None]:
path_to_save = os.path.join(folder_path_to_save, file_name)
print(f'File to save in {path_to_save}')

In [None]:
time_start_saving = time.time()
with open(path_to_save, 'wb') as f:
    pickle.dump(new_dataset, f, pickle.HIGHEST_PROTOCOL)
print(f'{time.time() - time_start_saving}s')

### 4.1 Load SnanaData instance<a name="load"></a> <font color=salmon>(Optional)</font>

We can load the saved file to verify weather it was correctly saved.

In [None]:
time_start_saving = time.time()
saved_dataset = load_dataset(path_to_save)
print(f'{time.time() - time_start_saving}s')

In [None]:
metadata = saved_dataset.metadata
np.unique(metadata.target)

In [None]:
numerical_cols = ['hostgal_photoz', 'hostgal_photoz_err']

In [None]:
np.allclose(saved_dataset.metadata[numerical_cols], 
            new_dataset.metadata[numerical_cols])

[Go back to top.](#index)