# Formatting

When all the tardis simulations are done, we will format the data into tersors, and split them into training and testing data set, and apply a data mask onto the spectra to mimic the telescope spectrograph wavelength limit.  
I have collected all the input data and TARDIS simulated spectra into the folder "ContSend".  


Please don't run this notebook unless you know what you are doing, because there are already some files, especially the training and testing data set, used in other notebooks and are directly related to the results in the paper.  


In the "ContSend" folder, the "elemList.npy" file and the "auxiList.npy" file saves the element abundance and auxiliary data (time after explosion, luminosity, density, photosphere velocity), and the structure is the same as in the "1_ElemPrepare" notebook. There are also "photList.npy" file saves the photosphere temperature, "tempList.npy" saves the temperature structure. The "specList.npy" file saves the simulated spectra. 

In [12]:
import os
import tqdm
import glob
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.interpolate import interp1d
from scipy.ndimage.filters import gaussian_filter1d
from sklearn.model_selection import train_test_split

  from scipy.ndimage.filters import gaussian_filter1d


# The Mask

Here I store the beginning and ending wavelengths of all the observed spectra on WISEREP (as of year 2020) into the "minList.npy" and "maxList.npy" file. 

In [19]:
minList=np.load('minList.npy')
maxList=np.load('maxList.npy')
wave=np.genfromtxt('Prim.ascii')[:,0]

In [20]:
def Normalizer(spec,shortwave=6500,longwave=7500):
    small=np.argmin(abs(spec[:,0]-shortwave))
    long=np.argmin(abs(spec[:,0]-longwave))
    if small<long:spec[:,1]=spec[:,1]/np.average(spec[small:long,1])
    if small>long:spec[:,1]=spec[:,1]/np.average(spec[long:small,1])
    return spec
def windowSpec(spec):
    spFunc=interp1d(spec[:,0],spec[:,1],fill_value=np.nan,bounds_error=False)
    smFlux=spFunc(wave)
    smFlux=smFlux/np.nanmean(smFlux)
    smFlux[np.isnan(smFlux)]=-1
    return np.array([wave,smFlux]).T

In [21]:
Xraw=np.load('ContSend/specList.npy')
Y=np.load('ContSend/elemList.npy')
X=[]
for flux in Xraw:
    spec=np.array([wave,flux]).T
    spec=Normalizer(spec)
    X.append(spec[:,1])
X=np.array(X)

  if small>long:spec[:,1]=spec[:,1]/np.average(spec[long:small,1])


In [24]:
Xc=[]
for i in tqdm.tqdm(range(len(X))):
    spec=np.array([wave,X[i]]).T
    spec=windowSpec(spec)
    Xc.append(spec[:,1].reshape([-1,1]))
Xc=np.array(Xc)

  smFlux=smFlux/np.nanmean(smFlux)
100%|█████████████████████████████████| 112090/112090 [00:15<00:00, 7038.77it/s]


# The Formatted Data

Now, the data stored in "Xc.npy" file contains the TARDIS simulated spectra, the NaN values are masked, and all the flux values are normalized to between 0 and 10 (approximately), so there is no total luminosity information. 

In [None]:
np.save('DataSet/110KRun/Xc.npy',Xc)

# The Mask
Here I apply the observational mask onto the simulated data. Now, the beginning and the ending of the spectra will be filled with -1. The observational mask is randomly picked from the WISEREP spectra maximum and minimum wavelength list. 

In [28]:
for iterRun in range(1,5):
    Xc=[]
    for i in tqdm.tqdm(range(len(X))):
        flux=X[i].copy()
        while True:
            chooseWave=np.random.randint(len(minList))
            choLaW=np.argmin(np.abs(maxList[chooseWave]-wave))
            choSmW=np.argmin(np.abs(minList[chooseWave]-wave))
            if choLaW+100<choSmW:break
        spec=np.array([wave,flux.flatten()]).T
        spec=spec[choLaW:choSmW]
        spec=windowSpec(spec)
        Xc.append(spec[:,1].reshape([-1,1]))
    Xc=np.array(Xc)
    np.save('DataSet/110KRun/Xc_'+str(iterRun)+'.npy',Xc)

2000.0000000000002

# Train and Test
Here we split the data into the training and testing data set. To notice, if a supernova model is in the training data set, then all the spectra, with or without observational mask of this supernova model should be in the training data set. Same rule also applys to the testing data set. This is to prevent data leakage which could cause a unreal high testing performance.  
Finally, the data will be saved into "DataSet" folder, and they will be used to train and validate the neural networks.  
The normalization parameters, which are the mean and the standard error of the luminosity, time after explosion, photosphere velocity, density profile values, are stored in "YauxNorm.npy".  

In [None]:
Y=np.load('ContSend/elemList.npy')
Yaux=np.load('ContSend/auxiList.npy')
YauxNorm=np.array([Yaux.mean(axis=0),Yaux.std(axis=0)])
Yaux=(Yaux-YauxNorm[0])/YauxNorm[1]

XcList=[np.load('DataSet/110KRun/Xc.npy'),np.load('DataSet/110KRun/Xc_1.npy'),np.load('DataSet/110KRun/Xc_2.npy'),\
        np.load('DataSet/110KRun/Xc_3.npy'),np.load('DataSet/110KRun/Xc_4.npy')]
X_train=[]
X_test=[]
Y_train=[]
Y_test=[]
Yaux_train=[]
Yaux_test=[]

trainMask=np.random.choice([True,False],p=[0.8,0.2],size=len(XcList[0]))
for i in range(len(XcList)):
    X_train.append(XcList[i][trainMask])
    X_test.append(XcList[i][trainMask==False])
    Y_train.append(Y[trainMask])
    Y_test.append(Y[trainMask==False])
    Yaux_train.append(Yaux[trainMask])
    Yaux_test.append(Yaux[trainMask==False])
X_train=np.concatenate(X_train)
X_test=np.concatenate(X_test)
Y_train=np.concatenate(Y_train)
Y_test=np.concatenate(Y_test)
Yaux_train=np.concatenate(Yaux_train)
Yaux_test=np.concatenate(Yaux_test)

mask=(np.max(X_train,axis=(1,2))<15)
X_train=X_train[mask]
Y_train=Y_train[mask]
Yaux_train=Yaux_train[mask]
mask=(np.max(X_test,axis=(1,2))<15)
X_test=X_test[mask]
Y_test=Y_test[mask]
Yaux_test=Yaux_test[mask]

In [None]:
np.save('DataSet/110KRun/X_train.npy',X_train)
np.save('DataSet/110KRun/X_test.npy',X_test)
np.save('DataSet/110KRun/Y_train.npy',Y_train)
np.save('DataSet/110KRun/Y_test.npy',Y_test)
np.save('DataSet/110KRun/Yaux_train.npy',Yaux_train)
np.save('DataSet/110KRun/Yaux_test.npy',Yaux_test)
np.save('DataSet/110KRun/YauxNorm.npy',YauxNorm)