## Prediction of normalized physical parameters of odetached data with simple combined NN model
In this Jupyter Notebook we will train NN model to predict normalized physical parameter of detached binary systems. Content:
* Libraries, functions
* Data preparation
* Create architecture of NN model
* Evaluation of model
* Predictions
* Evaluation of predictions

## 1. Environment set-up
* Importing libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

from keras.models import load_model
from sklearn.model_selection import train_test_split
from keras.layers import Conv1D, MaxPooling1D, Input, Dense, LSTM, Dropout, Flatten
from keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping

np.random.seed(1234)
pd.set_option('display.max_rows', None)

* Defining functions for noise generation, set-up of random sigma value generator.
* Inputs:
    * for function generate_observation_sigma(space_obs_frac=0.5) - space_obs_frac=0.5
    * for function stochastic_noise_generator(curve) - vector of light curve

In [2]:
def generate_observation_sigma(space_obs_frac=0.5):
    """
    Draws a standard deviation of noise in light curve points from a "true" value provided in synthetic light curve.
    Noise sigma is drawn from bimodal distribution taking into account contributions from space based and earth based
    observations which have different levels of stochastic noise.

    :param space_obs_frac: ratio between earth based and space based observations
    :return: float; standard deviation of the light curve noise
    """
    earth_based_sigma = 4e-3
    space_based_sigma = 2e-4
    sigma = np.random.choice([earth_based_sigma, space_based_sigma], p=[1-space_obs_frac, space_obs_frac])
    return np.random.rayleigh(sigma)

def stochastic_noise_generator(curve):
    """
    Introduces gaussian noise into synthetic observation provided in `curve`.

    :param curve: numpy.array; normalized light curve
    :return: Tuple(numpy.array, float); normalized light curve with added noise, standard deviation of observations
    """
    sigma = generate_observation_sigma()
    return np.random.normal(curve, sigma), np.full(curve.shape, sigma)

## 2. Data loading
* Loading synthetic data from .pkl file

In [4]:
data = pd.read_pickle("detached_all_parameters.pkl").reset_index()
data.head()

Unnamed: 0,index,id,curve,primary__t_eff,secondary__t_eff,inclination,mass_ratio,primary__surface_potential,secondary__surface_potential,t1_t2,filter,critical_surface_potential,primary__equivalent_radius,secondary__equivalent_radius,primary__filling_factor,secondary__filling_factor
0,0,38,"[0.6055271686415179, 0.9842041250556204, 0.999...",7000,4000,1.560796,10.0,110.00005,996.5005,1.75,Bessell_U,15.09104,0.009996,0.009996,-145.333979,-1502.830354
1,1,38,"[0.608985656265516, 0.9846965713304289, 0.9998...",7000,4000,1.560796,10.0,110.00005,996.5005,1.75,Bessell_B,15.09104,0.009996,0.009996,-145.333979,-1502.830354
2,2,38,"[0.6189025614226916, 0.9837351924934223, 0.999...",7000,4000,1.560796,10.0,110.00005,996.5005,1.75,Bessell_V,15.09104,0.009996,0.009996,-145.333979,-1502.830354
3,3,38,"[0.6292771409565273, 0.9832675811171884, 0.999...",7000,4000,1.560796,10.0,110.00005,996.5005,1.75,Bessell_R,15.09104,0.009996,0.009996,-145.333979,-1502.830354
4,4,38,"[0.6543378609145588, 0.9835188424579704, 0.999...",7000,4000,1.560796,10.0,110.00005,996.5005,1.75,Bessell_I,15.09104,0.009996,0.009996,-145.333979,-1502.830354


* Selecting random sample of data of size 100 000 records

In [5]:
data_sample = data.sample(n=100000)

## 3. Data preparation

* Create multi-dimensional array of vectors of light curves

In [6]:
X = []
for row in data_sample["curve"]:
    X.append(row)
X=np.array(X)

* Create array of features, which will model predict

In [7]:
y = np.array(data_sample[[
    "primary__t_eff",
    "secondary__t_eff",
    "inclination",
    "mass_ratio",
    "primary__surface_potential",
    "secondary__surface_potential",
    "t1_t2",
    "critical_surface_potential",
    "primary__equivalent_radius",
    "secondary__equivalent_radius",
    "primary__filling_factor",
    "secondary__filling_factor"]])

* Defining MinMax scaler object
* Fitting scaler

In [8]:
scaler = MinMaxScaler()
y_minmax_scaled = scaler.fit_transform(y)
y_minmax_scaled[0]

array([1.        , 0.02439024, 0.71584635, 0.05050505, 0.17208073,
       0.0020424 , 0.7804878 , 0.08409557, 0.07311887, 0.36706089,
       0.96675942, 0.99809033])

* Splitting data into training and testing data sets in 80:20 ratio

In [9]:
X_train1, X_test, y_train1, y_test = train_test_split(X, y_minmax_scaled, test_size=0.2)

* Adding noise into training datasets (noise generated with functions defined earlier)

In [10]:
X_train_n = []
y_train_n = []
for i in range(len(X_train1)):
    for j in range(3):
        curve = stochastic_noise_generator(X_train1[i])
        X_train_n.append(curve[0])
        y_train_n.append(y_train1[i])
X_train_n = np.array(X_train_n)
y_train_n=np.array(y_train_n)

* Details about number of records in specific data sets

In [15]:
print("Number of records in dataset: ", len(data),
    "\nNumber of records in sample: ", len(X),
    "\nNumber of train data without noise: ", len(X_train1),
    "\nNumber of train data with noise: ", len(X_train_n),
    "\nNumber of test data without noise: ", len(X_test))

Number of records in dataset:  1300000 
Number of records in sample:  100000 
Number of train data without noise:  80000 
Number of train data with noise:  240000 
Number of test data without noise:  20000


## 4. Modeling

* Defining neural network model architecture
    * it is simple combined architecture with 1D CNN and recurrent LSTM layer
    * input shape of vector is  400x1, output is array 12x1 - 12 predicted physical parameters
    * model will be saved as *norm_detached_all_params.hdf5* in *models* folder

In [20]:
inputs = Input(shape=(400, 1))
b = Conv1D(64, kernel_size = 3, padding = "valid")(inputs)
b = MaxPooling1D(2)(b)
b = Dropout(0.2)(b)
b = LSTM(64, return_sequences=True)(b)
b = Flatten()(b)
b = Dense(64, activation='relu')(b)
x = Dense(32, activation='relu')(b)
output = Dense(12, activation='linear')(x)
model = Model(inputs=inputs, outputs=output)
model.compile(loss='mse', optimizer='adam', metrics=["mae", "mape"])

saved_model = "models/norm_detached_all_params.hdf5"
checkpoint = ModelCheckpoint(saved_model, monitor = 'val_mae', verbose = 1, save_best_only = True, mode = 'min')
early = EarlyStopping(monitor = "val_mae", mode = "min", patience = 25)
callbacks_list = [checkpoint, early]

print(model.summary())

Model: "functional_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 400, 1)]          0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 398, 64)           256       
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 199, 64)           0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 199, 64)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 199, 64)           33024     
_________________________________________________________________
flatten_1 (Flatten)          (None, 12736)             0         
_________________________________________________________________
dense_3 (Dense)              (None, 64)               

* Model training
    * Model is trained for 10 epochs
    * For validation data set we selected 10% of training data

In [22]:
history = model.fit(X_train_n, y_train_n, validation_split = 0.1, epochs = 10, verbose = 1, callbacks = callbacks_list, batch_size = 64)

Epoch 1/10
Epoch 00001: val_mae improved from inf to 0.09556, saving model to models\norm_detached_all_params.hdf5
Epoch 2/10
Epoch 00002: val_mae improved from 0.09556 to 0.08633, saving model to models\norm_detached_all_params.hdf5
Epoch 3/10
Epoch 00003: val_mae improved from 0.08633 to 0.08368, saving model to models\norm_detached_all_params.hdf5
Epoch 4/10
Epoch 00004: val_mae improved from 0.08368 to 0.07832, saving model to models\norm_detached_all_params.hdf5
Epoch 5/10
Epoch 00005: val_mae improved from 0.07832 to 0.07459, saving model to models\norm_detached_all_params.hdf5
Epoch 6/10
Epoch 00006: val_mae did not improve from 0.07459
Epoch 7/10
Epoch 00007: val_mae improved from 0.07459 to 0.07284, saving model to models\norm_detached_all_params.hdf5
Epoch 8/10
Epoch 00008: val_mae improved from 0.07284 to 0.07153, saving model to models\norm_detached_all_params.hdf5
Epoch 9/10
Epoch 00009: val_mae improved from 0.07153 to 0.06848, saving model to models\norm_detached_all_par

* Loading of trained model

In [11]:
model = load_model("models/norm_detached_all_params.hdf5")

## 5. Model evaluation

* Model evaluation on test data without added noise
* In the output we can see loss and MAE values

In [12]:
scores = model.evaluate(X_test, y_test)
print('Loss: {:.4f}, MAE: {:.4f}'.format(scores[0], scores[1]))

Loss: 0.0143, MAE: 0.0676


* Adding random noise to test data

In [13]:
X_test_n = []
y_test_norm_n = []
for i in range(len(X_test)):
    for j in range(3):
        curve = stochastic_noise_generator(X_test[i])
        X_test_n.append(curve[0])
        y_test_norm_n.append(y_test[i])
        j += 1
X_test_n = np.array(X_test_n)
y_test_norm_n = np.array(y_test_norm_n)

* Model evaluation on test data with added noise
* In the output we can see loss and MAE values

In [14]:
scores_n = model.evaluate(X_test_n, y_test_norm_n)
print('Loss: {:.4f}, MAE: {:.4f}'.format(scores_n[0], scores_n[1]))

Loss: 0.0149, MAE: 0.0691


## 6. Predictions on test data without noise

* Predictions on test data without noise
* Predictions are saved into *y_pred_norm* variable in the form of multi-dimensional array

In [15]:
y_pred_norm = model.predict(X_test)

* Since model is predicting normalized values, we need to denormalize array of predictions with use of inverse transformation

In [16]:
denorm = scaler.inverse_transform(y_pred_norm)
denorm[0]

array([ 2.1713391e+04,  1.7028617e+04,  1.3953224e+00,  1.6369009e+00,
        8.6988602e+00,  1.2150850e+01,  1.1674731e+00,  4.5563955e+00,
        1.6504228e-01,  1.7309594e-01, -2.1386898e+01, -9.7847910e+00],
      dtype=float32)

* We create data frame of denormalized predictions with specific column names

In [17]:
y_pred_denorm_df = pd.DataFrame(denorm,
                            columns = [
                                "P_prim__t_eff",
                                "P_sec__t_eff",
                                "P_inclination",
                                "P_mass_ratio",
                                "P_prim__surface_potential",
                                "P_sec__surface_potential",
                                "P_t1_t2",
                                "P_critical_surface_potential",
                                "P_primary_equivalent_radius",
                                "P_secondary_equivalent_radius",
                                "P_primary_filling_factor",
                                "P_secondary_filling_factor"
                            ])
y_pred_denorm_df.head()

Unnamed: 0,P_prim__t_eff,P_sec__t_eff,P_inclination,P_mass_ratio,P_prim__surface_potential,P_sec__surface_potential,P_t1_t2,P_critical_surface_potential,P_primary_equivalent_radius,P_secondary_equivalent_radius,P_primary_filling_factor,P_secondary_filling_factor
0,21713.390625,17028.617188,1.395322,1.636901,8.69886,12.15085,1.167473,4.556396,0.165042,0.173096,-21.386898,-9.784791
1,20370.228516,9271.480469,1.432273,1.192739,20.403212,31.860313,2.317413,3.877478,0.108669,0.06683,-57.957951,-66.856216
2,20484.207031,13641.383789,1.417863,1.242624,2.690766,8.863263,1.41286,4.12885,0.284711,0.271785,2.91847,4.684364
3,26482.248047,10883.576172,1.366974,1.097659,3.090334,3.780988,2.763849,3.897063,0.32097,0.324458,8.847303,9.564446
4,28366.121094,7792.821777,1.367452,1.169786,11.70927,15.829687,3.665346,3.899266,0.095586,0.143613,-27.030075,-17.82349


* Average values for each predicted attribute calculated with *mean()* function

In [18]:
pred_mean = y_pred_denorm_df.mean(axis=0)
pred_mean

P_prim__t_eff                    21964.074219
P_sec__t_eff                      9857.833984
P_inclination                        1.369297
P_mass_ratio                         1.618292
P_prim__surface_potential           17.017254
P_sec__surface_potential            17.733107
P_t1_t2                              2.610183
P_critical_surface_potential         4.525213
P_primary_equivalent_radius          0.173925
P_secondary_equivalent_radius        0.209024
P_primary_filling_factor           -28.454815
P_secondary_filling_factor         -20.850000
dtype: float32

* Data frame created from test datasets
* Average values for each attribute is calculated with *mean()* function

In [22]:
denorm_test = scaler.inverse_transform(y_test)
y_test_norm_df = pd.DataFrame(denorm_test,
                            columns = [
                                "prim__t_eff",
                                "sec__t_eff",
                                "inclination",
                                "mass_ratio",
                                "prim__surface_potential",
                                "sec__surface_potential",
                                "t1_t2",
                                "critical_surface_potential",
                                "primary_equivalent_radius",
                                "secondary_equivalent_radius",
                                "primary_filling_factor",
                                "secondary_filling_factor"
                            ])
true_mean = y_test_norm_df.mean(axis=0)
true_mean

prim__t_eff                    22565.200000
sec__t_eff                     10129.450000
inclination                        1.375910
mass_ratio                         1.747564
prim__surface_potential           18.126015
sec__surface_potential            16.289401
t1_t2                              2.675162
critical_surface_potential         4.666303
primary_equivalent_radius          0.176560
secondary_equivalent_radius        0.200172
primary_filling_factor           -34.399594
secondary_filling_factor         -23.618069
dtype: float64

* Dataframe created for purpose to compare average true and predicted value, with Mean Average Error showed

In [26]:
eval_pred = pd.DataFrame({'attribute': true_mean.index,
            'avg_true': true_mean.values,
            'avg_pred': pred_mean.values,
            'MAE': abs(true_mean.values - pred_mean.values)})
eval_pred

Unnamed: 0,attribute,avg_true,avg_pred,MAE
0,prim__t_eff,22565.2,21964.074219,601.125781
1,sec__t_eff,10129.45,9857.833984,271.616016
2,inclination,1.37591,1.369297,0.006613
3,mass_ratio,1.747564,1.618292,0.129272
4,prim__surface_potential,18.126015,17.017254,1.108761
5,sec__surface_potential,16.289401,17.733107,1.443706
6,t1_t2,2.675162,2.610183,0.064979
7,critical_surface_potential,4.666303,4.525213,0.14109
8,primary_equivalent_radius,0.17656,0.173925,0.002635
9,secondary_equivalent_radius,0.200172,0.209024,0.008852


## 7. Prediction on test data with noise

* Prediction on test data with noise
* Predictions are save into *y_pred_norm_n* variable in the form of multi-dimensional array

In [27]:
y_pred_norm_n = model.predict(X_test_n)

* Since model is predicting normalized values, we need to denormalize array of predictions with use of inverse transformation

In [28]:
denorm_n = scaler.inverse_transform(y_pred_norm_n)
denorm_n[0]

array([ 2.3069885e+04,  1.7985830e+04,  1.3949709e+00,  1.7647982e+00,
        9.1150284e+00,  1.2332879e+01,  1.1173550e+00,  4.7598162e+00,
        1.5445419e-01,  1.7978655e-01, -2.1069891e+01, -8.1847010e+00],
      dtype=float32)

* We create data frame of denormalized predictions with specific column names

In [30]:
y_pred_denorm_n_df = pd.DataFrame(denorm_n,
                            columns = [
                                "P_prim__t_eff",
                                "P_sec__t_eff",
                                "P_inclination",
                                "P_mass_ratio",
                                "P_prim__surface_potential",
                                "P_sec__surface_potential",
                                "P_t1_t2",
                                "P_critical_surface_potential",
                                "P_primary_equivalent_radius",
                                "P_secondary_equivalent_radius",
                                "P_primary_filling_factor",
                                "P_secondary_filling_factor"
                            ])
y_pred_denorm_n_df.head()

Unnamed: 0,P_prim__t_eff,P_sec__t_eff,P_inclination,P_mass_ratio,P_prim__surface_potential,P_sec__surface_potential,P_t1_t2,P_critical_surface_potential,P_primary_equivalent_radius,P_secondary_equivalent_radius,P_primary_filling_factor,P_secondary_filling_factor
0,23069.884766,17985.830078,1.394971,1.764798,9.115028,12.332879,1.117355,4.759816,0.154454,0.179787,-21.069891,-8.184701
1,21730.755859,17053.150391,1.394916,1.641296,8.711674,12.138206,1.1677,4.561847,0.164768,0.173658,-21.355547,-9.754962
2,22437.886719,17392.103516,1.406102,1.690352,9.202767,14.912174,1.154073,4.652238,0.157752,0.171174,-21.776283,-12.940541
3,18172.566406,9790.380859,1.408084,1.414821,12.439978,30.931646,2.002754,4.23466,0.120293,0.077331,-40.354721,-63.336521
4,20089.316406,10899.501953,1.352746,1.216016,14.105429,27.077154,1.836332,4.010816,0.115363,0.10615,-47.517998,-56.508617


* Average values for each predicted attribute calculated with *mean()* function

In [31]:
n_pred_mean = y_pred_denorm_n_df.mean(axis=0)
n_pred_mean

P_prim__t_eff                    22021.871094
P_sec__t_eff                      9883.876953
P_inclination                        1.369198
P_mass_ratio                         1.614515
P_prim__surface_potential           17.051184
P_sec__surface_potential            17.807707
P_t1_t2                              2.608001
P_critical_surface_potential         4.520045
P_primary_equivalent_radius          0.173390
P_secondary_equivalent_radius        0.209630
P_primary_filling_factor           -28.575169
P_secondary_filling_factor         -20.845806
dtype: float32

* We create dataframe of denormalized test values
* Calculate average values of each attribute

In [34]:
denorm_test_n = scaler.inverse_transform(y_test_norm_n)
y_test_norm_df_n = pd.DataFrame(denorm_test_n,
                            columns = [
                                "prim__t_eff",
                                "sec__t_eff",
                                "inclination",
                                "mass_ratio",
                                "prim__surface_potential",
                                "sec__surface_potential",
                                "t1_t2",
                                "critical_surface_potential",
                                "primary_equivalent_radius",
                                "secondary_equivalent_radius",
                                "primary_filling_factor",
                                "secondary_filling_factor"
                            ])
true_mean = y_test_norm_df_n.mean(axis=0)
true_mean

prim__t_eff                    22565.200000
sec__t_eff                     10129.450000
inclination                        1.375910
mass_ratio                         1.747564
prim__surface_potential           18.126015
sec__surface_potential            16.289401
t1_t2                              2.675162
critical_surface_potential         4.666303
primary_equivalent_radius          0.176560
secondary_equivalent_radius        0.200172
primary_filling_factor           -34.399594
secondary_filling_factor         -23.618069
dtype: float64

* Dataframe created for purpose to compare average true and predicted value, with Mean Average Error showed

In [35]:
n_eval_pred = pd.DataFrame({'attribute': true_mean.index,
            'avg_true': true_mean.values,
            'avg_pred': n_pred_mean.values,
            'MAE': abs(true_mean.values - n_pred_mean.values)})
n_eval_pred

Unnamed: 0,attribute,avg_true,avg_pred,MAE
0,prim__t_eff,22565.2,22021.871094,543.328906
1,sec__t_eff,10129.45,9883.876953,245.573047
2,inclination,1.37591,1.369198,0.006712
3,mass_ratio,1.747564,1.614515,0.13305
4,prim__surface_potential,18.126015,17.051184,1.074831
5,sec__surface_potential,16.289401,17.807707,1.518306
6,t1_t2,2.675162,2.608001,0.067161
7,critical_surface_potential,4.666303,4.520045,0.146258
8,primary_equivalent_radius,0.17656,0.17339,0.003171
9,secondary_equivalent_radius,0.200172,0.20963,0.009458
