# Prediction of all physical parameters for detached data with simple combined NN model
In this Jupyter Notebook we will train NN model to predict all physical parameters of detached binary system.
Content:
* Libraries, functions
* Data preparation
* Create architecture of NN model
* Evaluation of model
* Predictions
* Evaluation of predictions

## 1. Environment set-up
* Importing libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from keras.models import load_model
from sklearn.model_selection import train_test_split
from keras.layers import Conv1D, MaxPooling1D
from keras.layers import Input, Dense, LSTM, Dropout, Flatten
from keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping

np.random.seed(1234)
pd.set_option('display.max_rows', None)

* Defining functions for noise generation, set-up of random sigma value generator.

In [2]:
def generate_observation_sigma(space_obs_frac=0.5):
    """
    Draws a standard deviation of noise in light curve points from a "true" value provided in synthetic light curve.
    Noise sigma is drawn from bimodal distribution taking into account contributions from space based and earth based
    observations which have different levels of stochastic noise.

    :param space_obs_frac: ratio between earth based and space based observations
    :return: float; standard deviation of the light curve noise
    """
    earth_based_sigma = 4e-3
    space_based_sigma = 2e-4
    sigma = np.random.choice([earth_based_sigma, space_based_sigma], p=[1-space_obs_frac, space_obs_frac])
    return np.random.rayleigh(sigma)

def stochastic_noise_generator(curve):
    """
    Introduces gaussian noise into synthetic observation provided in `curve`.

    :param curve: numpy.array; normalized light curve
    :return: Tuple(numpy.array, float); normalized light curve with added noise, standard deviation of observations
    """
    sigma = generate_observation_sigma()
    return np.random.normal(curve, sigma), np.full(curve.shape, sigma)

## 2. Data loading
* Loading synthetic data from .pkl file

In [3]:
data = pd.read_pickle("detached_all_parameters.pkl").reset_index()

* Selecting random sample of data of size 100 000 records

In [4]:
data_sample = data.sample(n=100000)
data_sample.head()

Unnamed: 0,index,id,curve,primary__t_eff,secondary__t_eff,inclination,mass_ratio,primary__surface_potential,secondary__surface_potential,t1_t2,filter,critical_surface_potential,primary__equivalent_radius,secondary__equivalent_radius,primary__filling_factor,secondary__filling_factor
647973,647973,10032749,"[0.48835963039886077, 0.4913287663336375, 0.49...",45000,5000,1.334076,0.6,20.601251,4.071076,9.0,Bessell_B,3.063442,0.049981,0.210724,-49.955205,-2.870176
1109932,1109932,16384835,"[0.5932746668156491, 0.594846617739326, 0.5994...",12000,5000,1.310729,1.111111,7.008254,5.425004,2.4,SLOAN_u,3.928447,0.170106,0.25046,-5.570841,-2.707013
731380,731380,10579972,"[0.0559029780103817, 0.05590501277853746, 0.05...",10000,5000,1.427398,1.666667,9.3676,5.812,2.0,Bessell_U,4.772403,0.130062,0.330428,-7.853457,-1.776732
692703,692703,10341530,"[0.9719087814691745, 0.9719390320925176, 0.971...",9000,5000,1.223879,1.666667,9.3676,8.640628,1.8,GaiaDR2,4.772403,0.130062,0.209983,-7.853457,-6.611019
1145336,1145336,16615611,"[0.8654052436804602, 0.8654499986635169, 0.865...",20000,16000,1.180696,0.9,3.676833,19.051127,1.25,Kepler,3.585603,0.373969,0.049982,-0.183357,-31.082753


## 3. Data preparation

* Create multi-dimensional array of vectors of light curves

In [5]:
X = []
for row in data_sample["curve"]:
    X.append(row)
X = np.array(X)

* Create array of features, which will model predict

In [6]:
y = np.array(data_sample[[
    "inclination",
    "mass_ratio",
    "primary__surface_potential",
    "secondary__surface_potential",
    "t1_t2",
    "critical_surface_potential",
    "primary__equivalent_radius",
    "secondary__equivalent_radius",
    "primary__filling_factor",
    "secondary__filling_factor"]])

* Splitting data into training and testing data sets in 80:20 ratio

In [7]:
X_train1, X_test, y_train1, y_test = train_test_split(X, y, test_size=0.2)

* Adding noise into training datasets (noise generated with functions defined earlier)

In [8]:
X_train = []
y_train = []
for i in range(len(X_train1)):
    for j in range(3):
        curve = stochastic_noise_generator(X_train1[i])
        X_train.append(curve[0])
        y_train.append(y_train1[i])
X_train = np.array(X_train)
y_train=np.array(y_train)

* Details about number of records in specific data sets

In [9]:
print("Number of records in dataset: ", len(data),
    "\nNumber of records in sample: ", len(X),
    "\nNumber of train data without noise: ", len(X_train1),
    "\nNumber of train data with noise: ", len(X_train),
    "\nNumber of test data without noise: ", len(X_test))

Number of records in dataset:  1300000 
Number of records in sample:  100000 
Number of train data without noise:  80000 
Number of train data with noise:  240000 
Number of test data without noise:  20000


## 4. Modeling

* Defining neural network model architecture
    * it is simple combined architecture with 1D CNN and recurrent LSTM layer
    * input shape of vector is  400x1, output is array 10x1 - 10 predicted physical parameters
    * model will be saved as *detached_allParams.hdf5* in *models* folder

In [13]:
inputs = Input(shape=(400, 1))
b = Conv1D(64, kernel_size = 3, padding = "valid")(inputs)
b = MaxPooling1D(2)(b)
b = Dropout(0.2)(b)
b = LSTM(64, return_sequences=True)(b)
b = Flatten()(b)
b = Dense(64, activation='relu')(b)
x = Dense(32, activation='relu')(b)
output = Dense(10, activation='linear')(x)
model = Model(inputs=inputs, outputs=output)
model.compile(loss='mse', optimizer='adam', metrics=["mae", "mape"])

saved_model = "models/detached_allParams.hdf5"
checkpoint = ModelCheckpoint(saved_model, monitor = 'val_mae', verbose = 1, save_best_only = True, mode = 'min')
early = EarlyStopping(monitor = "val_mae", mode = "min", patience = 25)
callbacks_list = [checkpoint, early]

print(model.summary())

Model: "functional_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 400, 1)]          0         
_________________________________________________________________
conv1d (Conv1D)              (None, 398, 64)           256       
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 199, 64)           0         
_________________________________________________________________
dropout (Dropout)            (None, 199, 64)           0         
_________________________________________________________________
lstm (LSTM)                  (None, 199, 64)           33024     
_________________________________________________________________
flatten (Flatten)            (None, 12736)             0         
_________________________________________________________________
dense (Dense)                (None, 64)               

* Model training
    * Model is trained for 10 epochs
    * For validation data set we selected 10% of training data

In [15]:
history = model.fit(X_train, y_train, validation_split = 0.1, epochs = 10, verbose = 1, callbacks = callbacks_list, batch_size = 64)

Epoch 1/10
Epoch 00001: val_mae improved from inf to 8.66192, saving model to models\detached_allParams.hdf5
Epoch 2/10
Epoch 00002: val_mae did not improve from 8.66192
Epoch 3/10
Epoch 00003: val_mae improved from 8.66192 to 7.64794, saving model to models\detached_allParams.hdf5
Epoch 4/10
Epoch 00004: val_mae did not improve from 7.64794
Epoch 5/10
Epoch 00005: val_mae did not improve from 7.64794
Epoch 6/10
Epoch 00006: val_mae did not improve from 7.64794
Epoch 7/10
Epoch 00007: val_mae did not improve from 7.64794
Epoch 8/10
Epoch 00008: val_mae improved from 7.64794 to 7.24666, saving model to models\detached_allParams.hdf5
Epoch 9/10
Epoch 00009: val_mae improved from 7.24666 to 6.96629, saving model to models\detached_allParams.hdf5
Epoch 10/10
Epoch 00010: val_mae improved from 6.96629 to 6.64964, saving model to models\detached_allParams.hdf5


* Loading of trained model

In [9]:
model = load_model("models/detached_allParams.hdf5")

## 5. Model evaluation

* Model evaluation on test data without added noise
* In the output we can see loss and MAE values

In [10]:
scores = model.evaluate(X_test, y_test)
print('Loss: {:.4f}, MAE: {:.4f}'.format(scores[0], scores[1]))

Loss: 980.3248, MAE: 6.5634


* Adding random noise to test data

In [11]:
X_test_n = []
y_test_n = []
for i in range(len(X_test)):
    for j in range(3):
        curve = stochastic_noise_generator(X_test[i])
        X_test_n.append(curve[0])
        y_test_n.append(y_test[i])
        j += 1
X_test_n = np.array(X_test_n)
y_test_n = np.array(y_test_n)

* Model evaluation on test data with added noise
* In the output we can see loss and MAE values

In [12]:
scores_n = model.evaluate(X_test_n, y_test_n)
print('Loss: {:.4f}, MAE: {:.4f}'.format(scores_n[0], scores_n[1]))

Loss: 990.9431, MAE: 6.6170


## 6. Predictions on test data without noise

* Predictions on test data without noise
* Predictions are saved into *y_pred* variable in the form of multi-dimensional array

In [13]:
y_pred = model.predict(X_test)

* Data frame created of *y_pred* array with specified column names

In [14]:
pred_df = pd.DataFrame(y_pred,
                        columns = [
                            "p_inclination",
                            "p_mass_ratio",
                            "p_primary__surface_potential",
                            "p_secondary__surface_potential",
                            "p_t1_t2",
                            "p_critical_surface_potential",
                            "p_primary_equivalent_radius",
                            "p_secondary_equivalent_radius",
                            "p_primary_filling_factor",
                            "p_secondary_filling_factor"
                            ])
pred_df.head()

Unnamed: 0,p_inclination,p_mass_ratio,p_primary__surface_potential,p_secondary__surface_potential,p_t1_t2,p_critical_surface_potential,p_primary_equivalent_radius,p_secondary_equivalent_radius,p_primary_filling_factor,p_secondary_filling_factor
0,1.447385,1.397426,7.617913,7.339401,1.774296,4.291509,0.200144,0.234835,-10.480977,-5.427356
1,1.368769,1.541101,32.897808,39.313465,2.067462,4.323411,0.079443,0.151454,-69.918755,-83.33828
2,1.348097,1.281925,4.772484,4.516875,2.221587,4.046817,0.243484,0.278222,-3.849552,-1.336509
3,1.34155,1.59399,3.961015,5.562094,2.179913,4.422863,0.300006,0.303935,-1.142694,-3.159188
4,1.313521,1.387535,7.820548,7.853835,2.84284,4.085032,0.18563,0.18544,-10.351947,-9.845681


* Average values for each predicted attribute calculated with *mean()* function

In [15]:
pred_mean = pred_df.mean(axis=0)
pred_mean

p_inclination                      1.380883
p_mass_ratio                       1.591343
p_primary__surface_potential      17.445621
p_secondary__surface_potential    17.667978
p_t1_t2                            2.692803
p_critical_surface_potential       4.502497
p_primary_equivalent_radius        0.193741
p_secondary_equivalent_radius      0.225057
p_primary_filling_factor         -36.413895
p_secondary_filling_factor       -26.370497
dtype: float32

* Data frame created from test datasets
* Average values for each attribute is calculated with *mean()* function

In [19]:
y_test_df = pd.DataFrame(y_test,
                        columns = [
                            "inclination",
                            "mass_ratio",
                            "primary__surface_potential",
                            "secondary__surface_potential",
                            "t1_t2",
                            "critical_surface_potential",
                            "primary_equivalent_radius",
                            "secondary_equivalent_radius",
                            "primary_filling_factor",
                            "secondary_filling_factor"
                            ])
test_mean = y_test_df.mean(axis=0)
test_mean

inclination                      1.375910
mass_ratio                       1.747564
primary__surface_potential      18.126015
secondary__surface_potential    16.289401
t1_t2                            2.675162
critical_surface_potential       4.666303
primary_equivalent_radius        0.176560
secondary_equivalent_radius      0.200172
primary_filling_factor         -34.399594
secondary_filling_factor       -23.618069
dtype: float64

* Dataframe created for purpose to compare average true and predicted value, with Mean Average Error showed

In [20]:
eval_pred = pd.DataFrame({'Attribute': test_mean.index,
            'AVG True values': test_mean.values,
            'AVG Pred Values': pred_mean.values,
            'MAE': abs(test_mean.values - pred_mean.values)})
eval_pred

Unnamed: 0,Attribute,AVG True values,AVG Pred Values,MAE
0,inclination,1.37591,1.380883,0.004973
1,mass_ratio,1.747564,1.591343,0.156221
2,primary__surface_potential,18.126015,17.445621,0.680394
3,secondary__surface_potential,16.289401,17.667978,1.378578
4,t1_t2,2.675162,2.692803,0.017641
5,critical_surface_potential,4.666303,4.502497,0.163806
6,primary_equivalent_radius,0.17656,0.193741,0.017181
7,secondary_equivalent_radius,0.200172,0.225057,0.024886
8,primary_filling_factor,-34.399594,-36.413895,2.0143
9,secondary_filling_factor,-23.618069,-26.370497,2.752428


## 7. Prediction on test data with noise

* Prediction on test data with noise
* Predictions are save into *y_pred_n* variable in the form of multi-dimensional array

In [21]:
y_pred_n = model.predict(X_test_n)

* Data frame created of *y_pred_n* array with specified column names
* Average values for each predicted attribute iis calculated with *mean()* function

In [22]:
pred_df_n = pd.DataFrame(y_pred_n,
                        columns = [
                            "p_inclination",
                            "p_mass_ratio",
                            "p_primary__surface_potential",
                            "p_secondary__surface_potential",
                            "p_t1_t2",
                            "p_critical_surface_potential",
                            "p_primary_equivalent_radius",
                            "p_secondary_equivalent_radius",
                            "p_primary_filling_factor",
                            "p_secondary_filling_factor"
                            ])
pred_df_n.head()

Unnamed: 0,p_inclination,p_mass_ratio,p_primary__surface_potential,p_secondary__surface_potential,p_t1_t2,p_critical_surface_potential,p_primary_equivalent_radius,p_secondary_equivalent_radius,p_primary_filling_factor,p_secondary_filling_factor
0,1.43487,1.526116,7.810741,8.243753,1.777654,4.459871,0.207764,0.243389,-10.568967,-7.017222
1,1.44708,1.39698,7.607511,7.338118,1.776797,4.290952,0.200623,0.234561,-10.423254,-5.423965
2,1.444691,1.358741,7.50276,6.912149,1.805448,4.243828,0.202868,0.234922,-10.15129,-4.55794
3,1.406243,1.510198,27.451214,42.401932,1.892819,4.307508,0.095731,0.098716,-59.907612,-89.814438
4,1.367107,1.57832,29.017546,42.870331,2.027571,4.379554,0.109217,0.125922,-62.5896,-90.09726


* Data frame created from test data with noise
* Average values for each attribute are calculated with *mean()* function

In [23]:
pred_mean_n = pred_df_n.mean(axis=0)
pred_mean_n

p_inclination                      1.379663
p_mass_ratio                       1.584814
p_primary__surface_potential      17.508308
p_secondary__surface_potential    17.623581
p_t1_t2                            2.699197
p_critical_surface_potential       4.496512
p_primary_equivalent_radius        0.194844
p_secondary_equivalent_radius      0.224914
p_primary_filling_factor         -36.504875
p_secondary_filling_factor       -26.280840
dtype: float32

* Data frame created from test data with noise
* Average values for each attribute are calculated with *mean()* function

In [26]:
y_test_df_n = pd.DataFrame(y_test_n,
                        columns = [
                            "inclination",
                            "mass_ratio",
                            "primary__surface_potential",
                            "secondary__surface_potential",
                            "t1_t2",
                            "critical_surface_potential",
                            "primary_equivalent_radius",
                            "secondary_equivalent_radius",
                            "primary_filling_factor",
                            "secondary_filling_factor"
                            ])
test_mean_n = y_test_df_n.mean(axis=0)
test_mean_n

inclination                      1.375910
mass_ratio                       1.747564
primary__surface_potential      18.126015
secondary__surface_potential    16.289401
t1_t2                            2.675162
critical_surface_potential       4.666303
primary_equivalent_radius        0.176560
secondary_equivalent_radius      0.200172
primary_filling_factor         -34.399594
secondary_filling_factor       -23.618069
dtype: float64

* Dataframe created for purpose to compare average true and predicted value, with Mean Average Error showed

In [27]:
eval_pred_n = pd.DataFrame({'Attribute': test_mean_n.index,
            'AVG True values': test_mean_n.values,
            'AVG Pred Values': pred_mean_n.values,
            'MAE': abs(test_mean_n.values - pred_mean_n.values)})
eval_pred_n

Unnamed: 0,Attribute,AVG True values,AVG Pred Values,MAE
0,inclination,1.37591,1.379663,0.003752
1,mass_ratio,1.747564,1.584814,0.16275
2,primary__surface_potential,18.126015,17.508308,0.617707
3,secondary__surface_potential,16.289401,17.623581,1.33418
4,t1_t2,2.675162,2.699197,0.024035
5,critical_surface_potential,4.666303,4.496512,0.169791
6,primary_equivalent_radius,0.17656,0.194844,0.018284
7,secondary_equivalent_radius,0.200172,0.224914,0.024742
8,primary_filling_factor,-34.399594,-36.504875,2.105281
9,secondary_filling_factor,-23.618069,-26.28084,2.662771
