# MLR model construction

## Data loading and data analysis

In [1]:
# import some packages
import numpy as np
import tensorflow as tf
from matplotlib import pyplot as plt
import pandas as pd
import seaborn as sns

In [2]:
# read in the data
data = pd.read_csv('ps_usable_hydrogen_storage_capacity_gcmcv2.csv')

## Information about the data
This dataset contains 98,694 different MOFs with 7 structural properties and usable gravimetric storage capacity (UG) and usable volumetric storage capacity (UV) to measure the hydrogen storage capacity of each MOF. 

7 Structural properties,  2 responses and their units of this dataset are provided as follows:

Property | unit
--- | ---
Density | $$g/cm^3$$
Gravimetric surface area (GSA) | $$m^2/g$$
Volumetric surface area (VSA) | $$m^2/cm^3$$
Void fracion (VF)| 
Pore volumn (PV)| $$cm^3/g$$
Largest cavity diameter (LCD) | $$Å$$
Pore limiting diameter (PLD) | $$Å$$
Usable gravimetric storage capacity (UG) | $$wt.\%$$
Usable volumetric storage capacity (UV) | $$g/cm^3$$

## Data cleaning

In [3]:
# check the size of the loaded data 
assert(data.shape[0] == 98694)
assert(data.shape[1] == 17)

### 1. Column name modifications

In [4]:
data = data.rename(columns=lambda x: x.rstrip()) # delete the extra space in the end and check it again
data = data.rename(columns={'UG at PS':'UG', 'UV at PS': 'UV'}) # simplify some columns' names

### 2. Abnormal data removal

In [5]:
# remove data whose features are smaller or equal to 0
features_name = data.columns[5:12].tolist()
for feature in features_name:
    data = data.drop(data[data[feature] <= 0].index)

## Check missing values

In [6]:
def check_nan(col, data):
    '''Count the number of nan for a specific column in a dataset.'''
    return data.shape[0] - data[col].dropna().size

def print_nan(features):
    '''Print nan for original_training_data'''
    for feature in features:
        print('  Nan of ' + str(feature) + ': ' + str(check_nan(feature, data)))

print_nan(features_name)

  Nan of Density: 0
  Nan of GSA: 0
  Nan of VSA: 0
  Nan of VF: 0
  Nan of PV: 0
  Nan of LCD: 0
  Nan of PLD: 0


Comparing the cleaned dataset and the original one, one can found that there are some abnormal data in this dataset. The structures with features that are smaller or equal to 0 can be due to some possible mistakes when collecting the data and are impossible to shown outstanding hydrogen adsorption performance according to the discovered relationships between these features and adsorption capacity. However, there is no missing value for the 7 features and the 2 responses. 

# Visulization of the dataset

## Distributions of single features and UG/UV

In [7]:
def plot_dist(data, var_names):
    '''Plot the distribution of each variable.'''
    for var in var_names:
        plt.figure(figsize=(20, 6))
        sns.histplot(data[var], kde = True)
        plt.xlabel(var, fontsize = 15)
        plt.ylabel('Count', fontsize = 15)
        
features_name.append('UG')
features_name.append('UV')
#plot_dist(data, features_name)

From the above visulizations, one can observe that the distributions of different features and responses are extremely different. Some of them are right skewed while others are left skewed. Also, they have different ranges.

## Relationship between UV and UG

In [8]:
r1 = np.corrcoef(data['UV'],data['UG'])[0][1]
r1

0.761160782050035

From the above figure, one can see that the linear correlation between the two reponses is not as high as expected. Points with relatively high UG and low UV occur in the dataset. Intuitively, this is related to the low density of these structures. The below visulization proves this assumption.

## Relationship between each structural property and UG/UV

In [9]:
def single_feature_plot(data, single_feature):
    '''Plot the relationship between single feature and UG/UV'''
    
    # check that the input single_feature has type of str
    try:
        assert(type(single_feature) == str)
    except:
        raise TypeError('The input single_feature is not string.')

    if single_feature == 'Density':
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (15, 6))
        ax1.scatter(data[single_feature], data['UG'])
        ax1.set_xlabel(single_feature, fontsize = 15)
        ax1.set_ylabel('UG',fontsize = 15)
        ax1.set_title('Relationship between '+single_feature+ ' and UG', fontsize = 15)
        ax2.scatter(data[single_feature], data['UV'])
        ax2.set_xlabel(single_feature, fontsize = 15)
        ax2.set_ylabel('UV', fontsize = 15)
        ax2.set_title('Relationship between '+single_feature+ ' and UV', fontsize = 15)
    else:
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (15, 5))
        sc1 = ax1.scatter(data[single_feature], data['UG'], c = data['Density'], cmap = cm)
        ax1.set_xlabel(single_feature, fontsize = 15)
        ax1.set_ylabel('UG',fontsize = 15)
        ax1.set_title('Relationship between '+single_feature+ ' and UG', fontsize = 15)
        fig.colorbar(sc1, ax = ax1)
        sc2 = ax2.scatter(data[single_feature], data['UV'], c = data['Density'], cmap = cm)
        ax2.set_xlabel(single_feature, fontsize = 15)
        ax2.set_ylabel('UV', fontsize = 15)
        ax2.set_title('Relationship between '+single_feature+ ' and UV', fontsize = 15)
        fig.colorbar(sc2, ax = ax2)

#for single_feature in features_name[:7]:
#    single_feature_plot(data, single_feature)

## Identify the importance of each feature

In [10]:
# calculating the correlation coefficient
corr_coefs = []
for single_feature in features_name[:7]:
    corr_coefs.append([np.corrcoef(data[single_feature],data['UV'])[0][1],np.corrcoef(data[single_feature],data['UG'])[0][1]])

df_rs = pd.DataFrame(columns =['UV', 'UG'], data = corr_coefs, index = features_name[:7])
df_rs['Avg_corr'] = (df_rs['UV'] + df_rs['UG'])/2
df_rs

Unnamed: 0,UV,UG,Avg_corr
Density,-0.859508,-0.76609,-0.812799
GSA,0.879863,0.832255,0.856059
VSA,0.5327,0.083976,0.308338
VF,0.931162,0.792888,0.862025
PV,0.531987,0.9337,0.732844
LCD,0.598503,0.828389,0.713446
PLD,0.616799,0.826018,0.721409


According to the above analysis, density is negatively correlated with UG/UV while other features are positively correlated with UG/UV. However, the linear correlation between the features (except density) and responses are not strong. This can be due to the differences of density. 

In [11]:
df_rs_abs = df_rs.copy()
df_rs_abs = abs(df_rs_abs) # consider the feature importance as its absolute value
df_rs_abs_1 = df_rs_abs.sort_values(by = 'Avg_corr', ascending = False)
df_rs_abs_2 = df_rs_abs.sort_values(by = 'UV', ascending = False)
df_rs_abs_3 = df_rs_abs.sort_values(by = 'UG', ascending = False)
print('Features ranking:')
print('   UV + UG: ')
print('   '+str(df_rs_abs_1.index.tolist()))
print('   UV: ')
print('   '+str(df_rs_abs_2.index.tolist()))
print('   UG: ')
print('   '+str(df_rs_abs_3.index.tolist()))

Features ranking:
   UV + UG: 
   ['VF', 'GSA', 'Density', 'PV', 'PLD', 'LCD', 'VSA']
   UV: 
   ['VF', 'GSA', 'Density', 'PLD', 'LCD', 'VSA', 'PV']
   UG: 
   ['PV', 'GSA', 'LCD', 'PLD', 'VF', 'Density', 'VSA']


From the above analysis, we can found that the VF has the biggest influence on the overall hydrogen adsorption performance while the VSA has the smallest influence on the overall hydrogen adsorption performance. In terms of UV, the most important factor is VF and the least important is VSA. For UG, the most important factor is PV and the least important one is VSA.

# Data transformation
Data transformation can be hard because it requires to achieve linear relationships between each features and two responses simultaneously. Herein, I try my best to do the data transformation to achieve this by tring both logrithm transformation and power transformation of different values. Only the best results were left in this part.

In [12]:
# create an empty list to store the transformed average correlation coefficients
transformed_avg_corrs = []

# Density
data_trans = data.copy()
data_trans['Density'] = -np.log(data_trans['Density'])
val1 = sum([np.corrcoef(data_trans['Density'],data['UV'])[0][1],np.corrcoef(data_trans['Density'],data['UG'])[0][1]])/2
transformed_avg_corrs.append(val1)

# GSA
data_trans['GSA'] = data_trans['GSA']**1.1
val2 = sum([np.corrcoef(data_trans['GSA'],data['UV'])[0][1],np.corrcoef(data_trans['GSA'],data['UG'])[0][1]])/2
transformed_avg_corrs.append(val2)

# VSA
data_trans['VSA'] = (data_trans['VSA'])**0.2
val3 = sum([np.corrcoef(data_trans['VSA'],data['UV'])[0][1],np.corrcoef(data_trans['VSA'],data['UG'])[0][1]])/2
transformed_avg_corrs.append(val3)

# VF
data_trans['VF'] = (data_trans['VF'])**4.2
val4 = sum([np.corrcoef(data_trans['VF'],data['UV'])[0][1],np.corrcoef(data_trans['VF'],data['UG'])[0][1]])/2
transformed_avg_corrs.append(val4)

# PV
data_trans['PV'] = (data_trans['PV'])**0.1
val5 = sum([np.corrcoef(data_trans['PV'],data['UV'])[0][1],np.corrcoef(data_trans['PV'],data['UG'])[0][1]])/2
transformed_avg_corrs.append(val5)

# LCD
data_trans['LCD'] = np.log(data_trans['LCD'])
val6 = sum([np.corrcoef(data_trans['LCD'],data['UV'])[0][1],np.corrcoef(data_trans['LCD'],data['UG'])[0][1]])/2
transformed_avg_corrs.append(val6)

# PLD
data_trans['PLD'] = np.log(data_trans['PLD'])
val7 = sum([np.corrcoef(data_trans['PLD'],data['UV'])[0][1],np.corrcoef(data_trans['PLD'],data['UG'])[0][1]])/2
transformed_avg_corrs.append(val7)

In [13]:
df_rs['Transformed_avg_corr'] = transformed_avg_corrs

Transform all the features to have better linear relationship to the 2 responses may lead to higher accuracy of the multi-linear regression model. But whether this assumption is true still requires further modeling work.

# Neural Network

In [14]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [15]:
features_name = ['Density', 'GSA', 'VSA', 'VF', 'PV', 'LCD', 'PLD']

### Single output model
Step 1. trans or not -- it doesn't matter -- no trans

Step 2. batch_size -- it doesn't matter -- batch_size = len(trains)

Step 3. which act_func and number of dense layer and epoch
* Best act_func is sigmoid
* Number of dense layer doesn't help to increase the accuracy measured by mse.
* The larger the number of epoch, the better the model. But when epoch > 7, the influcence is insignificant.

### Step1. trans or not

In [16]:
train, test = train_test_split(data, train_size = 0.9, random_state=42)
train_trans, test_trans = train_test_split(data_trans, train_size = 0.9, random_state=42)

In [17]:
def final_model_single(inputs, num, act, res):
    inputs = tf.keras.Input(shape=(7))
    x = tf.keras.layers.Dense(7, activation = act)(inputs)
    if num >= 1:
        for _ in range(num-1):
            x = tf.keras.layers.Dense(7, activation = act)(x)
            
    outputs = tf.keras.layers.Dense(1, name = res, activation = act)(x)
    model = tf.keras.Model(inputs, outputs)
    return model

In [18]:
def define_and_compile_model(num, act, res, lr=0.1):
    '''define and compile the model
    num: int -- number of dense layer in the model - 1
    act: str -- activation function
    '''
    inputs = tf.keras.Input(shape = (7))
    model = final_model_single(inputs, num, act, res)
    
    opt = tf.keras.optimizers.SGD(learning_rate=lr)
    model.compile(optimizer = opt, loss = 'mse', metrics = 'mse')
    return model

In [19]:
def model_pipeline(train, test, num, act, epoch, batch, res, lr):
    model = define_and_compile_model(num, act, res, lr)
    model.summary()
    history = model.fit(train[features_name], train[res], epochs = epoch, batch_size = batch, validation_data = (test[features_name], test[res]))
    return model

In [20]:
mdoel_UV_trans = model_pipeline(train_trans, test_trans, 0, 'relu', 3, len(train_trans), 'UV', 0.1)

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 7)]               0         
_________________________________________________________________
dense (Dense)                (None, 7)                 56        
_________________________________________________________________
UV (Dense)                   (None, 1)                 8         
Total params: 64
Trainable params: 64
Non-trainable params: 0
_________________________________________________________________
Epoch 1/3
Epoch 2/3
Epoch 3/3


In [21]:
mdoel_UV = model_pipeline(train, test, 0, 'relu', 3, len(train), 'UV', 0.1)

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_4 (InputLayer)         [(None, 7)]               0         
_________________________________________________________________
dense_1 (Dense)              (None, 7)                 56        
_________________________________________________________________
UV (Dense)                   (None, 1)                 8         
Total params: 64
Trainable params: 64
Non-trainable params: 0
_________________________________________________________________
Epoch 1/3
Epoch 2/3
Epoch 3/3


In [22]:
mdoel_UG_trans = model_pipeline(train_trans, test_trans, 0, 'relu', 3, len(train_trans), 'UG', 0.1)

Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_6 (InputLayer)         [(None, 7)]               0         
_________________________________________________________________
dense_2 (Dense)              (None, 7)                 56        
_________________________________________________________________
UG (Dense)                   (None, 1)                 8         
Total params: 64
Trainable params: 64
Non-trainable params: 0
_________________________________________________________________
Epoch 1/3
Epoch 2/3
Epoch 3/3


In [23]:
mdoel_UG_trans = model_pipeline(train, test, 0, 'relu', 3, len(train), 'UG', 0.1)

Model: "model_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_8 (InputLayer)         [(None, 7)]               0         
_________________________________________________________________
dense_3 (Dense)              (None, 7)                 56        
_________________________________________________________________
UG (Dense)                   (None, 1)                 8         
Total params: 64
Trainable params: 64
Non-trainable params: 0
_________________________________________________________________
Epoch 1/3
Epoch 2/3
Epoch 3/3


In [24]:
batches = [100, 1000, 10000, len(train)]
for batch in batches:
    model_pipeline(train, test, 0, 'relu', 3, batch,'UG', 0.1)

Model: "model_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_10 (InputLayer)        [(None, 7)]               0         
_________________________________________________________________
dense_4 (Dense)              (None, 7)                 56        
_________________________________________________________________
UG (Dense)                   (None, 1)                 8         
Total params: 64
Trainable params: 64
Non-trainable params: 0
_________________________________________________________________
Epoch 1/3
Epoch 2/3
Epoch 3/3
Model: "model_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_12 (InputLayer)        [(None, 7)]               0         
_________________________________________________________________
dense_5 (Dense)              (None, 7)                 56        
________________

In [25]:
for batch in batches:
    model_pipeline(train, test, 0, 'relu', 3, batch,'UV', 0.1)

Model: "model_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_18 (InputLayer)        [(None, 7)]               0         
_________________________________________________________________
dense_8 (Dense)              (None, 7)                 56        
_________________________________________________________________
UV (Dense)                   (None, 1)                 8         
Total params: 64
Trainable params: 64
Non-trainable params: 0
_________________________________________________________________
Epoch 1/3
Epoch 2/3
Epoch 3/3
Model: "model_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_20 (InputLayer)        [(None, 7)]               0         
_________________________________________________________________
dense_9 (Dense)              (None, 7)                 56        
________________

### Conclusion: batch size doesn't help.

In [26]:
act_funcs = ['relu', 'sigmoid','softmax']
nums = [1, 2, 3]
for act in act_funcs:
    for num in nums:
        print("Activation func is "+act)
        print("Number of dense layer is "+str(num))
        model_pipeline(train, test, num, act, 3, len(train), 'UV', 0.1)

Activation func is relu
Number of dense layer is 1
Model: "model_12"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_26 (InputLayer)        [(None, 7)]               0         
_________________________________________________________________
dense_12 (Dense)             (None, 7)                 56        
_________________________________________________________________
UV (Dense)                   (None, 1)                 8         
Total params: 64
Trainable params: 64
Non-trainable params: 0
_________________________________________________________________
Epoch 1/3
Epoch 2/3
Epoch 3/3
Activation func is relu
Number of dense layer is 2
Model: "model_13"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_28 (InputLayer)        [(None, 7)]               0         
____________________________________________

Epoch 2/3
Epoch 3/3
Activation func is sigmoid
Number of dense layer is 2
Model: "model_16"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_34 (InputLayer)        [(None, 7)]               0         
_________________________________________________________________
dense_19 (Dense)             (None, 7)                 56        
_________________________________________________________________
dense_20 (Dense)             (None, 7)                 56        
_________________________________________________________________
UV (Dense)                   (None, 1)                 8         
Total params: 120
Trainable params: 120
Non-trainable params: 0
_________________________________________________________________
Epoch 1/3
Epoch 2/3
Epoch 3/3
Activation func is sigmoid
Number of dense layer is 3
Model: "model_17"
_________________________________________________________________
Layer (type)    

Epoch 2/3
Epoch 3/3
Activation func is softmax
Number of dense layer is 2
Model: "model_19"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_40 (InputLayer)        [(None, 7)]               0         
_________________________________________________________________
dense_25 (Dense)             (None, 7)                 56        
_________________________________________________________________
dense_26 (Dense)             (None, 7)                 56        
_________________________________________________________________
UV (Dense)                   (None, 1)                 8         
Total params: 120
Trainable params: 120
Non-trainable params: 0
_________________________________________________________________
Epoch 1/3
Epoch 2/3
Epoch 3/3
Activation func is softmax
Number of dense layer is 3
Model: "model_20"
_________________________________________________________________
Layer (type)    

In [27]:
for act in act_funcs:
    for num in nums:
        print("Activation func is "+act)
        print("Number of dense layer is "+str(num))
        model_pipeline(train, test, num, act, 3, len(train), 'UG', 0.1)

Activation func is relu
Number of dense layer is 1
Model: "model_21"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_44 (InputLayer)        [(None, 7)]               0         
_________________________________________________________________
dense_30 (Dense)             (None, 7)                 56        
_________________________________________________________________
UG (Dense)                   (None, 1)                 8         
Total params: 64
Trainable params: 64
Non-trainable params: 0
_________________________________________________________________
Epoch 1/3
Epoch 2/3
Epoch 3/3
Activation func is relu
Number of dense layer is 2
Model: "model_22"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_46 (InputLayer)        [(None, 7)]               0         
____________________________________________

Epoch 2/3
Epoch 3/3
Activation func is sigmoid
Number of dense layer is 1
Model: "model_24"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_50 (InputLayer)        [(None, 7)]               0         
_________________________________________________________________
dense_36 (Dense)             (None, 7)                 56        
_________________________________________________________________
UG (Dense)                   (None, 1)                 8         
Total params: 64
Trainable params: 64
Non-trainable params: 0
_________________________________________________________________
Epoch 1/3
Epoch 2/3
Epoch 3/3
Activation func is sigmoid
Number of dense layer is 2
Model: "model_25"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_52 (InputLayer)        [(None, 7)]               0         
__________________

Epoch 2/3
Epoch 3/3
Activation func is softmax
Number of dense layer is 1
Model: "model_27"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_56 (InputLayer)        [(None, 7)]               0         
_________________________________________________________________
dense_42 (Dense)             (None, 7)                 56        
_________________________________________________________________
UG (Dense)                   (None, 1)                 8         
Total params: 64
Trainable params: 64
Non-trainable params: 0
_________________________________________________________________
Epoch 1/3
Epoch 2/3
Epoch 3/3
Activation func is softmax
Number of dense layer is 2
Model: "model_28"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_58 (InputLayer)        [(None, 7)]               0         
__________________

Epoch 2/3
Epoch 3/3


In [28]:
lrs = [0.01, 0.1, 0.5, 1]
for lr in lrs:
    print("learning rate is "+ str(lr))
    model_pipeline(train, test, 0, 'softmax', 3, len(train), 'UV', lr)

learning rate is 0.01
Model: "model_30"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_62 (InputLayer)        [(None, 7)]               0         
_________________________________________________________________
dense_48 (Dense)             (None, 7)                 56        
_________________________________________________________________
UV (Dense)                   (None, 1)                 8         
Total params: 64
Trainable params: 64
Non-trainable params: 0
_________________________________________________________________
Epoch 1/3
Epoch 2/3
Epoch 3/3
learning rate is 0.1
Model: "model_31"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_64 (InputLayer)        [(None, 7)]               0         
_________________________________________________________________
dense_49 (Dense)             (None, 7

Epoch 2/3
Epoch 3/3


In [29]:
lrs = [0.01, 0.1, 0.5, 1]
for lr in lrs:
    print("learning rate is "+ str(lr))
    model_pipeline(train, test, 0, 'softmax', 3, len(train), 'UG', lr)

learning rate is 0.01
Model: "model_34"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_70 (InputLayer)        [(None, 7)]               0         
_________________________________________________________________
dense_52 (Dense)             (None, 7)                 56        
_________________________________________________________________
UG (Dense)                   (None, 1)                 8         
Total params: 64
Trainable params: 64
Non-trainable params: 0
_________________________________________________________________
Epoch 1/3
Epoch 2/3
Epoch 3/3
learning rate is 0.1
Model: "model_35"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_72 (InputLayer)        [(None, 7)]               0         
_________________________________________________________________
dense_53 (Dense)             (None, 7

Epoch 1/3
Epoch 2/3
Epoch 3/3


Tunning the learning rate doesn't help.

In [30]:
epoches = [3, 5, 7, 9]
for epoch in epoches:
    print("Number of epoch: "+str(epoch))
    model_pipeline(train, test, 1, 'softmax', epoch, len(train), 'UV', 0.1)
    print('\n')

Number of epoch: 3
Model: "model_38"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_78 (InputLayer)        [(None, 7)]               0         
_________________________________________________________________
dense_56 (Dense)             (None, 7)                 56        
_________________________________________________________________
UV (Dense)                   (None, 1)                 8         
Total params: 64
Trainable params: 64
Non-trainable params: 0
_________________________________________________________________
Epoch 1/3
Epoch 2/3
Epoch 3/3


Number of epoch: 5
Model: "model_39"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_80 (InputLayer)        [(None, 7)]               0         
_________________________________________________________________
dense_57 (Dense)             (None, 7)  

Epoch 2/9
Epoch 3/9
Epoch 4/9
Epoch 5/9
Epoch 6/9
Epoch 7/9
Epoch 8/9
Epoch 9/9




In [31]:
epoches = [3, 5, 7, 9]
for epoch in epoches:
    print("Number of epoch: "+str(batch))
    model_pipeline(train, test, 1, 'softmax', epoch, len(train), 'UG', 0.1)
    print('\n')

Number of epoch: 79619
Model: "model_42"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_86 (InputLayer)        [(None, 7)]               0         
_________________________________________________________________
dense_60 (Dense)             (None, 7)                 56        
_________________________________________________________________
UG (Dense)                   (None, 1)                 8         
Total params: 64
Trainable params: 64
Non-trainable params: 0
_________________________________________________________________
Epoch 1/3
Epoch 2/3
Epoch 3/3


Number of epoch: 79619
Model: "model_43"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_88 (InputLayer)        [(None, 7)]               0         
_________________________________________________________________
dense_61 (Dense)             (No

### Number of epoch doesn't help.

For UV model
* nums doesn't help. 
* sigmoid and softmax are roughly the same.

For UG model
* nums doesn't help, even starts to overfitting.
* softmax is the best.

### Multi-output model
Step 1. trans or not -- it doesn't matter -- no trans

Step 2. batch_size -- it doesn't matter -- batch_size = len(trains)

Step 3. which act_func and number of dense layer and epoch
* Best act_func is sigmoid
* Number of dense layer doesn't help to increase the accuracy measured by mse.
* The larger the number of epoch, the better the model. But when epoch > 7, the influcence is insignificant.

### Step 1. trans or not

In [32]:
train, test = train_test_split(data, train_size = 0.9, random_state=42)
train_trans, test_trans = train_test_split(data_trans, train_size = 0.9, random_state=42)

In [33]:
def final_model(inputs, num, act):
    inputs = tf.keras.Input(shape=(7))
    x = tf.keras.layers.Dense(7, activation = act)(inputs)
    if num >= 1:
        for _ in range(num-1):
            x = tf.keras.layers.Dense(7, activation = act)(x)
            
    outputs = tf.keras.layers.Dense(2, name = 'UG_and_UV', activation = act)(x)
    model = tf.keras.Model(inputs, outputs)
    return model

In [34]:
def define_and_compile_model(num, act):
    '''define and compile the model
    num: int -- number of dense layer in the model - 1
    act: str -- activation function
    '''
    inputs = tf.keras.Input(shape = (7))
    model = final_model(inputs, num, act)
    
    opt = tf.keras.optimizers.SGD(learning_rate=0.1)
    model.compile(optimizer = opt, loss = 'mse', metrics = 'mse')
    return model

In [35]:
def model_pipeline(train, test, num, act, epoch, batch):
    model = define_and_compile_model(num, act)
    model.summary()
    history = model.fit(train[features_name], train[['UG', 'UV']], epochs = epoch, batch_size = batch, validation_data = (test[features_name], test[['UG','UV']]))
    return model

In [36]:
model_pipeline(train_trans, test_trans, 0, 'relu', 3, len(train_trans))

Model: "model_46"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_94 (InputLayer)        [(None, 7)]               0         
_________________________________________________________________
dense_64 (Dense)             (None, 7)                 56        
_________________________________________________________________
UG_and_UV (Dense)            (None, 2)                 16        
Total params: 72
Trainable params: 72
Non-trainable params: 0
_________________________________________________________________
Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.engine.functional.Functional at 0x7ff8217295d0>

In [37]:
model_pipeline(train, test, 0, 'relu', 3, len(train))

Model: "model_47"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_96 (InputLayer)        [(None, 7)]               0         
_________________________________________________________________
dense_65 (Dense)             (None, 7)                 56        
_________________________________________________________________
UG_and_UV (Dense)            (None, 2)                 16        
Total params: 72
Trainable params: 72
Non-trainable params: 0
_________________________________________________________________
Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.engine.functional.Functional at 0x7ff82315f2d0>

### Step 2. Batch size

In [38]:
batches = [100, 1000, 10000, len(train)]
for batch in batches:
    model_pipeline(train, test, 0, 'relu', 3, batch)

Model: "model_48"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_98 (InputLayer)        [(None, 7)]               0         
_________________________________________________________________
dense_66 (Dense)             (None, 7)                 56        
_________________________________________________________________
UG_and_UV (Dense)            (None, 2)                 16        
Total params: 72
Trainable params: 72
Non-trainable params: 0
_________________________________________________________________
Epoch 1/3
Epoch 2/3
Epoch 3/3
Model: "model_49"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_100 (InputLayer)       [(None, 7)]               0         
_________________________________________________________________
dense_67 (Dense)             (None, 7)                 56        
______________

### Step 3. act and number of dense_layer

In [39]:
act_funcs = ['relu', 'sigmoid','softmax']
nums = [1, 2, 3]
for act in act_funcs:
    for num in nums:
        print("Activation func is "+act)
        print("Number of dense layer is "+str(num))
        model_pipeline(train, test, num, act, 3, len(train))

Activation func is relu
Number of dense layer is 1
Model: "model_52"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_106 (InputLayer)       [(None, 7)]               0         
_________________________________________________________________
dense_70 (Dense)             (None, 7)                 56        
_________________________________________________________________
UG_and_UV (Dense)            (None, 2)                 16        
Total params: 72
Trainable params: 72
Non-trainable params: 0
_________________________________________________________________
Epoch 1/3
Epoch 2/3
Epoch 3/3
Activation func is relu
Number of dense layer is 2
Model: "model_53"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_108 (InputLayer)       [(None, 7)]               0         
____________________________________________

Epoch 2/3
Epoch 3/3
Activation func is sigmoid
Number of dense layer is 2
Model: "model_56"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_114 (InputLayer)       [(None, 7)]               0         
_________________________________________________________________
dense_77 (Dense)             (None, 7)                 56        
_________________________________________________________________
dense_78 (Dense)             (None, 7)                 56        
_________________________________________________________________
UG_and_UV (Dense)            (None, 2)                 16        
Total params: 128
Trainable params: 128
Non-trainable params: 0
_________________________________________________________________
Epoch 1/3
Epoch 2/3
Epoch 3/3
Activation func is sigmoid
Number of dense layer is 3
Model: "model_57"
_________________________________________________________________
Layer (type)    

Epoch 2/3
Epoch 3/3
Activation func is softmax
Number of dense layer is 2
Model: "model_59"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_120 (InputLayer)       [(None, 7)]               0         
_________________________________________________________________
dense_83 (Dense)             (None, 7)                 56        
_________________________________________________________________
dense_84 (Dense)             (None, 7)                 56        
_________________________________________________________________
UG_and_UV (Dense)            (None, 2)                 16        
Total params: 128
Trainable params: 128
Non-trainable params: 0
_________________________________________________________________
Epoch 1/3
Epoch 2/3
Epoch 3/3
Activation func is softmax
Number of dense layer is 3
Model: "model_60"
_________________________________________________________________
Layer (type)    

Act_funcs should be sigmoid.

In [40]:
act_funcs = ['sigmoid']
nums = [1, 5, 10]
for act in act_funcs:
    for num in nums:
        print("Activation func is "+act)
        print("Number of dense layer is "+str(num))
        model_pipeline(train, test, num, act, 3, len(train))
        print('\n')

Activation func is sigmoid
Number of dense layer is 1
Model: "model_61"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_124 (InputLayer)       [(None, 7)]               0         
_________________________________________________________________
dense_88 (Dense)             (None, 7)                 56        
_________________________________________________________________
UG_and_UV (Dense)            (None, 2)                 16        
Total params: 72
Trainable params: 72
Non-trainable params: 0
_________________________________________________________________
Epoch 1/3
Epoch 2/3
Epoch 3/3


Activation func is sigmoid
Number of dense layer is 5
Model: "model_62"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_126 (InputLayer)       [(None, 7)]               0         
____________________________________

Epoch 2/3
Epoch 3/3




Adding more dense layer doesn't work. set dense layer to be 1.

In [41]:
epoches = [3, 5, 7, 9]
for epoch in epoches:
    print("Number of epoch: "+str(batch))
    model_pipeline(train, test, 1, 'sigmoid', epoch, len(train))
    print('\n')

Number of epoch: 79619
Model: "model_64"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_130 (InputLayer)       [(None, 7)]               0         
_________________________________________________________________
dense_104 (Dense)            (None, 7)                 56        
_________________________________________________________________
UG_and_UV (Dense)            (None, 2)                 16        
Total params: 72
Trainable params: 72
Non-trainable params: 0
_________________________________________________________________
Epoch 1/3
Epoch 2/3
Epoch 3/3


Number of epoch: 79619
Model: "model_65"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_132 (InputLayer)       [(None, 7)]               0         
_________________________________________________________________
dense_105 (Dense)            (No

Epoch 2/9
Epoch 3/9
Epoch 4/9
Epoch 5/9
Epoch 6/9
Epoch 7/9
Epoch 8/9
Epoch 9/9




The larger the epoch, the better the model. But when epoch is larger than 7, the descent is not significant.