### Thermophysical Property Prediction

Empirical regression has limitations, especially when predictions are requested outside of the training region. Physics-based information can overcome this limitation by including fundamental engineering knowledge such as constraints during the training process.

![Thermo Properties](https://apmonitor.com/pds/uploads/Main/thermophysical_properties.png)


__Background__: Parachor values are a factor in the prediction of several thermophysical properties such as surface tension and thermal conductivity. The parachor value ($P$) is used to predict surface tension with the difference between the density of saturated liquid $\rho_L$ and saturated vapor $\rho_V$ at the temperature of interest.

$\sigma = \left(P\left(\rho_L-\rho_V\right)\right)^4$

Surface tension and thermal conductivity are two specific properties that need improved predictions. A more accurate predictor of the parachor value (P) is an important step to improve those thermophysical properties. Most deep learning methods use a limited set of activation functions (ReLU, sigmoid, tanh, and linear) and perform unconstrained regression to minimize a loss function. The purpose of this case study is to explore the addition of physics-based information in the fitting process. This may include the use of new types of activation functions or constraints on the adjustable weights. The data for this case study is from Gharagheizi, et. al. (2011) who explored deep learning (a multi-layered neural network) to improve parachor predictions for 277 compounds from 40 functional groups.

- Name: The common chemical name
- Formula: Chemical formula of the compound
- CASN: Chemical Abstracts Service Registry Number
- Family: Chemical family of the compound
- Parachor: Estimate of parachor value
- Grp1-Grp40: Number of functional groups in the compound

__Objective__: Develop a prediction of the parachar from the chemical compound data set. Report the correlation coefficient (R2) for predicting Parachor in the test set. Randomly select values that split the data into a train (80%) and test (20%) set. Use Linear Regression and Neural Network (Deep Learning) with constraints. The solutions for regression without constraints or feature engineering are provided in this notebook. For the constrained cases, enforce a positive parachor contribution for each group. Discuss the performance of each on the train and test sets. Submit source code and a summary memo (max 2 pages) of your results.

### Load Data

[Chemical Compound Data Set](https://apmonitor.com/pds/uploads/Main/thermo.txt)

```python
url = 'https://apmonitor.com/pds/uploads/Main/thermo.txt'
```

In [None]:
url = 'https://apmonitor.com/pds/uploads/Main/thermo.txt'

### Linear Regression without Constraints

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv(url)

# input
d = np.array(data)[:,-40:]
d = np.array(d,dtype=float)

d_train = d[0:205]
d_valid = d[205:]

# measured output
meas = np.array(data['Parachor'])
meas_train = meas[0:205]
meas_valid = meas[205:]

# linear regression
#  d * b = p
#  (d^T * d) * b = (d^T * meas)
#  A * b = rhs
A = np.dot(d_train.T,d_train)
rhs = np.dot(d_train.T,meas_train)
# solve for
#  b = inv(d^T*d)*d^T*p
b = np.linalg.solve(A,rhs)

# predicted output
pred_train = np.dot(d_train,b)
pred_valid = np.dot(d_valid,b)

print('ms_abs train')
print(np.sum(np.abs((meas_train-pred_train)/meas_train)/(len(meas_train))))
print('ms_abs validate')
print(np.sum(np.abs((meas_valid-pred_valid)/meas_valid)/(len(meas_valid))))

# parity plot
plt.loglog([80,2000],[80,2000],'k-')
plt.loglog(meas_train,pred_train,'b.',label='Linear (Train)')
plt.loglog(meas_valid,pred_valid,'r.',label='Linear (Validate)')
plt.legend()
plt.xlabel('Measured')
plt.ylabel('Predicted')
plt.show()

### Neural Network (Deep Learning) without Constraints

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import *
import matplotlib.pyplot as plt  

#################################################################
### Import Data #################################################
#################################################################

data = pd.read_csv(url)

# input
d = np.array(data)[:,-40:]
d = np.array(d,dtype=float)
x_train = d[0:205]
x_valid = d[205:]

# measured output
meas = np.array(data['Parachor'])
y_train = meas[0:205]
y_valid = meas[205:]

train = np.vstack((x_train.T,y_train)).T
valid = np.vstack((x_valid.T,y_valid)).T

#################################################################
### Scale data ##################################################
#################################################################

# scale values to 0 to 1 for the ANN to work well
s = MinMaxScaler(feature_range=(0,1))

# scale training and test data
sc_train = s.fit_transform(train)
xs_train = sc_train[:,0:-1]
ys_train = sc_train[:,-1]

sc_valid = s.transform(valid)
xs_valid = sc_valid[:,0:-1]
ys_valid = sc_valid[:,-1]

#################################################################
### Train model #################################################
#################################################################

# create neural network model
model = Sequential()
model.add(Dense(40, input_dim=40, activation='linear'))
model.add(Dense(40, activation='linear'))
model.add(Dense(5, activation='tanh'))
model.add(Dense(5, activation='linear'))
model.add(Dense(1, activation='linear'))
model.compile(loss="mean_squared_error", optimizer="adam")

# load training data
X1 = xs_train
Y1 = ys_train

# train the model
model.fit(X1,Y1,epochs=200,verbose=1,shuffle=True)

#################################################################
### Test model ##################################################
#################################################################

# load test data
X2 = xs_valid
Y2 = ys_valid

# test the model
mse_train = model.evaluate(X1,Y1, verbose=1)
mse_valid = model.evaluate(X2,Y2, verbose=1)

print('Mean Squared Error (Train): ', mse_train)
print('Mean Squared Error (Valid): ', mse_valid)

#################################################################
### Predictions Outside Training Region #########################
#################################################################

# predict
Y1P = model.predict(X1)
Y2P = model.predict(X2)

# unscale for plotting and analysis
ymin = s.min_[-1]
yrange = s.scale_[-1]

Y1u = (Y1-ymin)/yrange
Y1Pu = (Y1P-ymin)/yrange

Y2u = (Y2-ymin)/yrange
Y2Pu = (Y2P-ymin)/yrange

sae1 = 0.0
for i in range(len(Y1u)):
    sae1 += np.abs(Y1u[i]-Y1Pu[i][0])/Y1u[i]
sae1 = sae1 / len(Y1u)

sae2 = 0.0
for i in range(len(Y2u)):
    sae2 += np.abs(Y2u[i]-Y2Pu[i][0])/Y2u[i]
sae2 = sae2 / len(Y2u)

# mean sum abs difference
print('Mean sum abs diff - Training ' + str(sae1))
print('Mean sum abs diff - Validate ' + str(sae2))

plt.figure()
plt.plot(Y1u, Y1Pu, 'b.',label='train')
plt.plot(Y2u, Y2Pu, 'r.',label='validate')
plt.xlabel('Measured')
plt.ylabel('Predicted')
plt.legend(loc='best')
plt.show()