## 4-MFR Regression

The is objective is to minimize a loss function such as a sum of squared errors between the measured and predicted values:

$Loss = \sum_{i=1}^{n}\left(y_i-z_i\right)^2$

where `n` is the number of observations. Regression requires labelled data (output values) for training. Classification, on the other hand, can either be supervised (with `z` measurements, labels) or unsupervised (no labels, `z` measurements).

```python
import pickle
with open('mfr_data.pkl', 'rb') as handle:
    info = pickle.load(handle)
data,test,train,ds,s = info
```

Load the `pkl` file from the prior notebook.

![idea](https://apmonitor.com/che263/uploads/Begin_Python/idea.png)

### Linear Regression

There are many model forms such as linear, polynomial, and nonlinear. A familiar linear model is a line with slope `a` and intercept `b` with `y = a x + b`.   
    
```python
x = data['H2R'].values
z = data['lnMFR'].values
p1 = np.polyfit(x,z,1)
```
    
A simple method for linear regression is with `numpy` to fit `p=np.polyfit(x,y,1)` and evaluate `np.polyval(p,x)` the model. Determine the slope and intercept that minimize the sum of squared errors (least squares) between the predicted `lnMFR` and measured `lnMFR` output using `H2R` as the input.

Another package is `statsmodels` that performs standard Ordinary Least Squares (OLS) analysis with a nice report summary.

```python
import statsmodels.api as sm
xc = sm.add_constant(x)
model = sm.OLS(z,xc).fit()
predictions = model.predict(xc)
model.summary()
```

The input `x` is augmented with a ones column so that it also predicts the intercept. This is accomplished with `xc=sm.add_constant(x)`. Perform a multiple linear regression with all of the data columns to predict `lnMFR`.

In [None]:
x_columns = data.columns[0:-1]; print(x)
z_column  = data.columns[-1]; print(z)

x = data[x_columns]
z = data[z_column]

### Select Best Features

Rank the features to determine the best set that predicts `lnMFR`.

```python
from sklearn.feature_selection import SelectKBest, f_regression
best = SelectKBest(score_func=f_regression, k='all')
fit = best.fit(x,z)
plt.bar(x=x.columns,height=fit.scores_)
```

There is additional information on [Select K Best Features](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html).

![exercise](https://apmonitor.com/che263/uploads/Begin_Python/exercise.png)

### Machine Learning

Machine learning is computer algorithms and statistical models that rely on patterns and inference. They perform a specific task without explicit instructions. Machine learned regression models can be as simple as linear regression or as complex as deep learning. This tutorial demonstrates several regression methods with `scikit-learn`.

#### Function for Plotting

Run this code so that each of the regressor models will train and display on a 3D scatter and surface plot with `Pressure` and `lnMFR`.

In [None]:
def fit(method):
    # create points for plotting surface
    xp = np.arange(-5, 5, 0.2)
    yp = np.arange(-5, 5, 0.2)
    XP, YP = np.meshgrid(xp, yp)

    model = method.fit(train[['H2R','Pressure']],train['lnMFR'])
    zp = method.predict(np.vstack((XP.flatten(),YP.flatten())).T)
    ZP = zp.reshape(np.size(XP,0),np.size(XP,1))

    r2 = method.score(test[['H2R','Pressure']],test['lnMFR'])
    print('R^2: ' + str(r2))

    fig = plt.figure()
    ax = fig.add_subplot(111, projection='3d')
    ax.scatter(ds['H2R'],ds['Pressure'],ds['lnMFR'],c=z,cmap='plasma',label='data')
    ax.plot_surface(XP, YP, ZP, cmap='coolwarm',alpha=0.7,
                    linewidth=0, antialiased=False)
    plt.show()
    return

#### Linear Regression with `sklearn`

The simplest regressor is a linear model.

```python
from sklearn import linear_model
lm = linear_model.LinearRegression()
fit(lm)
```

This model is not expected to perform very well with the nonlinear data but it does predict the slope of the data.

#### K-Nearest Neighbors

Use the `KNeighborsRegressor` and adjust the `n_neighbors=20` to achieve a better $R^2$ value.

```python
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors=20)
fit(knn)
```

`n_neighbors` is an example of a hyper-parameter that can be optimized by a package such as `hyperopt` or with user experience.

#### Support Vector Regressor

Use a Support Vector Regressor (`SVR`) to perform the regression.

```python
from sklearn import svm
s = svm.SVR(gamma='scale')
fit(s)
```

What are the hyper-parameters for this regressor?

#### Multilayer Perceptron (Neural Network)

Train a neural network to predict the `lnMFR`.

```python
from sklearn.neural_network import MLPRegressor
# activation{‘identity’, ‘logistic’, ‘tanh’, ‘relu’}, default=’relu’
nn = MLPRegressor(hidden_layer_sizes=(3), 
                  activation='tanh', solver='lbfgs')
fit(nn)
```

Adust the `hidden_layer_sizes` such as with deep learning `(3,5,3)` to achieve a better fit.

### Additional Features

Repeat the analysis but now generate a parity plot of measured versus predicted values with all features, not just `Pressure` and `H2R`.

In [None]:
def fitn(method):
    f = train.columns
    model = method.fit(train[f[0:-1]],train['lnMFR'])

    r2 = method.score(test[f[0:-1]],test['lnMFR'])
    print('R^2: ' + str(r2))
    
    MFR_pred = method.predict(test[f[0:-1]])
    
    plt.plot(test['lnMFR'],MFR_pred,'b.')
    plt.plot([-1,2],[-1,2],'k-')

    return

#### Linear Regression

```python
lm = linear_model.LinearRegression()
fitn(lm)
```

#### K Nearest Neighbors

```python
knn = KNeighborsRegressor(n_neighbors=20)
fitn(knn)
```

#### Support Vector Regressor

```python
s = svm.SVR(gamma='scale')
fitn(s)
```

#### Neural Network

```python
nn = MLPRegressor(hidden_layer_sizes=(3), max_iter=1000, \
                  activation='tanh', solver='lbfgs')
fitn(nn)
```

Repeat the neural network fit but use TensorFlow and Keras instead of Scikit-learn

In [None]:
from keras.models import Sequential
from keras.layers import *

#################################################################
### Train model #################################################
#################################################################
f = train.columns
n_inputs = len(f)-1
nodes = 10

# create neural network model
model = Sequential()
model.add(Dense(n_inputs, input_dim=n_inputs, activation='linear'))
model.add(Dense(nodes, activation='linear'))
model.add(Dense(nodes, activation='tanh'))
model.add(Dense(nodes, activation='tanh'))
model.add(Dense(nodes, activation='linear'))
model.add(Dense(1, activation='linear'))
model.compile(loss="mean_squared_error", optimizer="adam")

# load training data
X1 = train.drop('lnMFR', axis=1).values
Y1 = train[['lnMFR']].values

# train the model
model.fit(X1,Y1,epochs=300,verbose=1,shuffle=True)

# Save the model to hard drive
#model.save('model.h5')

In [None]:
#################################################################
### Test model ##################################################
#################################################################

# Load the model from hard drive
#model.load('model.h5')

# load test data
X2 = test.drop('lnMFR', axis=1).values
Y2 = test[['lnMFR']].values

# test the model
mse = model.evaluate(X2,Y2, verbose=1)

print('Mean Squared Error: ', mse)

In [None]:
lnMFR_pred = model.predict(X2)
plt.plot(test['lnMFR'],lnMFR_pred,'b.')
plt.plot([-1,2],[-1,2],'k-')