In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Homework: Basics of Machine Learning

### 3 Phase Oil Dataset

Description (<a href="https://inverseprobability.com/3PhaseData">source</a>)

This is synthetic data modelling non-intrusive measurements on a pipe-line transporting a mixture of oil, water and gas. The flow in the pipe takes one out of three possible configurations: horizontally stratified, nested annular or homogeneous mixture flow. The data lives in a 12-dimensional measurement space, but for each configuration, there is only two degrees of freedom: the fraction of water and the fraction of oil. (The fraction of gas is redundant, since the three fractions must sum to one.) Hence, the data lives on a number of ‘sheets’ which locally are approximately 2-dimensional.

In [None]:
os.system("wget http://staffwww.dcs.shef.ac.uk/people/N.Lawrence/resources/3PhData.tar.gz")
os.system("tar -xvzf 3PhData.tar.gz")

In [None]:
# Load training data
X_train = np.loadtxt('DataTrn.txt')
X_train_frac = np.loadtxt('DataTrnFrctns.txt')
y_train = np.loadtxt('DataTrnLbls.txt')
y_train = np.argmax(y_train, axis=1)

# Load validation data
X_valid = np.loadtxt('DataVdn.txt')
X_valid_frac = np.loadtxt('DataVdnFrctns.txt')
y_valid = np.loadtxt('DataVdnLbls.txt')
y_valid = np.argmax(y_valid, axis=1)

# Load test data
X_test = np.loadtxt('DataTst.txt')
X_test_frac = np.loadtxt('DataTstFrctns.txt')
y_test = np.loadtxt('DataTstLbls.txt')
y_test = np.argmax(y_test, axis=1)

## Question 1: Data Visualization

Examine the distribution (see: histogram) of features and labels in the training and validation datasets. For each feature and label, comment on whether imbalances in the distribution of values (is the distribution binomial? does the distribution span multiple orders of magnitude?) exist as well as any difference between the training data distribution and validation data distribution. For each plot, be sure to label and title each plot appropriately.

Example:

```
# plot histogram of first feature in training datasets
plt.hist(X_train[:, 0])
plt.title("X_train Feature: 1")
plt.xlabel("X_1")
plt.ylabel("Unnormalized Counts")
```

## Question 2: Train a Linear Classification Model using the Measurements

Train a logistic regression (e.g. linear classifier) using the training dataset. Build three models:

1. No regularization applied
2. L1-regularization applied
3. L2-regularization applied

For models with regularization, adjust the regularization strength to maximize performance on the validation set. Evaluate the generalization of each final model with the test dataset. It is not necessary to show the performance of all models trained, just evaluate the best 3 models you find as defined by validation set performance.

To get started:
```
from sklearn.linear_model import LogisticRegression
# No regularization
clf = LogisticRegression(penalty='none')
# L1 regularization, lambda = C**-1 = 0.01
clf_l1 = LogisticRegression(penalty='l1', solver='liblinear', C=100.)
# L2 regularization, lambda = C**-1 = 0.01
clf_l2 = LogisticRegression(penalty='l2', solver='liblinear', C=100.)
```

## Question 3: Train 2-Feature Model

As mentioned in the data description, the phase of the oil/water mixture is uniquely determined by mol fractions of water and oil present in the mixture. Train a linear classifier using `X_train_frac` instead of `X_train` and evaluate the train, validation, and test scores. Due to the few new number of features in this example, applying regularization is not strictly necessary.

## Question 4: Model Evaluation and Interpretation

Why are we unable to train a linear classifier to correctly identify the phase of the 3-component mixture even though we have the minimum required amount of information from a thermodynamic viewpoint to exactly specify phase and composition? If you fell stuck here, try running the function `plot_2d_decision_boundary` to inspire your thinking.

In [None]:
def plot_2d_decision_boundary():

  # adapted from: 
  # https://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html
  from sklearn.linear_model import LogisticRegression
  clf = LogisticRegression()
  clf.fit(X_train_frac, y_train)
  X = X_train_frac
  Y = y_train
  # Plot the decision boundary. For that, we will assign a color to each
  # point in the mesh [x_min, x_max] * [y_min, y_max].
  # Here, x is mol fraction 1 and y is mol fraction 2
  x_min, x_max = X[:, 0].min() - .05, X[:, 0].max() + .05
  y_min, y_max = X[:, 1].min() - .05, X[:, 1].max() + .05
  h = .02  # step size in the mesh
  xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
  Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])

  # Put the result into a color plot
  Z = Z.reshape(xx.shape)
  plt.figure(1, figsize=(4, 3))
  plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

  # Plot also the training points
  # Each class (Stratified, Annular, Homogeneous) is assigned a color
  # that matches the corresponding decision area
  plt.scatter(X[:, 0], X[:, 1], c=Y, edgecolors='k', cmap=plt.cm.Paired)
  plt.xlabel('mol fraction 1')
  plt.ylabel('mol fraction 2')
  plt.xlim(xx.min(), xx.max())
  plt.ylim(yy.min(), yy.max())
  plt.xticks(())
  plt.yticks(())

  plt.show()

plot_2d_decision_boundary()