<h1 style="text-align: center;">AAE 590 Surrogate Methods</h1>

## Design of Experiments

This notebook supports material covered in the class for design of experiments. In this notebook, we will be using [`pyDOE2`](https://pythonhosted.org/pyDOE/) for generating samples. Following topics are covered:

1. [Full Factorial Sampling](#Full-Factorial-Sampling)
2. [Latin Squares Sampling](#Latin-Squares-Sampling)
3. [Latin Hypercube Sampling](#Latin-Hypercube-Sampling)

Before you proceed, you need to have `pyDOE2` package installed. If you don't have this package, then close jupyter notebook and install the package within the environment using `pip install pyDOE2` command in anaconda prompt.

<font color='red'>**Please run the below block of code before you run any other block**</font> - it imports all the packages needed for this notebook.

In [None]:
from pyDOE2 import fullfact, lhs
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
from scipy.spatial.distance import pdist

### Full Factorial Sampling

First, we will look at full factorial sampling. You can read about how to use pyDOE2 for generating full-factorial samples in the [documentation](https://pythonhosted.org/pyDOE/factorial.html#general-full-factorial). Below block of code generates full-factorial samples for *two* dimensional problem and plots it. Read comments for more details.

In [None]:
# Defining lower and upper bound
# Number of entries in the array will be 
# equal to number of dimensions
lb = np.array([-4, -4])
ub = np.array([4, 4])

# Defining number of samples to have in each dimension
# Total number of samples will be product of number of
# samples in each dimension
levels = np.array([7, 7])

# Generating nomralized sample. Note that the 
# output of `fullfact` is not between 0 and 1. So,
# we have to normalize it.
normalized_samples = fullfact(levels)/(levels-1)

# Scaling the normalized samples
samples = lb + (ub - lb)*normalized_samples

# Plotting the samples
fig, ax = plt.subplots(figsize=(6,5))
ax.scatter(samples[:,0], samples[:,1])
ax.set_xlabel("$X$", fontsize=12)
ax.set_ylabel("$Y$", fontsize=12)
ax.set_title("Samples", fontsize=14)
ax.grid()

Now, we will generate full-factorial samples in *three* dimensions. Below block of code generates the samples and plots it. Read comments in the code for more details.

In [None]:
# Defining lower and upper bound
# Number of entries in the array will be 
# equal to number of dimensions
lb = np.array([-3, -1, -4])
ub = np.array([1, 5, 4])

# Defining number of samples to have in each dimension
# Total number of samples will be product of number of
# samples in each dimension
levels = np.array([3,4,5])

# Generating nomralized sample. Note that the 
# output of `fullfact` is not between 0 and 1. So,
# we have to normalize it.
normalized_samples = fullfact(levels)/(levels-1)

# Scaling the normalized samples
samples = lb + (ub - lb)*normalized_samples

# Plotting the samples
fig = plt.figure(figsize=(7,6))
ax = fig.add_subplot(projection='3d')
ax.scatter3D(samples[:,0], samples[:,1], samples[:,2], color="k")
ax.set_xlabel('X-axis')
ax.set_ylabel('Y-axis')
ax.set_zlabel('Z-axis')

# Plotting X and Y
fig, ax = plt.subplots(figsize=(5,4))
ax.scatter(samples[:,0], samples[:,1])
ax.set_xlabel("$X$", fontsize=12)
ax.set_ylabel("$Y$", fontsize=12)
ax.set_title("Samples", fontsize=14)
ax.grid()

# Plotting Y and Z
fig, ax = plt.subplots(figsize=(5,4))
ax.scatter(samples[:,1], samples[:,2])
ax.set_xlabel("$Y$", fontsize=12)
ax.set_ylabel("$Z$", fontsize=12)
ax.set_title("Samples", fontsize=14)
ax.grid()

# Plotting X and Z
fig, ax = plt.subplots(figsize=(5,4))
ax.scatter(samples[:,0], samples[:,2])
ax.set_xlabel("$X$", fontsize=12)
ax.set_ylabel("$Z$", fontsize=12)
ax.set_title("Samples", fontsize=14)
ax.grid()

plt.show()

### Latin Squares Sampling

Now, we will look into latin squares sampling i.e. two dimensional latin hypercube sampling. `lhs` function within pyDOE2 generates latin hypercube samples. We will generate two different kinds of samples - *random* and *centermaximin*.  Important parameter in the `lhs` function which determines sampling type is `criterion`. If it is kept as `None`, then random lhs samples will be generated. If it is set to `centermaximin`, then it will generate sample using some heuristics to increase the minimum distance between the points. You can read about how to use pyDOE2 for generating lhs samples in the [documentation](https://pythonhosted.org/pyDOE/randomized.html#latin-hypercube).

Below block of code generates different sizes of random lhs samples for **two dimensional problem**, computes minimum distance between the points and plots the points using `seaborn`. Read comments for more details.

In [None]:
# Defining lower and upper bound
# Number of entries in the array will be 
# equal to number of dimensions
lb = np.array([-2, -3])
ub = np.array([5, 3])

# Number of variables
dim = len(lb)

# Generate random lhs samples of size 5, 25, 125, and 625
for itr in range(4):
    # Total number of samples
    num_samples = 5**(itr+1)
    
    # Generating samples. Output will be normalized
    # We are fixing the random state so that we can
    # compare with different criterion. But, in general,
    # you don't have to set randome state
    normalized_samples = lhs(dim, num_samples, iterations=10000, random_state=56, criterion=None)

    # Scaling the normalized variables
    samples = lb + (ub - lb)*normalized_samples
    
    # Computing the minimum distance between all the samples
    min_dist = np.min(pdist(samples))
    
    # Print the minimum distance
    print("Minimum distance between the {} samples: {}".format(num_samples, min_dist))

    # Plotting the samples using seaborn
    sns.jointplot(x=samples[:,0], y=samples[:,1])
    plt.xlabel("$X$", fontsize=12)
    plt.ylabel("$Y$", fontsize=12)
    plt.xlim(left=lb[0], right=ub[0])
    plt.ylim(bottom=lb[1], top=ub[1])

Note that minimum distance between the samples decrease as you increase the number of points (obviously). Also, the distribution of samples approach uniform distribution which denotes that samples are equally likely in the given interval. Now, we will change the criteria and see if it increases the minimum distance between the samples.

In [None]:
# Defining lower and upper bound
# Number of entries in the array will be 
# equal to number of dimensions
lb = np.array([-2, -3])
ub = np.array([5, 3])

# Number of variables
dim = len(lb)

# Generate random lhs samples of size 5, 25, 125, and 625
for itr in range(4):
    # Total number of samples
    num_samples = 5**(itr+1)
    
    # Generating samples. Output will be normalized
    normalized_samples = lhs(dim, num_samples, iterations=10000, random_state=56, criterion="centermaximin")

    # Scaling the normalized variables
    samples = lb + (ub - lb)*normalized_samples
    
    # Computing the minimum distance between all the samples
    min_dist = np.min(pdist(samples))
    
    # Print the minimum distance
    print("Minimum distance between the {} samples: {}".format(num_samples, min_dist))

The minimum distance in this case is higher than the case of random lhs sampling. This shows that pyDOE2 is using some hueristics to increase the minimum distance. **Note**: This is not space-filling lhs which actually involves an optimization problem. Now, we will generate samples in higher dimensions.

### Latin Hypercube Sampling

The process to generate samples is similar to what is described in previous section. In this section, we will generate samples in four dimensions. Samples of different size are generated and minimum distance between the points is also calculated. Plotting of points is done using `pairplot` within seaborn. You can refer [tutorial section](https://seaborn.pydata.org/tutorial/distributions.html#plotting-many-distributions) of seaborn for more details.

In [None]:
# Defining lower and upper bound
# Number of entries in the array will be 
# equal to number of dimensions
lb = np.array([-2, -3, -3, -4])
ub = np.array([5, 3, 4, 5])

# Number of variables
dim = len(lb)

# Generate random lhs samples of size 5, 25, 125, and 625
for itr in range(4):
    # Total number of samples
    num_samples = 5**(itr+1)
    
    # Generating samples. Output will be normalized
    normalized_samples = lhs(dim, num_samples, iterations=100, random_state=56, criterion=None)

    # Scaling the normalized variables
    samples = lb + (ub - lb)*normalized_samples
    
    # Computing the minimum distance between all the samples
    min_dist = np.min(pdist(samples))
    
    print("Minimum distance between the {} samples: {}".format(num_samples, min_dist))

    # Creating a pandas dataframe for plotting
    df = pd.DataFrame(samples, columns = ['A','B','C','D'])
    
    # Plotting the samples
    sns.pairplot(data=df)

As mentioned earlier, minimum distance between the samples decrease as you increase the number of points. Diagonal plots show the distribution of samples and you can see that distribution approaches uniform distribution which denotes that samples are equally likely in the given interval. All off-diagonal plots show the distribution of points between any two variables.

Now, we will change the criteria and see if it increases the minimum distance between the samples.

In [None]:
# Defining lower and upper bound
# Number of entries in the array will be 
# equal to number of dimensions
lb = np.array([-2, -3, -3, -4])
ub = np.array([5, 3, 4, 5])

# Number of variables
dim = len(lb)

# Generate random lhs samples of size 5, 25, 125, and 625
for itr in range(4):
    # Total number of samples
    num_samples = 5**(itr+1)
    
    # Generating samples. Output will be normalized
    normalized_samples = lhs(dim, num_samples, iterations=100, random_state=56, criterion="centermaximin")

    # Scaling the normalized variables
    samples = lb + (ub - lb)*normalized_samples
    
    # Computing the minimum distance between all the samples
    min_dist = np.min(pdist(samples))
    
    print("Minimum distance between the {} samples: {}".format(num_samples, min_dist))

The minimum distance in this case is higher than the case of random lhs sampling. This shows that pyDOE2 is using some hueristics to increase the minimum distance. **Note**: This is not space-filling lhs which actually involves an optimization problem. Now, we will gnereate samples in higher dimensions.