# Exercise 7A - Full-factorial Experimental Design

In this exercise, we will demonstrate how to build full-factorial designs in Python and demonstrate a practical limitation of using this method.


### Colour codes

<span style="color:orange;"> Orange text is for emphasis and definitions </span>

<span style="color:lime;"> Green text is for tasks to be completed by the student </span>

<span style="color:dodgerblue;"> Blue text is for Python coding tricks and references </span>

## Load all the necessary Python packages
All packages should work with Conda environment if installed on your machine. Otherwise all necessary packages can be installed in a virtual environment (.venv) in VS Code using: Ctrl+Shift+P > Python: Create Environment > Venv > Python 3.12.x > requirements.txt

In [None]:
import itertools
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
import pandas as pd
import psutil
import sys

svmem = psutil.virtual_memory()
mem_avail_begin = svmem.available / 1024 **2 

## 1. Full-Factorial Sampling

Demonstrate how to generate full-factorial sampling lists in Python using a simple-case and also the larger case shown in the lecture.

### 1.1 Generation in Python
In Python there are two common ways to generate full-factorial lists:
* Using Python's itertools
* Numpy's meshgrid

We will be using the following dictionaries of variables

In [None]:
variables_1 = {
    "integers" : [1, 3, 5, 10, 100],
    "floats" : [1.4142, 1.6180, 2.7182, 3.1419, 6.2831],
    "categoricals" :["red", "blue", "orange"],
}

### 1.1.1 Itertools Product

Itertools is a standard library package which has many useful functions for generating iterators and combinatorics. We will be using its *Cartesian Product* function to generate the full-factorial set.

In [None]:
# First we will, get the values from the dictionary. We can do this using the values method of a dictionary
v = variables_1.values()

# Next, place the values into itertools.product to obtain the full-factorial set.
# NOTE: If using a list-of-lists, you need the * in front of the variable name
iterations_1 = list(itertools.product(*v))

print (f"{len(iterations_1)} iterations have been generated.")
print ("The first 20 combinations are:")
for i in range (20):
    print (iterations_1[i])

print (sys.getsizeof(iterations_1))


We can transform the full-factorial list into a dataframe for easier reference in the future.

In [None]:
# Because the values are stored in a dictionary, we can recover the variable names easily
df_1 = pd.DataFrame(iterations_1, columns = variables_1.keys())

print ("The first 20 combinations are:")
print (df_1.head(20))

### 1.1.2 Numpy MeshGrid
Same process but using Numpy MeshGrid which builds a series of 3D coordinate arrays

In [None]:
# First we will, get the values from the dictionary. We can do this using the values method of a dictionary
v = variables_1.values()

# Put the values of the variables into the np.meshgrid function
X = np.meshgrid (*v)
print (X)

The resultant data is in an unfiendly form. An extra step is required to 'ravel' out the data for each variable, ie. convert it from three-dimensional coordinate matrices into one long array.

In [None]:
# Do this for each column in a list comprehension and then use column_stack to construct an np.array
X = [x.ravel() for x in X]
# Transpose the array
X = np.column_stack(X)

print ("The first 20 combinations are:")
for i in range (20):
    print (X[i])

The results using both methods should be identical.

### 1.2 Let's Break Things!

Let's follow the same process but for a much larger solution space inspired by the problem shown in the lecture.

<span style = "color:orange;"> The last few entries have been commented out. It will become clear why. </span>

<span style = "color:lime;">Calculate how many iterations the Cartesian product will generate. </span>

In [None]:
variables_2 = {
    "length" : [5, 6, 7, 8, 9, 10],
    "width" : [5, 6, 7, 8, 9, 10],
    "height" : [2.5, 2.7, 2.9, 3.1, 3.3, 3.5],
    "u_window" : [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
    "g_value" : [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8],
    "wwr" : [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8], 
    "ventilationRate" : [0.0, 0.25, 0.5, 0.75, 0.1],
    "infiltrationRate" : [1.0, 3.0, 5.0, 7.0, 9.0],
    "slabInsulation" : [0, 0.05, 0.10, 0.15, 0.2],
    #"wallInsulation" : [0, 0.05, 0.10, 0.15, 0.2, 1, 3],
    #"roofInsulation" : [0, 0.05, 0.10, 0.15, 0.2, 1, 3],
    #"overhangDepth" : [0, 0.1, 0.2, 0.3, 0.4],
    #"roofAbsorptivity" : [0.2, 0.35, 0.5, 0.75, 0.9]
}

Let's create a full-factorial list using itertools Cartesian Product.

In [None]:
v = variables_2.values()

iterations_2 = list(itertools.product(*v))

print (f"{len(iterations_2):.4e} iterations have been generated.")
print ("The first 20 iterations are:")
for i in range (20):
    print (iterations_2[i])


### 1.2.1 Memory Usage
You likely will have noticed that this took much longer to run this cell (or it has not completed at all). Let's examine why.

We printed out how many iterations were required. Let's compare how much memory *variables_1* and *variables_2* take up. The *sys* package had a method called *getsizeof* which returns the size of the object in bytes

In [None]:
memory_1 = sys.getsizeof(iterations_1) / 1024 ** 2
memory_2 = sys.getsizeof(iterations_2) / 1024 ** 2

print (f"Iterations_1 = {memory_1} MB")
print (f"Iterations_2 = {memory_2} MB")


As you can see, the full-factorial set consumes orders of magnitudes more memory than iterations_1. This is to be expected.

The size of iterations_2 should be in the 100s of MB range - which should be fine for your computer. However this is misleading as it only counts the size of the list - and **not** the sizes of the items in the list. [Link if your are interested in the specifics](https://python.land/check-memory-usage-of-your-python-objects). 

To get a sense of how much memory is being taken up we need to access your computer's performance diagnostics. Using psutil we can access how much memory is currently available and used. 

In [None]:
svmem = psutil.virtual_memory()
mem_total = svmem.total / 1024 ** 2
mem_avail = svmem.available / 1024 ** 2
mem_used = svmem.used / 1024 ** 2

print (f"Total memory = {mem_total:.0f} MB")
print (f"Available memory = {mem_avail:.0f} MB")
print (f"Used memory = {mem_used:.0f} MB")


I secretly measured how much memory you system had available at the start of the notebook! Let's compare the results.

In [None]:
print (f"Memory Available Before = {mem_avail_begin:.0f} MB, Memory Available Now = {mem_avail:.0f} MB.")

mem_object =  mem_avail_begin - mem_avail

print (f"The memory of the iterations_2 object is approximately {mem_object:.0f} MB, or about {mem_object / mem_total * 100:.0f}% of your total memory.")

### 1.2.2. Let's see how much more your machine can handle! 

Add one more variable with 7 options to the dict and run the code again. <span style="color:orange;"> This may take considerably longer to run than the previous step - OR not at all. </span>

In [None]:
# First clear the original iterations objects
del iterations_1

# Add wallInsulation into the variables_2 dictionary
variables_2["wallInsulation"] = [0, 0.05, 0.10, 0.15, 0.2]

v = variables_2.values()
iterations_2 = list(itertools.product(*v))

svmem = psutil.virtual_memory()
mem_avail = svmem.available / 1024 ** 2

print (f"Memory Available Before = {mem_avail_begin:.0f} MB, Memory Available Now = {mem_avail:.0f} MB.")

mem_object =  mem_avail_begin - mem_avail

print (f"The memory of the iterations_2 object is approximately {mem_object:.0f} MB, or about {mem_object / mem_total * 100:.0f}% of your total memory.")

One of three things may happen
* The cell completes and you see that iterations_2 uses significantly more memory.
* You get a memory error. You can try removing a couple of values from *variables_2["wallInsulation"]* and try again.
* It is very slow. Likely your computer is writing everything it can't store in memory to disk. If it is taking more than a minute, you can stop the cell's execution.

It is not difficult to see that if we included the full set of parameters the memory problem would be worse.

<span style="color:dodgerblue;">There are a few things to get around this problem when using big data like using numpy arrays instead of Python lists, as they are more memory-efficient, chunking datasets, and using databases such as SQL and HDF5 for example.</span>

## 2. Summary

* Demonstrated how to put together a full-factorial set for building simulation problems
* Demonstrated the practical limitations of this method for large solution spaces.