# CHEM 60 - February 14th, 2024 (Intro to Optimization)

We are going to wind our way to the notion of optimization today. Optimization problems come up pretty much everywhere in chemistry (and other fields) and can take many forms. We actually did one type of optimization last week (the brute-force curve fitting example was an optimization). When doing your Molecular Monday calculations, a lot of optimization has gone on behind the scenes.

From optimizing conditions in chemical synthesis to optimizing molecular geometry in a DFT calculation, there is  shared logic and mathematics that is really worth understanding.

When we optimize *something*, that means we are searching for some maximum or minimum value - a maximum yield, a maximum $r$-coeffient, a minimum energy, minimum residuals, etc. In order to 'optimize' our *something* (target, let's say), we need to have variables or parameters we can, well, vary that will impact said target.

In the example with curve fitting last week, our target was the sum of the squared residuals, and we wanted to minimize it ("least squares"). Our variables - or parameter space - consisted of a range of possible slopes and intercepts.


To get started, click on '**File**' in the left menu, then '**Save a copy in Drive**' to ensure you are editing *your* version of this assignment (if you don't, your changes won't be saved!). After you click '**Save a copy in Drive**' a popup that says **Notebook copy complete** should appear, and it may ask you to <font color='blue'>**Open in a new tab**</font>. When open, your new file will be named `Copy of CHEM60_Class_8... .ipynb` (you may want to rename it before/after you move it to your chosen directory).

# Imports

Here are the Python imports that we will need today. A couple of new ones are included, along with some extra formatting (a custom colour map we might use).

Run the below code block to get started.

In [None]:
# Third party imports
import matplotlib.colors as colours  # Library for defining colours for plotting
import matplotlib.pyplot as plt  # Library for creating plots and visualizations
import numpy as np  # Library for numerical computing
import pandas as pd  # Library for data manipulation and analysis
import plotly.graph_objects as go  # Library for creating interactive plots
from statsmodels.stats.multicomp import pairwise_tukeyhsd # a stats test we'll use
from scipy import linalg  # Library for linear algebra operations
from scipy.stats import distributions # Module that loads in pre-made statistical distributions
from scipy.stats import f # Module for F statistics to get P-values
from scipy.interpolate import griddata  # Module for data interpolation on a grid

# This part of the code block is telling matplotlib to make certain font sizes exra, extra large by default
# Here is where I list what parametres I want to set new defaults for
params = {'legend.fontsize': 'xx-large',
         'axes.labelsize': 'xx-large',
         'axes.titlesize':'xx-large',
         'xtick.labelsize':'xx-large',
         'ytick.labelsize':'xx-large'}
plt.rcParams.update(params)

# below sets up some custom colour maps for certain visualizations
# colour blind friendly colour map (blue )
cmap_hex_temp = ['#2c7bb6','#abd9e9','#ffffbf','#fdae61','#d7191c']
cmap_RBG_temp = [colours.to_rgb(cmap_hex_temp[i]) for i in range(len(cmap_hex_temp))]
cmap_temp = colours.LinearSegmentedColormap.from_list('cmap',cmap_RBG_temp,N=256)
# Calculate color level values for each color
colour_levels = [i/float(len(cmap_hex_temp)-1) for i in range(len(cmap_hex_temp))]
# Create the colorscale list for Plotly
colourscale = [[colour_levels[i], cmap_hex_temp[i]] for i in range(len(cmap_hex_temp))]

On Monday, you saw a nice example of finding the distance associated with the minimum energy for the O-O system (ie. $\text{O}_2$). This is an example of geometry optimization.

In [None]:
# we're not re-running these day!
dist = [0.5       , 0.5862069 , 0.67241379, 0.75862069, 0.84482759,  0.93103448, 1.01724138, 1.10344828, 1.18965517, 1.27586207,1.36206897, 1.44827586, 1.53448276, 1.62068966, 1.70689655,1.79310345, 1.87931034, 1.96551724, 2.05172414, 2.13793103,2.22413793, 2.31034483, 2.39655172, 2.48275862, 2.56896552,2.65517241, 2.74137931, 2.82758621, 2.9137931 , 3.        ]
energy = [-142.83145598888603,-145.75368546778373, -147.48581182092795, -148.48178800434505, -149.03081058146063, -149.32484884392164,-149.4693374365733, -149.52713923490964, -149.53543166711282, -149.51619309250424, -149.4826317854426,-149.4422527260286, -149.39936341364097, -149.3559334484894, -149.31439704290997, -149.2750698381664,-149.23838283368121, -149.20411880460887, -149.1730433537303, -149.1446194036704, -149.11867257567388,-149.09501513175158, -149.07322402577395, -149.05362473403744, -149.03579277603419, -149.01957653617893,-149.00483321293848, -148.9912790073016, -148.97909612853542, -148.96800065923728]

# here is the plot
plt.figure(figsize=(6,5))
plt.plot(dist,energy)
plt.xlabel('distance (Å)')
plt.ylabel('energy ($E_h$)')

# Finding the minimum distance and its corresponding energy value
d_min = dist[np.argmin(energy)]

# Annotating the minimum distance on the plot
plt.annotate('Minimum Distance = '+str(round(d_min,3))+" Å", xy=(d_min, np.min(energy)), xytext=(d_min, np.min(energy) + 1),
             arrowprops=dict(facecolor='black', arrowstyle='->'))
plt.title("$O_2$ Energy vs O-O Distance \n psi4, scf/6-31++G(d,p)")
plt.show()

Here, the target was energy (and we wanted a minimum), and the parameter space consisted of only one variable - distance.

All manner of chemists think about optimization problems. One that we'll look at today might seem quite different from the above, but it is the same fundamental operation: Reaction Optimization.

When designing a new chemical synthesis, how do you know how much reagent or catalyst to use? How do you understand the optimal reaction conditions? You vary them, just like the distance was varied above.

# Reaction Optimization

Jonathan and Matthew picked a paper for us in our first class together that was all about optimization!

Glavinović et al.'s "[A chlorine-free protocol for processing germanium](https://www.science.org/doi/10.1126/sciadv.1700149)". You'll spend some time with it today and in your homework. (a pdf of the paper and its supplementary material are found in [In_Class_Notebooks](https://drive.google.com/drive/folders/1hqeix2IYjN9PxOK2pn63Y4gucbuMONVr?usp=share_link) directory in the Drive, too).

Here is the abstract to get you thinking about its about and why its important:
>Replacing molecular chlorine and hydrochloric acid with less energy- and risk-intensive reagents would markedly
improve the environmental impact of metal manufacturing at a time when demand for metals is rapidly increasing.
We describe a recyclable quinone/catechol redox platform that provides an innovative replacement for elemental
chlorine and hydrochloric acid in the conversion of either germanium metal or germaniumdioxide to a germanium
tetrachloride substitute. Germanium is classified as a “critical” element based on its high dispersion in the
environment, growing demand, and lack of suitable substitutes. Our approach replaces the oxidizing capacity of chlorine with molecular oxygen and replaces germanium tetrachloride with an air- and moisture-stable Ge(IV)-
catecholate that is kinetically competent for conversion to high-purity germanes.


So, traditional Germanium processing has some big problems from a Green Chemistry lens. The authors present a new synthesis - that they claim is generalizable beyond germanium - to process this metal more safely and efficiently. Sounds good. So, how do you systematically optimize a reaction?  

One way to do it is by varying individual components of the reaction and seeing what happens (that's what the authors of this paper did).
![Glavinović et al., supplementary figure 1 showing reaction scheme](https://kavassalis.space/s/Glavinovic_2017_S1_Optimization_of_LAG_additive_composition.png)


We'll step through their process so you can see what this looks like. This reaction involves griding up the reactants - they use something called liquid-assisted
grinding (LAG) to do so. What liquid did they pick? And how did they pick it?

In the supplement, the authors explain their process.

> The mechanical milling conditions were optimized to promote full conversion of starting materials into either crystalline phases 3 or 3a. By qualitative evaluation of the PXRD patterns of crude reaction mixtures, the LAG additive composition, LAG additive volume, Py equivalents, milling time and milling frequency were optimized so that crystalline phases corresponding to starting materials were not detected, and crystalline phases corresponding to 3 or 3a were dominant.

A key thing to note here is "qualitative evaluation" of the PXRD. PXRD, or Powder X-ray Diffraction, is a technique used to study crystalline materials. It operates by directing X-rays at a sample, which then diffract in various directions. By studying the angles and intensities of these diffracted beams, chemists can gain detailed information about the structure and chemical composition of the material.

# Optimization of LAG additive composition.

Check out table S1 in the supplement (make sure I didn't make any mistakes!)

Does looking at this table tell you how/why 1:1 PhMe:H₂O ended up being the winning LAG liquid?

In [None]:
# Define the DataFrame columns
columns = ['LAG Liquid', 'LAG Volume (μL)', 'Pyridine (equiv)', 'Milling Time (mins)', 'Milling frequency (Hz)', 'Scale (mg)', 'Observed SM Phase', 'Observed Product Phase']

# Initialize an empty DataFrame
LAG_liquid_df = pd.DataFrame(columns=columns)

# Fill in the DataFrame with your data
LAG_liquid_df.loc[0] = ['Neat', 60, 2, 90, 25, 200, 'Ge. 1', '3'] # values for Entry (1)
LAG_liquid_df.loc[1] = ['H₂O', 60, 2, 90, 25, 200, 'Ge. 1', '3, 3a'] # values for Entry (2)
LAG_liquid_df.loc[2] = ['MeOH', 60, 2, 90, 25, 200, 'Ge. 1', '3a'] # values for Entry (3)
LAG_liquid_df.loc[3] = ['PhOH', 60, 2, 90, 25, 200, 'Ge. 1', '3'] # values for Entry (4)
LAG_liquid_df.loc[4] = ['1:1 PhMe:H₂O', 60, 2, 90, 25, 200, 'Ge. 1', '3, 3a'] # values for Entry (5)
LAG_liquid_df.loc[5] = ['1:1 PhMe:MeOH', 60, 2, 90, 25, 200, 'Ge. 1', '3, 3a'] # values for Entry (6)

# now let's call the dataframe to look at the table
LAG_liquid_df

The authors state:
> The resulting green/brown crude solids were analyzed by PXRD (fig . S2). The optimal liquid additive composition was determined to be 1:1 PhMe : H2O (entry 5)

Okay - how did they know? Looking at the PXRD data.

They didn't provide their raw PXRD data, so... I have made some up! This part - the making some-up part - is not part of the reaction optimization process. But it is useful to demonstrate the optimization techniques we want to work through.

If you look at the PXRD data in the paper, you'll see peaks that look kind-of Gaussian associated with some funny "2 theta" angles. In PXRD, the peaks in the "diffractogram" (the graph resulting from the X-ray analysis) represent the response from the sample when it is struck by the X-rays. Each peak corresponds to a specific plane of atoms within the crystal structure of the material. The 2 theta (2Θ) angles are the angles at which these peaks occur. ie. 2Θ is the diffracted angle at which X-rays scatter off of the planes in the crystalline sample. The specific angles of diffraction (and the positions of the peaks) are unique to the atomic arrangement of the sample analyzed. This allows the authors to identify the structures of unknown samples by comparing the 2Θ values to known diffraction patterns.

You don't need to know the details of how PXRD works, but the big things to know are that the peak heights (intensities) and 2Θ angles let you know what you are looking at chemically.

Because we know we expect some Gaussian-y peaks at specific 2Θs, plus some noise, I wrote a quick function to simulate their data.

In [None]:
# Function to generate a Gaussian peak
def gaussian(x, height, centre, width):
    return height * np.exp(-(x - centre) ** 2 / (2 * width ** 2))

## PRACTICE QUESTION

Test the above function to see what it gives you so you know *how* the synthetic data is being generated.

What type of object is it spitting out? What do the different terms mean?


---



In [None]:
# Create a range of 2-theta values
theta = np.linspace(4, 40, 1000)
### try some different values before
height = 0
centre = 0
width = 0
# testing the thing!
gaussian(theta, height, centre, width)



---


Okay, if you look at Fig. S2 from the author', you'll notice many peaks. I eye-balled the ones that looked important and created a list of fake peaks below. This part is distinctly different from what the authors did (we assume!).

In [None]:
# Define parameters for the Gaussian peaks for each PXRD pattern: height, mean, width for each peak
all_params = [
    { 'params': [(1, 21, .1), (3, 25.5, .1), (15, 27, .1)], 'name': 'Ge', 'colour': '#377eb8', 'noise':.2},
    { 'params': [(1, 7, .1), (1, 12.2, .2), (15, 13, .1), (1, 17, .2), (1.5, 19, .2), (1, 22, .2), (1, 25, .2), (1, 30, .2)], 'name': '$C_{14}H_{20}O_2$', 'colour': '#e41a1c', 'noise':.3},
    { 'params': [(1, 8.5, .1), (1, 12.2, .2), (15, 13, .1), (3, 17, .2), (3.5, 19, .2), (1, 22, .2), (2, 25, .2), (1, 25.5, .1), (10, 27, .1), (1, 30, .2)], 'name': 'Neat', 'colour': '#4daf4a', 'noise':.5},
    { 'params': [(15, 8.5, .1), (1, 12.2, .2), (15, 13, .1), (3, 17, .2), (3.5, 19, .2), (1, 22, .2), (2, 25, .2), (1, 25.5, .1), (10, 27, .1), (1, 30, .2)], 'name': 'H₂O', 'colour': '#4daf4a', 'noise':.5},
    { 'params': [(15, 9.5, .1), (2, 12.2, .2), (6, 13, .1), (3, 17, .2), (3.5, 19, .2), (1, 22, .2), (2, 25, .2), (1, 25.5, .1), (5, 27, .1), (1, 30, .2)], 'name': 'MeOH', 'colour': '#4daf4a', 'noise':.5},
    { 'params': [(6, 8.5, .1), (2, 12.2, .2), (15, 13, .1), (3, 17, .2), (3.5, 19, .2), (1, 22, .2), (2, 25, .2), (1, 25.5, .1), (6, 27, .1), (1, 30, .2)], 'name': 'PhMe', 'colour': '#4daf4a', 'noise':.5},
    { 'params': [(15, 8.5, .1), (10, 10, .4), (2, 13, .1), (2, 17, .2), (2, 19, .2), (2, 22, .2), (2, 25, .2), (2, 25.5, .1), (2, 27, .1), (2, 30, .2)], 'name': '1:1 PhMe:H₂O', 'colour': '#4daf4a', 'noise':.5},
    { 'params': [(5, 8.5, .1), (1, 12.2, .2), (15, 13, .1), (5, 17, .2), (6.5, 19, .2), (1, 22, .2), (2, 25, .2), (1, 25.5, .1), (10, 27, .1), (1, 30, .2)], 'name': '1:1 PhMe:MeOH', 'colour': '#4daf4a', 'noise':.5},
]


Now, let's look at what this.

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))
theta = np.linspace(4, 40, 1000)

# Generate and plot PXRD patterns for each parameter set
for i, param_dict in enumerate(all_params):

    # Generate noise and add it to the y-data
    np.random.seed(0)  # For reproducibility
    noise = np.random.normal(0, param_dict['noise'], theta.shape)

    # Generate the y-data for the PXRD pattern
    y = sum(gaussian(theta, height, mean, width) for height, mean, width in param_dict['params'])
    y += noise

    # Add the offset for easier visualization
    y -= i * 20

    # Plot the PXRD pattern
    ax.plot(theta, y, lw=3, color=param_dict['colour'])

    # Create a text label for the PXRD pattern
    ax.text(42, -i*20, param_dict['name'], verticalalignment='center', color=param_dict['colour'])

# Set the x-ticks to start at 4 and increment by 5
ax.set_xticks(range(4, 40, 5))

# Remove y-axis ticks
ax.yaxis.set_ticklabels([])

# Hide grid lines
ax.grid(False)

# Set the x-axis label
ax.set_xlabel('$2 \\theta$')

plt.show()

Look! Fake data! It looks pretty decent. Okay, how did the above figure let them pick the optimal LAG liquid?

## PRACTICE QUESTION

Talk with your neighbours - why did the above let the authors pick the ideal LAG liquid? They did this step qualitatively - how would the above let you do that?


---



**you can write notes if that will help you**



---



# Which is best?

What if we wanted to quantify which one was better? First, we'd need to have defined the notion of 'better' (what you discussed above, presumably). Then, we'd want to know if - statistically - the different liquids had significantly different performance.

To do this, we'd need repeated measurements (ie. not just one PXRD spectra for each liquid, but several so we could say if the differences were meaningfully consistent).

Let's functionalize our fake data maker to make fake replicates now, too.

In [None]:
# Create a function to generate Gaussian peaks and return the y-values
def generate_pattern(theta, params, noise_level):
    noise = np.random.normal(0, noise_level, theta.shape)
    y = sum(gaussian(theta, height, mean, width) for height, mean, width in params)
    y += noise
    return y

Now, instead of looking at the entire spectra, let's only concern ourselves with the two regions we are interested in (the regions of the spectra that let us know that some of our reactants ended up in our final product).

We aren't doing fancy peak finding or integrating under curves yet (soon, though!). Instead, we'll just pick the peak values associated with the two peaks we know to be significant (because our two main reactants have distinct, characteristic theta values).

In [None]:
# Initializing an empty list to store each dataframe
df_list = []

# Iterate through each entry in all_params
for i, entry in enumerate(all_params[2:]):
    # Repeat three times for replicates
    for j in range(3):
        y_vals = generate_pattern(theta, entry['params'], entry['noise'])
        # 2 theta locations of the Peaks
        peak_I_location = 13
        peak_Ge_location = 27
        # Find the closest theta value to the peak location & get corresponding y value from y_vals
        peak_I_value = y_vals[(np.abs(theta - peak_I_location)).argmin()]
        peak_Ge_value = y_vals[(np.abs(theta - peak_Ge_location)).argmin()]
        # Create a new dataframe for each peak value and append it to df_list
        new_df = pd.DataFrame({'LAG Liquid': [entry['name']], '1 Peak': [peak_I_value], 'Ge Peak': [peak_Ge_value], 'Replicate': [j]})
        df_list.append(new_df)

# Combine all the dataframes in df_list
peak_values = pd.concat(df_list, ignore_index=True)

Let's look at what we've made now:

In [None]:
peak_values

Okay, so it is a table, storing the signal value for the characteristic reactant peaks (for Ge and '1' - $C_{14}H_{20}O_2$)

As always, let's look at it:

In [None]:
# Plotting the data
fig, ax1 = plt.subplots(figsize=(10,5))

ax1.set_xlabel('LAG Liquid')
ax1.tick_params(axis='x',rotation=45)

colour = '#e41a1c'
ax1.set_ylabel('1 Peak (signal at 13 $2 \\theta$)', color=colour)

# Add 'I Peak' scatter plot
for liquid in peak_values['LAG Liquid'].unique():
    i_peak_values = peak_values[peak_values['LAG Liquid'] == liquid]['1 Peak']
    ax1.scatter([liquid]*len(i_peak_values), i_peak_values, color=colour)

ax1.tick_params(axis='y',labelcolor=colour)
ax1.set_ylim([0,17])

ax2 = ax1.twinx()

colour = '#377eb8'
ax2.set_ylabel('Ge Peak (signal at 27 $2 \\theta$)', color=colour)

# Add 'Ge Peak' scatter plot
for liquid in peak_values['LAG Liquid'].unique():
    ge_peak_values = peak_values[peak_values['LAG Liquid'] == liquid]['Ge Peak']
    ax2.scatter([liquid]*len(ge_peak_values), ge_peak_values, color=colour)

ax2.tick_params(axis='y', labelcolor=colour)
ax2.set_ylim([0,17])

plt.show()

## PRACTICE QUESTION

Talk with your neighbours - what is the above showing us?



---



**some notes for you**



---



I don't know about you, but looking at repeated measurements makes me think we need some stats!

## Within-sample variation
For each sample, a variance can be calculated by using the **sample variance** (like sample standard deviation) we talked about in [Class 4](https://colab.research.google.com/drive/1KJVhs9Wmx7ff7NyPytj8p9v7u9Zmbhgx?usp=sharing).

$$s^2 = \frac{\sum \left( x_i -\bar{x}\right)^2}{n-1} $$

This tells us the variance associated with each LAG liquid. Some intrumental techniques are extremely repeatable, and some are not. If you don't know how repeatable your methods are, it's always a good idea to see if you can create replicates. Without replicates, it can be hard to know if values are really as different from each other as they look.

We can write the above as a simple function.

In [None]:
def within_sample_variance(x):
    x_bar = np.mean(x)  # mean of the sample
    diff_squared = (x - x_bar) ** 2  # squared difference from mean for each observation
    s_squared = np.sum(diff_squared) / (len(x) - 1)  # sample variance
    return s_squared

And test it for all our data.

In [None]:
# Calculate mean and variance for each liquid for each Peak
liquid_stats = peak_values.groupby('LAG Liquid').agg(
    {'1 Peak': ['mean', within_sample_variance],
     'Ge Peak': ['mean', within_sample_variance]}).reset_index()
liquid_stats

This did something fun! Check out the [`pd.DataFrame.agg`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html) function to see how it that worked out.

Let's add those basic stats to our plot too.

# Adding some stats to our plot

In [None]:
# Improved version
fig, ax1 = plt.subplots(figsize=(10,5))

ax1.set_xlabel('LAG Liquid')
ax1.tick_params(axis='x', rotation=45)

colour = '#e41a1c'
ax1.set_ylabel('1 Peak (signal at 13 $2 \\theta$)', color=colour)

# For each unique liquid
for liquid in peak_values['LAG Liquid'].unique():
    # Plot scatter points
    i_peak_values = peak_values[peak_values['LAG Liquid'] == liquid]['1 Peak']
    ax1.scatter([liquid]*len(i_peak_values), i_peak_values, color=colour, alpha=0.4)

    # Plot mean and variance
    mean = liquid_stats.loc[liquid_stats['LAG Liquid'] == liquid, ('1 Peak', 'mean')]
    variance = liquid_stats.loc[liquid_stats['LAG Liquid'] == liquid, ('1 Peak', 'within_sample_variance')]
    ax1.errorbar(liquid, mean, yerr=np.sqrt(variance), color=colour, fmt='o')

ax1.tick_params(axis='y', labelcolor=colour)
ax1.set_ylim([0,17])

ax2 = ax1.twinx()

colour = '#377eb8'
ax2.set_ylabel('Ge Peak (signal at 27 $2 \\theta$)', color=colour)

for liquid in peak_values['LAG Liquid'].unique():
    ge_peak_values = peak_values[peak_values['LAG Liquid'] == liquid]['Ge Peak']
    ax2.scatter([liquid]*len(ge_peak_values), ge_peak_values, color=colour, alpha=0.4)

    mean = liquid_stats.loc[liquid_stats['LAG Liquid'] == liquid, ('Ge Peak', 'mean')]
    variance = liquid_stats.loc[liquid_stats['LAG Liquid'] == liquid, ('Ge Peak', 'within_sample_variance')]
    ax2.errorbar(liquid, mean, yerr=np.sqrt(variance), color=colour, fmt='o')

ax2.tick_params(axis='y', labelcolor=colour)
ax2.set_ylim([0,17])

plt.show()

In this case, it might be obvious to you that one of the liquids is significantly different than the others. We can be precise about it though.

# ANOVA

If you have the means and variances for each condition (LAG liquid), as we do, we can run hypothesis tests to determine if there are statistically significant differences between the groups. ANOVA (Analysis of Variance) is commonly used when dealing with more than two groups/conditions.

If using ANOVA, the null hypothesis is that all samples have the same mean, while the alternate hypothesis is that at least one group (LAG liquid) has a different mean. If the p-value is less than your chosen significance level (typically 0.05), you can reject the null hypothesis and conclude that the differences are statistically significant. P-values are the probability of observing a set of data (ie. observing what looks to be different means), given that the null hypothesis is true (ie. that they actually have the same mean). We'll go more into them later.

Before running ANOVA, you need to ensure the assumptions of normality, equal variances (homoscedasticity), and independent observations are met. Post-hoc tests such as Tukey's test can be used after ANOVA to determine which group(s) is different.

In this case, the variance between our samples should be from the method (perhaps our milling consistency and, importantly, the PXRD precision). Since this is fake data, I made the replicates with equal noise! That let's us assume the variance is the same for each of these.

How to run an ANOVA?



## Step 1: Compute the Within Group Variance.

You've already done this with our `within_sample_variance()` function, which calculates the average squared difference of each observation from the mean of its sample.

In [None]:
variance_1 = liquid_stats[('1 Peak', 'within_sample_variance')].values
variance_Ge = liquid_stats[('Ge Peak', 'within_sample_variance')].values

## Step 2: Compute the Between Group Variance.

This involves comparing the mean of each sample to the mean of all observations.

$$s_{b}^{2} = \frac{\sum (n_j (\bar{x_{j}} - \bar{x})^{2})}{k - 1}$$

where:

- $s_{b}^{2}$ is the between-group variance
- $n_j$ is the size (number of observations) of the $j$th sample
- $\bar{x_{j}}$ is the mean of the $j$th sample
- $\bar{x}$ is the mean of all observations
- $k$ is the number of samples

Here's a function for that:


In [None]:
def between_sample_variance(sample_means, overall_mean, sample_sizes):
    diff_squared = (sample_means - overall_mean) ** 2  # squared difference between each sample mean and the overall mean
    s_squared = np.sum(diff_squared * sample_sizes) / (len(sample_means) - 1)  # sample variance; note that this is weighted by sample size
    return s_squared

To test it, we'll need a few more pieces of information about each sample:

In [None]:
group_sizes = peak_values.groupby('LAG Liquid').size().values
group_means_1 = liquid_stats[('1 Peak', 'mean')].values
group_means_Ge = liquid_stats[('Ge Peak', 'mean')].values

Now, calculate the overall means:

In [None]:
overall_mean_1 = np.mean(group_means_1)
overall_mean_Ge = np.mean(group_means_Ge)

Then, we use these results to calculate the between-group variances:

In [None]:
between_variance_1 = between_sample_variance(group_means_1, overall_mean_1, group_sizes)
between_variance_Ge = between_sample_variance(group_means_Ge, overall_mean_Ge, group_sizes)

Next, compute the weighted within-sample variances (add all variances within each group divided by sum of sizes):

In [None]:
weighted_var_1 = np.sum(variance_1 * group_sizes) / np.sum(group_sizes)
weighted_var_Ge = np.sum(variance_Ge * group_sizes) / np.sum(group_sizes)

## Step 3: Compute the F-statistic.

"F-statistic" is ANOVA jargon. It is the ratio of the between-sample variance to the within-sample variance.

In [None]:
def f_statistic(between_variance, within_variance):
    return between_variance / within_variance


The larger the F-statistic is, the more likely it is that the differences in sample means are not due to random chance.


In [None]:
f_stat_1 = f_statistic(between_variance_1, weighted_var_1)
f_stat_Ge = f_statistic(between_variance_Ge, weighted_var_Ge)
print("For Peak 1, F = ", str(f_stat_1), "For Peak Ge, F = ", str(f_stat_Ge))

These seems like big numbers! And larger F values suggest that the group means are significantly different (so our liquids have different performance in our synthesis).  But we need to know more.

## Step 4: Obtain the F-critical value and P-value.

The F-critical value is looked up from the F-distribution table given the degrees of freedom and the significance level. These tables are all based on properties obtained from the normal distribution.

Degrees of freedom for the numerator (Between Group Variance) is k-1 where k is the number of groups. For the denominator (Within Group Variance), it is N-k, where N is the total number of observations.

If your calculated F-statistic is greater than the F-critical value, you reject the null hypothesis and conclude that there is a statistically significant difference. As for the p-value, it involves integration of the F-distribution. We won't go through that math today (because integration is another class), but we can run the stat through `scipy.stats`.

In [None]:
p_value_1 = f.sf(f_stat_1, len(group_means_1)-1, len(peak_values)-len(group_means_1))
p_value_Ge = f.sf(f_stat_Ge, len(group_means_Ge)-1, len(peak_values)-len(group_means_Ge))
print("For Peak 1, p = ", str(p_value_1), "For Peak Ge, p = ", str(p_value_Ge))

These are very small numbers! This tells us a statistically significant thing has occured! Specifically, that not all LAG liquids are the same as far as this synthesis is concerned.

But wait, which is the best?? To identify which sample mean is the most different, you usually perform a post-hoc test after ANOVA has found a significant result. A commonplace post-hoc test is Tukey's Honestly Significant Difference (HSD) test. I appreciate its name.

# Tukey's Honestly Significant Difference test

The basic idea of Tukey honestly significant difference (HSD) is to say which pairs of means are significantly different from each other.

Tukey's HSD calculates a range around each mean, and if the ranges of two means do not overlap, the test concludes that they are significantly different. The formula for Tukey's test statistic is:

$$ q = \frac{Y_i - Y_j}{\sqrt{MSE/n}} $$

where:
- $Y_i$ and $Y_j$ are the sample means of group i and j
- MSE is the mean squared error - the average of the squared difference between each observation and its group mean
- n is the sample size

The test statistic q has what is known as a 'studentized range distribution' (think, 'normal' distribution, again). It is used to determine the critical value, which is compared to the calculated difference of each pair of means to determine significance. Instead of writing out all of this math ourselves, we'll use a [function](https://www.statsmodels.org/devel/generated/statsmodels.stats.multicomp.pairwise_tukeyhsd.html) that does it for us.

In [None]:
# Execute Tukey's HSD
tukey = pairwise_tukeyhsd(peak_values['1 Peak'], peak_values['LAG Liquid'], alpha=0.05)

# Convert the summary to dataframe for easy comparison
tukey_1_peak_df = pd.DataFrame(data=tukey._results_table.data[1:], columns=tukey._results_table.data[0])
tukey_1_peak_df

## PRACTICE QUESTION

What is 'meandiff' showing? What does 'reject' mean? Jump back to the plot (**Adding some stats to our plot**) and confirm these match your intiution from looking at the data.


---



**notes**



---



Could we say MeOH was a better LAG liquid than PhMe based on only the '1 peak', for example? We need to be able to say if two groups (LAG liquids) are significantly different for both the 1 Peak and the Ge Peak.  **Qualitatively, we are really just saying the averages are far enough apart that the variances are not overlapping.** You can actually visually determine 1:1PhMe:H$_2$O is the right choice from the above plot. Sometimes, data points are closer together though, and that is not a given.

Try it out looking at the Ge Peak too (how much of our other reactant makes it into the product with the different liquids - ideally, neither reactant is significantly in our final product, remember).

## PRACTICE QUESTION

Confirm to yourself that 1:1 PhMe:H₂O really is the best LAG liquid (by checking the Ge peak too).



---



In [None]:
# you'll need to figure out which of the above pieces of code to include!



---



# Optimization of LAG additive volume.

Every part of the synthesis should be optimized!

Next, they looked at the volume of liquid added.


In [None]:
# Define the DataFrame columns
columns = ['LAG Liquid', 'LAG Volume (μL)', 'Pyridine (equiv)', 'Milling Time (mins)', 'Milling frequency (Hz)', 'Scale (mg)', 'Observed SM Phase', 'Observed Product Phase']

# Initialize an empty DataFrame
LAG_volume_df = pd.DataFrame(columns=columns)

# Fill in the DataFrame with your data
LAG_volume_df.loc[0] = ['1:1 PhMe:H₂O', 30, 2, 90, 25, 200, 'Ge. 1', '3'] # values for Entry (1)
LAG_volume_df.loc[1] = ['1:1 PhMe:H₂O', 60, 2, 90, 25, 200, 'Ge. 1', '3, 3a'] # values for Entry (2)
LAG_volume_df.loc[2] = ['1:1 PhMe:H₂O', 120, 2, 90, 25, 200, 'Ge. 1', '3, 3a'] # values for Entry (3)

# now let's call the dataframe to look at the table
LAG_volume_df

Let's make some fake data again based on what they provide in the supplement.

In [None]:
# Define parameters for the Gaussian peaks for each PXRD pattern: height, mean, width for each peak
all_params = [
    { 'params': [(1, 21, .1), (3, 25.5, .1), (15, 27, .1)], 'name': 'Ge', 'colour': '#377eb8', 'noise':.2},
    { 'params': [(1, 7, .1), (1, 12.2, .2), (15, 13, .1), (1, 17, .2), (1.5, 19, .2), (1, 22, .2), (1, 25, .2), (1, 30, .2)], 'name': '1', 'colour': '#e41a1c', 'noise':.3},
    { 'params': [(15, 8.5, .1), (6, 10, .4), (6, 13, .1), (2, 17, .2), (2, 19, .2), (2, 22, .2), (2, 25, .2), (2, 25.5, .1), (2, 27, .1), (2, 30, .2)], 'name': '30 µL', 'colour': '#4daf4a', 'noise':.5},
    {  'params': [(15, 8.5, .1), (10, 10, .4), (2, 13, .1), (2, 17, .2), (2, 19, .2), (2, 22, .2), (2, 25, .2), (2, 25.5, .1), (2, 27, .1), (2, 30, .2)], 'name': '60 µL', 'colour': '#4daf4a', 'noise':.5},
    {  'params': [(3, 8.5, .1), (5, 10, .4), (15, 13, .1), (2, 17, .2), (2, 19, .2), (2, 22, .2), (2, 25, .2), (2, 25.5, .1), (2, 27, .1), (2, 30, .2)], 'name': '120 µL', 'colour': '#4daf4a', 'noise':.5},
]

And they found:
> The resulting green/brown crude solids were analyzed by PXRD fig S3 ). The optimal liquid additive
volume was found to be 60 µL

How? Let's look at our version of their plot for this step:

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))

# Generate and plot PXRD patterns for each parameter set
for i, param_dict in enumerate(all_params):

    # Generate noise and add it to the y-data
    np.random.seed(0)  # For reproducibility
    noise = np.random.normal(0, param_dict['noise'], theta.shape)

    # Generate the y-data for the PXRD pattern
    y = sum(gaussian(theta, height, mean, width) for height, mean, width in param_dict['params'])
    y += noise

    # Add the offset for easier visualization
    y -= i * 20

    # Plot the PXRD pattern
    ax.plot(theta, y, lw=3, color=param_dict['colour'])

    # Create a text label for the PXRD pattern
    ax.text(42, -i*20, param_dict['name'], verticalalignment='center', color=param_dict['colour'])

# Set the x-ticks to start at 4 and increment by 5
ax.set_xticks(range(4, 40, 5))

# Remove y-axis ticks
ax.yaxis.set_ticklabels([])

# Hide grid lines
ax.grid(False)

# Set the x-axis label
ax.set_xlabel('$2 \\theta$')

plt.show()

## PRACTICE QUESTION

How come 60 µL was the winner?



---



**This can be qualitative - you'll do a quantitative example on the homework**



---



# Optimization of Py equivalents.

Let's look at their next step.

> The resulting green/brown crude solids were analyzed by PXRD fig S4). The optimal Py stoichiometry was found to be 2 equivalents

In [None]:
# Define the DataFrame columns
columns = ['LAG Liquid', 'LAG Volume (μL)', 'Pyridine (equiv)', 'Milling Time (mins)', 'Milling frequency (Hz)', 'Scale (mg)', 'Observed SM Phase', 'Observed Product Phase']

# Initialize an empty DataFrame
LAG_volume_df = pd.DataFrame(columns=columns)

# Fill in the DataFrame with your data
LAG_volume_df.loc[0] = ['1:1 PhMe:H₂O', 60, 2, 90, 25, 200, 'Ge. 1', '3, 3a'] # values for Entry (1)
LAG_volume_df.loc[1] = ['1:1 PhMe:H₂O', 60, 3, 90, 25, 200, 'Ge. 1*', '3'] # values for Entry (2)
LAG_volume_df.loc[2] = ['1:1 PhMe:H₂O', 60, 4, 90, 25, 200, 'Ge. 1*', '3'] # values for Entry (3)

# now let's call the dataframe to look at the table
LAG_volume_df

And a visualization:

In [None]:
# Define parameters for the Gaussian peaks for each PXRD pattern: height, mean, width for each peak
all_params = [
    { 'params': [(1, 21, .1), (3, 25.5, .1), (15, 27, .1)], 'name': 'Ge', 'colour': '#377eb8', 'noise':.2},
    { 'params': [(1, 7, .1), (1, 12.2, .2), (15, 13, .1), (1, 17, .2), (1.5, 19, .2), (1, 22, .2), (1, 25, .2), (1, 30, .2)], 'name': '1', 'colour': '#e41a1c', 'noise':.3},
    { 'params': [(15, 8.5, .1), (10, 10, .4), (2, 13, .1), (2, 17, .2), (2, 19, .2), (2, 22, .2), (2, 25, .2), (2, 25.5, .1), (2, 27, .1), (2, 30, .2)], 'name': 'Py (2 equiv)', 'colour': '#4daf4a', 'noise':.5},
    { 'params': [(15, 8.5, .1), (10, 10, .4), (2, 13, .1), (2, 17, .2), (2, 19, .2), (2, 22, .2), (2, 25, .2), (2, 25.5, .1), (2, 27, .1), (2, 30, .2)], 'name': 'Py (3 equiv)', 'colour': '#4daf4a', 'noise':.5},
    { 'params': [(15, 8.5, .1), (10, 10, .4), (2, 13, .1), (2, 17, .2), (2, 19, .2), (2, 22, .2), (2, 25, .2), (2, 25.5, .1), (2, 27, .1), (2, 30, .2)], 'name': 'Py (4 equiv)', 'colour': '#4daf4a', 'noise':.5},
]

Now that we've set up the fake peak parameters.

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))

# Generate and plot PXRD patterns for each parameter set
for i, param_dict in enumerate(all_params):

    # Generate noise and add it to the y-data
    np.random.seed(0)  # For reproducibility
    noise = np.random.normal(0, param_dict['noise'], theta.shape)

    # Generate the y-data for the PXRD pattern
    y = sum(gaussian(theta, height, mean, width) for height, mean, width in param_dict['params'])
    y += noise

    # Add the offset for easier visualization
    y -= i * 20

    # Plot the PXRD pattern
    ax.plot(theta, y, lw=3, color=param_dict['colour'])

    # Create a text label for the PXRD pattern
    ax.text(42, -i*20, param_dict['name'], verticalalignment='center', color=param_dict['colour'])

# Set the x-ticks to start at 4 and increment by 5
ax.set_xticks(range(4, 40, 5))

# Remove y-axis ticks
ax.yaxis.set_ticklabels([])

# Hide grid lines
ax.grid(False)

# Set the x-axis label
ax.set_xlabel('$2 \\theta$')

plt.show()

## PRACTICE QUESTION

Hmm, how did they optimize this one? What was different about this step?



---



**notes**



---



# Optimization of milling time.

> The resulting green/brown crude solids were analyzed by PXRD (fig S5). The optimal milling time changes. The optimal milling time changes to 180 minutes for near quantitative conversion of starting materials as observed by PXRD

Alright, let's see how this goes:

In [None]:
# Define the DataFrame columns
columns = ['LAG Liquid', 'LAG Volume (μL)', 'Pyridine (equiv)', 'Milling Time (mins)', 'Milling frequency (Hz)', 'Scale (mg)', 'Observed SM Phase', 'Observed Product Phase']

# Initialize an empty DataFrame
LAG_volume_df = pd.DataFrame(columns=columns)

# Fill in the DataFrame with your data
LAG_volume_df.loc[0] = ['1:1 PhMe:H₂O', 60, 2, 30, 25, 200, 'Ge. 1', '3, 3a'] # values for Entry (1)
LAG_volume_df.loc[1] = ['1:1 PhMe:H₂O', 60, 2, 60, 25, 200, 'Ge. 1', '3, 3a'] # values for Entry (2)
LAG_volume_df.loc[2] = ['1:1 PhMe:H₂O', 60, 2, 90, 25, 200, 'Ge. 1', '3, 3a'] # values for Entry (3)
LAG_volume_df.loc[3] = ['1:1 PhMe:H₂O', 60, 2, 180, 25, 200, 'trace Ge. 1', '3a'] # values for Entry (4)

# now let's call the dataframe to look at the table
LAG_volume_df

And some fake data

In [None]:
# Define parameters for the Gaussian peaks for each PXRD pattern: height, mean, width for each peak
all_params = [
    { 'params': [(1, 21, .1), (3, 25.5, .1), (15, 27, .1)], 'name': 'Ge', 'colour': '#377eb8', 'noise':.2},
    { 'params': [(1, 7, .1), (1, 12.2, .2), (15, 13, .1), (1, 17, .2), (1.5, 19, .2), (1, 22, .2), (1, 25, .2), (1, 30, .2)], 'name': '1', 'colour': '#e41a1c', 'noise':.3},
    { 'params': [(15, 8.5, .1), (10, 10, .4), (7, 13, .1), (2, 17, .2), (2, 19, .2), (2, 22, .2), (2, 25, .2), (2, 25.5, .1), (6, 27, .1), (2, 30, .2)], 'name': '30 mins', 'colour': '#4daf4a', 'noise':.5},
    { 'params': [(15, 8.5, .1), (10, 10, .4), (4, 13, .1), (2, 17, .2), (2, 19, .2), (2, 22, .2), (2, 25, .2), (2, 25.5, .1), (4, 27, .1), (2, 30, .2)], 'name': '60 mins', 'colour': '#4daf4a', 'noise':.5},
    { 'params': [(15, 8.5, .1), (10, 10, .4), (2, 13, .1), (2, 17, .2), (2, 19, .2), (2, 22, .2), (2, 25, .2), (2, 25.5, .1), (2, 27, .1), (2, 30, .2)], 'name': '90 mins', 'colour': '#4daf4a', 'noise':.5},
    { 'params': [(15, 8.5, .1), (10, 10, .4), (1, 13, .1), (2, 17, .2), (2, 19, .2), (2, 22, .2), (2, 25, .2), (2, 25.5, .1), (0.5, 27, .1), (1, 30, .2)], 'name': '180 mins', 'colour': '#4daf4a', 'noise':.5},
]

And the plot:

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))

# Generate and plot PXRD patterns for each parameter set
for i, param_dict in enumerate(all_params):

    # Generate noise and add it to the y-data
    np.random.seed(0)  # For reproducibility
    noise = np.random.normal(0, param_dict['noise'], theta.shape)

    # Generate the y-data for the PXRD pattern
    y = sum(gaussian(theta, height, mean, width) for height, mean, width in param_dict['params'])
    y += noise

    # Add the offset for easier visualization
    y -= i * 20

    # Plot the PXRD pattern
    ax.plot(theta, y, lw=3, color=param_dict['colour'])

    # Create a text label for the PXRD pattern
    ax.text(42, -i*20, param_dict['name'], verticalalignment='center', color=param_dict['colour'])

# Set the x-ticks to start at 4 and increment by 5
ax.set_xticks(range(4, 40, 5))

# Remove y-axis ticks
ax.yaxis.set_ticklabels([])

# Hide grid lines
ax.grid(False)

# Set the x-axis label
ax.set_xlabel('$2 \\theta$')

plt.show()

Now, something important should be noticed here...



## PRACTICE QUESTION

180 minutes is the optimal milling time. What value of milling time was used for all of the above? Does this... matter?



---



**Notes**



---



Optimization is non-trivial. What if some other grouping of conditions - even some other LAG liquid - with a different milling time and Pyridine equivalent - would have actually been better?

This style of optimization - change one variable at a time and find the best then move on to the next - is quite common. But it can have errors when systems have non-linearities (ie. the most efficient set of conditions for one solvent may be very different for another).

So how would we account for this? Do we need to search every possible combination of things? (That is a lot of things!) I bet you want to know the answer to this...

# Submit your notebook

But... It's time to download your notebook and submit it on Canvas. Go to the File menu and click **Download** -> **Download .ipynb**

Then, go to **Canvas** and **submit your assignment** on the assignment page. Once it is submitted, swing over to the homework now and start working through the paper.

# Another way
You can stop here and submit your notebook if you are not excited for more... BUT, if you want to see another way to optimize a reaction, read on. If you haven't done linear algebra yet, maybe also stop...

Two Mudd profs (Prof. Van Ryswyk and Prof. Van Hecke) wrote a nice, simple [paper](https://drive.google.com/file/d/1kJsliksMzu2i9wUqIg6w5w6zHI8dkOk9/view?usp=share_link) on experimental optimization for students. They showed a different - linear algebra based - way of optimizing a reaction that doesn't involve testing every single combination of things! Much of the below math comes from Prof. VR (and I translated it into Python).

We will walk (run?) through the optimization of the synthesis of acetylferrocene using VR/VH scheme to optimize time, temperature, and mole ratio (in this example, they had three parameters to vary). Reading the paper along with it will make a lot more sense (it was too much for me to type out though!).
\begin{align}
\text{Fe}\left(\text{C}_5\text{H}_5\right)_2+\left(\text{CH}_3\text{CO}\right)_2 \rightarrow \left(\text{C}_5\text{H}_5\right)\text{Fe}\left(\text{C}_5\text{H}_4\right)\text{COCH}_3+\text{Fe}\left[\left(\text{C}_5\text{H}_5\right)\text{COCH}_3\right]_2 + ...
\end{align}

---
![https://kavassalis.space/s/ferrocene_scence.png](https://kavassalis.space/s/ferrocene_scence.png)


---


From Prof. VR:
> The experimental design aims to establish a response surface in the minimum number of experiments. Variables are encoded, and a central composite design is used to efficiently explore the response surface.

I wrote out the functions they used.

In [None]:
# encoding equations for dimensionless time, temperature, and mole ratio
def encode_time(t):
  '''t in seconds'''
  return (t - 120) / 90

def encode_temp(T):
  '''T in degrees Celsius'''
  return (T - 100) / 15

def encode_ratio(R):
  '''R as moles acetic anhydride : moles ferrocene'''
  return (R - 10) / 7

Let's make sure those functions work.

In [None]:
# Example encoding
t_encoded = encode_time(130)  # Example time in seconds
T_encoded = encode_temp(105)  # Example temperature in °C
R_encoded = encode_ratio(12)  # Example mole ratio

Okay, now, it starts to get a bit jargon heavy (if you are here, make sure you have the paper open too!).

## Central composite design matrix for three factors $(k=3)$.
> This is a full $2^3$ design with star augmentation along the principal axes. The second-order fitting equation with cross-terms is:
\begin{align}
y=b_0+b_1x_1 + b_2x_2 + b_3x_3 + b_4x_1^2 + b_5x_2^2 + b_6x_3^2 + b_7x_1x_2 + b_8x_1x_3 + b_9x_2x_3
\end{align}

In [None]:
# star augmentation
alpha = 1.2

# The design matrix
# row format: 1, x1, x2, x3, x1^2, x2^2, x3^2, x1x2, x1x3, x2x3
X = np.array([
    [1, -1, -1, -1, 1, 1, 1, 1, 1, 1],
    [1, -1, -1, 1, 1, 1, 1, 1, -1, -1],
    [1, -1, 1, -1, 1, 1, 1, -1, 1, -1],
    [1, -1, 1, 1, 1, 1, 1, -1, -1, 1],
    [1, 1, -1, -1, 1, 1, 1, -1, -1, 1],
    [1, 1, -1, 1, 1, 1, 1, -1, 1, -1],
    [1, 1, 1, -1, 1, 1, 1, 1, -1, -1],
    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    [1, -alpha, 0, 0, alpha**2, 0, 0, 0, 0, 0],
    [1, alpha, 0, 0, alpha**2, 0, 0, 0, 0, 0],
    [1, 0, -alpha, 0, 0, alpha**2, 0, 0, 0, 0],
    [1, 0, alpha, 0, 0, alpha**2, 0, 0, 0, 0],
    [1, 0, 0, -alpha, 0, 0, alpha**2, 0, 0, 0],
    [1, 0, 0, alpha, 0, 0, alpha**2, 0, 0, 0],
])


# yield result from the experiments
y = np.array([0, 0.032, 0.008, 0.048, 0.427, 0.215, 0.516, 0.281, 0.648, 0.168, 0.613, 0.453, 0.488, 0.499, 0.396])

# least squares estimate of model coefficients
beta = np.linalg.inv(X.transpose().dot(X)).dot(X.transpose()).dot(y)
print('β:', beta)

That sure seems like a lot to take in! Let's make a plot that maybe looks more straightforward (we like residuals, right?)

In [None]:
# residuals
residuals = y - X.dot(beta)

# Plot residuals
plt.plot(residuals, linestyle='', marker='o')
plt.xlabel('run #')
plt.ylabel('y observed - y predicted')
plt.title('Residuals')
plt.show()

How does this let us optimize things? This looks like scattered runs with varying residual values?

Back to the paper:

> Decompose β in preparation for eigenvalue and stationary point calculation. Vector b contains the first-order coefficients b1, b2, and b3. Matrix B has the coefficients for the squared terms, b4 through b6, on the diagonal and symmetrical cross terms/2 derived from b7 through b9.

We can write that out:

In [None]:
# decomposition of β
b = np.array([0.173, 0.0203, -0.0458])
B = np.array([[-0.187, 0.00819, -0.0324],
              [0.00819, -0.131, -0.000938],
              [-0.0324, -0.000938, -0.147]])


> The eigenvectors of B describe a linear combination of our time, temperature, and mole ratio axes such that the origin is now at the stationary point

(again, this will be a lot of jargon and a very fast pace if this math isn't comfortable - that is okay - this is just for flavour).

> The eigenvalues of B tell us if the stationary point is a maxima, minima, or saddle point with respect to yield. If we have found a maxima, then the eigenvalues will all be negative. If the signs of the eigenvalues differ, then we have found a saddle point and we should proceed in the direction of the eigenvector with the largest positive eigenvalue in order to maximize our yield.

In [None]:
# canonical analysis
w, v = np.linalg.eig(B)
print('w:', w)
print('λ:', v)

Let's calculate the stationary point

In [None]:
# stationary point
xs = -np.linalg.inv(B).dot(b) / 2
print('xs:', xs)

The yield of acetylferrocene (y) can be modeled as follows:

$$
y = f(t, T, R) + \epsilon
$$

where $\epsilon$ represents the residual error in each run.

Next, calculate yield at stationary point and express as a percent

In [None]:
ys = beta[0] + 0.5*np.dot(xs, b)

ys_percentage = 100 * ys  # Convert ys to a percentage
formatted_yield = f"{ys_percentage:.2f}% yield"  # Format to 2 decimal places and add unit

print(formatted_yield)

And for time:

In [None]:
# Assume xs[0] contains the calculated stationary point value for x1 (time)
time_opt = xs[0] * 90 + 120  # Applying the given formula to calculate optimal time

# Formatting the output to 3 significant figures in Python can be done in various ways.
# Here's a straightforward method using formatted string literals for 3 decimal places:
formatted_time = f"{time_opt:.3f} s"

print(formatted_time)

And temperature:

In [None]:
temp_opt = xs[1] * 15 + 100  # Apply the formula to calculate optimal temperature

# Format for display with 3 decimal places and unit °C
formatted_temp = f"{temp_opt:.3f} °C"

print(formatted_temp)

And mole ratio:

In [None]:
mole_ratio_opt = xs[2] * 7 + 10  # Apply the provided formula

# Formatting precision in Python often focuses on decimal places rather than significant figures.
# Here, we aim for two decimal places and manually add the unit of measure.
formatted_mole_ratio = f"{mole_ratio_opt:.2f} mole ratio acetic anhydride:ferrocene"

print(formatted_mole_ratio)

Let's now visualize the results as a 3D heat map. First, we need to set up the experimental data.

In [None]:
# 4D data set consisting of x1, x2, x3, yield
AFc = [
    (-1, -1, -1, 0.0),
    (-1, -1, 1, 0.032),
    (-1, 1, -1, 0.008),
    (-1, 1, 1, 0.048),
    (1, -1, -1, 0.427),
    (1, -1, 1, 0.215),
    (1, 1, -1, 0.516),
    (1, 1, 1, 0.281),
    (0, 0, 0, 0.648),
    (-1.2, 0, 0, 0.168),
    (1.2, 0, 0, 0.613),
    (0, -1.2, 0, 0.453),
    (0, 1.2, 0, 0.488),
    (0, 0, -1.2, 0.499),
    (0, 0, 1.2, 0.396)
]

AFc_np = np.array(AFc)

Now we can make a plot:

In [None]:
# Extracting x1 (time), x2 (temp), x3 (mole ratio), and yields
x1 = AFc_np[:,0]
x2 = AFc_np[:,1]
x3 = AFc_np[:,2]
yields = AFc_np[:,3]

# Setup for 3D scatter plot
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

# Scatter plot with colormap
img = ax.scatter(x1, x2, x3, c=yields, cmap=cmap_temp)

# Colour bar indicating yield values
cbar = fig.colorbar(img, ax=ax, shrink=0.8, pad=0.1)
cbar.set_label('Yield')

# Setting labels
ax.set_xlabel('Scaled Time')
ax.set_ylabel('Scaled Temp')
ax.set_zlabel('Scaled Mol Ratio')

# Plot title
plt.title('Optimization of Acetylferrocene Synthesis')

plt.show()

Okay, so we want yield to be a maximum, and have a 3-variable parameter space. You can see the experiments that have been done and they appear to be somewhat randomly sampling said parameter space (this is good! we'll talk more later on how to set up this random sampling)

It's kind of hard to look at this plot and say where the optimized values occur though. Let's interpolate things to make something a bit easier to look at.

In [None]:
grid_x, grid_y, grid_z = np.mgrid[
    AFc_np[:,0].min():AFc_np[:,0].max():100j,
    AFc_np[:,1].min():AFc_np[:,1].max():100j,
    AFc_np[:,2].min():AFc_np[:,2].max():100j
]

points = AFc_np[:, :3]
values = AFc_np[:, 3]

grid_yields = griddata(points, values, (grid_x, grid_y, grid_z), method='linear')

Now let's plot the interpolated data:

In [None]:
# Define the grid where we want to interpolate.
grid_x, grid_y, grid_z = np.mgrid[-2:2:100j, -2:2:100j, -2:2:100j]

# Perform the 3D interpolation.
grid = griddata(points, values, (grid_x, grid_y, grid_z), method='linear')

# Now create the 3D scatter plot.
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

scatter = ax.scatter(grid_x, grid_y, grid_z, c=grid.ravel(), cmap=cmap_temp)

# Colour bar indicating yield values
cbar = fig.colorbar(img, ax=ax, shrink=0.8, pad=0.1)
cbar.set_label('Yield')

# Customizing the plot
ax.set_xlabel('Scaled time')
ax.set_ylabel('Scaled temp')
ax.set_zlabel('Scaled mol ratio')
ax.set_title('Optimization of acetylferrocene synthesis')

plt.show()

Now that helps us see what ranges of parameters might have optimal yield.

It would be nice to zoom in though...

Look!

In [None]:
# Separate the coordinates (x1, x2, x3) from the yield values
points = AFc_np[:, :3]   # x1, x2, x3 coordinates
values = AFc_np[:, 3]    # yields

# Define the grid where we want to interpolate
grid_x, grid_y, grid_z = np.mgrid[
    AFc_np[:,0].min():AFc_np[:,0].max():50j,
    AFc_np[:,1].min():AFc_np[:,1].max():50j,
    AFc_np[:,2].min():AFc_np[:,2].max():50j
]

# Perform the 3D interpolation
grid_values = griddata(points, values, (grid_x, grid_y, grid_z), method='linear')

# Compute global min and max of the yield
min_yield = np.nanmin(AFc_np[:, 3])
max_yield = np.nanmax(AFc_np[:, 3])

fig = go.Figure()

# Add invisible trace with global min yield
fig.add_trace(go.Volume(
    x=[AFc_np[0, 0]], y=[AFc_np[0, 1]], z=[AFc_np[0, 2]],
    value=[min_yield], opacity=0.))

# Add invisible trace with global max yield
fig.add_trace(go.Volume(
    x=[AFc_np[0, 0]], y=[AFc_np[0, 1]], z=[AFc_np[0, 2]],
    value=[max_yield], opacity=0.))

# Add your main volume plot
fig.add_trace(go.Volume(
    x=grid_x.flatten(),
    y=grid_y.flatten(),
    z=grid_z.flatten(),
    value=grid_values.flatten(),
    isomin=grid_values.min(),
    isomax=grid_values.max(),
    opacity=0.1,
    surface_count=16,
    colorscale=colourscale))

fig.show()

Okay, that doesn't actually seem as helpful as I wanted (but it can be optimized more). There are several different ways to visualize this kind of space.

We can just look at slices of the observations:

In [None]:
 # Create masks for specific slices (you can adjust these as needed)
t_mask = np.abs(AFc_np[:, 0] - 0) < 1e-3  # slice at scaled time = 0
temp_mask = np.abs(AFc_np[:, 1] - 0) < 1e-3  # slice at scaled temp = 0
mol_mask = np.abs(AFc_np[:, 2] - 0) < 1e-3  # slice at scaled mol ratio = 0

# Compute global min and max of the yield
# This way, the same color scale will be used for all of your contour plots.
min_yield = np.nanmin(AFc_np[:, 3])
max_yield = np.nanmax(AFc_np[:, 3])


# Draw the slice plots
fig, ax = plt.subplots(1, 3, figsize=(10, 5))

# Yield vs mol_ratio and time (temp = 0)
ax[0].scatter(AFc_np[t_mask][:, 2], AFc_np[t_mask][:, 0], c=AFc_np[t_mask][:, 3], cmap=cmap_temp, vmin=min_yield, vmax=max_yield)
ax[0].set_xlabel('Scaled Mol Ratio')
ax[0].set_ylabel('Scaled Time')
ax[0].set_title('Scaled Temp = 0')
ax[0].grid(True)

# Yield vs time and temp (mol_ratio = 0)
ax[1].scatter(AFc_np[mol_mask][:, 0], AFc_np[mol_mask][:, 1], c=AFc_np[mol_mask][:, 3], cmap=cmap_temp, vmin=min_yield, vmax=max_yield)
ax[1].set_xlabel('Scaled Time')
ax[1].set_ylabel('Scaled Temp')
ax[1].set_title('Scaled Mol Ratio = 0')
ax[1].grid(True)

# Yield vs temp and mol_ratio (time = 0)
ax[2].scatter(AFc_np[temp_mask][:, 1], AFc_np[temp_mask][:, 2], c=AFc_np[temp_mask][:, 3], cmap=cmap_temp, vmin=min_yield, vmax=max_yield)
ax[2].set_xlabel('Scaled Temp')
ax[2].set_ylabel('Scaled Mol Ratio')
ax[2].set_title('Scaled Time = 0')
ax[2].grid(True)

plt.tight_layout()
plt.show()

Or we can look at slices of the filled versions (note, I have sliced it differently below):

In [None]:
# Initialize the grid points
grid_x = np.linspace(AFc_np[:,0].min(), AFc_np[:,0].max(), num=50)
grid_y = np.linspace(AFc_np[:,1].min(), AFc_np[:,1].max(), num=50)

# Initialize for the three slices
slices = [-1, 0, 1]

# Making a subplot with 1 row and 3 columns
fig, axs = plt.subplots(1, 3, figsize=(18, 6), sharey=True)

contour_plots = []

# Compute global min and max of the yield
# This way, the same color scale will be used for all of your contour plots.
min_yield = np.nanmin(AFc_np[:, 3])
max_yield = np.nanmax(AFc_np[:, 3])

# Loop over slices and create a contour plot for each slice
for i, slc in enumerate(slices):
    # Create a mask for the slice
    mask = np.isclose(AFc_np[:,2], slc, rtol=1e-5)

    # Apply the mask to get the points and values for this slice
    points_slc = AFc_np[mask, :2]   # x1, x2 coordinates
    values_slc = AFc_np[mask, 3]    # yields

    # Create a grid for the slice
    grid_slc_x, grid_slc_y = np.meshgrid(grid_x, grid_y)

    # Interpolate the values for the grid
    grid_slc_z = griddata(points_slc, values_slc, (grid_slc_x, grid_slc_y), method='linear')

    # Plot the Contour for this slice
    contour = axs[i].contourf(grid_slc_x, grid_slc_y, grid_slc_z, levels=20, cmap=cmap_temp, vmin=min_yield, vmax=max_yield)
    contour_plots.append(contour)
    axs[i].set_title(f'Slice at scaled mol ratio = {slc}')
    axs[i].set_xlabel('Scaled Temperature')
    axs[i].set_ylabel('Scaled Time')

# Adding color bar
fig.subplots_adjust(right=0.8)
cbar_ax = fig.add_axes([0.85, 0.15, 0.05, 0.7]) # Position of colorbar [left, bottom, width, height]
fig.colorbar(contour_plots[1], cax=cbar_ax, label='Yield')

plt.show()

Now, does this tell us the exact correct parameters to use? Not necessarily, but it does tell us where we might want to conduct extra experiments to confirm we are using the optimal parameters. We can see what regions of our parameter space are unlikely to be worth look at further and what regions are worth optimizing in.

Could you apply this framework to the Germanium paper?

The optimization scheme done above is extremely general, and has applications across STEM. Visualizations for higher order parameter spaces get tricky though - this example had three variables, so we could make 3D plots or 2D slides. When you have higher dimensional spaces, this gets a bit trickier.