# Project 3

## Examining the geometry of molecules

Note that this project does not require any specialist knowledge of Chemistry. It is about simple geometry and tests your data analysis skills. A key skill of data scientists is to apply their data analysis techniques to unfamiliar situations.

Molecular geometry is the three-dimensional arrangement of atoms within a molecule. This geometry is often described by considering the lengths and angles of the bonds within the system. If we consider a central atom with 4 bonds to other atoms / molecules these are two of the three-dimensional shapes this system could take:

Square Planar

![Square planar](Square-planar-shape-with-angle.png)

Tetrahedral

![Tetrahedral](Tetrahedral-shape-with-angle.png)

Both the Square Planar and Tetrahedral shapes are symmetric under rotation and in an ideal system:
 - For the square planar shape, each of the four bonds (along the plane) has an angle of $90^{\mathrm{o}}$
 - For the tetrahedral shape, each of the four bonds has an angle of $109.5^{\mathrm{o}}$

### Data files

For this project, we have provided two data files containing the measured bond angles for different systems with a central atom and four surrounding bonds (to other atoms or molecules).
1. With Rhodium (Rh) as the central atom
2. With Cobalt (Co) as the central atom

In both cases, we expect the majority of these systems to be either Square Planar (angles of $90^{\mathrm{o}}$) or Tetrahedral (angles of $109.5^{\mathrm{o}}$) in shape.

First 5 rows of "Rh_CN4-molecules.csv"
```
# Missing entries are shown as -999 value.
Query,Refcode,ANG1,ANG2,ANG3,ANG4,R-factor,Study Temp.
1,ABANEX,86.286,91.877,91.402,90.744,4.110,150
1,ABEJIA,98.505,100.045,83.356,79.567,4.160,120
1,ABEJOG,100.686,99.261,78.550,82.795,3.910,120
```

First 5 rows of "Co_CN4-molecules.csv"
```
# Missing entries are shown as -999 value.
Query,Refcode,ANG1,ANG2,ANG3,ANG4,R-factor,Study Temp.
2,ABEBUG,107.775,104.241,106.684,115.986,2.920,150
2,ABECER,119.168,107.584,103.463,105.580,4.860,150
2,ABECIV,109.473,104.126,109.126,120.683,3.330,150
```

Column details:
 - "Query" - Database query number (*can be ignored*)
 - "Refcode" - Reference code for the molecular structure
 - "ANG1", "ANG2", "ANG3", "ANG4" - Values in degrees for each of the four bond angles
 - "R-factor" - Reliability factor. This is a measure of the quality of the data where a lower value is better
 - "Study Temp." - Temperature used when completing the measurements.

In [4]:
filename_Rh = "../csv/molecules/Rh_CN4-molecules.csv"
filename_Co = "../csv/molecules/Co_CN4-molecules.csv"

---

### Programming project

The aim of this project is to see how well these two different molecular shapes (Square Planar and Tetrahedral) can be distinguished within this data.

Start by reading in the data from `filename_Rh` and `filename_Co` and then create the following outputs:

- The mean and standard deviation of the first bond angle ("ANG1") for both data sets
   - Make a comparison of these values to the expected angles for the ideal case ($90^{\mathrm{o}}$ for Square Planar and $109.5^{\mathrm{o}}$ for the Tetrahederal shape).
- For each entry, calculate the average bond angle across all four of the bonds ("ANG1", "ANG2", "ANG3", "ANG4") (i.e. total divided by 4)
- From these data sets, create at least one plot to compare the distribution of these bond angles

Based on the analysis above, from the two options discussed comment on which would be the most likely shape for the molecules with the following "Refcode" values:
   - "YIQQUK" (Rh)
   - "ZOJPUJ01" (Co)
   - "ICAYES" (Co)
   - "ZABVIK" (Co)
   
*Make sure to consider the following*
 - How the data files are laid out and how best to read the data
 - How to handle any missing data

#### Approach

Consider your overall approach, layout and naming convention used within your code. Make sure to explain what you are doing at each stage (add comments and/or additional markdown cells). Consider what you could add to the analysis described above to build upon your conclusions.

For your plot (or plots), consider how best to present the output. Consider how this would look as as a standalone product without the context of the code.

More code and markdown cells can be added below as required ("Insert" --> "Cell Above" or "Cell Below")

# Task 1

Using `pandas` Read the `csv` datasets into two variables named

- `df_rh`
- `df_co`

Be careful that some data are not valid and are marked with the value `-999`. Exclude them.

Also, index the dataframe using the `Refcode` column.

In [5]:
### BEGIN SOLUTION
import pandas as pd

# Read in data and take account of na_values. Set index to "Refcode" (optional but helps later on)
df_rh = pd.read_csv(filename_Rh, skiprows=1, na_values="-999", index_col="Refcode")
df_co = pd.read_csv(filename_Co, skiprows=1, na_values="-999", index_col="Refcode")

###END SOLUTION

In [9]:
assert df_rh.equals(pd.read_csv(filename_Rh, skiprows=1, na_values="-999", index_col="Refcode")),f" The dataframe has not been read correctly"
assert df_co.equals(pd.read_csv(filename_Co, skiprows=1, na_values="-999", index_col="Refcode")),f" The dataframe has not been read correctly"

In [None]:

# Calculate mean and standard deviation
angle_mean_rh = df_rh["ANG1"].mean()
angle_std_rh = df_rh["ANG1"].std()

angle_mean_co = df_co["ANG1"].mean()
angle_std_co = df_co["ANG1"].std()

comparison_dict = {"Mean":[angle_mean_rh, angle_mean_co], "Standard deviation": [angle_std_rh, angle_std_co]}


In [None]:

# Check difference between these values and the ideal solutions
square_planar = 90
tetrahedral = 109.5


In [None]:

# Calculate the differences from the expected angles
diff_rh_square_planar = angle_mean_rh - square_planar
diff_co_square_planar = angle_mean_co - square_planar

diff_rh_tetrahedral = angle_mean_rh - tetrahedral
diff_co_tetrahedral = angle_mean_co - tetrahedral

comparison_dict["Square planar (diff)"] = [diff_rh_square_planar, diff_co_square_planar]
comparison_dict["Tetrahedral (diff)"] = [diff_rh_tetrahedral, diff_co_tetrahedral]


In [None]:

# OPTIONAL - creating a DataFrame to display this in a nicely formatted table
comparison_df = pd.DataFrame(comparison_dict, index=["Rh molecules","Co molecules"])
comparison_df
### END SOLUTION

In [None]:
### BEGIN SOLUTION
# Calculate the Mean angle from the angle columns (and assign back to DataFrame)
df_rh["ANG_MEAN"] = (df_rh["ANG1"] + df_rh["ANG2"] + df_rh["ANG3"] + df_rh["ANG4"])/4.
df_co["ANG_MEAN"] = (df_co["ANG1"] + df_co["ANG2"] + df_co["ANG3"] + df_co["ANG4"])/4.

print(df_co["ANG_MEAN"].head())
### END SOLUTION

In [None]:
### BEGIN SOLUTION
import matplotlib.pyplot as plt

## Creating 2 panel plot - one for "ANG1" and one for "ANG_MEAN"
# - Plotting these two histogram comparisons side by side since these are the quantities we have considered in this 
# analysis so far

fig, ax_arr = plt.subplots(nrows=1, ncols=2, figsize=(12,6))

ax_angle1 = ax_arr[0]
ax_mean_angle = ax_arr[1]

## Plot histograms for Rh vs Co for the first bond angle
# - Choosing consistent number of bins for both inputs (Rh and Co)
# - Plotting as a density to allow for a consistent comparison between Rh and Co
# - Choosing histtype of "step" for overlapping histograms (could have chosen transparency)
# - Making sure each data set is clearly labelled (legend)
# - Labelling the axes (including LaTeX for superscript)

num_bins=25

df_rh.plot.hist(y="ANG1", ax=ax_angle1, histtype="step", density=True, bins=num_bins, label="Rh molecules")
df_co.plot.hist(y="ANG1", ax=ax_angle1, histtype="step", density=True, bins=num_bins, label="Co molecules")

ax_angle1.set_xlabel("Angle 1 ($^o$)") # Label x axis

## Plot histograms for Rh vs Co for the average bond angle
# - Same decisions made for this plot

df_rh.plot.hist(y="ANG_MEAN", ax=ax_mean_angle, histtype="step", density=True, bins=num_bins, label="Rh molecules")
df_co.plot.hist(y="ANG_MEAN", ax=ax_mean_angle, histtype="step", density=True, bins=num_bins, label="Co molecules")

ax_mean_angle.set_xlabel("Mean angle ($^o$)") # Label x axis

### END SOLUTION

In [None]:
### BEGIN SOLUTION
# Find the average angle (or other individual angle) related to the reference code values
rh_codes = ["YIQQUK"]
co_codes = ["ZOJPUJ01", "ICAYES", "ZABVIK"]

ang_mean_rh_codes = df_rh["ANG_MEAN"].loc[rh_codes]
ang_mean_co_codes = df_co["ANG_MEAN"].loc[co_codes]

print(ang_mean_rh_codes)
print("\n")
print(ang_mean_co_codes)

## From looking at these values
# "YIQQUK" (Rh)   - 89.99525 - square planar
# "ZOJPUJ01" (Co) - 102.06200 - in the middle but more likely to be tetrahedral based on the histogram split above
# "ICAYES" (Co)   - 111.09500 - tetrahedral
# "ZABVIK" (Co)   - 90.02175 - square planar

### END SOLUTION

In [None]:
### BEGIN SOLUTION

## EXAMPLE OF FURTHER ANALYSIS - plotting these angle values on top of the histograms to see the comparison ###
# Create 2 subplots to look at Rh and Co separately
fig, ax_arr = plt.subplots(nrows=1, ncols=2, figsize=(12,6))

ax_rh = ax_arr[0]
ax_co = ax_arr[1]

# Plot histograms for mean angle for Co and Rh on side-by-side plots
bins=25
df_rh.plot.hist(y="ANG_MEAN", ax=ax_rh, histtype="step", density=True, bins=bins, label="Rh molecules")
df_co.plot.hist(y="ANG_MEAN", ax=ax_co, histtype="step", density=True, bins=bins, label="Co molecules")

# Plot the mean angle for the reference codes within the Rh dataset
y_arbitrary = [0.05]
rh_angle_mean = df_rh["ANG_MEAN"]
for rh_code in rh_codes:
    ax_rh.scatter(rh_angle_mean.loc[rh_code], y_arbitrary, label=rh_code, marker='x')

# Plot the mean angle for the reference codes within the Co dataset  
y_arbitrary = [0.01]
co_angle_mean = df_co["ANG_MEAN"]
for co_code in co_codes:
    ax_co.scatter(co_angle_mean.loc[co_code], y_arbitrary, label=co_code, marker='x')

# Label x axis (using LaTeX for superscript)
ax_co.set_xlabel("Mean angle ($^o$)")
ax_rh.set_xlabel("Mean angle ($^o$)")

# Making sure to display the legend so the reference code for each point is labelled
ax_co.legend()
ax_rh.legend()

### END SOLUTION

## Mark scheme

This project is marked under the following categories:

- Correct output (up to 40%)
  - Has the student successfully read in the files, accounted for missing data, calculated appropriate quantities, created the appropriate visualisations, and derived correct classifications?

- Code quality and methodology (up to 30%)
  - Is the code well structured, clear and easy to understand, appropriately commented? Does it use the correct libraries and methods?
  - Are there elements that go beyond the brief?

- Presentation (up to 30%)
  - Are the plot types appropriately chosen and are the plots clear, comprehensive and appropriately annotated and descibed?