## Assignment 1 ##
**Deadline: March 1st, 2014, midnight**

<a target="_blank" href="https://colab.research.google.com/github/RodrigoAVargasHdz/CHEM-4PB3/blob/w2024/Course_Notes/assignment1.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In [None]:
#!pip install py3Dmol

1. Load the data from the paper: <br>
   [Using Gradients in Permutationally Invariant Polynomial Potential Fitting: A Demonstration for CH4 Using as Few as 100 Configurations, JCTC **15** (5), 2826 (2019).](https://pubs.acs.org/doi/10.1021/acs.jctc.9b00043)

This data set contains different geometries of methane, the electronic energy and the forces. 
We will be fitting the Potential energy surface for methane ($\text{CH}_4$). 

In [None]:
# load all the necessary libraries that you will require for this assignment here
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

## Basic data analysis ##
1. Load the data using Pandas using the url: 
```python
data_url = "https://github.com/RodrigoAVargasHdz/CHEM-4PB3/raw/w2024/Course_Notes/data/methane.csv"
```
2. Inspect the data,
   1. How many data points the data has?
   2. What type of information the data has?
   3. What is the geometry with the lowest energy? 
   
(tips)<br>
Each geometry data point is a flat vector of (5*3). If reshaped to a (5,3), ```np.reshape(x,(5,3))```, each row corresponds to the xyz coordinate of each atom. <br>
Form the original file, the order of the atoms are ```['H','H','H','C']```

In [None]:
data_url = "https://github.com/RodrigoAVargasHdz/CHEM-4PB3/raw/w2024/Course_Notes/data/methane.csv"
data = pd.read_csv(data_url)

#code here!

You can use the following function ```draw_molecule()``` to plot individual geometries.

In [None]:
def get_xyz_str(z, xyz):
    n_atoms = len(z)
    xyz_ = []
    xyz_str = '%s\n * (null), Energy   -1000.0000000\n' % (n_atoms)
    for zi, xyzi in zip(z, xyz):
        xyzi_str = '%s     %.4f     %.4f     %.4f\n' % (
            zi, float(xyzi[0]), float(xyzi[1]), float(xyzi[2]))
        xyz_str += xyzi_str
    return xyz_str


def draw_molecule(view, z, xyz):
    """_summary_

    Args:
        view (_type_): py3dmol class
        z (_type_): atomic numbers, list for CH4, z = ['H','H','H','H','C']
        xyz (_type_): xyz coordinates in numpy array 
    """
    xyz_str = get_xyz_str(z, xyz)
    view.addModel(xyz_str, 'xyz')
    view.setStyle({'sphere': {'radius': 0.35}, 'stick': {'radius': 0.1}})
    view.zoomTo()
    view.update()
    view.clear()

'''
#Example Ozone
z = ['O','O','O']
xyz = np.array([[0.4496,   0.0000000,   0.0000000],
    [-0.2248,   0.0000000,  1.0927],
    [ -0.2248,  0.0000000,  -1.0927]])
view = py3Dmol.view(width=400, height=400)
view.show()
draw_molecule(view, z, xyz)
'''

3. Plot the histogram of the energies for $\text{CH}_4$. <br>
   **(grad students):** Over the histogram fit a **Gaussian Density estimation model**, you are allowed to use Scikit-Learn package, [link](https://scikit-learn.org/stable/modules/density.html).<br>
   
4. From the histogram what can you tell? 

In [None]:
# code here!


## Training, validation and test data splitting ##
Code a function that will split the training data into, training, validation and test. Your function must accept two integers, one that represents the number of training points and the second represent the validation data points.<br>
**The splitting must be random**, and you are allowed to use Scikit-Learn's function, [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

```python
    def split_data(n_tr,n_val):

        # split the data into training, validation and test
    
    return (X_tr,y_tr),(X_val,y_val),(X_tst,y_tst)
```

In [None]:
# code here!

# PES: Potential Energy surface model #

For all the following models, we will use the interatomic distances between all atoms as the input for the PES models. 

For each geometry compute interatomic distances between all atoms, 
   $$
   r(x^i,x^j) = \sum_{\ell} (x^i_\ell - x^j_\ell)^2, 
   $$
where $x^i$ is the XYZ coordinates of atom $i$ and $x^j$ is the XYZ coordinates of the atom $j$. <br>
(tip)  <br>
You can use ```np.lingalg.norm()``` and expand the dimensions of the array. 
The final vector should be the distance between <br>
```[H1-H2, H1-H3, H1-H4, H2-H3, H2-H4, H3-H4, C-H1, C-H2, C-H3, C-H4]```


In [None]:
# code here!

1. For each interatomic distance create a histogram showcasing the range of the data points, plot all of them in a single figure (**not individual**).<br>
   To plot all histograms in a single figure you can use the [mosaic fuction](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplot_mosaic.html#matplotlib.pyplot.subplot_mosaic) from Matplotlib, or [subplots](https://matplotlib.org/stable/api/figure_api.html#matplotlib.figure.Figure.subplots).
   To each panel add the title of what distance it is referred to. 

In [None]:
# code here!

## Linear models ##
The original paper used a polynomial based expansion to fit the PES for $\text{CH}_4$, however, we first will use only the distances as the input for our model,
$$
f(\mathbf{w},\mathbf{r}) = \mathbf{w}^\top \mathbf{r} = \sum_i w_i r_i
$$
where $r_i$ is one of the interatomic distances. 

1. Fit a linear model with and without a regularization term. You can use the code from the lecture. 
2. For the Least square with regularization model, do a cross-validation search procedure to optimize the value of $\lambda$ (**only using the training data**).

**Details:** <br>
Consider **50** training data points and **250** validation points. You must also use an additional **1,000** test points to assess the accuracy of the model. 

**Some analysis** <br>
Using the validation data, compare both models and report the optimal value of $\lambda$ found.<br>
You can do a prediction vs true plot to see the results of the models.


In [None]:
#code here!

## Polynomial linear models ##
The original paper used a polynomial based expansion to fit the PES for $\text{CH}_4$, however, we first will use only the distances as the input for our model,
$$
f(\mathbf{w},\mathbf{r}) = \mathbf{w}^\top \phi(\mathbf{r}) = \sum_i w_i  \phi_i(\mathbf{r})
$$
where $r_i$ is one of the interatomic distances and $ \phi(\mathbf{r})$ is the polynomial expansion. This model is more similar to the one used in the original [paper](https://pubs.acs.org/doi/10.1021/acs.jctc.9b00043).

* Fit a polynomial model with regularization term. You can use the code from the lecture. Use cross-validation search procedure to optimize the value of $\lambda$ (**only using the training data**).

**Details:** <br>
* Use the same training and validation data as the one used to fit the **Linear models**.<br>
* You are allow to use the function [PolynomialFeatures](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) from Scikit-Learn.


**Some analysis** <br>
Plot the validation error as a function of the polynomial degree, you can consider a space from 1 to 5. <br>

**(graduate students)**<br>
 Plot the optimal value of $\lambda$ also as a function of the polynomial degree.


In [None]:
# code here!

## Kernel based modes for PES ##
In class, we also studied that a linear model could be transformed into a kernel model one by using the kernel trick. <br>
Using Scikit-Learn let's fir the PES for $\text{CH}_4$ using either the RBF or Matern kernel.
**Important links:**<br>
* [Kernel Ridge regression](https://scikit-learn.org/stable/modules/generated/sklearn.kernel_ridge.KernelRidge.html#sklearn.kernel_ridge.KernelRidge)
* [RBF](https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.kernels.RBF.html#sklearn.gaussian_process.kernels.RBF)
* [Matern 2.5](https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.kernels.Matern.html)

Train a kernel model using individual length scale parameters for each interatomic distance.<br>
According to the documentation of both kernels, one can define a ```length_scale``` as a vector where each element of the array corresponds to an individual length scale parameter. 


In [None]:
#code here!

# PES Inspection #
Choosing one of the previously trained models, predict and plot the value of the energy as a function of the distance between **C-H** and **H-H**.
* For the **C-H** distance, consider the range of 0.9 $\AA$ to 1.7 $\AA$.
* For the **H-H** distance, consider the range of 1. $\AA$ to 2.4 $\AA$.

You can use the ground state geometry as reference for the other distances required as input to the model.

(tips)<br>
See Figure 5 as reference from the [original paper]((https://pubs.acs.org/doi/10.1021/acs.jctc.9b00043)).


In [None]:
#code here!