## Assessment: Classification of Reaction data

This is an assessment that forms part of the ML for Chemistry workshop for Advanced Topics in Physical Chemistry. 

### What can/cannot be used for this coursework?
**Using Google** to find programmatic solutions to a problem is perfectly acceptable, however if you copy more than a single line of code from an online source make sure you indicate this in the following way with a comment in Python:

```python
# The following xxx lines follow a solution found here
# https://stackoverflow.com/questions/39324039/highlight-typos-in-the-jupyter-notebook-markdown on 20/10/2021
```

What you **shouldn't have** have to do is:
- Install additional packages in your Noteable environment
- Use code you do not understand found on the web

What is clearly **not allowed**:
<div class="alert alert-danger">
    You are NOT allowed to send/give/receive Python code to/from classmates and others.
The standard examination rules apply to this project.
</div>

The course will be marked by hand, but will be checked both manually as well as through turn-it-in for plagiarism.

### How is the coursework assessed:

Think of this as a guided lab report in the form of a Jupyter notebook with an introduction section exploring the data, two sections on doing the classification problems using Random Forests and a Neural network, and a final section with conclusions and discussions. A skeleton structure for the report is provided in the cells below. 

- Each part of your code should run.
- Every function you write should have an appropriate doc-string.
- All plots should be labelled correctly and all fonts should have a legible size.
- Write an introduction, a discussion, and a conclusion section in markdown. Use references where appropriate. 
- Remember you can use LaTeX in markdown by using $$ to start a maths environment. 

Overall criteria:
- 10 % Code presentation and readability
- 20 % Production quality (plots, etc.)
- 50 % Report structure and readability
- 20 % Data interpretation and conclusion

**The assessment is worth 20% of the overall module mark.**   

**Deadline: 10th March 2023 5pm**   

<div class="alert alert-success">
Reminder: Comment your code, use markdown to explain your working where appropriate, and make sure all your variable names are sensible! Also make sure to choose appropriate levels of significant figures for print out and make sure you use the correct units. 
</div>

## The Data
You will be working with the `QMrxn20: Thousands of reactants and transition states for competing E2 and SN2 reactions` dataset. The whole dataset can be found here: [https://archive.materialscloud.org/record/2020.55](https://archive.materialscloud.org/record/2020.55). It is a dataset of SN2 and E2 reactions with reactant, product, and transition state coordinates. 

The task:
- Use xyz coordinates to classify geometries as forming a transition state geometry (1) and another geometry (0)
- Use a Random Forest as a classifier similar to the one in Unit 2 for this task
- Use a neural network similar to the one in Unit 3 for this task

The file `dataset.csv` contains the prerocessed data for these tasks from the original dataset. In this dataset coordinates and information on energies etc have been already combined, and the data has been subsampled down to ~80k datapoints.

The structure of the dataset looks as follows:

```
label,reaction,geometry,number,energy,method,element_0, element_0,x coordinates_0,y coordinates_0,z coordinates_0,element_1,...,z_coordinates_20
A_A_A_A_A_A,e2,ts,0,transition-states/e2/A_A_A_A_A_A.xyz,-179.132058577095,mp2,C,-0.04447,-0.0119,-0.3780,C,...,-0.153736
A_A_A_A_B_A,e2,ts,0,-539.161015594451,mp2,F,-0.044,-0.011,-0.3780,C,-0.1537 
```

You find the label for the data in the column named geometry. If it is `ts`, i.e. a transition state, the label should be 1, if it is anything else the label should be 0. You will need to prepare the label in such a way that you can train on it. Columns 7 to the end contain the x,y,z coordinates. Train on the coordinates, but be careful not to include the element names as well!

### External references
**Journal reference**
G. F. von Rudorff, S. N. Heinen, M. Bragato, O. A. von Lilienfeld, Machine Learning: Science and Technology 1, 045026 (2020). doi:10.1088/2632-2153/aba822

**Preprint (Preprint where the data generation is discussed)**
G. F. von Rudorff, S. N. Heinen, M. Bragato, O. A. von Lilienfeld, arXiv:2006.00504

------

## Start of the Report

Please leave the header sections as you find them, but feel free to add as many cells in each section as you need to answer them. 

## 1. Introduction and exploring the data

In the directory `data` you will find a file called `dataset.csv`. Freely explore the dataset in this section.

Things you definitely need:

- an **array of labels** which you can extract from the column geometries (1-ts, 0-everything else)
- an **array of x, y, and z coordinates** for all elements. 

Some of the coordinates will be 0 because the molecule has less than 21 atoms. Please explore some statistics around the data. 

In this part of your submission notebook you should also introduce what a classification problem is, and this dataset in particular.

In [None]:
# Your solution here



## 2. Classify the reaction data into reactant, transition state, and product using a Random Forests

Here you will use your labels and x,y,z coordinates to train a Random Forests classifier to identify transition states from non-transition states. 

In [None]:
# Your solution here


## 3. Classify the reaction data into reactant, transition state, and product using a Neural Network

Use the same data as before, but now set up a neural network using PyTorch for this classification task. Train the network and use appropriate plots to show your results. 

In [None]:
# Your solution here


## 4. Discussion of results and conclusion
In this part of your notebook you should discuss your results using Markdown text. 

In [None]:
# Your solution here
