## Assessment Input Data cleaning
<a rel="license" href="https://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons Licence" style="width=50" src="https://licensebuttons.net/l/by/4.0/88x31.png" title='This work is licensed under a Creative Commons Attribution 4.0 International License.' align="right"/></a>

**Authors**: 
- Dr Antonia Mey (antonia.mey@ed.ac.uk)

In [None]:
import pandas as pd
import numpy as np
import warnings; warnings.simplefilter('ignore')

## 1. The Dataset

You will be working with the 'QMrxn20: Thousands of reactants and transition states for competing E2 and SN2 reactions' dataset. The whole dataset can be found here: [https://archive.materialscloud.org/record/2020.55](https://archive.materialscloud.org/record/2020.55).

Below the data has been prepped already for machine learning for you. Some of the original data is included in the notebook for you to explore the data. This includes a file called `energies.txt` which shows a subset of the dataset and contains effectively a table  with referencing molecular geometries files and a label if it is in a transition state or not. 

```
label,reaction,geometry,number,filename,energy,method
A_A_A_A_A_A,e2,ts,0,transition-states/e2/A_A_A_A_A_A.xyz,-179.132058577095,mp2
A_A_A_A_B_A,e2,ts,0,transition-states/e2/A_A_A_A_B_A.xyz,-539.161015594451,mp2
A_A_A_A_B_B,e2,ts,0,transition-states/e2/A_A_A_A_B_B.xyz,-638.348834529304,mp2
A_A_A_A_B_C,e2,ts,0,transition-states/e2/A_A_A_A_B_C.xyz,-998.3806344381261,mp2
A_A_A_A_B_D,e2,ts,0,transition-states/e2/A_A_A_A_B_D.xyz,-3111.578751937588,mp2
A_A_A_A_C_A,e2,ts,0,transition-states/e2/A_A_A_A_C_A.xyz,-2652.360661690992,mp2
```

Some of the transition state geometries with `xyz` files are also included. In the first part you are encouraged to explore the dataset a little bit, in conjunction with writing an introduction. 

### External references
**Journal reference**
G. F. von Rudorff, S. N. Heinen, M. Bragato, O. A. von Lilienfeld, Machine Learning: Science and Technology 1, 045026 (2020). doi:10.1088/2632-2153/aba822    
**Preprint (Preprint where the data generation is discussed)**
G. F. von Rudorff, S. N. Heinen, M. Bragato, O. A. von Lilienfeld, arXiv:2006.00504

## 2. Sub-sampling the dataset
We don't need over 400k input data for a classification model, so we can easily subsample the data to 80 k structures. 

In [None]:
energies = pd.read_csv('data/energies.txt')
subsampled_energies = energies.iloc[::5, :]

In [None]:
#saving the subsamples
subsampled_energies.to_csv('data/subsampled_energies.txt')

## 3. Putting coordinates and labels into one file
Here we extract the xyz coordiantes from the xyz files and write everything into one big csv file which is the basis of the input dataset for the assessment.

You will need the xyz files from the dataset for this all place in the `data` subdirectory. The list of directories is:

```bash
product-conformers 
reactant-complex-constrained-conformers 
reactant-complex-unconstrained-conformers
reactant-conformers
transition-states
```

Please download them separately from here, as they are too large for the GitHub repo. 
They can be found here: [https://archive.materialscloud.org/record/2020.55](https://archive.materialscloud.org/record/2020.55).

In [None]:
path='data/'
energies = pd.read_csv('data/subsampled_energies.txt')

# get coordinates of all atoms i
#option: make 3 columns with tuples i.e. x col contains all x coordinates etc ??

def get_coordinates(n, energies, cols, all_cols):
    # for row n in energies file ...
    coord_path = path + energies['filename'][n]
    # print(coord_path)
    # read coordinates
    coords = pd.read_csv(coord_path, skiprows = 1, delim_whitespace = True)
    # make temporary df with coordinates for each atom
    coord_df = pd.DataFrame(data = coords.values, columns = cols)
    # if there are <17 atoms in compound then pack left over columns with nan
    buffer = np.empty((1,4*(21 - len(coord_df.index)))).reshape(1,-1)
    buffer[:] = np.nan
    data = np.concatenate((coord_df.values.reshape((1,-1)), buffer), axis = 1)
    # make df containing all coordinates 
    df_line = pd.DataFrame(data, columns = all_cols)
    # df.drop(columns='filename',inplace=True)
    return df_line

cols = ['element','x coordinates', 'y coordinates', 'z coordinates'] 
# make enough columns to store all coordinates 
all_cols = [x+'_'+str(i) for i in range(21) for x in cols]

counter =0 
df = pd.DataFrame(columns=all_cols)
# get coordinates for all 
for i in range(len(energies)):
#for i in range(1000):
    if i%5000==0:
        print(f'At entry {i}/{len(energies)}')
    data = get_coordinates(i,energies,cols, all_cols)

    df = df.append(data, ignore_index = True)


energies_new = pd.concat([energies, df], axis=1)
energies_new.drop(columns='filename',inplace=True)
energies_new.to_csv(path+'energies_coordinates.csv')

In [None]:
test = pd.read_csv('data/energies_coordinates.csv')

In [None]:
test.drop(['Unnamed: 0'], axis=1, inplace=True)

In [None]:
test.fillna(0)

In [None]:
test.to_csv('data/dataset.csv', index=False)

## END
----