<a name="top"></a>
<br/>
# Machine learning thermodynamic perturbation theory (MLPT) - Position
<br/>

**École Nationale Supérieure des Mines de Nancy**  
Project under the supervison of [Dario Rocca](http://crm2.univ-lorraine.fr/lab/fr/personnel/dario-rocca/) and [Fabien Pascale](https://www.researchgate.net/profile/Fabien_Pascale).  

Title: *High-accuracy materials modeling by machine learning quantum simulations*.  
Authors: [Lucas Lherbier](https://www.linkedin.com/in/lucas-lherbier/).

Last update: Feb 3rd, 2020.
<br/>

---
This notebook completes my project report. The Jupyter notebooks are divided in three parts : 
* the **[MLPT_position](MLPT_position.ipynb)** file: it creates the dataset whose features are the relative positions to an atom of reference.
* the **[MLPT_Distance](MLPT_Distance.ipynb)** file: it creates the dataset whose features are the distances of the neighbors from an atom of reference. For each configuration, all the atoms in the primitive cell are atoms of reference.
* the **[MLPT_Machine Learning](MLPT_MachineLearning.ipynb)** file: it applies machine learning algorithms to the datasets.

This notebook is the **MLPT_position** file.

---

### Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import ase
import ase.io

---
<a name="def"></a>

# Position descriptors

## Files extraction

Our data deals with science materials. The data file includes one material, whose primitive cell is composed of 42 atoms. For each configuration, we get the coordinates of the atoms.
The molecule position changes with time, then the coordinates of the atoms change. 

In [2]:
atoms = ase.io.read('MLPOS.19000',index='::95',format="xyz")
data_energy = pd.read_csv('CORRECTIONS', header = None, sep =' ', names = ['Id', 'Energy'] )  

In [3]:
print('The length of the data atoms is:', len(atoms))
print('The shape of the data energy_target is:', data_energy.shape)

The length of the data atoms is: 200
The shape of the data energy_target is: (200, 2)


<a name="launch"></a>
## Dataset creation

First of all, we convert our data whose type is *atoms* into a *panda* data frame.
The 2nd file downloaded above is our target value. We will just make transformation of this value : for our future predictions, we will try to predict the *adsorption energy*, difference between this value and the potential energy of the molecule depending on his structure. We also subtract the mean of the adsorption energy in order to center the values.

In [4]:
get_energy = []
get_position = []

energy_data = pd.Series.tolist(data_energy.Energy)
# Get energy of atoms
for i in range(len(atoms)): #parcours les lignes
    get_energy.append(atoms[i].get_potential_energy() - energy_data[i])
    b = np.ravel(atoms[i].get_positions())
    get_position.append(b.tolist())

# Subtract the mean of the adsorption energy
mean_energy = np.mean(get_energy)
adsorption_energy = [get_energy[i] - mean_energy for i in range(len(get_energy))]
adsorption_energy

# Data frame creation
name = []
for i in range(1,43) : 
    for j in range(1,4):
        name.append('A'+str(i)+' - P'+str(j)) 
        
data_position = pd.DataFrame(get_position, columns = name )
data_position['Energy'] = pd.Series(adsorption_energy)
data_position.head()

Unnamed: 0,A1 - P1,A1 - P2,A1 - P3,A2 - P1,A2 - P2,A2 - P3,A3 - P1,A3 - P2,A3 - P3,A4 - P1,...,A40 - P1,A40 - P2,A40 - P3,A41 - P1,A41 - P2,A41 - P3,A42 - P1,A42 - P2,A42 - P3,Energy
0,2.064315,0.329469,6.646192,0.453543,-2.909306,4.566102,0.078484,-2.056794,5.019568,0.058888,...,0.649065,-1.836236,11.482952,1.992832,0.398697,3.203989,4.855793,1.956977,6.490609,-0.006111
1,2.083317,0.513189,6.729427,0.714838,-3.324936,4.664158,0.038516,-2.496338,5.014388,0.875942,...,0.608554,-1.94871,11.346007,2.023643,0.361726,3.367719,4.901972,1.879437,6.41756,-0.046628
2,1.984442,0.548754,6.592113,1.399551,-3.581412,4.86789,2.280274,-3.28676,5.424982,-2.701399,...,0.558757,-1.861783,11.471564,1.925063,0.470993,3.316946,4.761561,1.903112,6.411927,-0.070909
3,2.055063,0.382656,6.701059,0.694871,-3.164923,4.59699,1.086989,-2.834942,5.531246,-3.254791,...,0.615444,-1.846,11.470749,1.985753,0.429084,3.273696,4.81036,1.948595,6.534493,-0.036038
4,2.013034,0.428079,6.662569,1.162095,-3.648593,4.892707,1.750001,-3.989799,5.736131,-3.180766,...,0.698391,-1.726378,11.341388,1.899574,0.447717,3.283343,4.61357,2.040584,6.539217,-0.030054


<a name="launch"></a>
## Dataset copy

We will copy the final dataset. The goal is to store it and to save computation time: when we will use it, we shall not run the kernel.  
We save it in [pickle](https://docs.python.org/3/library/pickle.html#module-pickle) format.

In [5]:
data_position.to_pickle("./data_position.pkl")

In [6]:
data_position_new = pd.read_pickle("./data_position.pkl")
data_position_new.head()

Unnamed: 0,A1 - P1,A1 - P2,A1 - P3,A2 - P1,A2 - P2,A2 - P3,A3 - P1,A3 - P2,A3 - P3,A4 - P1,...,A40 - P1,A40 - P2,A40 - P3,A41 - P1,A41 - P2,A41 - P3,A42 - P1,A42 - P2,A42 - P3,Energy
0,2.064315,0.329469,6.646192,0.453543,-2.909306,4.566102,0.078484,-2.056794,5.019568,0.058888,...,0.649065,-1.836236,11.482952,1.992832,0.398697,3.203989,4.855793,1.956977,6.490609,-0.006111
1,2.083317,0.513189,6.729427,0.714838,-3.324936,4.664158,0.038516,-2.496338,5.014388,0.875942,...,0.608554,-1.94871,11.346007,2.023643,0.361726,3.367719,4.901972,1.879437,6.41756,-0.046628
2,1.984442,0.548754,6.592113,1.399551,-3.581412,4.86789,2.280274,-3.28676,5.424982,-2.701399,...,0.558757,-1.861783,11.471564,1.925063,0.470993,3.316946,4.761561,1.903112,6.411927,-0.070909
3,2.055063,0.382656,6.701059,0.694871,-3.164923,4.59699,1.086989,-2.834942,5.531246,-3.254791,...,0.615444,-1.846,11.470749,1.985753,0.429084,3.273696,4.81036,1.948595,6.534493,-0.036038
4,2.013034,0.428079,6.662569,1.162095,-3.648593,4.892707,1.750001,-3.989799,5.736131,-3.180766,...,0.698391,-1.726378,11.341388,1.899574,0.447717,3.283343,4.61357,2.040584,6.539217,-0.030054


---
Back to [top](#top).