# Project: Data Retrieval and Manipulation
Submitted by: Mayank Agrawal and Abhinav Malhotra (Group 11)

In this part of the project, we will write functions to retrieve and manipulate data from .csv and .xyz files in the available data for the project. All the data is downloaded from the kaggle website link 'https://www.kaggle.com/c/nomad2018-predict-transparent-conductors/data'. 
The data was downloaded on the google drive that can be accessed using the following link: https://drive.google.com/drive/folders/1YQ17lTCj9D_Xpo0S_NSlxRM04YjsRP-2 

Data downloaded:
1. train.csv file
2. test.csv file
3. train folder
4. test folder

The 'train.csv' file has the training data set for 2400 conductors with given values of quantities to predict (bandgaps and formation energies). The 'test.csv' files has the data for 600 conductors to test our learned model. The columns in the file are described and the pandas dataframe are created to access all the data using the structure's id in this section below. 

Other than the csv files, the data contains .xyz structure files for all the materials describing atomic positions in the structure. The folder 'train' has 2400 subfolders, named based on 'id' of the structure, each having its corresponding 'geometry.xyz' file. In the same way, folder 'test' has 600 subfolders for all test materials. In this submission, functions are written to retrive this data converting each xyz files to a csv file in the folders 'train_xyz2csv' and 'test_xyz2csv' respectively. Each csv file is named as the id of the structure and contains only the fractional coordinates of atoms. Note that atomic coordinates are converted from cartesian to fractional to have a normalized data to work with. 


### Instructions to access the data:
The same jupyter-notebook file is available in the google drive folder (link given above), otherwise this file must be copied into that folder for all of the defined functions to work.


## Column names for dataframe

index: id = id of the crystal structure that ranges from 1 to 2400 for training dataset and 1 to 600 for test dataset

0: spacegroup = spacegroup of the crystal structure

1: N_total = number of total atoms in the unit cell

2: x_Al = percent_atom_al

3: x_Ga = percent_atom_ga

4: x_In = percent_atom_in

5: a = lattice_vector_1_ang

6: b = lattice_vector_2_ang

7: c = lattice_vector_3_ang

8: alpha = lattice_angle_alpha_degree

9: beta = lattice_angle_beta_degree

10: gamma = lattice_angle_gamma_degree

11: del_Hf = formation_energy_ev_natom (available only for training data)

12: bandgap = bandgap_energy_ev (available only for training data)

In [1]:
import pandas as pd
import numpy as np

# Importing training and test data csv files to pandas dataframes
column_names=['spacegroup','N_total','x_Al','x_Ga','x_In','a','b','c','alpha', 'beta','gamma', 'del_Hf', 'bandgap']
df_train = pd.read_csv('./train.csv', header=0, index_col = 0, names = column_names)

df_test = pd.read_csv('./test.csv', header=0, index_col = 0, names = column_names[0:11])

print('Training dataset shape: ', df_train.shape)
#print(df_train.head(10))

print('\nTest dataset shape: ', df_test.shape)
#print(df_test.head(10))

Training dataset shape:  (2400, 13)

Test dataset shape:  (600, 11)



## Importing atom coordinates for a structure to csv files

The functions defined in the next sections imports the coordinates of all atoms from xyz files of a structure given it's id and it's type (whethere it is in training set or test set). The function also converts the caresian coordinates to fractional coordinates while storing it to corresponding csv file. For further calculations, obtained csv files can be used rather than repeatedly using the functions to retrive the data from xyz files. 

In [2]:
# Function to convert cartesian coordinates to fractional coordinates
def get_cartesian_to_fractional_matrix(id_structure,df):
    a = float(df.loc[id_structure]['a'])
    b = float(df.loc[id_structure]['b'])
    c = float(df.loc[id_structure]['c'])
    alpha = np.deg2rad(float(df.loc[id_structure]['alpha']))
    beta = np.deg2rad(float(df.loc[id_structure]['beta']))
    gamma = np.deg2rad(float(df.loc[id_structure]['gamma']))    
    cosa = np.cos(alpha)
    sina = np.sin(alpha)
    cosb = np.cos(beta)
    sinb = np.sin(beta)
    cosg = np.cos(gamma)
    sing = np.sin(gamma)
    volume = 1.0 - cosa**2.0 - cosb**2.0 - cosg**2.0 + 2.0 * cosa * cosb * cosg
    volume = np.sqrt(volume)
    r = np.zeros((3, 3))
    r[0, 0] = 1.0 / a
    r[0, 1] = -cosg / (a * sing)
    r[0, 2] = (cosa * cosg - cosb) / (a * volume * sing)
    r[1, 1] = 1.0 / (b * sing)
    r[1, 2] = (cosb * cosg - cosa) / (b * volume * sing)
    r[2, 2] = sing / (c * volume)
    return r


def xyz2df(id_structure=1, df=df_train):
    ''' 
    Function to extract fractional coordinates of all atoms from the xyz files to a pandas dataframe
    @ param id = id of the conductor that ranges 1-2400 for training dataset and 1-600 for test dataset
    @ param data_type = 'train' or 'test'
    @ datafolder = path to folder containing datafiles
    '''
    if df.shape[1]>12:
        data_type = 'train'
    else:
        data_type = 'test'
    file_open = open('./'+data_type+'/'+str(id_structure)+'/geometry.xyz')
    lines = file_open.readlines()
    
    df_xyz = pd.DataFrame(columns=['atom','x','y','z'])
    # Matrix to convert cartesian coordinates to fractional coordinates
    r_matrix = get_cartesian_to_fractional_matrix(id_structure, df)
    # Storing coordinates of all atoms to a pandas dataframe
    for line in lines:
        if 'atom' in line:
            elements = line.split()
            xyz_cart = [float(elements[1]), float(elements[2]), float(elements[3])]
            xyz_frac = np.matmul(r_matrix,xyz_cart)
            df_xyz = df_xyz.append({'atom':elements[4], 'x':xyz_frac[0], \
                                    'y':xyz_frac[1], 'z':xyz_frac[2]}, ignore_index=True) 
    return df_xyz

xyz2df().head(10)

# The following for loops were used to convert xyz files to csv files and store them. 
# for i in df_train.index:
#     df2write = xyz2df(i, df_train)
#     df2write.to_csv('./train_xyz2csv/'+str(i)+'.csv', index=False)

# for i in df_test.index:
#     df2write = xyz2df(i, df_test)
#     df2write.to_csv('./test_xyz2csv/'+str(i)+'.csv', index=False)

Unnamed: 0,atom,x,y,z
0,Ga,0.161707,0.850947,0.695522
1,Al,0.661697,0.848185,0.693634
2,Al,0.345234,0.147329,0.195545
3,Ga,0.845223,0.144567,0.193657
4,Ga,0.096062,0.350708,0.196279
5,Al,0.596052,0.347945,0.194391
6,Al,0.410879,0.647569,0.694788
7,Al,0.910868,0.644806,0.6929
8,Al,0.09218,0.657496,-0.001022
9,Ga,0.59217,0.654733,-0.00291



## Function to import .csv files for atomic coordinates into pandas dataframe

This section defines a function that can be called to get all the atomic coordinates of a structure given it's id to a pandas dataframe.

In [3]:
def fractional_coordinates_csv2df(structure_id=1, data_type = 'train'):
    try:
        df_xyz = pd.read_csv('./'+data_type+'_xyz2csv/'+str(structure_id)+'.csv',\
                             header=0, names = ['atom', 'x', 'y', 'z'])
        return df_xyz
    except ValueError:
        print('Data not found for the given structure id and data type.')
        return None
    
print(df_test.loc[1,['a','b','c','alpha','beta','gamma','N_total']])
print(fractional_coordinates_csv2df(1, 'test').head(10))

a          10.5381
b           9.0141
c           9.6361
alpha      89.9997
beta       90.0003
gamma      90.0006
N_total    80.0000
Name: 1, dtype: float64
  atom         x         y         z
0   In  0.162177  0.852531  0.693671
1   In  0.662165  0.850221  0.690835
2   In  0.345430  0.148048  0.194619
3   In  0.845418  0.145737  0.191783
4   In  0.096117  0.351201  0.195460
5   In  0.596105  0.348890  0.192624
6   Ga  0.411490  0.649378  0.692830
7   Al  0.911478  0.647067  0.689994
8   Ga  0.091714  0.657580 -0.002388
9   In  0.591702  0.655269 -0.005225


## Data in usable form:
1. train.csv
2. test.csv
3. train_xyz2csv folder
4. test_xyz2csv folder