# Generating a DF with PyMatGen

We present a reuse of the code written by Karen, C. The main objective of this notebooks is to extract the relevant information of the CIF files and storaged them as a DataFrame. The relevant information corresponds to: 

* Crystal system, Spacial Group, Lattice Parameters and Wyckoff Sites. 

## Versions:

This code requieres a series of old versions of the following libraries:

In [None]:
#! pip install monty==2021.12.1 plotly==5.4.0 pymatgen==2020.6.8 ruamel.yaml==0.17.17 ruamel.yaml.clib==0.2.6 spglib==1.16.3 tenacity==8.0.1

## Libraries

In [1]:
import pymatgen as mg
from pymatgen.symmetry.analyzer import SpacegroupAnalyzer
import glob
import pandas as pd
import re
import shutil, os
import numpy as np
import time
import h5py
import matplotlib.pyplot as plt
import random
from tqdm.notebook import tqdm

In [2]:
!pip show pymatgen

Name: pymatgen
Version: 2020.6.8
Summary: Python Materials Genomics is a robust materials analysis code that defines core object representations for structures and molecules with support for many electronic structure codes. It is currently the core analysis code powering the Materials Project (https://www.materialsproject.org).
Home-page: http://www.pymatgen.org
Author: Pymatgen Development Team
Author-email: ongsp@eng.ucsd.edu
License: MIT
Location: /home/bokhimi/.conda/envs/tf-2.11/lib/python3.8/site-packages
Requires: matplotlib, monty, networkx, numpy, palettable, pandas, plotly, requests, ruamel.yaml, scipy, spglib, sympy, tabulate
Required-by: emmet-core, mp-api


## DF Generation

The next function inputs the path where the CIFs are storaged. In the last notebook we split the COD into three groups refering to organic, inorganic and errors. This function will work exactly the same for organic or inorganic compounds.

In [3]:
def create_database(ruta: str):
    '''
    This functions generates a DataFrame with the CIF relevant information for the XRP simulation
    
    Args:
        ruta (str) : Es la ruta en la que se encuentran los archivos cif
        ruta: /home/bokhimi/COD/notebooks_de_preprocesamiento/vero_test/*.cif

    Return:
        df: es un dataframe que contiene la información de interes de cada archivo cif en ruta
        errors: List containing the paths of all the error files.
    '''

    addrs = glob.glob(ruta)
    cif = []
    sg_number = []
    sg_symbol = []
    comp = []
    par1 = []
    par2 = []
    site = []
    system = []
    errors = []
    # len(addrs)
    for i in tqdm(range( len(addrs) ), desc = 'Generating DF: ') :
        addr = addrs[i]
        try:
            analyzer = SpacegroupAnalyzer(mg.Structure.from_file(addr))
            number=analyzer.get_space_group_number()
            symbol=analyzer.get_space_group_symbol()
            cs = analyzer.get_crystal_system()
            wy = str(analyzer.get_symmetrized_structure()).split('\n')
            compound = wy[2].split(' ')[-1]
            abc = list(filter(lambda x: x != "", wy[3].split(' ')))[2:]
            angles = list(filter(lambda x: x != "", wy[4].split(' ')))[1:]
            sites = []
            for i in range(len(wy)-8):
                lista = list(filter(lambda x: x != "", wy[i+8].split(' ')))[1:]
                if lista[0].find(':')<0:
                    lista[0] = lista[0] + ':1'
                sites.append(lista)
            cif.append(addr.split('.')[0].split('/')[-1])
            sg_number.append(number)
            sg_symbol.append(symbol)
            system.append(cs)
            comp.append(compound)
            par1.append(abc)
            par2.append(angles)
            site.append(sites)
            
        except Exception as e:
            errors.append(addr)
            
        except UserWarning as uw: 
            pass 

    df = pd.DataFrame({'cif': cif, 'compound': comp,
                       'cs': system, 'sg_number': sg_number, 'sg_symbol': sg_symbol,
                       'abc': par1, 'angles': par2, 'sites': site})
    
    #df.drop_duplicates()
    return df, errors

In [7]:
df_raw, errors = create_database('/home/bokhimi/COD/notebooks_de_preprocesamiento/vero_test/*.cif') # Example

Generando DF:   0%|          | 0/1 [00:00<?, ?it/s]

_audit_creation_date   1989-05-23
_audit_creation_method   CSD-ConQuest-V1
_database_code_CSD   GEXGUL
_chemical_formula_sum   'C4 H12 N5 O3.5 S1'
_chemical_formula_moiety   'C4 H11 N5 O3 S1,0.5(H2 O1)'
_journal_coeditor_code   'IUCr BX0221'
_journal_coden_Cambridge   591
_journal_volume   44
_journal_year   1988
_journal_page_first   1452
_journal_name_full   'Acta Crystallogr.,Sect.C:Cryst.Struct.Commun.'
loop_
 _publ_author_name
  J.M.Amigo
  J.M.Martinez-Calatayud
  A.Cantarero
  T.Debaerdemaeker
_chemical_name_systematic   '1-(2-Sulfoethyl)biguanide hemihydrate'
_cell_volume   1844.671
_exptl_crystal_density_diffrn   1.505
_exptl_special_details
'Fw of 209.1 and dx of 1.505 are for the unsolvated complex'
_diffrn_ambient_temperature   ?
_diffrn_special_details
'The study was carried out at room temperature,in the range 283-303K'
_refine_ls_R_factor_gt   0.0553
_refine_ls_wR_factor_gt   0.0553
_symmetry_cell_setting   monoclinic
_symmetry_space_group_name_H-M   'C 2/c'
_symmetry_In

In this section we can visualize the first 5 elements of the DF: 

In [8]:
df_raw.head()

Unnamed: 0,cif,compound,cs,sg_number,sg_symbol,abc,angles,sites
0,search_Metformin,ZnH15C8N7Cl2O5,monoclinic,4,P2_1,"[6.213000, 17.514000, 7.173000]","[90.000000, 99.330000, 90.000000]","[[Zn:1, 0.29901, 0.19128, 0.49746, 2a], [H:1, ..."


In [9]:
df_raw.shape

(1, 8)

We also save the df containing the CIF indexes: 

In [10]:
cifs_raw = df_raw['cif']
cifs_raw = pd.DataFrame({
    'cif': cifs_raw
})

In [11]:
cifs_raw.shape

(1, 1)

Convert the DataFrames into parquet format: 

In [12]:
df_raw.to_parquet('/home/bokhimi/COD/database/dataframes/df_raw_test.parquet', index=False)

In [13]:
cifs_raw.to_parquet('/home/bokhimi/COD/database/dataframes/cifs_raw_test.parquet', index=False)

And move the error elements into the error folder: 

In [14]:
for i in tqdm( range(len(errors)), desc = 'Moving errors'): # errors
    path_er = errors[i]
    if os.path.isfile(path_er):
        try:
            shutil.move(path_er, os.path.join('/home/bokhimi/COD/database/errors', os.path.basename(path_er) ) )
        except Exception as e:
            print(e)

Moviendo errores: 0it [00:00, ?it/s]