# CIF Split into Organic and Inorganic Compounds

We present a code to iterate over the Crystallographic Open Databse (COD) and split compounts into organinc and inorganic based on their carbon content. We delete all the files that are not possible to read using PyCIFRW and with PyMatGen. 

To filter the compounds we employed the first block of the CIF as a dictionary. We select the key: _chemical_formula_sum_ to get a string of the chemical formula. To detect the carbon content we filter by 'C' in chemical formula. As there are some inorganic compounds that contain carbon, we classify them as organic. 

## Libraries

In [1]:
import os 
from CifFile import CifFile, ReadCif # 4.4.6 version
import shutil
from tqdm import tqdm # To visualize the iteration 
from pathlib import Path
#!jupyter nbextension enable --py widgetsnbextension


In [6]:
pip show PyCifRW

Name: PyCifRW
Version: 4.4.6
Summary: CIF/STAR file support for Python
Home-page: https://github.com/jamesrhester/pycifrw/blob/development/README.md
Author: James Hester
Author-email: jamesrhester@gmail.com
License: Python 2.0
Location: /home/bokhimi/.conda/envs/tf-2.11/lib/python3.8/site-packages
Requires: numpy, ply
Required-by: 
Note: you may need to restart the kernel to use updated packages.


## Example

Declare the path to the _cif_ file, y and use the ReadCif method. 

In [None]:
cif_data = ReadCif('./2300563.cif') 

Extract the relevant properties of the first block:

In [None]:
structure = cif_data.first_block() # Extraemos la informacion

Which they are:

In [None]:
structure.keys()

We are going to filtrate by Carbon content. The presence of Carbon is written in the Chemical Formula with the letter C. 

In [None]:
formula = structure['_chemical_formula_sum']
formula

In [None]:
type(formula) 

In [None]:
'C' in formula

Define a function that outputs True if the compound contains Carbon in it:

In [None]:
def does_it_has_carbon(path: str):
    '''
    Returns True if the compounts contains carbon

    Args:
        path (str): Path to the CIF file

    Output: True/False (bool): Refers to the carbon presence in the compound
    '''
    return 'C' in (ReadCif(path).first_block())['_chemical_formula_sum']
    

In [None]:
does_it_has_carbon('/home/bokhimi/COD/database/cif/1/00/02/1000229.cif')

## Iteration:

Specify the path where the database is storaged: 

In [None]:
path = '/home/COD/database'

In this path, we will create three folders refering to organic, inorganic and errors.

In [None]:
organic = os.path.join(path, 'organic')
inorganic = os.path.join(path, 'inorganic')
er_f = os.path.join(path, 'errors')

os.makedirs(organic, exist_ok= True)
os.makedirs(inorganic, exist_ok = True)
os.makedirs(er_f, exist_ok = True)


Get the path where the CIF files are storaged:

In [None]:
path_dataset = Path(path + '/cif')

Now we iterate over the address, as the main folder contains a substatial amount of subfolders this process may take a while. We are getting the path of each compunds into three separate lists.

In [None]:
organic_list = []
inorganic_list = []
errors = []

i = 0
for filename in tqdm( path_dataset.rglob('*') , desc = 'Splitting compunds...'): # Get the dir
    if filename.suffix == '.cif': # Check the suffix
        i+= 1
        path_cif = str(filename) #os.path.join(path_dataset, filename) # CIF path

        try: #
            if does_it_has_carbon(path_cif):
                organic_list.append(path_cif)

            elif does_it_has_carbon(path_cif) == False:
                inorganic_list.append(path_cif)
                
        except Exception as e:
            errors.append(path_cif)



In [None]:
print('''El cantidad de CIF es: {:}

La cantidad de orgánicos es: {:}

La cantidad de inorgánicos es: {:}

La cantidad de errores es: {:}'''.format( i,
                                        len(organic_list),
                                        len(inorganic_list),
                                        len(errors)
                                         ))

As we got the paths, now we copy the files into the created folders:

In [None]:
for i in tqdm( range(len(organic_list)), desc = 'Moving organic'): # organic_list
    path_org = organic_list[i]
    if os.path.isfile(path_org):
        try:
            shutil.copy(path_org, os.path.join(organic, os.path.basename(path_org) ) )
        except Exception as e:
            print(e)

In [None]:
for i in tqdm(range(len(inorganic_list)), desc = 'Moving inorganic'): # inorganic_list
    path_inorg = inorganic_list[i]
    if os.path.isfile(path_inorg):
        try:
            shutil.copy(path_inorg, os.path.join(inorganic, os.path.basename(path_inorg) ) )
        except Exception as e:
            print(e)

In [None]:
for i in tqdm( range(len(errors)), desc = 'Moving errors'): # errors
    path_er = errors[i]
    if os.path.isfile(path_er):
        try:
            shutil.copy(path_er, os.path.join(er_f, os.path.basename(path_er) ) )
        except Exception as e:
            print(e)