Installation of the following featurization schemes:
- RDKit
- ElemNet (onehot)
- Jarvis
- Mat2vec
- Atom2vec
- Magpie
- Oliynyk

Featurizers don't perform as well as Dr. Oliynyk's list of properties & code for featurization
- See: https://doi.org/10.1007/s40192-020-00179-z

Goal:
- Summarize available featurizers, try to install them and test with the dataset provided (check Teams soon), compare it to the code that will collectively be put together.

### RDKit Installation

In [2]:
# https://www.rdkit.org/
# https://github.com/rdkit/rdkit
from rdkit.ML.Descriptors import MoleculeDescriptors

ModuleNotFoundError: No module named 'rdkit'

I ended up snooping around online because it seems that the RDKit documentation isn't completely accurate, or the homebrew installation didn't include some modules and I found this repository courtesy of Dr. Goshu (https://www.youtube.com/watch?v=9i9SY6Nd1Zw):
   
       https://github.com/gashawmg/molecular-descriptors/blob/main/Molecular%20descriptors.ipynb
  
Considering we don't have to reinvent the wheel, we'll repurpose his function below:

In [None]:
# Defining function:

def RDkit_descriptors(smiles):
    mols = [Chem.MolFromSmiles(i) for i in smiles] 
    calc = MoleculeDescriptors.MolecularDescriptorCalculator([x[0] for x in Descriptors._descList])
    desc_names = calc.GetDescriptorNames()
    
    Mol_descriptors =[]
    for mol in mols:
        # Add hydrogens to molecules
        mol=Chem.AddHs(mol)
        # Calculate all 211 descriptors for each molecule
        descriptors = calc.CalcDescriptors(mol)
        Mol_descriptors.append(descriptors)
    return Mol_descriptors,desc_names 

In [3]:
# Function call
Mol_descriptors,desc_names = RDkit_descriptors(editedData['smiles'])

NameError: name 'RDkit_descriptors' is not defined

In [4]:
# Extract column names from 'editedData'
editedData_columns = editedData.columns

# Create DataFrame from 'Mol_descriptors' using 'desc_names' as column names
result_df = pd.DataFrame(Mol_descriptors, columns=desc_names)

# Concatenate 'result_df' and 'editedData' horizontally
result_df = pd.concat([result_df, editedData], axis=1)
result_df

NameError: name 'editedData' is not defined

### ElemNet Installation

ElemNet is a deep neural network model that takes only the elemental compositions as inputs and leverages artificial intelligence to automatically capture the essential chemistry to predict materials properties. ElemNet can automatically learn the chemical interactions and similarities between different elements which allows it to even predict the phase diagrams of chemical systems absent from the training dataset more accurately than the conventional machine learning models based on physical attributes levaraging domain knowledge.

    https://github.com/NU-CUCIS/ElemNet/blob/master/README.md
    
Also needs Magpie...

In [6]:
# Requirements:
import pandas as pd, warnings, sklearn, numpy as np, matplotlib.pyplot as plt
warnings.filterwarnings('ignore')
#Conda env: Scipy, pymatgen, matminer (runs w/ lower pandas), tensorflow 

### Magpie Installation

Magpie is an extensible platform for using machine learning to predict the properties of materials.

Magpie is also an acronym for “Material-Agnostic Platform for Informatics and Exploration”, and is named after an intelligent bird.

Begin here:
https://bitbucket.org/wolverton/magpie/src/master/README.md
tutorial:
https://wolverton.bitbucket.io/installation.html

No clue why but its not working. JDK/gradlew issues I was trying to troubleshoot for hours.

### Mat2Vec Installation

See following:

    https://github.com/materialsintelligence/mat2vec
    
Downgrade your python version to 3.8 otherwise DAWG can't be installed.

In [9]:
import mat2vec
from mat2vec.processing import MaterialsTextProcessor
text_processor = MaterialsTextProcessor()
text_processor.process("LiCoO2 is a battery cathode material.")

(['CoLiO2', 'is', 'a', 'battery', 'cathode', 'material', '.'],
 [('LiCoO2', 'CoLiO2')])

In [10]:
from mat2vec import gensim
from gensim.models import Word2Vec
w2v_model = Word2Vec.load("mat2vec/training/models/pretrained_embeddings")
w2v_model.wv.most_similar("thermoelectric")

ImportError: cannot import name 'gensim' from 'mat2vec' (/Users/emiljaffal/anaconda3/envs/MLProj/lib/python3.8/site-packages/mat2vec/__init__.py)

In [12]:
w2v_model.wv.most_similar("band_gap", topn=5)

NameError: name 'w2v_model' is not defined

In [13]:
from mat2vec.processing import MaterialsTextProcessor
text_processor = MaterialsTextProcessor()
w2v_model.wv.most_similar(
    positive=["cubic", text_processor.normalized_formula("CdSe")], 
    negative=[text_processor.normalized_formula("GaAs")], topn=1)

NameError: name 'w2v_model' is not defined

### Atom2Vec Installation

A python implement of Atom2Vec: a simple way to describe atoms for machine learning

See following:

    https://github.com/idocx/Atom2Vec
    
'The 'sklearn' PyPI package is deprecated, use 'scikit-learn'rather than 'sklearn' for pip commands.'???

### Jarvis Installation

The JARVIS-Tools is an open-access software package for atomistic data-driven materials design. JARVIS-Tools can be used for a) setting up calculations, b) analysis and informatics, c) plotting, d) database development and e) web-page development.

see following:
    
    https://pages.nist.gov/jarvis/#install
    
for tutorials:
    
    https://pages.nist.gov/jarvis/tutorials/
    
all info:

    https://jarvis.nist.gov

In [1]:
import jarvis
from jarvis.core.atoms import Atoms
box = [[2.715, 2.715, 0], [0, 2.715, 2.715], [2.715, 0, 2.715]]
coords = [[0, 0, 0], [0.25, 0.25, 0.25]]
elements = ["Si", "Si"]
Si = Atoms(lattice_mat=box, coords=coords, elements=elements)
density = round(Si.density,2)
print (density)
2.33

from jarvis.db.figshare import data
dft_3d = data(dataset='dft_3d')
print (len(dft_3d))
75993


from jarvis.io.vasp.inputs import Poscar
for i in dft_3d:
    atoms = Atoms.from_dict(i['atoms'])
    poscar = Poscar(atoms)
    jid = i['jid']
    filename = 'POSCAR-'+jid+'.vasp'
    poscar.write_file(filename)
dft_2d = data(dataset='dft_2d')
print (len(dft_2d))
1109

for i in dft_2d:
    atoms = Atoms.from_dict(i['atoms'])
    poscar = Poscar(atoms)
    jid = i['jid']
    filename = 'POSCAR-'+jid+'.vasp'
    poscar.write_file(filename)
# Example to parse DOS data from JARVIS-DFT webpages
from jarvis.db.webpages import Webpage
from jarvis.core.spectrum import Spectrum
import numpy as np
new_dist=np.arange(-5, 10, 0.05)
all_atoms = []
all_dos_up = []
all_jids = []
for ii,i in enumerate(dft_3d):
  all_jids.append(i['jid'])
  try:
    w = Webpage(jid=i['jid'])
    edos_data = w.get_dft_electron_dos()
    ens = np.array(edos_data['edos_energies'].strip("'").split(','),dtype='float')
    tot_dos_up = np.array(edos_data['total_edos_up'].strip("'").split(','),dtype='float')
    s = Spectrum(x=ens,y=tot_dos_up)
    interp = s.get_interpolated_values(new_dist=new_dist)
    atoms=Atoms.from_dict(i['atoms'])
    ase_atoms=atoms.ase_converter()
    all_dos_up.append(interp)
    all_atoms.append(atoms)
    all_jids.append(i['jid'])
    filename=i['jid']+'.cif'
    atoms.write_cif(filename)
    break
  except Exception as exp :
    print (exp,i['jid'])
    pass

2.33
Obtaining 3D dataset 76k ...
Reference:https://www.nature.com/articles/s41524-020-00440-1
Other versions:https://doi.org/10.6084/m9.figshare.6815699


100%|█████████████████████████████████████| 40.8M/40.8M [00:21<00:00, 1.89MiB/s]


Loading the zipfile...
Loading completed.
75993
Obtaining 2D dataset 1.1k ...
Reference:https://www.nature.com/articles/s41524-020-00440-1
Other versions:https://doi.org/10.6084/m9.figshare.6815705


100%|█████████████████████████████████████| 8.39M/8.39M [00:03<00:00, 2.54MiB/s]


Loading the zipfile...
Loading completed.
1103
Requires ASE for this functionality.


An atomic structure can consist of atomic element types, corresponding xyz coordinates in space (either in real or reciprocal space) and lattice matrix used in setting periodic boundary conditions.

An example of constructing an atomic structure class using jarvis.core.Atoms is given below. After creating the Atoms class, we can simply print it and visualize the POSCAR format file in a software such as VESTA. While the examples below use Silicon elemental crystal creation and analysis, it can be used for multi-component systems as well.

In [14]:
from jarvis.core.atoms import Atoms
box = [[2.715, 2.715, 0], [0, 2.715, 2.715], [2.715, 0, 2.715]]
coords = [[0, 0, 0], [0.25, 0.25, 0.25]]
elements = ["Si", "Si"]
Si = Atoms(lattice_mat=box, coords=coords, elements=elements, cartesian=False)
print (Si) # To visualize 
Si.write_poscar('POSCAR.vasp')
Si.write_cif('POSCAR.vasp')

System
1.0
2.715 2.715 0.0
0.0 2.715 2.715
2.715 0.0 2.715
Si 
2 
direct
0.0 0.0 0.0 Si
0.25 0.25 0.25 Si



The Atoms class here is created from the raw data, but it can also be read from different file formats such as: '.cif', 'POSCAR', '.xyz', '.pdb', '.sdf', '.mol2' etc. The Atoms class can also be written to files in formats such as POSCAR/.cif etc.

Note that for molecular systems, we use a large vaccum padding (say 50 Angstrom in each direction) and set lattice_mat accordingly, e.g. lattice_mat = [[50,0,0],[0,50,0],[0,0,50]]. Similarly, for free surfaces we set high vaccum in one of the crystallographic directions (say z) by giving a large z-comonent in the lattice matrix while keeping the x, y comonents intact.

In [15]:
my_atoms = Atoms.from_poscar('POSCAR')
my_atoms.write_poscar('MyPOSCAR')

FileNotFoundError: [Errno 2] No such file or directory: 'POSCAR'

Once this Atoms class is created, several important information can be obtained such as:

In [16]:
print ('volume',Si.volume)
print ('density in g/cm3', Si.density)
print ('composition as dictionary', Si.composition)
print ('Chemical formula', Si.composition.reduced_formula)
print ('Spacegroup info', Si.spacegroup())
print ('lattice-parameters', Si.lattice.abc, Si.lattice.angles)
print ('packing fraction',Si.packing_fraction)
print ('number of atoms',Si.num_atoms)
print ('Center of mass', Si.get_center_of_mass())
print ('Atomic number list', Si.atomic_numbers)

volume 40.02575174999999
density in g/cm3 2.3303545408113413
composition as dictionary OrderedDict([('Si', 2)])
Chemical formula Si
Spacegroup info Fd-3m (227)
lattice-parameters [3.83959, 3.83959, 3.83959] [60.0, 60.0, 60.0]
packing fraction 0.27858
number of atoms 2
Center of mass [0.67875 0.67875 0.67875]
Atomic number list [14, 14]


For creating/accessing dataset(s), we use Atoms.from_dict() and Atoms.to_dict() methods:

In [17]:
d = Si.to_dict()
new_atoms = Atoms.from_dict(d)

The jarvis.core.Atoms object can be converted back and forth to other simulation toolsets such as Pymatgen and ASE if insyalled, as follows

In [18]:
pmg_struct = Si.pymatgen_converter()
ase_atoms = Si.ase_converter()

Requires ASE for this functionality.


In order to make supercell, the following example can be used:

In [19]:
supercell_1 = Si.make_supercell([2,2,2])
supercell_2 = Si.make_supercell_matrix([[2,0,0],[0,2,0],[0,0,2]])
supercell_1.density == supercell_2.density

True