# Question 1: Structural features
**1.1 cif-cn-featurizer**

Our friend Anton Oliynyk has recently released a new python script `cif-cn-featurizer`. 

Here is their description "A Python script designed to process CIF (Crystallographic Information File) files and extract various features from them. These features include interatomic distances, atomic environment information, and coordination numbers. The script can handle binary and ternary compounds."

Let's test it out and see how it works! To do so we will need some cifs. Right now the featurizer only works with binaries made up of certain elements shown in this plot. 

![allowed elements](https://github.com/sp8rks/MaterialsInformatics/blob/main/HW/HW2/cif-cn-featurizer-allowed-elements.png?raw=true)

**<font color='teal'>a)</font>** Download the `cif-cn-featurizer` files and run it on the cif files in the `HW\cn-featurizer\cifs` folder. 

Note: in case you can't get it working, you'll also find a csv folder with all the extracted features for these cifs already complete, but try and get it working so you can use it in the future!

In [1]:
#you can run this in your miniconda command prompt if you prefer\

# Do i need to import each python code individually that is in the cif-cn-featurizer folder???

#from pathlib import Path
#import sys
#files_dir = Path("C:\MaterialsInformatics\cn-featurizer\cifs")
#files = files_dir.iterdir()
#sys.path.append( 'C:\cif-cn-featurizer' )
#import main as ccnfeat
#features = ccnfeat.main()


# I used the miniconda command prompt


**1.2 Getting labeled data for the cifs**

**<font color='teal'>b)</font>** Now that you've got feature vectors in a series of .csv files, let's use them to build a model to predict a property. To get a property let's search for a materials project entry using the cif cards! If you've forgotten how, go back to the `legacy_MPRester_tutorial.ipynb` notebook where we did an example. Once you have the material project id, run a query to extract a property like bulk modulus (["elasticity"]["K_VRH"])

In [2]:
#your code goes here
from mp_api.client import MPRester
from pymatgen.io.cif import CifParser
import pandas as pd
import os
from pathlib import Path

filename = r'C:\Users\Aidan Belanger\OneDrive\Desktop\Materials Informatics\MyMatProjApiKey.txt'
def get_file_contents(filename):
    try:
        with open(filename, 'r') as f:
            # It's assumed our file contains a single line,
            # with our API key
            return f.read().strip()
    except FileNotFoundError:
        print("'%s' file not found" % filename)

Aidans_API = get_file_contents(filename)


directory_in_str = "C:\cif-cn-featurizer\cifs"
directory = os.fsencode(directory_in_str)

fpaths = []
trackf = []
for file in os.listdir(directory):
    filename = os.fsdecode(file)
    if filename.endswith(".cif"): 
        fpaths.append(os.path.join(directory_in_str, filename))
        path = Path(filename)
        trackf.append(path.stem)
        continue
    else:
        continue
print(fpaths)
print(trackf)


  from .autonotebook import tqdm as notebook_tqdm


['C:\\cif-cn-featurizer\\cifs\\250022.cif', 'C:\\cif-cn-featurizer\\cifs\\250065.cif', 'C:\\cif-cn-featurizer\\cifs\\250101.cif', 'C:\\cif-cn-featurizer\\cifs\\250125.cif', 'C:\\cif-cn-featurizer\\cifs\\250186.cif', 'C:\\cif-cn-featurizer\\cifs\\250191.cif', 'C:\\cif-cn-featurizer\\cifs\\250223.cif', 'C:\\cif-cn-featurizer\\cifs\\250225.cif', 'C:\\cif-cn-featurizer\\cifs\\250236.cif', 'C:\\cif-cn-featurizer\\cifs\\250328.cif', 'C:\\cif-cn-featurizer\\cifs\\250329.cif', 'C:\\cif-cn-featurizer\\cifs\\250330.cif', 'C:\\cif-cn-featurizer\\cifs\\250331.cif', 'C:\\cif-cn-featurizer\\cifs\\250332.cif', 'C:\\cif-cn-featurizer\\cifs\\250333.cif', 'C:\\cif-cn-featurizer\\cifs\\250334.cif', 'C:\\cif-cn-featurizer\\cifs\\250335.cif', 'C:\\cif-cn-featurizer\\cifs\\250336.cif', 'C:\\cif-cn-featurizer\\cifs\\250337.cif', 'C:\\cif-cn-featurizer\\cifs\\250338.cif', 'C:\\cif-cn-featurizer\\cifs\\250339.cif', 'C:\\cif-cn-featurizer\\cifs\\250340.cif', 'C:\\cif-cn-featurizer\\cifs\\250363.cif', 'C:\\cif-c

In [3]:
docs = []
ciflabel = []

for fpath in fpaths:
    with MPRester(Aidans_API) as mpr:
        # open cif file with cif parser
        parser = CifParser(fpath)
        path = Path(fpath)
        try:
            # get structural composition from the cif file
            structure = parser.parse_structures(primitive=True)[0]
            # get the compositional formula from the parsed information and remove white space
            formula = str(structure.composition.formula).replace(" ","")
            # search materials project for cif files with the compositional formula
            summary = mpr.materials.summary._search(formula=formula,fields=["material_id"])

            for doc in summary:
                print(doc.material_id)
                docs.append(doc.material_id)
                for x in doc.material_id[0]:
                    ciflabel.append(path.stem)
                #print(ciflabel)
        except:
            print(fpath, " contains an invalid cif")
            

Retrieving SummaryDoc documents: 0it [00:00, ?it/s]
Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]
Retrieving SummaryDoc documents: 100%|██████████| 2/2 [00:00<?, ?it/s]
Retrieving SummaryDoc documents: 100%|██████████| 3/3 [00:00<00:00, 3010.27it/s]
Retrieving SummaryDoc documents: 0it [00:00, ?it/s]
Retrieving SummaryDoc documents: 100%|██████████| 2/2 [00:00<00:00, 1999.67it/s]
Retrieving SummaryDoc documents: 0it [00:00, ?it/s]
Retrieving SummaryDoc documents: 100%|██████████| 2/2 [00:00<?, ?it/s]
Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]
Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]
Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]
Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]
Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]
Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]
Retrieving SummaryDoc documents: 100%|█████

C:\cif-cn-featurizer\cifs\250615.cif  contains an invalid cif


Retrieving SummaryDoc documents: 100%|██████████| 4/4 [00:00<?, ?it/s]
Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]
Retrieving SummaryDoc documents: 0it [00:00, ?it/s]
Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 988.29it/s]
Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]
Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]
Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 986.66it/s]
Retrieving SummaryDoc documents: 100%|██████████| 2/2 [00:00<00:00, 2007.32it/s]
Retrieving SummaryDoc documents: 0it [00:00, ?it/s]
Retrieving SummaryDoc documents: 0it [00:00, ?it/s]
Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]
Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 995.56it/s]
Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 1002.94it/s]
Retrieving SummaryDoc documents: 100%|██████████| 4/4 [00:00<?, ?it/s]
Retrieving Summa

C:\cif-cn-featurizer\cifs\250732.cif  contains an invalid cif


Retrieving SummaryDoc documents: 0it [00:00, ?it/s]
Retrieving SummaryDoc documents: 100%|██████████| 5/5 [00:00<?, ?it/s]
Retrieving SummaryDoc documents: 100%|██████████| 4/4 [00:00<?, ?it/s]
Retrieving SummaryDoc documents: 100%|██████████| 2/2 [00:00<?, ?it/s]
Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]
Retrieving SummaryDoc documents: 0it [00:00, ?it/s]
Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 999.83it/s]
Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 986.20it/s]
Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]
Retrieving SummaryDoc documents: 0it [00:00, ?it/s]
Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]
Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 999.36it/s]
Retrieving SummaryDoc documents: 0it [00:00, ?it/s]
Retrieving SummaryDoc documents: 0it [00:00, ?it/s]
Retrieving SummaryDoc documents: 0it [00:00, ?it/s]
Retrieving SummaryDoc 

In [4]:
d = {"Cif_num":ciflabel, "MP_num":docs}
df = pd.DataFrame(data=d)
print(df)

    Cif_num      MP_num
0    250065      mp-801
1    250101   mp-979040
2    250101      mp-768
3    250125     mp-1139
4    250125  mp-1008279
..      ...         ...
131  250934      mp-718
132  250934  mp-1218937
133  250962    mp-20971
134  250963    mp-20516
135  250977     mp-1051

[136 rows x 2 columns]


In [26]:
with MPRester(Aidans_API) as mpr:
    material = mpr.summary.search(material_ids=['mp-801'])
    print(material[0].density)

  material = mpr.summary.search(material_ids=['mp-801'])
Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]

12.883572424806513





In [27]:
material_ids = []
#bulkmod = []
density = []
for row in df["MP_num"]:
    #print(row)
    material_id = row
    
    try:
        with MPRester(Aidans_API) as mpr:
            material = mpr.summary.search(material_ids=[row])
            print(material_id)
            print(material[0].density)
            #bulkmod.append(material[0].bulk_modulus["reuss"])
            density.append(material[0].density)
    except:
        #bulkmod.append("NA")
        density.append("NA")
        print("fail")
    

mp-801


  material = mpr.summary.search(material_ids=[row])
Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


12.883572424806513
mp-979040


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


5.527285146110387
mp-768


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


5.632610075827367
mp-1139


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 1005.83it/s]


9.96299728602397
mp-1008279


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


10.015481379001777
mp-1226451


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


9.182845677359099
mp-1232


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


12.898261191603371
mp-1186016


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


12.777078339068995
mp-1220074


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


9.257027075671665
mp-11506


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


9.769613810100632
mp-1326


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


8.743524434942238
mp-674


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


10.24939082494553
mp-2092


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 989.69it/s]


10.760259389502755
mp-1571


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


10.397704278275409
mp-2333


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


10.615888317928517
mp-357


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


10.984967698041599
mp-636279


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 1004.38it/s]


10.820994191815037
mp-21427


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 989.92it/s]


11.316717971608051
mp-2747


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


11.524663797651474
mp-797


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


11.700858341922407
mp-1979


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


11.846010450424608
mp-2825


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


11.994827919381457
mp-1158


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


12.413465658612973
mp-633


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


7.415102440372493
mp-1911


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


7.828775749263485
mp-376


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


7.546256330916136
mp-1977


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


7.691879011502618
mp-2484


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 1022.00it/s]


7.930209209150159
mp-20387


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


7.8012568552109
mp-19919


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 988.06it/s]


8.156980369979735
mp-623460


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


7.143449785750475
mp-20131


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


7.270778164452664
mp-20729


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


7.376635651289854
mp-20369


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


7.775449563030647
mp-20903


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 987.59it/s]


7.656968652399108
mp-21197


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


7.823092406876472
mp-19977


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 984.81it/s]


8.104522072794246
mp-20258


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


8.343110213259171
mp-20920


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 985.97it/s]


8.47215828655415
mp-20236


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 990.16it/s]


8.584764503741077
mp-21431


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


8.677201098283199
mp-1291


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


8.763547703005178
mp-21177


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


8.842670849139486
mp-977


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


9.013708392223592
mp-801


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


12.883572424806513
mp-1192425


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


7.572105974393418
mp-2588


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


7.541135564711174
mp-1216480


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 1003.18it/s]


6.649320580371982
mp-22568


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 999.83it/s]


6.834443953341236
mp-16513


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


4.431460698497046
mp-865411


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 987.13it/s]


4.9748359632746535
mp-976358


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 989.69it/s]


6.563350676639624
mp-1080098


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


6.486766794944658
mp-1101986


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 998.64it/s]


9.597363238603098
mp-1192321


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


6.915798314053175
mp-1102392


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


6.870259267313855
mp-1192425


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 1001.74it/s]


7.572105974393418
mp-2588


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


7.541135564711174
mp-569196


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


7.788807287711017
mp-19977


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


8.104522072794246
mp-20309


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


10.671595644085569
mp-865527


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 997.93it/s]


3.610804964601175
mp-11232


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


2.391904116949502
mp-11231


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 1026.51it/s]


3.700060960389031
mp-2451


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 496.72it/s]


3.6346952134378188
mp-865411


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


4.9748359632746535
mp-16513


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


4.431460698497046
mp-12553


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


4.2842805337050525
mp-570409


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


4.383239011456298
mp-567305


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


4.334028134283033
mp-959


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


4.069623314700671
mp-1184171


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


5.443394957351932
mp-2134


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


5.554689859094507
mp-862656


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 999.12it/s]


5.5240221520341235
mp-1409


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 985.04it/s]


8.170464969881639
mp-21432


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


10.422672224131672
mp-1220420


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


8.559896739233848
mp-1451


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 1000.55it/s]


9.201398899391465
mp-999188


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


8.850508014459278
mp-11513


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


9.17745354406204
mp-999378


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


8.884123207300053
mp-1226802


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 1004.62it/s]


7.326271106559092
mp-30745


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


21.74541412225068
mp-1219506


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


13.83568242201397
mp-30866


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


14.496788904720361
mp-11482


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


19.324106016376305
mp-1221397


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 487.43it/s]


11.430488729323681
mp-30787


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


11.949571843809524
mp-1018021


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


9.484248854182471
mp-1220660


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 331.17it/s]


8.952881583522853
mp-977426


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


9.500977262376134
mp-570557


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


9.003407474216944
mp-2351


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 1000.55it/s]


18.661532327243094
mp-790


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


19.06247638492796
mp-481


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 499.68it/s]


16.546175481474044
mp-2092


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]

10.760259389502755
mp-21427



Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


11.316717971608051
mp-718


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 989.69it/s]


11.303794193467914
mp-1218937


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


10.38693842737166
mp-640095


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 985.74it/s]


6.599269145470333
mp-20309


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 987.13it/s]


10.671595644085569
mp-2092


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 986.66it/s]


10.760259389502755
mp-1189958


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


5.181760423731233
mp-1186855


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


5.210643938605863
mp-369


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


5.129861982817475
mp-867232


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


5.091299866816359
mp-865411


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


4.9748359632746535
mp-1220420


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


8.559896739233848
mp-1451


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


9.201398899391465
mp-999188


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 986.90it/s]


8.850508014459278
mp-11513


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 984.58it/s]


9.17745354406204
mp-999378


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 1000.55it/s]


8.884123207300053
mp-569776


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


12.339145193230443
mp-891


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


12.337173801007728
mp-570491


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


12.31760096600835
mp-1217931


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


11.478490785641336
mp-1082


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 990.62it/s]


18.74230964429206
mp-865496


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


18.802697413478537
mp-2006


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


10.606847605102757
mp-1080590


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


15.828538267315666
mp-30386


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


16.047024263509954
mp-1080756


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


16.24291569765432
mp-980752


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


7.834426899135408
mp-1101053


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


8.996006147572587
mp-2465


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


18.091854211021605
mp-1549


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


17.864325596737118
mp-674


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


10.24939082494553
mp-2092


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


10.760259389502755
mp-1571


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 496.37it/s]


10.397704278275409
mp-559


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


9.834653340340134
mp-30634


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


8.91538949231219
mp-718


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


11.303794193467914
mp-1218937


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 985.50it/s]


10.38693842737166
mp-20971


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 997.46it/s]


17.869663336391046
mp-20516


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<?, ?it/s]


17.91275209288864
mp-1051


Retrieving SummaryDoc documents: 100%|██████████| 1/1 [00:00<00:00, 990.16it/s]

10.347702857282641





In [32]:
d = {"CIF_id":ciflabel, "MP_num":docs, "Density":density}
df = pd.DataFrame(data=d)
print(df)

df.to_csv("Density_data.csv")


     CIF_id      MP_num    Density
0    250065      mp-801  12.883572
1    250101   mp-979040   5.527285
2    250101      mp-768   5.632610
3    250125     mp-1139   9.962997
4    250125  mp-1008279  10.015481
..      ...         ...        ...
131  250934      mp-718  11.303794
132  250934  mp-1218937  10.386938
133  250962    mp-20971  17.869663
134  250963    mp-20516  17.912752
135  250977     mp-1051  10.347703

[136 rows x 3 columns]


In [34]:
# Delete multiple instances of CIF_id even though the densities and MP numbers are different
# this is a quick and hacky way to match data to cif-cn-featurizer data

result_df = df.drop_duplicates(subset=['CIF_id'], keep='first')
print(result_df)



     CIF_id      MP_num    Density
0    250065      mp-801  12.883572
1    250101   mp-979040   5.527285
3    250125     mp-1139   9.962997
6    250191     mp-1232  12.898261
8    250225  mp-1220074   9.257027
..      ...         ...        ...
130  250920    mp-30634   8.915389
131  250934      mp-718  11.303794
133  250962    mp-20971  17.869663
134  250963    mp-20516  17.912752
135  250977     mp-1051  10.347703

[97 rows x 3 columns]


In [31]:
# import csv files generated by cif-cn-featurizer

df_ccnfeat = pd.read_csv("./cn-featurizer/csv/_atomic_environment_features_binary.csv")
print(df_ccnfeat)

     CIF_id      Compound   A   B  A_shortest_dist_count  \
0    250977         ErNi3  Er  Ni                      6   
1    250476         EuPb3  Eu  Pb                     12   
2    250804  La0.75Pt2.25  La  Pt                     12   
3    250191         Mo3Pt  Mo  Pt                      2   
4    250966  Pd3.25In9.75  Pd  In                      3   
..      ...           ...  ..  ..                    ...   
119  250125         MoCo3  Mo  Co                     12   
120  250390         Mo3Os  Mo  Os                      2   
121  250372          YIn3   Y  In                     12   
122  250654         YbIn3  Yb  In                     12   
123  250646         TmPt3  Tm  Pt                     12   

     B_shortest_dist_count  A_avg_shortest_dist_count  \
0                        6                        6.0   
1                       12                       12.0   
2                        6                       12.0   
3                       12                        2

In [48]:
# merge featurized data with data from Materials Project

df_ccnfeat['CIF_id'] = df_ccnfeat['CIF_id'].astype(str)
result_df['CIF_id'] = result_df['CIF_id'] .astype(str)
svm_df=df_ccnfeat.merge(result_df, on='CIF_id')
print(svm_df)
svm_df.to_csv("svm_df.csv")

    CIF_id Compound   A   B  A_shortest_dist_count  B_shortest_dist_count  \
0   250977    ErNi3  Er  Ni                      6                      6   
1   250476    EuPb3  Eu  Pb                     12                     12   
2   250191    Mo3Pt  Mo  Pt                      2                     12   
3   250368    EuSn3  Eu  Sn                     12                     12   
4   250367    SmSn3  Sm  Sn                     12                     12   
..     ...      ...  ..  ..                    ...                    ...   
92  250336    DyPd3  Dy  Pd                     12                     12   
93  250125    MoCo3  Mo  Co                     12                     10   
94  250390    Mo3Os  Mo  Os                      2                     12   
95  250372     YIn3   Y  In                     12                     12   
96  250646    TmPt3  Tm  Pt                     12                     12   

    A_avg_shortest_dist_count  B_avg_shortest_dist_count  \
0              

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df['CIF_id'] = result_df['CIF_id'] .astype(str)


**1.3 Comparing structural features to compositional features**

**<font color='teal'>c)</font>** Now that you've got structural features and you can get compositional features (use CBFV), let's compare them! Build a Support vector machine regressor model with each feature set and determine which works better. 

In [76]:
# SVM on Structural Features from cif-cn-featurizer

# load stored data to save time
svm_df=pd.read_csv('svm_df.csv')

from sklearn.svm import SVR
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

RNG_SEED = 42
np.random.seed(seed=RNG_SEED)

X = svm_df[['A_shortest_dist_count','B_shortest_dist_count','A_avg_shortest_dist_count','B_avg_shortest_dist_count',
        'A_shortest_tol_dist_count','B_shortest_tol_dist_count','A_avg_shortest_dist_within_tol_count',
        'B_avg_shortest_dist_within_tol_count','A_second_by_first_shortest_dist','B_second_by_first_shortest_dist',
        'A_avg_second_by_first_shortest_dist','B_avg_second_by_first_shortest_dist','A_second_shortest_dist_count',
        'B_second_shortest_dist_count','A_avg_second_shortest_dist_count','B_avg_second_shortest_dist_count',
        'A_homoatomic_dist_by_shortest_dist','B_homoatomic_dist_by_shortest_dist','A_avg_homoatomic_dist_by_shortest_dist',
        'B_avg_homoatomic_dist_by_shortest_dist','A_count_at_A_shortest_dist','A_count_at_B_shortest_dist',	
        'A_avg_count_at_A_shortest_dist','A_avg_count_at_B_shortest_dist','B_count_at_A_shortest_dist',
        'B_count_at_B_shortest_dist','B_avg_count_at_A_shortest_dist','B_avg_count_at_B_shortest_dist']]
y = svm_df['Density']




X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=RNG_SEED)
svr = SVR(kernel='rbf', C=1e3, gamma=0.1)
svr.fit(X_train, y_train)

y_pred = svr.predict(X_test)
r2 = r2_score(y_test, y_pred)
print('the r2 score is',r2)
mae = mean_absolute_error(y_test, y_pred)
print('the mean absolute error is',mae)
rmse_val = mean_squared_error(y_test, y_pred, squared=False)





the r2 score is 0.2815057887221405
the mean absolute error is 1.771342258353955


In [80]:
# SVM on Compositional Features from CBFV

from CBFV import composition

svm_df = svm_df.dropna()
rename_dict = {'Density': 'target', 'Compound':'formula'}
svm_df = svm_df.rename(columns=rename_dict)

RNG_SEED = 42
np.random.seed(seed=RNG_SEED)

X = svm_df[['formula']]
        #    ,'A_shortest_dist_count','B_shortest_dist_count','A_avg_shortest_dist_count','B_avg_shortest_dist_count',
        #'A_shortest_tol_dist_count','B_shortest_tol_dist_count','A_avg_shortest_dist_within_tol_count',
        #'B_avg_shortest_dist_within_tol_count','A_second_by_first_shortest_dist','B_second_by_first_shortest_dist',
        #'A_avg_second_by_first_shortest_dist','B_avg_second_by_first_shortest_dist','A_second_shortest_dist_count',
        #'B_second_shortest_dist_count','A_avg_second_shortest_dist_count','B_avg_second_shortest_dist_count',
        #'A_homoatomic_dist_by_shortest_dist','B_homoatomic_dist_by_shortest_dist','A_avg_homoatomic_dist_by_shortest_dist',
        #'B_avg_homoatomic_dist_by_shortest_dist','A_count_at_A_shortest_dist','A_count_at_B_shortest_dist',	
        #'A_avg_count_at_A_shortest_dist','A_avg_count_at_B_shortest_dist','B_count_at_A_shortest_dist',
        #'B_count_at_B_shortest_dist','B_avg_count_at_A_shortest_dist','B_avg_count_at_B_shortest_dist']]
y = svm_df['target']

new_df = X.join(y)
print(new_df)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=RNG_SEED)

X_train, y_train, formulae_train, skipped_train = composition.generate_features(new_df, elem_prop='oliynyk', drop_duplicates=False, extend_features=True, sum_feat=True)
X_test, y_test, formulae_train, skipped_train = composition.generate_features(new_df, elem_prop='oliynyk', drop_duplicates=False, extend_features=True, sum_feat=True)


svr = SVR(kernel='rbf', C=1e3, gamma=0.1)
svr.fit(X_train, y_train)

y_pred = svr.predict(X_test)
r2 = r2_score(y_test, y_pred)
print('the r2 score is',r2)
mae = mean_absolute_error(y_test, y_pred)
print('the mean absolute error is',mae)
rmse_val = mean_squared_error(y_test, y_pred, squared=False)




   formula     target
0    ErNi3  10.347703
1    EuPb3  10.671596
2    Mo3Pt  12.898261
3    EuSn3   7.801257
4    SmSn3   7.930209
..     ...        ...
92   DyPd3  11.700858
93   MoCo3   9.962997
94   Mo3Os  12.883572
95    YIn3   7.270778
96   TmPt3  18.661532

[97 rows x 2 columns]


Processing Input Data: 100%|██████████| 97/97 [00:00<00:00, 8820.92it/s]




	Featurizing Compositions...


Assigning Features...: 100%|██████████| 97/97 [00:00<00:00, 4041.60it/s]

	Creating Pandas Objects...



Processing Input Data: 100%|██████████| 97/97 [00:00<00:00, 16174.27it/s]


	Featurizing Compositions...


Assigning Features...: 100%|██████████| 97/97 [00:00<00:00, 6941.25it/s]


	Creating Pandas Objects...
the r2 score is 0.9993457723977782
the mean absolute error is 0.09845562750331356
