# Raw data preprocessing

This notebook handles the raw data from the two data sources used in this study: 
1. [Effective masses](https://www.nature.com/articles/sdata201785)
2. [Dielectric constants](https://www.nature.com/articles/sdata201865)

#### Requirements
- The datasets must be downloaded in their entirety and put into the `raw_data/` directory.
- You will need a (free) API key from the [Materials Project](http://materialsproject.org).

`pip install numpy scipy pandas pymatgen matminer smact`


#### Steps
- Limit the effective mass data to those that include at least one metal or semi-metal element (also including Si).
- Write to `processed_data/raw_eff_mass_data.json`, which is used in the main notebook.
- Get the intersection of the datasets by checking that *any* of the unique Materials Project `task_id`s associated with each compound in set 1 is used to identify a compound in set 2.
- Write to `processed_data/collected_raw_data.json`, which is used in the main notebook.

In [2]:
# Imports
## Pymatgen
from pymatgen import MPRester, Element, Composition
from matminer.data_retrieval.retrieve_MP import MPDataRetrieval
from matminer.utils.conversions import str_to_composition

# Data
import pandas as pd
import json
import gzip

# Maths
import numpy as np
import scipy

## System
import os
from datetime import datetime
import multiprocessing

# SMACT
from smact import metals

# Pandas options
pd.options.display.max_rows = 999
pd.options.display.max_columns = 999

# MP API - Add yours below
m = MPRester(os.environ.get('MP_API_KEY'))

## Preliminaries

In [3]:
# Function to pull necessary data out of a file and put into dict
def extract_data_diel(filepath):
    if filepath.endswith(".json"):
        try:
            with open(filepath , 'r') as f:
                data = json.load(f)
                if data['dielectric']:
                    task_id = data['metadata']['material_id']
                    formula = data['metadata']['formula']
                    eps_total = data['dielectric']['eps_total']
                    eps_electronic = data['dielectric']['eps_electronic']
                    
                    # Return as dict
                    return {'task_id': task_id, 'formula': formula, 
                            'eps_total': eps_total, 'eps_electronic': eps_electronic}

        except:
            pass

In [4]:
# Query to get all Materials Project task_ids
multiple_MPIDs = m.query(criteria={'task_id': {'$exists': True }}, properties = ['task_ids'])

HBox(children=(IntProgress(value=0, max=132074), HTML(value='')))

## Effective masses
Data extracted from many `.json.gz` files and initially put into list of dicts.
Dict keys are [formula, p_eff_mass_tensor, n_eff_mass_tensor, p_eff_mass_poly, n_eff_mass_poly, p_eff_mass_max, n_eff_mass_max, p_eff_mass_min, n_eff_mass_min]

Note: Extracted data corresponds to conductivity effective mass at a temp of 300K and carrier conc. of 1e18.

In [5]:
# Function to pull necessary data out of a file and put into dict
def extract_data(filepath):
    if filepath.endswith(".gz"):
        try:
            with gzip.GzipFile(filepath , 'r') as f:
                data = json.load(f)
                if data['GGA']['cond_eff_mass']:
                    task_id = data['task_id']['GGA']
                    formula = data['pretty_formula']
                    p_eff_mass = data['GGA']['cond_eff_mass']['p']['300']['1e+18']
                    n_eff_mass = data['GGA']['cond_eff_mass']['n']['300']['1e+18']

                    # Return as dict
                    return {'task_id': task_id, 'formula': formula, 
                            'p_eff_mass': p_eff_mass, 'n_eff_mass': n_eff_mass}

        except:
            pass

Use multiprocessing to get all the data.   
takes a few minutes.

In [7]:
start = datetime.now()
with multiprocessing.Pool() as p:
    dirname = '../raw_data/eff_mass_data_1/'
    result_1 = p.map(extract_data, [os.path.join(dirname,filename) for filename in os.listdir(dirname)])
    dirname = '../raw_data/eff_mass_data_2/'
    result_2 = p.map(extract_data, [os.path.join(dirname,filename) for filename in os.listdir(dirname)])
print('Time: {}'.format(datetime.now() - start))

Time: 0:03:15.078614


In [8]:
# Combine into one list
e_mass_data = result_1 + result_2
# Remove Nones from list
print(len(e_mass_data))
e_mass_data = [x for x in e_mass_data if x is not None]
print(len(e_mass_data))
# Create dataframe
effmass_df = pd.DataFrame(e_mass_data)

47738
22976


In [9]:
# Get rid of gases and other nonsense
# The compounds must include one of these:
must_include = metals + ['Si','As','At']

# Function to label whether the compound contains any of the necessary elements or not
def label_unwanted(form):
    comp = Composition(form)
    symbols = [s.symbol for s in comp.elements]
    contain_wanted = bool([i for i in symbols if i in must_include])
    return(contain_wanted)

effmass_df['wanted_formula'] = effmass_df.apply(lambda x: label_unwanted(x['formula']), axis=1)
effmass_df = effmass_df.loc[effmass_df['wanted_formula'] == True]

In [10]:
# Reduce down to columns we need, rename them to something useful, and save to json
effmass_df = effmass_df.filter(['task_id', 'formula', 'n_eff_mass',
                                      'p_eff_mass'], axis=1)

In [11]:
# Drop none values
effmass_df = effmass_df.dropna()
print('Number of compounds: {}'.format(len(effmass_df)))

Number of compounds: 21942


In [12]:
# Write out to json
with open('../processed_data/raw_eff_mass_data.json', 'w') as f:
    out = effmass_df.to_json()
    f.write(out)

## Dilectric constants

In [13]:
time = datetime.now()
with multiprocessing.Pool() as p:
    dirname = '../raw_data/dielectric_pettreto/phonon/'
    diel_result = p.map(extract_data_diel, 
                  [os.path.join(dirname,filename) for filename in os.listdir(dirname)])
    print(len(diel_result))

1521


In [14]:
# Add additional task_ids
for d in diel_result:
    for e in multiple_MPIDs:
        if d['task_id'] in e['task_ids']:
            d['all_task_ids'] = e['task_ids']

### Combine dielectric and effecive mass data


In [15]:
for d in diel_result:
    for e in e_mass_data:
        try:
            if e['task_id'] in d['all_task_ids']:
                for key in e.keys():
                    d['e_mass_{}'.format(key)] = e[key]
        except:
            pass

# Only store info for compounds in both datasets
all_data = [e for e in diel_result if 'e_mass_formula' in e]
print(len(diel_result))
print(len(all_data))

# create dataframe
all_data_df = pd.DataFrame(all_data)

1521
1236


In [16]:
# Get rid of gases and other nonsense again 
all_data_df['wanted_formula'] = all_data_df.apply(lambda x: label_unwanted(x['formula']), axis=1)
all_data_df = all_data_df.loc[all_data_df['wanted_formula'] == True]

In [17]:
# Drop none values
all_data_df = all_data_df.dropna()
print('Number of compounds: {}'.format(len(all_data_df)))

Number of compounds: 1131


In [18]:
# Reduce down to columns we need, rename them to something useful, and save to json
reduced_raw_data = all_data_df.filter(['task_id', 'formula', 'e_mass_n_eff_mass',
                                      'e_mass_p_eff_mass', 'eps_electronic', 'eps_total'], axis=1)

reduced_raw_data = reduced_raw_data.rename(columns={'e_mass_n_eff_mass': 'n_eff_mass',
                                                    'e_mass_p_eff_mass': 'p_eff_mass' })


# Write out to json
with open('../processed_data/collected_raw_data.json', 'w') as f:
    out = reduced_raw_data.to_json()
    f.write(out)