# Data Featurization

Here, we will show some simple examples of featurizing materials composition data using so-called "composition-based feature vectors", or CBFVs. This methods represents a single chemical formula as one vector based on its constituent atoms' chemical properties (refer to the paper for more information and references).

Note that the steps shown in this notebook are intended to demonstrate the best practices associated with featurizing materials data, using *one* way of featurizing materials composition data as an example. 
Depending on your input data and your particular modeling needs, the data featurization method and procedure you use may be different than the example shown here.

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format='retina'

from collections import OrderedDict

# Set a random seed to ensure reproducibility across runs
RNG_SEED = 42
np.random.seed(RNG_SEED)

## Loading data


We will start with the dataset splits that we saved from the last notebook.

In [2]:
#getting the paths for all the data
PATH = os.getcwd()
train_path = os.path.join(PATH, "../data_for_notebook_bestpractice/cp_train_byme.csv")
val_path = os.path.join(PATH, "../data_for_notebook_bestpractice/cp_val_byme.csv")
test_path = os.path.join(PATH, "../data_for_notebook_bestpractice/cp_test_byme.csv")

#now that the paths are created, we can make the dataframes
df_train = pd.read_csv(train_path)
df_val = pd.read_csv(val_path)
df_test = pd.read_csv(test_path)

print("df_train dataframe shape is", df_train.shape)
print("df_val dataframe shape is", df_val.shape)
print("df_test dataframe shape is", df_test.shape)

df_train dataframe shape is (3214, 3)
df_val dataframe shape is (980, 3)
df_test dataframe shape is (370, 3)


## Sub-sampling your data (optional)

If your dataset is too large, you can subsample it to be a smaller size.
This is useful for prototyping and for making quick sanity tests of new models / parameters.

Just be aware that you do not introduce any bias into your data through the sampling.

In [3]:
df_train_sample = df_train.sample(n = 2000, random_state = RNG_SEED) #https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html
df_val_sample = df_val.sample(n = 200, random_state = RNG_SEED)
df_test_sample = df_test.sample(n = 200, random_state = RNG_SEED)

print(f"shape of df_train-sample:{df_train_sample.shape}\nshape of df_val_sample:{df_val_sample.shape}\nshape of df_test_sample{df_test_sample.shape}")

shape of df_train-sample:(2000, 3)
shape of df_val_sample:(200, 3)
shape of df_test_sample(200, 3)


## Generate features using the `CBFV` package

To featurize the chemical compositions from a chemical formula (e.g. "Al2O3") into a composition-based feature vector (CBFV), we use the open-source [`CBFV` package](https://github.com/kaaiian/CBFV).

We have downloaded and saved a local copy of the package into this repository for your convenience.
For the most updated version, refer to the GitHub repository linked above.

In [4]:
#Import the package and the generate_features function
from CBFV.cbfv.composition import generate_features

The `generate_features` function from the CBFV package expects an input DataFrame containing at least the columns `['formula', 'target']`. You may also have extra feature columns (e.g., `temperature` or `pressure`, other measurement conditions, etc.).

In our dataset, `Cp` represents the target variable, and `T` is the measurement condition.
Since the `generate_features` function expects the target variable column to be named `target`, we have to rename the `Cp` column.

In [5]:
print('DataFrame column names before renaming:')
print(df_train.columns)
print(df_val.columns)
print(df_test.columns)

#renaming Cp column to "target"
rename_cp = {"Cp" : "target"}
df_train = df_train.rename(columns = rename_cp)
df_val = df_val.rename(columns = rename_cp)
df_test = df_test.rename(columns = rename_cp)

df_train_sample = df_train_sample.rename(columns = rename_cp)
df_val_sample = df_val_sample.rename(columns = rename_cp)
df_test_sample = df_test_sample.rename(columns = rename_cp)

print('\nDataFrame column names after renaming:')
print(df_train.columns)
print(df_val.columns)
print(df_test.columns)

DataFrame column names before renaming:
Index(['formula', 'T', 'Cp'], dtype='object')
Index(['formula', 'T', 'Cp'], dtype='object')
Index(['formula', 'T', 'Cp'], dtype='object')

DataFrame column names after renaming:
Index(['formula', 'T', 'target'], dtype='object')
Index(['formula', 'T', 'target'], dtype='object')
Index(['formula', 'T', 'target'], dtype='object')


Now we can use the `generate_features` function to generate the CBFVs from the input data.

Note that we have specified several keyword arguments in our call to `generate_features`:
* `elem_prop='oliynyk'`
* `drop_duplicates=False`
* `extend_features=True`
* `sum_feat=True`

A short explanation for the choice of keyword arguments is below:
* The `elem_prop` parameter specifies which CBFV featurization scheme to use (there are several). For this tutorial, we have chosen to use the `oliynyk` CBFV featurization scheme.
* The `drop_duplicates` parameter specifies whether to drop duplicate formulae during featurization. In our case, we want to preserve duplicate formulae in our data (`True`), since we have multiple heat capacity measurements (performed at different temperatures) for the same compound.
* The `extend_features` parameter specifies whether to include extended features (features that are not part of `['formula', 'target']`) in the featurized data. In our case, this is our measurement temperature, and we want to include this information (`True`), since this is pertinent information for the heat capacity prediction.
* The `sum_feat` parameter specifies whether to calculate the sum features when generating the CBFVs for the chemical formulae. We do in our case (`True`).

For more information about the `generate_features` function and the CBFV featurization scheme, refer to the GitHub repository and the accompanying paper to this notebook.

In [17]:
#note that what is labelled here is unscaled!!!
X_train_unscaled, y_train, formulae_train, skipped_train = generate_features(df_train_sample, elem_prop = 'oliynyk', drop_duplicates = False, extend_features = True, sum_feat = True)
X_val_unscaled, y_val, formulae_val, skipped_val = generate_features(df_val_sample, elem_prop = 'oliynyk', drop_duplicates = False, extend_features = True, sum_feat = True)
X_test_unscaled, y_test, formulae_test, skipped_test = generate_features(df_test_sample, elem_prop = 'oliynyk', drop_duplicates = False, extend_features = True, sum_feat = True)

#wtf is this
#need to figure out what the hell this is

#I think drop_duplicates will remove a formula if seen more than once i.e. only keep first example of it
#for Cp there will be same formula at many different temps so we don't want to get rid of duplicates
#therefore it is set to false

#similar to splittig data into train/val/test am unsure how to know what to put to the left of the = sign
#i.e. why is it X_train_unscaled, y_train, formulae_train, skipped_train and in that order

#think that sum means if H2o2 atomic no = 34 etc. adds up values associated with each element in compound

Processing Input Data: 100%|████████████████████████████████████████████████████| 2000/2000 [00:00<00:00, 14302.41it/s]


	Featurizing Compositions...


Assigning Features...: 100%|█████████████████████████████████████████████████████| 2000/2000 [00:00<00:00, 8416.98it/s]


	Creating Pandas Objects...


Processing Input Data: 100%|██████████████████████████████████████████████████████| 200/200 [00:00<00:00, 14359.63it/s]


	Featurizing Compositions...


Assigning Features...: 100%|███████████████████████████████████████████████████████| 200/200 [00:00<00:00, 4332.85it/s]


	Creating Pandas Objects...


Processing Input Data: 100%|██████████████████████████████████████████████████████| 200/200 [00:00<00:00, 17487.20it/s]


	Featurizing Compositions...


Assigning Features...: 100%|███████████████████████████████████████████████████████| 200/200 [00:00<00:00, 5505.74it/s]


	Creating Pandas Objects...


To see what a featurized X matrix looks like, `.head()` will show us some rows:

In [18]:
X_train_unscaled.head()

Unnamed: 0,sum_Atomic_Number,sum_Atomic_Weight,sum_Period,sum_group,sum_families,sum_Metal,sum_Nonmetal,sum_Metalliod,sum_Mendeleev_Number,sum_l_quantum_number,...,range_Melting_point_(K),range_Boiling_Point_(K),range_Density_(g/mL),range_specific_heat_(J/g_K)_,range_heat_of_fusion_(kJ/mol)_,range_heat_of_vaporization_(kJ/mol)_,range_thermal_conductivity_(W/(m_K))_,range_heat_atomization(kJ/mol),range_Cohesive_energy,T
0,32.0,65.11604,8.0,30.0,15.0,1.0,2.0,0.0,162.0,2.0,...,2642621.0,4742507.0,0.858492,0.021622,2388.183171,22965.815879,3091.366423,66594.888889,7.034755,600.0
1,28.0,53.4912,9.0,36.0,43.0,0.0,6.0,0.0,544.0,2.0,...,4363.94,8544.527,2e-06,40.816697,1.69693,17.270367,0.00603,22037.555556,4.284089,457.7
2,46.0,98.887792,14.0,72.0,36.0,3.0,4.0,0.0,441.0,4.0,...,400905.7,1662798.0,0.601941,1.321867,10.138486,13933.526946,6716.9217,10368.666667,1.070067,300.0
3,20.0,41.988171,5.0,18.0,9.0,1.0,1.0,0.0,95.0,1.0,...,25217.44,286813.8,0.234886,0.042025,1.372178,2194.463394,4968.283245,225.0,0.018632,2800.0
4,82.0,207.2,6.0,14.0,5.0,1.0,0.0,0.0,81.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1400.0


In [19]:
X_train_unscaled.shape

(2000, 177)

Note the `sum` features in the CBFV, which we have included by using `sum_feat=True` in the call to `generate_features`.

Also note the temperature column `T` at the end of this featurized data.

What we have done above is featurize the input data. In the featurized data, each row contains a unique CBFV that describes a given chemical composition.

## Data scaling & normalization

For numerical input data, scaling and normalization of the features often improves the model performance.
Scaling can partially correct the discrepancy between the orders of magnitudes of the features (e.g., some numerical features being much larger or smaller than others).
This typically improves the model learning performance, and in turn, improves the model performance.

We will scale then normalize our input data using scikit-learn's built-in `StandardScaler` class and `normalize` function.

Note, in addition to `StandardScaler`, other scalers such as `RobustScaler` and `MinMaxScaler` are also available in scikit-learn. Consult the documentation for the details and when to use them.

In [20]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import normalize

## Scaling the data

First, we instantiate the scaler object.

In a `StandardScaler` object:
* During the `fit` process, the statistics of the input data (mean and standard deviation) are computed.
* Then, during the `transform` process, the mean and standard deviation values calculated above are used to scale the data to having zero-mean and unit variance.

Therefore, for the first time usage of the scaler, we call the `.fit_transform()` method to fit the scaler to the input data, and then to transform the same data.
For subsequent uses, since we have already computed the statistics, we only call the `.transform()` method to scale data.

# **Note:** you should *only* `.fit()` the scaler using the training dataset statistics, and then use these same statistics from the training dataset to `.transform()` the other datasets (validation and train).

In [21]:
#Always scale first and then normalise!!!
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train_unscaled)
X_val = scaler.transform(X_val_unscaled)
X_test = scaler.transform(X_test_unscaled)

## Normalizing the scaled data

We repeat a similar process for normalizing the data.
Here, there is no need to first fit the normalizer, since the normalizer scales the rows of the input data to unit norm independently of other rows.

The normalizer is different to a Scaler in that the normalizer acts row-wise, whereas a Scaler acts column-wise on the input data.

In [22]:
X_train = normalize(X_train)
X_val = normalize(X_val)
X_test = normalize(X_test)

# Modeling using "classical" machine learning models

Here we implement some classical ML models from `sklearn`:

* Ridge regression
* Support vector machine
* Linear support vector machine
* Random forest
* Extra trees
* Adaptive boosting
* Gradient boosting
* k-nearest neighbors
* Dummy (if you can't beat this, something is wrong.)

Note: the Dummy model types from `sklearn` act as a good sanity check for your ML studies. If your models do not perform significantly better than the equivalent Dummy models, then you should know that something has gone wrong in your model implementation.

In [12]:
from time import time

from sklearn.dummy import DummyRegressor

from sklearn.linear_model import Ridge

from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.neighbors import KNeighborsRegressor

from sklearn.svm import SVR
from sklearn.svm import LinearSVR

from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

In addition, we define some helper functions.

In [13]:
#should probably figure out what the hell this stuff is
#no clue what a lot of the stuff here is I need to dig into it definitely

#tells you which model is being usitilised?
def instantiate_model(model_name):
    model = model_name() #going from model_name to model_name() turns it into an object I think?
    return model

def fit_model(model, X_train, y_train):
    ti = time()
    model = instantiate_model(model)
    model.fit(X_train, y_train)
    fit_time = time() - ti
    return model, fit_time

def evaluate_model(model, X, y_act):
    y_pred = model.predict(X)
    r2 = r2_score(y_act, y_pred)
    mae = mean_absolute_error(y_act, y_pred)
    rmse_val = mean_squared_error(y_act, y_pred, squared=False)
    return r2, mae, rmse_val

def fit_evaluate_model(model, model_name, X_train, y_train, X_val, y_act_val):
    model, fit_time = fit_model(model, X_train, y_train)
    r2_train, mae_train, rmse_train = evaluate_model(model, X_train, y_train)
    r2_val, mae_val, rmse_val = evaluate_model(model, X_val, y_act_val)
    result_dict = {
        'model_name': model_name,
        'model_name_pretty': type(model).__name__,
        'model_params': model.get_params(),
        'fit_time': fit_time,
        'r2_train': r2_train,
        'mae_train': mae_train,
        'rmse_train': rmse_train,
        'r2_val': r2_val,
        'mae_val': mae_val,
        'rmse_val': rmse_val}
    return model, result_dict

def append_result_df(df, result_dict):
    df_result_appended = df.append(result_dict, ignore_index=True)
    return df_result_appended

def append_model_dict(dic, model_name, model):
    dic[model_name] = model
    return dic

Build an empty DataFrame to store model results:

In [25]:
df_classics = pd.DataFrame(columns=['model_name',
                                    'model_name_pretty', #wtd is model name pretty like
                                    'model_params',
                                    'fit_time',
                                    'r2_train',
                                    'mae_train',
                                    'rmse_train',
                                    'r2_val',
                                    'mae_val',
                                    'rmse_val'])
df_classics
#no clue about this stuff need to look into it

Unnamed: 0,model_name,model_name_pretty,model_params,fit_time,r2_train,mae_train,rmse_train,r2_val,mae_val,rmse_val


## Define the models

Here, we instantiate several classical machine learning models for use.
For demonstration purposes, we instantiate the models with their default model parameters.

Some of the models listed above can perform either regression or classification tasks.
Because our ML task is a regression task (prediction of the continuous-valued target, heat capacity), we choose the regression variant of these models.

Note: the `DummyRegressor()` instance acts as a good sanity check for your ML studies. If your models do not perform significantly better than the `DummyRegressor()`, then you know something has gone awry.

In [26]:
# Build a dictionary of model names
classic_model_names = OrderedDict({ #need to find out what OrderedDict is
    'dumr': DummyRegressor,
    'rr': Ridge,
    'abr': AdaBoostRegressor,
    'gbr': GradientBoostingRegressor,
    'rfr': RandomForestRegressor,
    'etr': ExtraTreesRegressor,
    'svr': SVR,
    'lsvr': LinearSVR,
    'knr': KNeighborsRegressor,
})

## Instantiate and fit the models

Now, we can fit the ML models.

We will loop through each of the models listed above. For each of the models, we will:
* instantiate the model (with default parameters)
* fit the model using the training data
* use the fitted model to generate predictions from the validation data
* evaluate the performance of the model using the predictions
* store the results in a DataFrame for analysis

Note: this may take several minutes, depending on your hardware/software environment, dataset size and featurization scheme (CBFV).