## Overview

The task here is featuring the atoms of molecules in the QM9 dataset (we only considered the first 2000 molecules to limit the scope), and train a simple ML model the reported Mulliken charge using linear regression. 

* Mulliken charge is a quantity computed for each atom in a molecule

## Load Data

In [1]:
import numpy as np
from GMPFeaturizer import GMPFeaturizer, ASEAtomsConverter, PymatgenStructureConverter
import pickle

with open('QM9_charge.p', 'rb') as handle:
    partial_charge_data = pickle.load(handle)
    
systems = [entry["system"] for entry in partial_charge_data]
charges = [entry["charges"] for entry in partial_charge_data]

## Compute features

In [2]:
GMPs = {
    "GMPs": {   
        "orders": [-1, 0, 1, 2, 3], 
        "sigmas": [0.1, 0.2, 0.3, 0.4, 0.5]   
    },
    "psp_path": "./NC-SR.gpsp", # path to the pseudo potential file
    "overlap_threshold": 1e-16, # basically the accuracy of the resulting features
    # "square": False, # whether the features are squared, no need to change if you are not get the feature derivatives
}

converter = ASEAtomsConverter()

In [3]:
featurizer = GMPFeaturizer(GMPs=GMPs, calc_derivatives=False)
features = featurizer.prepare_features(systems, cores=5, converter=converter)

2023-06-07 11:11:03,519	INFO worker.py:1518 -- Started a local Ray instance.
100%|██████████████████████████████████████| 2000/2000 [00:02<00:00, 988.85it/s]


## Prepare data for the  model

In [4]:
X_list = [entry["features"] for entry in features]
X = np.vstack(X_list)
y = np.concatenate(charges)

## Train the regression model and print the score

In [5]:
# !pip install scikit-learn

In [6]:

from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X, y)
reg.score(X, y)

0.9487785353138086