In [2]:
import __init__

Navigated to package root: /home/cyprien/CrystaLLMv2_PKV
Added package root to Python path


#### 1st pass finetune - Mattergen XRD
- Dataset Source: [Mattergen Alex-MP-20](https://github.com/microsoft/mattergen/tree/main/data-release/alex-mp)
  - Columns: Database (manual) 
  - Reduced Formula (Source)
  - CIF (pmg - Cifwriter with symprec 0.1)
  - XRD 'Condition Vector' (with [_calculate_XRD.py](_utils/_preprocessing/_calculate_XRD.py))
    - pmg - XRDCalculator(wavelength="CuKa")
    - top 20 most intense peaks selected ($2\theta$ and int)
    - Normalisations
      - $2\theta$ min-max for 0,90
      - intensities min-max for 0,100
- Deduplicated
- Cleaned for CIF augmentation
  -  Note: I didnt filter to context length here because it was not implemented yet, but filter to context was flagged as True during model training which effectively does the same thing (less efficient)
- dataset pushed to HuggingFace as: c-bone/mattergen_XRD (90:10 train/valid sets)

In [None]:
!torchrun --nproc_per_node=2 _train.py --config '_config_files/training/conditional/xrd_studies/mattergen_XRD-slider.jsonc'

#### 2nd pass finetune - COD XRD
- Dataset Source: [COD hkl data](https://www.crystallography.net/hkl/)
  - Columns: Database (manual) 
  - Reduced Formula (automated extraction from source)
  - CIF
    - automated extraction of material id from COD source
    - converted to structure using pmg COD.get_structure_by_id
    - Cifwriter with symprec 0.1 for CIF
    - note: this was done because alot of COD cifs arent in clean standard format. Pymatgen already did a big job of cleaning them up so we dont need to reinvent the wheel and take CIF data straight from source.
  - XRD data
    - For every Material ID that has experimental hkl data and associated intensities, we extract it
    - Then:
      1. Calculate d_hkl from crystal structure.lattice.d_hkl([h, k, l])
      2. Use Bragg's law: sin($\theta$) = $\lambda$/($2$ × d_hkl)
      3. Find $\theta$ = arcsin($\lambda$/($2$ × d_hkl))
      4. Convert to degrees: $2\theta$ = $2$ × $\theta$ × (180/$\pi$)
    - Where:
      - $\lambda$: X-ray wavelength ($1.5406$ $\AA$ for Cu K$\alpha$)
      - d_hkl: d-spacing for the (hkl) planes
      - $\theta$: Bragg angle
    - Created 'Condition Vector'
      - top 20 most intense peaks selected ($2\theta$ and int)
      - Normalisations
        - $2\theta$ min-max for 0,90
        - intensities min-max for 0,100
  - Filtered out all hydrocarbons
    - symbols = struct.symbol_set
    - if "C" in symbols and "H" in symbols, remove it
  - Then cleaning for CIF augmentation
    - set --make_disordered_ordered flag
      - Makes every occupancy exactly integer if occupancy is int $\pm 0.05$. Element set needs to be exactly preserved or structure discarded.
    - Filtered to 1024 contect length
  - Pushed to HuggingFace as c-bone/COD_XRD_small_nohc

### Training
> **Note**: Here the hyperparamters change compared to regular finetuning because its 2nd pass. Backbone learning rates were set to decay from $5\times10^{-8}$ to $5\times10^{-10}$, while the learning rates for the newly initialised conditioning parameters were set 100 times higher

In [5]:
!python _utils/_preprocessing/_save_dataset_to_HF.py \
    --input_parquet 'HF-databases/COD_dev/COD_xrd_clean_nohc_small.parquet' \
    --output_parquet 'HF-databases/COD_XRD_small_nohc_full.parquet' \
    --valid_size 0.000 \
    --test_size 0.125 \
    --save_hub

Loading Hugging Face API key from API_keys.jsonc
Loading data from HF-databases/COD_dev/COD_xrd_clean_nohc_small.parquet as Parquet with zstd compression
Uploading the dataset shards:   0%|                       | 0/1 [00:00<?, ?it/s]
Creating parquet from Arrow format: 100%|█████████| 2/2 [00:00<00:00, 65.69ba/s][A
Uploading the dataset shards: 100%|███████████████| 1/1 [00:01<00:00,  1.23s/it]
Uploading the dataset shards:   0%|                       | 0/1 [00:00<?, ?it/s]
Creating parquet from Arrow format: 100%|████████| 1/1 [00:00<00:00, 164.92ba/s][A
Uploading the dataset shards: 100%|███████████████| 1/1 [00:00<00:00,  1.48it/s]
Dataset saved to Hugging Face Hub as c-bone/COD_XRD_small_nohc_full


In [None]:
!torchrun --nproc_per_node=2 _train.py --config '_config_files/training/conditional/xrd_studies/COD_XRD_small-slider-opt.jsonc'

### Generating

In [None]:
!python _utils/_generating/make_prompts.py \
    --HF_dataset 'c-bone/COD_XRD_small_nohc' \
    --split 'test' \
    --automatic \
    --output_parquet '_artifacts/cod-XRD/COD-small-XRD-nohc-test_prompts.parquet' \
    --level 'level_2' \
    --condition_columns 'Condition Vector'

#### Generate materials using the only 1-pass finetuning on Mattergen XRD

In [None]:
import __init__

Navigated to package root: /home/cyprien/CrystaLLMv2_PKV
Added package root to Python path


In [None]:
!python _utils/_generating/generate_CIFs.py --config '_config_files/generation/conditional/xrd_studies/COD-mgen-xrd_eval.jsonc'

python: can't open file '/home/cyprien/CrystaLLMv2_PKV/notebooks/_utils/_generating/generate_CIFs.py': [Errno 2] No such file or directory


In [None]:
!python _utils/_generating/postprocess.py \
    --input_parquet '_artifacts/cod-xrd/COD-mgen-topp-test_gen.parquet' \
    --output_parquet '_artifacts/cod-xrd/COD-mgen-topp-test_post.parquet' \
    --num_workers 32 \
    --column_name 'Generated CIF'

In [None]:
!python _utils/_metrics/XRD_metrics.py \
    --input_parquet '_artifacts/cod-xrd/COD-mgen-topp-test_post.parquet' \
    --num_gens 20 \
    --ref_parquet '_artifacts/cod-xrd/CODtest_ref.parquet' \
    --output_parquet '_artifacts/cod-xrd/COD-mgen-topp-test_metrics.parquet' \
    --num_workers 32 \
    --validity_check "none"

In [None]:
!python _utils/_metrics/XRD_metrics.py \
    --input_parquet '_artifacts/cod-xrd/COD-mgen-topp-test_post.parquet' \
    --num_gens 20 \
    --ref_parquet '_artifacts/cod-xrd/COD-test_ref.parquet' \
    --output_parquet '_artifacts/cod-xrd/COD-mgen-top1p-test_metrics.parquet' \
    --num_workers 32 \
    --validity_check "none"

In [None]:
from _utils import get_metrics_xrd
import pandas as pd

df_test = pd.read_parquet('_artifacts/cod-xrd/COD-test_ref.parquet')
df_metrics = pd.read_parquet('_artifacts/cod-xrd/COD-mgen-topp-test_metrics.parquet')
metrics = get_metrics_xrd(df_metrics, n_test=len(df_test), only_matched=False)

Number of matched structures: 86 / 198
Mean RMS-d: 0.0978
Percent Matched (%): 43.43% (86/198)
a MAE: 0.7486
b MAE: 0.6023
c MAE: 0.9354
Volume MAE: 47.8279
a R^2: 0.8873
b R^2: 0.9117
c R^2: 0.9290
Volume R^2: 0.9842


In [None]:
df_test = pd.read_parquet('_artifacts/cod-xrd/COD-test_ref.parquet')
df_metrics = pd.read_parquet('_artifacts/cod-xrd/COD-mgen-top1p-test_metrics.parquet')
metrics = get_metrics_xrd(df_metrics, n_test=len(df_test), only_matched=False)

Number of matched structures: 61 / 198
Mean RMS-d: 0.0627
Percent Matched (%): 30.81% (61/198)
a MAE: 2.1395
b MAE: 1.9055
c MAE: 2.5930
Volume MAE: 76.2166
a R^2: 0.5288
b R^2: 0.4970
c R^2: 0.5610
Volume R^2: 0.9680


#### Generate materials using 2-pass finetuning (Mattergen XRD + COD XRD nohc) but no XRD information fed during inference
> **Note**: replaced the condition_vector column in the prompt df made above with a series of [-100] missing values meaning no XRD information is fed during generation

In [2]:
import __init__


Navigated to package root: /home/cyprien/CrystaLLMv2_PKV
Added package root to Python path


In [None]:
!python _utils/_generating/generate_CIFs.py --config '_config_files/generation/conditional/xrd_studies/COD-xrd_uncond_eval.jsonc'

python: can't open file '/home/cyprien/CrystaLLMv2_PKV/notebooks/_utils/_generating/generate_CIFs.py': [Errno 2] No such file or directory


In [None]:
!python _utils/_generating/postprocess.py \
    --input_parquet '_artifacts/cod-xrd/COD-uncond-topp-test_gen.parquet' \
    --output_parquet '_artifacts/cod-xrd/COD-uncond-topp-test_post.parquet' \
    --num_workers 32 \
    --column_name 'Generated CIF'

In [None]:
!python _utils/_metrics/XRD_metrics.py \
    --input_parquet '_artifacts/cod-xrd/COD-uncond-topp-test_post.parquet' \
    --num_gens 20 \
    --ref_parquet '_artifacts/cod-xrd/COD-test_ref.parquet' \
    --output_parquet '_artifacts/cod-xrd/COD-uncond-topp-test_metrics.parquet' \
    --num_workers 32 \
    --validity_check "none"

In [None]:
!python _utils/_metrics/XRD_metrics.py \
    --input_parquet '_artifacts/cod-xrd/COD-uncond-topp-test_post.parquet' \
    --num_gens 1 \
    --ref_parquet '_artifacts/cod-xrd/COD-test_ref.parquet' \
    --output_parquet '_artifacts/cod-xrd/COD-uncond-top1p-test_metrics.parquet' \
    --num_workers 32 \
    --validity_check "none"

In [3]:
from _utils import get_metrics_xrd
import pandas as pd

df_test = pd.read_parquet('_artifacts/cod-xrd/COD-test_ref.parquet')
df_metrics = pd.read_parquet('_artifacts/cod-xrd/COD-uncond-topp-test_metrics.parquet')
metrics = get_metrics_xrd(df_metrics, n_test=len(df_test), only_matched=False)

Computing metrics on all (also unmatched) structures (198 entries, 77 matched)
Number of matched structures: 77 / 198
Mean RMS-d: 0.1081
Percent Matched (%): 38.89% (77/198)
a MAE: 0.9435
b MAE: 0.7416
c MAE: 1.0863
Volume MAE: 68.7130
a R^2: 0.8617
b R^2: 0.8886
c R^2: 0.9142
Volume R^2: 0.9569


In [4]:
df_test = pd.read_parquet('_artifacts/cod-xrd/COD-test_ref.parquet')
df_metrics = pd.read_parquet('_artifacts/cod-xrd/COD-uncond-top1p-test_metrics.parquet')
metrics = get_metrics_xrd(df_metrics, n_test=len(df_test), only_matched=False)

Computing metrics on all (also unmatched) structures (198 entries, 48 matched)
Number of matched structures: 48 / 198
Mean RMS-d: 0.0693
Percent Matched (%): 24.24% (48/198)
a MAE: 2.2364
b MAE: 2.1577
c MAE: 3.0707
Volume MAE: 95.0437
a R^2: 0.4711
b R^2: 0.4079
c R^2: 0.5081
Volume R^2: 0.9419


#### Generate materials using 2-pass finetuning and XRD information

In [2]:
import __init__

Navigated to package root: /home/cyprien/CrystaLLMv2_PKV
Added package root to Python path


In [None]:
!python _utils/_generating/generate_CIFs.py --config '_config_files/generation/conditional/xrd_studies/COD-xrd_eval.jsonc'

python: can't open file '/home/cyprien/CrystaLLMv2_PKV/notebooks/_utils/_generating/generate_CIFs.py': [Errno 2] No such file or directory


In [None]:
!python _utils/_generating/postprocess.py \
    --input_parquet '_artifacts/cod-xrd/COD-topp-test_gen.parquet' \
    --output_parquet '_artifacts/cod-xrd/COD-topp-test_gen.parquet' \
    --num_workers 32 \
    --column_name 'Generated CIF'

In [None]:
!python _utils/_metrics/XRD_metrics.py \
    --input_parquet '_artifacts/cod-xrd/COD-topp-test_gen.parquet' \
    --num_gens 20 \
    --ref_parquet '_artifacts/cod-xrd/COD-test_ref.parquet' \
    --output_parquet '_artifacts/cod-xrd/COD-topp-test_metrics.parquet' \
    --num_workers 32 \
    --validity_check "none"

In [None]:
!python _utils/_metrics/XRD_metrics.py \
    --input_parquet '_artifacts/cod-xrd/COD-topp-test_gen.parquet' \
    --num_gens 1 \
    --ref_parquet '_artifacts/cod-xrd/COD-test_ref.parquet' \
    --output_parquet '_artifacts/cod-xrd/COD-top1p-test_metrics.parquet' \
    --num_workers 32 \
    --validity_check "none"

In [3]:
from _utils import get_metrics_xrd
import pandas as pd

df_test = pd.read_parquet('_artifacts/cod-xrd/COD-test_ref.parquet')
df_metrics = pd.read_parquet('_artifacts/cod-xrd/COD-topp-test_metrics.parquet')
metrics = get_metrics_xrd(df_metrics, n_test=len(df_test), only_matched=False)

Number of matched structures: 86 / 198
Mean RMS-d: 0.0978
Percent Matched (%): 43.43% (86/198)
a MAE: 0.7486
b MAE: 0.6023
c MAE: 0.9354
Volume MAE: 47.8279
a R^2: 0.8873
b R^2: 0.9117
c R^2: 0.9290
Volume R^2: 0.9842


In [4]:
df_test = pd.read_parquet('_artifacts/cod-xrd/COD-test_ref.parquet')
df_metrics = pd.read_parquet('_artifacts/cod-xrd/COD-top1p-test_metrics.parquet')
metrics = get_metrics_xrd(df_metrics, n_test=len(df_test), only_matched=False)

Number of matched structures: 61 / 198
Mean RMS-d: 0.0627
Percent Matched (%): 30.81% (61/198)
a MAE: 2.1395
b MAE: 1.9055
c MAE: 2.5930
Volume MAE: 76.2166
a R^2: 0.5288
b R^2: 0.4970
c R^2: 0.5610
Volume R^2: 0.9680


### Testing on some real data

- Had the chance to get given some XRD data calculated by a group. It was calculated for brookite, anatase and rutiile poolymorphs of TiO2
- Anatase and rutile were seen during training and finetuning (in pretrain data and the mattergen xrd 1st pass finetune dataset), brookite was not
- Can the model generate the correct structures for experimental XRDs for materials seen in training, and one unseen?

1. First we make a dataset with the true structures as per their materials project structures
2. To this we add a prompt for each of the structures
3. And a condition vector as per below

In [1]:
import __init__
from _utils import process_xrd_to_condition_vector

anatase_raw_data = """2θ [°] Cu	Intensity
25.2280719392351	281.55012
30.7984477760649	148.62471
36.4922566866002	122.62704
37.6908921523268	119.93007
41.9139352787292	114.27506
48.0759377770583	93.23776
55.0743148175043	135.06294
62.592362748181	    81.58042
65.9512366402823	78.86014
77.6190112038205	75.32634"""

rutile_raw_data = """2θ [°] Cu	Intensity
23.4685203891387	203.0
27.4456323189006	922.0
30.8154418109335	163.0
36.1036065436626	473.0
39.224627779168	    112.0
41.2593176108602	270.0
44.0079436046877	133.0
46.2453487299531	109.0
54.3526968350901	450.0
56.6773553113041	186.0
62.8816130603703	127.0
64.1161640720417	117.0
69.0374099680916	164.0
69.9017250464323	127.0
82.4052975899525	90.0"""

brookite_raw_data = """2θ [°] Cu	Intensity
21.7068951575405	197.0
25.4158554191106	615.0
30.8494290781386	362.0
36.357093687628	159.0
37.387173876481	130.0
40.2006181429811	139.0
42.4506080337455	123.0
46.2786823244331	116.0
48.1590101936239	180.0
49.2875816475943	124.0
54.3691099189334	142.0
55.3364355858856	166.0
57.3462424126941	95.0
62.2385043787467	93.0
63.7639034593297	107.0
65.1227905836474	107.0
68.8799626318172	80.0
84.4701610611719	103.0"""


# Test the function
anatase = process_xrd_to_condition_vector(anatase_raw_data)
rutile = process_xrd_to_condition_vector(rutile_raw_data)
brookite = process_xrd_to_condition_vector(brookite_raw_data)

print(anatase)
print(rutile)
print(brookite)

Navigated to package root: /home/cyprien/CrystaLLMv2_PKV
Added package root to Python path
Theta scaled to [0,1] (0 to 90), Intensity scaled to [0,1] (relative to max in pattern), -100 for padding
Theta scaled to [0,1] (0 to 90), Intensity scaled to [0,1] (relative to max in pattern), -100 for padding
Theta scaled to [0,1] (0 to 90), Intensity scaled to [0,1] (relative to max in pattern), -100 for padding
0.28,0.342,0.612,0.405,0.419,0.466,0.534,0.695,0.733,0.862,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,1.0,0.528,0.48,0.436,0.426,0.406,0.331,0.29,0.28,0.268,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100
0.305,0.401,0.604,0.458,0.261,0.63,0.767,0.342,0.489,0.699,0.777,0.712,0.436,0.514,0.916,-100,-100,-100,-100,-100,1.0,0.513,0.488,0.293,0.22,0.202,0.178,0.177,0.144,0.138,0.138,0.127,0.121,0.118,0.098,-100,-100,-100,-100,-100
0.282,0.343,0.241,0.535,0.615,0.404,0.604,0.447,0.415,0.548,0.472,0.514,0.708,0.724,0.939,0.637,0.692,0.765,-100,-100,1.0,0.589,0.32,0.293,0.27,0.259,0

In [3]:
import pandas as pd
df = pd.read_parquet('_artifacts/cod-xrd/amil/amil-TiO2_ref_prompts.parquet')
df.head()

Unnamed: 0,MP_ehull,CIF,is_in_train,Material ID,name,Prompt,condition_vector
0,0.043639,# generated using pymatgen\ndata_TiO2\n_symmet...,True,mp-2657,Rutile (P4_2/mnm),<bos>\ndata_[Ti2O4]\n,"0.305,0.401,0.604,0.458,0.261,0.63,0.767,0.342..."
1,0.0,# generated using pymatgen\ndata_TiO2\n_symmet...,True,mp-390,Anatase (I4_1/amd),<bos>\ndata_[Ti4O8]\n,"0.28,0.342,0.612,0.405,0.419,0.466,0.534,0.695..."
2,0.003041,# generated using pymatgen\ndata_TiO2\n_symmet...,False,mp-1840,Brookite (Pbca),<bos>\ndata_[Ti8O16]\n,"0.282,0.343,0.241,0.535,0.615,0.404,0.604,0.44..."


> Note: we can use this as both prompt and ref dfs because it contains all the relevant columns

In [6]:
!python _utils/_generating/generate_CIFs.py \
    --config '_config_files/generation/conditional/xrd_studies/COD-amil-xrd_eval.jsonc'

Environment info
Available GPUs: 1
GPU 0: NVIDIA L4

Generation settings
Total sequences per prompt-condition pair: 20
Will save generated CIFs to _artifacts/cod-xrd/amil/amil-TiO2_gen.parquet
Model's max_length: 1024
Tokenizer validation passed: token vocabulary is consistent.
Generation kwargs: {'max_length': 1024, 'pad_token_id': 371, 'eos_token_id': 373, 'renormalize_logits': True, 'remove_invalid_values': True, 'num_return_sequences': 20, 'do_sample': True, 'top_k': 15, 'top_p': 0.95, 'temperature': 1.0}

Generation Strategy
Number of condition-prompt pairs: 3
Target valid CIFs per prompt: 20
Will save all CIFs ranked by LOGP score (up to 20 per prompt)
Tokenizer validation passed: token vocabulary is consistent.
Tokenizer validation passed: token vocabulary is consistent.
Generating CIFs...:  18%|████▏                  | 11/60 [00:02<00:09,  5.22it/s]Prompt 0: Generated 20 valid CIFs, LOGP scores: 1.1035 to 1.1984
Generating CIFs...:  60%|█████████████▊         | 36/60 [00:06<00:

In [7]:
!python _utils/_generating/postprocess.py \
    --input_parquet '_artifacts/cod-xrd/amil/amil-TiO2_gen.parquet' \
    --output_parquet '_artifacts/cod-xrd/amil/amil-TiO2_gen.parquet' \
    --num_workers 32 \
    --column_name 'Generated CIF'

Processing 60 records using 32 worker(s)
Processing Generated CIF: 100%|█████████████████| 60/60 [00:00<00:00, 94.92it/s]
Multi-worker processing completed: 60 records


Did we recover structures in the 20 generations?

In [8]:
!python _utils/_metrics/XRD_metrics.py \
    --input_parquet '_artifacts/cod-xrd/amil/amil-TiO2_gen.parquet' \
    --num_gens 20 \
    --ref_parquet '_artifacts/cod-xrd/amil/amil-TiO2_ref_prompts.parquet' \
    --output_parquet '_artifacts/cod-xrd/amil/amil-TiO2_metrics.parquet' \
    --num_workers 32 \
    --validity_check "none"

Using 20 generation(s) per compound
Score column detected, will include scores in output
Loaded 3 materials from _artifacts/cod-xrd/amil/amil-TiO2_gen.parquet
Using 3 matched materials from test DB
Parsing true CIFs: 100%|█████████████████████████| 3/3 [00:00<00:00, 404.27it/s]
Processing 60 CIFs across 3 materials
Parsing and sensible check for gen CIFs: 100%|█| 60/60 [00:00<00:00, 239.96it/s]
Materials processed: 3
Materials with sensible structures: 3
Comparing structures: 100%|███████████████████████| 3/3 [00:00<00:00,  3.59it/s]

Results saved to: _artifacts/cod-xrd/amil/amil-TiO2_metrics.parquet

Metrics:
  match_rate: 1.0000
  rms_dist: 0.1654
  n_matched: 3.0000
  a_diff: 0.1738
  b_diff: 0.3734
  c_diff: 0.2987


And what about if we take the most 'confident' one

In [13]:
!python _utils/_metrics/XRD_metrics.py \
    --input_parquet '_artifacts/cod-xrd/amil/amil-TiO2_gen.parquet' \
    --num_gens 1 \
    --ref_parquet '_artifacts/cod-xrd/amil/amil-TiO2_ref_prompts.parquet' \
    --output_parquet '_artifacts/cod-xrd/amil/amil-TiO2-top1_metrics.parquet' \
    --num_workers 32 \
    --validity_check "none"

Using 1 generation(s) per compound
Score column detected, will include scores in output
Using rank=1 rows for num_gens=1 (rank column detected)
Loaded 3 materials from _artifacts/cod-xrd/amil/amil-TiO2_gen.parquet
Using 3 matched materials from test DB
Parsing true CIFs: 100%|█████████████████████████| 3/3 [00:00<00:00, 302.34it/s]
Processing 3 CIFs across 3 materials
Parsing and sensible check for gen CIFs: 100%|████| 3/3 [00:00<00:00, 19.00it/s]
Materials processed: 3
Materials with sensible structures: 3
Comparing structures: 100%|███████████████████████| 3/3 [00:00<00:00,  9.00it/s]

Results saved to: _artifacts/cod-xrd/amil/amil-TiO2-top1_metrics.parquet

Metrics:
  match_rate: 0.3333
  rms_dist: 0.0032
  n_matched: 1.0000
  a_diff: 0.5444
  b_diff: 1.3622
  c_diff: 2.5531


In [14]:
from _utils import get_metrics_xrd
import pandas as pd

df_test = pd.read_parquet('_artifacts/cod-xrd/amil/amil-TiO2_ref_prompts.parquet')
df_metrics = pd.read_parquet('_artifacts/cod-xrd/amil/amil-TiO2_metrics.parquet')
metrics = get_metrics_xrd(df_metrics, n_test=len(df_test), only_matched=False)

Computing metrics on all (also unmatched) structures (3 entries, 3 matched)
Number of matched structures: 3 / 3
Mean RMS-d: 0.1654
Percent Matched (%): 100.00% (3/3)
a MAE: 0.1738
b MAE: 0.3734
c MAE: 0.2987
Volume MAE: 29.6743
a R^2: 0.9795
b R^2: 0.9995
c R^2: 0.9937
Volume R^2: 0.9894


In [15]:
df_test = pd.read_parquet('_artifacts/cod-xrd/amil/amil-TiO2_ref_prompts.parquet')
df_metrics = pd.read_parquet('_artifacts/cod-xrd/amil/amil-TiO2-top1_metrics.parquet')
metrics = get_metrics_xrd(df_metrics, n_test=len(df_test), only_matched=False)

Computing metrics on all (also unmatched) structures (3 entries, 1 matched)
Number of matched structures: 1 / 3
Mean RMS-d: 0.0032
Percent Matched (%): 33.33% (1/3)
a MAE: 0.5444
b MAE: 1.3622
c MAE: 2.5531
Volume MAE: 22.5467
a R^2: 0.7969
b R^2: 0.9883
c R^2: 0.1767
Volume R^2: 0.9715
