In [2]:
import __init__

Navigated to package root: /home/cyprien/CrystaLLMv2_PKV
Added package root to Python path


#### Jarvis-DFT
- Dataset: Source [JDFT](https://jarvis-materials-design.github.io/dbdocs/thedownloads/)
- Columns:
  - Database (manual)
  - **Material ID** (from source) 
  - Reduced Formula (pmg structure.composition.reduced_formula)
  - CIF (pmg - Cifwriter with symprec 0.1)

### Preprocess
- Random split 90:10 like the benchmark (no eval set)
- Generate XRD condition vectors using
  - pmg - XRDCalculator(wavelength="CuKa")
  - top 20 most intense peaks selected ($2\theta$ and int)
  - Normalisations
    - $2\theta$ min-max for 0,90
    - intensities min-max for 0,100
- Cleaned for CIF augmentation
  - note: filtered to context length for training etc. But for benchmarking we compare to all the structures in the database even if 16 were unparseable and 833 were above context length.
- saved HuggingFace as c-bone/jarvis-XRD

In [None]:
import pandas as pd
import numpy as np
np.random.seed(1)

df = pd.read_parquet('HF-databases/jarvis-XRD-processing-2/jarvis-XRD-unproc.parquet')

# Random Split for benchmarking because no train/test split provided in DiffactGPT paper
# 90% train, 10% test is the split used in paper
df['Split'] = np.random.choice(['train', 'test'], size=len(df), p=[0.9, 0.1])
df.to_parquet('HF-databases/jarvis-XRD-processing-2/jarvis-XRD-unproc.parquet', index=False)

# Make the test set parquet for benchmarking
df_test = df.copy()
df_test = df_test[df_test['Split'] == 'test']
df_test.to_parquet('_artifacts/jarvis-XRD/jarvis-test_ref.parquet', index=False)

In [11]:
!python _utils/_preprocessing/_calculate_XRD.py \
    --input_parquet HF-databases/jarvis-XRD-processing-2/jarvis-XRD-unproc.parquet \
    --output_parquet HF-databases/jarvis-XRD-processing-2/jarvis-XRD-unproc.parquet \
    --num_workers 32

Loading database from HF-databases/jarvis-XRD-processing-2/jarvis-XRD-unproc.parquet
Loaded 75993 entries
Processing XRD patterns for the entries
Generating XRD patterns: 100%|███| 75993/75993 [03:34<00:00, 354.67structures/s]
Computing condition vectors from XRD patterns
Processing XRD patterns: 100%|███| 75993/75993 [00:01<00:00, 57628.38patterns/s]
Theta range: 0-90, Intensity range: 0-100
Saving results to HF-databases/jarvis-XRD-processing-2/jarvis-XRD-unproc.parquet
done


In [21]:
!python _utils/_preprocessing/_cleaning.py \
    --input_parquet HF-databases/jarvis-XRD-processing-2/jarvis-XRD-unproc.parquet \
    --output_parquet HF-databases/jarvis-XRD-processing-2/jarvis-XRD-clean.parquet \
    --property_columns "['condition_vector']" \
    --num_workers 32 \
    --filter_to 1024

Loading data from HF-databases/jarvis-XRD-processing-2/jarvis-XRD-unproc.parquet as Parquet with zstd compression...

Normalizing property columns

Lets augment the CIFs now (parallelizing sometimes takes a min before speeding up
Number of CIFs before preprocessing: 75993
Number of workers: 32
100%|███████████████████████████████████| 75993/75993 [00:28<00:00, 2710.29it/s]
Number of CIFs before filtering out bad ones:  75993
Number of CIFs after filtering: 75977

Filtering dataframe of len 75977 to context length 1024
Tokenizer validation passed: token vocabulary is consistent.
Filtered dataframe length: 75144

Saving updated dataframe with len 75144 to HF-databases/jarvis-XRD-processing-2/jarvis-XRD-clean.parquet...
Preprocessing completed successfully.


In [23]:
!python _utils/_preprocessing/_save_dataset_to_HF.py \
    --input_parquet HF-databases/jarvis-XRD-processing-2/jarvis-XRD-clean.parquet \
    --output_parquet HF-databases/jarvis-XRD-processing-2/jarvis-XRD.parquet \
    --save_hub

Loading Hugging Face API key from API_keys.jsonc
Loading data from HF-databases/jarvis-XRD-processing-2/jarvis-XRD-clean.parquet as Parquet with zstd compression
Splitting dataset according to the 'Split' column
Train columns: ['Database', 'Material ID', 'Reduced Formula', 'CIF', 'condition_vector']
Uploading the dataset shards:   0%|                       | 0/1 [00:00<?, ?it/s]
Creating parquet from Arrow format:   0%|                | 0/68 [00:00<?, ?ba/s][A
Creating parquet from Arrow format: 100%|██████| 68/68 [00:00<00:00, 446.47ba/s][A
Uploading the dataset shards: 100%|███████████████| 1/1 [00:02<00:00,  2.91s/it]
Uploading the dataset shards:   0%|                       | 0/1 [00:00<?, ?it/s]
Creating parquet from Arrow format: 100%|████████| 8/8 [00:00<00:00, 387.95ba/s][A
Uploading the dataset shards: 100%|███████████████| 1/1 [00:01<00:00,  1.24s/it]
Dataset saved to Hugging Face Hub as c-bone/jarvis-XRD


### Training

In [None]:
!torchrun --nproc_per_node=2 _train.py --config '_config_files/training/conditional/xrd_studies/jarvis-xrd-slider-opt.jsonc'

### Generating

In [None]:
!python _utils/_generating/make_prompts.py \
    --HF_dataset 'c-bone/jarvis-XRD' \
    --split 'test' \
    --automatic \
    --output_parquet '_artifacts/jarvis-XRD/jarvis-test_prompts.parquet' \
    --level 'level_3' \
    --condition_columns 'condition_vector'


In [None]:
!python _utils/_generating/generate_CIFs.py --config '_config_files/generation/conditional/xrd_studies/jarvis-xrd_eval.jsonc'

In [None]:
!python _utils/_generating/postprocess.py \
    --input_parquet '_artifacts/jarvis-XRD/jarvis-ft-20perp-test_gen.parquet' \
    --output_parquet '_artifacts/jarvis-XRD/jarvis-ft-20perp-test_post.parquet' \
    --num_workers 32 \
    --column_name 'Generated CIF'

In [3]:
!python _utils/_metrics/XRD_metrics.py \
    --input_parquet '_artifacts/jarvis-XRD/jarvis-ft-20perp-test_post.parquet' \
    --num_gens 20 \
    --ref_parquet '_artifacts/jarvis-XRD/jarvis-test_ref.parquet' \
    --output_parquet '_artifacts/jarvis-XRD/jarvis-ft-20perp-test_metrics.parquet' \
    --num_workers 16 \
    --validity_check 'none'

Using 20 generation(s) per compound
Using 16 workers for parallel processing (based on input size)
Loaded 7485 materials from _artifacts/jarvis-XRD/jarvis-ft-20perp-test_post.parquet
Using 7485 matched materials from test DB
Parsing true CIFs: 100%|███████████████████| 7485/7485 [00:38<00:00, 196.86it/s]
Processing 149700 CIFs across 7485 materials
Parsing and sensible check for gen CIFs: 100%|█| 149700/149700 [01:20<00:00, 184
Materials processed: 7485
Materials with sensible structures: 7485
Comparing structures: 100%|█████████████████| 7485/7485 [04:43<00:00, 26.40it/s]

Results saved to: _artifacts/jarvis-XRD/jarvis-ft-20perp-test_metrics.parquet

Metrics:
  match_rate: 0.8651
  rms_dist: 0.0361
  n_matched: 6475.0000
  a_diff: 0.2866
  b_diff: 0.2718
  c_diff: 0.4193


In [None]:
!python _utils/_metrics/XRD_metrics.py \
    --input_parquet '_artifacts/jarvis-XRD/jarvis-ft-20perp-test_post.parquet' \
    --num_gens 1 \
    --ref_parquet '_artifacts/jarvis-XRD/jarvis-test_ref.parquet' \
    --output_parquet '_artifacts/jarvis-XRD/jarvis-ft-1perp-test_metrics.parquet' \
    --num_workers 24 \
    --validity_check 'none'

Using 1 generation(s) per compound
Using 16 workers for parallel processing (based on input size)
Using rank=1 rows for num_gens=1 (rank column detected)
Loaded 7485 materials from _artifacts/jarvis-XRD/jarvis-ft-20perp-test_post.parquet
Using 7485 matched materials from test DB
Parsing true CIFs: 100%|███████████████████| 7485/7485 [00:37<00:00, 199.73it/s]
Processing 7485 CIFs across 7485 materials
Parsing and sensible check for gen CIFs: 100%|█| 7485/7485 [00:04<00:00, 1522.74
Materials processed: 7485
Materials with sensible structures: 7485
Comparing structures:  37%|█████▉          | 2753/7485 [00:19<00:23, 204.72it/s]

Metrics

In [2]:
import __init__

Navigated to package root: /home/cyprien/CrystaLLMv2_PKV
Added package root to Python path


In [1]:
import __init__
from _utils import get_metrics_xrd
import pandas as pd

df_test = pd.read_parquet('_artifacts/jarvis-XRD/jarvis-test_ref.parquet')
df_metrics = pd.read_parquet('_artifacts/jarvis-XRD/jarvis-ft-20perp-test_metrics.parquet')
metrics = get_metrics_xrd(df_metrics, n_test=len(df_test), only_matched=False)
df_metrics = pd.read_parquet('_artifacts/jarvis-XRD/jarvis-ft-1perp-test_metrics.parquet')
metrics = get_metrics_xrd(df_metrics, n_test=len(df_test), only_matched=False)

Navigated to package root: /home/cyprien/CrystaLLMv2_PKV
Added package root to Python path
Computing metrics on all (also unmatched) structures (7485 entries, 6475 matched)
Number of matched structures: 6475 / 7576
Mean RMS-d: 0.0361
Percent Matched (%): 85.47% (6475/7576)
a MAE: 0.2866
b MAE: 0.2718
c MAE: 0.4193
Volume MAE: 14.7139
a R^2: 0.9217
b R^2: 0.9266
c R^2: 0.9482
Volume R^2: 0.8809
Average Score: 1.1458
Computing metrics on all (also unmatched) structures (7485 entries, 5017 matched)
Number of matched structures: 5017 / 7576
Mean RMS-d: 0.0347
Percent Matched (%): 66.22% (5017/7576)
a MAE: 0.6641
b MAE: 0.6946
c MAE: 1.1369
Volume MAE: 20.9998
a R^2: 0.7780
b R^2: 0.7437
c R^2: 0.8144
Volume R^2: 0.8435
Average Score: 1.1186
