In [2]:
import __init__

Navigated to package root: /home/cyprien/CrystaLLMv2_PKV
Added package root to Python path


#### 1st pass finetune - Mattergen XRD
- Dataset Source: [Mattergen Alex-MP-20](https://github.com/microsoft/mattergen/tree/main/data-release/alex-mp)
  - Columns: Database (manual) 
  - Reduced Formula (Source)
  - CIF (pmg - Cifwriter with symprec 0.1)
  - XRD 'Condition Vector' (with [_calculate_XRD.py](_utils/_preprocessing/_calculate_XRD.py))
    - pmg - XRDCalculator(wavelength="CuKa")
    - top 20 most intense peaks selected ($2\theta$ and int)
    - Normalisations
      - $2\theta$ min-max for 0,90
      - intensities min-max for 0,100
- Deduplicated
- Cleaned for CIF augmentation
  -  Note: I didnt filter to context length here because it was not implemented yet, but filter to context was flagged as True during model training which effectively does the same thing (less efficient)
- dataset pushed to HuggingFace as: c-bone/mattergen_XRD (90:10 train/valid sets)

In [None]:
!torchrun --nproc_per_node=2 _train.py --config '_config_files/training/conditional/xrd_studies/mattergen_XRD-slider.jsonc'

#### 2nd pass finetune - COD XRD
- Dataset Source: [COD hkl data](https://www.crystallography.net/hkl/)
  - Columns: Database (manual) 
  - Reduced Formula (automated extraction from source)
  - CIF
    - automated extraction of material id from COD source
    - converted to structure using pmg COD.get_structure_by_id
    - Cifwriter with symprec 0.1 for CIF
    - note: this was done because alot of COD cifs arent in clean standard format. Pymatgen already did a big job of cleaning them up so we dont need to reinvent the wheel and take CIF data straight from source.
  - XRD data
    - For every Material ID that has experimental hkl data and associated intensities, we extract it
    - Then:
      1. Calculate d_hkl from crystal structure.lattice.d_hkl([h, k, l])
      2. Use Bragg's law: sin($\theta$) = $\lambda$/($2$ × d_hkl)
      3. Find $\theta$ = arcsin($\lambda$/($2$ × d_hkl))
      4. Convert to degrees: $2\theta$ = $2$ × $\theta$ × (180/$\pi$)
    - Where:
      - $\lambda$: X-ray wavelength ($1.5406$ $\AA$ for Cu K$\alpha$)
      - d_hkl: d-spacing for the (hkl) planes
      - $\theta$: Bragg angle
    - Created 'Condition Vector'
      - top 20 most intense peaks selected ($2\theta$ and int)
      - Normalisations
        - $2\theta$ min-max for 0,90
        - intensities min-max for 0,100
  - Filtered out all hydrocarbons
    - symbols = struct.symbol_set
    - if "C" in symbols and "H" in symbols, remove it
  - Then cleaning for CIF augmentation
    - set --make_disordered_ordered flag
      - Makes every occupancy exactly integer if occupancy is int $\pm 0.05$. Element set needs to be exactly preserved or structure discarded.
    - Filtered to 1024 contect length
  - Pushed to HuggingFace as c-bone/COD_XRD_small_nohc

### Training
> **Note**: Here the hyperparamters change compared to regular finetuning because its 2nd pass. Backbone learning rates were set to decay from $5\times10^{-8}$ to $5\times10^{-10}$, while the learning rates for the newly initialised conditioning parameters were set 100 times higher

In [5]:
!python _utils/_preprocessing/_save_dataset_to_HF.py \
    --input_parquet 'HF-databases/COD_dev/COD_xrd_clean_nohc_small.parquet' \
    --output_parquet 'HF-databases/COD_XRD_small_nohc_full.parquet' \
    --valid_size 0.000 \
    --test_size 0.125 \
    --save_hub

Loading Hugging Face API key from API_keys.jsonc
Loading data from HF-databases/COD_dev/COD_xrd_clean_nohc_small.parquet as Parquet with zstd compression
Uploading the dataset shards:   0%|                       | 0/1 [00:00<?, ?it/s]
Creating parquet from Arrow format: 100%|█████████| 2/2 [00:00<00:00, 65.69ba/s][A
Uploading the dataset shards: 100%|███████████████| 1/1 [00:01<00:00,  1.23s/it]
Uploading the dataset shards:   0%|                       | 0/1 [00:00<?, ?it/s]
Creating parquet from Arrow format: 100%|████████| 1/1 [00:00<00:00, 164.92ba/s][A
Uploading the dataset shards: 100%|███████████████| 1/1 [00:00<00:00,  1.48it/s]
Dataset saved to Hugging Face Hub as c-bone/COD_XRD_small_nohc_full


In [None]:
!torchrun --nproc_per_node=2 _train.py --config '_config_files/training/conditional/xrd_studies/COD_XRD_small-slider.jsonc'

### Generating

In [None]:
!python _utils/_generating/make_prompts.py \
    --HF_dataset 'c-bone/COD_XRD_small_nohc' \
    --split 'test' \
    --automatic \
    --output_parquet '_artifacts/cod-XRD/COD-small-XRD-nohc-test_prompts.parquet' \
    --level 'level_2' \
    --condition_columns 'Condition Vector'

#### Generate materials using the only 1-pass finetuning on Mattergen XRD

In [None]:
!python _utils/_generating/generate_CIFs.py --config '_config_files/generation/conditional/xrd_studies/COD-small-mgen_XRD_eval.jsonc'

In [None]:
!python _utils/_generating/postprocess.py \
    --input_parquet '_artifacts/cod-XRD/COD-small-XRD-test-nohc-mgen-10T15K_gen.parquet' \
    --output_parquet '_artifacts/cod-XRD/COD-small-XRD-test-nohc-mgen-10T15K_gen.parquet' \
    --num_workers 32 \
    --column_name 'Generated CIF'

In [None]:
!python _utils/_metrics/XRD_metrics.py \
    --input_parquet '_artifacts/cod-XRD/COD-small-XRD-test-nohc-mgen-10T15K_gen.parquet' \
    --num_gens 20 \
    --ref_parquet '_artifacts/cod-XRD/COD-small-XRD-nohc-test_ref.parquet' \
    --output_parquet '_artifacts/cod-XRD/COD-small-XRD-test-nohc-mgen-10T15K_metrics.parquet' \
    --num_workers 24

In [9]:
from _utils import get_metrics
import pandas as pd

df_test = pd.read_parquet('_artifacts/cod-XRD/COD-small-XRD-nohc-test_ref.parquet')
df_metrics = pd.read_parquet('_artifacts/cod-XRD/COD-small-XRD-test-nohc-mgen-10T15K_metrics.parquet')
metrics = get_metrics(df_metrics, n_test=len(df_test), only_matched=False)

Number of matched structures: 42 / 198
Mean RMS-d: 0.0422
Percent Matched (%): 21.21% (42/198)
a MAE: 1.1414
b MAE: 1.1397
c MAE: 1.1404
Volume MAE: 88.1647
a R^2: 0.7690
b R^2: 0.8077
c R^2: 0.8981
Volume R^2: 0.9420


#### Generate materials using 2-pass finetuning (Mattergen XRD + COD XRD nohc) but no XRD information fed during inference
> **Note**: replaced the condition_vector column in the prompt df made above with a series of [-100] missing values meaning no XRD information is fed during generation

In [None]:
!python _utils/_generating/generate_CIFs.py --config '_config_files/generation/conditional/xrd_studies/COD-small_XRD_uncond_eval.jsonc'

In [None]:
!python _utils/_generating/postprocess.py \
    --input_parquet '_artifacts/cod-XRD/COD-small-XRD-nohc-test-uncond_gen.parquet' \
    --output_parquet '_artifacts/cod-XRD/COD-small-XRD-nohc-test-uncond_gen.parquet' \
    --num_workers 32 \
    --column_name 'Generated CIF'

In [None]:
!python _utils/_metrics/XRD_metrics.py \
    --input_parquet '_artifacts/cod-XRD/COD-small-XRD-nohc-test-uncond_gen.parquet' \
    --num_gens 20 \
    --ref_parquet '_artifacts/cod-XRD/COD-small-XRD-nohc-test_ref.parquet' \
    --output_parquet '_artifacts/cod-XRD/COD-small-XRD-nohc-test-uncond_metrics.parquet' \
    --num_workers 24

In [13]:
from _utils import get_metrics
import pandas as pd

df_test = pd.read_parquet('_artifacts/cod-XRD/COD-small-XRD-nohc-test_ref.parquet')
df_metrics = pd.read_parquet('_artifacts/cod-XRD/COD-small-XRD-nohc-test-uncond_metrics.parquet')
metrics = get_metrics(df_metrics, n_test=len(df_test), only_matched=False)

Number of matched structures: 55 / 198
Mean RMS-d: 0.0465
Percent Matched (%): 27.78% (55/198)
a MAE: 0.9464
b MAE: 1.0858
c MAE: 1.2233
Volume MAE: 79.8072
a R^2: 0.8337
b R^2: 0.7740
c R^2: 0.8811
Volume R^2: 0.9603


#### Generate materials using the only 1-pass finetuning on Mattergen XRD

In [None]:
!python _utils/_generating/generate_CIFs.py --config '_config_files/generation/conditional/xrd_studies/COD-small_XRD_eval.jsonc'

In [None]:
!python _utils/_generating/postprocess.py \
    --input_parquet '_artifacts/cod-XRD/COD-small-XRD-test-nohc-10T15K_gen.parquet' \
    --output_parquet '_artifacts/cod-XRD/COD-small-XRD-test-nohc-10T15K_gen.parquet' \
    --num_workers 32 \
    --column_name 'Generated CIF'

In [None]:
!python _utils/_metrics/XRD_metrics.py \
    --input_parquet '_artifacts/cod-XRD/COD-small-XRD-test-nohc-10T15K_gen.parquet' \
    --num_gens 20 \
    --ref_parquet '_artifacts/cod-XRD/COD-small-XRD-nohc-test_ref.parquet' \
    --output_parquet '_artifacts/cod-XRD/COD-small-XRD-test-nohc-10T15K_metrics.parquet' \
    --num_workers 24

In [14]:
from _utils import get_metrics
import pandas as pd

df_test = pd.read_parquet('_artifacts/cod-XRD/COD-small-XRD-nohc-test_ref.parquet')
df_metrics = pd.read_parquet('_artifacts/cod-XRD/COD-small-XRD-test-nohc-10T15K_metrics.parquet')
metrics = get_metrics(df_metrics, n_test=len(df_test), only_matched=False)

Number of matched structures: 82 / 198
Mean RMS-d: 0.0520
Percent Matched (%): 41.41% (82/198)
a MAE: 0.8390
b MAE: 0.6970
c MAE: 1.0016
Volume MAE: 62.4154
a R^2: 0.8662
b R^2: 0.9072
c R^2: 0.9276
Volume R^2: 0.9648
