# Notebook to show example preprocessing steps

### First we go to correct directory

In [3]:
import pandas as pd
import __init__

Navigated to package root: /home/cyprien/CrystaLLMv2_PKV
Added package root to Python path


In [4]:
df = pd.read_parquet('/home/cyprien/Data_Gen/datasets_dev/mattergen_props.parquet')
# show column names
print(df.columns)

Index(['material_id', 'space_group', 'chemical_system', 'num_sites',
       'energy_above_hull', 'dft_band_gap', 'dft_bulk_modulus',
       'dft_mag_density', 'hhi_score', 'ml_bulk_modulus', 'Database',
       'Reduced Formula', 'CIF', 'ALIGNN_BG', 'Density (g/cm^3)'],
      dtype='object')


### Given the dataframe with above columns, lets say we want to make the Mattergen Density dataset

Columns you need:
- For unconditional training (just structure prediction). You will need symmetrised CIFs (can generate from pymatgen for a structure), Reduced Formula for each CIF, and prefereably their material id's if they come from a database
- For conditional training. For each CIF, label it with a functional property value (like density - here also taken from pymatgen calculators)

1st we deduplicate, eg if there are structures with same formula and space group, only keep one with lowest volume per formula unit (vpfu)

In [None]:
!python _utils/_preprocessing/_deduplicate.py \
    --input_file '/home/cyprien/Data_Gen/datasets_dev/mattergen_props.parquet' \
    --output_parquet 'test_dedup.parquet' \
    --property_columns "['Density (g/cm^3)', 'energy_above_hull']" \
    --filter_na_columns "['Density (g/cm^3)', 'energy_above_hull']"

Next, we 'clean' or do data augmentation. For the CIFs this means making the format standardised for training, and for the properties we want to normalise their values to ensure training stability

In [None]:
!python _utils/_preprocessing/_cleaning.py \
    --input_parquet test_dedup.parquet \
    --output_parquet test_clean.parquet \
    --property_columns "['Density (g/cm^3)', 'energy_above_hull']" \
    --property1_normaliser "linear" \
    --property2_normaliser "linear" \
    --num_workers 8

Next we save our dataset to Huggingface. At this point we're pretty much ready for training

In [None]:
!python _utils/_preprocessing/_save_dataset_to_HF.py \
    --input_parquet 'test_clean.parquet' \
    --output_parquet 'upload_test.parquet' \
    --valid_size 0.1 \
    --test_size 0 \
    --save_hub

Optional: If for some reason you wish to change the tokenizer, it can be done and tested with this script

In [None]:
!python _utils/_preprocessing/_save_tokenizer_to_HF.py \
    --vocab_file '_utils/_tokenizer_utils/vocabulary.json' \
    --spacegroups_file '_utils/_tokenizer_utils/spacegroups.txt' \
    --path 'HF-cif-tokenizer' \
    --hub_path 'c-bone/cif-tokenizer' \
    --push_to_hub \
    --testing