### This notebook shows you how to pick up any model present in the huggingface repo and generate CIFs from it.
> In development

In [2]:
import __init__
import pandas as pd

Navigated to package root: /home/cyprien/CrystaLLMv2_PKV
Added package root to Python path


In [9]:
# Login to Hugging Face Hub
from huggingface_hub import login, HfApi
import os
from _utils import load_api_keys

API_KEY_PATH = "API_keys.jsonc"
data = load_api_keys(API_KEY_PATH)
hf_key_json = str(data['HF_key'])
login(token=hf_key_json)


In [12]:
!python _load_and_generate.py \
    --hf_model_path "c-bone/CrystaLLM-2.0_bandgap" \
    --manual \
    --compositions "Ti2O4" \
    --condition_lists "10.0" "0.0" \
    --spacegroups "Imma" \
    --level level_4 \
    --num_return_sequences 5 \
    --max_return_attempts 10 \
    --output_parquet generated_structures.parquet

Model: c-bone/CrystaLLM-2.0_bandgap
Type: Bandgap + stability conditioning
Needs 2 conditions
Example: ['1.1', '0.0']
Compositions: Ti2O4
Conditions: ['10.0', '0.0']

Generating Prompts 
Making prompts from compositions and conditions
Normalizing with power log method for prop_0 (beta = 0.8)...
Normalizing with linear method for prop_1...
Made 1 prompts at level_4
Got 1 prompts

Generating CIFs 
Generating with: c-bone/CrystaLLM-2.0_bandgap
Tokenizer validation passed: token vocabulary is consistent.
Loading from HF: c-bone/CrystaLLM-2.0_bandgap
Using PKVGPT
Downloading... (might take a few mins first time)
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Loaded PKV model
~61.6M parameters
Generation settings: {'max_length': 1024, 'pad_token_id': 371, 'eos_token_id': 373, 'renormalize_logits': True, 'remove_invalid_values': True, 'num_return_sequences': 5, 'do_sample': True, 'top_k': 15, 'top_p': 0.95, 'temperature': 1.0}
Got 50 CI

In [13]:
!python _load_and_generate.py \
    --hf_model_path "c-bone/CrystaLLM-2.0_SLME" \
    --input_parquet "_artifacts/slme/slme-PKV-opt_prompt.parquet" \
    --num_return_sequences 5 \
    --max_return_attempts 10 \
    --output_parquet generated_structures.parquet \
    --verbose

Model: c-bone/CrystaLLM-2.0_SLME
Type: Solar cell efficiency (SLME) conditioning
Needs 1 conditions
Example: ['25.0']
Input: _artifacts/slme/slme-PKV-opt_prompt.parquet

Generating Prompts 
Loading prompts from: _artifacts/slme/slme-PKV-opt_prompt.parquet
Note: should already have normalized [0-1] values
Got 1 prompts

Example:
<bos>
data_[


Generating CIFs 
Generating with: c-bone/CrystaLLM-2.0_SLME
Tokenizer validation passed: token vocabulary is consistent.
Loading from HF: c-bone/CrystaLLM-2.0_SLME
Using PKVGPT
Downloading... (might take a few mins first time)
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Loaded PKV model
~39.7M parameters
Generation settings: {'max_length': 1024, 'pad_token_id': 371, 'eos_token_id': 373, 'renormalize_logits': True, 'remove_invalid_values': True, 'num_return_sequences': 5, 'do_sample': True, 'top_k': 15, 'top_p': 0.95, 'temperature': 1.0}
Got 50 CIFs (no validation)

Post-processing 
Proces

In [14]:
!python _load_and_generate.py \
    --hf_model_path "c-bone/CrystaLLM-2.0_base" \
    --model_type "Base" \
    --manual \
    --compositions "Ti2O4" \
    --spacegroups "Imma" \
    --level level_4 \
    --num_return_sequences 5 \
    --max_return_attempts 10 \
    --output_parquet generated_structures.parquet \
    --verbose

Model: c-bone/CrystaLLM-2.0_base
Type: Unconditional generation
Compositions: Ti2O4

Generating Prompts 
Making prompts from compositions and conditions
Made 1 prompts at level_4
Got 1 prompts

Example:
<bos>
data_[Ti2O4]
loop_
 _atom_type_symbol
 _atom_type_electronegativity
 _atom_type_radius
 _atom_type_ionic_radius
[
  Ti  1.5400  1.4000  0.8517
  O  3.4400  0.6000  1.2600
]
_symmetry_space_group_name_H-M [Imma]



Generating CIFs 
Generating with: c-bone/CrystaLLM-2.0_base
Tokenizer validation passed: token vocabulary is consistent.
Loading from HF: c-bone/CrystaLLM-2.0_base
Using GPT2LMHeadModel
Downloading... (might take a few mins first time)
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Loaded Base model
~25.9M parameters
Generation settings: {'max_length': 1024, 'pad_token_id': 371, 'eos_token_id': 373, 'renormalize_logits': True, 'remove_invalid_values': True, 'num_return_sequences': 5, 'do_sample': True, 'top_k': 15,

In [15]:
!python _load_and_generate.py \
    --hf_model_path "c-bone/CrystaLLM-2.0_COD-XRD" \
    --model_type "Slider" \
    --input_parquet "_artifacts/cod-xrd/amil/amil-TiO2-nosc-nosg_ref_prompts.parquet" \
    --num_return_sequences 5 \
    --max_return_attempts 1 \
    --output_parquet generated_structures.parquet

Model: c-bone/CrystaLLM-2.0_COD-XRD
Type: Experimental XRD conditioning
Needs 40 conditions
Example: ['-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100', '-100']
Input: _artifacts/cod-xrd/amil/amil-TiO2-nosc-nosg_ref_prompts.parquet

Generating Prompts 
Loading prompts from: _artifacts/cod-xrd/amil/amil-TiO2-nosc-nosg_ref_prompts.parquet
Note: should already have normalized [0-1] values
Got 3 prompts

Generating CIFs 
Generating with: c-bone/CrystaLLM-2.0_COD-XRD
Tokenizer validation passed: token vocabulary is consistent.
Loading from HF: c-bone/CrystaLLM-2.0_COD-XRD
Using SliderGPT
Downloading... (might take a few mins first time)
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is igno