# Error Correction Curie temperature

In [1]:
import sys
from pathlib import Path
import os

# Get project root (parent of notebooks/)
PROJECT_ROOT = Path.cwd().parent.resolve()

# Add project root to Python path so src/ is importable
sys.path.insert(0, str(PROJECT_ROOT))

# Change working directory to project root
os.chdir(PROJECT_ROOT)

print("Project root set to:", PROJECT_ROOT)
print("Current working directory:", Path.cwd())

Project root set to: /raven/ptmp/cwinkler/ML-models/experimental-simulation-tc
Current working directory: /raven/ptmp/cwinkler/ML-models/experimental-simulation-tc


In [2]:
from src.augment_data import augment_data

### 1. Data augmentation

In [3]:
augment_data()

Input: /raven/ptmp/cwinkler/ML-models/experimental-simulation-tc/data/EC_curie_temp.csv
Output directory: /raven/ptmp/cwinkler/ML-models/experimental-simulation-tc/outputs

CURIE TEMPERATURE DATA AUGMENTATION
Random seed: 42
Loading data from /raven/ptmp/cwinkler/ML-models/experimental-simulation-tc/data/EC_curie_temp.csv
mammos_entity not available, using pandas directly
Loaded 27754 entries using pandas

FILTERING PAIRED DATA
All paired data: 839 samples
RE paired data: 547 samples
RE-free paired data: 292 samples

STATISTICAL DISTRIBUTION TESTS: RE VS RE-FREE

Performing Kolmogorov-Smirnov test on RE vs RE-free Tc_delta distributions
Null hypothesis: The two distributions are the same
Alternative hypothesis: The two distributions are different

Kolmogorov-Smirnov test results:
  KS statistic: 0.1933
  p-value: 0.00000105
  Result: Reject the null hypothesis (p < 0.05)
  The distributions of Tc_delta in RE and RE-free datasets are significantly different.
  This suggests that develop

### 2. Creation of embeddings

In [4]:
from src.create_embeddings import create_embeddings

In [5]:
create_embeddings()

script_dir /raven/ptmp/cwinkler/ML-models/experimental-simulation-tc/src
project_root /raven/ptmp/cwinkler/ML-models/experimental-simulation-tc
COMPOUND EMBEDDING CREATION
Random seed: 42
Element embeddings: /raven/ptmp/cwinkler/ML-models/experimental-simulation-tc/data/embeddings/element/matscholar200.json
Data directory: /raven/ptmp/cwinkler/ML-models/experimental-simulation-tc/outputs
Output directory: /raven/ptmp/cwinkler/ML-models/experimental-simulation-tc/outputs

Loading element embeddings from /raven/ptmp/cwinkler/ML-models/experimental-simulation-tc/data/embeddings/element/matscholar200.json
✓ Loaded 103 element embeddings (200D)

LOADING EMBEDDING-READY DATASETS
Loading All_orig dataset from Pairs_all_emb.csv...
✓ Loaded All_orig: 764 samples
Loading RE_orig dataset from Pairs_RE_emb.csv...
✓ Loaded RE_orig: 507 samples
Loading RE-free_orig dataset from Pairs_RE_Free_emb.csv...
✓ Loaded RE-free_orig: 257 samples
Loading All_aug dataset from Augm_all_emb.csv...
✓ Loaded All_a

### 3. Compress embeddings with PCA

In [6]:
from src.compress_embedding_PCA import compress_embeddings_PCA

In [7]:
compress_embeddings_PCA()

PCA EMBEDDING COMPRESSION PIPELINE
Random seed: 42
Input directory: /raven/ptmp/cwinkler/ML-models/experimental-simulation-tc/outputs
Output directory: /raven/ptmp/cwinkler/ML-models/experimental-simulation-tc/outputs


PROCESSING DATASET: All_orig
Input file: /raven/ptmp/cwinkler/ML-models/experimental-simulation-tc/outputs/Pairs_all_emb_w_embeddings.pkl
  Loaded 764 samples
  Adding PCA components: [8, 16, 32]

COMPRESSING EMBEDDINGS WITH PCA
Original embeddings shape: (764, 200)

Creating PCA compressions:

  Fitting PCA with 8 components...
  ✓ Explained variance: 0.7833 (78.33%)
  ✓ Added column: comp_emb_pca_8_components

  Fitting PCA with 16 components...
  ✓ Explained variance: 0.8996 (89.96%)
  ✓ Added column: comp_emb_pca_16_components

  Fitting PCA with 32 components...
  ✓ Explained variance: 0.9776 (97.76%)
  ✓ Added column: comp_emb_pca_32_components

✓ Saved: /raven/ptmp/cwinkler/ML-models/experimental-simulation-tc/outputs/Pairs_all_emb_w_embeddings_PCA.pkl
  Samples:

### 4. Model Training

#### 4.1 Original Dataset
Train ML models on existing materials where both experimental and simulated values have been recorded. 

In [9]:
from src.training_original import training_original

ERROR! Session/line number was not unique in database. History logging moved to new session 1981


In [10]:
training_original()

ORIGINAL DATA TRAINING (NO EMBEDDINGS)

------------------------------------------------------------
Training on All-Pairs dataset type: all
------------------------------------------------------------

Training Symbolic Regression: All-Pairs
Loading original (pairs) data for all dataset from: /raven/ptmp/cwinkler/ML-models/experimental-simulation-tc/outputs/Pairs_all.csv

Dataset diagnostics:
  - Dataset type: all
  - Target column: Tc_exp
  - Total rows: 839
  - Rows with NaN in Tc_sim: 0
  - Rows with NaN in Tc_exp: 0
Training samples: 671
Test samples: 168
Error running Symbolic Regression: PySR is required for symbolic regression

Training Linear Models: All-Pairs
Using only Tc_sim (no embedding)
Loading original (pairs) data for all dataset from: /raven/ptmp/cwinkler/ML-models/experimental-simulation-tc/outputs/Pairs_all.csv

Dataset diagnostics:
  - Dataset type: all
  - Target column: Tc_exp
  - Total rows: 839
  - Rows with NaN in Tc_sim: 0
  - Rows with NaN in Tc_exp: 0
Train

#### 4.2 Orginal Dataset with embedding

#### 4.3 Augmented dataset

#### 4.4 Augmeted dataset with embedding