### Tutorial: Data Preprocessing

This notebook demonstrates how to convert raw molecular datasets into the 2D and 3D feature tensors required by HME.

**Steps:**
1. Inspect raw JSON/SMILES input.
2. Run the preprocessing functions.
3. Verify the output tensor files.

Please make sure that you have downloaded the datasets:
```bash
# 1. Create the target directory
mkdir -p datasets  #the expected path is 'HME/datasets'

# 2. Download the data archive from Zenodo
wget -O datasets/data.zip "https://zenodo.org/records/16963804/files/data.zip?download=1"

# 3. Unzip the archive
unzip datasets/data.zip -d datasets/
#rm datasets/data.zip #Optional: Clean up the zip file
```

In [2]:
import os
import sys
import torch
import json


# Import HME preprocessing modules
from hme.preprocess_mol import get_2d3d_tensors

print("Imports successful. Ready to preprocess.")

Imports successful. Ready to preprocess.


### 1. Inspecting the Raw Data

Before processing, let's take a quick look at the format of our input data. We will inspect the test set used for Property QA.

In [3]:
input_file = '../datasets/property_qa_test_2.json.cfm'

print(f"Loading sample data...\n")
with open(input_file, 'r') as f:
    # Load the json data
    data_list = json.load(f)
    
# Display the first sample to show the structure
if len(data_list) > 0:
    print("Keys in the first sample:")
    print(data_list[0].keys())
    print("Structure of the first sample:")
    print(json.dumps(data_list[0], indent=2))
    print(f"\nTotal samples: {len(data_list)}")

Loading sample data...

Keys in the first sample:
dict_keys(['smiles', 'atoms', 'coordinates'])
Structure of the first sample:
{
  "smiles": "C[C@H](CC[C@H](C)OC[C@@H]1CO1)OC[C@@H]1CO1",
  "atoms": [
    "C",
    "C",
    "C",
    "C",
    "C",
    "C",
    "O",
    "C",
    "C",
    "C",
    "O",
    "O",
    "C",
    "C",
    "C",
    "O",
    "H",
    "H",
    "H",
    "H",
    "H",
    "H",
    "H",
    "H",
    "H",
    "H",
    "H",
    "H",
    "H",
    "H",
    "H",
    "H",
    "H",
    "H",
    "H",
    "H",
    "H",
    "H"
  ],
  "coordinates": [
    [
      2.372255563735962,
      1.7100319862365723,
      -0.10524449497461319
    ],
    [
      1.8879988193511963,
      0.2865290939807892,
      -0.37209394574165344
    ],
    [
      0.36822670698165894,
      0.15030863881111145,
      -0.19707560539245605
    ],
    [
      -0.4420231282711029,
      0.9235133528709412,
      -1.2427006959915161
    ],
    [
      -1.947428822517395,
      0.6457188129425049,
      -1

### 2. Run Preprocessing
We use `get_2d3d_tensors` to generate (2D graph) and (3D coordinate) features.

We create a temporary subset for quick demonstration

In [4]:
full_file = input_file
demo_file = './demo_subset_1000.json.cfm'

if os.path.exists(full_file):
    with open(full_file, 'r') as f:
        full_data = json.load(f)
    
    # Take only the first 1000 samples
    with open(demo_file, 'w') as f:
        json.dump(full_data[:1000], f)
        
    print(f"Created demo subset with {len(full_data[:1000])} samples.")
else:
    print(f"Error: Source file {full_file} not found.")

# --- Process Molecular Data ---
print("Processing Property QA Test Set (Demo: utilizing the first 1000 samples for speed)...")
get_2d3d_tensors(demo_file)

# The output will be generated based on the demo_file name

Created demo subset with 1000 samples.
Processing Property QA Test Set (Demo: utilizing the first 1000 samples for speed)...
Loaded 1000 molecules with conformations.
Loading from /home/lvliuzhenghao/llzh/NC_minor/HME/src/hme/molecule_towers/model1.pth ...


2025-12-08 06:32:17 | unimol_tools/models/unimol.py | 135 | INFO | Uni-Mol Tools | Loading pretrained weights from /home/lvliuzhenghao/miniconda3/envs/mollama/lib/python3.10/site-packages/unimol_tools/weights/mol_pre_all_h_220816.pt
Generating 2D features: 100%|██████████| 1000/1000 [00:03<00:00, 271.40it/s]
2025-12-08 06:32:21 | unimol_tools/tasks/trainer.py | 77 | INFO | Uni-Mol Tools | Number of GPUs available: 8
2025-12-08 06:32:21 | unimol_tools/tasks/trainer.py | 97 | INFO | Uni-Mol Tools | Using single GPU.


Generating 3D features for 1000 valid molecules...


100%|██████████| 32/32 [00:01<00:00, 17.24it/s]


Saved 2D & 3D features for 1000 molecules to ./demo_subset_1000.json.cfm.pt


### 3. Verify Output
Load the generated file to ensure features are correctly created.

In [9]:
#The previous script generates a file './demo_subset_1000.json.cfm.pt'
expected_output = './demo_subset_1000.json.cfm.pt' 

data = torch.load(expected_output)
print(f"Successfully loaded processed data from {expected_output}")
print(f"Number of samples: {len(data)}")

# Data is a dictionary where keys are SMILES strings
first_smiles = list(data.keys())[0]
first_features = data[first_smiles]

print(f"\nSample Key (SMILES): {first_smiles}")
# Display the available feature fields (e.g., 'graph_feat', '3d_feat', etc.)
print(f"Feature fields: {list(first_features.keys())}")

# Verify tensor shapes for the first sample
for k, v in first_features.items():
    if hasattr(v, 'shape'):
        print(f"  - {k}: shape {v.shape}")
        
print('Files with the .pt extension can be used directly for model training or inference.')

Successfully loaded processed data from ./demo_subset_1000.json.cfm.pt
Number of samples: 1000

Sample Key (SMILES): C[C@H](CC[C@H](C)OC[C@@H]1CO1)OC[C@@H]1CO1
Feature fields: ['smiles', 'atoms', 'coordinates', 'molecule_raw_2d_features', 'molecule_raw_3d_features']
  - molecule_raw_2d_features: shape torch.Size([16, 300])
  - molecule_raw_3d_features: shape torch.Size([38, 512])
Files with the .pt extension can be used directly for model training or inference.
