# XRD Database Builder

This notebook processes the crystal structure database and generates the XRD pattern database.

## Steps:
1. Import required modules
2. Initialize the database builder
3. Process all entries and calculate XRD patterns
4. Save the database as a pickle file

## 1. Import Modules

In [3]:
import time
import pickle
from pathlib import Path
from process.build_database import DatabaseBuilder

print("✓ Modules imported successfully")

✓ Modules imported successfully


## 2. Configuration

In [4]:
# Database paths
db_path = 'UniqCryLabeled.db'  # Input crystal structure database
output_path = 'xrd_database.pkl'  # Output XRD database

# XRD calculation parameters
wavelength = 'CuKa'  # Cu K-alpha radiation (1.54184 Å)
n_peaks = 5  # Number of top peaks to extract per structure
two_theta_range = (10, 90)  # 2θ range in degrees

# Processing options
max_entries = None  # Set to None to process ALL entries, or a number for testing (e.g., 100)
skip_errors = True  # Skip entries that fail processing

# Memory management options
batch_size = 5000  # Process in batches to manage memory
save_interval = 1000  # Save checkpoint every N entries

print("Configuration:")
print(f"  Input database: {db_path}")
print(f"  Output file: {output_path}")
print(f"  Wavelength: {wavelength}")
print(f"  Number of peaks: {n_peaks}")
print(f"  2θ range: {two_theta_range}")
print(f"  Max entries: {max_entries if max_entries else 'ALL'}")
print(f"  Skip errors: {skip_errors}")
print(f"  Batch size: {batch_size}")
print(f"  Save interval: {save_interval}")

Configuration:
  Input database: UniqCryLabeled.db
  Output file: xrd_database.pkl
  Wavelength: CuKa
  Number of peaks: 5
  2θ range: (10, 90)
  Max entries: ALL
  Skip errors: True
  Batch size: 5000
  Save interval: 1000


## 3. Initialize Database Builder

In [5]:
# Initialize builder
builder = DatabaseBuilder(
    db_path=db_path,
    wavelength=wavelength,
    n_peaks=n_peaks,
    two_theta_range=two_theta_range
)

# Get total entries
total_entries = builder.db_processor.get_total_entries()
print(f"\nTotal entries in database: {total_entries}")

if max_entries:
    print(f"Will process: {min(max_entries, total_entries)} entries")
else:
    print(f"Will process: ALL {total_entries} entries")
    print(f"Estimated time: ~{total_entries * 7 / 3600:.1f} hours (at ~7 sec/entry)")

INFO:process.db_processor:Successfully connected to database: UniqCryLabeled.db
INFO:process.db_processor:Total entries in database: 100315
INFO:process.xrd_calculator:XRD Calculator initialized with wavelength=CuKa, 2θ range=(10, 90), min d-spacing=0.5 Å
INFO:process.build_database:DatabaseBuilder initialized for UniqCryLabeled.db



Total entries in database: 100315
Will process: ALL 100315 entries
Estimated time: ~195.1 hours (at ~7 sec/entry)


## 4. Build Database

**Note**: This step may take a long time if processing all entries. 
- For testing: Set `max_entries = 100` in the configuration cell above
- For full database: Set `max_entries = None`

**If interrupted**: Simply re-run this cell - it will automatically resume from the last checkpoint!

In [None]:
# Start timer
start_time = time.time()

print("Starting database build...")
print("=" * 80)

# Build database with batch processing and auto-save
xrd_database = builder.build_database(
    output_path=output_path,
    max_entries=max_entries,
    skip_errors=skip_errors,
    batch_size=batch_size,
    save_interval=save_interval
)

# End timer
end_time = time.time()
elapsed_time = end_time - start_time

print("\n" + "=" * 80)
print("Database Build Complete!")
print("=" * 80)
print(f"Total entries processed: {len(xrd_database)}")
print(f"Output file: {output_path}")
print(f"Time elapsed: {elapsed_time / 60:.1f} minutes ({elapsed_time / 3600:.2f} hours)")
if len(xrd_database) > 0:
    print(f"Average time per entry: {elapsed_time / len(xrd_database):.2f} seconds")

In [None]:
# or run
nohup python build_database_parallel.py \
  --db-path UniqCryLabeled.db \
  --output xrd_database.pkl \
  --workers 32 \
  --batch-size 1000 \
  --save-interval 500 \
  --wavelength CuKa \
  --n-peaks 5 \
  > build_log.txt 2>&1 &

## Sample Entries

In [4]:
import pickle
import os

# 文件路径
file_path = "xrd_database.pkl"

# 检查文件是否存在
if not os.path.exists(file_path):
    raise FileNotFoundError(f"{file_path} not found.")

# 读取 pickle 文件
with open(file_path, "rb") as f:
    db = pickle.load(f)  # 这是完整的数据库包

# 提取各个部分
xrd_database = db['xrd_database']      # 主数据库
element_index = db['element_index']    # 元素索引
metadata = db['metadata']              # 元数据

# 显示元数据
print("Database Metadata:")
print("=" * 80)
print(f"Source: {metadata['source_db']}")
print(f"Wavelength: {metadata['wavelength']}")
print(f"Number of peaks: {metadata['n_peaks']}")
print(f"2θ range: {metadata['two_theta_range']}")
print(f"Total entries: {metadata['total_entries']}")
print()

# 显示前 5 条记录
print("Sample Entries:")
print("=" * 80)

for i, (entry_id, entry) in enumerate(list(xrd_database.items())[:5], 1):
    print(f"\n{i}. Entry ID: {entry_id}")
    print(f"   MPID: {entry.get('mpid', 'N/A')}")
    print(f"   Formula: {entry.get('formula', 'N/A')}")
    print(f"   Elements: {', '.join(entry.get('elements', []))}")
    print(f"   Space Group: {entry.get('spacegroup_number', 'N/A')} ({entry.get('spacegroup_symbol', 'N/A')})")
    print(f"   Number of atoms: {entry.get('n_atoms', 'N/A')}")
    peaks = entry.get('peaks', {})
    print(f"   Peak positions (2θ): {[f'{x:.2f}' for x in peaks.get('positions', [])]}")
    intensities = peaks.get('intensities', [])
    print(f"   Peak intensities: {[f'{x:.1f}' for x in intensities]}")

# 显示元素索引统计
print("\n" + "=" * 80)
print("Element Index Statistics:")
print(f"Total element combinations: {len(element_index)}")
print("\nTop 10 most common element combinations:")
sorted_elements = sorted(element_index.items(), key=lambda x: len(x[1]), reverse=True)[:10]
for i, (elements, entry_ids) in enumerate(sorted_elements, 1):
    print(f"{i:2d}. {'-'.join(elements):20s}: {len(entry_ids):6d} entries")

Database Metadata:
Source: UniqCryLabeled.db
Wavelength: CuKa
Number of peaks: 5
2θ range: (10.0, 90.0)
Total entries: 100315

Sample Entries:

1. Entry ID: 1
   MPID: mp-1001824.cif
   Formula: CIr
   Elements: C, Ir
   Space Group: 225 (F m -3 m)
   Number of atoms: 2
   Peak positions (2θ): ['35.28', '40.96', '70.93', '59.31', '74.60']
   Peak intensities: ['100.0', '62.1', '41.7', '41.7', '14.2']

2. Entry ID: 2
   MPID: mp-1002109.cif
   Formula: CoV
   Elements: Co, V
   Space Group: 221 (P m -3 m)
   Number of atoms: 2
   Peak positions (2θ): ['44.24', '81.41', '64.35', '30.88', '73.08']
   Peak intensities: ['100.0', '28.5', '14.7', '1.6', '0.5']

3. Entry ID: 3
   MPID: mp-1002117.cif
   Formula: NTc
   Elements: N, Tc
   Space Group: 221 (P m -3 m)
   Number of atoms: 2
   Peak positions (2θ): ['47.99', '33.42', '89.55', '80.03', '59.74']
   Peak intensities: ['100.0', '84.1', '31.7', '28.2', '21.8']

4. Entry ID: 4
   MPID: mp-1001834.cif
   Formula: HfN
   Elements: Hf, N
 