# **Phase 1**: Parsing `.xyz` files from the dataset and extract SMILES strings and target properties

## Overview

This document explains the procedure and functions used to process the QM9 dataset, which contains over 130,000 small organic molecules in `.xyz` files. The goal is to extract 12 quantum properties and SMILES strings from these files and convert them into a structured CSV dataset for machine learning applications.

_Note that the `.tar.gz` file is sliced from the main QM9 dataset, and it contains 11001 small organic molecules!_


## Data Preparation Procedure

1. **Access the `.tar.gz` archive**  
   Use Python’s `tarfile` module to read each `.xyz` file sequentially without fully extracting the archive.

2. **Parse individual `.xyz` files**  
   - Read the number of atoms and 17 quantum properties from the header.
   - Select 12 relevant properties (electronic and thermodynamic) for prediction.
   - Extract the canonical SMILES string from the penultimate line.

3. **Error Handling**  
   Use `try-except` blocks to skip files with malformed data or missing SMILES, and log errors without stopping the entire process.

4. **Data Assembly**  
   Create a dictionary of `{index, smiles, 12 properties}` for each molecule, store in a list, convert to a Pandas DataFrame, and save as a CSV.


## Code Explanation

### 1. Imports and Configurations

- **Libraries:** `tarfile`, `pathlib`, `pandas`, `tqdm`, `logging`, `rdkit.Chem`, `tempfile`, `uuid`.
- **Logging:** Configured to show timestamps and levels for process visibility.
- **Target Properties:** Defined in `PROP_LABELS`, ignoring 3 rotational constants.


### 1. Function `parse_qm9_xyz(xyz_path)`

**Purpose:**  
Reads one `.xyz` file and returns a dictionary with the molecule’s index, SMILES, and 12 selected quantum properties.

**Steps:**  
- Reads file into lines.
- Extracts molecule index and properties from the second line (header).
- Parses 12 properties into floats, stores them in a dictionary.
- Extracts the SMILES string from the penultimate line.
- Returns a dictionary of the molecule data.


### 3. Function `process_tar_gz(tar_path, output_csv)`

**Purpose:**  
Processes a `.tar.gz` archive of QM9 `.xyz` files, parses each using `parse_qm9_xyz()`, and saves the dataset as a CSV file.

**Steps:**  
- Opens the archive and finds `.xyz` files.
- Uses a temporary directory to extract and process each file.
- For each file:
  - Reads content, writes to a temp file.
  - Calls `parse_qm9_xyz` to extract data.
  - Appends valid records to a list.
  - Deletes temp file.
- After processing all files:
  - Cleans up temp directory.
  - Saves records as a DataFrame to CSV, sorted by index.




In [None]:
# Import the required Libraries
import tarfile
from pathlib import Path
import pandas as pd
from tqdm import tqdm
import logging
from rdkit import Chem
import tempfile
import uuid

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# keep only the 12 targets we care about
PROP_LABELS = [
    "mu", "alpha",                  # dipole moment (D), polarizability (Å³)
    "homo", "lumo", "gap",          # orbital energies (Ha) and gap (Ha)
    "r2",                           # <R²> (a₀²)
    "zpve",                         # zero‑point vibrational energy (Ha)
    "U0", "U", "H", "G",            # energies / enthalpy / free energy (Ha)
    "Cv"                            # heat capacity (cal mol‑¹ K‑¹)
]

IGNORED = 3                         # rotA, rotB, rotC -> Skipped Properties 

def parse_qm9_xyz(xyz_path: str | Path) -> dict:
    """
    Read one QM9‑style .xyz file and return:
        {
            'index' : 3895,
            'smiles': 'O=C1C=CON=N1',
            'props' : {label: value, …}   # the 12 targets (rotA/B/C omitted)
        }
    """
    lines = Path(xyz_path).read_text().splitlines()
    
    n_atoms = int(lines[0])                 # sanity‑check only
    header_parts = lines[1].split('\t')     # header parts are tab-separated

    # “gdb 3895” → tag = 'gdb', idx = 3895
    tag, idx = header_parts[0].split()      # split on whitespace
    idx = int(idx)                          # convert index to integer

    # Skip the first three values, then take the next 12
    start = 1 + IGNORED                    # 1 = first value column after “gdb idx”
    end   = start + len(PROP_LABELS)
    prop_values = list(map(float, header_parts[start:end]))     # make a list of properties
    props = dict(zip(PROP_LABELS, prop_values))

    smiles = lines[-2].split('\t')[0]      # first token on the penultimate line

    return {"index": idx, "smiles": smiles, "props": props}


def process_tar_gz(tar_path: str | Path, output_csv: str | Path) -> None:
    """
    Process a .tar.gz file containing QM9 .xyz files and save the dataset to a CSV.

    Args:
        tar_path: Path to the .tar.gz file.
        output_csv: Path to save the output CSV file.
    """
    records = []
    temp_dir = Path(tempfile.gettempdir()) / f"qm9_processing_{uuid.uuid4()}"

    try:
        # Create temporary directory
        temp_dir.mkdir(parents=True, exist_ok=True)

        with tarfile.open(tar_path, "r:gz") as tar:
            xyz_files = [member for member in tar.getmembers() if member.name.endswith(".xyz")]
            logger.info(f"Found {len(xyz_files)} .xyz files in {tar_path}")

            for member in tqdm(xyz_files, desc="Processing .xyz files"):
                try:
                    f = tar.extractfile(member)         # Extract the file from the tar
                    if f is None:
                        logger.warning(f"Could not extract {member.name}")
                        continue

                    # Use unique temporary file name
                    temp_xyz = temp_dir / f"{uuid.uuid4()}.xyz"
                    content = f.read().decode("utf-8")
                    temp_xyz.write_text(content)

                    result = parse_qm9_xyz(temp_xyz)    # Parse the extracted .xyz file
                    if result:
                        record = {"index": result["index"], "smiles": result["smiles"]} # Initialize record with index and smiles
                        record.update(result["props"])
                        records.append(record)
                    else:
                        logger.warning(f"Skipping {member.name}: Parsing failed")

                    # Clean up temporary file
                    temp_xyz.unlink()
                except Exception as e:
                    logger.error(f"Error processing {member.name}: {e}")
                    continue
    except Exception as e:
        logger.error(f"Error opening tar file {tar_path}: {e}")
        return
    finally:
        # Clean up temporary directory
        try:
            for temp_file in temp_dir.glob("*.xyz"):
                temp_file.unlink()
            temp_dir.rmdir()
        except Exception as e:
            logger.warning(f"Error cleaning up temporary directory {temp_dir}: {e}")

    if records:
        df = pd.DataFrame(records)          # Create DataFrame from records
        columns = ["index", "smiles"] + PROP_LABELS     # Define the order of columns
        df = df[columns]                # Reorder DataFrame columns
        df = df.sort_values("index")    # Sort DataFrame by index
        df.to_csv(output_csv, index=False)      # Save DataFrame to CSV
        logger.info(f"Saved dataset with {len(df)} records to {output_csv}")
    else:
        logger.warning("No valid records processed.")


# Main execution block
if __name__ == "__main__":
    tar_path = "dataset.tar.gz"  # The provided .tar.gz file
    output_csv = "qm9_dataset.csv"  # Output CSV file
    process_tar_gz(tar_path, output_csv)

2025-05-02 17:19:22,684 - INFO - Found 11001 .xyz files in dataset.tar.gz
Processing .xyz files: 100%|██████████| 11001/11001 [06:27<00:00, 28.36it/s]
2025-05-02 17:25:50,928 - INFO - Saved dataset with 11001 records to qm9_dataset.csv
