# Parsing: From Strings to Structures

Molecular structures can be represented as compact text strings using notations like SMILES and SMARTS. MolPy's `parser` module converts these string representations into working `Atomistic` structures you can manipulate, react, and simulate.

**What the parser handles:**

- **SMILES** (Simplified Molecular Input Line Entry System) – Describes molecular structure
- **SMARTS** (SMILES Arbitrary Target Specification) – Describes patterns for substructure matching
- **BigSMILES** (in development) – Describes polymers and macromolecules

**When to use the parser:**
- Starting from chemical database entries (PubChem, ChEMBL use SMILES)
- Defining atom typing rules (SMARTS patterns)
- Building structures programmatically from text

**When NOT to use the parser:**
- If you have coordinate files (PDB, XYZ), use `molpy.io` readers instead
- For complex 3D structures, prefer RDKit adapter + 3D generation

---

## Parsing SMILES: From String to Molecule

SMILES provides a compact way to represent molecular connectivity:
- `C` = carbon
- `O` = oxygen  
- `-` = single bond (often implicit)
- `=` = double bond
- `#` = triple bond

Example: `"CCO"` = ethanol (CH₃CH₂OH)


In [None]:
from molpy.parser import SmilesParser

# Create parser instance
smiles_parser = SmilesParser()

# Parse SMILES string to intermediate representation (IR)
smiles_ir = smiles_parser.parse_smiles("CCO")
print(f"Parsed SMILES: {smiles_ir}")

# Note: The IR is an intermediate format
# You typically convert it to Atomistic next (see below)


### Converting to Atomistic

The parsed intermediate representation needs to be converted into an `Atomistic` structure:

```python
# (API still evolving - check molpy.parser for current conversion functions)
# Typical pattern:
# atomistic = smiles_ir_to_atomistic(smiles_ir)
```

> **Note:** For production use with SMILES, we currently recommend using the RDKit adapter (`molpy.adapter.RDKitWrapper.from_smiles()`) which provides a more complete implementation including 3D coordinate generation.

---

## Parsing SMARTS: Pattern Matching

SMARTS extends SMILES with wildcards and logical operators for matching molecular patterns:

- `[C]` = any carbon
- `[C;H2,H3]` = carbon with 2 OR 3 hydrogens
- `[#6]` = element with atomic number 6 (carbon)
- `c` = aromatic carbon

**Common use case:** Defining atom typing rules in force fields


In [None]:
from molpy.parser import SmartsParser

# Create parser instance
smarts_parser = SmartsParser()

# Parse SMARTS pattern
# This pattern matches: "Carbon with 2 or 3 hydrogens"
smarts_ir = smarts_parser.parse_smarts("[C;H2,H3]")
print(f"Parsed SMARTS: {smarts_ir}")

# More complex example: aromatic carbon in 6-membered ring
aromatic_pattern = smarts_parser.parse_smarts("[c;r6]")
print(f"Aromatic pattern: {aromatic_pattern}")


### Using SMARTS with Typifier

SMARTS patterns are heavily used in `molpy.typifier` for atom type assignment:

```python
# Example  workflow (see user-guide/typifier.ipynb for details)
from molpy.typifier import Rule

# Define typing rule with SMARTS pattern
rule = Rule(
    name="sp3_carbon",
    pattern="[C;X4]",  # tetrahedral carbon
    atom_type="C_sp3"
)
```

---

## Integration with Molecular Building

Parsed `Atomistic` structures flow naturally into the rest of MolPy's ecosystem:

### Typical Pipeline

```
SMILES/SMARTS string
        ↓
    Parser (parse_smiles/parse_smarts)
        ↓
  Intermediate Representation (IR)
        ↓
  Convert to Atomistic
        ↓
  Wrap in Monomer (optional)
        ↓
  Use with Builder / Reacter / Typifier
```

### Example Workflow

```python
# 1. Parse SMILES
smiles_parser = SmilesParser()
ir = smiles_parser.parse_smiles("c1ccccc1")  # benzene

# 2. Convert to Atomistic (API evolving)
# atomistic = convert_ir_to_atomistic(ir)

# 3. Wrap as Monomer for polymer building
# from molpy.core.wrappers.monomer import Monomer
# monomer = Monomer.from_atomistic(atomistic)

# 4. Use in reactions, typing, etc.
# (see user-guide/reacter.ipynb, user-guide/typifier.ipynb)
```

---

## Parser vs. Adapter

You might wonder: when should I use `molpy.parser` vs. `molpy.adapter.RDKitWrapper`?

| Feature | molpy.parser | RDKit Adapter |
|---------|-------------|---------------|
| **Dependencies** | Pure Python, no RDKit needed | Requires RDKit installation |
| **SMILES support** | Basic parsing | Full SMILES support |
| **3D coordinates** | ❌ No | ✅ Yes (via generate_3d) |
| **SMARTS patterns** | ✅ Yes (for typing) | ✅ Yes (via RDKit) |
| **Maturity** | In development | Production-ready |

**Recommendation:**
- Use **RDKit adapter** for general SMILES → structure conversion
- Use **molpy.parser** for: 
  - SMARTS pattern definition (typing rules)
  - Situations where RDKit dependency is problematic
  - BigSMILES polymer notation (future)

---

## Future Directions

The parser module is actively evolving. Planned features:

- **BigSMILES support** – Parse polymer notation directly
- **Direct Atomistic conversion** – Skip intermediate representation for simpler API
- **Enhanced SMARTS** – Support for more advanced patterns and queries

Check the [MolPy roadmap](https://github.com/MolCrafts/molpy) for updates!
