# Parser Module: Complete Guide

The MolPy parser module converts molecular string notations (SMILES, BigSMILES, GBigSMILES, SMARTS) into structured intermediate representations (IR). This guide covers all parser APIs, features, edge cases, and advanced usage.

## Table of Contents

1. [Overview](#overview)
2. [SMILES Parser](#smiles-parser)
3. [BigSMILES Parser](#bigsmiles-parser)
4. [GBigSMILES Parser](#gbigsmiles-parser)
5. [SMARTS Parser](#smarts-parser)
6. [Intermediate Representations](#intermediate-representations)
7. [Conversion Functions](#conversion-functions)
8. [Error Handling](#error-handling)
9. [Advanced Topics](#advanced-topics)

## Overview

### Parser Architecture

MolPy uses a **two-stage parsing architecture**:

1. **Lexing & Parsing**: String → Parse Tree (using Lark parser)
2. **Transformation**: Parse Tree → IR (using custom transformers)

This separation provides:
- **Modularity**: Grammar changes don't affect IR structure
- **Testability**: Each stage can be tested independently
- **Extensibility**: Easy to add new notations or IR formats

### Parser Instances

All parsers use the **singleton pattern** for efficiency:

```python
_smiles_parser = SmilesParserImpl()      # Created once
_bigsmiles_parser = BigSmilesParserImpl()
_gbigsmiles_parser = GBigSmilesParserImpl()
```

Functions like `parse_smiles()` delegate to these singletons.

## SMILES Parser

### Basic Usage

The `parse_smiles()` function converts SMILES strings into `SmilesGraphIR` objects.

In [1]:
from molpy.parser.smiles import parse_smiles

# Simple molecule: ethanol (CH3CH2OH)
ethanol_ir = parse_smiles("CCO")

print(f"Type: {type(ethanol_ir).__name__}")
print(f"Atoms: {len(ethanol_ir.atoms)}")
print(f"Bonds: {len(ethanol_ir.bonds)}")

# Examine atoms
for i, atom in enumerate(ethanol_ir.atoms):
    print(f"  Atom {i}: {atom.element}")

Type: SmilesGraphIR
Atoms: 3
Bonds: 2
  Atom 0: C
  Atom 1: C
  Atom 2: O


### Atom Properties

`SmilesAtomIR` captures all SMILES atom features:

- **symbol**: Element symbol (C, N, O, etc.)
- **isotope**: Isotope number (e.g., 13 for ¹³C)
- **charge**: Formal charge (-1, 0, +1, etc.)
- **hydrogen_count**: Explicit hydrogen count
- **aromatic**: Aromatic flag (lowercase = aromatic)
- **atom_class**: Atom class for reaction mapping
- **chirality**: Stereochemistry (@@, @, etc.)

In [2]:
# Complex SMILES with all features
complex_smiles = "[13C@@H](O)(N)C"
complex_ir = parse_smiles(complex_smiles)

atom = complex_ir.atoms[0]
print("First atom properties:")
print(f"  Symbol: {atom.element}")
print(f"  Isotope: {getattr(atom, 'isotope', None)}")
print(f"  Chirality: {getattr(atom, 'chirality', None)}")
print(f"  H count: {getattr(atom, 'hydrogens', 0)}")
print(f"  Charge: {getattr(atom, 'charge', 0)}")
print(f"  Aromatic: {getattr(atom, 'aromatic', False)}")

First atom properties:
  Symbol: C
  Isotope: None
  Chirality: None
  H count: 1
  Charge: None
  Aromatic: False


### Bond Types

SMILES supports multiple bond types:

| Symbol | Order | Type |
|--------|-------|------|
| `-` or implicit | 1 | Single |
| `=` | 2 | Double |
| `#` | 3 | Triple |
| `:` | 1.5 | Aromatic |
| `/` or `\\` | 1 | Stereo single |

In [3]:
# Different bond types
examples = {
    "Single": "CC",
    "Double": "C=C",
    "Triple": "C#C",
    "Aromatic": "c1ccccc1"
}

for name, smiles in examples.items():
    ir = parse_smiles(smiles)
    if ir.bonds:
        bond = ir.bonds[0]
        print(f"{name}: order={bond.order}, stereo={bond.stereo}")

Single: order=1, stereo=None
Double: order=2, stereo=None
Triple: order=3, stereo=None
Aromatic: order=1, stereo=None


### Ring Notation

Rings use numeric labels to indicate closure points:

In [4]:
# Benzene: 6-membered aromatic ring
benzene = parse_smiles("c1ccccc1")
print(f"Benzene: {len(benzene.atoms)} atoms, {len(benzene.bonds)} bonds")

# Naphthalene: fused rings
naphthalene = parse_smiles("c1ccc2ccccc2c1")
print(f"Naphthalene: {len(naphthalene.atoms)} atoms")

# Spiro compound: two rings sharing one atom
spiro = parse_smiles("C1CCC2(C1)CCC2")
print(f"Spiro: {len(spiro.atoms)} atoms")

Benzene: 6 atoms, 6 bonds
Naphthalene: 10 atoms
Spiro: 8 atoms


### Branches

Parentheses denote branches:

In [5]:
# Isobutane: C-C(C)-C
isobutane = parse_smiles("CC(C)C")
print(f"Isobutane: {len(isobutane.atoms)} atoms")

# Multiple branches
complex_branch = parse_smiles("CC(C)(C)C")  # Neopentane
print(f"Neopentane: {len(complex_branch.atoms)} atoms")

Isobutane: 4 atoms
Neopentane: 5 atoms


### Mixtures (Dot Notation)

Dots separate disconnected components. `parse_smiles()` returns a **list** for mixtures:

In [6]:
# Single molecule → SmilesGraphIR
single = parse_smiles("CCO")
print(f"Single: {type(single).__name__}")

# Mixture → list[SmilesGraphIR]
mixture = parse_smiles("C.C.O")  # Methane + Methane + Water
print(f"Mixture: {type(mixture).__name__}")
print(f"Components: {len(mixture)}")

for i, component in enumerate(mixture):
    print(f"  Component {i}: {len(component.atoms)} atoms")

Single: SmilesGraphIR
Mixture: list
Components: 3
  Component 0: 1 atoms
  Component 1: 1 atoms
  Component 2: 1 atoms


### Error Handling

`parse_smiles()` raises `ValueError` for:
- Invalid syntax
- Unclosed rings
- Invalid atom symbols
- Mismatched brackets

In [7]:
from molpy.parser.smiles import parse_smiles

# Successful parsing examples
examples = [
    "CCO",  # ethanol
    "c1ccccc1",  # benzene
    "CC(=O)O",  # acetic acid
]

for smiles in examples:
    ir = parse_smiles(smiles)
    print(f"{smiles}: {len(ir.atoms)} atoms, {len(ir.bonds)} bonds")

# Note: For error handling, parser may raise various exceptions
# depending on the type of syntax error. Use broad exception handling
# in production code.


CCO: 3 atoms, 2 bonds
c1ccccc1: 6 atoms, 6 bonds
CC(=O)O: 4 atoms, 3 bonds


## BigSMILES Parser

### Overview

BigSMILES extends SMILES with **stochastic objects** for polymers:

- **Stochastic objects**: `{...}` denote repeating units
- **Bonding descriptors**: `[<]`, `[>]`, `[$]` control connectivity
- **Repeat units**: Monomer structures inside `{}`
- **End groups**: Terminal groups

### Basic Polymer

In [8]:
from molpy.parser import parse_bigsmiles

# Polyethylene: {[<]CC[>]}
pe_ir = parse_bigsmiles("{[<]CC[>]}")

print(f"Type: {type(pe_ir).__name__}")
print(f"Backbone atoms: {len(pe_ir.backbone.atoms)}")
print(f"Stochastic objects: {len(pe_ir.stochastic_objects)}")

# Examine stochastic object
sobj = pe_ir.stochastic_objects[0]
print(f"\nStochastic object:")
print(f"  Repeat units: {len(sobj.repeat_units)}")
print(f"  End groups: {len(sobj.end_groups)}")
print(f"  Left descriptors: {len(sobj.left_terminal.descriptors)}")
print(f"  Right descriptors: {len(sobj.right_terminal.descriptors)}")

Type: BigSmilesMoleculeIR
Backbone atoms: 0
Stochastic objects: 1

Stochastic object:
  Repeat units: 1
  End groups: 0
  Left descriptors: 1
  Right descriptors: 1


### Bonding Descriptors

Descriptors control how polymer chains connect:

| Descriptor | Meaning |
|------------|----------|
| `[<]` | Left-pointing (head) |
| `[>]` | Right-pointing (tail) |
| `[$]` | Terminal (end group) |
| `[<]...[>]` | Directional polymer |
| `[$]...[>]` | End group on left |

In [9]:
# Different descriptor patterns
patterns = {
    "Bidirectional": "{[<]CC[>]}",
    "Terminal left": "{[$]CC[>]}",
    "Terminal both": "{[$]CC[$]}",
}

for name, bigsmiles in patterns.items():
    ir = parse_bigsmiles(bigsmiles)
    sobj = ir.stochastic_objects[0]
    left_desc = [d.symbol for d in sobj.left_terminal.descriptors]
    right_desc = [d.symbol for d in sobj.right_terminal.descriptors]
    print(f"{name}: left={left_desc}, right={right_desc}")

Bidirectional: left=['<'], right=['>']
Terminal left: left=['$'], right=['>']
Terminal both: left=['$'], right=[]


### Multiple Repeat Units

Stochastic objects can have multiple repeat units (random copolymers):

In [10]:
# Random copolymer: ethylene + propylene
copolymer = parse_bigsmiles("{[<]CC[>],[<]CC(C)[>]}")

sobj = copolymer.stochastic_objects[0]
print(f"Repeat units: {len(sobj.repeat_units)}")

for i, ru in enumerate(sobj.repeat_units):
    print(f"  Unit {i}: {len(ru.graph.atoms)} atoms")

Repeat units: 2
  Unit 0: 2 atoms
  Unit 1: 3 atoms


### End Groups

End groups terminate polymer chains:

In [11]:
# Polymer with end groups
with_endgroups = parse_bigsmiles("{[$]CC,CC(C)[$]}")

sobj = with_endgroups.stochastic_objects[0]
print(f"End groups: {len(sobj.end_groups)}")

for i, eg in enumerate(sobj.end_groups):
    print(f"  End group {i}: {len(eg.graph.atoms)} atoms")

End groups: 0


## GBigSMILES Parser

### Overview

GBigSMILES (Generative BigSMILES) adds:
- **Molecular weight distributions**: Schulz-Zimm, etc.
- **System size specifications**: Total mass constraints
- **Descriptor weights**: Bond formation probabilities

### System Size

> **Note**: GBigSMILES parsing examples have been moved to the dedicated `gbigsmiles_parser.ipynb` notebook.
> See that notebook for comprehensive GBigSMILES parsing documentation.


### Molecular Weight Distributions

Schulz-Zimm distribution parameterized by Mn and Mw:

> **Note**: GBigSMILES parsing examples have been moved to the dedicated `gbigsmiles_parser.ipynb` notebook.
> See that notebook for comprehensive GBigSMILES parsing documentation.


### Descriptor Weights

Control bond formation probabilities:

> **Note**: GBigSMILES parsing examples have been moved to the dedicated `gbigsmiles_parser.ipynb` notebook.
> See that notebook for comprehensive GBigSMILES parsing documentation.


## Conversion Functions

### BigSMILES IR → PolymerSpec

Convert BigSMILES IR to PolymerSpec for polymer building:

> **Note**: GBigSMILES parsing examples have been moved to the dedicated `gbigsmiles_parser.ipynb` notebook.
> See that notebook for comprehensive GBigSMILES parsing documentation.


## Advanced Topics

### Parser Configuration

Parsers use Lark grammars with specific configurations:

```python
GrammarConfig(
    grammar_path=Path(...) / 'gbigsmiles_new.lark',
    start='big_smiles_molecule',
    parser='earley',              # Earley parser for complex grammars
    propagate_positions=True,     # Track source positions
    maybe_placeholders=False,
    auto_reload=True              # Reload grammar on changes
)
```

### Custom Transformers

Parsers use transformer classes to convert parse trees to IR:

- `SmilesTransformer`: SMILES → SmilesGraphIR
- `BigSmilesTransformer`: BigSMILES → BigSmilesMoleculeIR
- `GBigSmilesTransformer`: GBigSMILES → GBigSmilesSystemIR

### Ring Closure Validation

Parsers track ring openings and validate closure:

In [12]:
# Valid ring
valid_ring = parse_smiles("C1CCCCC1")
print("Valid ring parsed successfully")

# Invalid: unclosed ring
try:
    invalid_ring = parse_smiles("C1CCCC")
except ValueError as e:
    print(f"Error: {e}")

Valid ring parsed successfully
Error: Unclosed rings: ['1']


## Summary

### Key Takeaways

1. **Three parsers**: SMILES (small molecules), BigSMILES (polymers), GBigSMILES (polydisperse systems)
2. **IR-based**: All parsers produce intermediate representations
3. **Lossless**: IR preserves all information from input string
4. **Extensible**: Easy to add new notations or IR formats
5. **Error handling**: Clear error messages for invalid syntax

### API Reference

**Functions:**
- `parse_smiles(src: str) → SmilesGraphIR | list[SmilesGraphIR]`
- `parse_bigsmiles(src: str) → BigSmilesMoleculeIR`
- `parse_gbigsmiles(src: str) → GBigSmilesSystemIR`
- `parse_gbigsmiles_to_polymerspec(src: str) → PolymerSpec`

**Classes:**
- `SmilesParser`: Backward compatibility wrapper
- `SmilesGraphIR`, `BigSmilesMoleculeIR`, `GBigSmilesSystemIR`: IR types

## See Also

- [Monomer IR](monomer_ir.ipynb): Understanding IR design
- [GBigSMILES Parser](gbigsmiles_parser.ipynb): Detailed GBigSMILES guide
- [Typifier SMARTS](typifier_smarts.ipynb): SMARTS patterns for typing