# Parsing Smiles: From Strings to Structures

Molecular structures can be represented as compact text strings using linear notations like SMILES and SMARTS. MolPy's `parser` module converts these string representations into intermidate representation data that you can manipulate, and convert to other data structure.

**What the parser handles:**

- **SMILES** (Simplified Molecular Input Line Entry System) – Describes molecular structure
- **BigSmiles** - a line notation that supports intrinsically stochastic molecules on top of SMILES
- **SMARTS** (SMILES Arbitrary Target Specification) – Describes patterns for substructure matching

---

## Parsing SMILES: From String to Molecule

`"CCO"` = ethanol (CH₃CH₂OH)


In [1]:
from molpy.parser import SmilesParser

# Create parser instance
smiles_parser = SmilesParser()

# Parse SMILES string to intermediate representation (IR)
smiles_ir = smiles_parser.parse_smiles("CCO")
print(f"Parsed SMILES: {smiles_ir}")

Parsed SMILES: SmilesIR(atoms=[AtomIR(symbol='C'), AtomIR(symbol='C'), AtomIR(symbol='O')], bonds=[BondIR('C', 'C', '-'), BondIR('C', 'O', '-')])


The IR is an intermediate format, you typically convert it to any kind of structure later.

### Converting to Atomistic

The parsed intermediate representation needs to be converted into an `Atomistic` structure:

In [2]:
# from molpy.parser import smiles_ir_to_atomistic
# atomistic = smiles_ir_to_atomistic(smiles_ir)

## Parsing BigSmiles: Smiles with bonding descriptors

Let's starts with `[$]CC[$]`. It means a 

In [3]:
# big-smiles is superset of smiles, so we still use smiles partser
bigsmiles_ir = smiles_parser.parse_bigsmiles("[$]CC[$]")
print(f"Parsed BigSmiles: {bigsmiles_ir}")

Parsed BigSmiles: BigSmilesIR(atoms=[AtomIR(symbol='C'), AtomIR(symbol='C')], bonds=[BondIR('C', 'C', '-')], chain=BigSmilesChainIR(start_smiles=SmilesIR(atoms=[AtomIR(symbol='C'), AtomIR(symbol='C')], bonds=[BondIR('C', 'C', '-')]), repeat_segments=[]))


## Parsing SMARTS: Pattern Matching

SMARTS is logical rule for matching molecular patterns:

- `[C]` = any carbon
- `[C;H2,H3]` = carbon with 2 OR 3 hydrogens
- `[#6]` = element with atomic number 6 (carbon)
- `c` = aromatic carbon

**Common use case:** Defining atom typing rules in force fields

In [4]:
from molpy.parser import SmartsParser

# Create parser instance
smarts_parser = SmartsParser()

# Parse SMARTS pattern
# This pattern matches: "Carbon with 2 or 3 hydrogens"
smarts_ir = smarts_parser.parse_smarts("[C;H2,H3]")
print(f"Parsed SMARTS: {smarts_ir}")

# More complex example: aromatic carbon in 6-membered ring
aromatic_pattern = smarts_parser.parse_smarts("[c;r6]")
print(f"Aromatic pattern: {aromatic_pattern}")

Parsed SMARTS: SmartsIR(atoms=[SmartsAtomIR(expression=AtomExpressionIR(op='weak_and', children=[AtomPrimitiveIR(type='symbol', value='C'), AtomExpressionIR(op='or', children=[AtomPrimitiveIR(type='hydrogen_count', value=2), AtomPrimitiveIR(type='hydrogen_count', value=3)])]))], bonds=[])
Aromatic pattern: SmartsIR(atoms=[SmartsAtomIR(expression=AtomExpressionIR(op='weak_and', children=[AtomPrimitiveIR(type='symbol', value='c'), AtomPrimitiveIR(type='ring_size', value=6)]))], bonds=[])


## Summary:

Parser won't return a specific data structure, instead, it return an intermidate representation(IR) for decoupling. When user create their only data structure, they can create a convert function to convert smiles to house-made class.

---

## Future Directions

The parser module is actively evolving. Planned features:

- **G-BigSMILES support** – 

