# CGSmiles Tutorial: Building Coarse-Grained Polymer Structures

This notebook demonstrates how MolPy's `parse_cgsmiles` API turns CGSmiles notation into structured `CGSmilesIR` objects. Each example builds on the original intent—showcasing linear, branched, and annotated chains—while clarifying how the parser's source code behaves:

- `CGSmilesParserImpl` (see `molpy/parser/smiles/cgsmiles_parser.py`) walks the grammar and emits `CGSmilesNodeIR`/`CGSmilesBondIR` with stable IDs so later builders can track topology deterministically.
- `CGSmilesIR.base_graph` keeps the chain as an adjacency list, and fragments/annotations share the same data model that `PolymerBuilder` consumes when assembling Atomistic structures.

Use this tutorial whenever you need a concise reference for how MolPy interprets CGSmiles strings before feeding them into higher-level builders or typifiers.

## Import Libraries

The parser entry point lives in `molpy.parser.smiles`. Importing `molpy` itself ensures support data (force fields, builders) is registered so the parsed IR can be routed directly into the rest of the toolkit.

In [1]:
from molpy.parser.smiles import parse_cgsmiles
import molpy as mp

### Inspecting CGSmiles graphs

`CGSmilesGraphIR` exposes nodes, bonds, and annotations exactly as generated by `CGSmilesTransformer`. The helper below mirrors the internal schema (IDs, labels, annotations) so each example can focus on the structural idea rather than reimplementing traversal logic.

In [None]:
from pprint import pprint

def describe_graph(cg_ir):
    graph = cg_ir.base_graph
    return {
        "nodes": [
            {"id": node.id, "label": node.label, "annotations": dict(node.annotations)}
            for node in graph.nodes
        ],
        "bonds": [
            {"nodes": (bond.node_i.label, bond.node_j.label), "order": bond.order}
            for bond in graph.bonds
        ],
    }


## Example 1: Linear Chain

Linear sequences are the canonical starting point for CGSmiles. The parser allocates one `CGSmilesNodeIR` per bracketed label and wires them in order; those node IDs propagate into `PolymerBuilder` so downstream connectors know which monomer copy each bond belonged to. Use this form when you want full manual control over the monomer order.

In [2]:
# Parse a simple linear chain
cgsmiles = "{[#PEO][#PMA][#PEO]}"
linear_ir = parse_cgsmiles(cgsmiles)

print(cgsmiles)
pprint(describe_graph(linear_ir))

Linear Chain: {[#PEO][#PMA][#PEO]}
Nodes: 3
Labels: ['PEO', 'PMA', 'PEO']
Bonds: 2
  Bond 0: 0(PEO) -> 1(PMA) (order=1)
  Bond 1: 1(PMA) -> 2(PEO) (order=1)


## Example 2: Using Repeat Operators

Repeat operators (`|N`) expand into N copies of the immediately preceding node. In the source, `CGSmilesTransformer.branched_atom` stores this multiplier and emits duplicated `CGSmilesNodeIR` objects so the IR remains explicit—`PolymerBuilder` never needs to interpret repeat metadata. Use repeats when you want compact strings yet still need every monomer spelled out for later topology edits.

In [3]:
# Repeat operator example
cgsmiles = "{[#PMA]|10}"
repeat_ir = parse_cgsmiles(cgsmiles)

print({"cgsmiles": cgsmiles, "dp": len(repeat_ir.base_graph.nodes)})
pprint(describe_graph(repeat_ir))

Repeated Chain: {[#PMA]|10}
Nodes: 10
Bonds: 9

Block Copolymer: {[#PEO]|5=[#PMA]|5}
Nodes: 10
Labels: ['PEO', 'PEO', 'PEO', 'PEO', 'PEO', 'PMA', 'PMA', 'PMA', 'PMA', 'PMA']


## Example 3: Ring Structures

Ring indices trigger the `ring_openings` table inside `CGSmilesTransformer`. When the parser later encounters the matching digit, it creates a bond back to the stored node with the specified order. This guarantees that every closure is explicit in the IR (no deferred constraints), which is vital when PolymerBuilder replays the topology to close cycles.

In [4]:
# Simple ring (triangle)
cgsmiles = "{[#PMA]1[#PEO][#PMA]1}"
ring_ir = parse_cgsmiles(cgsmiles)

print(cgsmiles)
pprint(describe_graph(ring_ir))

Triangle Ring: {[#PMA]1[#PEO][#PMA]1}
Nodes: 3
Bonds: 3
  Bond 0: 0 -> 1
  Bond 1: 1 -> 2
  Bond 2: 0 -> 2

Square Ring: {[#A]1[#B][#C][#D]1}
Nodes: 4
Bonds: 4

Ring with Double Bond: {[#PMA]1=[#PEO][#PMA]1}
  Bond 0: 0 -> 1 (order=2)
  Bond 1: 1 -> 2 (order=1)
  Bond 2: 0 -> 2 (order=1)


## Example 4: Branched Structures

Branches become child `CGSmilesGraphIR` objects whose nodes inherit fresh IDs before being merged back into the parent graph. Because each branch bond stores its order, the resulting IR precisely mirrors the BigSMILES semantics that PolymerBuilder expects when deciding how many ports remain on each branch node.

In [5]:
# Simple branch
cgsmiles = "{[#PMA]([#PEO])[#PMA]}"
branch_ir = parse_cgsmiles(cgsmiles)

print(cgsmiles)
pprint(describe_graph(branch_ir))

Simple Branch: {[#PMA]([#PEO])[#PMA]}
Nodes: 3
Labels: ['PMA', 'PEO', 'PMA']
Bonds: 2

Branch with Double Bond: {[#PMA](=[#PEO])[#PMA]}
  Bond 0: 0(PMA) -> 1(PEO) (order=2)
  Bond 1: 0(PMA) -> 2(PMA) (order=1)

Branched with Repeat: {[#PMA]([#PEO][#PEO]=[#OH])|3}
Nodes: 12
Structure: 3 PMA units, each with a PEO-PEO=OH branch


## Example 5: Nested Branches

Nested parentheses recurse through the same branch assembler, meaning each sub-branch keeps its own adjacency before being grafted back. This matches the invariant noted in `CGSmilesTransformer.branch`: every branch carries its bond order so deeply nested dendrons can be reconstructed without guesswork.

In [6]:
# Nested branches (dendritic structure)
cgsmiles = "{[#PMA][#PMA]([#PEO][#PEO]([#OH])[#PEO])[#PMA]}"
nested_ir = parse_cgsmiles(cgsmiles)

print(cgsmiles)
pprint(describe_graph(nested_ir))

Dendritic Structure: {[#PMA][#PMA]([#PEO][#PEO]([#OH])[#PEO])[#PMA]}
Nodes: 7
Labels: ['PMA', 'PMA', 'PEO', 'PEO', 'OH', 'PEO', 'PMA']
Bonds: 6

Structure:
  Main chain: PMA-PMA-PMA
  Branch on 2nd PMA: PEO-PEO-PEO
  Sub-branch on 2nd PEO: OH


## Example 6: Annotations

Key/value annotations live directly on `CGSmilesNodeIR.annotations`. The parser strips the semicolons, keeps the keys as strings, and PolymerBuilder can later read them (for charge, coarse mass, etc.). Use annotations when you need to propagate physical metadata with the topology.

In [7]:
# Annotations example
cgsmiles = "{[#PEO;q=1][#PMA;q=-1][#PEO;q=1]}"
annotated_ir = parse_cgsmiles(cgsmiles)

print(cgsmiles)
pprint(describe_graph(annotated_ir))

Annotated Chain: {[#PEO;q=1][#PMA;q=-1][#PEO;q=1]}
Nodes: 3
  Node 0: PEO, annotations={'q': '1'}
  Node 1: PMA, annotations={'q': '-1'}
  Node 2: PEO, annotations={'q': '1'}


## Example 7: Fragment Definitions

Fragments (defined after the dot) are parsed into `CGSmilesFragmentIR` objects. The base graph references fragment names, while `fragments` stores the reusable bodies so builders can expand them on demand. This mirrors MolPy's requirement that monomer libraries be explicit before any reaction logic runs.

In [8]:
# Fragment definitions
cgsmiles = "{[#PEO][#PMA]}.{#PEO=[$]COC[$],#PMA=[$]CC(C)C[$]}"
fragment_ir = parse_cgsmiles(cgsmiles)

print(cgsmiles)
pprint({
    "graph": describe_graph(fragment_ir),
    "fragments": [(frag.name, frag.body) for frag in fragment_ir.fragments],
})

CGSmiles with Fragments: {[#PEO][#PMA]}.{#PEO=[$]COC[$],#PMA=[$]CC(C)C[$]}

Base Graph:
  Nodes: ['PEO', 'PMA']

Fragment Definitions: 2
  PEO = [$]COC[$]
  PMA = [$]CC(C)C[$]


## Example 8: Complex Structure - Star Polymer

Complex CGSmiles strings simply layer the primitives above. The ring IDs wire the core, each branch arm uses the repeat operator, and the parser never infers missing atoms—`CGSmilesTransformer` enumerates every node so typifiers and connectors can inspect the final adjacency directly.

In [9]:
# Star polymer: cyclic core with 4 arms
cgsmiles = "{[#Core]1([#PEO]|5)[#Core]([#PMA]|5)[#Core]([#PEO]|5)[#Core]1([#PMA]|5)}"
star_ir = parse_cgsmiles(cgsmiles)

print({
    "cgsmiles": cgsmiles,
    "total_nodes": len(star_ir.base_graph.nodes),
    "total_bonds": len(star_ir.base_graph.bonds),
})
pprint(describe_graph(star_ir))

Star Polymer: {[#Core]1([#PEO]|5)[#Core]([#PMA]|5)[#Core]([#PEO]|5)[#Core]1([#PMA]|5)}

Structure:
  Total nodes: 24
  Core nodes: 4 (forming a ring)
  Arms: 4 branches, each with 5 nodes
  Total bonds: 24


## Summary

This tutorial still walks through the original CGSmiles motifs, but now each example ties directly to how `CGSmilesParserImpl` and the IR classes behave:

- `CGSmilesNodeIR` IDs, annotations, and fragments mirror the data consumed by `PolymerBuilder` and downstream typifiers.
- Repeat operators, rings, and branches expand immediately so later builders never need to interpret extra syntax.
- Inspecting the parsed IR is the fastest way to verify that a CGSmiles string satisfies the invariants enforced in `molpy/parser/smiles/cgsmiles_parser.py` before converting it into Atomistic structures.