# Case 2: GBigSMILES-based Polydisperse Polymer System

This notebook demonstrates building a polydisperse polymer system from GBigSMILES
notation with molecular weight distribution.

**Key features:**
- Uses molpy's parser for GBigSMILES
- Uses three-layer architecture (SystemPlanner → PolydisperseChainGenerator → SequenceGenerator)
- Supports large systems with hundreds of polymer chains
- Generates molecular weight distribution plots
- Exports molecular weight data to JSON
- Generates polymer structure images (SVG format)

## Step 1: Import Required Libraries

Import all necessary modules from MolPy for building polydisperse polymer systems, handling distributions, and exporting to LAMMPS format.

In [1]:

import json
from pathlib import Path
from random import Random

import matplotlib.pyplot as plt
from matplotlib.axes import Axes
from matplotlib.ticker import ScalarFormatter
import scienceplots
plt.style.use(["science", "nature", "no-latex"])
import numpy as np

import molpy as mp
from molpy.external import RDKitAdapter, Generate3D
from molpy.builder.polymer.connectors import ReacterConnector
from molpy.builder.polymer.linear import linear
from molpy.builder.polymer.placer import Placer, CovalentSeparator, LinearOrienter
from molpy.builder.polymer.sequence_generator import WeightedSequenceGenerator
from molpy.builder.polymer.system import (
    PolydisperseChainGenerator,
    SystemPlanner,

    SchulzZimmPolydisperse,
    UniformPolydisperse,
    PoissonPolydisperse,
    FlorySchulzPolydisperse,
)
from molpy.core.atomistic import Atomistic, Bond
from molpy.core.element import Element
from molpy.core.frame import Frame
from molpy.io.data.lammps import LammpsDataWriter
from molpy.parser.smiles import parse_gbigsmiles, parse_gbigsmiles_to_polymerspec
from molpy.reacter import Reacter, select_hydroxyl_group
from molpy.reacter.selectors import select_one_hydrogen, select_port_atom, select_prev_atom
from molpy.reacter.transformers import form_single_bond
from molpy.typifier.atomistic import OplsAtomisticTypifier

## Step 2: Load Force Field

Load the OPLS-AA force field and create a typifier.

In [2]:
# Load force field
forcefield_path = "oplsaa.xml"
ff = mp.io.read_xml_forcefield(forcefield_path)
typifier = OplsAtomisticTypifier(ff, strict_typing=True)

print("✅ Force field loaded successfully")



✅ Force field loaded successfully


## Step 3: Define Helper Functions

Define helper functions for:
- Extracting monomers from GBigSMILES
- Calculating molecular weights
- Extracting distribution parameters
- Plotting molecular weight distributions

In [3]:
def extract_monomers_from_gbigsmiles(
    gbigsmiles_str: str,
    typifier: OplsAtomisticTypifier,
) -> list[Atomistic]:
    """Extract monomers from GBigSMILES using molpy's parser."""
    spec = parse_gbigsmiles_to_polymerspec(gbigsmiles_str)
    segment = spec.segments[0]
    monomers = segment.monomers
    
    processed_monomers = []
    for monomer in monomers:
        adapter = RDKitAdapter(internal=monomer)
        generate_3d = Generate3D(add_hydrogens=True, embed=True, optimize=True, update_internal=True)
        adapter = generate_3d(adapter)
        monomer = adapter.get_internal()
        
        monomer.get_topo(gen_angle=True, gen_dihe=True)
        
        for idx, atom in enumerate(monomer.atoms):
            atom["id"] = idx + 1
        
        typifier.typify(monomer)
        processed_monomers.append(monomer)
    
    return processed_monomers


def extract_distribution_and_weights(
    ir: mp.parser.smiles.gbigsmiles_ir.GBigSmilesSystemIR,
    n_units: int,
) -> tuple[mp.parser.smiles.gbigsmiles_ir.DistributionIR, dict[int, float], float]:
    """Extract distribution, weights, and system molecular weight from GBigSMILES IR."""
    system_mw = ir.total_mass
    mol_ir = ir.molecules[0].molecule
    
    distribution_ir = None
    for meta in mol_ir.stochastic_metadata:
        if meta.distribution:
            distribution_ir = meta.distribution
            break
    
    weights: dict[int, float] = {i: 1.0 for i in range(n_units)}
    for gb_desc in mol_ir.descriptor_weights:
        if gb_desc.pair_weights:
            for i, weight in enumerate(gb_desc.pair_weights):
                if i < n_units:
                    weights[i] = float(weight)
    
    return distribution_ir, weights, system_mw


def calculate_avg_monomer_mw(monomers: list[Atomistic]) -> float:
    """Calculate average monomer molecular weight."""
    total_mw = 0.0
    for monomer in monomers:
        mw = 0.0
        for atom in monomer.atoms:
            symbol = atom["symbol"]
            symbol_upper = symbol.upper()
            element = Element(symbol_upper)
            mw += element.mass
        total_mw += mw
    return total_mw / len(monomers)


def calculate_polymer_molecular_weight(polymer: Atomistic) -> float:
    """Calculate molecular weight of a polymer."""
    total_mw = 0.0
    for atom in polymer.atoms:
        symbol = atom["symbol"]
        symbol_upper = symbol.upper()
        element = Element(symbol_upper)
        total_mw += element.mass
    return total_mw


print("✅ Helper functions defined")

✅ Helper functions defined


## Step 4: Define GBigSMILES and Extract Monomers

Define the GBigSMILES string for a polydisperse polymer system with Flory-Schulz distribution.

In [None]:
# Define GBigSMILES with Flory-Schulz distribution
gbigsmiles = "{[<]OCCOCCOCCOCCO[>],[<]OCC(c1ccccc1)CO[>]}|flory_schulz(0.8)|[H].|1e7|"

# Parse GBigSMILES and extract monomers
monomers = extract_monomers_from_gbigsmiles(gbigsmiles, typifier)
avg_monomer_mw = calculate_avg_monomer_mw(monomers)

print(f"✅ Extracted {len(monomers)} monomer types")
print(f"   Average monomer MW: {avg_monomer_mw:.1f} g/mol")

## Step 5: Extract Distribution Parameters

Extract distribution type, parameters, weights, and system molecular weight from the parsed GBigSMILES IR.

In [None]:
# Parse GBigSMILES IR
ir = parse_gbigsmiles(gbigsmiles)
distribution_ir, weights, system_mw = extract_distribution_and_weights(ir, len(monomers))

print(f"✅ Distribution extracted:")
print(f"   Type: {distribution_ir.name}")
print(f"   Parameters: {distribution_ir.params}")
print(f"   System MW: {system_mw:.1f} g/mol")

# Create Polydisperse distribution from IR
dist_name = distribution_ir.name
params = distribution_ir.params

if dist_name == "schulz_zimm":
    Mn = float(params["p0"])
    Mw = float(params["p1"])
    dp_dist = SchulzZimmPolydisperse(Mn=Mn, Mw=Mw, random_seed=42)
elif dist_name == "uniform":
    min_dp = int(params["p0"])
    max_dp = int(params["p1"])
    dp_dist = UniformPolydisperse(min_dp=min_dp, max_dp=max_dp, random_seed=42)
elif dist_name == "poisson":
    lambda_param = float(params["p0"])
    dp_dist = PoissonPolydisperse(lambda_param=lambda_param, random_seed=42)
elif dist_name == "flory_schulz":
    p = float(params["p0"])
    dp_dist = FlorySchulzPolydisperse(p=p, random_seed=42)
else:
    raise ValueError(f"Unsupported distribution type: {dist_name}")

print(f"   Distribution created: {dist_name}")

## Step 6: Generate Polymer Sequences

Use the three-layer architecture to generate polymer sequences:
1. **WeightedSequenceGenerator**: Generates sequences based on monomer weights
2. **PolydisperseChainGenerator**: Generates chains with specified degree of polymerization distribution
3. **SystemPlanner**: Plans the system to match target molecular weight

In [None]:
# Calculate individual monomer masses
monomer_masses = {}
for i, monomer in enumerate(monomers):
    mw = 0.0
    for atom in monomer.atoms:
        symbol = atom["symbol"]
        symbol_upper = symbol.upper()
        element = Element(symbol_upper)
        mw += element.mass
    monomer_masses[str(i)] = mw

# Convert weights dict to monomer_weights format
monomer_weights = {str(i): weights[i] for i in range(len(monomers))}
seq_generator = WeightedSequenceGenerator(monomer_weights=monomer_weights)

# Create chain generator
chain_gen = PolydisperseChainGenerator(
    seq_generator=seq_generator,
    monomer_mass=monomer_masses,
    end_group_mass=0.0,
    distribution=dp_dist,
)

# Create system planner
planner = SystemPlanner(
    chain_generator=chain_gen,
    target_total_mass=system_mw,
    max_rel_error=0.02,
    max_chains=None,
    enable_trimming=True,
)

# Generate system plan
rng = Random(42)
system_plan = planner.plan_system(rng=rng)

print(f"✅ Generated {len(system_plan.chains)} chains, total mass: {system_plan.total_mass:.1f} g/mol")

# Filter out chains with dp < 2
valid_chains = [chain for chain in system_plan.chains if chain.dp >= 2]

# Convert chains to sequences
sequences = []
for chain in valid_chains:
    seq = [int(m) for m in chain.monomers]
    sequences.append(seq)

seq_lengths = [len(seq) for seq in sequences]
print(f"   Sequence length stats: min={min(seq_lengths)}, max={max(seq_lengths)}, "
      f"mean={np.mean(seq_lengths):.1f}, median={np.median(seq_lengths):.1f}, n_seqs={len(sequences)}")

## Step 7: Preview Theoretical Molecular Weight Distribution

Preview the theoretical molecular weight distribution based on the generated sequences before building the actual polymer structures.

In [None]:
# Calculate estimated molecular weights from sequences using actual monomer masses
estimated_molecular_weights = []
for sequence in sequences:
    # Sum actual monomer masses for this sequence
    mw = sum(monomer_masses[str(m)] for m in sequence)
    estimated_molecular_weights.append(mw)

print(f"✅ Estimated molecular weights for {len(estimated_molecular_weights)} sequences")

# Calculate statistics from sequences
mw_array = np.array(estimated_molecular_weights)
Mn_estimated = np.mean(mw_array)
Mw_estimated = np.sum(mw_array**2) / np.sum(mw_array)
PDI_estimated = Mw_estimated / Mn_estimated

print(f"   Estimated Mn (number-average): {Mn_estimated:.1f} g/mol")
print(f"   Estimated Mw (weight-average): {Mw_estimated:.1f} g/mol")
print(f"   Estimated PDI: {PDI_estimated:.3f}")

# Create output directory
output_dir = Path("case2_output")
output_dir.mkdir(parents=True, exist_ok=True)

# Plot distribution
M_range = np.linspace(max(0, np.min(mw_array) * 0.3), np.max(mw_array) * 1.3, 3000)

fig, ax = plt.subplots(figsize=(2.0, 2.0))
fmt = ScalarFormatter(useMathText=True)
fmt.set_scientific(True)
fmt.set_powerlimits((-1, 1))
ax.yaxis.set_major_formatter(fmt)

# Histogram of estimated MWs from sequences
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
ax.hist(
    estimated_molecular_weights,
    bins=80,
    density=True,
    zorder=2,
    edgecolor="black",
    linewidth=0.5,
    color=colors[6],
    label="Generated chains"
)

# Theoretical curve
if isinstance(dp_dist, SchulzZimmPolydisperse):
    pdf_theoretical = dp_dist.molecular_weight_pdf(M_range)
else:
    pdf_theoretical = dp_dist.molecular_weight_pdf(M_range, avg_monomer_mass=avg_monomer_mw)

ax.plot(M_range, pdf_theoretical, zorder=3, color=colors[0], label="Theoretical")

ax.set_xlabel('Molecular Weight (g/mol)')
ax.set_ylabel('Probability Density')
ax.set_title('')
ax.grid(True, alpha=0.3)
ax.set_axisbelow(True)
ax.legend(loc='upper right')

plt.tight_layout()
plot_file = output_dir / "molecular_weight_distribution_preview.png"
plt.savefig(plot_file, dpi=300, bbox_inches='tight', facecolor='white', edgecolor='none')
plt.show()

print(f"   Distribution preview plot saved to: {plot_file}")

## Step 8: Build Polymers from Sequences

Build polymer structures from the generated sequences using the linear builder with dehydration reaction.

In [None]:

# Create library: map labels to monomers
library = {}
labels = []
for i, monomer in enumerate(monomers):
    label = chr(ord("A") + i)
    library[label] = monomer
    labels.append(label)

# Create Reacter for dehydration
default_reacter = Reacter(
    name="-OH + -OH -> -O-",
    port_selector_left=select_prev_atom,
    port_selector_right=select_port_atom,
    leaving_selector_left=select_hydroxyl_group,
    leaving_selector_right=select_one_hydrogen,
    bond_former=form_single_bond,
)

# Create port_map
port_map = {}
for left_label in labels:
    for right_label in labels:
        port_map[(left_label, right_label)] = (">", "<")

# Create ReacterConnector
connector = ReacterConnector(
    default=default_reacter,
    port_map=port_map,
)

# Create Placer
separator = CovalentSeparator(buffer=0.0)
orienter = LinearOrienter()
placer = Placer(separator, orienter)

# Build polymers from sequences
polymers = []
for seq_idx, sequence in enumerate(sequences):
    sequence_str = "".join([labels[i] for i in sequence])
    
    build_result = linear(
        sequence=sequence_str,
        library=library,
        connector=connector,
        typifier=typifier,
        placer=placer,
    )
    polymer = build_result.polymer
    polymers.append(polymer)

print(f"✅ Built {len(polymers)} polymers")
for i, polymer in enumerate(polymers[:5]):
    print(f"   Polymer {i+1}: {len(polymer.atoms)} atoms, {len(polymer.bonds)} bonds")

## Step 9: Calculate Actual Molecular Weights and Compare with Distribution

Calculate actual molecular weights from built polymers and compare with the theoretical distribution.

In [None]:
# Calculate molecular weights
molecular_weights = []
for polymer in polymers:
    mw = calculate_polymer_molecular_weight(polymer)
    molecular_weights.append(mw)

print(f"✅ Calculated molecular weights for {len(molecular_weights)} polymers")

# Calculate statistics
mw_array = np.array(molecular_weights)
Mn_actual = np.mean(mw_array)
Mw_actual = np.sum(mw_array**2) / np.sum(mw_array)
PDI_actual = Mw_actual / Mn_actual

print(f"   Mn (number-average): {Mn_actual:.1f} g/mol")
print(f"   Mw (weight-average): {Mw_actual:.1f} g/mol")
print(f"   PDI: {PDI_actual:.3f}")

# Plot distribution
M_range = np.linspace(max(0, np.min(mw_array) * 0.3), np.max(mw_array) * 1.3, 3000)

fig, ax = plt.subplots(figsize=(2.0, 2.0))
fmt = ScalarFormatter(useMathText=True)
fmt.set_scientific(True)
fmt.set_powerlimits((-1, 1))
ax.yaxis.set_major_formatter(fmt)

# Histogram
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
ax.hist(
    molecular_weights,
    bins=80,
    density=True,
    zorder=2,
    edgecolor="black",
    linewidth=0.5,
    color=colors[6],
    label="Actual polymers"
)

# Theoretical curve
if isinstance(dp_dist, SchulzZimmPolydisperse):
    pdf_theoretical = dp_dist.molecular_weight_pdf(M_range)
else:
    pdf_theoretical = dp_dist.molecular_weight_pdf(M_range, avg_monomer_mass=avg_monomer_mw)

ax.plot(M_range, pdf_theoretical, zorder=3, color=colors[0], label="Theoretical")

ax.set_xlabel('Molecular Weight (g/mol)')
ax.set_ylabel('Probability Density')
ax.set_title('')
ax.grid(True, alpha=0.3)
ax.set_axisbelow(True)
ax.legend(loc='upper right')

plt.tight_layout()
plot_file = output_dir / "molecular_weight_distribution.png"
plt.savefig(plot_file, dpi=300, bbox_inches='tight', facecolor='white', edgecolor='none')
plt.show()

print(f"   Distribution plot saved to: {plot_file}")

## Step 10: Save Molecular Weight Data to JSON

Save molecular weight statistics and polymer sequences to a JSON file for post-processing.

In [None]:
# Prepare data for JSON
# For distributions that don't have Mn/Mw/PDI properties, use estimated values
if hasattr(dp_dist, 'Mn'):
    target_Mn = dp_dist.Mn
    target_Mw = dp_dist.Mw
    target_PDI = dp_dist.PDI
else:
    # Use estimated values from generated sequences
    target_Mn = Mn_estimated
    target_Mw = Mw_estimated
    target_PDI = PDI_estimated

data = {
    "distribution_parameters": {
        "type": distribution_ir.name,
        "params": {k: float(v) for k, v in distribution_ir.params.items()},
        "Mn_target": float(target_Mn),
        "Mw_target": float(target_Mw),
        "PDI_target": float(target_PDI),
        "avg_monomer_mw": float(avg_monomer_mw),
    },
    "statistics": {
        "n_polymers": len(molecular_weights),
        "Mn_actual": float(Mn_actual),
        "Mw_actual": float(Mw_actual),
        "PDI_actual": float(PDI_actual),
        "min_mw": float(np.min(mw_array)),
        "max_mw": float(np.max(mw_array)),
        "std_mw": float(np.std(mw_array)),
        "median_mw": float(np.median(mw_array)),
    },
    "polymers": [
        {
            "polymer_id": i + 1,
            "molecular_weight": float(mw),
            "sequence_length": len(seq),
            "sequence": seq,
        }
        for i, (mw, seq) in enumerate(zip(molecular_weights, sequences))
    ],
}

json_file = output_dir / "molecular_weight_distribution.json"
with open(json_file, 'w') as f:
    json.dump(data, f, indent=2)

print(f"✅ Molecular weight data saved to: {json_file}")

## Step 11: Pack Polymers into Simulation Box

Pack all polymers into a simulation box using packmol. Calculate box size based on system density.

In [None]:
# Calculate molecular weights
molecular_weights = []
for polymer in polymers:
    mw = calculate_polymer_molecular_weight(polymer)
    molecular_weights.append(mw)

print(f"✅ Calculated molecular weights for {len(molecular_weights)} polymers")

# Calculate statistics
mw_array = np.array(molecular_weights)
Mn_actual = np.mean(mw_array)
Mw_actual = np.sum(mw_array**2) / np.sum(mw_array)
PDI_actual = Mw_actual / Mn_actual

print(f"   Mn (number-average): {Mn_actual:.1f} g/mol")
print(f"   Mw (weight-average): {Mw_actual:.1f} g/mol")
print(f"   PDI: {PDI_actual:.3f}")

# Plot distribution
M_range = np.linspace(max(0, np.min(mw_array) * 0.3), np.max(mw_array) * 1.3, 3000)

fig, ax = plt.subplots(figsize=(2.0, 2.0))
fmt = ScalarFormatter(useMathText=True)
fmt.set_scientific(True)
fmt.set_powerlimits((-1, 1))
ax.yaxis.set_major_formatter(fmt)

# Histogram
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
ax.hist(
    molecular_weights,
    bins=80,
    density=True,
    zorder=2,
    edgecolor="black",
    linewidth=0.5,
    color=colors[6],
    label="Actual polymers"
)

# Theoretical curve
if isinstance(dp_dist, SchulzZimmPolydisperse):
    pdf_theoretical = dp_dist.molecular_weight_pdf(M_range)
else:
    pdf_theoretical = dp_dist.molecular_weight_pdf(M_range, avg_monomer_mass=avg_monomer_mw)

ax.plot(M_range, pdf_theoretical, zorder=3, color=colors[0], label="Theoretical")

ax.set_xlabel('Molecular Weight (g/mol)')
ax.set_ylabel('Probability Density')
ax.set_title('')
ax.grid(True, alpha=0.3)
ax.set_axisbelow(True)
ax.legend(loc='upper right')

plt.tight_layout()
plot_file = output_dir / "molecular_weight_distribution.png"
plt.savefig(plot_file, dpi=300, bbox_inches='tight', facecolor='white', edgecolor='none')
plt.show()

print(f"   Distribution plot saved to: {plot_file}")

## Step 12: Export Packed System to LAMMPS

Export the packed system to LAMMPS data file format for molecular dynamics simulations.

In [None]:
# Export system
system_data_file = output_dir / "system.data"
writer = LammpsDataWriter(system_data_file, atom_style="full")
writer.write(packed_frame)

print(f"✅ Exported packed system to: {system_data_file}")
print(f"   Total atoms: {packed_frame['atoms'].nrows}")
print(f"   Box size: {box_size:.2f} nm")

## Summary

This notebook demonstrated the complete workflow for building a polydisperse polymer system:

1. ✅ Loaded OPLS-AA force field
2. ✅ Parsed GBigSMILES and extracted monomers
3. ✅ Extracted distribution parameters
4. ✅ Generated polymer sequences using three-layer architecture
5. ✅ Built polymer structures from sequences
6. ✅ Calculated molecular weights and generated distribution plot
7. ✅ Saved molecular weight data to JSON
8. ✅ Packed polymers into simulation box
9. ✅ Exported to LAMMPS format

The generated files can be used for:
- Visualization of molecular weight distribution
- Post-processing analysis (JSON data)
- LAMMPS molecular dynamics simulations (system.data)