
# Experiment Documentation & Story Templates

This notebook provides **ready-to-use templates** for experimentation docs.

The goal is to make it easy to turn analysis code into a **decision memo** by generating
structured markdown from simple Python objects.

Included templates:

1. **Experiment design pre-doc**
   - Hypotheses, metrics, MDE/power, design, analysis plan, decision rule.

2. **Auto-generated result summary**
   - Primary metric effect, uncertainty, decision outcome.  
   - Guardrail changes and risk commentary.

3. **Risk assessment & rollout plan**
   - Risk level, blast radius, rollout strategy, monitoring plan.

You can import these helpers into any A/B notebook and generate markdown cells that can be
copied into internal docs, tickets, or emails.


## 0) Setup

In [None]:

from __future__ import annotations

from dataclasses import dataclass, asdict
from typing import List, Dict, Optional

import textwrap
import pandas as pd



## 1) Experiment design pre-doc template

Define a structured configuration object and render it into markdown.

Typical sections:

- Context & objective  
- Hypotheses  
- Metrics (primary and guardrails)  
- Population & exposure  
- Sample size, power & MDE  
- Design & randomization  
- Analysis plan  
- Decision rule

You can fill the dataclasses once and re-use the same structure across experiments.


In [None]:

@dataclass
class MetricSpec:
    """Specification of a single experiment metric."""
    name: str
    description: str
    direction: str  # 'higher_is_better' or 'lower_is_better'
    type: str       # 'binary', 'rate', 'continuous', etc.
    target_effect: Optional[float] = None  # e.g. 0.01 for +1pp
    notes: Optional[str] = None


@dataclass
class ExperimentDesign:
    """Structured description of an experiment for pre-doc generation."""
    experiment_name: str
    owner: str
    context: str
    objective: str
    primary_hypothesis: str
    secondary_hypotheses: Optional[List[str]]
    primary_metric: MetricSpec
    guardrail_metrics: List[MetricSpec]
    population: str
    traffic_allocation: str
    planned_duration: str
    max_exposure: str
    alpha: float
    power: float
    mde_description: str
    randomization_unit: str
    segmentation_plan: Optional[str]
    analysis_plan: str
    decision_rule: str
    risks_and_assumptions: Optional[str]
    notes: Optional[str] = None


def _render_metric_block(metric: MetricSpec) -> str:
    """Render a metric specification as markdown bullet points."""
    direction_map = {
        "higher_is_better": "Higher is better",
        "lower_is_better": "Lower is better",
    }
    direction_txt = direction_map.get(metric.direction, metric.direction)
    lines = [
        f"- **Name:** {metric.name}",
        f"  - **Description:** {metric.description}",
        f"  - **Type:** {metric.type}",
        f"  - **Direction:** {direction_txt}",
    ]
    if metric.target_effect is not None:
        lines.append(f"  - **Target effect (MDE / uplift):** {metric.target_effect:.4f}")
    if metric.notes:
        lines.append(f"  - **Notes:** {metric.notes}")
    return "\n".join(lines)


def render_experiment_design_md(design: ExperimentDesign) -> str:
    """Render an ExperimentDesign into a markdown doc string.

    Parameters
    ----------
    design : ExperimentDesign
        Structured experiment description.

    Returns
    -------
    str
        Markdown text describing the experiment design.
    """
    primary_metric_md = _render_metric_block(design.primary_metric)
    guardrails_md = "\n".join(
        _render_metric_block(m) for m in design.guardrail_metrics
    )

    secondary_md = ""
    if design.secondary_hypotheses:
        sec_lines = "\n".join(f"- {h}" for h in design.secondary_hypotheses)
        secondary_md = f"""
### Secondary hypotheses

{sec_lines}
"""

    segmentation_md = ""
    if design.segmentation_plan:
        segmentation_md = f"""
### Segmentation plan

{design.segmentation_plan}
"""

    risks_md = ""
    if design.risks_and_assumptions:
        risks_md = f"""
### Risks and assumptions

{design.risks_and_assumptions}
"""

    notes_md = ""
    if design.notes:
        notes_md = f"""
### Additional notes

{design.notes}
"""

    md = f"""
# Experiment design: {design.experiment_name}

**Owner:** {design.owner}

## 1. Context and objective

**Context.**  
{design.context}

**Objective.**  
{design.objective}

## 2. Hypotheses

### Primary hypothesis

{design.primary_hypothesis}
{secondary_md}

## 3. Metrics

### Primary metric

{primary_metric_md}

### Guardrail metrics

{guardrails_md}

## 4. Population and exposure

- **Population:** {design.population}
- **Traffic allocation:** {design.traffic_allocation}
- **Planned duration:** {design.planned_duration}
- **Max exposure / safety cap:** {design.max_exposure}
- **Randomization unit:** {design.randomization_unit}

## 5. Power, alpha, and MDE

- **Significance level (alpha):** {design.alpha:.3f}
- **Power:** {design.power:.3f}
- **MDE / detectable effect:** {design.mde_description}

## 6. Analysis plan

{design.analysis_plan}
{segmentation_md}

## 7. Decision rule

{design.decision_rule}
{risks_md}
{notes_md}
"""

    # Clean trailing spaces
    return textwrap.dedent(md).strip()



### Example usage

Fill an `ExperimentDesign`, render markdown, and display it as a rich cell.


In [None]:

from IPython.display import Markdown, display

primary = MetricSpec(
    name="revenue_per_user",
    description="Average revenue per user over 14 days after exposure.",
    direction="higher_is_better",
    type="continuous",
    target_effect=0.05,
)

guardrails = [
    MetricSpec(
        name="refund_rate",
        description="Share of orders that result in a refund.",
        direction="lower_is_better",
        type="binary",
        target_effect=0.01,
        notes="Degradation above +1pp is considered unacceptable.",
    ),
    MetricSpec(
        name="support_ticket_rate",
        description="Share of users who contact support within 14 days.",
        direction="lower_is_better",
        type="binary",
        target_effect=0.005,
    ),
]

design_example = ExperimentDesign(
    experiment_name="Discount banner on checkout",
    owner="your.name@company.com",
    context="We are testing a new discount banner on the checkout page for logged-in users.",
    objective="Increase revenue per user without creating excessive refunds or support load.",
    primary_hypothesis="Showing the banner increases revenue per user by at least 5%.",
    secondary_hypotheses=[
        "The effect is stronger for high pre-activity users.",
        "Retention at D+30 is not negatively impacted.",
    ],
    primary_metric=primary,
    guardrail_metrics=guardrails,
    population="Logged-in web users in markets A, B, C.",
    traffic_allocation="50% control, 50% treatment",
    planned_duration="14 days",
    max_exposure="Max 50% of eligible traffic until decision.",
    alpha=0.05,
    power=0.80,
    mde_description="Detect a +5% relative uplift in revenue per user at 80% power.",
    randomization_unit="user_id",
    segmentation_plan="Pre-specified splits by device type and pre_activity tertiles.",
    analysis_plan=(
        "Primary analysis: two-sample t-test / regression-adjusted difference in mean RPU, "
        "with CUPED using pre-period revenue.
"
        "Secondary: Beta-Binomial for conversion, survival analysis for retention."
    ),
    decision_rule=(
        "Ship if RPU lift > 0 and statistically significant, with refund and support rates "
        "not degraded beyond the pre-specified thresholds."
    ),
    risks_and_assumptions=(
        "Assumes no major concurrent pricing or UX changes on checkout. "
        "Assumes traffic composition is stable over the experiment window."
    ),
    notes="This doc should be agreed before starting the experiment.",
)

md_text = render_experiment_design_md(design_example)
display(Markdown(md_text))



## 2) Auto-generated result summary template

Once the experiment is analyzed, you can pass key numbers into a summary function
that produces a short, readable markdown block.

Typical content:

- Primary metric effect (estimate + interval / p-value or posterior).  
- Direction and significance.  
- Guardrails: which moved, which stayed clean.  
- Final recommendation: ship / hold / do not ship, with one-line rationale.

This is intentionally compact so it can be pasted into a ticket or email.


In [None]:

@dataclass
class PrimaryResult:
    metric_name: str
    estimate: float
    ci_low: float
    ci_high: float
    unit: str            # e.g. "pp", "%", "absolute"
    p_value: Optional[float] = None
    posterior_prob_positive: Optional[float] = None


@dataclass
class GuardrailResult:
    metric_name: str
    estimate: float
    ci_low: float
    ci_high: float
    unit: str
    direction: str       # 'higher_is_better' or 'lower_is_better'
    acceptable_degradation: Optional[float] = None
    note: Optional[str] = None


@dataclass
class DecisionSummary:
    decision: str        # 'ship', 'hold', 'do_not_ship'
    rationale: str
    next_steps: Optional[str] = None


def render_result_summary_md(
    primary: PrimaryResult,
    guardrails: List[GuardrailResult],
    decision: DecisionSummary,
) -> str:
    """Render a concise result summary as markdown.

    Parameters
    ----------
    primary : PrimaryResult
        Effect for the primary metric.
    guardrails : list of GuardrailResult
        Effects for guardrail metrics.
    decision : DecisionSummary
        Final decision and rationale.

    Returns
    -------
    str
        Markdown text summarizing the experiment outcome.
    """
    # Primary effect text
    pv_text = ""
    if primary.p_value is not None:
        pv_text = f", p = {primary.p_value:.3f}"
    post_text = ""
    if primary.posterior_prob_positive is not None:
        post_text = f", P(effect > 0) = {primary.posterior_prob_positive:.3f}"

    primary_line = (
        f"Estimated effect on **{primary.metric_name}** is "
        f"{primary.estimate:+.3f} {primary.unit} "
        f"(95% CI [{primary.ci_low:+.3f}, {primary.ci_high:+.3f}]"
        f"{pv_text}{post_text})."
    )

    # Guardrails block
    guard_lines: List[str] = []
    for g in guardrails:
        dir_map = {
            "higher_is_better": "higher is better",
            "lower_is_better": "lower is better",
        }
        dir_txt = dir_map.get(g.direction, g.direction)
        line = (
            f"- **{g.metric_name}**: {g.estimate:+.3f} {g.unit} "
            f"(95% CI [{g.ci_low:+.3f}, {g.ci_high:+.3f}]; {dir_txt})."
        )
        if g.acceptable_degradation is not None:
            line += f" Threshold: {g.acceptable_degradation:+.3f} {g.unit}."
        if g.note:
            line += f" {g.note}"
        guard_lines.append(line)

    guardrails_block = "\n".join(guard_lines) if guard_lines else "_No guardrails tracked._"

    decision_title_map = {
        "ship": "Ship",
        "hold": "Hold / rerun",
        "do_not_ship": "Do not ship",
    }
    decision_title = decision_title_map.get(decision.decision, decision.decision)

    next_steps_md = ""
    if decision.next_steps:
        next_steps_md = f"""
### Next steps

{decision.next_steps}
"""

    md = f"""
## Experiment result summary

### Primary effect

{primary_line}

### Guardrail metrics

{guardrails_block}

### Decision

**Decision:** {decision_title}  

**Rationale:**  
{decision.rationale}
{next_steps_md}
"""

    return textwrap.dedent(md).strip()



### Example usage


In [None]:

primary_res = PrimaryResult(
    metric_name="revenue_per_user",
    estimate=0.052,
    ci_low=0.020,
    ci_high=0.084,
    unit="relative",
    p_value=0.004,
    posterior_prob_positive=0.985,
)

guard_res = [
    GuardrailResult(
        metric_name="refund_rate",
        estimate=0.003,
        ci_low=-0.001,
        ci_high=0.007,
        unit="absolute_pp",
        direction="lower_is_better",
        acceptable_degradation=0.010,
        note="Change is small and well within the +1pp tolerance.",
    ),
    GuardrailResult(
        metric_name="support_ticket_rate",
        estimate=0.000,
        ci_low=-0.002,
        ci_high=0.002,
        unit="absolute_pp",
        direction="lower_is_better",
        acceptable_degradation=0.005,
    ),
]

decision_res = DecisionSummary(
    decision="ship",
    rationale=(
        "Primary revenue per user shows a clear positive effect, both statistically "
        "and in practical terms. Guardrail metrics (refund and support tickets) "
        "did not degrade beyond pre-specified thresholds."
    ),
    next_steps=(
        "Roll out the treatment to 100% of eligible traffic over 3 days while "
        "monitoring revenue, refunds, and support tickets. Prepare a follow-up "
        "analysis on retention after 30 days."
    ),
)

summary_md = render_result_summary_md(primary_res, guard_res, decision_res)
display(Markdown(summary_md))



## 3) Risk assessment and rollout plan template

This template focuses on **risk framing** and operational rollout:

- Risk level and blast radius.  
- Failure modes and mitigations.  
- Rollout plan (phases and checkpoints).  
- Monitoring and rollback triggers.

It is independent of the actual statistics and can be re-used across experiments.


In [None]:

@dataclass
class RiskAssessment:
    risk_level: str          # 'low', 'medium', 'high'
    blast_radius: str        # e.g. 'checkout only', 'all logged-in users'
    key_failure_modes: List[str]
    mitigations: List[str]
    monitoring_plan: str
    rollback_triggers: str
    owner_on_call: str
    additional_notes: Optional[str] = None


def render_risk_rollout_md(
    risk: RiskAssessment,
    rollout_plan: str,
) -> str:
    """Render a risk assessment and rollout plan as markdown.

    Parameters
    ----------
    risk : RiskAssessment
        Structured description of risk and monitoring.
    rollout_plan : str
        Description of rollout phases and timeline.

    Returns
    -------
    str
        Markdown text for risk and rollout.
    """
    fm_lines = "\n".join(f"- {m}" for m in risk.key_failure_modes) or "_None specified_"
    mit_lines = "\n".join(f"- {m}" for m in risk.mitigations) or "_None specified_"

    notes_md = ""
    if risk.additional_notes:
        notes_md = f"""
### Additional notes

{risk.additional_notes}
"""

    md = f"""
## Risk assessment and rollout plan

### Risk level and blast radius

- **Risk level:** {risk.risk_level}
- **Blast radius:** {risk.blast_radius}
- **On-call / owner:** {risk.owner_on_call}

### Key failure modes

{fm_lines}

### Mitigations

{mit_lines}

### Rollout plan

{rollout_plan}

### Monitoring and rollback

**Monitoring plan**  
{risk.monitoring_plan}

**Rollback triggers**  
{risk.rollback_triggers}
{notes_md}
"""

    return textwrap.dedent(md).strip()



### Example usage


In [None]:

risk_example = RiskAssessment(
    risk_level="medium",
    blast_radius="Checkout page for logged-in users in markets A, B, C",
    key_failure_modes=[
        "Unexpected spike in refund rate due to confusing discount messaging.",
        "Increased support contacts due to perceived pricing inconsistency.",
    ],
    mitigations=[
        "Limit exposure to 50% of traffic during the first week.",
        "Add clear copy in FAQ explaining the discount logic.",
    ],
    monitoring_plan=(
        "Daily monitoring of revenue per user, refund rate, and support ticket rate "
        "split by control/treatment. Weekly deep dive by data science + product."
    ),
    rollback_triggers=(
        "Immediate rollback if refund rate increases by >2pp vs control for 2 consecutive days, "
        "or if revenue per user drops below baseline by >3%."
    ),
    owner_on_call="your.name@company.com",
    additional_notes="Align rollback criteria with finance and support teams before rollout.",
)

rollout_plan_text = (
    "1. Week 1: 25% rollout with tight monitoring.
"
    "2. Week 2: Increase to 50% if metrics remain within guardrail thresholds.
"
    "3. Week 3: Roll out to 100% of traffic, keep separate logging for at least 30 days."
)

risk_md = render_risk_rollout_md(risk_example, rollout_plan_text)
display(Markdown(risk_md))



## 4) How to use these templates in your experiment notebooks

Typical workflow:

1. **Before running the experiment**  
   - Fill an `ExperimentDesign` object.  
   - Render it with `render_experiment_design_md` and paste the markdown into
     your tracking system (doc, ticket, etc.).

2. **After analyzing results**  
   - Build a `PrimaryResult` + list of `GuardrailResult` from your metrics tables.  
   - Decide on `DecisionSummary` using your pre-specified rule.  
   - Render with `render_result_summary_md` for a one-page summary.

3. **For rollout**  
   - Fill a `RiskAssessment` and write a short `rollout_plan` string.  
   - Render with `render_risk_rollout_md` and attach to the same doc.

You can either:

- Copy these cells into each analysis notebook and adjust, or  
- Move them into a small internal module (e.g. `exp_docs.py`) and import the
  functions / dataclasses wherever needed.
