# Model v1 — Step 1: Build owned-ecosystem state `S_k` and inspect `S_k → S_{k+1}`

## What we’re doing (lock these assumptions)
- We work on `order_pg_ecosystem_sets_3y_k5.parquet` (already filtered to 3-year timeline and k≤5).
- Each row = one **order** for a customer, indexed by `purchase_k ∈ {1..5}`.
- Each order contains a **set of ecosystems purchased in that order** (a Python `set` or list-like).
- Define the **owned ecosystem state** after purchase `k`:
  \[
  S_k = \bigcup_{i=1}^{k} \text{ecosystems bought in order } i
  \]
- Structural + transparent: we will **not** model revenue, ML, or optimisation here. Just state construction + empirical transition inspection.

---

## 1) Load data + identify the ecosystem-set column (minimal + robust)


In [1]:
import pandas as pd
import numpy as np

path = "../data/interim/order_pg_ecosystem_sets_3y_k5.parquet"
df = pd.read_parquet(path)

df.columns


Index(['anon', 'purchase_k', 'product_groups', 'ecosystems', 'date',
       'n_ecosystems', 'n_product_groups', 'n_purchases_in_horizon'],
      dtype='object')

Now **set these explicitly** (don’t proceed until they match your file):
- `ID_COL` = customer id column (usually `anon`)
- `K_COL`  = purchase index column (usually `purchase_k`)
- `ECOS_COL` = ecosystems-in-this-order set column (from the candidates above)


In [2]:
ID_COL = "anon"
K_COL = "purchase_k"
ECOS_COL = "ecosystems"  # <-- change if needed

# quick sanity check
df[[ID_COL, K_COL, ECOS_COL]].head(10)


Unnamed: 0,anon,purchase_k,ecosystems
0,ANON_0000011,1,[pitcher]
1,ANON_0000011,2,"[other, pitcher]"
2,ANON_0000012,1,[pitcher]
3,ANON_0000012,2,"[bottle, pitcher]"
4,ANON_0000019,1,[bottle]
5,ANON_0000019,2,[bottle]
6,ANON_0000019,3,[bottle]
7,ANON_0000019,4,[bottle]
8,ANON_0000027,1,[bottle]
9,ANON_0000027,2,[pitcher]


## 2) Clean + normalize the per-order ecosystem sets

We enforce:
- missing -> empty set
- list/tuple -> set
- keep only hashable items (strings recommended)


In [3]:
def to_set(x):
    if x is None or (isinstance(x, float) and np.isnan(x)):
        return set()
    if isinstance(x, set):
        return x
    if isinstance(x, (list, tuple, np.ndarray)):
        return set(x)
    return {x}


tmp = df[[ID_COL, K_COL, ECOS_COL]].copy()
tmp[ECOS_COL] = tmp[ECOS_COL].apply(to_set)

# ensure sorted by customer, then purchase_k
tmp = tmp.sort_values([ID_COL, K_COL]).reset_index(drop=True)

tmp.head(10)


Unnamed: 0,anon,purchase_k,ecosystems
0,ANON_0000011,1,{pitcher}
1,ANON_0000011,2,"{pitcher, other}"
2,ANON_0000012,1,{pitcher}
3,ANON_0000012,2,"{pitcher, bottle}"
4,ANON_0000019,1,{bottle}
5,ANON_0000019,2,{bottle}
6,ANON_0000019,3,{bottle}
7,ANON_0000019,4,{bottle}
8,ANON_0000027,1,{bottle}
9,ANON_0000027,2,{pitcher}


In [4]:
df = pd.read_parquet("../data/interim/ecosystem_add_events.parquet")

df = df.sort_values(["anon", "purchase_k"]).reset_index(drop=True)


df = df.rename(columns={
    "prev_ecos_str": "S_prev",
    "curr_ecos_str": "S_k"
})





In [5]:
df = df.sort_values(["anon", "purchase_k"]).reset_index(drop=True)

len(df)

77578

In [6]:
df["S_next"] = df.groupby("anon")["S_k"].shift(-1)
df["k_next"] = df.groupby("anon")["purchase_k"].shift(-1)

df.head(21)

Unnamed: 0,anon,purchase_k,date,ecos_set,added_ecosystem,has_expansion,S_prev,S_k,S_next,k_next
0,ANON_0000009,2,2025-11-19,[bottle],bottle,True,pitcher,"bottle, pitcher",,
1,ANON_0000011,2,2024-02-19,"[other, pitcher]",other,True,pitcher,"other, pitcher",,
2,ANON_0000012,2,2024-09-26,"[bottle, pitcher]",bottle,True,pitcher,"bottle, pitcher",,
3,ANON_0000019,2,2023-06-20,[bottle],,False,bottle,bottle,bottle,3.0
4,ANON_0000019,3,2024-04-29,[bottle],,False,bottle,bottle,bottle,4.0
5,ANON_0000019,4,2025-04-06,[bottle],,False,bottle,bottle,,
6,ANON_0000027,2,2023-04-06,[pitcher],pitcher,True,bottle,"bottle, pitcher","bottle, pitcher",3.0
7,ANON_0000027,3,2023-06-07,[pitcher],,False,"bottle, pitcher","bottle, pitcher","bottle, pitcher",4.0
8,ANON_0000027,4,2024-11-17,[pitcher],,False,"bottle, pitcher","bottle, pitcher",,
9,ANON_0000029,2,2025-02-08,"[bottle, pitcher]",,False,"bottle, pitcher","bottle, pitcher",,


In [7]:
trans = df[
    (df["S_next"].notna()) &
    (df["k_next"] == df["purchase_k"] + 1)
].copy()


In [8]:
trans[["anon", "purchase_k", "S_k", "S_next", "added_ecosystem"]].head(10)


Unnamed: 0,anon,purchase_k,S_k,S_next,added_ecosystem
3,ANON_0000019,2,bottle,bottle,
4,ANON_0000019,3,bottle,bottle,
6,ANON_0000027,2,"bottle, pitcher","bottle, pitcher",pitcher
7,ANON_0000027,3,"bottle, pitcher","bottle, pitcher",
14,ANON_0000050,2,bottle,bottle,
15,ANON_0000050,3,bottle,bottle,
18,ANON_0000057,2,bottle,bottle,
19,ANON_0000057,3,bottle,"bottle, other",
21,ANON_0000058,2,pitcher,"bottle, pitcher",
22,ANON_0000058,3,"bottle, pitcher","bottle, pitcher",bottle


In [9]:
(
    trans.apply(
        lambda r: set(r["S_k"].split(", ")) <= set(r["S_next"].split(", ")),
        axis=1
    )
).value_counts()


True    24203
Name: count, dtype: int64

In [10]:
trans["S_k_key"] = trans["S_k"]
trans["S_next_key"] = trans["S_next"]

trans["S_k_key"].nunique()



159

In [11]:
trans["S_next_key"].nunique()

177

In [12]:
trans.groupby(["S_k_key", "S_next_key"]).size().shape[0]


493

In [13]:
cols = [
    "S_k_key",
    "S_next_key",
    "purchase_k"   # keep for stage-specific transitions later
]

trans[cols].to_parquet(
    "../data/interim/markov_transitions_v1.parquet",
    index=False
)


# Model v1 — Step 1 (Owned-ecosystem memory): State construction + transition inspection

## Goal
Extend the baseline structural simulator (v0) by adding **equipment ownership memory**.
We keep the model **structural, transparent, and Markov**, but augment the state.

This notebook step builds and validates the owned-ecosystem state:
\[
S_k = \{\text{ecosystems owned up to and including purchase } k\}
\]
and inspects empirical transitions \(S_k \to S_{k+1}\).

---

## Data + scope
- Input table is at **order level**: one row per customer order with `purchase_k ∈ {1..5}` on a 3-year timeline.
- Each order has `ecos_set`: the ecosystems purchased in that order.
- We assume **ownership is monotone** (customers do not "un-own" equipment).

---

## State definition (Model v1)
### Owned ecosystems (memory state)
\[
S_k = \bigcup_{i=1}^{k} \text{ecos\_set}_i
\]
This is **cumulative union** within customer and is **monotone**:
\[
S_{k+1} \supseteq S_k
\]

### Why this matters
Ownership affects future behavior (lock-in, compatibility, refills, cross-sell), so it must be part of the Markov state.  
Model v1 therefore models transitions over **owned sets**, not just "what was bought last time".

---

## Transition table (Markov edges)
We build a transition dataset with one row per **valid consecutive step**:
\[
S_k \to S_{k+1}
\]
Implementation detail:
- We shift within customer to create `S_next` and `k_next`.
- We keep only rows where `k_next = purchase_k + 1`.
- This **drops the last observed order per customer**, because it has no `S_{k+1}`. This is expected: a customer with \(K\) orders has \(K-1\) transitions.

---

## Validation checks (passed)
- Monotonicity: every transition satisfies \(S_k \subseteq S_{k+1}\).
- State-space size is manageable for a structural Markov simulator:
  - `|unique S_k| = 159`
  - `|unique S_next| = 177`
  - `|unique transitions| = 493`

---

## Outputs / artifacts
- Transition-level table `trans` with columns:
  - `anon`, `purchase_k`, `S_k`, `S_next`, `S_k_key`, `S_next_key`, (and optional order-level annotations)
- This completes **Model v1 Step 1**: state construction + empirical transition inspection.

---

## Next notebook (v2)
Use the transition table to estimate:
\[
P(S_{k+1} \mid S_k)
\]
(optionally stage-specific by `k`), then run the structural simulation and validate vs empirical paths.
