# Step-by-step tutorial on how to use different strategies for multi-domain sequence analysis

Multi-domain sequence analysis is.... Please introduce.

This tutorial will guide you through multi-domain sequence analysis, including assessing the association between domains, and the four strategies of multi-domain sequence analysis, including .... 

Here, we will use biofam dataset from MedSeq R package to illustrate how these four strategies differ. 

Let's get started!

In [1]:
from sequenzo import *

# Load datasets, and which dataset corresponds to one domain
left_df = load_dataset('biofam_left_domain')
children_df = load_dataset('biofam_child_domain')
married_df = load_dataset('biofam_married_domain')

# For example, let's take a look at the dataset about whether individuals left home or not
left_df

Unnamed: 0,id,age_15,age_16,age_17,age_18,age_19,age_20,age_21,age_22,age_23,age_24,age_25,age_26,age_27,age_28,age_29,age_30
0,1,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1
1,2,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
2,3,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1
3,4,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1
4,5,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,1996,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1
1996,1997,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1
1997,1998,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1998,1999,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## Assessing whether and to what extent domains are associated with each other

To explore how different life domains (e.g. marriage, leaving home, having children) are related across time, we use **sequence association analysis**. This helps us understand **if** and **how strongly** two domains tend to move together across a person's life.

### Step 1: Create Sequence Objects

We first create sequence data objects for each domain (e.g. a sequence showing whether someone was married at each age). These objects are then compared **pairwise** to analyze their associations.


In [2]:
# Extract the columns related to age/time
# which is a prerequisite for building a sequence data.
time_cols = [col for col in children_df.columns if col.startswith("age_")]

# Construct a sequence data for each 
print("\n------ seq_left ------")
seq_left = SequenceData(data=left_df, 
                        time_type="age", 
                        time=time_cols, 
                        states=[0, 1],
                        labels=["At home", "Left home"])

print("\n------ seq_child ------")
seq_child = SequenceData(data=children_df, 
                         time_type="age", 
                         time=time_cols, 
                         states=[0, 1],
                         labels=["No child", "Child"])

print("\n------ seq_married ------")
seq_married = SequenceData(data=married_df, 
                        time_type="age", 
                        time=time_cols, 
                        states=[0, 1],
                        labels=["Not married", "Married"])


------ seq_left ------

[>] SequenceData initialized successfully! Here's a summary:
[>] Number of sequences: 2000
[>] Min/Max sequence length: 16 / 16
[>] Alphabet: [0, 1]

------ seq_child ------

[>] SequenceData initialized successfully! Here's a summary:
[>] Number of sequences: 2000
[>] Min/Max sequence length: 16 / 16
[>] Alphabet: [0, 1]

------ seq_married ------

[>] SequenceData initialized successfully! Here's a summary:
[>] Number of sequences: 2000
[>] Min/Max sequence length: 16 / 16
[>] Alphabet: [0, 1]


### Step 2: Measuring Association

We use two complementary statistical measures:

| Measure      | Description                                                                 | What it tells us                                       |
|--------------|-----------------------------------------------------------------------------|---------------------------------------------------------|
| **LRT (Likelihood Ratio Test)** | A test of **whether** two domains are statistically associated | Tells you **if there is any significant link** at all    |
| **Cramer's V (v)**              | A measure of **how strong** the association is (0 to 1)         | Tells you **how strong** the link is if it exists       |

Both of these are calculated **based on cross-tabulations** of aligned sequence positions (e.g., marriage status vs. childbearing at each age).

### What's the Difference?

- **LRT (`p(LRT)`):**  
  Think of this as a **yes/no test** — *Is there any relationship?*
  - A low p-value (e.g., < 0.05) means "Yes, the association is statistically significant."
  - A high p-value means there's no evidence of association.

- **Cramer's V (`v` and `p(v)`):**  
  This tells you **how strong** the relationship is, even if it's weak.
  - Value ranges from 0 (no association) to 1 (perfect association).
  - We also attach a label:
    - `None` (v < 0.1)
    - `Weak` (0.1 ≤ v < 0.3)
    - `Moderate` (0.3 ≤ v < 0.5)
    - `Strong` (v ≥ 0.5)

> **Note:** Even when `v = 0`, non-linear associations *might* exist — this test only captures **linear dependencies**.

### Output Table

The result is a table that looks like the following. 

Each row shows how two domains relate to each other, how statistically significant that relationship is, and how strong it is.

In [3]:
result = get_association_between_domains(
    [seq_left, seq_child, seq_married],
    assoc=["V", "LRT"],
    rep_method="overall",
    cross_table=True,
    weighted=True,
    dnames=["children", "married", "left"],
    explain=True,
)


📜 Full results table:


Unnamed: 0,df,LRT,p(LRT),v,p(v),strength
children vs married,1.0,9144.680641,0.000 ***,0.481817,0.000 ***,Moderate
children vs left,1.0,9561.568952,0.000 ***,0.531414,0.000 ***,Strong
married vs left,1.0,12430.120849,0.000 ***,0.626851,0.000 ***,Strong



📘 Column explanations:
  - df       : Degrees of freedom for the test (typically 1 for binary state sequences).
  - LRT      : Likelihood Ratio Test statistic (higher = stronger dependence).
  - p(LRT)   : p-value for LRT + significance stars: * (p<.05), ** (p<.01), *** (p<.001)
  - v        : Cramer's V statistic (0 to 1, measures association strength).
  - p(v)     : p-value for Cramer's V (based on chi-squared test) + significance stars: * (p<.05), ** (p<.01), *** (p<.001)
  - strength : Qualitative label for association strength based on Cramer's V:
               0.00–0.09 → None, 0.10–0.29 → Weak, 0.30–0.49 → Moderate, ≥0.50 → Strong


## The first strategy to conduct multi-domain sequence analysis: IDCD

## Other strategies for conducting multi-domain sequence analysis