# Example Demonstration: Exploring get_distance_matrix, get_substitution_matrix, and clara to Display
*Author: Xinyi Li   Date: March 1, 2025*

**Data Source**: The sample dataset used in this demonstration is sourced from Gapminder, compiling data from 223 countries spanning the years 1800 to 2022. We have extracted data related to CO₂ emissions for distance matrix computation and clustering analysis.<br>

This dataset categorizes CO₂ emission levels into five states: "Very Low" (Below 20%), "Low" (20–40%), "Middle" (40–60%), "High" (60–80%), "Very High" (Top 80%)<br>These categories represent different levels of CO₂ emissions.

In this article, we will use this dataset to demonstrate the functionality and usage of the following functions: *get_distance_matrix*, *get_substitution_matrix*, *clara*.

### Contents:
- Chapter 1: get_distance_matrix
- Chapter 2: get_substitution_matrix
- Chapter 3: clara

Let’s get started!

In [1]:
# Import necessary libraries
from sequenzo import * # Social sequence analysis
import pandas as pd # Import necesarry packages
import numpy as np

In [3]:
# Load the data that we would like to explore in this tutorial
# `df` is the short for `dataframe`, which is a common variable name for a dataset
df = load_dataset('country_co2_emissions')
# df = pd.read_csv('country_co2_emissions_missing.csv')

# show
df

Unnamed: 0,country,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,Afghanistan,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,...,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low
1,Albania,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,...,Middle,Middle,Middle,Middle,Middle,Middle,Middle,Middle,Middle,Middle
2,Algeria,Low,Low,Low,Low,Low,Low,Low,Low,Low,...,Middle,Middle,Middle,Middle,Middle,Middle,Middle,Middle,Middle,Middle
3,Andorra,Very High,Very High,Very High,Very High,Very High,Very High,Very High,Very High,Very High,...,Very High,Very High,Very High,Very High,Very High,Very High,Very High,Very High,Very High,Very High
4,Angola,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,...,Low,Low,Low,Low,Low,Low,Low,Low,Low,Low
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
188,Venezuela,High,High,High,High,High,High,High,High,High,...,High,High,High,Middle,Middle,Middle,Low,Low,Low,Low
189,Vietnam,Middle,Middle,Middle,Middle,Middle,Middle,Middle,Middle,Middle,...,Low,Low,Low,Low,Low,Middle,Middle,Middle,Middle,Middle
190,Yemen,High,High,High,High,High,High,High,High,High,...,Low,Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low
191,Zambia,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,...,Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low


In [4]:
# If it is a multidimensional matrix, 
# wrap the matrix with states to make the output of the matrix more interpretable
def output(data, time, states): # The data consists of two parts: indel and sm
    print("indel: ", data['indel'])
    states.insert(0, "null")
    print("sm:")
    for i in range(data['sm'].shape[0]):
        print(f" , , {time[i]}")
        _df = pd.DataFrame(data['sm'][i, :, :], index=states, columns=states)
        print(_df)

In [5]:
# Ctrate SeqdataData
# Define the time-span variable
time = list(df.columns)[1:]

states = ['Very Low', 'Low', 'Middle', 'High', 'Very High']

sequence_data = SequenceData(df, time=time, id_col="country", states=states)

sequence_data


[>] SequenceData initialized successfully! Here's a summary:
[>] Number of sequences: 193
[>] Number of time points: 223
[>] Min/Max sequence length: 223 / 223
[>] States: ['Very Low', 'Low', 'Middle', 'High', 'Very High']
[>] Labels: ['Very Low', 'Low', 'Middle', 'High', 'Very High']
[>] Weights: Not provided


SequenceData(193 sequences, States: ['Very Low', 'Low', 'Middle', 'High', 'Very High'])

# Chapter 1: get_distance_matrix
Below are the examples from the *get_distance_matrix* official documentation.

**Note**: The 5th example (DHD) is provided as a **counterexample**—it demonstrates an unsupported computation method. As a result, the function returns an incorrect output.

In [6]:
# refseq
refseq = [[0, 1, 2], [99, 100]] # Reference sequences set

om = get_distance_matrix(sequence_data,
                         method="OM",
                         # refseq=refseq,
                         sm="TRATE",
                         indel="auto")
om

[>] Processing 193 sequences with 5 unique states.
[>] Transition-based substitution-cost matrix (TRATE) initiated...
  - Computing transition probabilities for: [Very Low, Low, Middle, High, Very High]
[>] generated an indel of type number

[>] Identified 175 unique sequences.
[>] Starting Optimal Matching(OM)...
[>] Computing all pairwise distances...
[>] Computed Successfully.


Unnamed: 0,Afghanistan,Albania,Algeria,Andorra,Angola,Antigua and Barbuda,Argentina,Armenia,Australia,Austria,...,United Kingdom,United States,Uruguay,Uzbekistan,Vanuatu,Venezuela,Vietnam,Yemen,Zambia,Zimbabwe
Afghanistan,0.000000,151.094678,385.142326,446.000000,113.604016,443.712870,445.994660,324.812048,442.950901,445.999644,...,446.000000,446.000000,445.994304,265.525100,308.819210,437.852189,264.911044,322.779050,59.604016,305.921084
Albania,151.094678,0.000000,271.241177,445.992253,80.361601,310.618192,443.334344,242.349051,399.581344,445.815059,...,445.992253,445.992253,442.448375,210.000862,381.210694,429.807482,299.723092,319.760787,133.285272,310.856435
Algeria,385.142326,271.241177,0.000000,443.618181,343.441483,179.526558,355.403649,152.943244,383.899006,437.738122,...,443.618181,443.618181,325.996184,267.665222,379.657760,317.791727,269.808167,198.690426,385.332920,261.998761
Andorra,446.000000,445.992253,443.618181,0.000000,445.996068,442.749561,88.225170,444.451638,59.721371,5.881678,...,0.000000,0.000000,117.633560,445.914643,446.000000,198.292230,445.989940,443.547394,446.000000,445.989477
Angola,113.604016,80.361601,343.441483,445.996068,0.000000,375.141088,443.986566,283.144204,399.585160,445.818875,...,445.996068,445.996068,443.984786,207.033965,320.717098,430.159835,267.200780,295.094678,99.472021,295.094678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Venezuela,437.852189,429.807482,317.791727,198.292230,430.159835,281.289837,133.613307,350.744276,184.232973,192.647543,...,198.292230,198.292230,134.204917,424.858818,437.845900,0.000000,427.322980,306.106116,437.852664,426.792019
Vietnam,264.911044,299.723092,269.808167,445.989940,267.200780,269.422470,443.332032,264.488181,399.585160,445.812746,...,445.989940,445.989940,442.446062,248.808641,253.062609,427.322980,0.000000,177.424519,294.944043,151.189975
Yemen,322.779050,319.760787,198.690426,443.547394,295.094678,315.563463,355.329509,283.857054,401.534274,437.665716,...,443.547394,443.547394,325.921119,267.636039,264.694983,306.106116,177.424519,0.000000,322.779050,140.045965
Zambia,59.604016,133.285272,385.332920,446.000000,99.472021,443.903464,445.994779,332.680054,442.950901,445.999644,...,446.000000,446.000000,445.994779,271.393106,324.687216,437.852664,294.944043,322.779050,0.000000,321.657095


In [7]:
# 1. OMspell + TRATE
omspell = get_distance_matrix(sequence_data,
                         method="OM",
                         sm="TRATE",
                         indel="auto")
omspell

[>] Processing 193 sequences with 5 unique states.
[>] Transition-based substitution-cost matrix (TRATE) initiated...
  - Computing transition probabilities for: [Very Low, Low, Middle, High, Very High]
[>] generated an indel of type number

[>] Identified 175 unique sequences.
[>] Starting Optimal Matching(OM)...
[>] Computing all pairwise distances...
[>] Computed Successfully.


Unnamed: 0,Afghanistan,Albania,Algeria,Andorra,Angola,Antigua and Barbuda,Argentina,Armenia,Australia,Austria,...,United Kingdom,United States,Uruguay,Uzbekistan,Vanuatu,Venezuela,Vietnam,Yemen,Zambia,Zimbabwe
Afghanistan,0.000000,151.094678,385.142326,446.000000,113.604016,443.712870,445.994660,324.812048,442.950901,445.999644,...,446.000000,446.000000,445.994304,265.525100,308.819210,437.852189,264.911044,322.779050,59.604016,305.921084
Albania,151.094678,0.000000,271.241177,445.992253,80.361601,310.618192,443.334344,242.349051,399.581344,445.815059,...,445.992253,445.992253,442.448375,210.000862,381.210694,429.807482,299.723092,319.760787,133.285272,310.856435
Algeria,385.142326,271.241177,0.000000,443.618181,343.441483,179.526558,355.403649,152.943244,383.899006,437.738122,...,443.618181,443.618181,325.996184,267.665222,379.657760,317.791727,269.808167,198.690426,385.332920,261.998761
Andorra,446.000000,445.992253,443.618181,0.000000,445.996068,442.749561,88.225170,444.451638,59.721371,5.881678,...,0.000000,0.000000,117.633560,445.914643,446.000000,198.292230,445.989940,443.547394,446.000000,445.989477
Angola,113.604016,80.361601,343.441483,445.996068,0.000000,375.141088,443.986566,283.144204,399.585160,445.818875,...,445.996068,445.996068,443.984786,207.033965,320.717098,430.159835,267.200780,295.094678,99.472021,295.094678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Venezuela,437.852189,429.807482,317.791727,198.292230,430.159835,281.289837,133.613307,350.744276,184.232973,192.647543,...,198.292230,198.292230,134.204917,424.858818,437.845900,0.000000,427.322980,306.106116,437.852664,426.792019
Vietnam,264.911044,299.723092,269.808167,445.989940,267.200780,269.422470,443.332032,264.488181,399.585160,445.812746,...,445.989940,445.989940,442.446062,248.808641,253.062609,427.322980,0.000000,177.424519,294.944043,151.189975
Yemen,322.779050,319.760787,198.690426,443.547394,295.094678,315.563463,355.329509,283.857054,401.534274,437.665716,...,443.547394,443.547394,325.921119,267.636039,264.694983,306.106116,177.424519,0.000000,322.779050,140.045965
Zambia,59.604016,133.285272,385.332920,446.000000,99.472021,443.903464,445.994779,332.680054,442.950901,445.999644,...,446.000000,446.000000,445.994779,271.393106,324.687216,437.852664,294.944043,322.779050,0.000000,321.657095


In [8]:
# 2. OM + CONSTANT
om = get_distance_matrix(sequence_data,
                         method="OM",
                         sm="CONSTANT",
                         indel="auto")
om

[>] Processing 193 sequences with 5 unique states.
  - Creating 6x6 substitution-cost matrix using 2 as constant value
[>] generated an indel of type number

[>] Identified 175 unique sequences.
[>] Starting Optimal Matching(OM)...
[>] Computing all pairwise distances...
[>] Computed Successfully.


Unnamed: 0,Afghanistan,Albania,Algeria,Andorra,Angola,Antigua and Barbuda,Argentina,Armenia,Australia,Austria,...,United Kingdom,United States,Uruguay,Uzbekistan,Vanuatu,Venezuela,Vietnam,Yemen,Zambia,Zimbabwe
Afghanistan,0.0,152.0,386.0,446.0,114.0,446.0,446.0,326.0,444.0,446.0,...,446.0,446.0,446.0,268.0,314.0,438.0,266.0,324.0,60.0,308.0
Albania,152.0,0.0,272.0,446.0,82.0,312.0,446.0,244.0,400.0,446.0,...,446.0,446.0,446.0,212.0,388.0,432.0,302.0,320.0,134.0,312.0
Algeria,386.0,272.0,0.0,446.0,346.0,180.0,356.0,154.0,386.0,440.0,...,446.0,446.0,326.0,270.0,386.0,320.0,272.0,200.0,386.0,264.0
Andorra,446.0,446.0,446.0,0.0,446.0,446.0,90.0,446.0,60.0,6.0,...,0.0,0.0,120.0,446.0,446.0,202.0,446.0,446.0,446.0,446.0
Angola,114.0,82.0,346.0,446.0,0.0,378.0,446.0,286.0,400.0,446.0,...,446.0,446.0,446.0,210.0,326.0,432.0,268.0,296.0,100.0,296.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Venezuela,438.0,432.0,320.0,202.0,432.0,282.0,136.0,354.0,188.0,196.0,...,202.0,202.0,136.0,428.0,438.0,0.0,432.0,308.0,438.0,432.0
Vietnam,266.0,302.0,272.0,446.0,268.0,272.0,446.0,266.0,400.0,446.0,...,446.0,446.0,446.0,250.0,258.0,432.0,0.0,178.0,296.0,152.0
Yemen,324.0,320.0,200.0,446.0,296.0,320.0,356.0,284.0,402.0,440.0,...,446.0,446.0,326.0,268.0,268.0,308.0,178.0,0.0,324.0,142.0
Zambia,60.0,134.0,386.0,446.0,100.0,446.0,446.0,334.0,444.0,446.0,...,446.0,446.0,446.0,274.0,330.0,438.0,296.0,324.0,0.0,324.0


In [9]:
# 3. HAM + TRATE
ham = get_distance_matrix(sequence_data,
                          method="HAM",
                          sm="TRATE",
                          indel="auto")
ham

[>] Processing 193 sequences with 5 unique states.
[>] Transition-based substitution-cost matrix (TRATE) initiated...
  - Computing transition probabilities for: [Very Low, Low, Middle, High, Very High]
[>] generated an indel of type number

[>] Identified 175 unique sequences.
[>] Starting (Dynamic) Hamming Distance(DHD/HAM)...
[>] Computing all pairwise distances...
[>] Computed Successfully.


Unnamed: 0,Afghanistan,Albania,Algeria,Andorra,Angola,Antigua and Barbuda,Argentina,Armenia,Australia,Austria,...,United Kingdom,United States,Uruguay,Uzbekistan,Vanuatu,Venezuela,Vietnam,Yemen,Zambia,Zimbabwe
Afghanistan,0.000000,160.823291,443.340799,446.000000,175.184454,445.614065,446.000000,404.836349,445.967001,445.999644,...,446.000000,446.000000,445.999644,338.856108,308.819210,445.865988,331.169513,344.082524,94.416064,305.921084
Albania,160.823291,0.000000,309.865882,445.992253,91.789818,405.155906,443.629074,333.429797,445.959254,445.932951,...,445.992253,445.992253,442.743105,257.079214,381.474392,438.147116,385.158853,385.551576,175.511033,384.620215
Algeria,443.340799,309.865882,0.000000,443.618181,374.293717,295.350552,440.960273,306.449390,440.569082,443.440987,...,443.618181,443.618181,440.074303,295.018513,380.513298,370.541699,383.848780,419.332585,442.913504,357.999146
Andorra,446.000000,445.992253,443.618181,0.000000,445.996068,442.749561,88.225170,444.451638,59.721371,5.881678,...,0.000000,0.000000,117.633560,445.914643,446.000000,198.292230,445.989940,443.547394,446.000000,445.989477
Angola,175.184454,91.789818,374.293717,445.996068,0.000000,418.464915,445.813891,378.192931,445.963070,445.818875,...,445.996068,445.996068,445.045813,227.823241,351.690441,437.671657,367.221006,340.412803,158.971958,358.674970
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Venezuela,445.865988,438.147116,370.541699,198.292230,437.671657,349.826799,143.218900,429.482449,207.700078,204.173908,...,198.292230,198.292230,172.627290,437.171914,445.858156,0.000000,441.218929,394.762246,445.867412,444.433286
Vietnam,331.169513,385.158853,383.848780,445.989940,367.221006,283.530799,445.691650,309.701557,401.531499,445.989584,...,445.989940,445.989940,445.691650,298.477561,278.277893,441.218929,0.000000,271.988379,402.172155,239.274814
Yemen,344.082524,385.551576,419.332585,443.547394,340.412803,325.454952,443.543003,342.763815,429.428529,443.547038,...,443.547394,443.547394,443.542172,365.015601,284.363810,394.762246,271.988379,0.000000,358.185074,229.984803
Zambia,94.416064,175.511033,442.913504,446.000000,158.971958,445.091711,445.998813,418.414883,445.967001,445.999644,...,446.000000,446.000000,445.997033,348.262277,332.423226,445.867412,402.172155,358.185074,0.000000,329.525100


In [11]:
# 4. DHD + CONSTANT

# But "CONSTANT" is not relevant for DHD, but we want to show you the error output information, so there will have a bug.
# You can consider HAM instead.

dhd = get_distance_matrix(sequence_data,
                         method="DHD",
                         sm="CONSTANT",
                         indel="auto")
dhd

[>] Processing 193 sequences with 5 unique states.


ValueError: [!] 'sm = "CONSTANT"' is not relevant for DHD, consider HAM instead.

# Chapter 2: get_substitution_matrix
Below are the examples from the *get_substitution_matrix* official documentation.

**Note**: There is a major issue when running code with .ipynb: it retains variables.

Therefore, to ensure that the variables are not affected by the previous get_distance_matrix when testing get_substitution_cost_matrix, please do not run the previous get_distance_matrix. Of course, this is not a problem with the code, but a characteristic of .ipynb.

Also, even for get_subsitution_cost_matrix, you cannot run the following two code blocks in.ipynb unless you are not using.ipynb

In [12]:
# 1. TRATE + time_varying(True)
sm = get_substitution_cost_matrix(sequence_data,
                                  method="TRATE",
                                  cval=4,
                                  time_varying=True)

# sm is an nd.array due to efficiency, 
# but the output is poorly interpretable, 
# so the output function is used
output(sm, time, states.copy())

[>] Transition-based substitution-cost matrix (TRATE) initiated...
  - Computing transition probabilities for: [Very Low, Low, Middle, High, Very High]
indel:  1
sm:
 , , 1800
           null  Very Low       Low    Middle  High  Very High
null        0.0       4.0  4.000000  4.000000   4.0        4.0
Very Low    4.0       0.0  4.000000  4.000000   4.0        4.0
Low         4.0       4.0  0.000000  3.896086   4.0        4.0
Middle      4.0       4.0  3.896086  0.000000   4.0        4.0
High        4.0       4.0  4.000000  4.000000   0.0        4.0
Very High   4.0       4.0  4.000000  4.000000   4.0        0.0
 , , 1801
           null  Very Low       Low    Middle      High  Very High
null        0.0       4.0  4.000000  4.000000  4.000000        4.0
Very Low    4.0       0.0  4.000000  4.000000  4.000000        4.0
Low         4.0       4.0  0.000000  3.948043  4.000000        4.0
Middle      4.0       4.0  3.948043  0.000000  3.948043        4.0
High        4.0       4.0  4.000000  3

In [13]:
# 2. CONSTANT + time_varying(False)
sm = get_substitution_cost_matrix(sequence_data,
                                  method="CONSTANT",
                                  cval=2,
                                  time_varying=False)
sm

  - Creating 6x6 substitution-cost matrix using 2 as constant value


{'indel': 1,
 'sm':            null  Very Low  Low  Middle  High  Very High
 null        0.0       2.0  2.0     2.0   2.0        2.0
 Very Low    2.0       0.0  2.0     2.0   2.0        2.0
 Low         2.0       2.0  0.0     2.0   2.0        2.0
 Middle      2.0       2.0  2.0     0.0   2.0        2.0
 High        2.0       2.0  2.0     2.0   0.0        2.0
 Very High   2.0       2.0  2.0     2.0   2.0        0.0}

# Chapter 3: clara
Below is the example from the clara official documentation.

In [14]:
# clara
# TODO: 输出里面有升级包的提示，但现在的环境已经是最新版了，因此最好不要有这个输出，容易让人误解自己还是旧版包。
result = clara(sequence_data,
               R=10,
               sample_size=3000,
               kvals=range(2,21),
               criteria=['distance', 'pbm'],
               dist_args={"method": "OMspell", "sm": "TRATE", "indel": "auto"},
               stability=True)
result

[>] Starting generalized CLARA for sequence analysis.
[>] Using crisp clustering optimizing the following criterion: distance, pbm.
  - Aggregating 193 sequences...
  - OK (175 unique cases).
[>] Starting iterations...


[notice] A new release of sequenzo is available: 0.1.28 -> 0.1.30
[notice] To update, run: pip install --upgrade sequenzo==0.1.30
[notice] A new release of sequenzo is available: 0.1.28 -> 0.1.30
[notice] To update, run: pip install --upgrade sequenzo==0.1.30
[notice] A new release of sequenzo is available: 0.1.28 -> 0.1.30
[notice] To update, run: pip install --upgrade sequenzo==0.1.30


  - Done.
[>] Aggregating iterations for each k values...
  - Done.


{'param': {'criteria': ['distance', 'pbm'],
  'pam_combine': False,
  'all_criterias': ['distance', 'pbm'],
  'kvals': range(2, 21),
  'method': 'crisp',
  'stability': True},
 'distance': {'kvals': range(2, 21),
  'clara': {0: {'medoids': array([27,  3]),
    'clustering': array([1, 1, 2, 2, 1, 2, 2, 1, 2, 2, 1, 2, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1,
           1, 2, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 2, 1, 2, 2,
           2, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 1, 2, 1, 2,
           2, 2, 1, 1, 2, 1, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1,
           1, 2, 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, 2, 1, 1, 2, 2,
           1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 2, 2, 2, 1, 1, 1, 2, 1, 1, 2, 2, 2,
           1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 1, 2, 2, 2, 1, 1,
           2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 1, 2, 1, 1, 2, 1, 1, 1, 2,
           1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 2, 1, 1, 1, 1]),
    'evol_diss': array([151.19451678, 151.1945167

[notice] A new release of sequenzo is available: 0.1.28 -> 0.1.30
[notice] To update, run: pip install --upgrade sequenzo==0.1.30


In [25]:
print("Thank you for learning sequence analysis with Sequenzo! ")
print("We hope you found this tutorial insightful.")
print("\n💡 Stay Curious, keep coding, and discover new insights.")
print("✉️ If you have any questions, please feel free to reach out :)")

Thank you for learning sequence analysis with Sequenzo! 
We hope you found this tutorial insightful.

💡 Stay Curious, keep coding, and discover new insights.
✉️ If you have any questions, please feel free to reach out :)
