# Example Demonstration: Exploring get_distance_matrix, get_substitution_matrix, and clara to Display
*Author: Xinyi Li   Date: March 1, 2025*

**Data Source**: The sample dataset used in this demonstration is sourced from Gapminder, compiling data from 223 countries spanning the years 1800 to 2022. We have extracted data related to CO₂ emissions for distance matrix computation and clustering analysis.<br>

This dataset categorizes CO₂ emission levels into five states: "Very Low" (Below 20%), "Low" (20–40%), "Middle" (40–60%), "High" (60–80%), "Very High" (Top 80%)<br>These categories represent different levels of CO₂ emissions.

In this article, we will use this dataset to demonstrate the functionality and usage of the following functions: *get_distance_matrix*, *get_substitution_matrix*, *clara*.

### Contents:
- Chapter 1: get_distance_matrix
- Chapter 2: get_substitution_matrix
- Chapter 3: clara

Let’s get started!

In [1]:
# Import necessary libraries
from sequenzo import * # Social sequence analysis
import pandas as pd # Import necesarry packages

In [2]:
# Load the data that we would like to explore in this tutorial
# `df` is the short for `dataframe`, which is a common variable name for a dataset
df = load_dataset('country_co2_emissions')

# show
df

Unnamed: 0,country,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,Afghanistan,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,...,High,High,High,High,High,High,High,High,High,High
1,Albania,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,...,Very High,Very High,Very High,Very High,Very High,Very High,Very High,Very High,Very High,Very High
2,Algeria,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,...,Very High,Very High,Very High,Very High,Very High,Very High,Very High,Very High,Very High,Very High
3,Andorra,High,High,High,High,High,High,High,High,High,...,Very High,Very High,Very High,Very High,Very High,Very High,Very High,Very High,Very High,Very High
4,Angola,Low,Low,Low,Low,Low,Low,Low,Low,Low,...,High,High,High,High,High,High,High,High,High,High
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
189,Venezuela,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,...,Very High,Very High,Very High,Very High,Very High,High,Middle,High,High,High
190,Vietnam,Low,Low,Low,Low,Low,Low,Low,Low,Low,...,High,High,High,High,High,Very High,Very High,Very High,Very High,Very High
191,Yemen,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,Very Low,...,High,High,High,High,High,High,High,High,High,High
192,Zambia,High,High,High,High,High,High,High,High,High,...,High,High,High,High,High,High,High,High,High,High


In [3]:
# If it is a multidimensional matrix, 
# wrap the matrix with states to make the output of the matrix more interpretable
def output(data, time, states): # The data consists of two parts: indel and sm
    print("indel: ", data['indel'])
    print("sm:")
    for i in range(data['sm'].shape[0]):
        print(f" , , {time[i]}")
        _df = pd.DataFrame(data['sm'][i, :, :], index=states, columns=states)
        print(_df)

In [3]:
# Ctrate SeqdataData
# Define the time-span variable
time = list(df.columns)[1:]

states = ['Very Low', 'Low', 'Middle', 'High', 'Very High']

sequence_data = SequenceData(df, time=time, time_type="year", id_col="country", states=states)

sequence_data


[>] SequenceData initialized successfully! Here's a summary:
[>] Number of sequences: 194
[>] Min/Max sequence length: 223 / 223
[>] Alphabet: ['Very Low', 'Low', 'Middle', 'High', 'Very High']


SequenceData(194 sequences, Alphabet: ['Very Low', 'Low', 'Middle', 'High', 'Very High'])

# Chapter 1: get_distance_matrix
Below are the examples from the *get_distance_matrix* official documentation.

**Note**: The 5th example (DHD) is provided as a **counterexample**—it demonstrates an unsupported computation method. As a result, the function returns an incorrect output.

In [5]:
# refseq
refseq = [[0, 1, 2], [99, 100]] # Reference sequences set

om = get_distance_matrix(sequence_data,
                         method="OM",
                         refseq=refseq,
                         sm="TRATE",
                         indel="auto")
om

[>] Processing 194 sequences with 6 unique states.
[>] Transition-based substitution-cost matrix (TRATE) initiated...
  - Computing transition probabilities for: [Very Low, Low, Middle, High, Very High]
[>] Indel cost generated.

[>] Pairwise measures between two subsets of sequences of sizes 3 and 2
[>] Identified 5 unique sequences.
[>] Sequence length: min/max = 223 / 223.

[>] Starting Optimal Matching(OM)...
[>] Computed Successfully.


Unnamed: 0,Lithuania,Luxembourg
Afghanistan,183.953446,341.902297
Albania,196.629013,386.107284
Algeria,126.0,299.890765


In [6]:
# 1. OMspell + TRATE
omspell = get_distance_matrix(sequence_data,
                         method="OMspell",
                         sm="TRATE",
                         indel="auto")
omspell

[>] Processing 194 sequences with 7 unique states.
[>] Transition-based substitution-cost matrix (TRATE) initiated...
  - Computing transition probabilities for: [Very Low, Low, Middle, High, Very High,  ]
[>] Indel cost generated.

[>] Identified 192 unique spell sequences.
[>] Sequence spell length: min/max = 1 / 24.

[>] Starting Optimal Matching with spell(OMspell)...
[>] Computing all pairwise distances...
[>] Computed Successfully.


Unnamed: 0,Afghanistan,Albania,Algeria,Andorra,Angola,Antigua and Barbuda,Argentina,Armenia,Australia,Austria,...,Uganda,Ukraine,Uruguay,Uzbekistan,Vanuatu,Venezuela,Vietnam,Yemen,Zambia,Zimbabwe
Afghanistan,0.000000,69.447981,69.466725,214.0,183.500000,180.000000,187.500000,107.499418,182.500000,188.499418,...,89.500000,117.500000,151.498799,116.500000,166.466725,91.896885,166.000000,58.499418,213.5,186.843289
Albania,69.447981,0.000000,55.966725,182.5,197.000000,179.500000,180.000000,96.967925,176.000000,165.000000,...,122.000000,102.999418,137.948563,104.999418,176.999418,76.500000,187.484482,58.999418,203.0,168.500000
Algeria,69.466725,55.966725,0.000000,153.5,187.000000,153.500000,144.000000,89.000000,145.000000,152.000000,...,106.000000,76.000000,125.997998,75.000000,190.960807,39.500000,180.483481,68.948563,205.0,195.500000
Andorra,214.000000,182.500000,153.500000,0.0,170.500000,170.000000,52.500000,165.500000,66.500000,132.500000,...,225.448563,117.500000,189.500000,106.500000,159.460807,137.000000,166.000000,171.500000,125.5,190.000000
Angola,183.500000,197.000000,187.000000,170.5,0.000000,135.500000,153.000000,167.000000,170.000000,194.000000,...,155.000000,171.000000,130.948563,170.000000,95.000000,204.500000,59.500000,155.999418,170.0,179.500000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Venezuela,91.896885,76.500000,39.500000,137.0,204.500000,162.000000,134.500000,102.500000,129.500000,131.500000,...,112.500000,61.500000,134.500000,62.499418,202.500000,0.000000,199.999418,92.500000,216.5,200.000000
Vietnam,166.000000,187.484482,180.483481,166.0,59.500000,124.999418,140.500000,156.499418,157.500000,175.499418,...,150.500000,163.500000,110.396885,163.500000,76.428732,199.999418,0.000000,146.499418,171.5,163.895926
Yemen,58.499418,58.999418,68.948563,171.5,155.999418,194.500000,171.948563,116.000000,187.967344,201.967925,...,108.948563,127.948563,155.000000,126.948563,152.999418,92.500000,146.499418,0.000000,171.0,184.467925
Zambia,213.500000,203.000000,205.000000,125.5,170.000000,209.500000,124.000000,198.000000,190.000000,210.000000,...,224.948563,198.000000,222.000000,196.000000,159.000000,216.500000,171.500000,171.000000,0.0,195.500000


In [7]:
# 2. OM + CONSTANT
om = get_distance_matrix(sequence_data,
                         method="OM",
                         sm="CONSTANT",
                         indel="auto")
om

[>] Processing 194 sequences with 8 unique states.
  - Creating 8x8 substitution-cost matrix using 2 as constant value
[>] Indel cost generated.

[>] Identified 192 unique sequences.
[>] Sequence length: min/max = 223 / 223.

[>] Starting Optimal Matching(OM)...
[>] Computing all pairwise distances...
[>] Computed Successfully.


Unnamed: 0,Afghanistan,Albania,Algeria,Andorra,Angola,Antigua and Barbuda,Argentina,Armenia,Australia,Austria,...,Uganda,Ukraine,Uruguay,Uzbekistan,Vanuatu,Venezuela,Vietnam,Yemen,Zambia,Zimbabwe
Afghanistan,0.0,130.0,118.0,408.0,354.0,316.0,334.0,182.0,334.0,340.0,...,140.0,226.0,272.0,228.0,316.0,162.0,306.0,110.0,408.0,334.0
Albania,130.0,0.0,88.0,286.0,330.0,350.0,284.0,180.0,322.0,304.0,...,234.0,192.0,266.0,192.0,318.0,138.0,308.0,54.0,326.0,306.0
Algeria,118.0,88.0,0.0,300.0,360.0,264.0,282.0,126.0,282.0,282.0,...,200.0,142.0,194.0,144.0,360.0,68.0,350.0,124.0,402.0,352.0
Andorra,408.0,286.0,300.0,0.0,334.0,294.0,104.0,288.0,124.0,128.0,...,446.0,188.0,324.0,182.0,250.0,264.0,294.0,310.0,250.0,250.0
Angola,354.0,330.0,360.0,334.0,0.0,256.0,300.0,292.0,334.0,318.0,...,304.0,306.0,212.0,306.0,182.0,390.0,104.0,306.0,334.0,300.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Venezuela,162.0,138.0,68.0,264.0,390.0,282.0,260.0,164.0,248.0,246.0,...,216.0,118.0,212.0,120.0,388.0,0.0,388.0,176.0,414.0,366.0
Vietnam,306.0,308.0,350.0,294.0,104.0,184.0,224.0,224.0,262.0,266.0,...,242.0,292.0,198.0,290.0,116.0,388.0,0.0,282.0,304.0,224.0
Yemen,110.0,54.0,124.0,310.0,306.0,378.0,310.0,198.0,370.0,354.0,...,214.0,222.0,292.0,222.0,282.0,176.0,282.0,0.0,310.0,308.0
Zambia,408.0,326.0,402.0,250.0,334.0,406.0,246.0,360.0,370.0,354.0,...,446.0,362.0,410.0,360.0,242.0,414.0,304.0,310.0,0.0,248.0


In [8]:
# 3. HAM + TRATE
ham = get_distance_matrix(sequence_data,
                          method="HAM",
                          sm="TRATE",
                          indel="auto")
ham

[>] Processing 194 sequences with 9 unique states.
[>] Transition-based substitution-cost matrix (TRATE) initiated...
  - Computing transition probabilities for: [Very Low, Low, Middle, High, Very High,  ,  ,  ]
[>] Indel cost generated.

[>] Identified 192 unique sequences.
[>] Sequence length: min/max = 223 / 223.

[>] Starting (Dynamic) Hamming Distance(DHD/HAM)...
[>] Computing all pairwise distances...
[>] Computed Successfully.


Unnamed: 0,Afghanistan,Albania,Algeria,Andorra,Angola,Antigua and Barbuda,Argentina,Armenia,Australia,Austria,...,Uganda,Ukraine,Uruguay,Uzbekistan,Vanuatu,Venezuela,Vietnam,Yemen,Zambia,Zimbabwe
Afghanistan,0.000000,163.715397,186.363017,445.216586,387.645710,443.941048,445.166313,310.359423,445.161511,445.183599,...,146.053322,325.044898,345.699481,325.061301,402.911112,209.301036,409.893879,117.725731,405.807815,409.940796
Albania,163.715397,0.000000,96.374471,395.635013,375.930348,353.261271,373.864089,202.372780,395.579938,383.736408,...,237.095295,273.502635,268.934419,275.479728,323.948635,175.405306,344.759112,86.924739,324.888153,344.502396
Algeria,186.363017,96.374471,0.000000,343.118210,404.868303,308.048601,342.601182,218.079263,343.063135,342.829846,...,216.382240,222.889887,227.590879,222.957028,398.449326,124.893192,395.967010,149.497771,399.508800,389.965218
Andorra,445.216586,395.635013,343.118210,0.000000,443.664682,327.011439,199.390109,343.497511,124.732541,148.543814,...,445.936207,251.617963,335.206542,249.529697,439.861780,285.355307,432.971293,441.320065,245.100821,405.073239
Angola,387.645710,375.930348,404.868303,443.664682,0.000000,324.243702,442.021243,381.264836,441.748585,442.473063,...,323.344272,422.072086,244.436368,424.102717,216.999262,425.896513,122.291845,314.959248,332.895240,343.444823
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Venezuela,209.301036,175.405306,124.893192,285.355307,425.896513,321.367596,284.898547,187.857517,285.300231,296.984928,...,231.296257,145.206706,247.422288,147.157543,413.348226,0.000000,405.687704,205.404515,410.383540,378.915103
Vietnam,409.893879,344.759112,395.967010,432.971293,122.291845,241.146902,402.143620,285.897714,430.955403,423.690802,...,424.365908,392.369658,291.624536,398.285565,160.648769,405.687704,0.000000,329.511903,301.165971,290.870450
Yemen,117.725731,86.924739,149.497771,441.320065,314.959248,418.627206,441.261368,269.766225,441.264990,441.282024,...,209.962281,321.147535,320.248572,321.164780,310.928145,205.404515,329.511903,0.000000,309.927722,327.486159
Zambia,405.807815,324.888153,399.508800,245.100821,332.895240,399.131918,240.567047,355.595806,361.990136,346.585277,...,442.426815,357.348953,406.908199,355.362397,238.689852,410.383540,301.165971,309.927722,0.000000,241.817747


In [9]:
# 4. DHD + CONSTANT
dhd = get_distance_matrix(sequence_data,
                         method="DHD",
                         sm="CONSTANT",
                         indel="auto")
dhd

[>] Processing 194 sequences with 10 unique states.


ValueError: [!] 'sm = "CONSTANT"' is not relevant for DHD, consider HAM instead.

# Chapter 2: get_substitution_matrix
Below are the examples from the *get_substitution_matrix* official documentation.

In [None]:
# 1. TRATE + time_varying(True)
sm = get_substitution_cost_matrix(sequence_data,
                                  method="TRATE",
                                  cval=4,
                                  time_varying=True)

# sm is an nd.array due to efficiency, 
# but the output is poorly interpretable, 
# so the output function is used
output(sm, time, states)

In [None]:
# 2. CONSTANT + time_varying(False)
sm = get_substitution_cost_matrix(sequence_data,
                                  method="CONSTANT",
                                  cval=2,
                                  time_varying=False)
sm

# Chapter 3: clara
Below is the example from the clara official documentation.

In [4]:
# clara
result = clara(sequence_data,
               R=10,
               sample_size=3000,
               kvals=range(2,21),
               criteria=['distance', 'pbm'],
               parallel=True,
               stability=True)
result

 [>] Starting generalized CLARA for sequence analysis.
 [>] Using crisp clustering optimizing the following criterion: distance, pbm.
 [>] Aggregating 194 sequences... OK (192 unique cases).
 [>] Starting iterations...



FloatingPointError: NaN dissimilarity value in intermediate results.

In [4]:
diss = get_distance_matrix(sequence_data, method="OM", sm="TRATE", indel="auto")

pam = k_medoids_once(diss=diss, k=6, weights=sequence_data.weights, npass=1)
pam

[>] Processing 194 sequences with 6 unique states.
[>] Transition-based substitution-cost matrix (TRATE) initiated...
  - Computing transition probabilities for: [Very Low, Low, Middle, High, Very High]
[>] Indel cost generated.

[>] Identified 192 unique sequences.
[>] Sequence length: min/max = 223 / 223.

[>] Starting Optimal Matching(OM)...
[>] Computing all pairwise distances...
[>] Computed Successfully.
[>] PAMonce starts ... 


array([152, 152,  68, 152, 152, 152,  68, 152,  68, 152, 152, 152, 152,
       152, 152, 152,  68, 152, 152,  68,  68,  68,  68,  68, 152,  25,
       152,  68,  28, 152,  68, 152,  68, 152, 152, 152, 152,  68, 152,
        68,  68, 152,  68, 152, 152,  68,  68,  68, 152, 152,  68,  68,
       152, 152,  68,  68,  68,  68,  68,  68, 152, 152,  68, 152, 152,
       152, 152, 152,  68,  68,  68,  68, 152, 152,  68,  68, 152, 152,
       152, 152, 152,  68,  68,  68,  68, 152, 152,  68, 152,  68,  68,
       152,  68, 152, 152,  68,  68,  68, 152,  68, 152,  68, 152,  68,
        68, 152,  68, 152,  68,  68, 152, 152, 152, 152,  68,  68,  68,
        68, 152, 152,  68, 152,  68,  68, 152, 152,  68,  68,  68, 152,
       152, 152,  68, 152, 152,  68,  68,  68,  68,  68,  68,  68,  68,
       152, 152, 145,  68, 152,  68,  68, 152,  68, 152, 152, 152, 152,
        68,  68,  68,  68, 152,  68,  68,  68, 152,  68, 152, 152, 152,
       152,  68,  68, 152,  68,  68,  68,  68,  68,  68, 152, 15

In [25]:
print("Thank you for learning sequence analysis with Sequenzo! ")
print("We hope you found this tutorial insightful.")
print("\n💡 Stay Curious, keep coding, and discover new insights.")
print("✉️ If you have any questions, please feel free to reach out :)")

Thank you for learning sequence analysis with Sequenzo! 
We hope you found this tutorial insightful.

💡 Stay Curious, keep coding, and discover new insights.
✉️ If you have any questions, please feel free to reach out :)
