# **Building Markov transition matrix**

Simple Dinucleotide Frequency Model just counts how often each pair appears in the sequence.
Markov model computes conditional probabilities and caputres the transition biases .
To compute the transition matrix:

*   Count transitions (how many times A → T, A → C, etc. occur).
*   Normalize counts into probabilities by dividing each row by sum of the counts in that row.

For simple dinucleotide frequency model
* Count all dinucleotides in the sequence
* Normalize by dividing by the total number of dinucleotides








In [1]:
import numpy as np
import pandas as pd


In [2]:
def simple_dinucleotide_frequency(dna_seq):

    nucleotides = ['A', 'C', 'G', 'T']
    nuc_to_idx = {nuc:i for i, nuc in enumerate(nucleotides)}  #Map nucleotide to index


    count_matrix = np.zeros((4, 4), dtype=int)
    freq_matrix = np.zeros((4, 4))

    # Count dinucleotides
    for i in range(len(dna_seq) - 1):
        current = dna_seq[i]
        next_nuc = dna_seq[i+1]
        if current in nuc_to_idx and next_nuc in nuc_to_idx:
            row = nuc_to_idx[current]
            col = nuc_to_idx[next_nuc]
            count_matrix[row, col] += 1

    # Calculate frequencies (normalize by total dinucleotides)
    total_pairs = max(1, len(dna_seq) - 1)  # Avoid division by zero
    freq_matrix = count_matrix / total_pairs

    return freq_matrix, count_matrix

In [3]:
def markov_transition_matrix(dna_seq):
    nucleotides = ['A', 'C', 'G', 'T']
    nuc_to_idx = {nuc:i for i, nuc in enumerate(nucleotides)}


    count_matrix = np.zeros((4, 4), dtype=int)
    transition_matrix = np.zeros((4, 4))

    # Count transitions
    for i in range(len(dna_seq) - 1):
        current = dna_seq[i]
        next_nuc = dna_seq[i+1]

        if current in nuc_to_idx and next_nuc in nuc_to_idx:
            row = nuc_to_idx[current]
            col = nuc_to_idx[next_nuc]
            count_matrix[row, col] += 1


    row_sums = count_matrix.sum(axis=1)  # Sum each row
    for i in range(4):
        if row_sums[i] > 0:
            transition_matrix[i] = count_matrix[i] / row_sums[i]  # Normalize
        else:
            transition_matrix[i] = 0  # Avoid division by zero

    return transition_matrix, count_matrix

In [4]:
dna_seq = input("Please enter the DNA sequence whose transition matrix is to be build: ")
print("You entered: ", dna_seq)
for nucleo in dna_seq:
    if nucleo not in {'A','G','T','C'}:
        print("Invalid DNA sequence")
        break

Please enter the DNA sequence whose transition matrix is to be build:  ACGTTGCAACACACGTCAGAATGCATAATACGTCCTTTATGCCGGAAGACACAGATACGATACAGAT


You entered:  ACGTTGCAACACACGTCAGAATGCATAATACGTCCTTTATGCCGGAAGACACAGATACGATACAGAT


In [5]:
dinuc_freq, dinuc_counts = simple_dinucleotide_frequency(dna_seq)

# Create labeled dataframes
nucleotides = ['A', 'C', 'G', 'T']
df_dinuc_freq = pd.DataFrame(dinuc_freq,
                             index=nucleotides,
                             columns=nucleotides)
df_dinuc_counts = pd.DataFrame(dinuc_counts,
                                index=nucleotides,
                                columns=nucleotides)

# Display results
print("Dinucleotide Count Matrix:")
display(df_dinuc_counts)
print("Dinucleotide Frequency Matrix:")
display(df_dinuc_freq.style.format("{:.2f}"))



Dinucleotide Count Matrix:


Unnamed: 0,A,C,G,T
A,4,9,4,7
C,8,2,5,1
G,6,3,1,3
T,5,2,3,3


Dinucleotide Frequency Matrix:


Unnamed: 0,A,C,G,T
A,0.06,0.14,0.06,0.11
C,0.12,0.03,0.08,0.02
G,0.09,0.05,0.02,0.05
T,0.08,0.03,0.05,0.05


In [6]:
trans_mat, count_mat = markov_transition_matrix(dna_seq)

# Display matrices with labels
nucleotides = ['A', 'C', 'G', 'T']
df_trans = pd.DataFrame(trans_mat, index=nucleotides, columns=nucleotides)
df_count = pd.DataFrame(count_mat, index=nucleotides, columns=nucleotides)

print("\nCount Matrix:")
display(df_count)
print("Transition Probability Matrix:")
display(df_trans.style.format("{:.2f}"))  # Show probabilities with 2 decimal places




Count Matrix:


Unnamed: 0,A,C,G,T
A,4,9,4,7
C,8,2,5,1
G,6,3,1,3
T,5,2,3,3


Transition Probability Matrix:


Unnamed: 0,A,C,G,T
A,0.17,0.38,0.17,0.29
C,0.5,0.12,0.31,0.06
G,0.46,0.23,0.08,0.23
T,0.38,0.15,0.23,0.23
