### Count(Text,Pattern)

To Compute Count(Text, Patternn) we slide a window down the text checking whether each K-mer substring of the Text matches Pattern

In [14]:
def PatternCount(Text, Pattern):
    """
    Count the number of times a given pattern appears in a text.

    This function scans the input string `Text` and counts how many times 
    the substring `Pattern` occurs, including overlapping matches.

    Parameters
    ----------
    Text : str
        The string in which to search for the pattern.
    Pattern : str
        The substring to count within the text.

    Returns
    -------
    int
        The number of times `Pattern` appears in `Text`.

    Examples
    --------
    >>> PatternCount("GCGCG", "GCG")
    2
    >>> PatternCount("AAAAA", "AA")
    4
    """

    count = 0
    for i in range(len(Text) - len(Pattern) + 1):
        if Text[i:i+len(Pattern)] == Pattern:
            count += 1
    return count


The regex lookahead version is usually the fastest for large texts.

In [1]:
import re

def PatternCount_regex(Text: str, Pattern: str) -> int:
    """
    Count pattern occurrences using regular expressions.
    Includes overlapping matches via lookahead (?=...).
    """
    return len(re.findall(f"(?={Pattern})", Text))


### The Frequent Words Problem.

<p> We say that the pattern is a most frequent k-mer in Text if it maximizes Count(Text, Pattern) among all k-mers.</p>
<p> Ex: ACTAT is a most frequent 5-mer in ACAACTAGCATACTATCGGGAACTATCCT, and ATA is a frequent 3 mer i=of CGATATATCCATAG.</p>

<p> 


In [None]:
def frequentWords(Text, k):

    frequentPatterns = set()
    Count = {}
    for i in range(len(Text) - k + 1):
        Pattern = Text[i:i+k]
        if Pattern in Count:
            Count[Pattern] += 1
        else:
            Count[Pattern] = 1
    maxCount = max(Count.values())
    for Pattern, count in Count.items():
        if count == maxCount:
            frequentPatterns.add(Pattern)
    return frequentPatterns

if __name__ == "__main__":
    


In [4]:
def FrquencyTable(Text, k):
    freqMap = {}
    for i in range(len(Text) - k + 1):
        Pattern = Text[i:i+k]
        if Pattern in freqMap:
            freqMap[Pattern] += 1
        else:
            freqMap[Pattern] = 1
    return freqMap

In [5]:
def BetterFrequentWords(Text, k):
    freqMap = FrquencyTable(Text, k)
    maxCount = max(freqMap.values())
    return [Pattern for Pattern, count in freqMap.items() if count == maxCount]

In [6]:
text = "ACGTTGCATGTCGCATGATGCATGAGAGCT"
k = 4
print(BetterFrequentWords(text, k))

['GCAT', 'CATG']


### The reverse compliment problem
<p> Given a nucleotide p, we denote its complementary nucleotide as p*. </p>
<p> The reverse compliment of a string Pattern = p1, p2, ...,pn is the string pattern_rc = pn*.....p1* </p>
<p> formed by taking the compliment of each nucleotide in Pattern, then reversing the resulting string. 

In [12]:
def ReverseComplement(Pattern):
    complement = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C'}
    pattern = ''.join(complement[n] for n in Pattern)
    return pattern[::-1]
pattern = "AAAACCCGGT"
ReverseComplement(pattern)
print(f"The Pattern is {pattern} and its reverse complement is {ReverseComplement(pattern)}")

The Pattern is AAAACCCGGT and its reverse complement is ACCGGGTTTT


### Pattern Matching Problem

Input : Two Strings, Pattern and Genome

Output: A collection of space-seperated integers specifying all straring positions where Patten appears as a substing of Genome

In [15]:
def PatternMatchingProbem(Pattern, Genome):
    positions = []
    for i in range(len(Genome) - len(Pattern) + 1):
        if Genome[i:i+len(Pattern)] == Pattern:
            positions.append(i)
    return positions

if __name__ == "__main__":
    Text = "GCGCG"
    Pattern = "GCG"
    print(f"PatternCount: {PatternCount(Text, Pattern)}")
    print(f"PatternCount_regex: {PatternCount_regex(Text, Pattern)}")
    print(f"frequentWords: {frequentWords('ACGTTGCATGTCGCATGATGCATGAGAGCT', 4)}")
    print(f"ReverseComplement: {ReverseComplement('AAAACCCGGT')}")
    print(f"PatternMatchingProbem: {PatternMatchingProbem('ATAT', 'GATATATGCATATACTT')}")

PatternCount: 2
PatternCount_regex: 2
frequentWords: {'GCAT', 'CATG'}
ReverseComplement: ACCGGGTTTT
PatternMatchingProbem: [1, 3, 9]


In [20]:
with open(r"Module-1/HiddenMessages/Vibrio_cholerae.txt") as f:
    genome = f.read().strip()
    pattern = "CTTGATCAT"
    print(f"Genome length: {len(genome)}")
    positions = PatternMatchingProbem(pattern, genome)
    print(f"Pattern {pattern} found at positions: {positions}")

Genome length: 1108250
Pattern CTTGATCAT found at positions: [60039, 98409, 129189, 152283, 152354, 152411, 163207, 197028, 200160, 357976, 376771, 392723, 532935, 600085, 622755, 1065555]


In [41]:
with open(r"Module-1/HiddenMessages/dataset_30273_5.txt") as f:
    text = f.readlines()
    genome = text[1].strip()
    pattern = text[0].strip()
    result = PatternMatchingProbem(pattern, genome)

    print(" ".join(map(str, result)))

1 24 31 127 158 233 240 247 273 280 337 365 398 423 455 462 492 536 558 597 658 722 847 862 904 945 988 1006 1052 1088 1117 1166 1214 1221 1268 1307 1346 1353 1380 1449 1575 1590 1665 1726 1771 1810 1828 1901 1908 1915 1938 1949 1984 1991 2012 2091 2098 2151 2158 2165 2191 2231 2255 2280 2290 2303 2344 2351 2358 2415 2510 2562 2623 2651 2718 2872 2879 2903 2918 2925 2936 2953 2974 3062 3147 3154 3175 3206 3213 3255 3271 3336 3365 3374 3393 3427 3434 3441 3448 3455 3478 3574 3589 3635 3644 3661 3684 3723 3748 3765 3772 3811 3818 3872 3901 3968 3986 4033 4091 4154 4222 4250 4271 4278 4317 4342 4365 4405 4429 4473 4480 4530 4562 4590 4608 4671 4695 4702 4709 4732 4739 4754 4761 4768 4835 4842 4877 4884 4917 4927 4934 4941 5004 5020 5136 5160 5284 5291 5414 5475 5499 5558 5616 5653 5660 5696 5719 5780 5803 5827 5844 5896 5961 6012 6042 6049 6077 6144 6187 6211 6237 6244 6279 6322 6431 6446 6456 6471 6499 6536 6551 6587 6594 6626 6649 6679 6750 6757 6764 6805 6846 6871 6949 7047 7077 7084 7

In [34]:
result

[34,
 39,
 1302,
 1873,
 3018,
 3126,
 3632,
 4199,
 4241,
 5228,
 5230,
 5274,
 5490,
 5501,
 5681,
 6223,
 7430,
 7575,
 7582,
 7941,
 8088,
 8193]