<div style="text-align: center;">
  <img src="src\images\test.jpg" alt="Diagram of model" style="width: 60%;">
</div>

<div style="text-align: center;">
  <h1>Implementing different K-mer counting algorithms</h1>
  <p style="text-align: center;">By Seyed Mohammadreza Javad, CE student at SUT</p>
</div>

# Introduction
I will start with three questions:  
**What is a k-mer?** We can define them simply as substrings of length `k` in a biological sequence.  
**Why's that important?** Well, counting k-mers is subproblem in many genomic computation tasks and plays an important role.  
**What are we going to cover in this notebook?**  In this notebook we implement different algorithms to count k-mers, assuming they can overlap.  

# Methods and Algorithms

In [16]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import os
import itertools

Loading DNA and looking into data set
We will use [E-coli](https://www.ncbi.nlm.nih.gov/nuccore/NZ_CABFNN010000001.1) which is similiar as the one introduced in class   
DEFINITION: Escherichia coli J53 isolate E. coli, whole genome shotgun

In [15]:
PATH = r"C:\Users\Asus\Desktop\BIO\kmer\src\sequence.fasta"
DNA = []
with open(PATH,'r') as fp:
    for line in fp:
        DNA.append(line.strip().upper()) if '>' not in line else None
DNA = ''.join(DNA)
"example:", DNA[:30], " size:", len(DNA)

('example:', 'GTGCCGGTCGCTAACTGCGCGGTCACTACC', ' size:', 4705562)

## Naive approach(using window slicing method)
We simply count them by hand!

In [19]:
def k_mer_naive(k:int, dna:list) -> (dict,str,int):
    table = {}
    winner, score = None, -1_000_000_000
    for i in range(len(dna)-k+1):
        _seq = ''.join(dna[i:i+k])
        new_value = 1
        if _seq in table.keys():
            new_value += table[_seq]
        table[_seq] = new_value
        if score<new_value:
            score = new_value
            winner = _seq
    return table, winner, score

In [23]:
res = k_mer_naive(k=3,dna=DNA)

In [24]:
res[1],res[2]

('GCG', 116953)

In [25]:
res[0]

{'GTG': 67570,
 'TGC': 97411,
 'GCC': 93450,
 'CCG': 88163,
 'CGG': 88220,
 'GGT': 75834,
 'GTC': 55588,
 'TCG': 71736,
 'CGC': 115944,
 'GCT': 82030,
 'CTA': 27557,
 'TAA': 69729,
 'AAC': 83665,
 'ACT': 50599,
 'CTG': 106391,
 'GCG': 116953,
 'TCA': 84859,
 'CAC': 67213,
 'TAC': 53387,
 'ACC': 75231,
 'CCA': 86404,
 'CAG': 104454,
 'AGC': 81511,
 'CCT': 51445,
 'CTC': 43336,
 'CGA': 72583,
 'GAC': 54999,
 'ACA': 59296,
 'ACG': 74138,
 'TGA': 85260,
 'GAA': 85040,
 'AAG': 64602,
 'GAT': 87603,
 'ATG': 78117,
 'TGG': 87508,
 'GGC': 94215,
 'ATT': 84096,
 'TTG': 77628,
 'GAG': 43412,
 'CGT': 74243,
 'GTT': 83763,
 'GGA': 56832,
 'TTT': 110482,
 'TTC': 84833,
 'TCT': 57570,
 'TTA': 69758,
 'CAA': 78019,
 'CAT': 77521,
 'ATC': 87739,
 'GCA': 96648,
 'AAA': 111353,
 'AAT': 84520,
 'ATA': 64281,
 'TAT': 64589,
 'CTT': 64360,
 'TCC': 57331,
 'AGG': 51153,
 'CCC': 48188,
 'AGA': 56379,
 'GGG': 48346,
 'TGT': 59527,
 'GTA': 53247,
 'AGT': 50563,
 'TAG': 27138}

# References

In [1]:
import bibtexparser

with open('refs.bib') as bibtex_file:
    bib_database = bibtexparser.load(bibtex_file)

# Simple render (custom formatting)
for entry in bib_database.entries:
    print(f"{entry['title']}. {entry.get('author','').replace(' and ', ', ')}. {entry.get('journal','')} {entry.get('year','')}.")

An analysis of something. Smith, John, Doe, Jane. Journal of Examples 2020.
