A computational framework for pattern detection on unaligned sequences: An application on SARS-CoV-2 data
An alignment-free method capable of processing and counting k-mers in a reasonable time, while evaluating multiple values of the k parameter concurrently.
kmerAnalyzer was initially implemented in Python 2.7 version, but it seems to work pretty well in Python 3.8 too.
The current application supports only .fasta
files as input files.
- In order to execute the application, there must be a unique
fasta
file inside thedata/
folder, which will be used as an input to the current k-mer analyzer toolkit. - Folder
Output/
needs to be empty. Otherwise, the application will remove everything (file or subfolder) inside it. In case the folder doesn't exist, it wil be created automatically. - Specify the parameters inside
featuresExtraction.py
script in lines 21-22,kmax
andeval_factor
.Eval_factor
parameter determines the strictness in the assessment of kmers of each length. Recommended values foreval_factor
lie inside the interval [1,2]. For optimal results, it's highly recommended to select a value between [1.2, 1.5]. At any case, for values lower than 1, the application won't run properly - Execute the python script
featuresExtraction.py
Assuming that the input file is called filename.fasta
:
- Inside the
Output/
directory there is a.csv
file calledclustData.csv
which is actually the data matrix that we aimed for. Every sequemce is being represented by a number k-mer based features. The value of every feature is the number of times each k-mer was detected in the current sequence. There's also a CSV file calledheades_to_IDS.csv
, which maps the headers of each sequence from fasta input to code names ID-1, ID-2 etc. - Inside the
Output/filename/
sub-folder there are 3csv
files:- File
output.csv
contains the list that is generated from the kmer-tree. Every row represents a k-mer. The first column is the k-mer itself, the second is the length of the k-mer, the third column its frequency (the number of times that was detected in the input data) and the fourth one is its evaluation in the tree. - The two remaining files are associated with the sequences that every k-mer appears, and the number of times that each k-mer appears in every sequence occurs as well.
- File
- It's important to have a look at the lengths of the sequences, prior to executing kmerAnalyzer. For example, in the example dataset, the sequence with header
>ERR525627.984.1 984 length=31
has length 31, so probably its better to either exclude this sequence (data filtering) or examine lower k-values, e.g. up to 20. However, if we setkmax = 35
, the code seems to work properly. - While executing kmerAnalyzer, a folder called
input
is created isnide the project direcoty, containing some necessary files for the execution process. The folder is deleted at the end of the process.
SARS-CoV-2 data have been downloaded from NCBI SARS-CoV-2 Resources.
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
This project is licensed under the MIT License - see the LICENSE file for details