Code_for_Elsherbini_et_al.,_2023

Kindly if you find this repo useful for your work, cite this publication and star the repo itself

How did we download the SARS-CoV2 sequences?

Firstly for the main dataset until 3-Nov-2022 (n= 1,131,185) After we did register at GISAID,...

Clear. right? What about the validation dataset (n = 67,399) from 4-Nov-2022 to 30-11-2023?

Here is an example of how we downloaded Clade G.

We did the same to other clades.

How did we preprocess our input sequence data for the main dataset?

Firstly, we needed to unzip our data.

for file in *.tar.xz ; do tar -xvf $file ; done

We have 9 files and each file represents one clade, let's count each clade.

for file in *.fasta ; do cat $file | grep ">" | wc -l >> count.txt ; done

We decided to exclude (S and O), as the first had very few sequences in relative to the smallest clade (< 10 % to GV clade) and the second represented unclassified/noisy isolates and both clades did not belong to the trajectory of virus evolution. Also, no sequence data for very early clades like (L and V).

Ok, now we need to normalize our data. Then,..

Open count.txt and based on the smallest clade, we subsampled each clade using seqtk tool.

In our case, it was the GV clade with 185,207 seq.

for file in *.fasta ; do seqtk sample -s185207 $file 185207  > sample_$file.fa ; done

After subsampling, make sure that the number will be as it is expected. So, let's count again.

for file in *.fa ; do cat $file | grep ">" | wc -l >> count.txt ; done

How (did we/ to) produce Di and Tri nucleotide signal using GenoSig?

Regardless of this work, this step can be used to produce Di and Tri nucleotide signals in just two steps.

Important: GenoSig is a variant of PaSiT(aka GenDisCAl) and both of them are developed by Gleb Goussarav.

Just in a normal PC (24 GB RAM and Intel core i5-8265U CPU @ 1.60GHz), it took ~7 minutes (wall-clock time) for a dataset of ~37.8 GB).

You do not need to install anything, download the GenoSig.zip (attached) --> unzip it. Then, just put your Fasta files in the "All" directory and then run the command line.

perl genosig.pl

Then,..

for file in *.csv ; do sed -i -e "/nan/d" $file  ; sed -i 's/^.*.hCoV-19/hCoV-19/' $file   ; sed -i 's/.AltKarlinSignature//' $file   ; sed -i -e 's/^/>/' $file ; cut -f1 < $file > data_$file.fasta ; done

This command does three things. Firstly, remove any line with nan values out of GenoSig output, remove any "hCoV-19/hCoV-19" before the name of the ID, and remove the string named "AltKarlinSignature".

But, it does also produce a FASTA file with just headers, why?

This is step 1. in the next phase.

How did we prepare our MetaData?

Back again to the SARS-CoV2 project...

This is a crucial step. Well, the MetaData in our work is two information the clade and the continent. The clade was concluded from the name of the Fasta file, while the continent needed a lot work.

How did we get the continent?

From the header of the fasta file (>SARS-COV-3356667-EGYPT-....) you can find the country. right? So, using a Python script we looped over each header in all of the files and based on a country-continent association (aka guide.csv) file, we were able to conclude for each sequence which continent it belong.

These steps explain in detail this process.

Step.1:

This command extracts the first column to make a FAKE file with ".fasta" extension which can be used to extract metadata (aka y data in our next section related to Machine learning) I mean make sure that the column ID starts with >xxxxxxx in both files. this command to add > at the start of the names

for file in *.csv ; do  cut -f1 < $file > data_$file.fasta ; done

Small trick: name them all with start fake

Then get the MetaData ((aka y data in our workflow)) out of x files (Di+Tri signal data file) by using this loop

Step.2:

So, let's extract the MetaData. Make sure you have the file named guide.csv in the same dir of the python script

Important: Make sure that the name of the file is the clade.

for file in *.fasta ; do python get_y_data.py -i $file ; done

Step.3:

Let's merge.

  nawk 'FNR==1 && NR!=1{next;}{print}' *.csv > x_data.csv
  nawk 'FNR==1 && NR!=1{next;}{print}' y_data* > y_data.csv

Always remember that the X file is the Di and Tri frequency signal and the Y file is the MetaData

How did we prepare our Machine Learning models?

Well, there are plenty of GUI/CLI platforms for ML training or testing, you can use whatever platform you like according to your skills and preferences.

Important: the ML scripts in our work were developed initially by Ahmed M. Elshal and for DL and finalization by Youssef M. Fadel.

In this work, with python3 scikit-learn mainly, we compared the ML algorithms using (compare_ML_models.zip). As Random Forest (RF) turned out to be the best model, we made (ML_anlysis.zip) which we can train and test using RF only.

To get started with our ML work, follow these steps :

Clone the repository

git clone https://github.com/AhmedElsherbini/Code_for_Elsherbini_et_al_2023.git
cd Code_for_Elsherbini_et_al_2023
cd ML_anlysis

Create the conda environment with the dependencies

conda env create -f enviroment.yml

Activate the conda environment

conda activate GenoSig_MLDL

Check the table below:

Note: Arguments are case sensitive, only write in CAPSLOCK mode.

Argument	Function
-x	the data input to the model including the features
-y	The label/target/class that the model will predict
-m	The tool mood. We have different modes listed in (PCA , ML_train , ML_test , DL_train or DL_test)
-s	Number of splits which required to the cross validation

Examples:

python3 run.py -m PCA -x ./data/x_data.csv -y ./data/y_data.csv

DL and RF were the best, therefore we used them for models.

Take care: RF with 100 estimators is quite slow more than DL, especially with cross-validation.

python3 run.py -m DL_train -x ./data/x_data.csv -y ./data/y_data.csv -s 10

Here to test the whole dataset against the computed model, use it afterwards for the confusion matrix, and draw some Chord graphs.

python3 run.py -m DL_test -x ./data/x_data.csv -y ./data/y_data.csv

How did we make the confusion matrix and Chord diagram?

In Excel or CSV, just we copied the ground truth and compared it to the predicted class from the previous "run.py -m test...." command.

Then we used this up there R script named confusion_matrix.html.

How did we make the statistics?

Up there, the R script is named Stat.html.

Contributing

Pull requests are very welcome.

For major changes, please open an issue (or Email me) first to discuss what you would like to change.

Contact me directly by email: drahmedsherbini@yahoo.com

Name		Name	Last commit message	Last commit date
Latest commit History 127 Commits
GenoSig_MLDL		GenoSig_MLDL
GenoSig.zip		GenoSig.zip
ML_workflow.jpg		ML_workflow.jpg
README.md		README.md
Slide16.jpg		Slide16.jpg
Slide8.jpg		Slide8.jpg
Stat.html		Stat.html
confusion_matrix.html		confusion_matrix.html
environment.yml		environment.yml
get_y_data.py		get_y_data.py
gisaid.jpg		gisaid.jpg
guide.csv		guide.csv
readme.txt		readme.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GenoSig_MLDL

GenoSig_MLDL

GenoSig.zip

GenoSig.zip

ML_workflow.jpg

ML_workflow.jpg

README.md

README.md

Slide16.jpg

Slide16.jpg

Slide8.jpg

Slide8.jpg

Stat.html

Stat.html

confusion_matrix.html

confusion_matrix.html

environment.yml

environment.yml

get_y_data.py

get_y_data.py

gisaid.jpg

gisaid.jpg

guide.csv

guide.csv

readme.txt

readme.txt

Repository files navigation

Code_for_Elsherbini_et_al.,_2023

Contributing

About

Releases

Packages

Contributors 2

Languages

AhmedElsherbini/Code_for_Elsherbini_et_al_2023

Folders and files

Latest commit

History

Repository files navigation

Code_for_Elsherbini_et_al.,_2023

Contributing

About

Topics

Resources

Stars

Watchers

Forks

Languages