# Prediction of Impacts

The following steps have to be repeated for every chromosome and every plasmid!

What we need for these steps:
- Data:
    - A BLAST-Database for protein sequences (we use the UniRef90 Database)
    - The `.fasta` files from the SNP Annotation pipeline
- Tools:
    - `DeMaSk` variant effect predictor
    - `blastp`
    - `perl`

First we need to get homologes for every of our proteins. We achieve this with the command `demask.homologs`.

In [3]:
python3 -m demask.homologs -s Prot.fasta -d /BLAST_Hits --blastp '/usr/bin/blastp' --db /UnirefBLASTDB/uniref90_DB -t 20

With the homologes searched before, we can now predict our impact scores using the command `demask.predict`.

In [5]:
for i in *.a2m; do python3 -m demask.predict -i $i -o /DeMasK_$i.txt; done

# Extract Impact Score for every Mutation

`DeMaSk` calculates all possible impacts for every position but we are only interested in the scores for mutations we observed in out data. With the `awk` and `perl` scripts below we extract the impact for these mutation.

In [24]:
ls | awk  '{FS ="_"; print $3, $6, $9, $10, $11}' | awk '{FS = "."; print $1}' > ImpactPosition.tab

In [25]:
perl -lne 'BEGIN {$fnr=1};

             if ($fnr == 1) {
               ($fn = $ARGV) =~ s/\.[^.]+$//;
             } else {
               print "$fn\t$_"
             };

             $fnr++;

             if (eof) {$fnr=1}' *.txt > Chr1_Predict.tab

In [26]:
cat Chr1_Predict.tab | tr _ "\t" | sed s/.a2m//g  > Chr1_Predict.tab.new

Finally our data should look like this:

In [27]:
head Chr1_Predict.tab.new

DeMasK		1019336	T	C	1018673	1019800	+	221	S	P	1	V	A	-0.2028	1.2983	-16.6099	-0.1977
DeMasK		1019336	T	C	1018673	1019800	+	221	S	P	1	V	C	-0.2307	1.2983	-16.6099	-0.2342
DeMasK		1019336	T	C	1018673	1019800	+	221	S	P	1	V	D	-0.3694	1.2983	-16.6099	-0.4158
DeMasK		1019336	T	C	1018673	1019800	+	221	S	P	1	V	E	-0.3267	1.2983	-16.6099	-0.3599
DeMasK		1019336	T	C	1018673	1019800	+	221	S	P	1	V	F	-0.2799	1.2983	-16.6099	-0.2987
DeMasK		1019336	T	C	1018673	1019800	+	221	S	P	1	V	G	-0.3286	1.2983	-16.6099	-0.3624
DeMasK		1019336	T	C	1018673	1019800	+	221	S	P	1	V	H	-0.3724	1.2983	-16.6099	-0.4197
DeMasK		1019336	T	C	1018673	1019800	+	221	S	P	1	V	I	-0.0705	1.2983	-3.3221	-0.1133
DeMasK		1019336	T	C	1018673	1019800	+	221	S	P	1	V	K	-0.3656	1.2983	-16.6099	-0.4109
DeMasK		1019336	T	C	1018673	1019800	+	221	S	P	1	V	L	-0.2034	1.2983	-16.6099	-0.1985
