## Walkthrough for annotating a mutation list with database data using annovar and clinscore

#### Oftentimes you receive a mutation list of any sort and you need to annotate it with database info to have an idea about the position and relevance of the data. More than often, this task can be subdivided into several or all of the following steps:
+ wrangle the data into the format (ID) Chr Start End Ref Alt .....
+ if coords are on hg19, convert it to hg38 to use this annotation set (same procedure can be applied if converting back to hg19 is required+ annotate with annovar on command line using any of a set of databases
+ apply clinscore calculation to get relevance of mutations in a clear (scalar) fashion

#### init paths and code base

In [None]:
# some sensible settings for better output
import os
import pandas as pd
from IPython.display import display
pd.set_option('display.max_columns', None)
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
pd.set_option('max_colwidth', 200)


# get the code
import sys
sys.path.append('../code')
from script_utils import show_output

# paths from environ file
home = os.environ['HOME']
# you need static files for the annovar annotation
static_path = os.path.join(os.environ['STATIC'], "annotation/clinical")
# here, set the working directory where your files are saved

workdir = os.path.join("../")

### load mutation file and inspect

In [None]:
df = pd.read_excel(os.path.join(workdir, "testdata/test_mutations_hg19.xlsx"))
df

There are a lot of columns and information right now. The last rows are leftovers from the excel transformation so we get rid of them with iloc. To make things easy, we also remove every additional column right now and keep only the columns needed for annotation.
The coords are somehow stored in the `Pos.` column and the mutation in the `Nuc Change` column. So do all this with `iloc`..

In [None]:
df = df.iloc[:16, [4,6]]
# df = df.iloc[:15, :]
df

### wrangle
Now, we extract the relevant information with str.extract and some smart regex. This will be different everytime and if things get too complicated consider chaining several extracts and even to modify the source file itself. Try to be efficient here!

In [None]:
df.loc[:,["Start", "Chr"]] = df['Pos.'].str.extract(r"(?P<Chr>chr[0-9]+):g\.(?P<Start>[0-9]+)")
df.loc[:,["Ref", "Alt"]] = df['Nuc Change'].str.extract(r"(?P<Ref>[ACTG]) -> (?P<Alt>[ACTG])")
df.loc[:, "End"] = df['Start']
# df = df.loc[:,["Chr", "Start", "End", "Ref", "Alt"]]
df

### convert to hg38
Although there are some tools that do that automatically, I still use the USCS website for conversion. For that, the coords have to be converted to the bed-format `Chr:Start-End`. This is always tedious and I use the helpers provided by the repo.
Here are the steps:
+ create the coords using `pos2bed` helper. Set option `as_string=True` and print the output for direct copying to the browser as it removes the index
+ copy to the website and create the conversion
+ reintroduce the hg38 coords into the file

In [None]:
from pyseq_utils import pos2bed
print(pos2bed(df, as_string=True))

+ Now copy the coords to the clipboard and paste into the [uscs liftover website](https://genome.ucsc.edu/cgi-bin/hgLiftOver)
+ click on `View conversions` and retrieve the file from your download folder. Should be something like `hglft_genome....be`

In [None]:
hg38 = pd.read_csv(os.path.join(home, "Downloads/hglft_genome_1181f_fb3d20.bed"), sep="\t", names=['hg38'])
hg38

`bed2pos`reconverts the column into Chr Start End that can be reinserted into the dataframe. This only works by index, so make sure that your original df indices have not changed and are in a reset state!!

In [None]:
from pyseq_utils import bed2pos
df.loc[:, ["Chr", "Start", "End"]] = bed2pos(hg38['hg38'])
df = df.loc[:, ['Chr', 'Start', 'End', 'Ref', 'Alt']]
df

### annotate data with annovar
Annovar is a very comprehensive tool for annotation of mutations using various databases. See [here](https://annovar.openbioinformatics.org/en/latest/) for all you need to know. I provide a static file for genomic annotation that contains (among others) updated databases for use with annovar.
For downloading the static files, do this from the root folder:
+ `$ . setup/download_static.sh <path-to-static-folder>  # provide a folder for downloading and expanding ~40GB of data`
+ for annovar to work, the path to the static folder has to be set in an environment variable STATIC: 
    * `$ export STATIC=<path-to-static-folder>`. 
    * For making this permanent, copy this line to your .bash_profile file in your HOME folder:
    * (`$ echo "export STATIC=<path-to-static-folder>" >> "${HOME}/.bash_profile`)

This will take some time as it is downloading and unpacking a 40GB file!!

Next, you have to configure the annovar_config.yaml file, giving it
   + the absolute path to the annovar folder sitting in this repo at ./code/anno2019. This depends on where your repo is residing.
   + the path in the STATIC folder to the annovar database (only change this if you moved the humandb folder somewhere else relative to the static folder)
   + a list of databases from the annovar database to use for populating the mutations
        * see the list of currently stored databases like so:`$ ls ${STATIC}/hg38/annotation/annovar/humandb/*.txt`
        * if you want the database hg38_icgc29.txt to be used, just list "icgc29" in the yaml file

Now, can run the annovar tool (run_annovar is a convenience wrapper around the command line tool written in perl), so you only have to worry once about providing the absolute path to the annovar code folder. You can see the large 
This will take some time depending on the list and size of both the mutations and the number and size of the databases used. You can see the long command line call in the cell output.

In [None]:
from anno import run_annovar
config_file = "../configs/annovar_config.yaml"
                         
df_anno = run_annovar(df, config_file, threads=10, cleanup=False)


### The annovar result
You can inspect the output and see that, depending on the used databases you have several new columns in your output dataframe
But first, it would be wise to remove some of the columns

In [None]:
df_anno[:3]

In [None]:
df_anno = df_anno.loc[:, ['Chr', 'Start', 'End', 'Ref', 'Alt', 'Func.ensGene34', 'Gene.ensGene34',
        'ExonicFunc.ensGene34',
       'cytoBand', 'gnomAD_exome_ALL', 'dbSNP154_AltFreq',
       'Mut_ID', 'count', 'type',
       'icgc29_ID', 'icgc29_Affected', 'CLNSIG']]
df_anno

## include gene scores to clinscore

In [None]:
from clinscore import get_cosmic_score

clinscore_file = "../configs/clinscoreLung.yaml"
df = get_cosmic_score(df_anno, cosmic_weights_file=clinscore_file, verbose=1)
df

In [None]:
df.to_excel("/Users/martinszyska/Dropbox/Icke/Work/LO/Sequencing/LO_SequencingClaudia/M162-22AB_hg38.xlsx", sheet_name="hg38_annotated", index=False)