#  Exam I0U19A - Management of Large-Scale Omics Data

June 2024

**Note:** 

* The exam is open book and open internet. However the **use of any communication tool (phone, chat, mail, etc) is strictly forbidden!**
* You are allowed to use Github during the exam - but do not post any comments.
* You may use your phone ONLY for authentication purposes (to access toledo & the vcs)
* For all questions - even if you cannot finish the question - please provide comments describing what you are plannning to do

Exam will be evaluated based on this notebook & accompanying files uploaded to the Toledo Assignment. You will be expected to upload the following files:

* The exam ipython notebook (`exam_I0U19A_June_2024.ipynb`) with your answers. (download using `Jupyter menu / File / Download`)
* An HTML copy of above notebook (download using `Internet Browser menu / File / Save page as`)
* Your new Snakemake file (`Snakefile`)
* In general - **Make sure the plots you make are visible in this notebook before uploading to Toledo**
   
Please zip all files into one file with your r-number in the name: `rnumber.zip` - Note - Toledo does not allow the upload of .html files - so you must create an archive!

**Note:** you will also be graded not only on the outcome of these exercises, but also on a number of criteria discussed during class, such as: writing resilient code; by running (simple) sanity checks; by properly documenting your code and decent visualizations.

#### Preparation

**Make sure you work on your exam in a dedicted work folder**

Prior to starting the exam make sure you create a work folder:

```
mkdir -p $VSC_DATA/large_omics_exam_2024
cd $VSC_DATA/large_omics_exam_2024
```

**Data required**

Copy the data files to your work folder:

```
cd  $VSC_DATA/large_omics_exam_2024
cp -r /staging/leuven/stg_00079/teaching/exam_June_2024/* .
```

Among these files you will find the ipython exam notebook (`exam_I0U19A_June_2024.ipynb`). Continue working there.


**Terminal/Conda**

Do your (CPU intensive) command line work in a VSC interactive session. Please do not take too many scores or memory. This command was sufficient for me:

```
srun -n 1 -c 2 --mem 4G --time=7:00:00 -A lp_edu_large_omics -p interactive --cluster wice --pty bash -l
```

For **all** command line work (including snakemake) - make sure you use the correct conda environment by running the following in your shell:

    export PATH=/lustre1/project/stg_00079/teaching/I0U19a_conda_2024/bin/:$PATH
    
You can check if you have the correct kernel loaded by running:

    which python
    
Which should yield `/lustre1/project/stg_00079/teaching/I0U19a_conda_2024/bin/python`


**Jupyter**

Ondemand settings (as used in class):

* cluster: Wice
* Account: lp_edu_large_omics
* Partition: Batch
* Number of hours: Duration of the exam +1hr
* Number of cores: 1
* Required memory per core: 3000 
* Number of nodes: 1
* Number of GPU's: 0

Ensure you use the correct kernel for the jupyter work! You can confirm you have the correct kernel by running (in python):

    import sys
    sys.executable
    
Which should yield `/lustre1/project/stg_00079/teaching/I0U19a_conda_2024/bin/python`

If not, please check the Toledo posts.

---

**After copying the data to your work folder, you will find a notebook called `exam_I0U19A_June_2024.ipynb` in the `$VSC_DATA/large_omics_exam_2024` folder - continue your work there.**

---

**Best of luck,**
Mark

---

In [None]:
# check your kernel
import sys
sys.executable

### Imports

In [1]:
import sqlite3
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import vcfpy

---

# Question 1 - Snakemake

In your exam folder you wil a snakemake folder containing the workflow definition (`snakemake/Snakefile`) and the tumor/control fastq data. This Snakefile is the same Snakefile we created during the course. The workflow has not been executed yet.

The objective of this question is to expand the Snakemake file to further annotate the final snpEFF annotated VCF file using PhastCons conservation scores.

snpEff as a tool is powerful - but only predicts coding effects. We are also potentially interested in non-coding SNPs. One method to identify potentially interesting non-coding SNPs is by looking at conservation. Regions that are evolutionary conserved are more likely to have a function. So non-coding SNPs in conserved regions are of more interest. We will be using [Phast/PhastCons](http://compgen.cshl.edu/phast/) to find such SNPs.

I already downloaded the database (for chr9 only) and a file indicating genome sizes. These files can be found in `/staging/leuven/stg_00079/teaching/phastcons`. 

We will be using `snpEff`'s sister tool `snpSift` to annotate our vcf file with the `phastCons` scores. `snpSift` is installed in our conda environment. I already added the location of the jar file & the phastCons database to the Snakemake file.

The goal of this exercise is to extend the Snakemake workflow to automatically annotate  our final vcf file (`snakemake/050.snpeff/snps.annotated.vcf`) with the PhastCons scores. Note, you must create a new rule, and this rule must be automatically executed when running Snakemake without specifying a target.

**To prove you did this - please find and copy - from the resulting vcf file - the line containing the SNP on chromosome 9, position 129702113 in the cell below**

```
**copy past the requested vcf line (from chr9, position 129702113) here**
```

**Note**:

 * `snpSift` is available in our conda environment
 * The phastCons data is in `/lustre1/project/stg_00079/teaching/phastcons`
 * You must add at least **one new [rule](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html)** to the `Snakefile`.
 * Make sure the PhastCons annotated vcf file ends up in a dedicated subfolder.
 * Ensure the new rule(s) get executed automatically when running Snakemake without specifying a rule.
 * Make sure your new `Snakefile` is part of the Toledo assignment upload.

 ---

## Question 2 - Extending the SNP database

You will find a reference notebook as we used in class (`ParseVCF.ipynb`) in this folder. The database it created `snps.sqlite` which also in the data folder.

The goal is to include the phastCons scores from the VCF file into the database so that we can use this for visualizations.

**Note:**

* Please use the `ParseVCF.ipynb` notebook only for reference. Write all extra code you need below this cell.
* Continue your work using the database included (`snps.sqlite`)
* Create a **new** table with the phastcons scores (and snp identifier)
* Make sure you sanity check your data. Are all scores between 0 and 1? Do all SNPs from the input file get a PhastCons score? What do you do with the SNPs that do not? Discuss your choices.
* To be sure you are not dependent on the last exercise - I provide a vcf file with the phastCons scores called in `vcf_files/snps.phastcons.vcf`. For safety I also have the snpEff annotated vcf file available (`vcf_files/snps.annotated.vcf` - which you should not use to answer the question above!)
* Test it worked by writing a SQL statement that shows all SNPs with a phastcons score indicating perfect conservation


---

## Question 3 - Visualization

Given the the database you just generated - I would like you to investigate if you find tumor specific SNPs to be located in more conserved regions? Do we see a difference of Phastcons scores comparing SNPs present only in the tumor sample to to other SNPs.

**Note:**
 * Argue why this might be biologically relevant. Would you only look at non coding SNPs?
 * Argue your process & thinking while exploring the data.
 * **Make a plot!** (you do not need to do statistics).
 * Make sure your plot is visible in this notebook prior to uploading it to Toledo.
 * **Discuss your interpretation of the plot, doublecheck your conclusions, if required adapt your visualization**
 * If you did not manage Question 2 you can request a copy of the database from me.
