# Oligo library (2021)

* Amhed Missael Vargas Velazquez
* Post-doctoral fellow, SGB lab, KAUST
* May, 2021

## Description
This jupyter notebook has been adapted to contain the terminal commands used to produce a Twist oligo library for hightrhout output analysis of C. elegans regulatory elements. The working environment is R, but most of the commands are actually run in the Unix terminal.

## Software requirements
The core instance running this script is R 4.0. However, the analysis are performed by multiple programs (handled by `system calls`) which have been installed previously. In particular:

* [bedtools](https://bedtools.readthedocs.io/en/latest/)
* awk
* wget
* zcat
* gzip
* [**R**](https://cran.r-project.org/)

### R libraries
* ggplot2
* stringdist

### Prerequisites
#### Working directory
Before running any code, make sure to set a working directory where everything will happen, e.g.:

In [3]:
##Run the code once to avoid recursive creation of folders
#system("mkdir Oligo_library")
setwd("Oligo_library")
getwd()

Also extra directories for saving the outputs

In [6]:
system("mkdir bed_and_fastas")
system("mkdir bed_and_fastas/cel_reference")

#### Download initial data
To work the library out, we need a genomic file (WS280):

In [4]:
##Download
system("wget ftp://ftp.wormbase.org/pub/wormbase/releases/WS280/species/c_elegans/PRJNA13758/c_elegans.PRJNA13758.WS280.genomic.fa.gz")
##Uncompress
system("gzip -d c_elegans.PRJNA13758.WS280.genomic.fa.gz")

Genomic annotations (gff3) file:

In [5]:
system("wget ftp://ftp.wormbase.org/pub/wormbase/releases/WS280/species/c_elegans/PRJNA13758/c_elegans.PRJNA13758.WS280.annotations.gff3.gz")

And the data from Ahringer [RegAtlas](https://ahringerlab.com/RegAtlas/):

In [8]:
system("wget https://github.com/js2264/RegAtlas/raw/master/dashboard.Ahringer/releases/dashboard.Ahringer_v0.5.3/data/minimal-data.RData")

Verify all the files have been downloaded just fine

In [3]:
list.files(".")

## Introduction
Our [lab](https://wormbuilder.org/) specializes in the development and scalability of molecular tools in *C. elegans*, e.g. MosTi or piRNAi (naming a few). And as such, since 2019 we've vouched for the use of large-scale oligo-libraries for directed mutagenesis via CRISPR, and to study gene regulation (see MDJ thesis proposal or our CRG2020 grant). The gist is simple, we design a particular scaffold, decide which sequences to insert in it, and we ask [Twist](https://www.twistbioscience.com/) to synthetize a dozen of them.  

The aim of this notebook is to guide you trhough the construction of these libraries. First, we will extract our genomic sequences of interest using coordinates in [bed](https://genome.ucsc.edu/FAQ/FAQformat.html) format. Subsequently, we will parse the coordinates given prederminated parameters (see below); and finally, we will assembly everything in R.

### Extracting *C. elegans* gene coordinates from .gff3 files
`.gff3` as `.bed` is a tab-delimited file format to store the coordinates of any element annotated in a given genome. Given its flexibility, [Wormbase](https://wormbase.org/) prefers its use althought there are arguably less programs that handle it; that's why we would be using the bed format instead. The main difference to keep in mind between these two formats is that .gff3 is a 1-based coordinate sytem while .bed files are [0-based](https://www.biostars.org/p/84686/), i.e. we have to substract one from the start position of any genomic element in .gff3. 

Also, note that by mere preference we would be processing out the bed files in bash commands, though these files could also be handled in R (in particular with the use of the [bedr](https://cran.r-project.org/web/packages/bedr/vignettes/Using-bedr.html) package).

#### *C. elegans* genebodies
The command below is composed of three main parts:
* Opening and extraction of CDSs (`zcat` and `awk, $3==CDS`)
* Split relevant info of each CDS and order via awk (`split` function)
* Indexing of only the first and last CDS per transcript to produce a "genebodie", i.e. coordinates from the start to the stop codon (included).

In [8]:
##Wrap command in cmd
cmd = "zcat c_elegans.PRJNA13758.WS280.annotations.gff3.gz | awk -F\"\t\" '{if($2==\"WormBase\"){if($3==\"CDS\"){print $0}}}' - | awk -F\"\t|;|=\" '{print $1\"\t\"$4\"\t\"$5\"\t\"$12\"\t.\t\"$7}' | perl -pe 's/Transcript://g' | awk -F\"\t\" '{OFS=\"\t\"; split($4,trans,\",\"); for(i=1;i<=length(trans);i++){print $1,$2,$3,trans[i],$5,$6}}' | awk -F\"\t\" '{if(array[$4] != 0){if(start[$4] > $2){start[$4]=$2};if(end[$4] < $3){end[$4]=$3};array[$4]=$1\"\t\"start[$4]\"\t\"end[$4]\"\t\"$4\"\t\"$5\"\t\"$6}else{start[$4]=$2;end[$4]=$3;array[$4]=$0}}END{for(key in array){print array[key]}}' - | sort > Cel_GeneBodies.bed"
#run command
system(cmd)
#Verify file was produced
cat(head(readLines("Cel_GeneBodies.bed"),1))

I	10014740	10015437	C03C11.1.1	.	+

As showed above, the coordinates indicate the genebody for the transcript `C03C11.1.1`, however, it is much better if we have the actual gene name; let's extract them from the gff3 as well via the `mRNA` annotations, i.e:
* Open gff3 and get mRNA entries
* Split mRNA entries to get WBID, genename and transcript

In [9]:
##Wrap command in cmd
cmd = "zcat c_elegans.PRJNA13758.WS280.annotations.gff3.gz | grep -v \"#\" | grep \"WormBase\" | awk -F\"\t\" '{if ($3==\"mRNA\"){print $9}}'| awk -F\";\" '{split($1,ts,\":\"); split($2,gs,\":\"); split($5,loc,\"=\"); print ts[2]\"\t\"gs[2]\";\"ts[2]\";\"loc[2]}' | awk -F\"\t\" '{OFS=\"\t\"; print $1,$2,$2,$1}' > Cel_mRNAID.txt"
#run command
system(cmd)
#Verify file was produced
cat(head(readLines("Cel_mRNAID.txt"),1))

Y74C9A.3.1	WBGene00022277;Y74C9A.3.1;homt-1	WBGene00022277;Y74C9A.3.1;homt-1	Y74C9A.3.1

Now is only matter of combining them via a hash, i.e:

In [11]:
##Wrap command in cmd
cmd = "awk -F\"\t\" '{OFS=\"\t\"; if(array[$4]==0){array[$4]=$2}else{print $1,($2-1),$3,array[$4],$5,$6}}' Cel_mRNAID.txt Cel_GeneBodies.bed > Cel_Genes.bed"
#run command
system(cmd)
#Verify file was produced
cat(head(readLines("Cel_Genes.bed"),1))

I	10014739	10015437	WBGene00007274;C03C11.1.1;	.	+