# Nonsense-mediated decay in sex-biased alternative splicing

This notebook uses **gencode.v30.annotation.gtf** and the associated genome assembly **GRCH30.p12.genome.fa** and the **fromGTF.SE.txt** file from the rMATS 3.2.5 experiment to generate a single output file **NMD_summary.txt**

## 1. Library Dependencies

In [1]:
suppressWarnings({suppressMessages({
library(Biostrings)
library(rtracklayer)
})})

## 1.1 Obtain the appropriate genome assembly 

The rMATS 3.2.5 experiment was done with gencode v.30.  Using this release obtain the proper fasta file for the genome.   

ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_30/GRCh38.p12.genome.fa.gz

In [2]:
if (!("GRCh38.p12.genome.fa.gz" %in% list.files("../data/"))) {
    message("downloading genome assembly associated with the Gencode release 30 - GRCh38.p12.genome.fa.gzn")
    system("wget -O ../data/GRCh38.p12.genome.fa.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_30/GRCh38.p12.genome.fa.gz")
    message("Done!\n")
    message("Unzipping compressed file GRCh38.p12.genome.fa.gz..")
    system("gunzip ../data/GRCh38.p12.genome.fa.gz", intern = TRUE)
    message("Done! GRCh38.p12.genome.fa can be found in ../data/")
}
fasta.file <- "../data/GRCh38.p12.genome.fa"

## 1.2 Obtain the gencode.v30.gtf file

gencode.v30.annotation.gtf file was used for the rMATS 3.2.5 experiment.  

In [3]:
#
# add chr information for summary data later, use the annotation we used for rMATS
#
if (!("gencode.v30.annotation.gtf.gz" %in% list.files("../data/"))) {
    message("downloading gencode v30 annotation\n")
    system("wget -O ../data/gencode.v30.annotation.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_30/gencode.v30.annotation.gtf.gz")
    message("Done!\n")
    message("Unzipping compressed file gencode.v30.annotation.gtf.gz..")
    system("gunzip ../data/gencode.v30.annotation.gtf.gz", intern = TRUE)
    message("Done! gencode.v30.annotation.gtf can be found in ../data/")
}

### 1.3 Creating the internal datastructure for gencode file

Attempting to use rtracklayer::import rearranges the gtf file causing issues with the using gffread and other applications.   A shortcoming of this application is this rearrangement.  Subsequently, using just read.table to avoid these rearrangements.

In [4]:
annotation.gtf <- read.table("../data/gencode.v30.annotation.gtf",sep='\t',quote="")
colnames(annotation.gtf) <- c("chr", "source","type","start","end","V6","strand","V8","V9")
head(annotation.gtf,2)

Unnamed: 0_level_0,chr,source,type,start,end,V6,strand,V8,V9
Unnamed: 0_level_1,<fct>,<fct>,<fct>,<int>,<int>,<fct>,<fct>,<fct>,<fct>
1,chr1,HAVANA,gene,11869,14409,.,+,.,"gene_id ""ENSG00000223972.5""; gene_type ""transcribed_unprocessed_pseudogene""; gene_name ""DDX11L1""; level 2; havana_gene ""OTTHUMG00000000961.2"";"
2,chr1,HAVANA,transcript,11869,14409,.,+,.,"gene_id ""ENSG00000223972.5""; transcript_id ""ENST00000456328.2""; gene_type ""transcribed_unprocessed_pseudogene""; gene_name ""DDX11L1""; transcript_type ""processed_transcript""; transcript_name ""DDX11L1-202""; level 2; transcript_support_level ""1""; tag ""basic""; havana_gene ""OTTHUMG00000000961.2""; havana_transcript ""OTTHUMT00000362751.1"";"


## 1.3 Import the rMATS 3.2.5 fromGTF definition file

In [5]:
from.gtf <- read.table(file = "../data/fromGTF.SE.txt", sep = "\t", quote = "\"", header = T, stringsAsFactors = F)
head(from.gtf,2)

Unnamed: 0_level_0,ID,GeneID,geneSymbol,chr,strand,exonStart_0base,exonEnd,upstreamES,upstreamEE,downstreamES,downstreamEE
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<int>,<int>,<int>,<int>
1,1,ENSG00000034152.18,MAP2K3,chr17,+,21287990,21288091,21284709,21284969,21295674,21295769
2,2,ENSG00000034152.18,MAP2K3,chr17,+,21303182,21303234,21302142,21302259,21304425,21304553


## 1.4 Create the matrix for the Nonsense mediated decay results

In [6]:
res<-matrix(c(rep(0,nrow(from.gtf)),rep(0,nrow(from.gtf)),rep('',nrow(from.gtf))),ncol=3)
colnames(res)<-c('num.nmd','num.transcripts','nmd.ids')

## 2.0 Main program to obtain the NMD_summary.txt

In [None]:
finished=0

message(" dim (cur.gtf) before removing chrY ")
dim(annotation.gtf)

# chromosome Y excluded from analysis
annotation.gtf <- annotation.gtf[annotation.gtf$chr != "chrY", ]
message("dim (annotation.gtf) without chrY ")
dim(annotation.gtf)

# loop through one chromosome at a time through the fromGTF.SE.txt file from rMATS 3.2.5
for (chr in as.character(unique(from.gtf$chr))) {
    message("looking at chr -> ", chr)
    #
    # read in the reference genome and limit it chr of interest
    #
    cur.gtf      <- annotation.gtf[annotation.gtf$chr == chr, ] 

    #
    # iterate over the current fromGTF (also limited to chr of interest)
    #
    for (exon.itr in ((1:nrow(from.gtf))[from.gtf$chr==chr])) {
 
        #
        # only consider those transcripts in the reference that match the exon of interest
        #
        exon.rows <- which((cur.gtf$start== from.gtf$exonStart_0base[exon.itr]+1) & 
                           (cur.gtf$end  == from.gtf$exonEnd        [exon.itr]) & 
                            cur.gtf$type == 'exon')
        
        for (exon.row in exon.rows) {
            transcript.first.row <- max (which((cur.gtf$type   =='transcript') & 
                                               ((1:nrow(cur.gtf)) < exon.row)  ))
            
            #
            #  make sure we are still in the set of transcripts
            #  that contain our exon of interest
            #
            if (sum((cur.gtf$type %in% c('transcript','gene')) & 
                    ((1:nrow(cur.gtf)) > exon.row)  )==0) {
                transcript.last.row <- nrow(cur.gtf)
            #
            #  else we have advanced and we take the 
            
            } else {
                transcript.last.row <- min(which((cur.gtf$type %in% c('transcript','gene')) & 
                                                 ((1:nrow(cur.gtf)) > exon.row)  ))-1
            }
      
            out.gtf<-cur.gtf[transcript.first.row:transcript.last.row,]
    
            #
            # if we have captured the upstream or downstream exon no nonsense mediated decay
            #
            if (sum(c(sum((out.gtf$start == from.gtf$upstreamES[exon.itr]+1) & 
                           out.gtf$end   == from.gtf$upstreamEE[exon.itr])>0, 
      
                      sum((out.gtf$start == from.gtf$downstreamES[exon.itr]+1) & 
                           out.gtf$end   == from.gtf$downstreamEE[exon.itr])>0))<2)
        
            next
            
            #
            # if the sum of the transcript types we have captured is coding for less than 3
            #  move on ? Not sure I get why
            #
            if (sum(out.gtf$type=='CDS')<3)
            next
      
            write.table(out.gtf,"../data/transcript.gtf",sep='\t',col.names = FALSE,
                        row.names = FALSE, quote = FALSE)
        
            command <- paste0('gffread ../data/transcript.gtf -g ../data/GRCh38.p12.genome.fa -y ../data/gene.fa')
            system(command)       
            seq<-readAAStringSet('../data/gene.fa')
        
            if (length(seq)==0)
                next
        
            l.inc<-length(seq[[1]])
        
            out.gtf <- out.gtf [(out.gtf$start != (from.gtf$exonStart_0base[exon.itr]+1)) & 
                                (out.gtf$end   != (from.gtf$exonEnd[exon.itr])),]
        
            write.table(out.gtf,"../data/transcript.gtf",sep='\t',col.names = FALSE,
                        row.names = FALSE,quote = FALSE)
        
            command <- paste0('gffread ../data/transcript.gtf -g ../data/GRCh38.p12.genome.fa -y ../data/gene.fa')
            system(command)       
            seq<-readAAStringSet('../data/gene.fa')
        
            l.skip<-length(seq[[1]])
             
            #
            # NMD test - if it the transcript is shortened by 1/3, it is considered NMD
            # 
            skip.exon.aa.length <- (from.gtf$exonEnd[exon.itr] - (from.gtf$exonStart_0base[exon.itr]+1))/3
        
            res[exon.itr,2] <- as.integer(res[exon.itr,2])+1
        
            if (l.inc<(l.skip+skip.exon.aa.length-1)) {
                res[exon.itr,1] <- as.integer(res[exon.itr,1])+1
                res[exon.itr,3] <- paste(res[exon.itr,3],cur.gtf$gene_id,sep='***')
            }
            
        } # end for (exon.row in exon.rows) 
    } # end for (exon.itr in ((nrow(chrfromGTF))[chrfromGTF$chr==chr])){
    
    finished<-finished+sum(from.gtf$chr==chr)
  
    message(paste0("Finished: ",finished))
  
} # for (chr in as.character(unique(chrfromGTF$chr))) 

 dim (cur.gtf) before removing chrY 



dim (annotation.gtf) without chrY 



looking at chr -> chr17



## 3.0 write out **NMD_summary.txt**

In [None]:
write.table(res,"../data/NMD_summary.txt",sep='\t',row.names = TRUE,col.names = TRUE,quote = FALSE)

## Appendix - Metadata

For replicability and reproducibility purposes, we also print the following metadata:

1. Checksums of **'artefacts'**, files generated during the analysis and stored in the folder directory **`data`**
2. List of environment metadata, dependencies, versions of libraries using `conda list`

### Appendix - 1. Checksums with the sha256 algorithm

In [None]:
notebook_id = "nonsenseMediatedDecay"
os.system("echo true")

print("Generating sha256 checksums of the artefacts in the `..data/` directory .. ")
os.system(f"cd ../data/ && sha256sum NMD_summary.txt > ../metadata/{notebook_id}_sha256sums.txt")
print("Done!\n")

pd.read_csv(f"../metadata/{notebook_id}_sha256sums.txt")

### 2. Libraries metadata

In [None]:
notebook_id = "nonsenseMediatedDecay"

print(f"Saving `conda list` packages in ../metadata/{notebook_id}_conda_list.txt  ..")
os.system(f"conda list > ../metadata/{notebook_id}_conda_list.txt")
print("Done!\n")