# Nonsense-mediated decay in sex-biased alternative splicing

This notebook uses **gencode.v30.annotation.gtf** and the associated genome assembly **GRCH30.p12.genome.fa** and the **fromGTF.SE.txt** file from the rMATS 3.2.5 experiment to generate a single output file **NMD_summary.txt**

## 1. Library Dependencies

In [1]:
suppressWarnings({suppressMessages({
library(Biostrings)
library(rtracklayer)
})})

## 1.1 Obtain the appropriate genome assembly 

The rMATS 3.2.5 experiment was done with gencode v.30.  Using this release obtain the proper fasta file for the genome.   

ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_30/GRCh38.p12.genome.fa.gz

In [2]:
if (!("GRCh38.p12.genome.fa.gz" %in% list.files("../data/"))) {
    message("downloading genome assembly associated with the Gencode release 30 - GRCh38.p12.genome.fa.gzn")
    system("wget -O ../data/GRCh38.p12.genome.fa.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_30/GRCh38.p12.genome.fa.gz")
    message("Done!\n")
    message("Unzipping compressed file GRCh38.p12.genome.fa.gz..")
    system("gunzip ../data/GRCh38.p12.genome.fa.gz", intern = TRUE)
    message("Done! GRCh38.p12.genome.fa can be found in ../data/")
}
fasta.file <- "../data/GRCh38.p12.genome.fa"

downloading genome assembly associated with the Gencode release 30 - GRCh38.p12.genome.fa.gzn

Done!


Unzipping compressed file GRCh38.p12.genome.fa.gz..

“running command 'gunzip ../data/GRCh38.p12.genome.fa.gz' had status 2”
Done! GRCh38.p12.genome.fa can be found in ../data/



## 1.2 Obtain the gencode.v30.gtf file

gencode.v30.annotation.gtf file was used for the rMATS 3.2.5 experiment.  

In [4]:
#
# add chr information for summary data later, use the annotation we used for rMATS
#
if (!("gencode.v30.annotation.gtf.gz" %in% list.files("../data/"))) {
    message("downloading gencode v30 annotation\n")
    system("wget -O ../data/gencode.v30.annotation.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_30/gencode.v30.annotation.gtf.gz")
    message("Done!\n")
    message("Unzipping compressed file gencode.v30.annotation.gtf.gz..")
    system("gunzip ../data/gencode.v30.annotation.gtf.gz", intern = TRUE)
    message("Done! gencode.v30.annotation.gtf can be found in ../data/")
}
gencode <- rtracklayer::import("../data/gencode.v30.annotation.gtf")
gtf.df <- as.data.frame (gencode)

In [5]:
head(gtf.df)

Unnamed: 0_level_0,seqnames,start,end,width,strand,source,type,score,phase,gene_id,⋯,transcript_type,transcript_name,transcript_support_level,tag,havana_transcript,exon_number,exon_id,ont,protein_id,ccdsid
Unnamed: 0_level_1,<fct>,<int>,<int>,<int>,<fct>,<fct>,<fct>,<dbl>,<int>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,chr1,11869,14409,2541,+,HAVANA,gene,,,ENSG00000223972.5,⋯,,,,,,,,,,
2,chr1,11869,14409,2541,+,HAVANA,transcript,,,ENSG00000223972.5,⋯,processed_transcript,DDX11L1-202,1.0,basic,OTTHUMT00000362751.1,,,,,
3,chr1,11869,12227,359,+,HAVANA,exon,,,ENSG00000223972.5,⋯,processed_transcript,DDX11L1-202,1.0,basic,OTTHUMT00000362751.1,1.0,ENSE00002234944.1,,,
4,chr1,12613,12721,109,+,HAVANA,exon,,,ENSG00000223972.5,⋯,processed_transcript,DDX11L1-202,1.0,basic,OTTHUMT00000362751.1,2.0,ENSE00003582793.1,,,
5,chr1,13221,14409,1189,+,HAVANA,exon,,,ENSG00000223972.5,⋯,processed_transcript,DDX11L1-202,1.0,basic,OTTHUMT00000362751.1,3.0,ENSE00002312635.1,,,
6,chr1,12010,13670,1661,+,HAVANA,transcript,,,ENSG00000223972.5,⋯,transcribed_unprocessed_pseudogene,DDX11L1-201,,basic,OTTHUMT00000002844.2,,,PGO:0000019,,


## 1.3 Import the rMATS 3.2.5 fromGTF definition file

In [6]:
from.gtf <- read.table(file = "../data/fromGTF.SE.txt", sep = "\t", quote = "\"", header = T, stringsAsFactors = F)

## 1.4 Create the matrix for the Nonsense mediated decay results

In [7]:
res<-matrix(c(rep(0,nrow(from.gtf)),rep(0,nrow(from.gtf)),rep('',nrow(from.gtf))),ncol=3)
colnames(res)<-c('num.nmd','num.transcripts','nmd.ids')

## 2.0 Main program to obtain the NMD_summary.txt

In [9]:
finished=0

message(" dim (gtf.df) before removing chrY ")
dim(gtf.df)

gtf.df <- gtf.df[gtf.df$seqnames != "chrY", ]
message("dim (gtf.df) without chrY ")
dim(gtf.df)

#
# for testing - look at a single chromosome
#
chr <- "chr17"
# loop through one chromosome at a time through the fromGTF.SE.txt file from rMATS 3.2.5
#for (chr in as.character(unique(from.gtf$chr))) {
    message("looking at chr -> ", chr)
      
    cur.gtf <- gtf.df[gtf.df$seqnames == chr, ] 

    for (exon.itr in ((1:nrow(from.gtf))[from.gtf$chr==chr])){
 
        exon.rows <- which((cur.gtf$start == from.gtf$exonStart_0base[exon.itr]+1) & 
                           (cur.gtf$end   == from.gtf$exonEnd        [exon.itr]) & 
                            cur.gtf$type  == 'exon')

        for (exon.row in 1:exon.rows) {
            transcript.first.row<-max(which((cur.gtf$type=='transcript') & 
                                            ((1:nrow(cur.gtf))<exon.row)  ))
        
            if (sum((cur.gtf$type %in% c('transcript','gene')) & 
                    ((1:nrow(cur.gtf))>exon.row)) == 0) {
                transcript.last.row<-nrow(cur.gtf)
            } else {
                transcript.last.row<-min(which((cur.gtf$type %in% c('transcript','gene')) & 
                                               ((1:nrow(cur.gtf)) > exon.row)  ))
            }
            out.gtf<-cur.gtf[transcript.first.row:transcript.last.row,]
            if (sum(c(sum((out.gtf$start == from.gtf$upstreamES  [exon.itr]+1) & 
                           out.gtf$end   == from.gtf$upstreamEE  [exon.itr]) > 0, 
                      sum((out.gtf$start == from.gtf$downstreamES[exon.itr]+1) & 
                           out.gtf$end   == from.gtf$downstreamEE[exon.itr])>0))<2)
                next
            if (sum(out.gtf$type == 'CDS' ) < 3)
                next       
            write.table(out.gtf,"../data/transcript.gtf",sep='\t',col.names = FALSE,
                        row.names = FALSE, quote = FALSE)
        
            command <- paste0('gffread transcript.gtf -g GRCh38.p12.genome.fa -y gene.fa')
            system(paste('gffread -y ../data/gene.fa -g ',fasta.file,' ../data/transcript.gtf',sep=''))
            system(command)       
            seq<-readAAStringSet('../data/gene.fa')
        
            if (length(seq)==0)
                next
        
            l.inc<-length(seq[[1]])
        
            out.gtf <- out.gtf [(out.gtf$start != (from.gtf$exonStart_0base[exon.itr]+1)) & 
                                (out.gtf$end   != (from.gtf$exonEnd[exon.itr])),]
        
            write.table(out.gtf,"../data/transcript.gtf",sep='\t',col.names = FALSE,
                        row.names = FALSE,quote = FALSE)
        
            command <- paste0('gffread transcript.gtf -g GRCh38.p12.genome.fa -y gene.fa')
            system(paste('gffread -y gene.fa -g ',fasta.file,' ../data/transcript.gtf',sep=''))
            system(command)       
            seq<-readAAStringSet('../data/gene.fa')
        
            l.skip<-length(seq[[1]])
        
            skip.exon.aa.length <- (from.gtf$exonEnd[exon.itr] - (from.gtf$exonStart_0base[exon.itr]+1))/3
        
            res[exon.itr,2] <- as.integer(res[exon.itr,2])+1
        
            if (l.inc<(l.skip+skip.exon.aa.length-1)) {
                res[exon.itr,1] <- as.integer(res[exon.itr,1])+1
                res[exon.itr,3] <- paste(res[exon.itr,3],cur.gtf$gene_id,sep='***')
            }
            
        } # end for (exon.row in exon.rows) 
    } # end for (exon.itr in ((1:nrow(from.gtf))[from.gtf$chr==chr])){
    
    finished<-finished+sum(from.gtf$chr==chr)
  
    print(paste0("Finished: ",finished))
  
#} # for (chr in as.character(unique(from.gtf$chr))) 

 dim (gtf.df) before removing chrY 



dim (gtf.df) without chrY 



looking at chr -> chr17

“no non-missing arguments to max; returning -Inf”


ERROR: Error in transcript.first.row:transcript.last.row: result would be too long a vector


In [50]:
?which


0,1
which {base},R Documentation

0,1
x,a logical vector or array. NAs are allowed and omitted (treated as if FALSE).
arr.ind,logical; should array indices be returned when x is an array?
ind,"integer-valued index vector, as resulting from which(x)."
.dim,dim(.) integer vector
.dimnames,"optional list of character dimnames(.). If useNames is true, to be used for constructing dimnames for arrayInd() (and hence, which(*, arr.ind=TRUE)). If names(.dimnames) is not empty, these are used as column names. .dimnames[[1]] is used as row names."
useNames,logical indicating if the value of arrayInd() should have (non-null) dimnames at all.


## 3.0 write out **NMD_summary.txt**

In [None]:
write.table(res,"../data/NMD_summary.txt",sep='\t',row.names = TRUE,col.names = TRUE,quote = FALSE)

## Appendix - Metadata

For replicability and reproducibility purposes, we also print the following metadata:

1. Checksums of **'artefacts'**, files generated during the analysis and stored in the folder directory **`data`**
2. List of environment metadata, dependencies, versions of libraries using `conda list`

### Appendix - 1. Checksums with the sha256 algorithm

In [5]:
notebook_id = "nonsenseMediatedDecay"
os.system("echo true")

print("Generating sha256 checksums of the artefacts in the `..data/` directory .. ")
os.system(f"cd ../data/ && sha256sum NMD_summary.txt > ../metadata/{notebook_id}_sha256sums.txt")
print("Done!\n")

pd.read_csv(f"../metadata/{notebook_id}_sha256sums.txt")

Generating sha256 checksums of the artefacts in the `..data/` directory .. 
Done!



Unnamed: 0,ec38fac35613da014f73140da90c95294ba52c6e7923e380a5236762f1ca3793 GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct
0,65683ffcf6df3a68ce9e3401f2f66408231b9b9acd1c04...
1,0133c46eac7fde518f6d851287cd1933443ae3ea759d22...
2,cbabf87ae994cf76eb9b47709f6efb59e43a52af0a65f9...
3,8291be77c7ad6cd73d9f7797658e3f1ffc197532a23385...
4,5b1d46a8d2a5a2556e81d5262a50aa1ff2f31ee621193f...
...,...
4854,295ebbc27fa4e169b131781d772b4e69d528b82e728243...
4855,84f06a1cb756a11cd0306595d171d8739a8cf7fcebcc70...
4856,14d846231f11b4ff7acee9858fe6b39ecaad4079b3c1c5...
4857,66083588f477e50baa7229ee5ca9c34fbf53f7dad7940e...


### 2. Libraries metadata

In [5]:
notebook_id = "nonsenseMediatedDecay"

print(f"Saving `conda list` packages in ../metadata/{notebook_id}_conda_list.txt  ..")
os.system(f"conda list > ../metadata/{notebook_id}_conda_list.txt")
print("Done!\n")

Saving `conda list` packages in ../metadata/figure_4c_conda_list.txt  ..
Done!

