# Data access and Filtering

Transcript filtering is a crucial step in long-read transcriptome sequencing analysis, to remove low-quality transcripts and retain high-quality transcripts for downstream analysis. IsoTools implements transcript filtering as a flexible query syntax, based on logical combinations of named "tags", by convention, a single word in capital letters. These tags are defined Python expressions, which are evaluated in the context of the transcript dictionary, so it may depend on all metrics and properties of the transcript. IsoTools provides predefined tags, covering technical artifacts, but also the novelty categories, and properties of the reference annotation. Additionally, users can define custom tags to tailor filtering to their specific needs.
To apply these filters, tags are combined in query strings, which can be used in iterator functions or export functions, to select the desired transcripts. Importantly, filtering does not modify the original data; rather, it is only applied when specifying the query string.

In this tutorial, we will learn to

* access genes by gene_name/gene_id.
* iterate over genes/ transcripts, and filter them by their properties, genomic location or coverage. 
* use IsoTools query syntax to filter genes and transcripts. 
* define custom tags and filter expressions, to tailor filter queries.


This tutorial assumes you have run the tutorial on [transcriptome reconstruction](03_transcriptome_reconstruction.html) already, and prepared the transcriptome pkl file "PacBio_isotools.pkl" based on the [demonstration data set](https://nc.molgen.mpg.de/cloud/index.php/s/zYe7g6qnyxGDxRd).

In [13]:
from isotools import Transcriptome
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

path='demonstration_dataset'
isoseq=Transcriptome.load(f'{path}/PacBio_isotools.pkl')

## Access genes by name or id
[Gene objects](../isotoolsAPI.html?highlight=isotools.Gene#isotools-gene) can be retrieved from the transcriptome object using the square bracket operator. This works with both the gene name and gene id. The gene object contains all information of the gene, like the coverage and the list of transcripts, which in turn contain the list of exons, comparison with reference annotation and much more. 

In [14]:
g=isoseq['ENSG00000104312.8']
if g==isoseq['RIPK2']:
    print('found the same gene by name and id')

#string representation
print(g)
# obtain the sum coverage over all transcripts and samples
total_cov=g.coverage.sum()
print(f"{total_cov} reads in total")

found the same gene by name and id
Gene RIPK2 chr8:89757805-89791064(+), 4 reference transcripts, 46 expressed transcripts
920 reads in total


In [15]:
#This gene has 52 expressed transcripts, only 4 of them are in the reference.
#However, most are supported by few reads only
n_tr = sum([cov>total_cov *.01 for cov in g.coverage.sum(0)])
print(f'{n_tr} transcripts contribute at least 1% to that gene')


7 transcripts contribute at least 1% to that gene


In [16]:
#lets look at the primary transcript
max_i=np.argmax(g.coverage.sum(0))
max_contr=g.coverage.sum(0)[max_i]
print(f'''The primary transcript is number {max_i}.
This transcript contributes {max_contr/total_cov:.2%}  ({max_contr}/{total_cov} reads)''')


The primary transcript is number 2.
This transcript contributes 40.87%  (376/920 reads)


In [17]:
#all the information for this transcript are stored in this dict:
primary=g.transcripts[max_i]
print(f'\nThese are the infos for this transcript:')
for k,v in primary.items():
    print(f'{k}: {str(v)[:100]}{"..." if len(str(v))>100 else ""}')




These are the infos for this transcript:
exons: [[89757791, 89758233], [89762828, 89762982], [89765340, 89765496], [89769771, 89769929], [89771740, ...
strand: +
coverage: {'GM12878_a': 48, 'GM12878_b': 137, 'GM12878_c': 141, 'K562_a': 18, 'K562_b': 19, 'K562_c': 13}
TSS: {'GM12878_a': {89757763: 1, 89757766: 1, 89757778: 1, 89757782: 3, 89757783: 4, 89757784: 4, 8975778...
PAS: {'GM12878_a': {89790462: 3, 89791057: 3, 89790608: 5, 89790939: 3, 89790606: 3, 89790425: 1, 8979046...
annotation: (0, {'FSM': [1]})
TSS_unified: {'GM12878_a': {89757791: 48}, 'GM12878_b': {89757791: 137}, 'GM12878_c': {89757791: 141}, 'K562_a': ...
PAS_unified: {'GM12878_a': {89790462: 20, 89790982: 10, 89790605: 11, 89790939: 3, 89790524: 4}, 'GM12878_b': {89...
direct_repeat_len: [3, 3, 4, 5, 7, 3, 5, 4, 6, 4]
downstream_A_content: 0.2
ORF: (89758060, 89790416, {"5'UTR": 269, 'CDS': 1623, "3'UTR": 46, 'start_codon': 'ATG', 'stop_codon': 'T...


In [18]:

# this 'annotation' line reveals that it is a FSM with reference transcript nr 1:
# annotation: (0, {'FSM': [1]})

print(f'\nThe corresponding reference transcript: ')
for k,v in g.ref_transcripts[primary["annotation"][1]["FSM"][0]].items():
    print(f'{k}: {str(v)[:100]}{"..." if len(str(v))>100 else ""}')


The corresponding reference transcript: 
transcript_id: ENST00000220751.5
transcript_type: protein_coding
transcript_name: RIPK2-201
transcript_support_level: 1
exons: [(89757815, 89758233), (89762828, 89762982), (89765340, 89765496), (89769771, 89769929), (89771740, ...
CDS: (89758060, 89790416)
downstream_A_content: 0.06666666666666667
ORF: (89758060, 89790416, {"5'UTR": 245, 'CDS': 1623, "3'UTR": 648, 'start_codon': 'ATG', 'stop_codon': '...


## Iterating genes and transcripts
To iterate genes and transcripts, the transcriptome object provides the methods [iter_transcripts](../isotoolsAPI.html?highlight=iter_transcripts#isotools.Transcriptome.iter_transcripts) and [iter_genes](../isotoolsAPI.html?highlight=iter_genes#isotools.Transcriptome.iter_genes). Both have the option to filter by genomic region or coverage, and with queries. The gene iterator yields the gene objects, while the transcript iterator method yields a 3 tuple with gene, transcript index, and a dictionary with the transcript properties.

In [19]:
for g in isoseq.iter_genes(region='chr8:89000000-90000000', min_coverage=100):
    print(g)
    print(f'\t{g.coverage.sum()} reads')

Gene OSGIN2 chr8:89901848-89927888(+), 4 reference transcripts, 47 expressed transcripts
	573 reads
Gene RIPK2 chr8:89757805-89791064(+), 4 reference transcripts, 46 expressed transcripts
	920 reads
Gene NBN chr8:89924514-90003228(-), 38 reference transcripts, 294 expressed transcripts
	2259 reads


## Filtering tags and queries
IsoTools allows to filter genes and transcripts based on **TAGS** (single word in ALLCAPS).

Each TAG is defined by a corresponding **expression**, that gets evaluated on the properties of the gene or transcript.

A TAG is defined in a specific **context**, either gene, transcript, or reference transcript context, in which the expression gets evaluated. Expressions in gene context depend on properties of the gene, while in transcript context, the properties of the transcript are relevant. 

We already used tags in the previous tutorial, for the definition of the sequencing artifacts.'INTERNAL_PRIMING', 'FRAGMENT', 'RTTS' all have corresponding expressions, that define them. For example, the expression for INTERNAL_PRIMING tag is 'len(exons)==1 and downstream_A_content and downstream_A_content>.5', e.g. it selects (e.g. returns True) mono exon genes with more than 50% A downstream of the transcript. 


As additional examples, we print the default definitions for all defined tags.

In [20]:

#print all defined filter expressions
for context in isoseq.filter:
    print(f'\n{context}\n{"="*len(context)}')
    for tag,expression in isoseq.filter[context].items():
        print(f'- {tag}:\t{expression}')



gene
====
- NOVEL_GENE:	not reference
- EXPRESSED:	transcripts
- CHIMERIC:	chimeric

transcript
- INTERNAL_PRIMING:	len(exons)==1 and downstream_A_content and downstream_A_content>.5
- RTTS:	noncanonical_splicing is not None and novel_splice_sites is not None and         any(2*i in novel_splice_sites and 2*i+1 in novel_splice_sites for i,_ in noncanonical_splicing)
- NONCANONICAL_SPLICING:	noncanonical_splicing
- NOVEL_TRANSCRIPT:	annotation[0]>0
- FRAGMENT:	fragments and any("novel exonic " in a or "fragment" in a for a in annotation[1])
- UNSPLICED:	len(exons)==1
- MULTIEXON:	len(exons)>1
- SUBSTANTIAL:	g.coverage.sum() * .01 < g.coverage[:,trid].sum()
- ANTISENSE:	"antisense" in annotation[1]
- INTERGENIC:	"intergenic" in annotation[1]
- GENIC_GENOMIC:	"genic genomic" in annotation[1]
- NOVEL_EXONIC_PAS:	"novel exonic PAS" in annotation[1]
- NOVEL_INTRONIC_PAS:	"novel intronic PAS" in annotation[1]
- READTHROUGH_FUSION:	"readthrough fusion" in annotation[1]
- NOVEL_EXON:	"novel exo

## Custom Tags
Users can modify existing criteria, for example to adjust thresholds, or define additional criteria, based on custom properties, with the [add_filter](../isotoolsAPI.html?highlight=add_filter#isotools.Transcriptome.add_filter) method. 

The following example shows how the user can define additional flags, in this case "HIGH_SUPPORT" and "PROTEIN_CODING" for the reference transcripts, which is based on the GENCODE annotation information on "transcript_support_level" and "transcript_type". Note that these custom filter tags, like all filter tags, get only applied when used in a filter query. 


In [21]:
#add /modify custom filter

isoseq.add_filter(tag = "HIGH_SUPPORT",
                  expression = 'transcript_support_level=="1"',
                  context = 'reference')
isoseq.add_filter(tag = "PROTEIN_CODING",
                  expression = 'transcript_type=="protein_coding"',
                  context = 'reference')


## Constructing Filter Queries
### Analyse Genes of Interest
The tags can be combined to boolean expressions, to query transcripts of interest. For example, to find novel exon skipping events, that contribute substantially to the genes total expression, the query would be: **"EXON_SKIPPING and SUBSTANTIAL"**.

In [22]:
for g,trnr,tr in isoseq.iter_transcripts( query='EXON_SKIPPING and SUBSTANTIAL',
                                          min_coverage=50 ):
    print(f'Transcript nr {trnr} of "{g}" with a coverage of {g.coverage.sum(0)[trnr]}')
    print(f"  -> {tr['annotation'][1]['exon skipping']}")


Transcript nr 1 of "Gene SLC20A2 chr8:42416474-42541926(-), 14 reference transcripts, 53 expressed transcripts" with a coverage of 50
  -> [[42480458, 42480513]]
Transcript nr 4 of "Gene ARMC1 chr8:65602457-65634217(-), 6 reference transcripts, 57 expressed transcripts" with a coverage of 51
  -> [[65605421, 65605538]]


Transcript nr 3 of "Gene ZNF7 chr8:144827517-144847509(+), 18 reference transcripts, 86 expressed transcripts" with a coverage of 52
  -> [[144828007, 144828053], [144828728, 144828950], [144829042, 144829604], [144830930, 144831029]]
Transcript nr 0 of "Gene CIBAR1 chr8:93698560-93731527(+), 18 reference transcripts, 30 expressed transcripts" with a coverage of 296
  -> [[93707201, 93707281], [93708010, 93708016]]
Transcript nr 7 of "Gene RIPK2 chr8:89757805-89791064(+), 4 reference transcripts, 46 expressed transcripts" with a coverage of 117
  -> [[89784049, 89784139]]
Transcript nr 12 of "Gene RIPK2 chr8:89757805-89791064(+), 4 reference transcripts, 46 expressed transcripts" with a coverage of 126
  -> [[89784049, 89784139]]


### Filter Queries for transcript export
One of the primary tasks for filtering is the selection of high confidence transcripts, to export for downstream analysis. Depending on the requirements of this analysis, the filtering may be more permissive, or stricter. 
The following code blocks provides three different filter schemes with varying stringency. When exporting several files (e.g. gtf and transcript table), be careful to use consistent filtering parameters for all files to ensure compatibility. 

* The *permissive* filtering scheme selects reference transcripts, as well as novel transcripts which are not tagged as artifacts, as long as they are supported by two or more reads.
* The *balanced* filtering scheme requires at least 5 reads for novel transcripts. This can be realized with a custom filter tag HIGH_COVER.
* The *strict* filtering requests at least 5 reads for all transcripts (min_coverage parameter), and at least 5% contribution to the genes total, ensured by the predefined tag SUBSTANTIAL.

Note that these filtering schemes may require adaption, depending on the dataset and the specific analysis task.

In [23]:
# add the custom filter HIGH_COVER
isoseq.add_filter( "HIGH_COVER",'g.coverage.sum(0)[trid]>= 7',
  context='transcript')

# define the filtering shemes
permissive={
  "query": "FSM or not ( RTTS or INTERNAL_PRIMING or FRAGMENT )",
  "min_coverage": 2
}

balanced={'query':
  'FSM or (HIGH_COVER and not (RTTS or FRAGMENT or INTERNAL_PRIMING))',
  'min_coverage':2
}

strict={'query':
  'SUBSTANTIAL and (FSM or not (RTTS or FRAGMENT or INTERNAL_PRIMING))',
  'min_coverage':7
}


In [24]:

isoseq.write_gtf(f'{path}/demonstration_dataset_transcripts_balanced.gtf', **balanced)

transcript_tab=isoseq.transcript_table(groups=isoseq.groups(),
                                       tpm=True,
                                       coverage=True,
                                       progress_bar=True,
                                       ** balanced)

# write to csv file
transcript_tab.to_csv(f'{path}/demonstration_dataset_transcripts_balanced.csv',
                      index=False, sep='\t')


100%|██████████| 10801/10801 [00:00<00:00, 17798.62genes/s]
