# Transcriptome Reconstruction
This tutorial processes the example data set, based on PacBio Isoseq samples of hematopoetic cells from ENCODE. This dataset contains only a subset of genomic regions, allowing for fast processing of the demonstration tutorials. 
All required data files to run the tutorials can be obtained here: ([download link](https://nc.molgen.mpg.de/cloud/index.php/s/zYe7g6qnyxGDxRd)). 

To run this example, you will you will need the following **input files**:

* sample description file 'encode_samples.tsv'
* six .bam alignment files for the six samples
* six corresponding .bam.bai indices
* reference annotation file gencode.v42.chr_patch_hapl_scaff.annotation_sorted_chr8.gff3.gz
* corresponding .gff3.gz.tbi index file
* genomic reference file GRCh38.p13.genome_chr8.fa
* corresponding .fai index file

All files are assumed to be stored in a subfolder called 'demonstration_dataset'

In this tutorial, we import the reference annotation, specify the alignment files for the samples, and integrate the data into a common data structure. During this step, the transcriptome is reconstructed, and quality control metrics are computed. 


In [1]:
# preparation: import the libraries
from isotools import Transcriptome
from isotools import __version__ as isotools_version
import pandas as pd
import matplotlib.pyplot as plt
import logging
# set up logging
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.INFO)
logger=logging.getLogger('isotools')
logger.info(f'This is isotools version {isotools_version}')


path='demonstration_dataset'

INFO:This is isotools version 0.3.5rc10


## Import of reference annotation
The first step is to import the reference annotation from a gff or gtf file with the [Transcriptome.from_reference](../isotoolsAPI.html?highlight=from_reference#isotools.Transcriptome.from_reference) class method. The input file should be sorted and indexed with tabix. 

In [2]:
annotation_fn=f'{path}/gencode.v42.chr_patch_hapl_scaff.annotation_sorted_chr8.gff3.gz'
#create the IsoTools transcriptome object from the reference annotation
isoseq=Transcriptome.from_reference(annotation_fn)


INFO:importing reference from gff3 file demonstration_dataset/gencode.v42.chr_patch_hapl_scaff.annotation_sorted_chr8.gff3.gz
100%|█████████▉| 2.70M/2.70M [00:02<00:00, 1.22MB/s]
INFO:skipped the following categories: {'three_prime_UTR', 'five_prime_UTR', 'CDS'}


## Import of sequencing data
Next a sample table containing the sample names as well as corresponding file names and group assignment is imported.
This information is used in the next step, to reconstruct the transcripts for each sample and integrate them with the transcriptome object, using the [add_sample_from_bam](../isotoolsAPI.html?highlight=add_sample_from_bam#isotools.Transcriptome.add_sample_from_bam) function.
When all samples are added, quality control metrics are calculated by calling [Transcriptome.add_qc_metrics](../isotoolsAPI.html?highlight=add_qc_metrics#isotools.Transcriptome.add_qc_metrics).
Last, the object with all data is stored on disk in a pickle file for later use with the [save](../isotoolsAPI.html?highlight=save#isotools.Transcriptome.save) method.
In another session, the object can be restored with the classmethod [Transcriptome.load](../isotoolsAPI.html?highlight=load#isotools.Transcriptome.load).

In [3]:
sample_fn=f'{path}/encode_samples.tsv'
genome_fn=f'{path}/GRCh38.p13.genome_chr8.fa'

samples=pd.read_csv(sample_fn, sep='\t')
samples.file_name=path+'/'+samples.file_name
samples

Unnamed: 0,sample_name,file_name,group
0,GM12878_a,demonstration_dataset/ENCFF417VHJ_aligned_mm2_...,GM12878
1,GM12878_b,demonstration_dataset/ENCFF450VAU_aligned_mm2_...,GM12878
2,GM12878_c,demonstration_dataset/ENCFF694DIE_aligned_mm2_...,GM12878
3,K562_a,demonstration_dataset/ENCFF696GDL_aligned_mm2_...,K562
4,K562_b,demonstration_dataset/ENCFF429VVB_aligned_mm2_...,K562
5,K562_c,demonstration_dataset/ENCFF634YSN_aligned_mm2_...,K562


In [4]:
# integrate the samples
for i,row in samples.iterrows():
    # this step takes about 5-30 seconds per sample
     isoseq.add_sample_from_bam(row.file_name, sample_name=row.sample_name, group=row.group)
# the sample table of the transcriptome object contains the number of imported reads
isoseq.sample_table

INFO:adding sample GM12878_a from file demonstration_dataset/ENCFF417VHJ_aligned_mm2_chr8.bam
100%|██████████| 53.0k/53.0k [00:12<00:00, 4.12kreads/s, chr=KI270757.1]
INFO:skipped 113 reads aligned fraction of less than 0.75.
INFO:skipped 10940 secondary alignments (0x100), alignment that failed quality check (0x200) or PCR duplicates (0x400)
INFO:ignoring 2235 chimeric alignments with less than 2 reads
INFO:imported 40177 nonchimeric reads (including  14 chained chimeric alignments) and 73 chimeric reads with coverage of at least 2.
INFO:adding sample GM12878_b from file demonstration_dataset/ENCFF450VAU_aligned_mm2_chr8.bam
100%|██████████| 68.3k/68.3k [00:13<00:00, 5.16kreads/s, chr=KI270757.1]
INFO:skipped 72 reads aligned fraction of less than 0.75.
INFO:skipped 12598 secondary alignments (0x100), alignment that failed quality check (0x200) or PCR duplicates (0x400)
INFO:ignoring 1275 chimeric alignments with less than 2 reads
INFO:imported 54856 nonchimeric reads (including  12 c

Unnamed: 0,name,file,group,nonchimeric_reads,chimeric_reads
0,GM12878_a,demonstration_dataset/ENCFF417VHJ_aligned_mm2_...,GM12878,40177,73
1,GM12878_b,demonstration_dataset/ENCFF450VAU_aligned_mm2_...,GM12878,54856,7
2,GM12878_c,demonstration_dataset/ENCFF694DIE_aligned_mm2_...,GM12878,72445,12
3,K562_a,demonstration_dataset/ENCFF696GDL_aligned_mm2_...,K562,59121,281
4,K562_b,demonstration_dataset/ENCFF429VVB_aligned_mm2_...,K562,76686,415
5,K562_c,demonstration_dataset/ENCFF634YSN_aligned_mm2_...,K562,80338,369


In the next step, we compute several qc metrics for the transcripts: 

* downstream A content, 
* direct repeat length at junctions, 
* noncanonical splicing, 
* potential fragments

This information is stored with the genes objects, and can be accessed by downstream analysis. 

In [5]:
# compute qc metrics
isoseq.add_qc_metrics(genome_fn)
# add ORF predictions
isoseq.add_orf_prediction(genome_fn)


100%|██████████| 10801/10801 [01:06<00:00, 162.33genes/s]
100%|██████████| 10801/10801 [00:49<00:00, 220.16genes/s]


## Transcriptome Export
The identified transcrips can be exported in pickle format, GTF format, and as a table.
* Pickle format is an internal python format, used to save the entire transcriptome data, so it can be restored in an IsoTools session without reimporting the alignment files. Export to pickle is done with the [save](../isotoolsAPI.html?highlight=save#isotools.Transcriptome.save) methods
* GTF (General Feature Format) is a file format commonly used in bioinformatics to represent annotated genomic features, such as gene models, exons, and introns. Exporting to this format is done with the [write_gtf](../isotoolsAPI.html?highlight=write_gtf#isotools.Transcriptome.write_gtf) method, and facilitates the use of the reconstructed transcriptome in several external tools. 
* The transcripts can also be exported as a table, containing coverage information, as well as additional features such as qc metrics, specified with the "extra_columns" parameter. The use of the [transcript_table](../isotoolsAPI.html?highlight=transcript_table#isotools.Transcriptome.transcript_table) methods is explained in the API documentation. 


To select the transcripts to be exported to GTF and the table, we used the isotools filtering query syntax, which is explained in detail in a [separate tutorial](06_filtering.html#Filtering-tags-and-queries). 
The gtf can be compressed by adding a '.gz' to the filename and setting  gzip=True. 
The table can be set to contain samplewise and/or groupwise coverage information, by setting the sample and the groups parameters. Here we sum the read counts per group. 


In [6]:
# export the transcriptome object for later use.
isoseq.save(f'{path}/PacBio_isotools.pkl')
# to load the data in the next session, use
# isoseq=Transcriptome.load('PacBio_isotools_substantial_isotools.pkl')

INFO:saving transcriptome to demonstration_dataset/PacBio_isotools.pkl


In [7]:
# export gtf:
query_string = 'SUBSTANTIAL and not (NOVEL_TRANSCRIPT and UNSPLICED)'
isoseq.write_gtf(f'{path}/demonstration_dataset_substantial_transcripts.gtf.gz',
                 source='isoseq', min_coverage=5,
                 gzip=True,
                 query=query_string)


INFO:writing gzip compressed gtf file to demonstration_dataset/demonstration_dataset_substantial_transcripts.gtf.gz


In [8]:
# export transcript table with the same filter criteria:
transcript_tab=isoseq.transcript_table( groups=isoseq.groups(),tpm=True,coverage=True,
                                       min_coverage=5, progress_bar=True,
                                       query=query_string)
# write to csv file
transcript_tab.to_csv(f'{path}/demonstration_dataset_substantial_transcripts.csv',
                      index=False, sep='\t')
#show the first lines
transcript_tab.head()

100%|██████████| 10801/10801 [00:00<00:00, 22686.76genes/s]


Unnamed: 0,chr,transcript_start,transcript_end,strand,gene_id,gene_name,transcript_nr,transcript_length,num_exons,exon_starts,exon_ends,novelty_class,novelty_subclasses,GM12878_sum_coverage,K562_sum_coverage,GM12878_sum_tpm,K562_sum_tpm
0,chr8,15540261,15764645,+,ENSG00000104723.21,TUSC3,16,1640,10,"15540261,15623079,15650696,15659506,15662155,1...","15540568,15623249,15650814,15659647,15662296,1...",FSM,FSM,0,226,0.0,1045.594393
1,chr8,15540261,15764645,+,ENSG00000104723.21,TUSC3,17,1705,11,"15540261,15623079,15650696,15659506,15662155,1...","15540568,15623249,15650814,15659647,15662296,1...",FSM,FSM,0,52,0.0,240.579241
2,chr8,15540261,15764645,+,ENSG00000104723.21,TUSC3,18,1549,9,"15540261,15623079,15650696,15659506,15662155,1...","15540568,15623249,15650814,15659647,15662296,1...",FSM,FSM,0,6,0.0,27.759143
3,chr8,15540261,15758273,+,ENSG00000104723.21,TUSC3,20,1441,10,"15540261,15623079,15650696,15659506,15662155,1...","15540568,15623249,15650814,15659647,15662296,1...",NIC,exon skipping,0,20,0.0,92.530477
4,chr8,15540261,15758273,+,ENSG00000104723.21,TUSC3,21,1506,11,"15540261,15623079,15650696,15659506,15662155,1...","15540568,15623249,15650814,15659647,15662296,1...",NIC,novel combination,0,20,0.0,92.530477
