# Transcriptome Reconstruction
This tutorial processes the example data set, based on PacBio Isoseq samples of hematopoetic cells from ENCODE. This dataset contains only a subset of genomic regions, allowing for fast processing of the demonstration tutorials. 
All required data files to run the tutorials can be obtained here: ([download link](https://oc-molgen.gnz.mpg.de/owncloud/s/gjG9EPiQwpRAyg3)). 

You will need:

* sample description file 'encode_samples.tsv'
* six .bam alignment files for the six samples
* six corresponding .bam.bai indices
* referernce annotation file gencode.v42.chr_patch_hapl_scaff.annotation_sorted_chr8.gff3.gz
* corresponding .gff3.gz.tbi index file
* genomic reference file GRCh38.p13.genome_chr8.fa
* cooresponding .fai index file

All files are assumed to be stored in a subfolder called 'demonstration_dataset'

In this tutorial, we import the reference annotation, specify the alignment files for the samples, and integrate the data into a common data structure. During this step, the transcriptome is reconstructed, and quality control metrics are computed. 


In [1]:
# preperation: import the libraries
from  isotools import Transcriptome
from isotools import __version__ as isotools_version
import pandas as pd
import matplotlib.pyplot as plt
import logging
# set up logging
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.INFO)
logger=logging.getLogger('isotools')
logger.info(f'This is isootools version {isotools_version}')


path='demonstration_dataset'


INFO:This is isootools version 0.3.2


## Import of reference annotation
The first step is to import the reference annotation from a gff or gtf file with the [Transcriptome.from_reference](../isotoolsAPI.html?highlight=from_reference#isotools.Transcriptome.from_reference) class method. The input file should be sorted and indexed with tabix. 

In [2]:
annotation_fn=f'{path}/gencode.v42.chr_patch_hapl_scaff.annotation_sorted_chr8.gff3.gz'
#create the IsoTools transcriptome object from the reference annotation
isoseq=Transcriptome.from_reference(annotation_fn)


INFO:importing reference from gff3 file demonstration_dataset/gencode.v42.chr_patch_hapl_scaff.annotation_sorted_chr8.gff3.gz
100%|█████████▉| 2.82M/2.82M [00:02<00:00, 1.38MB/s]
INFO:skipped the following categories: {'CDS', 'five_prime_UTR', 'three_prime_UTR'}


## Import of sequencing data
Next a sample table containing the sample names as well as corresponding file names and group assignment is imported.
This information is used in the next step, to reconstruct the transcripts for each sample and integrate them with the transcriptome object, using the [add_sample_from_bam](../isotoolsAPI.html?highlight=add_sample_from_bam#isotools.Transcriptome.add_sample_from_bam) function.
When all samples are added, quality control metrics are calculated by calling [Transcriptome.add_qc_metrics](../isotoolsAPI.html?highlight=add_qc_metrics#isotools.Transcriptome.add_qc_metrics).
Last, the object with all data is stored on disk in a pickle file for later use with the [save](../isotoolsAPI.html?highlight=save#isotools.Transcriptome.save) method.
In another session, the object can be restored with the classmethod [Transcriptome.load](../isotoolsAPI.html?highlight=load#isotools.Transcriptome.load).

In [3]:
sample_fn=f'{path}/encode_samples.tsv'
genome_fn=f'{path}/GRCh38.p13.genome_chr8.fa'

samples=pd.read_csv(sample_fn, sep='\t')
samples.file_name=path+'/'+samples.file_name
samples

Unnamed: 0,sample_name,file_name,group
0,GM12878_a,demonstration_dataset/ENCFF417VHJ_aligned_mm2_...,GM12878
1,GM12878_b,demonstration_dataset/ENCFF450VAU_aligned_mm2_...,GM12878
2,GM12878_c,demonstration_dataset/ENCFF694DIE_aligned_mm2_...,GM12878
3,K562_a,demonstration_dataset/ENCFF429VVB_aligned_mm2_...,K562
4,K562_b,demonstration_dataset/ENCFF696GDL_aligned_mm2_...,K562
5,K562_c,demonstration_dataset/ENCFF634YSN_aligned_mm2_...,K562


In [None]:
# integrate the samples
for i,row in samples.iterrows():
    # this step takes about 5-30 seconds per sample
     isoseq.add_sample_from_bam(row.file_name, sample_name=row.sample_name, group=row.group)
# the sample table of the transcriptome object contains the number of imported reads
isoseq.sample_table

INFO:adding sample GM12878_a from file demonstration_dataset/ENCFF417VHJ_aligned_mm2_chr8.bam
100%|██████████| 53.0k/53.0k [00:12<00:00, 4.14kreads/s, chr=KI270757.1]
INFO:skipped 110 reads aligned fraction of less than 0.75.
INFO:skipped 10972 secondary alignments (0x100), alignment that failed quality check (0x200) or PCR duplicates (0x400)
INFO:ignoring 2231 chimeric alignments with less than 2 reads
INFO:imported 40182 nonchimeric reads (including  14 chained chimeric alignments) and 73 chimeric reads with coverage of at least 2.
INFO:adding sample GM12878_b from file demonstration_dataset/ENCFF450VAU_aligned_mm2_chr8.bam
100%|██████████| 68.4k/68.4k [00:13<00:00, 5.01kreads/s, chr=KI270757.1]
INFO:skipped 71 reads aligned fraction of less than 0.75.
INFO:skipped 12700 secondary alignments (0x100), alignment that failed quality check (0x200) or PCR duplicates (0x400)
INFO:ignoring 1273 chimeric alignments with less than 2 reads
INFO:imported 54853 nonchimeric reads (including  12 c

In the next step, we compute several qc metrics for the transcripts: 

* downstream A content, 
* direct repeat length at junctions, 
* noncanonical splicing, 
* potential fragments

This information is stored with the genes objects, and can be accessed by downstream analysis. 

In [None]:
# compute qc metrics
isoseq.add_qc_metrics(genome_fn)

## Transcriptome Export
The identified transcrips can be exported in pickle format, GTF format, and as a table.
* Pickle format is an interlal python format, used to save the entire transcriptome data, so it can be restored in an IsoTools session without reimporting the alignment files. Export to pickle is done with the [save](../isotoolsAPI.html?highlight=save#isotools.Transcriptome.save) methods
* GTF (General Feature Format) is a file format commonly used in bioinformatics to represent annotated genomic features, such as gene models, exons, and introns. Exporting to this format is done with the [write_gtf](../isotoolsAPI.html?highlight=write_gtf#isotools.Transcriptome.write_gtf) method, and facilitates the use of the reconstructed transcriptome in several external tools. 
* The transcripts can also be exported as a table, containing coverage information, as well as additional features such as qc metrics, specified with the "extra_columns" parameter. The use of the [transcript_table](../isotoolsAPI.html?highlight=transcript_table#isotools.Transcriptome.transcript_table) methods is explained in the API documentation. 


To select the transcripts to be exported to GTF and the table, we used the isotools filtering query syntax, which is explained in detail in a [seperate tutorial](06_filtering.html#Filtering-tags-and-queries). 
The gtf can be compressed by adding a '.gz' to the filename and setting  gzip=True. 
The table can be set to contain samplewise and/or groupwise coverage information, by setting the sample and the groups parameters. Here we sum the read counts per group. 


In [None]:
# export the transcriptome object for later use. 
isoseq.save('PacBio_isotools_substantial_isotools.pkl')
# to load the data in the next session, use 
# isoseq=Transcriptome.load('PacBio_isotools_substantial_isotools.pkl')

In [None]:
# export gtf:
isoseq.write_gtf(f'{path}/demonstration_dataset_substantial_transcripts.gtf', source='isoseq', min_coverage=5,  gzip=False, query='SUBSTANTIAL and not (NOVEL_TRANSCRIPT and UNSPLICED)')


In [None]:

# export transcript table with the same filter criteria:
transcript_tab=isoseq.transcript_table( groups=isoseq.groups(),tpm=True,coverage=True,  min_coverage=5, progress_bar=True, query='SUBSTANTIAL and not (NOVEL_TRANSCRIPT and UNSPLICED)')
# write to csv file
transcript_tab.to_csv(f'{path}/demonstration_dataset_substantial_transcripts.csv', index=False, sep='\t')
#show the first lines
transcript_tab.head()

