# Transcriptome Import
This tutorial demonstrates the import of a transcriptome from a transcript table and gtf file. The transcript table contains the read counts per sample and transcript, and the gtf file describes the transcript models, gene to transcript relationships and additional properties. Both files can be produced by IsoTools, but also by any other tool. 
This functionality is useful to integrate external tools for transcriptome reconstruction in IsoTools, or compare different tools for transcriptome reconstruction. Importantly, all transcript metadata information from the gtf file is imported and stored in the transcriptome method. 
The import functionality is implemented in the method [add_sample_from_csv](../isotoolsAPI.html?highlight=from_reference#isotools.Transcriptome.add_sample_from_csv) of the `isotools.Transcriptome` class.

The transcriptome gtf should contain a transcript entry for each transcript with long read coverage, but may also contain additional transcripts for reference. The transcript table contains one row per transcript, with transcript id, and a column for each sample specifying the number of long reads. 
All additional tags from the gtf info field get imported, and can be used within isotools for subsequent analysis. 
All files used in the tutorial can be obtained here: ([download link](https://nc.molgen.mpg.de/cloud/index.php/s/zYe7g6qnyxGDxRd)). 


You will need:

* reference annotation file (*.gff3.gz) and corresponding index file (.tbi)
* genomic reference file genome.fa and corresponding .fai index file
* gtf file with the long read transcripts (e.g. from external tool) 
* corresponding table with **number of long reads per transcripts** for the samples to be added. This is a csv file, with one column for each sample. The sample names are specified in the header (first line). Row names (first column) must correspond to transcript names from the gtf file.

For demonstration, we use the gtf file and transcript table exported from isotools in the [previous tutorial](03_transcriptome_reconstruction.html). 
Note that in this table we exported not only the read counts, but also TPM values and further information on the transcripts. Therefore, we specify the suffix of the columns with the read counts.
Remember that we exported the sum coverage, so instead of individual replicates, we are adding the pooled samples. 
Also we filtered the transcripts, hence only a subset of transcripts will be imported. 
All files are assumed to be stored in a subfolder called 'demonstration_dataset'.

## Reference Gene and Transcript to Gene Assignment
Transcript to gene assignment is either taken from the transcript_file, or recreated, as specified by the reconstruct_genes parameter. In the first case, the genes are matched to overlapping genes from the reference annotation by gene id. In absence of a overlapping gene with same id, the assignment falls back to "recreate". In that case, the gene is matched to existing genes by splice junction, and renamed accordingly. A map reflecting the the renaming is returned as a dictionary. Transcripts without matching existing gene constitute a new gene.



In [3]:
# preparation: import the libraries
from isotools import Transcriptome
from isotools import __version__ as isotools_version
import pandas as pd
import matplotlib.pyplot as plt
import logging
# set up logging
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.INFO)
logger=logging.getLogger('isotools')
logger.info(f'This is isotools version {isotools_version}')

path='demonstration_dataset'


INFO:This is isotools version 0.3.4


In [4]:
annotation_fn=f'{path}/gencode.v42.chr_patch_hapl_scaff.annotation_sorted_chr8.gff3.gz'
# create the IsoTools transcriptome object from the reference annotation
isoseq=Transcriptome.from_reference(annotation_fn)

#specify the columns with the read counts per transcript
read_count_cols={'GM12878_pooled':'GM12878_sum_coverage','K562_pooled':'K562_sum_coverage'}

# add the transcripts from the gtf file
id_map=isoseq.add_sample_from_csv(
    f'{path}/demonstration_dataset_substantial_transcripts.csv',
    transcripts_file=f'{path}/demonstration_dataset_substantial_transcripts.gtf.gz',
    sample_cov_cols=read_count_cols,
    reconstruct_genes=False
)
# now we want to add qc metrics
genome_fn=f'{path}/GRCh38.p13.genome_chr8.fa'
isoseq.add_qc_metrics(genome_fn)

isoseq.sample_table

INFO:importing reference from gff3 file demonstration_dataset/gencode.v42.chr_patch_hapl_scaff.annotation_sorted_chr8.gff3.gz
100%|█████████▉| 2.70M/2.70M [00:02<00:00, 1.23MB/s]
INFO:skipped the following categories: {'three_prime_UTR', 'five_prime_UTR', 'CDS'}
INFO:adding samples "GM12878_pooled", "K562_pooled" from csv
100%|██████████| 2609/2609 [00:09<00:00, 270.23genes/s]


Unnamed: 0,name,file,group,nonchimeric_reads,chimeric_reads
0,GM12878_pooled,demonstration_dataset/demonstration_dataset_su...,GM12878_pooled,125956,0
0,K562_pooled,demonstration_dataset/demonstration_dataset_su...,K562_pooled,143973,0
