# RNA-seq analysis pipeline

The pipeline presented in this notebook will require the following:

1. __requests__ to download SRA files
2. __fasterq-dump__ from the SRA-toolkit to generate .fastq files
3. __FastQC__ to perform Quality Controls
4. __Salmon__ for alignment-free transcript abundance quantification
4. __multiQC__ to generate QC report and check if there are low quality samples to exclude from downstream analysis
5. __tximport__ to aggregate transcript counts and produce gene-level count matrices and normalizing offsets

In [104]:
#Imports
import os, sys, IPython, re
import ftplib, requests, shlex
import subprocess as sp
import numpy as np
import pandas as pd
from progressbar import *

In [105]:
#Sound alert
from IPython.display import Audio
sound_file = 'test.wav' #sound to play
#code to add if you want a sound alert
Audio(sound_file, autoplay=True)

## Informations about the study:

### *Human MAIT cells exit peripheral tissues and re-circulate via lymph in steady state conditions: QC and mapping*

Mucosal-associated invariant T (MAIT) cells recognize bacterial metabolites as antigen and are found in blood and tissues, where they are poised to contribute to barrier immunity. Recent data demonstrate that MAIT cells located in mucosal barrier tissues are functionally distinct from their blood counterparts, but the relationship and circulation of MAIT cells between blood and different tissue compartments remains poorly understood. Previous studies raised the possibility that MAIT cells do not leave tissue and may either be retained or undergo apoptosis. To directly address if human MAIT cells exit tissues, we collected human donor-matched thoracic duct lymph and blood and analyzed MAIT cell phenotype, transcriptome and TCR diversity by RNAseq.

__Methods__

MAIT cells from cryopreserved PBMCs and lymph of 5 patients were sorted and processed with the SMARTseq v4 kit (Clontech). After cDNA amplification, sequencing libraries were prepared using the Nextera XT DNA Library Preparation Kit (Illumina). Barcoded libraries were pooled and quantified using a Qubit Fluorometer (Invitrogen). Single-read sequencing of the pooled libraries was carried out on a HiSeq2500 sequencer (Illumina) with 58-base reads, either using TruSeq v4 or Rapid Run v2 Cluster and SBS kits (Illumina).

The RNA-seq data were aligned to the human genome (UCSC Human Genome Assembly [http://hgdownload.cse.ucsc.edu/downloads.html#human], reference sequence GRCh38) using STAR (v2.4.2a) (42), and gene expression quantification was performed using RSEM (v1.2.22) (43). Genes with less than 5 nonzero read counts were discarded, leaving 15,719 expressed genes for the analysis. Libraries (samples) with less than 200,000 reads; 12,000 detected genes; and an exon rate <60% were also removed. Fourteen of the 18 prepared libraries (from 4 of 5 patients) passed these quality criteria.

The published study is available here: https://insight.jci.org/articles/view/98487#SEC4

GEO data be accessed here: <https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE106288>

Informations can also be found on the SRA Run Selector:

In [106]:
url = 'https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA416130'
IPython.display.IFrame(url, width=800, height=350)

## Get SRA files

* First we need to download the RunInfo file to extract all SRR numbers

In [107]:
#SRP project number and corresponding cgi request
srp = "SRP122527"
cgi_url = 'http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=' +srp
SraRunInfo = pd.read_csv(cgi_url)
SraRunInfo.head()

Unnamed: 0,Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,...,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
0,SRR6231076,2018-03-26 22:10:20,2017-10-27 14:23:47,7877339,447389981,0,56,178,,https://sra-download.ncbi.nlm.nih.gov/traces/s...,...,,,,,GEO,SRA625701,,public,EB86E7B9D8A73328402599ABF1FB9C0B,CB6247FA5DB8A10DF70A6FFE074B9F12
1,SRR6231077,2018-03-26 22:10:20,2017-10-27 14:25:39,9444712,536343465,0,56,210,,https://sra-download.ncbi.nlm.nih.gov/traces/s...,...,,,,,GEO,SRA625701,,public,DE26016A65445953E09474BE585E0DFE,5A18139E9F96055BD80328EA8D38CA72
2,SRR6231078,2018-03-26 22:10:20,2017-10-27 14:24:16,6979362,396332025,0,56,154,,https://sra-download.ncbi.nlm.nih.gov/traces/s...,...,,,,,GEO,SRA625701,,public,DD9562DD1CA4B76E128BC71D61B07642,DDD11529F85680C5CC28090CF95F7632
3,SRR6231079,2018-03-26 22:10:20,2017-10-27 14:24:46,8515581,483664476,0,56,189,,https://sra-download.ncbi.nlm.nih.gov/traces/s...,...,,,,,GEO,SRA625701,,public,F24F1FA07B4B4EFC1DB7B088D6FE99E3,25F34F4E5FEC032B0BD526942BE63FF5
4,SRR6231080,2018-03-26 22:10:20,2017-10-27 14:24:39,8927960,507079642,0,56,199,,https://sra-download.ncbi.nlm.nih.gov/traces/s...,...,,,,,GEO,SRA625701,,public,F6471119A5BC2922C7831F08050C0D22,45B5C9B20E60D1D158D59DA17D1112FD


* The SRA files can then be downloaded from the ncbi ftp website:

In [108]:
# Create folders to store the data
SRADIR = 'data/SRA'
try:
    os.makedirs(SRADIR)
except FileExistsError:
    # directory already exists
    pass

In [109]:
def download_file(url, local_filename):
    with open(local_filename, 'wb') as f:
        response = requests.get(url, stream=True)
        filesize = response.headers.get('content-length')

        if filesize is None:
            f.write(response.content)
        else:
            print('*** Downloading {} (size {} MB) ***'.format(local_filename, filesize))
            
            for chunk in log_progress(response.iter_content(chunk_size=max(int(int(filesize)/1000),1024*1024)),every=1):
                f.write(chunk)
            
    return local_filename

In [7]:
for url in log_progress(SraRunInfo.download_path):
    local_filename = os.path.join(SRADIR,url.split('/')[-1]+'.sra')
    if not os.path.exists(local_filename):
        download_file(url, local_filename)
    else:
        print(local_filename, 'already downloaded')

VBox(children=(HTML(value=''), IntProgress(value=0, max=14)))

data/SRA/SRR6231076.sra already downloaded
data/SRA/SRR6231077.sra already downloaded
data/SRA/SRR6231078.sra already downloaded
data/SRA/SRR6231079.sra already downloaded
data/SRA/SRR6231080.sra already downloaded
data/SRA/SRR6231081.sra already downloaded
data/SRA/SRR6231082.sra already downloaded
data/SRA/SRR6231083.sra already downloaded
data/SRA/SRR6231084.sra already downloaded
data/SRA/SRR6231085.sra already downloaded
data/SRA/SRR6231086.sra already downloaded
data/SRA/SRR6231087.sra already downloaded
data/SRA/SRR6231088.sra already downloaded
data/SRA/SRR6231089.sra already downloaded


In [110]:
## Check the downloaded SRA files
sra_files = [f for f in os.listdir(SRADIR) if f.endswith('.sra')]
for f in sra_files: #Calculate size for all files here.
    f= os.path.join(SRADIR,f)
    print("file: {:30s} size: {:d} bytes".format(f,os.stat(f).st_size))

file: data/SRA/SRR6231076.sra        size: 187451954 bytes


## Dump sra to fastq and perform fastQC

In [111]:
#helper function to run shell commands
def run_command(command):
    try:
        process = sp.Popen(shlex.split(command), stdout = sp.PIPE, stderr = sp.STDOUT, shell = False)
        for line in process.stdout:
            print(line.decode("UTF-8").replace('\n',''))
    except sp.CalledProcessError as e:
        raise Exception("Error running", command, e.output)
    except FileNotFoundError:
        print("command not found")

In [112]:
########## Dump ##########
#Create output directory
FASTQDIR = 'data/fastq'
try:
    os.makedirs(FASTQDIR)
except FileExistsError:
    # directory already exists
    pass

In [113]:
#dump sra to fastq
for file in log_progress(sra_files):
    sra = os.path.join(SRADIR, file)
    fastq = os.path.join(FASTQDIR, file+'.fastq')
    if not os.path.exists(fastq):
        print('processing:',sra)
        run_command("fasterq-dump --outdir {} {} -e4".format(FASTQDIR, sra))
    else:
        print(sra, 'already dumped to fastq')
        
#sound alert when done
Audio(sound_file, autoplay=True)

VBox(children=(HTML(value=''), IntProgress(value=0, max=1)))

data/SRA/SRR6231076.sra already dumped to fastq


In [115]:
########## Step 2: QC ##########
#Create output directory
QCDIR = os.path.join(os.getcwd(),'data/fastqc_output')
try:
    os.makedirs(QCDIR)
except FileExistsError:
    # directory already exists
    pass

QCed_samples = [s.split('.')[0] for s in os.listdir(QCDIR)] # list of fastq files already gone through QC
fastq_files = [f for f in os.listdir(FASTQDIR) if f.endswith('.fastq')] # list of all fastq files

for fastq in log_progress(fastq_files):
    if fastq.split('.')[0] not in QCed_samples:
        filename = os.path.join(FASTQDIR,fastq)
        print('*** processing', filename, '***')
        run_command("fastqc {} --outdir {}".format(filename, QCDIR))

#sound alert when done
Audio(sound_file, autoplay=True)

VBox(children=(HTML(value=''), IntProgress(value=0, max=4)))

## Align to genome using salmon

* Download reference genome

In [116]:
# Create folders to store the reference genome
ALIGNDIR= 'align'
try:
    os.makedirs(ALIGNDIR)
except FileExistsError:
    # directory already exists
    pass

In [117]:
def download_ftp(ftp, path_to_remote_file, local_name):
    if os.path.exists(local_name):
        print(local_name, 'already downloaded')
    else:
        #open ftp session
        ftp = ftplib.FTP(ftp)
        ftp.login()
        filesize = ftp.size(path_to_remote_file)
        with open(local_name, 'wb') as f:
            print('*** Downloading {} (size {} MB) ***'.format(local_name, filesize))
            p = Progressbar(filesize)
            def callback(chunk):
                f.write(chunk)
                p.update_progress(len(chunk))

            ftp.retrbinary('RETR '+path_to_remote_file, callback, blocksize = max(int(int(filesize)/1000),1024*1024))
        ftp.quit() #close session

In [118]:
#dowload transcript file
transcript_file = os.path.join(os.getcwd(),ALIGNDIR,'gencode.v30.transcripts.fa.gz')
download_ftp('ftp.ebi.ac.uk', 'pub/databases/gencode/Gencode_human/release_30/gencode.v30.transcripts.fa.gz', transcript_file)

/Users/Jb_Macbook/Desktop/CIM/python/RNAseq/python_pipeline/align/gencode.v30.transcripts.fa.gz already downloaded


* Create Salmon index https://salmon.readthedocs.io/en/latest/salmon.html

In [119]:
salmon_index= os.path.join(ALIGNDIR,'salmon_gencode.v30_quasi_index')
if not os.path.exists(salmon_index):
    ## Create salmon index (make sure to have sufficient disk space and memory)
    run_command('salmon index -t {} -i {} k 31 --perfectHash'.format(transcript_file,salmon_index))
    #sound alert when done
    Audio(sound_file, autoplay=True)
else:
    print(salmon_index, 'already created')

align/salmon_gencode.v30_quasi_index already created


* quantify transcripts

In [120]:
########## alignment-free transcript abundance quantification ##########
OUTDIR = os.path.join(ALIGNDIR,'salmon_output')
single_samples = [f.split('.')[0] for f in os.listdir(FASTQDIR) if f.endswith('.sra.fastq')]
paired_samples = [f.split('.')[0] for f in os.listdir(FASTQDIR) if f.endswith('.sra_1.fastq')]

## Align and assemble single-end sequencing reads
for sample in log_progress(single_samples):
    fastq = os.path.join(FASTQDIR,sample+'.sra.fastq')
    output = os.path.join(OUTDIR,sample)
    if not os.path.exists(output):
        print('Processing sample',sample)
        run_command('salmon quant --index {} --libType A -r {} --threads 4 --validateMappings --output {}'.format(salmon_index,fastq,output))


## Align and assemble paired-end sequencing reads
for sample in log_progress(paired_samples):
    fastq1 = os.path.join(FASTQDIR,sample+'.sra_1.fastq')
    fastq2 = os.path.join(FASTQDIR,sample+'.sra_2.fastq')
    output = os.path.join(OUTDIR,sample)
    if not os.path.exists(output):
        print('Processing sample',sample)
        run_command('salmon quant --index {} --libType A -1 {} -2{} --threads 4 --validateMappings --output {}'.format(salmon_index,fastq1,fastq2,output))

print('Transcript abundance quantification done')
#sound alert when done
Audio(sound_file, autoplay=True)

VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))

VBox(children=(HTML(value=''), IntProgress(value=0, max=1)))

Transcript abundance quantification done


## QC
* Check the quality of the raw reads (fastq files) and %mapping (salmon) to check which samples need to be excluded

In [121]:
#We can examine the QC report generated by multiQC to evaluate the quality of the data
!multiqc . -f --outdir data
IPython.display.IFrame('data/multiqc_report.html', width=800, height=350)

  configs = yaml.load(f)
  sp = yaml.load(f)
[INFO   ]         multiqc : This is MultiQC v1.6
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Searching '.'
[?25lSearching 118 files..  [####################################]  100%          [?25h
[INFO   ]          salmon : Found 3 meta reports
[INFO   ]          salmon : Found 3 fragment length distributions
[INFO   ]          fastqc : Found 14 reports
[INFO   ]         multiqc : Compressing plot data
[INFO   ]         multiqc : Report      : data/multiqc_report.html
[INFO   ]         multiqc : Data        : data/multiqc_data
[INFO   ]         multiqc : MultiQC complete


Here there did not seem to be adaptor contamination and base quality was good throughout the reads for all retained samples so we decided not to perform adaptor and quality trimmimg. 

## Pre-process quantification results for downstream analysis
* Keep only transcript name in the quant files and trim other identifiers:

In [122]:
def trim_ids(file):
    with open(file, 'r') as f, open('temp.txt', 'x') as t:
        for line in f:
            t.write(re.sub(r'\|.*\|','', line))
    os.remove(file)
    os.rename('temp.txt', file)

In [123]:
for sample in [s for s in os.listdir(OUTDIR) if s.startswith('SRR')]:
    file = os.path.join(OUTDIR,sample,'quant.sf')
    trim_ids(file)

* Steps 4 is performed in the R script 'tx2gene.R'

In [124]:
#Download gtf_file
gtf_file = os.path.join(os.getcwd(), ALIGNDIR,'gencode.v30.annotation.gtf.gz')
download_ftp('ftp.ebi.ac.uk', '/pub/databases/gencode/Gencode_human/release_30/gencode.v30.annotation.gtf.gz', gtf_file)

/Users/Jb_Macbook/Desktop/CIM/python/RNAseq/python_pipeline/align/gencode.v30.annotation.gtf.gz already downloaded


In [125]:
run_command("Rscript tx2gene.R {} {} -e4".format('data/', gtf_file))

Setting WORKDIR to: data/ 
Error in library(readr) : aucun package nommé ‘readr’ n'est trouvé
Exécution arrêtée


In [127]:
!which Rscript

/Users/Jb_Macbook/miniconda3/envs/rnaseq/bin/Rscript


Check results and export gene counts and length as separate matrices

In [25]:
## Load the expression matrix
txi = pd.read_csv(os.path.join(os.getcwd(),'data', 'txi.csv'), index_col=0)
print(txi.shape)
txi.head(3)

(58434, 43)


Unnamed: 0,abundance.SRR6231076,abundance.SRR6231077,abundance.SRR6231078,abundance.SRR6231079,abundance.SRR6231080,abundance.SRR6231081,abundance.SRR6231082,abundance.SRR6231083,abundance.SRR6231084,abundance.SRR6231085,...,length.SRR6231081,length.SRR6231082,length.SRR6231083,length.SRR6231084,length.SRR6231085,length.SRR6231086,length.SRR6231087,length.SRR6231088,length.SRR6231089,countsFromAbundance
ENSG00000000003.14,0.904746,1.728894,2.91638,1.672736,1.783211,0.839227,2.240563,0.740904,2.093231,0.184066,...,2267.286323,1631.755398,2116.087129,2106.54757,3547.0,2008.67225,3547.0,3547.0,1375.863908,no
ENSG00000000005.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,624.5,624.5,624.5,624.5,624.5,624.5,624.5,624.5,624.5,no
ENSG00000000419.12,49.583824,42.782857,70.891166,45.358392,61.097132,49.674372,42.417422,49.326188,43.877142,53.120251,...,740.182887,659.593373,689.528985,752.2808,704.728908,675.620847,657.309812,634.535056,655.377459,no


Here we add a column with trimmed ENSEMBL IDs (remove version .XX) and check for possible duplicates

In [26]:
txi['ENSEMBL'] = [ID.split('.')[0] for ID in txi.index]
txi[txi.duplicated(subset='ENSEMBL', keep=False)].head(6)

Unnamed: 0,abundance.SRR6231076,abundance.SRR6231077,abundance.SRR6231078,abundance.SRR6231079,abundance.SRR6231080,abundance.SRR6231081,abundance.SRR6231082,abundance.SRR6231083,abundance.SRR6231084,abundance.SRR6231085,...,length.SRR6231082,length.SRR6231083,length.SRR6231084,length.SRR6231085,length.SRR6231086,length.SRR6231087,length.SRR6231088,length.SRR6231089,countsFromAbundance,ENSEMBL


No duplicates

In [27]:
#Drop duplicates
#txi = txi.drop_duplicates(subset='ENSEMBL', keep='first').reset_index(drop=True).set_index('ENSEMBL')
#print(txi.shape)
#txi.head(3)

In [28]:
#TPM correspond to abundance calculated by salmon/tximport
TPM = txi[[col for col in txi.columns if col.startswith("abundance.")]]
#remove prefix "abundance."
TPM.columns = [col.split('.')[1] for col in TPM.columns]
print(TPM.shape)
TPM.to_csv(os.path.join(os.getcwd(),'data', 'TPM.csv'))
TPM.head(3)

(58434, 14)


Unnamed: 0,SRR6231076,SRR6231077,SRR6231078,SRR6231079,SRR6231080,SRR6231081,SRR6231082,SRR6231083,SRR6231084,SRR6231085,SRR6231086,SRR6231087,SRR6231088,SRR6231089
ENSG00000000003.14,0.904746,1.728894,2.91638,1.672736,1.783211,0.839227,2.240563,0.740904,2.093231,0.184066,1.386043,0.135562,0.098742,2.971102
ENSG00000000005.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ENSG00000000419.12,49.583824,42.782857,70.891166,45.358392,61.097132,49.674372,42.417422,49.326188,43.877142,53.120251,50.326877,34.554994,50.109105,69.003062


In [29]:
counts = txi[[col for col in txi.columns if col.startswith("counts.")]]
#remove prefix "counts."
counts.columns = [col.split('.')[1] for col in counts.columns]
print(counts.shape)
counts.to_csv(os.path.join(os.getcwd(),'data', 'counts.csv'))
counts.head(3)

(58434, 14)


Unnamed: 0,SRR6231076,SRR6231077,SRR6231078,SRR6231079,SRR6231080,SRR6231081,SRR6231082,SRR6231083,SRR6231084,SRR6231085,SRR6231086,SRR6231087,SRR6231088,SRR6231089
ENSG00000000003.14,31.812,60.686,121.274,34.846,69.902,25.15,66.644,22.725,48.893,10.447,49.374,9.59,7.226,80.811
ENSG00000000005.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ENSG00000000419.12,574.999,565.0,1041.999,516.999,789.0,486.0,510.001,493.0,366.001,598.999,603.0,453.0,656.0,894.0


In [30]:
lengths = txi[[col for col in txi.columns if col.startswith("length.")]]
#remove prefix "counts."
lengths.columns = [col.split('.')[1] for col in lengths.columns]
print(lengths.shape)
lengths.to_csv(os.path.join(os.getcwd(),'data', 'lengths.csv'))
lengths.head(3)

(58434, 14)


Unnamed: 0,SRR6231076,SRR6231077,SRR6231078,SRR6231079,SRR6231080,SRR6231081,SRR6231082,SRR6231083,SRR6231084,SRR6231085,SRR6231086,SRR6231087,SRR6231088,SRR6231089
ENSG00000000003.14,2076.907355,1807.147185,2014.507866,1320.160816,2139.434255,2267.286323,1631.755398,2116.087129,2106.54757,3547.0,2008.67225,3547.0,3547.0,1375.863908
ENSG00000000005.6,624.5,624.5,624.5,624.5,624.5,624.5,624.5,624.5,624.5,624.5,624.5,624.5,624.5,624.5
ENSG00000000419.12,684.978199,679.919882,712.065911,722.332479,704.799893,740.182887,659.593373,689.528985,752.2808,704.728908,675.620847,657.309812,634.535056,655.377459


For use with limma-voom, export counts calculated with the lengthScaled TPM method (from [tximport vignette](https://bioc.ism.ac.jp/packages/3.4/bioc/vignettes/tximport/inst/doc/tximport.html): "limma-voom does not use the offset matrix stored in y$offset, so we recommend using the scaled counts generated from abundances, either 'scaledTPM' or 'lengthScaledTPM' ")

In [31]:
## Load the expression matrix
txi_lengthScaledTPM = pd.read_csv(os.path.join(os.getcwd(),'data', 'txi_lengthScaledTPM.csv'), index_col=0)
print(txi_lengthScaledTPM.shape)

(58434, 43)


In [32]:
# Add trimmed ENSEMBL IDs and drop duplicates
txi_lengthScaledTPM['ENSEMBL'] = [ID.split('.')[0] for ID in txi_lengthScaledTPM.index]
txi_lengthScaledTPM = txi_lengthScaledTPM.drop_duplicates(subset='ENSEMBL', keep='first').reset_index(drop=True).set_index('ENSEMBL')
print(txi_lengthScaledTPM.shape)
txi_lengthScaledTPM.head(3)

(58434, 43)


Unnamed: 0_level_0,abundance.SRR6231076,abundance.SRR6231077,abundance.SRR6231078,abundance.SRR6231079,abundance.SRR6231080,abundance.SRR6231081,abundance.SRR6231082,abundance.SRR6231083,abundance.SRR6231084,abundance.SRR6231085,...,length.SRR6231081,length.SRR6231082,length.SRR6231083,length.SRR6231084,length.SRR6231085,length.SRR6231086,length.SRR6231087,length.SRR6231088,length.SRR6231089,countsFromAbundance
ENSEMBL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000000003,0.904746,1.728894,2.91638,1.672736,1.783211,0.839227,2.240563,0.740904,2.093231,0.184066,...,2267.286323,1631.755398,2116.087129,2106.54757,3547.0,2008.67225,3547.0,3547.0,1375.863908,lengthScaledTPM
ENSG00000000005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,624.5,624.5,624.5,624.5,624.5,624.5,624.5,624.5,624.5,lengthScaledTPM
ENSG00000000419,49.583824,42.782857,70.891166,45.358392,61.097132,49.674372,42.417422,49.326188,43.877142,53.120251,...,740.182887,659.593373,689.528985,752.2808,704.728908,675.620847,657.309812,634.535056,655.377459,lengthScaledTPM


In [33]:
counts_lengthScaledTPM = txi_lengthScaledTPM[[col for col in txi_lengthScaledTPM.columns if col.startswith("counts.")]]
#remove prefix "counts."
counts_lengthScaledTPM.columns = [col.split('.')[1] for col in counts_lengthScaledTPM.columns]
print(counts_lengthScaledTPM.shape)
counts_lengthScaledTPM.to_csv(os.path.join(os.getcwd(),'data', 'counts_lengthScaledTPM.csv'))
counts_lengthScaledTPM.head(3)

(58434, 14)


Unnamed: 0_level_0,SRR6231076,SRR6231077,SRR6231078,SRR6231079,SRR6231080,SRR6231081,SRR6231082,SRR6231083,SRR6231084,SRR6231085,SRR6231086,SRR6231087,SRR6231088,SRR6231089
ENSEMBL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
ENSG00000000003,33.784806,73.529314,130.314786,57.585487,72.726977,24.408778,89.253089,23.650555,50.903998,6.355981,52.17112,5.709203,4.373752,128.361618
ENSG00000000005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ENSG00000000419,568.489962,558.663062,972.589846,479.436744,765.070847,443.595365,518.798119,483.442674,327.612976,563.192875,581.622853,446.823896,681.485953,915.322233


After you completed successfully the above steps, you can start to analyze the processed gene expression matrix

## References
---
<a id='ref1'></a>
1. Andrews S. (2010) **FastQC: A quality control tool for high throughput sequence data.** _Reference Source_.
<a id='ref2'></a>
2. Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. (2017) **Salmon provides fast and bias-aware quantification of transcript expression.**_Nature Methods_,14, 417-419. PMID:[28263959](https://www.ncbi.nlm.nih.gov/pubmed/28263959)
<a id='ref3'></a>
3. Soneson C., Love M.I., Robinson M.D. (2015): **Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.** _F1000Research_ http://dx.doi.org/10.12688/f1000research.7563.1
<a id='ref4'></a>