# Processing RNAseq data 

First step is to create a directory to work in:

In [3]:
# mkdir -p /mnt/storage/$USER/jupyternotebooks/RNA-seq
cd /mnt/storage/$USER/jupyternotebooks/RNA-seq

### symlink the fastq files

First step is to create a symbolic link to the raw fastq files. A symbolic link instead of a full copy (to not clog up the system) 

In [24]:
# ln -sf /mnt/nfs/data/RNA-seq/MCF7/*.fastq .

In [20]:
head test.fastq

@SRR867127.1 HWI-ST571:122:D0HFTACXX:4:1101:1454:2139 length=50
CAGCTTTCTAGGGCTGGAAGTCTCAAATAGAACTCACCTGTTCCCCAACC
+SRR867127.1 HWI-ST571:122:D0HFTACXX:4:1101:1454:2139 length=50
@@@?DFFDF?HFHIIIGG>FCAFGHG@EHGHIDE?FHHDHF@BFF@@?DG
@SRR867127.2 HWI-ST571:122:D0HFTACXX:4:1101:1439:2168 length=50
GAATTAAAAACTGGTATCAGCGATGTTTTTGCTAAAAATGATCTTGCTGT
+SRR867127.2 HWI-ST571:122:D0HFTACXX:4:1101:1439:2168 length=50
@@CDDFDDDHHHFGAFGHHIDIH<EGHIIBGIIHEGCFHA>FGHIIICHI
@SRR867127.3 HWI-ST571:122:D0HFTACXX:4:1101:1455:2177 length=50
CACGGCCAGGCCCCCCGCTGTGGATAAGTGGCTCTTGCCATAGGGGTGGT


Note that there is a test.fastq file - this is a small subset of NS1.fastq to be able to run a few operation in class without overly burdening the system.

## Quality control using FASTQC

After receiving sequencing data your first step should be to check your data quality. This step helps to identify potential technical biases that might affect your analysis. Based on this evaluation you might need to pre-process your data to improve the quality. 

[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) is a tool that produces a quality analysis report on FASTQ files.

    fastqc fastqfile -o outputdir  

Useful links:

* [FastQC report for a good Illumina dataset](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ "Title")
* [FastQC report for a bad Illumina dataset](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html)
* [Adapter contaminated run](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/RNA-Seq_fastqc.html)
* [Online documentation for each FastQC report](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/)
* [Video tutorial](http://www.youtube.com/watch?v=bz93ReOv87Y)

Look through the individual reports and evaluate them according to your experiment type.

The FastQC reports I find most useful are:

* The Per base sequence quality report, which can help you decide if sequence trimming is needed before alignment.
* The Sequences Duplication Levels report, which helps you evaluate library enrichment / complexity. But note that different experiment types are expected to have vastly different duplication profiles.
* The Overrepresented Sequences report, which helps evaluate adapter contamination.

We will not run FastQC on the full files as this is too computationally intensive. Instead we'll run fastqc on the small test.fastq file:

Note, we're using the fastqc program in `/usr/local/bin/FastQC/` to ensure we have the correct version. 

In [21]:
/usr/bin/fastqc -o . test.fastq

Started analysis of test.fastq
Approx 5% complete for test.fastq
Approx 10% complete for test.fastq
Approx 15% complete for test.fastq
Approx 20% complete for test.fastq
Approx 25% complete for test.fastq
Approx 30% complete for test.fastq
Approx 35% complete for test.fastq
Approx 40% complete for test.fastq
Approx 45% complete for test.fastq
Approx 50% complete for test.fastq
Approx 55% complete for test.fastq
Approx 60% complete for test.fastq
Approx 65% complete for test.fastq
Approx 70% complete for test.fastq
Approx 75% complete for test.fastq
Approx 80% complete for test.fastq
Approx 85% complete for test.fastq
Approx 90% complete for test.fastq
Approx 95% complete for test.fastq
Approx 100% complete for test.fastq
Analysis complete for test.fastq


Download the .zip file and open the .html file to check some FASTQ quality stats. Below you can see boxplots of the Illumina base quality scores

![image.png](attachment:image.png)

Now download the test_fastqc.html file to you computer and open it in a web browser.

## Mapping to the genome

We will use STAR to map to the genome. 

STAR also needs a database with the human genome and gene annotation. Creating one takes a long time, so this is also pregenerated and located here: `/mnt/nfs/mfiers/STAR/hg19_star_db`.

The STAR manual gives an extensive account of all options: https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf

Again, running STAR all toghether may crash the server, so we will run with the test fastq file.

In [11]:
# You don't need to run this, but this is a trick to load genome into memory, 
# so if we all run STAR we can the genome will not be loaded X times and overload
# the server:
STAR --genomeLoad LoadAndExit --genomeDir /mnt/nfs/mfiers/STAR/hg19_star_db

Sep 27 16:45:24 ..... started STAR run
Sep 27 16:45:24 ..... loading genome


In [12]:
STAR --genomeDir /mnt/nfs/mfiers/STAR/hg19_star_db \
     --genomeLoad LoadAndKeep \
     --runThreadN 2 \
     --readFilesIn test.fastq \
     --outFileNamePrefix test.

Sep 27 16:45:32 ..... started STAR run
Sep 27 16:45:32 ..... loading genome
Sep 27 16:45:32 ..... started mapping
Sep 27 16:45:33 ..... finished successfully


The output file is now in `test.Aligned.out.sam`. Apart from that we have a number of additional log files generated.

In [13]:
ls -l test.*

-rw-r--r-- 1 r0773125 domain users 5239034 Sep 27 16:45 test.Aligned.out.sam
lrwxrwxrwx 1 r0773125 domain users      37 Sep 27 16:44 test.fastq -> /mnt/nfs/data/RNA-seq/MCF7/test.fastq
-rw-r--r-- 1 r0773125 domain users    1830 Sep 27 16:45 test.Log.final.out
-rw-r--r-- 1 r0773125 domain users   16038 Sep 27 16:45 test.Log.out
-rw-r--r-- 1 r0773125 domain users     246 Sep 27 16:45 test.Log.progress.out
-rw-r--r-- 1 r0773125 domain users  134953 Sep 27 16:45 test.SJ.out.tab


### SAM file structure

A SAM file is a text file containing the alignment of all reads. A SAM files starts with a header:(Please do not use `cat` as this will crash your notebook)

In [14]:
head -40 test.Aligned.out.sam | grep '^@'

@HD	VN:1.4
@SQ	SN:chrM	LN:16571
@SQ	SN:chr1	LN:249250621
@SQ	SN:chr2	LN:243199373
@SQ	SN:chr3	LN:198022430
@SQ	SN:chr4	LN:191154276
@SQ	SN:chr5	LN:180915260
@SQ	SN:chr6	LN:171115067
@SQ	SN:chr7	LN:159138663
@SQ	SN:chr8	LN:146364022
@SQ	SN:chr9	LN:141213431
@SQ	SN:chr10	LN:135534747
@SQ	SN:chr11	LN:135006516
@SQ	SN:chr12	LN:133851895
@SQ	SN:chr13	LN:115169878
@SQ	SN:chr14	LN:107349540
@SQ	SN:chr15	LN:102531392
@SQ	SN:chr16	LN:90354753
@SQ	SN:chr17	LN:81195210
@SQ	SN:chr18	LN:78077248
@SQ	SN:chr19	LN:59128983
@SQ	SN:chr20	LN:63025520
@SQ	SN:chr21	LN:48129895
@SQ	SN:chr22	LN:51304566
@SQ	SN:chrX	LN:155270560
@SQ	SN:chrY	LN:59373566
@PG	ID:STAR	PN:STAR	VN:STAR_2.5.4b	CL:STAR   --runThreadN 2   --genomeDir /mnt/nfs/mfiers/STAR/hg19_star_db   --genomeLoad LoadAndKeep   --readFilesIn test.fastq      --outFileNamePrefix test.
@CO	user command line: STAR --genomeDir /mnt/nfs/mfiers/STAR/hg19_star_db --genomeLoad LoadAndKeep --runThreadN 2 --readFilesIn test.fastq --outFileNamePrefix test.


Followed by the alignments:

In [15]:
head -34 test.Aligned.out.sam | grep -v '^@'

SRR867127.1	16	chr11	73373455	255	50M	*	0	0	GGTTGGGGAACAGGTGAGTTCTATTTGAGACTTCCAGCCCTAGAAAGCTG	GD?@@FFB@FHDHHF?EDIHGHE@GHGFACF>GGIIIHFH?FDFFD?@@@	NH:i:1	HI:i:1	AS:i:49	nM:i:0
SRR867127.2	16	chr2	232324890	255	50M	*	0	0	ACAGCAAGATCATTTTTAGCAAAAACATCGCTGATACCAGTTTTTAATTC	IHCIIIHGF>AHFCGEHIIGBIIHGE<HIDIHHGFAGFHHHDDDFDDC@@	NH:i:1	HI:i:1	AS:i:49	nM:i:0
SRR867127.3	16	chr8	37556053	255	50M	*	0	0	ACCACCCCTATGGCAAGAGCCACTTATCCACAGCGGGGGGCCTGGCCGTG	9A9'HE>HGGJJIIIIGIIIG@IHEJHGHDHEGIGIGFFHDFFFDFF@@@	NH:i:1	HI:i:1	AS:i:49	nM:i:0
SRR867127.4	0	chr12	49522065	255	50M	*	0	0	CACAAACTGGATGCTGCGCTTGGTTTTGATGGTGGCAATGGCAGCATTGA	@@@FFFFDFBDHHGGHEEHIJIGFGHGDGIGGFHGHGGIJJIJICHIJHE	NH:i:1	HI:i:1	AS:i:49	nM:i:0
SRR867127.5	16	chr16	3726056	255	50M	*	0	0	GGAAAACTCCTTGCAGTCGGATTTCAGGTGGATGATGATTTTTGTCCCGG	>FJIHCJJIJIJIJJJIIHJJHIGGGEGIIJIIJIGIHHHHFFFDFF@CC	NH:i:1	HI:i:1	AS:i:49	nM:i:0
SRR867127.6	16	chr17	17184449	255	50M	*	0	0	GAGCTGAGAGCTGTCGGACACTGTTCACGAACTGCTCCAGGGCAGACGCC	IHGFFGHCGDHGB>IIIDF?EHEGCGHHDHHEG

As you may notice, the SAM file is unsorted and uncompressed. To be able to quickly access the data we tend to sort the SAM file and convert it to a BAM file, which is a compressed (much smaller) version of a SAM file. The last step is to create an index for quick access.

We use `samtools` to do this. The first step sorts and outputs `bam`:

In [63]:
samtools sort -o test.bam test.Aligned.out.sam

In [64]:
samtools sort

Usage: samtools sort [options...] [in.bam]
Options:
  -l INT     Set compression level, from 0 (uncompressed) to 9 (best)
  -m INT     Set maximum memory per thread; suffix K/M/G recognized [768M]
  -n         Sort by read name
  -t TAG     Sort by value of TAG. Uses position as secondary index (or read name if -n is set)
  -o FILE    Write final output to FILE rather than standard output
  -T PREFIX  Write temporary files to PREFIX.nnnn.bam
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
  -O, --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]
  -@, --threads INT
               Number of additional threads to use [0]


In [65]:
ls -l test*[bs]am

-rw-r--r-- 1 u0038182 domain users 5239034 sep 22 14:35 test.Aligned.out.sam
-rw-r--r-- 1 u0038182 domain users 1573049 sep 22 14:43 test.bam


As you can see, the output bam file is much smaller. We will now generate an index, again using `samtools`:

In [66]:
samtools index test.bam

In [67]:
ls -l test*[bs]a[mi]

-rw-r--r-- 1 u0038182 domain users 5239034 sep 22 14:35 test.Aligned.out.sam
-rw-r--r-- 1 u0038182 domain users 1573049 sep 22 14:43 test.bam
-rw-r--r-- 1 u0038182 domain users 1761048 sep 22 14:43 test.bam.bai


The procedure is exactly the same for the full fastq files (NS1, NS2, S1 and S2). We've already performed the alignments. We will create a link to this data: 

### inspect the BAM file(s)

In [69]:
ls -l *.ba[mi]

-rw-r--r-- 1 u0038182 domain users 1573049 sep 22 14:43 test.bam
-rw-r--r-- 1 u0038182 domain users 1761048 sep 22 14:43 test.bam.bai


Now we have the full set of bam files. Note, there is an extra bam file (`S1-chr21.bam`) which contains only the reads of chromosome 21. We'll use this for manual inspection.

You can have a look at any one of them, but ensure that you use `head` - otherwise the notebook will crash:

In [70]:
samtools view test.bam | head -3

SRR867127.21188	0	chrM	86	255	50M	*	0	0	CATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTT	CCCFFFFFHHHHHJIJIIJJJJIIIJJJIIIJJHIIIHHHIJJJJHHGGH	NH:i:1	HI:i:1	AS:i:49	nM:i:0
SRR867127.9187	16	chrM	115	255	50M	*	0	0	TATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGC	JJJJJJJJJJJJJJJJJIJJIJJJIJJJJJJJJJJJJHHHHHFFFFFCCC	NH:i:1	HI:i:1	AS:i:47	nM:i:1
SRR867127.11621	16	chrM	214	255	50M	*	0	0	AATTAATGCTTGTAGGACATAATAATAACAATTGAATGTCTGCACAGCCG	JJIJJJJJJJJJIJJJIIIGJIGJJIJJIJJJJJJJJHHHHHFFFFFCCC	NH:i:1	HI:i:1	AS:i:49	nM:i:0
samtools view: writing to standard output failed: Broken pipe
samtools view: error closing standard output: -1


The format of the BAM output is the same as that of the SAM output.

With `samtools idxstats` you can see how many reads map to each chromosome

In [71]:
samtools idxstats test.bam

chrM	16571	878	0
chr1	249250621	2662	0
chr2	243199373	1845	0
chr3	198022430	1391	0
chr4	191154276	795	0
chr5	180915260	1530	0
chr6	171115067	1381	0
chr7	159138663	1386	0
chr8	146364022	1024	0
chr9	141213431	1302	0
chr10	135534747	899	0
chr11	135006516	1502	0
chr12	133851895	2304	0
chr13	115169878	395	0
chr14	107349540	916	0
chr15	102531392	975	0
chr16	90354753	1701	0
chr17	81195210	1923	0
chr18	78077248	291	0
chr19	59128983	1457	0
chr20	63025520	904	0
chr21	48129895	234	0
chr22	51304566	559	0
chrX	155270560	980	0
chrY	59373566	72	0
*	0	0	0


In [72]:
ls *.bam

test.bam


Running `samtools flagstat` tells us what the distribution of mapping flags (column 2 in the sam/bam file) is:

    0x0001	p	the read is paired in sequencing
    0x0002	P	the read is mapped in a proper pair
    0x0004	u	the query sequence itself is unmapped
    0x0008	U	the mate is unmapped
    0x0010	r	strand of the query (1 for reverse)
    0x0020	R	strand of the mate
    0x0040	1	the read is the first read in a pair
    0x0080	2	the read is the second read in a pair
    0x0100	s	the alignment is not primary
    0x0200	f	the read fails platform/vendor quality checks
    0x0400	d	the read is either a PCR or an optical duplicate
    0x0800	S	the alignment is supplementary
    
See [here](http://www.htslib.org/doc/samtools.html) for a more extensive explanation of the `samtools flagstat` output

In [73]:
samtools flagstat test.bam

29306 + 0 in total (QC-passed reads + QC-failed reads)
6041 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
29306 + 0 mapped (100.00% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)


# reads to gene counts

First we'll make a symbolic link to (**a part of!**) the human annotation.

Note, if you want to do this for your own project, a full gtf file (containing all chromosomes) can be found in:

    /mnt/nfs/data/RNA-seq/gencode.v19.nopseudo.plus.sort.gtf

In [80]:
ln -sf /mnt/storage/data/RNA-seq/gencode.v19.nopseudo.plus.sort.gtf .
ls -l *gtf

lrwxrwxrwx 1 u0038182 domain users 67 sep 22 14:45 [0m[01;36mgencode.v19.nopseudo.plus.sort.chr21.gtf[0m -> /mnt/nfs/data/RNA-seq/MCF7/gencode.v19.nopseudo.plus.sort.chr21.gtf
lrwxrwxrwx 1 u0038182 domain users 60 sep 22 14:49 [01;36mgencode.v19.nopseudo.plus.sort.gtf[0m -> /mnt/storage/data/RNA-seq/gencode.v19.nopseudo.plus.sort.gtf
lrwxrwxrwx 1 u0038182 domain users 32 sep 22 14:44 [01;36m*.gtf[0m -> /mnt/nfs/data/RNA-seq/2018/*.gtf


Have a look (use `head`!):

In [83]:
grep -w TP53 gencode.v19.nopseudo.plus.sort.gtf

chr17	HAVANA	exon	7565097	7565332	.	-	.	gene_id "ENSG00000141510.11"; transcript_id "ENST00000413465.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TP53"; transcript_type "protein_coding"; transcript_status "PUTATIVE"; transcript_name "TP53-018"; exon_number 7;  exon_id "ENSE00001657961.2";  level 2; tag "basic"; havana_gene "OTTHUMG00000162125.4"; havana_transcript "OTTHUMT00000440236.1";
chr17	HAVANA	gene	7565097	7590856	.	-	.	gene_id "ENSG00000141510.11"; transcript_id "ENSG00000141510.11"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TP53"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TP53"; level 2; havana_gene "OTTHUMG00000162125.4";
chr17	HAVANA	transcript	7565097	7579912	.	-	.	gene_id "ENSG00000141510.11"; transcript_id "ENST00000413465.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TP53"; transcript_type "protein_coding"; transcript_status "PUTATIVE"; transcript_name "TP53-018"; level 2; tag "bas

chr17	HAVANA	exon	7571722	7573008	.	-	.	gene_id "ENSG00000141510.11"; transcript_id "ENST00000455263.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TP53"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TP53-004"; exon_number 12;  exon_id "ENSE00002034209.1";  level 2; tag "NMD_exception"; tag "basic"; tag "CCDS"; ccdsid "CCDS45605.1"; havana_gene "OTTHUMG00000162125.4"; havana_transcript "OTTHUMT00000367400.1";
chr17	HAVANA	transcript	7571722	7590799	.	-	.	gene_id "ENSG00000141510.11"; transcript_id "ENST00000420246.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TP53"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TP53-005"; level 2; tag "NMD_exception"; tag "basic"; tag "CCDS"; ccdsid "CCDS45606.1"; havana_gene "OTTHUMG00000162125.4"; havana_transcript "OTTHUMT00000367401.1";
chr17	HAVANA	transcript	7571722	7590799	.	-	.	gene_id "ENSG00000141510.11"; transcript_id "ENST00000455263.2"; gene

chr17	HAVANA	exon	7573927	7574033	.	-	.	gene_id "ENSG00000141510.11"; transcript_id "ENST00000269305.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TP53"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TP53-001"; exon_number 10;  exon_id "ENSE00003545950.1";  level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS11118.1"; havana_gene "OTTHUMG00000162125.4"; havana_transcript "OTTHUMT00000367397.1";
chr17	HAVANA	exon	7573927	7574033	.	-	.	gene_id "ENSG00000141510.11"; transcript_id "ENST00000420246.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TP53"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TP53-005"; exon_number 11;  exon_id "ENSE00003634848.1";  level 2; tag "NMD_exception"; tag "basic"; tag "CCDS"; ccdsid "CCDS45606.1"; havana_gene "OTTHUMG00000162125.4"; havana_transcript "OTTHUMT00000367401.1";
chr17	HAVANA	exon	7573927	7574033	.	-	.	gene_id "ENSG00000141510.11"; t

chr17	HAVANA	CDS	7576853	7576905	.	-	2	gene_id "ENSG00000141510.11"; transcript_id "ENST00000576024.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TP53"; transcript_type "protein_coding"; transcript_status "PUTATIVE"; transcript_name "TP53-017"; exon_number 1;  exon_id "ENSE00002672443.1";  level 2; tag "mRNA_start_NF"; tag "cds_start_NF"; havana_gene "OTTHUMG00000162125.4"; havana_transcript "OTTHUMT00000440238.1";
chr17	HAVANA	CDS	7576853	7576926	.	-	2	gene_id "ENSG00000141510.11"; transcript_id "ENST00000269305.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TP53"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TP53-001"; exon_number 9;  exon_id "ENSE00003636029.1";  level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS11118.1"; havana_gene "OTTHUMG00000162125.4"; havana_transcript "OTTHUMT00000367397.1";
chr17	HAVANA	CDS	7576853	7576926	.	-	2	gene_id "ENSG00000141510.11"; transcript_id "ENST00000359597

chr17	HAVANA	CDS	7577019	7577155	.	-	1	gene_id "ENSG00000141510.11"; transcript_id "ENST00000359597.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TP53"; transcript_type "protein_coding"; transcript_status "PUTATIVE"; transcript_name "TP53-019"; exon_number 7;  exon_id "ENSE00003586720.1";  level 2; tag "basic"; havana_gene "OTTHUMG00000162125.4"; havana_transcript "OTTHUMT00000440237.1";
chr17	HAVANA	CDS	7577019	7577155	.	-	1	gene_id "ENSG00000141510.11"; transcript_id "ENST00000420246.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TP53"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TP53-005"; exon_number 8;  exon_id "ENSE00003586720.1";  level 2; tag "NMD_exception"; tag "basic"; tag "CCDS"; ccdsid "CCDS45606.1"; havana_gene "OTTHUMG00000162125.4"; havana_transcript "OTTHUMT00000367401.1";
chr17	HAVANA	CDS	7577019	7577155	.	-	1	gene_id "ENSG00000141510.11"; transcript_id "ENST00000445888.2"; gene_type "protein_coding"

chr17	HAVANA	CDS	7577499	7577608	.	-	0	gene_id "ENSG00000141510.11"; transcript_id "ENST00000455263.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TP53"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TP53-004"; exon_number 7;  exon_id "ENSE00003504863.1";  level 2; tag "NMD_exception"; tag "basic"; tag "CCDS"; ccdsid "CCDS45605.1"; havana_gene "OTTHUMG00000162125.4"; havana_transcript "OTTHUMT00000367400.1";
chr17	HAVANA	CDS	7577499	7577608	.	-	0	gene_id "ENSG00000141510.11"; transcript_id "ENST00000509690.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TP53"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TP53-015"; exon_number 4;  exon_id "ENSE00003504863.1";  level 2; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000162125.4"; havana_transcript "OTTHUMT00000367411.2";
chr17	HAVANA	exon	7577499	7577608	.	-	.	gene_id "ENSG00000141510.11"; transcript_id "ENST00000269305.4"; gene

chr17	HAVANA	CDS	7578177	7578289	.	-	2	gene_id "ENSG00000141510.11"; transcript_id "ENST00000359597.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TP53"; transcript_type "protein_coding"; transcript_status "PUTATIVE"; transcript_name "TP53-019"; exon_number 5;  exon_id "ENSE00003462942.1";  level 2; tag "basic"; havana_gene "OTTHUMG00000162125.4"; havana_transcript "OTTHUMT00000440237.1";
chr17	HAVANA	CDS	7578177	7578289	.	-	2	gene_id "ENSG00000141510.11"; transcript_id "ENST00000413465.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TP53"; transcript_type "protein_coding"; transcript_status "PUTATIVE"; transcript_name "TP53-018"; exon_number 5;  exon_id "ENSE00003462942.1";  level 2; tag "basic"; havana_gene "OTTHUMG00000162125.4"; havana_transcript "OTTHUMT00000440236.1";
chr17	HAVANA	CDS	7578177	7578289	.	-	2	gene_id "ENSG00000141510.11"; transcript_id "ENST00000420246.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TP53"; transcript_

chr17	HAVANA	CDS	7578371	7578554	.	-	0	gene_id "ENSG00000141510.11"; transcript_id "ENST00000269305.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TP53"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TP53-001"; exon_number 5;  exon_id "ENSE00003518480.1";  level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS11118.1"; havana_gene "OTTHUMG00000162125.4"; havana_transcript "OTTHUMT00000367397.1";
chr17	HAVANA	CDS	7578371	7578554	.	-	0	gene_id "ENSG00000141510.11"; transcript_id "ENST00000359597.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TP53"; transcript_type "protein_coding"; transcript_status "PUTATIVE"; transcript_name "TP53-019"; exon_number 4;  exon_id "ENSE00003518480.1";  level 2; tag "basic"; havana_gene "OTTHUMG00000162125.4"; havana_transcript "OTTHUMT00000440237.1";
chr17	HAVANA	CDS	7578371	7578554	.	-	0	gene_id "ENSG00000141510.11"; transcript_id "ENST00000413465.2"; gene_type "protein_codi

chr17	HAVANA	exon	7578371	7578811	.	-	.	gene_id "ENSG00000141510.11"; transcript_id "ENST00000510385.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TP53"; transcript_type "retained_intron"; transcript_status "KNOWN"; transcript_name "TP53-007"; exon_number 1;  exon_id "ENSE00002064269.1";  level 2; havana_gene "OTTHUMG00000162125.4"; havana_transcript "OTTHUMT00000367403.1";
chr17	HAVANA	CDS	7578434	7578554	.	-	0	gene_id "ENSG00000141510.11"; transcript_id "ENST00000508793.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TP53"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TP53-014"; exon_number 5;  exon_id "ENSE00002073243.1";  level 2; tag "alternative_5_UTR"; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000162125.4"; havana_transcript "OTTHUMT00000367410.1";
chr17	HAVANA	exon	7578434	7578554	.	-	.	gene_id "ENSG00000141510.11"; transcript_id "ENST00000508793.1"; gene_type "protein_coding"; gene_status "KNOW

chr17	HAVANA	CDS	7579312	7579590	.	-	0	gene_id "ENSG00000141510.11"; transcript_id "ENST00000508793.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TP53"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TP53-014"; exon_number 4;  exon_id "ENSE00003625790.1";  level 2; tag "alternative_5_UTR"; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000162125.4"; havana_transcript "OTTHUMT00000367410.1";
chr17	HAVANA	CDS	7579312	7579590	.	-	0	gene_id "ENSG00000141510.11"; transcript_id "ENST00000604348.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TP53"; transcript_type "protein_coding"; transcript_status "NOVEL"; transcript_name "TP53-020"; exon_number 4;  exon_id "ENSE00003625790.1";  level 2; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000162125.4"; havana_transcript "OTTHUMT00000468199.1";
chr17	HAVANA	exon	7579312	7579590	.	-	.	gene_id "ENSG00000141510.11"; transcript_id "ENST00000269305.4"; gene_type 

chr17	HAVANA	CDS	7579700	7579721	.	-	1	gene_id "ENSG00000141510.11"; transcript_id "ENST00000508793.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TP53"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TP53-014"; exon_number 3;  exon_id "ENSE00002419584.1";  level 2; tag "alternative_5_UTR"; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000162125.4"; havana_transcript "OTTHUMT00000367410.1";
chr17	HAVANA	CDS	7579700	7579721	.	-	1	gene_id "ENSG00000141510.11"; transcript_id "ENST00000514944.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TP53"; transcript_type "protein_coding"; transcript_status "NOVEL"; transcript_name "TP53-013"; exon_number 3;  exon_id "ENSE00002419584.1";  level 2; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000162125.4"; havana_transcript "OTTHUMT00000367409.1";
chr17	HAVANA	CDS	7579700	7579721	.	-	1	gene_id "ENSG00000141510.11"; transcript_id "ENST00000604348.1"; gene_type "

chr17	HAVANA	CDS	7579839	7579912	.	-	0	gene_id "ENSG00000141510.11"; transcript_id "ENST00000455263.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TP53"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TP53-004"; exon_number 2;  exon_id "ENSE00002667911.1";  level 2; tag "NMD_exception"; tag "basic"; tag "CCDS"; ccdsid "CCDS45605.1"; havana_gene "OTTHUMG00000162125.4"; havana_transcript "OTTHUMT00000367400.1";
chr17	HAVANA	CDS	7579839	7579912	.	-	0	gene_id "ENSG00000141510.11"; transcript_id "ENST00000503591.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TP53"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TP53-003"; exon_number 3;  exon_id "ENSE00002667911.1";  level 2; tag "alternative_5_UTR"; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000162125.4"; havana_transcript "OTTHUMT00000367399.1";
chr17	HAVANA	CDS	7579839	7579912	.	-	0	gene_id "ENSG00000141510.11"; transcript_id "

chr17	HAVANA	start_codon	7579910	7579912	.	-	0	gene_id "ENSG00000141510.11"; transcript_id "ENST00000445888.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TP53"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TP53-002"; exon_number 2;  exon_id "ENSE00001596491.1";  level 2; tag "alternative_5_UTR"; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS11118.1"; havana_gene "OTTHUMG00000162125.4"; havana_transcript "OTTHUMT00000367398.1";
chr17	HAVANA	start_codon	7579910	7579912	.	-	0	gene_id "ENSG00000141510.11"; transcript_id "ENST00000455263.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TP53"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TP53-004"; exon_number 2;  exon_id "ENSE00002667911.1";  level 2; tag "NMD_exception"; tag "basic"; tag "CCDS"; ccdsid "CCDS45605.1"; havana_gene "OTTHUMG00000162125.4"; havana_transcript "OTTHUMT00000367400.1";
chr17	HAVANA	start_codon	7579910	7

chr17	HAVANA	exon	7590695	7590745	.	-	.	gene_id "ENSG00000141510.11"; transcript_id "ENST00000514944.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TP53"; transcript_type "protein_coding"; transcript_status "NOVEL"; transcript_name "TP53-013"; exon_number 1;  exon_id "ENSE00002051873.1";  level 2; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000162125.4"; havana_transcript "OTTHUMT00000367409.1";
chr17	HAVANA	exon	7590695	7590799	.	-	.	gene_id "ENSG00000141510.11"; transcript_id "ENST00000420246.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TP53"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TP53-005"; exon_number 1;  exon_id "ENSE00002051192.1";  level 2; tag "NMD_exception"; tag "basic"; tag "CCDS"; ccdsid "CCDS45606.1"; havana_gene "OTTHUMG00000162125.4"; havana_transcript "OTTHUMT00000367401.1";
chr17	HAVANA	exon	7590695	7590799	.	-	.	gene_id "ENSG00000141510.11"; transcript_id "ENST00000455263.2"; ge

In [86]:
featureCounts -Q 10 -g gene_name -a gencode.v19.nopseudo.plus.sort.gtf -o test.counts test.bam


       [44;37m =====      [0m[36m   / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
       [44;37m   =====    [0m[36m  | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
       [44;37m     ====   [0m[36m   \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
       [44;37m       ==== [0m[36m   ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
	v1.5.0-p1

||  [0m                                                                          ||
||             Input files : [36m1 BAM file  [0m [0m                                    ||
||                           [32mS[36m test.bam[0m [0m                                      ||
||  [0m                                                                          ||
||             Output file : [36mtest.counts[0m [0m                                     ||
||             Annotations : [36mgencode.v19.nopseudo.plus.sort.gtf (GTF)[0m [0m        ||
||  [0m                                                                    

The command for the full run is:

    featureCounts \
        -Q 10 \
        -g gene_name \
        -a /mnt/nfs/data/RNA-seq/gencode.v19.nopseudo.plus.sort.gtf \
        -o all.counts \
        S1.bam S2.bam NS1.bam NS2.bam

Again, we did already run this for the full files (otherwise it would take too long), let's create a link:

In [90]:
ln -sf /mnt/nfs/data/RNA-seq/MCF7/all.count* .
ls -l *count*

lrwxrwxrwx 1 u0038182 domain users      37 sep 22 14:52 [0m[01;36mall.counts[0m -> /mnt/nfs/data/RNA-seq/MCF7/all.counts
-rw-r--r-- 1 u0038182 domain users 8320176 sep 22 14:50 S1.counts
-rw-r--r-- 1 u0038182 domain users     283 sep 22 14:50 S1.counts.summary
-rw-r--r-- 1 u0038182 domain users 8320178 sep 22 14:51 test.counts
-rw-r--r-- 1 u0038182 domain users     283 sep 22 14:51 test.counts.summary


You can inspect both files using `head` (and `column` to nicely align the file)

Note that there are a number of columns on the gene structure, and a number with the actual counts. We'll separate these:

In [95]:
cut -f-6 all.counts > all.genedata.tsv

In [97]:
cut -f1,7- all.counts | grep -v '^#' > all.gene.counts

how many counts has the gene CDKN2A?

In [5]:
grep CDKN2A all.gene.counts

CDKN2AIP	1194	1018	759	853
CDKN2AIPNL	1181	1034	1096	1266
CDKN2A	585	255	372	273
