# Assignment Comparative and Regulatory Genomics
## Master - Bulk RNA Sequencing
Tim Blokker - r0639760

The dataset was taken from here: https://www.ncbi.nlm.nih.gov/sra/SRX8335650 and is accompanying the publication by Carmona-Rivera et al. 2020 https://insight.jci.org/articles/view/139388. This data set was obtained by isolating Human FLS from synovial tissue from Osteoathritis (OA) or Rheumatic Athritis (RA) joints. Cells were treated with Neutrophil extracellular traps (NETs) for 48 hours and RNA was isolated . cDNA libraries were prepared with poly(A) tail enrichment. RNA-Seq libraries were sequenced on an Illumina Hiseq 3000 as single end with 50 base reads. 
The paper mainly addressed the topic of increased cartilage damage in rheumatic athritis cells stimulated with NETS compared to rheumatic athritis cells not stimulated with NETS.

In total there are 3 jupyter notebooks in this submission:
1. Master notebook  & FastQC and Mapping of reads using star and subsequent feature count (this notebook)
2. The DESeq2 analysis in R for the statistical analyisis
3. Functional analysis using arbritrary (GProfiler, iRegulon) and "leading edge" (Gorilla, GSEA) cut-offs.

In [4]:
mkdir -p /mnt/storage/$USER/jupyternotebooks/Assignment/
cd /mnt/storage/$USER/jupyternotebooks/Assignment/

In [1]:
vdb-config -s /repository/user/cache-disabled=true
pwd

/mnt/storage/r0639760/jupyternotebooks/Assignment


### Dumping FastQ files
The read length of the second read in each spot is 0, also visible here https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR11783198. Perhaps the data was uploaded incorrectly as paired end reads but since the dataset is only single end there might be an empty spot.


In [None]:
#Test run
START=11783198
END=11783201

for i in $(seq $START $END);
do
    fastq-dump --split-files -X 50 'SRR'"$i" 
done

Rejected 50 READS because READLEN < 1
Read 50 spots for SRR11783198
Written 50 spots for SRR11783198
Rejected 50 READS because READLEN < 1
Read 50 spots for SRR11783199
Written 50 spots for SRR11783199


In [3]:
START=11783198
END=11783201

for i in $(seq $START $END); 
do
    fastq-dump --split-files 'SRR'"$i";
done

Rejected 51400777 READS because READLEN < 1
Read 51400777 spots for SRR11783198
Written 51400777 spots for SRR11783198
Rejected 45674553 READS because READLEN < 1
Read 45674553 spots for SRR11783199
Written 45674553 spots for SRR11783199
Rejected 59532979 READS because READLEN < 1
Read 59532979 spots for SRR11783200
Written 59532979 spots for SRR11783200
Rejected 45036641 READS because READLEN < 1
Read 45036641 spots for SRR11783201
Written 45036641 spots for SRR11783201


In [4]:
#check number of reads
for i in $(seq $START $END);
do
    echo 'Number of reads for SRR'"$i"'_1.fastq : ' | tr -d "\n";
    grep @ 'SRR'"$i"'_1.fastq' |wc -l;
done
#Save output below since files will be removed in next box
#Number of reads for SRR11783198_1.fastq : 51400777
#Number of reads for SRR11783199_1.fastq : 45674553
#Number of reads for SRR11783200_1.fastq : 59532979
#Number of reads for SRR11783201_1.fastq : 45036641

Number of reads for SRR11783198_1.fastq : 51400777
Number of reads for SRR11783199_1.fastq : 45674553
Number of reads for SRR11783200_1.fastq : 59532979
Number of reads for SRR11783201_1.fastq : 45036641


## Quality control using FASTQC

In [5]:
START=11783198
END=11783201
mkdir -p FastQC/
for i in $(seq $START $END);
do
/usr/bin/fastqc -o FastQC 'SRR'"$i"'_1.fastq'; #save the output to FastQC folder
done

Started analysis of SRR11783198_1.fastq
Approx 5% complete for SRR11783198_1.fastq
Approx 10% complete for SRR11783198_1.fastq
Approx 15% complete for SRR11783198_1.fastq
Approx 20% complete for SRR11783198_1.fastq
Approx 25% complete for SRR11783198_1.fastq
Approx 30% complete for SRR11783198_1.fastq
Approx 35% complete for SRR11783198_1.fastq
Approx 40% complete for SRR11783198_1.fastq
Approx 45% complete for SRR11783198_1.fastq
Approx 50% complete for SRR11783198_1.fastq
Approx 55% complete for SRR11783198_1.fastq
Approx 60% complete for SRR11783198_1.fastq
Approx 65% complete for SRR11783198_1.fastq
Approx 70% complete for SRR11783198_1.fastq
Approx 75% complete for SRR11783198_1.fastq
Approx 80% complete for SRR11783198_1.fastq
Approx 85% complete for SRR11783198_1.fastq
Approx 90% complete for SRR11783198_1.fastq
Approx 95% complete for SRR11783198_1.fastq
Analysis complete for SRR11783198_1.fastq
Started analysis of SRR11783199_1.fastq
Approx 5% complete for SRR11783199_1.fastq


### Initial FastQC results
#### Overrepresented sequences
The QC results show "Overrepresented sequences" which are the sequencing adapters in all 4 Runs: 
The cut-off for warning on duplication levels is 0.1% and the cut-off for errors is at 1%. 

    SRR11783198_1 TruSeq Adapter, Index 19 (97% over 40bp)
    SRR11783199_1 TruSeq Adapter, Index 21 (97% over 40bp)
    SRR11783200_1 TruSeq Adapter, Index 1  (100% over 50bp)
    SRR11783201_1 TruSeq Adapter, Index 10 (100% over 50bp)

    TruSeq Adapter, Index 1 from FastQC        GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGC                                                                                    
    TruSeq Adapter, Index 1 from Illumina docs:GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG
                                               
https://emea.support.illumina.com/content/dam/illumina-support/documents/documentation/chemistry_documentation/experiment-design/illumina-adapter-sequences-1000000002694-14.pdf

We conclude adaptor contamination from this section and aFstQC is doing a good job detecting them. We do not get the entire adapter sequence sequenced because we have a sequence length of 50bp.
    
#### Sequence Duplication levels

The level of sequence duplication is enormous. After deduplication 23% to 27% of the sequences remain. Removing duplicates is however a debated subject since it is not clear whether the mRNA was really present in higher copy numbers or if a PCR artefact is the cause of the duplication event.
![image.png](attachment:image.png)


Also **K-mer** levels are high and the **Per base sequence content** is wobbly.

The K-mer issue should be solved by trimming the adaptors. Adapters are sequences only at the 3 prime end and this is exactly where we observe the K-mers.
The per pase sequence content might get better after removing adapter dimers.

Trimmomatic will be sued to trim the adaptor sequencing and because we use a minimum length exclusion we will also get rid of adaptor dimers at the same time.

In [3]:
wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.36.zip
unzip Trimmomatic-0.36.zip

--2020-11-18 18:47:07--  http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.36.zip
Resolving www.usadellab.org (www.usadellab.org)... failed: Temporary failure in name resolution.
wget: unable to resolve host address ‘www.usadellab.org’
unzip:  cannot find or open Trimmomatic-0.36.zip, Trimmomatic-0.36.zip.zip or Trimmomatic-0.36.zip.ZIP.


: 9

In [5]:
echo '>TruSeq_Adapter_Index_1
GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCC

>TruSeq_Adapter_Index_10
GATCGGAAGAGCACACGTCTGAACTCCAGTCACTAGCTTATCTCGTATGC

>TruSeq_Adapter_Index_21
GATCGGAAGAGCACACGTCTGAACTCCAGTCACGTTTCGGAATCTCGTATG

>TruSeq_Adapter_Index_19
GATCGGAAGAGCACACGTCTGAACTCCAGTCACGTGAAACGATCTCGTATG' > adapter_sequences.fa

cat adapter_sequences.fa

>TruSeq_Adapter_Index_1
GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCC

>TruSeq_Adapter_Index_10
GATCGGAAGAGCACACGTCTGAACTCCAGTCACTAGCTTATCTCGTATGC

>TruSeq_Adapter_Index_21
GATCGGAAGAGCACACGTCTGAACTCCAGTCACGTTTCGGAATCTCGTATG

>TruSeq_Adapter_Index_19
GATCGGAAGAGCACACGTCTGAACTCCAGTCACGTGAAACGATCTCGTATG


In [7]:
#use python script to determine whether encoding is Phred+64 or Phred+33, credit to Brent Pedersen
curl https://raw.githubusercontent.com/brentp/bio-playground/master/reads-utils/guess-encoding.py > guess-encoding.py
chmod +x guess-encoding.py
awk 'NR % 4 == 0' SRR11783198_1.fastq | ./guess-encoding.py -n 1000

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  4631  100  4631    0     0   2672      0  0:00:01  0:00:01 --:--:--  2670
# reading qualities from STDIN
Illumina-1.8	35	74


#### Phred Encoding
Phred encoding for Illumina-1.8 is Phred+33.

In [8]:
START=11783198
END=11783201
mkdir -p trimmomatic
for i in $(seq $START $END);
do
    java -jar Trimmomatic-0.36/trimmomatic-0.36.jar SE \
        -threads 40 \
        -phred33 \
        'SRR'"$i"'_1.fastq' \
        trimmomatic/'SRR'"$i"'_1_trimmed.fastq' \
        ILLUMINACLIP:adapter_sequences.fa:2:30:10 \
        MINLEN:36;
        rm 'SRR'"$i"'_1.fastq'
done

TrimmomaticSE: Started with arguments:
 -threads 40 -phred33 SRR11783198_1.fastq trimmomatic/SRR11783198_1_trimmed.fastq ILLUMINACLIP:adapter_sequences.fa:2:30:10 MINLEN:36
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACGTGAAACGATCTCGTATG'
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCC'
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACTAGCTTATCTCGTATGC'
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACGTTTCGGAATCTCGTATG'
ILLUMINACLIP: Using 0 prefix pairs, 4 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Input Reads: 51400777 Surviving: 50898193 (99.02%) Dropped: 502584 (0.98%)
TrimmomaticSE: Completed successfully
TrimmomaticSE: Started with arguments:
 -threads 40 -phred33 SRR11783199_1.fastq trimmomatic/SRR11783199_1_trimmed.fastq ILLUMINACLIP:adapter_sequences.fa:2:30:10 MINLEN:36
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACGTGAAACGATCTCGTATG'
Using 

**SRR11783198_1.fastq.gz**  
    
    Dropped: 502584 (0.98%)
**SRR11783199_1.fastq.gz**

    Dropped: 597885 (1.31%)
**SRR11783200_1.fastq.gz** 

    Dropped: 2523090 (4.24%)
**SRR11783201_1.fastq.gz**

    Dropped: 305018 (0.68%)

In [10]:
START=11783198
END=11783201
mkdir -p FastQC_trimmed
for i in $(seq $START $END);
do
   /usr/bin/fastqc -o FastQC_trimmed/ trimmomatic/'SRR'"$i"'_1_trimmed.fastq'; #save the output to FastQC folder;
done

Started analysis of SRR11783198_1_trimmed.fastq
Approx 5% complete for SRR11783198_1_trimmed.fastq
Approx 10% complete for SRR11783198_1_trimmed.fastq
Approx 15% complete for SRR11783198_1_trimmed.fastq
Approx 20% complete for SRR11783198_1_trimmed.fastq
Approx 25% complete for SRR11783198_1_trimmed.fastq
Approx 30% complete for SRR11783198_1_trimmed.fastq
Approx 35% complete for SRR11783198_1_trimmed.fastq
Approx 40% complete for SRR11783198_1_trimmed.fastq
Approx 45% complete for SRR11783198_1_trimmed.fastq
Approx 50% complete for SRR11783198_1_trimmed.fastq
Approx 55% complete for SRR11783198_1_trimmed.fastq
Approx 60% complete for SRR11783198_1_trimmed.fastq
Approx 65% complete for SRR11783198_1_trimmed.fastq
Approx 70% complete for SRR11783198_1_trimmed.fastq
Approx 75% complete for SRR11783198_1_trimmed.fastq
Approx 80% complete for SRR11783198_1_trimmed.fastq
Approx 85% complete for SRR11783198_1_trimmed.fastq
Approx 90% complete for SRR11783198_1_trimmed.fastq
Approx 95% comple

https://workshop.eupathdb.org/bop/pdfs/fastqc_output.pdf 
#### Duplicated sequences
In RNA-Seq libraries sequences from different transcripts will be present at wildly different levels in the startingpopulation. In order to be able to observe lowly expressed transcripts it is therefore common to greatly over-sequence high expressed transcripts, and this will potentially create large set of duplicates. This will result inhigh overall duplication in this test, and will often produce peaks in the higher duplication bins. This duplicationwill come from physically connected regions, **and an examination of the distribution of duplicates in a specific genomic region will allow the distinction between over-sequencing and general technical duplication, but these distinctions are not possible from raw fastq files.** A similar situation can arise in highly enriched ChIP-Seqlibraries although the duplication there is less pronounced. Finally, if you have a library where the sequencestart points are constrained (a library constructed around restriction sites for example, or an unfragmented smallRNA library) then the constrained start sites will generate huge dupliction levels which should not be treated as a problem, nor removed by deduplication. In these types of library you should consider using a system such asrandom barcoding to allow the distinction of technical and biological duplicates.

#### Kmers
Common reasons for warnings: Any individually overrepresented sequences, even if not present at a high enough threshold to trigger the overrepresented sequences module will cause the Kmers from those sequences to be highly enriched in this module. These will normally appear as sharp spikes of enrichemnt at a single point in the sequence, rather than a progressive or broad enrichment.Libraries which derive from random priming will nearly always show Kmer bias at the start of the library due to an incomplete sampling of the possible random primers.

#### Per Base Sequence Content
Biased fragmentation: Any library which is generated based on the ligation of random hexamers or through tagmentation should theoretically have good diversity through the sequence, but experience has shown that these libraries always have a selection bias in around the first 12bp of each run. This is due to a biased selection of random primers, but doesn't represent any individually biased sequences. Nearly all RNA-Seq libraries will fail this module because of this bias, but this is not a problem which can be fixed by processing, and it doesn't seem to adversely affect the ablity to measure expression. 
![image.png](attachment:image.png)
**As a conclusion we continue with this data set after trimming adaptors at the sequence ends and removing adapter dimers and keep the findings from the FastQC step in mind.**

## Mapping to the genome

We will use STAR (Spliced Transcripts Alignment to Reference) to map to the genome. 

A database with the human genome and gene annotation was pregenerated and so we just point there instead of copying it to our directory. 

The STAR manual gives an extensive account of all options: https://raw.githubusercontent.com/alexdobin/STAR/master/doc/STARmanual.pdf 

In [2]:
mkdir -p Mapped
START=11783198
END=11783201
for i in $(seq $START $END);
do
    STAR --genomeDir /mnt/nfs/mfiers/STAR/hg19_star_db \
         --runThreadN 40 \
         --readFilesIn trimmomatic/'SRR'"$i"'_1_trimmed.fastq' \
         --outFileNamePrefix Mapped/'SRR'"$i";
done

Oct 10 20:26:08 ..... started STAR run
Oct 10 20:26:08 ..... loading genome
Oct 10 20:26:26 ..... started mapping
Oct 10 20:27:47 ..... finished successfully
Oct 10 20:27:47 ..... started STAR run
Oct 10 20:27:47 ..... loading genome
Oct 10 20:28:05 ..... started mapping
Oct 10 20:29:16 ..... finished successfully
Oct 10 20:29:17 ..... started STAR run
Oct 10 20:29:17 ..... loading genome
Oct 10 20:29:34 ..... started mapping
Oct 10 20:31:03 ..... finished successfully
Oct 10 20:31:04 ..... started STAR run
Oct 10 20:31:04 ..... loading genome
Oct 10 20:31:22 ..... started mapping
Oct 10 20:32:34 ..... finished successfully


In [3]:
head -40 Mapped/SRR11783198Aligned.out.sam | grep '^@' #

@HD	VN:1.4
@SQ	SN:chrM	LN:16571
@SQ	SN:chr1	LN:249250621
@SQ	SN:chr2	LN:243199373
@SQ	SN:chr3	LN:198022430
@SQ	SN:chr4	LN:191154276
@SQ	SN:chr5	LN:180915260
@SQ	SN:chr6	LN:171115067
@SQ	SN:chr7	LN:159138663
@SQ	SN:chr8	LN:146364022
@SQ	SN:chr9	LN:141213431
@SQ	SN:chr10	LN:135534747
@SQ	SN:chr11	LN:135006516
@SQ	SN:chr12	LN:133851895
@SQ	SN:chr13	LN:115169878
@SQ	SN:chr14	LN:107349540
@SQ	SN:chr15	LN:102531392
@SQ	SN:chr16	LN:90354753
@SQ	SN:chr17	LN:81195210
@SQ	SN:chr18	LN:78077248
@SQ	SN:chr19	LN:59128983
@SQ	SN:chr20	LN:63025520
@SQ	SN:chr21	LN:48129895
@SQ	SN:chr22	LN:51304566
@SQ	SN:chrX	LN:155270560
@SQ	SN:chrY	LN:59373566
@PG	ID:STAR	PN:STAR	VN:STAR_2.5.4b	CL:STAR   --runThreadN 40   --genomeDir /mnt/nfs/mfiers/STAR/hg19_star_db   --readFilesIn trimmomatic/SRR11783198_1_trimmed.fastq      --outFileNamePrefix Mapped/SRR11783198
@CO	user command line: STAR --genomeDir /mnt/nfs/mfiers/STAR/hg19_star_db --runThreadN 40 --readFilesIn trimmomatic/SRR11783198_1_trimmed.fastq --outFileN

In [4]:
head -34 Mapped/SRR11783198Aligned.out.sam | grep -v '^@' # everything that does not begin with an @

SRR11783198.546324	0	chr12	133350898	255	9M820N41M	*	0	0	CTCCTTTCTCGTCTTGGCCGCGCCGCGGCGTAGGTCCAGCTTGAGCTGCT	AAFFFJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJ	NH:i:1	HI:i:1	AS:i:50	nM:i:0
SRR11783198.546325	16	chr14	69926744	255	50M	*	0	0	CAGACTAGGATAATTTTTTTTTCATATTTGCCAAAATTTTTGTAAACCCT	JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA	NH:i:1	HI:i:1	AS:i:49	nM:i:0
SRR11783198.546326	0	chr2	218667286	255	50M	*	0	0	GGCTGGGACATGTGCAACCCCTCCCAATGCTGAGCCCCACACAGTCTAGG	AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ	NH:i:1	HI:i:1	AS:i:49	nM:i:0
SRR11783198.546327	0	chr19	41173139	255	50M	*	0	0	CCGGCAGTGAAATGGTTCCCTTAGCCAGGCTGGGTCCGTCCCTGAATTCC	AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ	NH:i:1	HI:i:1	AS:i:49	nM:i:0
SRR11783198.546328	0	chr7	45952672	255	16M1D34M	*	0	0	CTCCCTGTCTCTCTGTCCTCCTACCCCACGGGGCCGCAGCAAAAGCCATC	AAFFFJJJJJJJJJJJJJFJJAFFJJJ7FAJJJJJJJ-AJ<JJJJJJJFF	NH:i:1	HI:i:1	AS:i:45	nM:i:0
SRR11783198.546329	0	chr5	81571891	255	50M	*	0	0	TGAAAATTTATTACTACAGTGTTTTCACCATTAA

In [6]:
START=11783198
END=11783201
mkdir -p bam
for i in $(seq $START $END);
do
   samtools sort \
       -o bam/'SRR'"$i"'.bam' \
       Mapped/'SRR'"$i"'Aligned.out.sam'\
       --threads 40;
done

[bam_sort_core] merging from 0 files and 40 in-memory blocks...
[bam_sort_core] merging from 0 files and 40 in-memory blocks...
[bam_sort_core] merging from 0 files and 40 in-memory blocks...
[bam_sort_core] merging from 0 files and 40 in-memory blocks...


In [None]:
START=11783198
END=11783201
for i in $(seq $START $END);
do
    rm  Mapped/'SRR'"$i"'Aligned.out.sam';
done
if [ -d trimmomatic ]; then rm trimmomatic

rm: cannot remove 'Mapped/SRR11783198Aligned.out.sam': No such file or directory
rm: cannot remove 'Mapped/SRR11783199Aligned.out.sam': No such file or directory
rm: cannot remove 'Mapped/SRR11783200Aligned.out.sam': No such file or directory
rm: cannot remove 'Mapped/SRR11783201Aligned.out.sam': No such file or directory


### Indexing the Bam files

In [23]:
START=11783198
END=11783201

for i in $(seq $START $END);
do
    samtools index  bam/'SRR'"$i"'.bam';
done

### Symbolic link to the annotation of the human genome and then count the features.

featureCounts counts the reads mapped by STAR for genomic features. Among these features are genes, exons, promotor regions etc. The program is further described here: http://bioinf.wehi.edu.au/featureCounts/.

In [27]:
mkdir -p gtf
ln -sf /mnt/nfs/data/RNA-seq/gencode.v19.nopseudo.plus.sort.gtf gtf


In [28]:
ls -l gtf/*gtf

lrwxrwxrwx 1 r0639760 domain users 56 Oct 10 21:02 gtf/gencode.v19.nopseudo.plus.sort.gtf -> /mnt/nfs/data/RNA-seq/gencode.v19.nopseudo.plus.sort.gtf


In [29]:
grep -w TP53 gtf/gencode.v19.nopseudo.plus.sort.gtf |head

chr17	HAVANA	exon	7565097	7565332	.	-	.	gene_id "ENSG00000141510.11"; transcript_id "ENST00000413465.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TP53"; transcript_type "protein_coding"; transcript_status "PUTATIVE"; transcript_name "TP53-018"; exon_number 7;  exon_id "ENSE00001657961.2";  level 2; tag "basic"; havana_gene "OTTHUMG00000162125.4"; havana_transcript "OTTHUMT00000440236.1";
chr17	HAVANA	gene	7565097	7590856	.	-	.	gene_id "ENSG00000141510.11"; transcript_id "ENSG00000141510.11"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TP53"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TP53"; level 2; havana_gene "OTTHUMG00000162125.4";
chr17	HAVANA	transcript	7565097	7579912	.	-	.	gene_id "ENSG00000141510.11"; transcript_id "ENST00000413465.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TP53"; transcript_type "protein_coding"; transcript_status "PUTATIVE"; transcript_name "TP53-018"; level 2; tag "bas

In [1]:
START=11783198
END=11783201
mkdir -p counts
for i in $(seq $START $END);
do
    featureCounts \
    -Q 10 \
    -g gene_name \
    -a /mnt/nfs/data/RNA-seq/gencode.v19.nopseudo.plus.sort.gtf \
    -o counts/'SRR'"$i"'.counts' \
    bam/'SRR'"$i"'.bam' ;
done


       [44;37m =====      [0m[36m   / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
       [44;37m   =====    [0m[36m  | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
       [44;37m     ====   [0m[36m   \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
       [44;37m       ==== [0m[36m   ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
	  v1.6.0

||  [0m                                                                          ||
||             Input files : [36m1 BAM file  [0m [0m                                    ||
||                           [32mS[36m bam/SRR11783198.bam[0m [0m                           ||
||  [0m                                                                          ||
||             Output file : [36mcounts/SRR11783198.counts[0m [0m                       ||
||                 Summary : [36mcounts/SRR11783198.counts.summary[0m [0m               ||
||              Annotation : [36m/mnt/nfs/data/RNA-seq/gencode.v19.nopseudo.

The feature count resulted in 67 % of the reads being assigned to known features. This is a relatively high value and so we continue with the analysis-

In [35]:
echo -e "1\t2\t3\t4\t5\t6\t7"
awk "NR==2"  counts/SRR11783198.counts 

1	2	3	4	5	6	7
Geneid	Chr	Start	End	Strand	Length	bam/SRR11783198.bam


So from the output above we can see that we are interested in column 1 and 7 which we will retrieve using the cut command. Then we select 2 genes that we know are upregulated from the paper. And indeed we see them being upregulated in the RA +NET samples compared to the RA only samples.notebook

In [42]:
cut -f1,7 counts/SRR11783198.counts | grep IL1B #no NET
cut -f1,7 counts/SRR11783199.counts | grep IL1B #no NET
cut -f1,7 counts/SRR11783200.counts | grep IL1B #NET -> Higher inflammtion levels
cut -f1,7 counts/SRR11783201.counts | grep IL1B #NET -> Higher inflammtion levels

IL1B	13
IL1B	8
IL1B	1069
IL1B	874


In [41]:
cut -f1,7 counts/SRR11783198.counts | grep IL33 #no NET
cut -f1,7 counts/SRR11783199.counts | grep IL33 #no NET
cut -f1,7 counts/SRR11783200.counts | grep IL33 #NET -> Higher inflammtion levels
cut -f1,7 counts/SRR11783201.counts | grep IL33 #NET -> Higher inflammtion levels

IL33	130
IL33	138
IL33	11320
IL33	9274


In [21]:
cut -f1,7 counts/SRR11783198.counts | grep -v '^#' | grep CXCL6 #no NET
cut -f1,7 counts/SRR11783199.counts | grep -v '^#' | grep CXCL6 #no NET
cut -f1,7 counts/SRR11783200.counts | grep -v '^#' | grep CXCL6 #NET -> Higher inflammtion levels
cut -f1,7 counts/SRR11783201.counts | grep -v '^#' | grep CXCL6 #NET -> Higher inflammtion levels

CXCL6	325
CXCL6	297
CXCL6	649948
CXCL6	540233


So we see that we have much higher counts for cytokines produced by fibroblasts (IL33) or macrophages (IL1B) and chemokines (CXCL6) as described in the publication. This fits with the hypthesis of the authors that claim higher levels of inflammation when NETS are formed.

Below is the final activity of this quality control, we are checking the expected cyto- and chemokines for upregulation in IGV. And we see that indeed we have higher read counts for the NET induced samples compared to the control samples.The coverage might not look very different but the scale should be noted, the scale for the control ranges from 0 to 10 and for the treated samples it goes to 270 and even 1000 mapped reads.

**IL33**
![igv_snapshot.png](attachment:igv_snapshot.png)

**IL1B**
![igv_snapshot_IL1B.png](attachment:igv_snapshot_IL1B.png)

**CXCL6**
![igv_snapshoCXCL6t.png](attachment:igv_snapshoCXCL6t.png)

### Moving on to the DeSeq2 Notebook for the statistical analysis

In [43]:
mkdir -p DeSeq
ls

adapter_sequences.fa  DeSeq         FastQC_trimmed     Mapped
bam                   DeSeq2.ipynb  gtf                RNAseq.map.count.ipynb
counts                FastQC        guess-encoding.py  Trimmomatic-0.36
