### `Downloading Fastq Sequence`

In [1]:
#!wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR298/009/ERR2985659/ERR2985659_1.fastq.gz

In [2]:
#!wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR298/009/ERR2985659/ERR2985659_2.fastq.gz

### `Read Trimming with Cutadapt`

In [None]:
# Installation 
!conda install -c bioconda cutadapt 
# In bash script dont use '!'

In [1]:
!mkdir -p cutadapt_trim_output 

In [4]:
!cutadapt --adapter AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -g AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT --trim-n --output cutadapt_trim_output/ERR2985659_trim_1.fastq.gz --paired-output cutadapt_trim_output/ERR2985659_trim_2.fastq.gz ERR2985659_1.fastq.gz ERR2985659_2.fastq.gz
# Regular 3’ adapter	-a ADAPTER
# Regular 5’ adapter	-g ADAPTER
# --trim-n : trim sequences from the ends of reads when they consist solely of N nucleotides. This option is helpful for removing low-quality or ambiguous bases that are often represented as Ns in sequencing data.

This is cutadapt 1.18 with Python 3.7.6
Command line parameters: --adapter AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -g AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT --trim-n --output cutadapt_trim_output/ERR2985659_trim_1.fastq.gz --paired-output cutadapt_trim_output/ERR2985659_trim_2.fastq.gz ERR2985659_1.fastq.gz ERR2985659_2.fastq.gz
Processing reads on 1 core in paired-end legacy mode ...
*ignore* the second read. To switch to regular paired-end mode,
provide the --pair-filter=any option or use any of the
-A/-B/-G/-U/--interleaved options.
Finished in 394.98 s (12 us/read; 5.17 M reads/minute).

=== Summary ===

Total read pairs processed:         34,006,553
  Read 1 with adapter:                 672,021 (2.0%)
  Read 2 with adapter:                       0 (0.0%)
Pairs written (passing filters):    34,006,553 (100.0%)

Total basepairs processed: 6,615,557,434 bp
  Read 1: 3,328,882,137 bp
  Read 2: 3,286,675,297 bp
Total written (filtered):  6,613,205,317 bp (100.0%)
  Read 1: 3,326,530,020 bp
  Rea

In [5]:
!mkdir -p cutadapt_fastqc

In [9]:
#!fastqc -o cutadapt_fastqc/ cutadapt_trim_output/*.gz

In [10]:
!mkdir -p multiqc_cutadapt

In [21]:
#!multiqc cutadapt_fastqc/ -o multiqc_cutadapt/

# `STAR(Spliced Transcript Aligned To a Reference) Aligner Tool`

In [27]:
#Installation
!conda install -c bioconda star

## `1-pass mapping with indexed genome with star`

In [22]:
#Downloading the Genome file and Gff file

In [22]:
!wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/195/955/GCF_000195955.2_ASM19595v2/GCF_000195955.2_ASM19595v2_genomic.fna.gz

In [23]:
!wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/195/955/GCF_000195955.2_ASM19595v2/GCF_000195955.2_ASM19595v2_genomic.gff.gz

In [24]:
#Unzipping the genome and gff file
!gzip -d GCF_000195955.2_ASM19595v2_genomic.fna.gz

In [19]:
!gzip -d GCF_000195955.2_ASM19595v2_genomic.gff.gz

#### `Genome Indexing with STAR`

In [20]:
!STAR --runMode genomeGenerate --genomeDir genome_data/ --genomeFastaFiles genome_data/GCF_000195955.2_ASM19595v2_genomic.fna --sjdbGTFfile genome_data/GCF_000195955.2_ASM19595v2_genomic.gff --runThreadN 4 --sjdbOverhang 89
#Flags
    #--runThreadN: number of threads
    #--runMode: genomeGenerate mode
    #--genomeDir: /path/to/store/genome_indices [In this case is genome_data]
    #--genomeFastaFiles: /path/to/FASTA_file
    #--sjdbGTFfile: /path/to/GTF_file
    #--sjdbOverhang:(readlength -1)
      #length of the genomic sequence around the annotated junction be used in constructing the splice-junctions database

	STAR --runMode genomeGenerate --genomeDir genome_data/ --genomeFastaFiles genome_data/GCF_000195955.2_ASM19595v2_genomic.fna --sjdbGTFfile genome_data/GCF_000195955.2_ASM19595v2_genomic.gff --runThreadN 4 --sjdbOverhang 89
	STAR version: 2.7.10b   compiled: 2022-11-01T09:53:26-04:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 01 16:35:14 ..... started STAR run
Jun 01 16:35:14 ... starting to generate Genome files
Jun 01 16:35:14 ..... processing annotations GTF
Jun 01 16:35:14 ... starting to sort Suffix Array. This may take a long time...
Jun 01 16:35:14 ... sorting Suffix Array chunks and saving them to disk...
Jun 01 16:35:15 ... loading chunks from disk, packing SA...
Jun 01 16:35:15 ... finished generating suffix array
Jun 01 16:35:15 ... generating Suffix Array index
Jun 01 16:35:18 ... completed Suffix Array index
Jun 01 16:35:18 ... writing Genome to disk ...
Jun 01 16:35:18 ... writing Suffix Array to disk ...
Jun 01 16:35:18 ... writing SAindex to disk
Jun 01 16:3

#### `Mapping with STAR`

In [21]:
!STAR --genomeDir genome_data --readFilesIn reads/ERR2985659_1.fastq.gz reads/ERR2985659_2.fastq.gz --readFilesCommand zcat --outSAMunmapped Within --outFileNamePrefix ERR2985659 --runThreadN 4
'''
 --runThreadN: number of threads / cores
    --readFilesIn: /path/to/FASTQ_file
    --genomeDir: /path/to/genome_indices_directory
    --outFileNamePrefix: prefix for all output files
    --outSAMtype: output filetype (SAM default)
    '''

	STAR --genomeDir genome_data --readFilesIn reads/ERR2985659_1.fastq.gz reads/ERR2985659_2.fastq.gz --readFilesCommand zcat --outSAMunmapped Within --outFileNamePrefix ERR2985659 --runThreadN 4
	STAR version: 2.7.10b   compiled: 2022-11-01T09:53:26-04:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 01 16:35:48 ..... started STAR run
Jun 01 16:35:48 ..... loading genome
Jun 01 16:35:48 ..... started mapping
Jun 01 18:26:04 ..... finished mapping
Jun 01 18:26:04 ..... finished successfully


## `2-Pass Mapping with STAR`

`There are two ways in which we can do 2-Pass mapping with Star`

##### 1. Re-generating the genome indices from splice junctions obtained from `1-pass`

##### 2. Using splice junctions directly during the mapping step

### `1. Regenerating the Genome Indices`

In [26]:
#Building Genome index
!mkdir -p star_index_2pass #directory
#using the splice junction obtained from 1-pass mapping

#Additional Flag that we will use
#`--sjdbFileChrStartEnd` : A tab-delimited file that specifies the coordinates of annotated splice junctions in the genome.

In [33]:
!STAR --runThreadN 8 --runMode genomeGenerate --genomeDir star_index_2pass --genomeFastaFiles star_index_2pass/GCF_000195955.2_ASM19595v2_genomic.fna --sjdbGTFfile star_index_2pass/GCF_000195955.2_ASM19595v2_genomic.gff --sjdbFileChrStartEnd ERR2985659SJ.out.tab --sjdbOverhang 99

	STAR --runThreadN 8 --runMode genomeGenerate --genomeDir star_index_2pass --genomeFastaFiles star_index_2pass/GCF_000195955.2_ASM19595v2_genomic.fna --sjdbGTFfile star_index_2pass/GCF_000195955.2_ASM19595v2_genomic.gff --sjdbFileChrStartEnd ERR2985659SJ.out.tab --sjdbOverhang 99
	STAR version: 2.7.10b   compiled: 2022-11-01T09:53:26-04:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 01 20:29:23 ..... started STAR run
Jun 01 20:29:23 ... starting to generate Genome files
Jun 01 20:29:23 ..... processing annotations GTF
Jun 01 20:29:23 ... starting to sort Suffix Array. This may take a long time...
Jun 01 20:29:23 ... sorting Suffix Array chunks and saving them to disk...
Jun 01 20:29:23 ... loading chunks from disk, packing SA...
Jun 01 20:29:23 ... finished generating suffix array
Jun 01 20:29:23 ... generating Suffix Array index
Jun 01 20:29:26 ... completed Suffix Array index
Jun 01 20:29:26 ..... inserting junctions into the genome indices
Jun 01 20:29:29 ... writing Geno

In [2]:
!STAR --runThreadN 6 --readFilesIn reads/ERR2985659_1.fastq.gz reads/ERR2985659_2.fastq.gz --readFilesCommand zcat --genomeDir star_index_2pass/ --outFileNamePrefix ERR2985659_2_pass --outSAMunmapped Within

	STAR --runThreadN 6 --readFilesIn reads/ERR2985659_1.fastq.gz reads/ERR2985659_2.fastq.gz --readFilesCommand zcat --genomeDir star_index_2pass/ --outFileNamePrefix ERR2985659_2_pass --outSAMunmapped Within
	STAR version: 2.7.10b   compiled: 2022-11-01T09:53:26-04:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 02 07:25:15 ..... started STAR run
Jun 02 07:25:15 ..... loading genome
Jun 02 07:25:16 ..... started mapping
Jun 02 09:46:12 ..... finished mapping
Jun 02 09:46:12 ..... finished successfully


### `2. Using splice junctions directly during the mapping step`

In [5]:
!STAR --runThreadN 12 --readFilesIn reads/ERR2985659_1.fastq.gz reads/ERR2985659_2.fastq.gz --readFilesCommand zcat --genomeDir star_index_2pass/ --sjdbFileChrStartEnd SJ_out_filtered.tab --outFileNamePrefix ERR2985659_star_direct2pass --outSAMunmapped Within

	STAR --runThreadN 12 --readFilesIn reads/ERR2985659_1.fastq.gz reads/ERR2985659_2.fastq.gz --readFilesCommand zcat --genomeDir star_index_2pass/ --sjdbFileChrStartEnd SJ_out_filtered.tab --outFileNamePrefix ERR2985659_star_direct2pass --outSAMunmapped Within
	STAR version: 2.7.10b   compiled: 2022-11-01T09:53:26-04:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 02 10:24:37 ..... started STAR run
Jun 02 10:24:37 ..... loading genome
Jun 02 10:24:37 ..... inserting junctions into the genome indices
Jun 02 10:24:40 ..... started mapping
Jun 02 11:35:42 ..... finished mapping
Jun 02 11:35:42 ..... finished successfully


## Read count generation with `Bedtools`

In [28]:
#Installation
!conda install -c bioconda bedtools
!conda install -c bioconda samtools

In [6]:
!samtools view -S -b  ERR2985659_star_direct2passAligned.out.sam >ERR2985659.bam

In [7]:
!samtools sort ERR2985659.bam -o ERR2985659_sorted.bam

[bam_sort_core] merging from 21 files and 1 in-memory blocks...


In [10]:
!samtools index ERR2985659_sorted.bam

`Bedtools`

In [20]:
!bedtools multicov -bams ERR2985659_sorted.bam -bed genome_data/GCF_000195955.2_ASM19595v2_genomic.gff > read_counts.txt
# multicov : reports the count of alignments from multiple position-sorted and indexed BAM files that overlap intervals in a BED file