# 2-3 Mapping sequencing data
Manuel Jara-Espejo$^{1}$\
Aboobaker lab, Department of Biology, University of Oxford

## Contents of notebook
1. Introduction
2. RNA-seq data
3. Obtaining and processing Phaw reference gene annotation *phaw_30* 
4. RNA-seq data alignment

### 1. Introduction
This notebook describes the annotation of protein-coding genes of the Parhyale hawaiensis genome (Phaw5.0) based on PolyA RNA-seq data from this species. Briefly, we integrated 79 RNA-seq samples and current *Parhyale* reference annotation into a StringTie2 annotation pipeline.More information on the stringtie annotation process can be found at http://ccb.jhu.edu/software/stringtie/index.shtml.

### 2. RNA-Seq data
In total 78 PolyA+ RNA-seq data sets sequenced in from differente *Parhyale hawaiensis* life stages and conditions were used as the raw pool of RNA-seq reads for the expression-driven annotation. The sample IDs are listed below:

In [37]:
%%bash
#Aboobaker embryo
ls /drives/raid/AboobakerLab/data/bams/phaw/hisat/bams_05_2022/Aboobaker_embryo_data/polyA/*bam | xargs -n 1 basename 
#Aboobaker immnune polya
ls /drives/raid/AboobakerLab/data/bams/phaw/hisat/bams_05_2022/Aboobaker_immune_data/*bam | xargs -n 1 basename 
#Public data
ls /drives/raid/AboobakerLab/data/bams/phaw/hisat/bams_05_2022/Manchester_embryo_data/*bam  | xargs -n 1 basename
#Public data
ls /drives/raid/AboobakerLab/data/bams/phaw/hisat/bams_05_2022/Patel_embryo_data/merged_bams/*bam | xargs -n 1 basename

P161.bam
P162.bam
P201.bam
P202.bam
P241.bam
P242.bam
P481.bam
P482.bam
S1.bam
S2.bam
S3.bam
S4.bam
S5.bam
S6.bam
S7.bam
S8.bam
S9.bam
A1.bam
A2.bam
A3.bam
C1.bam
C2.bam
C3.bam
F1.bam
F2.bam
F3.bam
FB1.bam
FB2.bam
FB3.bam
FI1.bam
FI2.bam
FI3.bam
FIB1.bam
FIB2.bam
FIB3.bam
H1.bam
H2.bam
H3.bam
HB1.bam
HB2.bam
HB3.bam
M1.bam
M2.bam
M3.bam
MB1.bam
MB2.bam
MB3.bam
S4H1.bam
S4H2.bam
S4H3.bam
S7d1.bam
S7d2.bam
S7d3.bam
S8H1.bam
S8H2.bam
S8H3.bam
SRR14908223.bam
SRR14908224.bam
SRR14908225.bam
SRR14908226.bam
SRR14908227.bam
SRR14908228.bam
SRR14908229.bam
SRR14908230.bam
SRR14908231.bam
SRR14908232.bam
SRR14908233.bam
SRR14908234.bam
SRR14908235.bam
SRR14908236.bam
MBE_103B_A_index2_CGATGT_L008_merged.bam
MBE_103B_D_index7_CAGATC_L008_merged.bam
MBEHB00058a_index1_ATCACG_L007_merged.bam
MBEHB00058b_index2_CGATGT_L007_merged.bam
MBEHB00058c_index3_TTAGGC_L007_merged.bam
MBEHB00058d_index4_TGACCA_L007_merged.bam
MBEHB00058e_index5_ACAGTG_L007_merged.bam
MBEHB00058f_index6_GCCAAT_L007_merged.ba

### 3. Obtaining and processing Phaw reference gene annotation *phaw_30* 
##### wget https://research.janelia.org/pavlopoulos/fa/phaw.3.0.genes.nuc.fa 

##### In addition, reference Phaw gene annotation *phaw_30* (https://research.janelia.org/pavlopoulos/fa/phaw.3.0.genes.gff) were included.

##### Build gmap genome index for Phaw5.1 assembly

In [23]:
#%%bash
#gmap_build -D /drives/raid/AboobakerLab/manuel/data_phaw_analysis/phaw_reference/gmap_index/phaw_sambaTGSAsm \
#-d phaw_sambaTGSAsm \
#/drives/ssd1/manuel/phaw/2022_analysis/phaw_gapfilling/TGS-GapCloser_anlysis/results_sambaAsm/change_scafNames/phaw_sambaAsm.scaff_seqs_editedScafNames.fa

##### Map phaw3.0 transcriptome to current genome assmebly

In [25]:
%%bash
#Map reference transcriptome to current genome assmebly
#nohup /drives/ssd1/manuel/phaw/2022_analysis/annotation/protein_coding_genes/gmap/run_gmap.sh  > gmap.out 2>&1 & 
cat ./scripts/run_gmap.sh

#!/bin/bash 
#Map reference transcriptome to current genome assmebly. Output gtf file

gmap -D /drives/raid/AboobakerLab/manuel/data_phaw_analysis/phaw_reference/gmap_index/phaw_sambaTGSAsm \
-d phaw_sambaTGSAsm -f 2 /drives/ssd1/manuel/phaw/2021_analysis/annotation/gmap/phaw.3.0.genes.nuc.fa -t 30 > phaw.3.0.genes.nuc.fa.gmap.gff


In [27]:
##### Convert .gff to .gtf format
#agat_convert_sp_gff2gtf.pl --gff phaw.3.0.genes.nuc.fa.gmap.gff -o phaw.3.0.genes.nuc.fa.gmap.gtf
#mv phaw.3.0.genes.nuc.fa.gmap.gtf /drives/ssd1/manuel/phaw/2022_analysis/annotation/protein_coding_genes/stringtie2/gtf_files

### 3. RNA-seq data alignment

##### The hisat index was build for alignment

In [9]:
#%%bash 
#cd /drives/raid/AboobakerLab/manuel/data_phaw_analysis/asm_gap_filling/TGS-GapCloser_anlysis/sambaAsm_analysis/asm_editedNames/hisat_index/
#nohup hisat2-build /drives/ssd1/manuel/phaw/2022_analysis/phaw_gapfilling/TGS-GapCloser_anlysis/results_sambaAsm/change_scafNames/phaw_sambaAsm.scaff_seqs_editedScafNames.fa \
#phaw_sambaAsm &

##### All RNA-seq libraries were aligned to the Parhyale genome with HISAT2:

In [8]:
%%bash
#cd /drives/ssd1/manuel/phaw/2022_analysis/annotation/protein_coding_genes/
less ./scripts/align_rna.sh
#nohup ./align_rna.sh totalRNA_embryos > totalRNA_embryos_phaw51.out 2>&1 & 

#!/bin/bash 
if [ "$1" == "totalRNA_embryos" ]; then

    in=/drives/raid/AboobakerLab/data/rna_data/paired/total_082022/X204SC22071455-Z01-F001/ #raw .fastq data
    out=/drives/raid/AboobakerLab/data/bams/phaw/hisat/bams_total_082022 #Final bam files for each libtrary
    fastq=/drives/raid/AboobakerLab/data/rna_data/paired/total_082022/trim #trimmed data after Trimmomatic
    adapters=/drives/raid/AboobakerLab/software/Trimmomatic-0.39/adapters/TruSeq2-PE.fa 
    
    paired_files_1=$(ls -R $in | grep 1.fq.gz)
        
    for i in $paired_files_1; do
        f=$(basename $i _1.fq.gz)
        echo "Running on library $f"

    java -jar /drives/raid/AboobakerLab/software/Trimmomatic-0.39/trimmomatic-0.39.jar PE ${in}/01.RawData/${f}/${f}_1.fq.gz ${in}/01.RawData/${f}/${f}_2.fq.gz ${fastq}/${f}_1_out_paired.fq ${fastq}/${f}_1_out_unpaired.fq ${fastq}/${f}_2_out_paired.fq ${fastq}/${f}_2_out_unpaired.fq -threads 16 ILLUMINACLIP:${adapters}:2:30:10 SLIDINGWINDOW:4:20 LEADING:10 TRAILING

### 4. Transcript assembly and merge

#### The assembly process used Stringtie. First, the RNA-seq reads were assembled into transcripts for each sample.

In [34]:
%%bash
#nohup ../run_stringtie2.sh assembly >> st2.out &
less ./scripts/run_stringtie2.sh

#!/bin/bash
#in=/drives/raid/AboobakerLab/data/bams/phaw/hisat/bams_05_2022/Aboobaker_embryo_data/polyA/*bam 
#in=/drives/raid/AboobakerLab/data/bams/phaw/hisat/bams_05_2022/Aboobaker_immune_data/*bam 
#in=/drives/raid/AboobakerLab/data/bams/phaw/hisat/bams_05_2022/Manchester_embryo_data/*bam 
in=/drives/raid/AboobakerLab/data/bams/phaw/hisat/bams_05_2022/Patel_embryo_data/merged_bams/*bam 
out=/drives/ssd1/manuel/phaw/2022_analysis/annotation/protein_coding_genes/stringtie2
bam_files=`ls $in`

    for i in $bam_files; do
        f=$(basename $i .bam)
            
    echo "Running on library $f"
     	~/software/stringtie-2.1.6/stringtie ${i} --conservative -o ${out}/gtf_files/${f}_ST2.gtf -p 48 
    done


#### Then, all sample-specific annotations were merged into one final annotation.

In [36]:
%%bash
#nohup ../run_stringtie2_merge.sh assembly >> st2.out &
less ./scripts/run_stringtie2_merge.sh

#!/bin/bash
in=/drives/ssd1/manuel/phaw/2022_analysis/annotation/protein_coding_genes/stringtie2/gtf_files
output=/drives/ssd1/manuel/phaw/2022_analysis/annotation/protein_coding_genes/stringtie2/stringtie2_merge/stringtie2_merged.gtf
gtf_files=`readlink -f ${in}/*.gtf`

~/software/stringtie-2.1.6/stringtie --merge -o ${output} -c 10 -T 0.001 -F 0.001 -i ${gtf_files}


# FINISHED