# Overview
This notebook runs expression analysis for the imparlogs for Schistosoma mansoni. Before running the user needs to edit the following variables located at the beginning of the script:

- Illumina_adapters: Location of a fasta file with the Illumina adapters to be used
- run_interpro_sequence_dir: Directory with the imparalog sequences [check in your script were is located] 
- interprot_path: pat to the installation of interproscan
- interpro_ID_info: File with information Interpto IDs on interpro.xml
- schisto_genome_fasta: S. mansoni genome in fasta format (schistosoma_mansoni.PRJEA36577.WBPS18.genomic.fa)
- schisto_gff_file: S. mansoni genome annotation in GFF format (schistosoma_mansoni.PRJEA36577.WBPS18.genomic.fa)
- schisto_trans: S. mansoni genome transcript sequences in fasta format (schistosoma_mansoni.PRJEA36577.WBPS18.mRNA_transcripts.fa)
- eggnog_annotation: File with the results for the EGGNOG annotation [check in your script were is located] 
- threads: Number of threads to use when possible.

Next are control variables that run the necessary analysis is set to TRUE. Mapping and counting of reads are set for the three datasets under the convention [Dataset Prefix] “_” [Analysis]. The prefixes are “Schisto”, “Schisto_Protasio” and “Schisto_Sanger” for the datasets “Wangwiwatsin et al. 2020”, “Protasio et al. 2012” and “Wellcome Sanger Institute”. And the analyses are “*”_TRIM, “*”_STAR, “*”_COUNT and “*” _MAKER  for read trimming, mapping, counting and count table re-arrangement.  Ej: Schisto_Sanger_TRIM
Before they run, it is required to create a folder with the SRA of each datasets. The name of each dataset needs to be given by defining the variable “species_name” on each block of code (total 14 instances).
Blocks Smansoni_Genome_Location and Smansoni_Distance_groups generate inputs for R scripts that need to be run only once.

# Differential expression
This script calculates the diferential expression for each dataset. Before running the lines with setwd need to be updated for each dataset, giving the path to the full counts table (*“_all_counts.tab”):
Line 3: “Wangwiwatsin et al. 2020”
Line 139: “Protasio et al. 2012”
Line 203: “Wellcome Sanger Institute”

In [None]:
%system Rscript ./rnaseq_scripts/001_Expresion_Dif.R

# Normalization and correlations

This script runs the correlations based on expression levels. Before running it must be set the working directory appropriate for each dataset. It requires the following files:

1. The count table of each dataset (* “_all_counts.tab”)
2. Imparaloges for S. mansoni. With the format “Imparalog” “tab” “gene ids, separated by a coma and a space”. (* “_grupos.tab”)
3. File with the groups of each sample. With the format ID_Sample “tab” “group” (*“_grupos.tab”). For convenience, the groups must be numbered in the order desired with “nº_”“sample_group”

The working directory must be set for each dataset before running.

In [None]:
%system Rscript ./rnaseq_scripts/002_Norm_y_Correlaciones.R

# Calculate correlations based on distances
This script calculates de correlations based on genomic distances. In addition to all previous input files it requires a file named “distance_corr.txt” with the information of each gene pair in the paralogue. Generated in block “Smansoni_Distance_groups” of the main script.

In [None]:
%system Rscript ./rnaseq_scripts/003_Norm_y_Correlaciones_grupos_distancia.R

# Heatmaps for InterPro domains
This script generates the heatmaps for the interpro domains observed within imparalogs of the studied species. Before running the set directory needs to be set to the table location generated on block INTERPRO_DOMAINS_PRE_HEATMAP

In [None]:
%system Rscript ./rnaseq_scripts/004_heatmap_domains.R