# IDENTIFICATION OF NOVEL CLASSES OF NEOANTIGENS IN CANCER | Data preprocessing

In [None]:
%load_ext rpy2.ipython

## 0. Data preparation

This first cell should be modified according to the data that is going to be used. It is only available for datasets with paired samples per patient: normal and tumor. 

The **PROJECT** variable should be changed according to the GEO identifier.

From the GEO website, the *SRR_Acc_List.txt* and *SraRunTable.txt* files should be manually downloaded and save in a directory. This directory should be specified in **SRR** variable.

The pipeline is developed with the intention of running the most computationally expensive programs in a cluster. 
In this case, a Gluster File System has been used. The code to run on a cluster may need to be adapted.

In [None]:
import os,re,shutil,glob,openpyxl
import pandas as pd
from Bio import SeqIO
from gtfparse import read_gtf
from matplotlib_venn import venn2, venn2_circles, venn2_unweighted
from matplotlib import pyplot as plt
from IPython.display import Image

PROJECT="GSE193567"

DIR=os.path.join("data",PROJECT)

try:
    os.makedirs(DIR) #path where to store all the itermediate steps and outputs of the pipeline
except:
    print("Directory for %s already exists" %PROJECT)
    
CLUSTERDIR="/users/genomics/marta" #path where to run and store things that run in a cluster
SRR="/projects_eg/datasets/"+PROJECT # path where SRR_Acc_List.txt and SraRunTable.txt are stored. It should be inside a folder named with GEO accession
SRR_ACC=os.path.join(SRR,"SRR_Acc_List.txt") 
SRA=os.path.join(SRR,"SraRunTable.txt")

FASTQDIR=os.path.join(DIR,"fastq_files") #path where to store fastq files
try:
    os.mkdir(FASTQDIR)
except:
    print("Fastq_files directory exists")
    
shutil.copy(SRR_ACC, os.path.join(FASTQDIR,"SRR_Acc_List.txt"))
shutil.copy(SRA, os.path.join(FASTQDIR,"SraRunTable.txt"))

GENOMEDIR="genomes"

try:
    os.makedirs(os.path.join(DIR,"analysis"))
    os.makedirs(os.path.join(DIR,"results"))
    #os.makedirs(os.path.join(DIR,"scripts"))
except:
    print("Directory exists")



In [None]:
%%R

require(tidyr)
require(dplyr)
require(rtracklayer)
#library(purrr)
require(ggplot2)
require(RColorBrewer)
require(devtools)
require(stringr)
require(edgeR)

Get a three column csv file with patient_id normal_id tumor_id for latter use. This code may need to be adjusted according to the metadata available for the dataset used. Only columns with patient id and sample id are considered.

In [None]:
metadata = pd.read_csv(os.path.join(FASTQDIR,"SraRunTable.txt"))
metadata = metadata[['Run','Individual','tissue']]

normal = metadata[metadata['tissue'] == "non-tumor"]
normal = normal[['Individual','Run']]

tumor = metadata[metadata['tissue'] == "tumor"]
tumor = tumor[['Individual','Run']].rename(columns ={'Run' : 'Run_t'})

patients = pd.merge(normal, tumor, on=['Individual'])
patients['Individual'] = patients['Individual'].str.split(' ').str[1]
patients.to_csv(os.path.join(DIR,"results/patients.csv"),index=False, header=False)
patients_summary = os.path.join(DIR,"results/patients.csv")

patients_id=list(patients.iloc[:,0])
normal_id=list(patients.iloc[:,1])
tumor_id=list(patients.iloc[:,2])

patients

## 01. Download data


In [None]:
%%bash -s "$PROJECT" "$CLUSTERDIR" "$FASTQDIR" "$DIR"

######################################DONE IN CLUSTER###############################################

sbatch $4/scripts/0_data_preprocessing/loop_dwnl.sh $3 $1 $2

To evaluate the quality of the downloaded samples we use `md5sum`

In [None]:
%%bash -s "$FASTQDIR"

for file in $1/*gz; do 
md5sum $file >> $1/md5sum.txt
done

md5sum --check $1/md5sum.txt >> $1/md5sum_check.txt

Make summary file with count raw reads

In [None]:
%%bash -s "$PROJECT" "$FASTQDIR" "$DIR"

OUT=$3/results
if [ -f "$OUT" ] ; then
    rm "$OUT"
fi
for file in $2/*_1*gz; do
    echo ${file##*/} >> $OUT/raw_counts.txt
    zgrep '^@SRR' $file | wc -l >> $OUT/raw_counts.txt
done    

## 02.FastQScreen

FastQ Screen allows you to screen a library of sequences in FastQ format against a set of sequence databases so you can see if the composition of the library matches with what you expect. 

https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/

In [None]:
%%bash -s "$DIR"

mkdir $1/analysis/02_fastqscreen

In [None]:
%%bash -s "$PROJECT" "$CLUSTERDIR" "$FASTQDIR" "$DIR" "$GENOMEDIR"

######################################DONE IN CLUSTER###############################################

sbatch $4/scripts/0_data_preprocessing/loop_fastqscreen.sh $1 $2 $3 $4 $5

Generate a summary multiqc file 

In [None]:
%%bash -s "$DIR"

cd $1/analysis/02_fastqscreen

multiqc .

## 03.FastQC

Check quality of the data

In [None]:
%%bash -s "$DIR"

mkdir $1/analysis/03_fastqc

In [None]:
%%bash -s "$PROJECT" "$CLUSTERDIR" "$FASTQDIR" "$DIR"

######################################DONE IN CLUSTER###############################################

sbatch $4/scripts/0_data_preprocessing/loop_fastqc.sh $1 $2 $3 $4

Generate a summary multiqc file 

In [None]:
%%bash -s "$DIR"

cd $1/analysis/03_fastqc

multiqc .

## 04.Remove adapters

In [None]:
%%bash -s "$DIR"

mkdir $1/analysis/04_cutadapt

In [None]:
%%bash -s "$PROJECT" "$CLUSTERDIR" "$FASTQDIR" "$DIR" "$GENOMEDIR"

######################################DONE IN CLUSTER###############################################

sbatch $4/scripts/0_data_preprocessing/loop_cutadapt.sh $1 $2 $3 $4

Make summary file with count raw reads after trimming

In [None]:
%%bash -s "$DIR"

OUT=$1/results/trimmed_counts.txt
if [ -f "$OUT" ] ; then
    rm "$OUT"
fi
for file in $1/analysis/04_cutadapt/*_1*gz; do
    echo ${file##*/} >> $OUT
    zgrep '^@SRR' $file | wc -l >> $OUT
done    

## FastQC after trimming

In [None]:
%%bash -s "$DIR"

mkdir $1/analysis/04_cutadapt/fastqc

In [None]:
%%bash -s "$PROJECT" "$CLUSTERDIR" "$DIR"

######################################DONE IN CLUSTER###############################################

sbatch $3/scripts/0_data_preprocessing/loop_fastqc_trimming.sh $1 $2 $3

Generate a summary multiqc file 

In [None]:
%%bash -s "$DIR"

cd $1/analysis/04_cutadapt/fastqc

multiqc .

## 05.Alignment on *H.sapiens* Genome v.38

To do the alignment, STAR program is used with 2pass option and keeping only uniquely mapped reads.

First the index(es) must be generated

In [None]:
%%bash -s "$GENOMEDIR"

mkdir $1/Index_Genomes_STAR
mkdir $1/Index_Genomes_STAR/Idx_Gencode_v38_hg38_readlength75

In [None]:
%%bash -s "$CLUSTERDIR" "$GENOMEDIR" "$DIR"

######################################DONE IN CLUSTER###############################################

sbatch $3/scripts/0_data_preprocessing/index.sh $1 $2

Now indexes are stored in `$GENOMEDIR/Index_Genomes_STAR/Idx_Gencode_v38_hg38_readlength75`

In [None]:
%%bash -s "$DIR"

mkdir $1/analysis/05_STAR
mkdir $1/analysis/05_STAR/uniquely_mapped_2pass_BAM_files

In [None]:
%%bash -s "$PROJECT" "$CLUSTERDIR" "$DIR" "$GENOMEDIR"

######################################DONE IN CLUSTER###############################################

sbatch $3/scripts/0_data_preprocessing/loop_STAR.sh $1 $2 $3 $4

In order to avoid future problems dealing with multimapping reads, we only keep the uniquely mapped ones

Make summary file with uniquely mapped reads and the percentage they represent from the whole alignment

In [None]:
%%bash -s "$DIR"

OUT=$1/results/uniquely_mapped_reads.txt
if [ -f "$OUT" ] ; then
    rm "$OUT"
fi
echo -e "Sample\tUniquely_mapped_reads\t%alignment" >> $OUT

for file in $1/analysis/05_STAR/uniquely_mapped_2pass_BAM_files/*Log.final.out; do
    name=${file%%Log*}
    name=${name##*BAM_files/}
    echo -e $name"\t"$(sed '9q;d' $file | awk '{print $6}')"\t"$(sed '10q;d' $file | awk '{print $6}') >> $OUT
done