## IDENTIFICATION OF NOVEL CLASSES OF NEOANTIGENS IN CANCER | Data preprocessing

In [1]:
%load_ext rpy2.ipython

## 0. Data preparation

This first cell should be modified according to the data that is going to be used. It is only available for datasets with paired samples per patient: normal and tumor. 

The **PROJECT** variable should be changed according to the GEO identifier.

From the GEO website, the *SRR_Acc_List.txt* and *SraRunTable.txt* files should be manually downloaded and save in a directory. This directory should be specified in **SRR** variable.

The pipeline is developed with the intention of running the most computationally expensive programs in a cluster. 
In this case, a Gluster File System has been used. The code to run on a cluster may need to be adapted.

In [None]:
import os,re,shutil,glob,openpyxl
import pandas as pd
from Bio import SeqIO
from gtfparse import read_gtf
from matplotlib_venn import venn2, venn2_circles, venn2_unweighted
from matplotlib import pyplot as plt
from IPython.display import Image

PROJECT="GSE193567"

DIR=os.path.join("data",PROJECT)

try:
    os.makedirs(DIR) #path where to store all the itermediate steps and outputs of the pipeline
except:
    print("Directory for %s already exists" %PROJECT)
    
CLUSTERDIR="/users/genomics/marta" #path where to run and store things that run in a cluster
SRR="/projects_eg/datasets/"+PROJECT # path where SRR_Acc_List.txt and SraRunTable.txt are stored. It should be inside a folder named with GEO accession
SRR_ACC=os.path.join(SRR,"SRR_Acc_List.txt") 
SRA=os.path.join(SRR,"SraRunTable.txt")

FASTQDIR=os.path.join(DIR,"fastq_files") #path where to store fastq files
try:
    os.mkdir(FASTQDIR)
except:
    print("Fastq_files directory exists")
    
shutil.copy(SRR_ACC, os.path.join(FASTQDIR,"SRR_Acc_List.txt"))
shutil.copy(SRA, os.path.join(FASTQDIR,"SraRunTable.txt"))

GENOMEDIR="genomes"

try:
    os.makedirs(os.path.join(DIR,"analysis"))
    os.makedirs(os.path.join(DIR,"results"))
    #os.makedirs(os.path.join(DIR,"scripts"))
except:
    print("Directory exists")



In [None]:
%%R

require(tidyr)
require(dplyr)
require(rtracklayer)
#library(purrr)
require(ggplot2)
require(RColorBrewer)
require(devtools)
require(stringr)
require(edgeR)

Get a three column file with patient_id normal_id tumor_id for latter usage 

In [4]:
metadata = pd.read_csv(os.path.join(FASTQDIR.split("/fastq_files")[0],"SraRunTable.txt"))
metadata = metadata[['Run','Individual','tissue']]

normal = metadata[metadata['tissue'] == "non-tumor"]
normal = normal[['Individual','Run']]

tumor = metadata[metadata['tissue'] == "tumor"]
tumor = tumor[['Individual','Run']].rename(columns ={'Run' : 'Run_t'})

patients = pd.merge(normal, tumor, on=['Individual'])
patients['Individual'] = patients['Individual'].str.split(' ').str[1]
patients.to_csv(os.path.join(DIR,"results/patients.csv"),index=False, header=False)
patients_summary = os.path.join(DIR,"results/patients.csv")

patients_id=list(patients.iloc[:,0])
normal_id=list(patients.iloc[:,1])
tumor_id=list(patients.iloc[:,2])

patients

Unnamed: 0,Individual,Run,Run_t
0,10615,SRR17593537,SRR17593538
1,10594,SRR17593539,SRR17593540
2,10584,SRR17593542,SRR17593541
3,10635,SRR17593543,SRR17593544
4,10632,SRR17593546,SRR17593545
5,10628,SRR17593548,SRR17593547
6,10627,SRR17593550,SRR17593549
7,10622,SRR17593551,SRR17593552
8,10619,SRR17593554,SRR17593553


## 14.Immunopeptidomes

To look to immunopeptidomics evidence for our potential neoantigens, we use several sources:


**Chong et al. 2020.** Method: NewAnce (combination of MaxQuant and Comet). Samples: melanoma and lung. File: Chong_etal_2020_SupData3.xlsx


In [17]:
%%bash -s $DIR

mkdir $1/analysis/14_immunopeptidomics

In [None]:
chong = pd.read_excel(os.path.join(cwd,"immunopeptidomes_evidences/Chong_etal_2020_SupData3_41467_2020_14968_MOESM5_ESM.xlsx"), skiprows=1)
chong['Transcript_ID'] = chong['Transcript_ID'].str[:-2]
to_compare = chong.Sequence.values.tolist()

folders=['noncanonical_CIPHER','canonical_CDS','translation_evidence_NOCDS']

for f in folders:
    print(f)
    for p in patients_id:
        merged = pd.DataFrame()
        INDIR=DIR+"/analysis/11_PeptideBindingMHC/"+f+"/"+p
        for file in os.listdir(INDIR):
            if file.endswith(".xls"):
                full_file=os.path.join(INDIR,file)
                INFILE=pd.read_csv(full_file, skiprows=1, sep="\t")
                shared = INFILE[INFILE['Peptide'].isin(to_compare)] #peptides immunogenic present in the evidence data
                if len(shared) > 0:
                    print(p,len(shared))


In [24]:
coincidence = total_peptides_counts[total_peptides_counts['peptide'].isin(to_compare)] #how many of the neoantigens from novel genes are already described?
coincidence

Unnamed: 0,peptide,counts


**SPENCER**

In [25]:
spencer = pd.read_csv(os.path.join(cwd,"immunopeptidomes_evidences/SPENCER_Immunogenic_peptide_info.txt"), sep="\t")
spencer_to_compare = spencer.sequence.values.tolist()

folders=['noncanonical_CIPHER','canonical_CDS','translation_evidence_NOCDS']

for f in folders:
    print(f)
    for p in patients_id:
        merged = pd.DataFrame()
        INDIR=DIR+"/analysis/11_PeptideBindingMHC/"+f+"/"+p

        for file in os.listdir(INDIR):
            if file.endswith(".xls"):
                full_file=os.path.join(INDIR,file)
                INFILE=pd.read_csv(full_file, skiprows=1, sep="\t")
                to_compare = INFILE.Peptide.values.tolist()

                shared = spencer[spencer['sequence'].isin(to_compare)] #peptides immunogenic present in the evidence data
                if len(shared) > 0:                    
                    print(p,len(shared))



noncanonical_CIPHER
10615 9
10594 11
10584 12
10635 8
10632 6
10628 8
10627 9
10622 6
10619 9
canonical_CDS
translation_evidence_NOCDS
10627 3


Check novel peptides coincidence in databases

In [27]:
coincidence = total_peptides_counts[total_peptides_counts['peptide'].isin(to_compare)] #how many of the neoantigens from novel genes are already described?
coincidence

Unnamed: 0,peptide,counts
