## Proteogenomics Searches

In order to find non-canonical products of the genome, we need to extend the dictionary of protein sequences, the _sequence database_ to be searched using the possible new sequences

##### _❔ Do we need to keep the canonical protein sequences?_

##### _❔ What can happen if we include too many sequences? Too few?_

## A proteogenomic database

In the _resources_ folder, you will find a [proteogenomic sequence database](https://github.com/GTPB/IBIP21/blob/master/pages/proteogenomics/Proteogenomic_searches/resources/ensembl_final_proteinDB.fasta.gz) kindly provided by Jakub Vašíček. It consists of the human Ensembl peptide database complemented with peptides possibly containing the product of sequence variants identified by [_Wang et al._](https://www.embopress.org/doi/full/10.15252/msb.20188503) by exome and transcriptome sequencing.

##### _❔  If you would have to choose, would you prefer exome or transcriptome sequencing?_

##### 👨‍💻  Uncompress the file and inspect its content.

##### _❔ How does this database compare to the UniProt database that you used in the previous workshop in terms of number and type of protein sequences?_

##### 👨‍💻  Look for the sequence of the insulin protein.

##### _❔ How do you interpret the content of the header?_

##### _❔ How many different sequences do you find? What is the difference between them?_

##### _❔ How does the UniProt sequence compare to the Ensembl sequence?_

In the clinics, we are particularly interested in a byproduct of insulin production, the [C-peptide](https://en.wikipedia.org/wiki/C-peptide). 

##### 👨‍💻  Find the sequence of the C-peptide.

##### _❔ Can you find this peptide using proteomics?_

##### 👨‍💻  Look for the natural variants of insulin according to UniProt.

##### _❔ Can we find such variants using proteomics?_

##### 👨‍💻  Look for the natural variants of insulin according to UniProt.

##### _❔ Can we find such variant proteins using proteomics?_

##### _❔ What types of genomics variants do you know? What consequences do they have on proteins?_

##### 👨‍💻  Look for the variant called '1.153003465.C.G' in your database.

##### _❔ How do you interpret the header of the corresponding sequence? How many sequences can be changed depending on the allelic status of this variant?_

The variant with identifier 'rs1055935' corresonds to variation at this location of the genome. 

##### 👨‍💻  Find this variant in Ensembl.

##### _❔ What alleles can be found at this location of the genome? How often are the different alleles encountered? What influences the likelihood to find a specific allele? What consequences of having the different alleles can you expect in terms of protein sequence and disease?_

##### 👨‍💻  Find the sequence from the transcript 'ENST00000679982' affected by this variant in your database.

##### _❔ What is the difference in the sequence compared to the canonical 'ENST00000679982' sequence?_


## Proteogenomic Search

We used this database to reprocess a file from [_Wang et al._](https://www.embopress.org/doi/full/10.15252/msb.20188503) using SearchGUI and PeptideShaker as you did in the previous workshop, the results are in the _resources_ folder. 

##### 👨‍💻  Downlaod and unzip the file available at this [link](https://drive.google.com/file/d/1S5plnpm8berIgA5UYUXPKJN4EzRRu3pa/view?usp=sharing) and open it using [PeptideShaker](https://github.com/compomics/peptide-shaker).

##### ❔ _How do the protein results compare to previous searches that you might have conducted?_

##### 👨‍💻  Look for the protein named 'var_1.153003465.C.G_ENST00000295367 stop:169'. (line 90)

##### ❔ _Can you find information supporting that this sample carried the alternative allele (G) for this variant?_

##### 👨‍💻  Click on the yellow box in the 'PI' cloumn. 

##### ❔ _How do you interpret this information?_

##### 💬 _How would you map peptides to proteins in proteogenomic searches?_

##### 👨‍💻  Look for the protein named 'ENSP00000295367.4'. (line 7261)

##### ❔ _Can you find information supporting that this sample carried the reference allele (C) for this variant?_

##### ❔ _Can you confidently infer the genotype of the sample based on these proteomics data?_

## Application to Cancer research

### Data set

We will now analyze the non-canonical genomic products identified in breast cancer by [Johansson _et al._](https://www.nature.com/articles/s41467-019-09018-y). The proteogenomic identification results by Johansson _et al._ are reported in Supplementary Data 6, available in the [resources folder](https://github.com/GTPB/IBIP21/blob/master/pages/proteogenomics/Proteogenomic_searches/resources/novel_peptides.gz).

### Libraries

We will need the following libraries, please make sure that they are installed.

In [1]:
library(tidyr)
library(dplyr)

"package 'tidyr' was built under R version 3.6.3"
"replacing previous import 'vctrs::data_frame' by 'tibble::data_frame' when loading 'dplyr'"
"package 'dplyr' was built under R version 3.6.3"

Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union




##### 👨‍💻 Load the data in R as in the code below.

In [2]:
novelPeptidesDF <- read.table(
    file = "resources/novel_peptides.gz",
    header = T,
    sep = "\t",
    comment.char = "",
    quote = "",
    stringsAsFactors = F
)

### Genomic context and function

##### 👨‍💻 Find the different classes of peptides represented.

In [3]:
classesDF <- as.data.frame(
    table(
        novelPeptidesDF$class
    )
) %>%
    rename(
        class = Var1,
        n_peptides = Freq
    ) %>%
    arrange(
        desc(n_peptides)
    )

print(classesDF)

                     class n_peptides
1               intergenic        172
2                 intronic         91
3           ncRNA_intronic         22
4             ncRNA_exonic         18
5                   exonic         17
6                     UTR5         16
7              UTR5-exonic         14
8              exonic-UTR5         12
9          intronic-exonic         10
10         exonic-intronic          4
11                upstream          3
12     exonic-ncRNA_exonic          2
13         exonic-splicing          1
14         exonic-upstream          1
15 intergenic-ncRNA_exonic          1
16     ncRNA_exonic-exonic          1
17       splicing-intronic          1
18                    UTR3          1
19           UTR5-upstream          1


##### ❔ _What do these categories represent? Would you expect the same distribution in all samples?_

##### 💬 Can you speculate on the function or effect of these novel peptides in cancer biology? How can these be used in a clinical setup?