-
Notifications
You must be signed in to change notification settings - Fork 2
/
extdata.Rmd
67 lines (55 loc) · 3.05 KB
/
extdata.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
---
title: "External data"
output: html_document
---
# AllPeptides.psmtsv and AllQuantifiedPeptides.tsv files
`AllPeptides.psmtsv` file collects for each identified peptide the following information:
* 'Base Sequence': a character string indicating the peptide name/id/amino acids sequence;
* 'Decoy/Contaminant/Target': a character string indicating if the peptide is a decoy, contaminant or, target;
* 'PSM Count (unambiguous, <0.01 q-value)': a numeric variable indicating how many times the peptide was identified (peptide abundance);
* 'Protein Accession': a character string indicating the isoform(s) name the peptide maps to;
* 'QValue': a numeric variable indicating the false discovery rate associated to the peptide;
* 'PEP': a numeric variable indicating the Peptide Error Probability.
`AllQuantifiedPeptides.tsv` lists for each identified peptide the base sequence and several peptide intensities.
Below we report the script used to create `AllPeptides.psmtsv` and `AllQuantifiedPeptides.tsv`, both available inside IsoBayes package.
```{r, eval=FALSE}
data_path = "/path/to/AllPeptides.psmtsv"
df = as.data.frame(data.table::fread(data_path))
# Keep useful columns for our model:
df = df[, c("Base Sequence", "Decoy/Contaminant/Target", "PSM Count (unambiguous, <0.01 q-value)", "Protein Accession", "QValue", "PEP")]
# Discard peptides with qvalue > 0.1
df = df[df$QValue < 0.1, ]
data.table::fwrite(df, "AllPeptides.psmtsv", sep = "\t")
```
```{r, eval=FALSE}
data_path = "/path/to/AllQuantifiedPeptides.tsv"
df = as.data.frame(data.table::fread(data_path))
# Keep useful columns for our model:
sel_intensities = grep("Intensity_", colnames(df))
df = df[, c("Base Sequence", colnames(df)[sel_intensities])]
data.table::fwrite(df, "AllQuantifiedPeptides.tsv", sep = "\t")
```
# mRNA data integration: jurkat_isoform_kallisto.tsv
To reduce the size of jurkat_isoform_kallisto.tsv, we take only those isoforms that are present in the peptide dataset.
```{r, eval=FALSE}
data_path = "/path/to/AllPeptides.psmtsv"
tpm_path = "/path/to/tpm.tsv"
df = as.data.frame(data.table::fread(data_path))
tpm = as.data.frame(data.table::fread(tpm_path))
# extract isoform names
iso_names_prot = unique(unlist(lapply(df$`Protein Accession`, function(x){strsplit(x, "\\|")})))
# keep only those transcript isoforms that are present in the peptide dataset
tpm = merge(tpm, as.data.frame(iso_names_prot), by.x = "isoname", by.y = "iso_names_prot")
data.table::fwrite(tpm, "jurkat_isoform_kallisto.tsv", sep = "\t")
```
# Mapping protein isofors to gene: map_iso_gene.csv
```{r, eval=FALSE}
library(IsoBayes)
data_dir = system.file("extdata", package = "IsoBayes")
path_to_peptides_psm = paste0(data_dir, "/AllPeptides.psmtsv")
f = as.data.frame(data.table::fread(path_to_peptides_psm))
iso_names_prot = unique(unlist(lapply(df$`Protein Accession`, function(x){strsplit(x, "\\|")})))
gene_name = unlist(lapply(as.list(iso_names_prot), function(x){strsplit(x, "-")[[1]][1]}))
map_iso_gene = data.frame(iso_names_prot, gene_name)
data.table::fwrite(map_iso_gene, file = "map_iso_gene.csv", col.names = FALSE, sep = ",")
```