scripts/r-scripts/getting-secreted-data/0005_process_signalp_data.Rmd

---
title: "Untitled"
author: "Ruth Kristianingsih"
date: "10/12/2019"
knit: (function(inputFile, encoding) { rmarkdown::render(inputFile, encoding = encoding, output_dir = here::here("reports", stringr::str_remove(getwd(), here::here("scripts/r-scripts/")))) })
output:
  md_document:
    variant: markdown_github
---
# Process in Getting SignalP data 


```{r setup, include = FALSE}
# Load libraries
library(tidyverse)
```

## Background

After getting a prediction for SignalP, we need to process the result into dataframe, so that it can be easily read for all the results. And it is also explained how the signalP pipeline works with all the `.bash` scripts. 

## Function

```{r}
# This function is used to get all of signalP results

get_signalp_predictions <- function(organism_names_path, all_fasta_path, signalp_preds_path, organism) {
  # Organism names
  data_organism_names <- data.table::fread(
    organism_names_path,
    header = FALSE
  ) %>%
    `colnames<-`(c("organism_name", "ID")) %>%
    as_tibble() %>%
    dplyr::select(ID, organism_name)

  # Sequences from FASTA
  data_seqs_raw <- Biostrings::readAAStringSet(all_fasta_path)

  data_seqs <- data.frame(
    ID_raw = names(data_seqs_raw),
    sequence = paste(data_seqs_raw)
  ) %>%
    as_tibble() %>%
    dplyr::mutate(sequence = sequence %>% as.character()) %>%
    dplyr::rowwise() %>%
    tidyr::separate(
      col = ID_raw,
      into = c("ID", "trash"),
      sep = " pep "
    ) %>%
    dplyr::select(-trash)

  # SignalP predictions
  if (organism == "bacteria") {
    data_signalp_preds <- data.table::fread(
      signalp_preds_path,
      header = FALSE
    ) %>%
      `colnames<-`(c("ID", "prediction", "signalp_prob", "cleavage")) %>%
      as_tibble() %>%
      # Clean features
      dplyr::mutate(
        ID = ID %>%
          stringr::str_remove_all(">"),
        prediction = prediction %>%
          stringr::str_remove_all("Prediction: "),
        signalp_prob = signalp_prob %>%
          stringr::str_remove_all("Signal peptide probability: ") %>%
          as.numeric(),
        cleavage = cleavage %>%
          stringr::str_remove_all("Max cleavage site probability: ")
      ) %>%
      tidyr::separate(
        col = cleavage,
        into = c("cleavage_prob", "position_start", "position_end"),
        sep = " between pos. | and ",
        convert = TRUE
      )
  } else {
    data_signalp_preds <- data.table::fread(
      signalp_preds_path,
      header = FALSE
    ) %>%
      `colnames<-`(c("ID", "prediction", "signalp_prob", "signal_anchor", "cleavage")) %>%
      as_tibble() %>%
      dplyr::select(-c(signal_anchor)) %>%
      # Clean features
      dplyr::mutate(
        ID = ID %>%
          stringr::str_remove_all(">"),
        prediction = prediction %>%
          stringr::str_remove_all("Prediction: "),
        signalp_prob = signalp_prob %>%
          stringr::str_remove_all("Signal peptide probability: ") %>%
          as.numeric(),
        cleavage = cleavage %>%
          stringr::str_remove_all("Max cleavage site probability: ")
      ) %>%
      tidyr::separate(
        col = cleavage,
        into = c("cleavage_prob", "position_start", "position_end"),
        sep = " between pos. | and ",
        convert = TRUE
      )
  }

  # Join all data
  data_full_table <-
    full_join(
      data_organism_names,
      data_seqs,
      by = "ID"
    ) %>%
    left_join(
      data_signalp_preds,
      by = "ID"
    ) %>%
    dplyr::filter(!is.na(signalp_prob))

  return(data_full_table)
}
```

## Pipeline explanation

This following steps is used to get all of the signalP predicition. 

### SignalP predictions

The file is generated by running the script
```bash
sh signalp_to_csv.sh gram- bacteria
```

This does the following:
- Split the FASTA files into smaller chunks with AWK so that `signalp-3.0` can process them.
- Use `signalp` to make the predictions.
- Summarise the results with `sed` and `grep`.

### Sequences with Biostrings

This file is simply all FASTA files concatenated into a single one by running

```bash
combine_fasta_files.sh bacteria
```

### Organism names 

For this step we created a CSV with the original filenames of the FASTA files and the IDs of each of the sequences contained in them:

```bash
sh get_organism_names.sh bacteria
```

## Run function

### Bacteria


```{r}
bacteria_full_table <- get_signalp_predictions(
  organism_names_path = "../../../data/secreted_data/updated_signalp_results/bacteria_organism_names.csv",
  all_fasta_path = "../../../data/secreted_data/updated_signalp_results/bacteria_all_fasta.fa",
  signalp_preds_path = "../../../data/secreted_data/updated_signalp_results/bacteria_signalp_results.csv", 
  organism = "bacteria"
)
```

```r
bacteria_full_table %>%
  data.table::fwrite("../../../data/secreted_data/updated_signalp_results/bacteria_full_table.csv")
```


```{r}
bacteria_full_table <-
  data.table::fread("../../../data/secreted_data/signalp-pipeline/bacteria_full_table.csv")
```

Check percentage of signal peptides:

```{r}
bacteria_full_table %>%
  dplyr::group_by(prediction) %>%
  dplyr::summarise(count = n())
```

### Fungi


```{r}
fungi_full_table <- get_signalp_predictions(
  organism_names_path = "../../../data/secreted_data/updated_signalp_results/fungi_organism_names.csv",
  all_fasta_path = "../../../data/secreted_data/updated_signalp_results/fungi_all_fasta.fa",
  signalp_preds_path = "../../../data/secreted_data/updated_signalp_results/fungi_signalp_results.csv", 
  organism = "fungi"
)
```

```{r}
fungi_full_table %>%
  data.table::fwrite("../../../data/secreted_data/updated_signalp_results/fungi_full_table.csv")
```


```{r}
# Testing the file
fungi_full_table <-
  data.table::fread("../../../data/secreted_data/updated_signalp_results/fungi_full_table.csv")
```

Check percentage of signal peptides:

```{r}
fungi_full_table %>%
  dplyr::group_by(prediction) %>%
  dplyr::summarise(count = n()) %>% 
  knitr::kable()
```


### Oomycete / Prostist

```{r}
oomycete_full_table <- get_signalp_predictions(
  organism_names_path = "../../../data/secreted_data/updated_signalp_results/protists_organism_names.csv",
  all_fasta_path = "../../../data/secreted_data/updated_signalp_results/protists_all_fasta.fa",
  signalp_preds_path = "../../../data/secreted_data/updated_signalp_results/protists_signalp_results.csv", 
  organism = "prostist"
)
```


```{r}
oomycete_full_table %>%
  data.table::fwrite("../../../data/secreted_data/updated_signalp_results/protist_full_table.csv")
```


```{r}
# Testing the file
protist_full_table <-
  data.table::fread("../../../data/secreted_data/updated_signalp_results/protist_full_table.csv")
```

Check percentage of signal peptides:

```{r}
protist_full_table %>%
  dplyr::group_by(prediction) %>%
  dplyr::summarise(count = n()) %>% 
  knitr::kable()
```