# Final anlalysis

Now that we have some of the sample we are going to do some biostatistical analysis.

In [1]:
## Libreries
suppressPackageStartupMessages(suppressWarnings(library(tidyverse)))

## Now we are going to add self made functions in order to reduce the code that we are gonna use

In [12]:
## pre-processing, adding 0s
pre_processed <- function(df,value1,value2,value3,
                          vect1,vect2,vect3,
                          value1_char,value2_char,value3_char){
    body = expand.grid(value1=as.character(unique(as.character(vect1))),
                       value2=as.character(unique(as.character(vect2))),
                       value3=as.character(unique(as.character(vect3)))) %>% 
    left_join(
        df, by=c(value1_char,value2_char,value3_char)
        ) %>% 
    mutate(n=ifelse(is.na(n),0,n))
}


In [2]:
## reading the data ##

# Biotype
apc_biotype <- read_tsv("../results/results_workflow/chromosome5/biotype.tsv") %>%
    mutate(gene="APC") 
braf_biotype <- read_tsv("../results/results_workflow/chromosome7/biotype.tsv") %>%
    mutate(gene="BRAF")
kras_biotype <- read_tsv("../results/results_workflow/chromosome12/biotype.tsv") %>%
    mutate(gene="KRAS")
tp53_biotype <- read_tsv("../results/results_workflow/chromosome17/biotype.tsv") %>%
    mutate(gene="TP53")

[1mRows: [22m[34m30[39m [1mColumns: [22m[34m3[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m "\t"
[31mchr[39m (2): biotype, sample
[32mdbl[39m (1): n

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m41[39m [1mColumns: [22m[34m3[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m "\t"
[31mchr[39m (2): biotype, sample
[32mdbl[39m (1): n

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m33[39m [1mColumns: [22m[34m3[39m
[36m──[39m [1mColum

In [8]:
## Joining the data
biotype <- rbind(apc_biotype,braf_biotype,tp53_biotype,kras_biotype)  

## Now we will do a preprocessing for our data.

the reason behind that is becouse it has a problem. When one of the values (biotype, conscuences, clinical significance...) didn't appear in one of the samples, R by default will considerate that that particular value does not exist instead that its actually 0 in reality.

So with this chunk of code we solve that problem

* We will add 0s where there is not a particular value for one of the samples!
* We also will separate the chromosomes from the samples 

In [14]:
## Pre-processing the data
df_biotype <- expand.grid(biotype=as.character(unique(as.character(biotype$biotype))),
                          sample=as.character(unique(as.character(biotype$sample))),
                          gene=as.character(unique(as.character(biotype$gene)))) %>% 
    left_join(
        biotype, by=c("biotype","sample","gene")
        ) %>% 
    mutate(n=ifelse(is.na(n),0,n)) %>%
    separate_wider_delim(sample,delim = "_", names = c("sample", "chr")) %>%
    select(-"chr") 



## Starting point for the analysis

### We will start from the differents biotypes of the variants for each gene

In [46]:
## Ploting ggplot2 biotype-genes
biotype_genes <- df_biotype %>%
    group_by(gene,biotype) %>%
    summarise(n=sum(n)) %>%
    mutate(biotype=factor(biotype,
                          levels = c("protein_coding", "protein_coding_CDS_not_defined", 
                                     "nonsense_mediated_decay", "retained_intron"),
                          labels = c("Portein coding", "Protein coding (not define)",
                                     "Nonsense mediated decay", "Retained intron"))) %>%
    ggplot(aes(reorder(gene,n),n, fill=biotype)) +
    geom_bar(stat="identity",position=position_dodge(.65), width=.5,
             color="black") +
    scale_y_continuous(expand=expansion(0),
                       limits=c(0,270),
                       breaks=seq(0,250,25)) +
    scale_fill_manual(values=c("darkgray","red", "blue", "forestgreen")) +
    labs(
        title="Variant biotypes per genes",
        x="Samples",
        y="Number of variants",
        fill="Biotype"
    ) +
    theme_classic() +
    theme(
        plot.title=element_text(hjust=.5, face="bold", size=14),
        axis.title=element_text(face="bold", size=12),
        axis.text=element_text(color="black", size=10),
        axis.ticks=element_blank(),
        legend.position=c(.3,.8),
        legend.background=element_rect(color="black")
    )



## Plotting ggplot2 biotype-samples

levs <- sort(unique(df_biotype$sample))

biotype_samples <-  df_biotype %>%
    group_by(sample,biotype) %>%
    summarise(n=sum(n)) %>%
    mutate(sample=factor(sample,
                          levels = levs),
          biotype=factor(biotype,
                          levels = c("protein_coding", "protein_coding_CDS_not_defined", 
                                     "nonsense_mediated_decay", "retained_intron"),
                          labels = c("Portein coding", "Protein coding (not define)",
                                     "Nonsense mediated decay", "Retained intron"))) %>% 
    ggplot(aes(sample,n, fill=biotype)) +
    geom_bar(stat="identity",position=position_dodge(.65), width=.5,
             color="black") +
    scale_y_continuous(expand=expansion(0),
                       limits=c(0,270),
                       breaks=seq(0,250,25)) +
    scale_fill_manual(values=c("darkgray","red", "blue", "forestgreen")) +
    labs(
        title="Variant biotypes per sample",
        x="Samples",
        y="Number of variants",
        fill="Biotype"
    ) +
    theme_classic() +
    theme(
        plot.title=element_text(hjust=.5, face="bold", size=14),
        axis.title=element_text(face="bold", size=12),
        axis.text=element_text(color="black", size=10),
        axis.ticks=element_blank(),
        legend.position="top",
        #legend.background=element_rect(color="black")
    )

ggsave(file = "../results/results_workflow/plots_workflow/plots/biotype_genes.png",
       plot = biotype_genes,
       width = 7,
       heigh = 5)

ggsave(file = "../results/results_workflow/plots_workflow/plots/biotype_samples.png",
       plot = biotype_samples,
       width = 10,
       heigh = 5)

[1m[22m`summarise()` has grouped output by 'gene'. You can override using the `.groups` argument.
[1m[22m`summarise()` has grouped output by 'sample'. You can override using the `.groups` argument.
“[1m[22mRemoved 2 rows containing missing values (`geom_bar()`).”
“[1m[22mRemoved 1 rows containing missing values (`geom_bar()`).”


In [30]:
sort(unique(df_biotype$sample))