In [2]:
# Load packages
library(tidyverse)
library(wesanderson)
library(fs)

# Angsd SFS nSites test

- I suspect our jagged SFSs might in part be due to having too few sites
- Here, I generate SFSs with varying total number of sites
- I varied the number of sites by varying the proportion of individuals required to have data when generating the SAF files

In [8]:
load_long_sfs <- function(path){
  
    # Get name of folder with parameter combinations
    dir <- str_split(path_dir(path), '/', simplify = TRUE)
    sample_set <- dir[1, 1]
    prop <- dir[1, 2]

    # Read in SFS
    full_path <- paste0(inpath, '/', path)
    sfs <- suppressMessages(read_delim(full_path, delim= ' ', col_names = FALSE)) %>% 
    t() %>% 
    as.data.frame() %>% 
    rename('num_sites' = 'V1') %>% 
    filter(num_sites != 0) %>%  #  folded SFS so samples > # of samples will be 0
    mutate(maf = 1: n() - 1,
           prop_sites = num_sites / sum(num_sites),
           sample_set = sample_set,
           propInd = prop)
    return(sfs)
}

In [12]:
inpath <- '../results/angsd/nSites_test/'
sfs_df <- list.files(inpath, pattern = '*allSites.sfs', recursive = TRUE) %>% 
  map_dfr(., load_long_sfs)
head(sfs_df)

In [14]:
sfs_df %>% 
    group_by(sample_set, propInd) %>% 
    summarise(total_sites = sum(num_sites))

In [15]:
sfs_plot <- sfs_df %>% 
    filter(maf != 0 & maf <= 25)  %>%
    ggplot(., aes(x = maf, y = prop_sites)) + 
    geom_bar(stat ='identity', color = 'black',  width=.70) + 
    facet_grid(sample_set~ propInd) +
    ylab('Proportion of sites') + xlab('Minor allele frequency') +
    scale_fill_manual(values = cols) +
    scale_x_continuous(breaks = seq(1, 25, 6)) +
    scale_y_continuous(breaks = seq(0, 0.13, 0.02)) + 
    theme_classic() + 
    theme(axis.text = element_text(size = 13),
        axis.title = element_text(size = 15))
sfs_plot

In [17]:
# Save plot
outpath <- snakemake@output[[1]]
print(outpath)
ggsave(filename = outpath, plot = sfs_plot, device = 'pdf', 
       dpi = 300, width = 12, height = 10, units = 'in')

## Quick take

- SFSs look more jagged with fewer sites (i.e., greater proportion of individuals with reads)
- Suggests having more sites better for estimating SFS