In [2]:
library(tidyverse)
library(wesanderson)

# QC analysis and exploration of bad samples

- Here we'll look at some of the QC output by Qualimap as part of the `multiqc` pipeline
- We'll lokk at per-sample heterozygosity estimated from the single-sample SFS as a measure of genetic variation
    - Samples with really weird diversity values might need to be looked into
    
- SFS were estimated in `ANGSD` with no filters (e.g., MQ, BQ) so that hopefully bad samples will be obvious

## Load in SFS and multiQC data

In [3]:
# Function to load folded SFS output by ANGSD. 
load_wide_sfs <- function(path){
  
  # Get name of folder with parameter combinations
  dirname <- dirname(path)
  
  # Read in SFS
  full_path <- paste0(inpath, '/', path)
  sfs <- suppressMessages(read_table(full_path, col_names = FALSE)) %>% 
    as.data.frame() %>% 
    rename('Invar' = 'X1',
           'Var' = 'X2') %>% 
    dplyr::select(-X3) %>% 
    mutate(Sample = dirname,
           total_sites = Invar + Var,
           prop_var = Var / total_sites)

  return(sfs)
}

In [4]:
# Load in all SFS as single dataframe
inpath <- '../results/single_sample_sfs/'
sfs_df <- list.files(inpath, recursive = TRUE, pattern = '*.sfs$') %>% 
  map_dfr(., load_wide_sfs)

In [5]:
# Load Qualimap data from multiQC directory
multiqc <- suppressMessages(read_delim('../results/qc/multiqc/multiqc_data/multiqc_qualimap_bamqc_genome_results_qualimap_bamqc.txt', 
                      delim='\t'))

In [6]:
# Join all data into single df
allQC_data <- multiqc %>% 
    mutate(Sample = str_extract(Sample, pattern = '\\w+_\\d+_\\d+$')) %>% 
    dplyr::select(Sample, mean_coverage, percentage_aligned, general_error_rate) %>% 
    left_join(., sfs_df, by = 'Sample')

In [7]:
print(nrow(allQC_data))
head(allQC_data)

## Show some plots

### Histogram of mean coverage

- Histogram of mean coverage reported by Qualimap
- Vertical red line place at putative cutoff separating low coverage from higher coverage samples

In [8]:
cov_hist <- allQC_data %>% 
    ggplot(., aes(x = mean_coverage)) +
    geom_histogram(bins = 100, color = 'black', fill = 'white') + 
    geom_vline(xintercept = 0.35, color = 'red') +
    xlab('Mean depth (X)') + ylab('Sample Count') +
    theme_classic()
cov_hist

In [9]:
outpath <- snakemake@output[[1]]
ggsave(filename = outpath, plot = cov_hist, device = 'pdf', height = 8, width = 8, units = 'in', dpi = 600)

### General error rate histogram

- Histogram of error rates reported by Qualimap
- High error rates are problematic and can bias diversity upwards. 
- High error rates likely caused by different species being sequenced. 
- These samples should be removed right from the start.

In [10]:
error_hist <- allQC_data %>% 
    ggplot(., aes(x = general_error_rate)) +
    geom_histogram(bins = 100, color = 'black', fill = 'white') + 
    xlab('General alignment error rate (%)') + ylab('Sample Count') +
    theme_classic()
error_hist

In [11]:
outpath <- snakemake@output[[2]]
ggsave(filename = outpath, plot = error_hist, device = 'pdf', height = 8, width = 8, units = 'in', dpi = 600)

#### Remove high error rate samples

- Will immediately remove high error rate samples from data 

In [12]:
# Keep only samples with low error rates
allQC_data_highErrorRemoved <- allQC_data %>% 
    filter(general_error_rate < 0.04)

### Alignment percent by coverage and SNP density

- Plot Alignment percentage and sample coverage, colored by SNP density

#### All samples (minus high error rate)

In [14]:
cols <- wes_palette("Zissou1", 100, type = "continuous")
align_cov_SNPdens_allsamples_plot <- ggplot(data = allQC_data_highErrorRemoved, 
                                            aes(x = mean_coverage, y = percentage_aligned, color = prop_var)) + 
    scale_color_gradientn(colors = cols) +
    xlab('Mean depth (X)') + ylab('Alignment %') + labs(color = 'SNP density') + 
    geom_vline(xintercept = 0.31, color = 'red') +
    geom_point() +
    theme_classic()
align_cov_SNPdens_allsamples_plot

- Some samples have very low alignemtn % and similarly low coverage
    - These samples also have lower than average SNP density estimates
- Overall, filtering on coverage would probably be a good start to removing these poorly sequenced samples. 
- Will use 0.31X as our cutoff, since this corresponds to a natural split in the mean coverage histogram (see coverage histogram above). 
    - Cutoff also shown in figure above. Samples to left of this line will get tossed for some analyses. 

In [15]:
outpath <- snakemake@output[[3]]
ggsave(filename = outpath, plot = align_cov_SNPdens_allsamples_plot, device = 'pdf', 
       height = 8, width = 8, units = 'in', dpi = 600)

#### Low coverage samples removed

- Same histogram as above, but removing samples with coverage < 0.31X

In [16]:
# Keep only samples with higher coverage
allQC_data_highQualOnly <- allQC_data_highErrorRemoved %>% 
    filter(mean_coverage >= 0.31)
print(nrow(allQC_data_highQualOnly))

In [17]:
cols <- wes_palette("Zissou1", 100, type = "continuous")
align_cov_SNPdens_highQualOnly_plot <- ggplot(data = allQC_data_highQualOnly, 
                                            aes(x = mean_coverage, y = percentage_aligned, color = prop_var)) + 
    scale_color_gradientn(colors = cols) +
    xlab('Mean depth (X)') + ylab('Alignment %') + labs(color = 'SNP density') + 
    geom_vline(xintercept = 0.31, color = 'red') +
    geom_point() +
    theme_classic()
align_cov_SNPdens_highQualOnly_plot

In [19]:
outpath <- snakemake@output[[4]]
ggsave(filename = outpath, plot = align_cov_SNPdens_highQualOnly_plot, device = 'pdf', 
       height = 8, width = 8, units = 'in', dpi = 600)

## Dataframe for samples to be removed

- Create a dataframe with names of samples to be removed for use in Snakemake pipeline
    - Total of 88 samples being removed for some analyses

In [20]:
# Get samples not in high quality sample dataframe
samples_to_remove <- allQC_data %>% 
    filter(!(Sample %in% allQC_data_highQualOnly$Sample)) %>% 
    dplyr::select(Sample)
head(samples_to_remove)

In [21]:
# Quick breakdown of number of samples being tossed by habitab and city
sample_df <- suppressMessages(read_csv('../resources/low1_libraryConcentrations.csv')) %>% 
    dplyr::select(city, site, plantID) %>% 
    rename('Sample' = 'plantID')

# Join sample_df to samples_to_remove and summarise
samples_to_remove_counts <- samples_to_remove %>% 
    left_join(., sample_df, by = 'Sample') %>% 
    group_by(city, site) %>% 
    summarise(to_remove = n())
samples_to_remove_counts

In [24]:
outpath = snakemake@output[[5]]
write_delim(file = outpath, x = samples_to_remove_counts, col_names = TRUE)

In [25]:
outpath = snakemake@output[[6]]
write_delim(file = outpath, x = samples_to_remove, col_names = FALSE)