In [16]:
# Load packages
library(tidyverse)
library(wesanderson)

# Global patterns of diversity and structure

In this notebook, we'll analyze the output fron `ANGSD` run using 495 samples sampled from urban and rural habitats across 25 cities. Specifically, we'll look at:

1. Sequencing depth across all samples. This is mostly a check to see whether our maximum depth threshold used in genotype likelihoos and SFS estimation is reasonable.
2. General shape of the SFS estimated using all sites, 4-fold sites, and 0-fold sites across the genome
3. Genome-wide average pairwise genetic diversity (Theta<sub>pi</sub>) across all samples. Done for all sites, 4-fold sites, and 0-fold sites.
4. PCA showing population clustering of all 495 samples using 4-fold sites only

## Sequencing depth

We'll look at the depth distribution across all 495 samples using a single chromosome as an example

### Load in depth dataset

In [75]:
# Load in data and transform
depth_df <- suppressMessages(
    read_delim(
        '..//results//angsd/depth/CM019101.1/CM019101.1_allSamples_allSites.depthGlobal',
                      delim = '\t', col_names = FALSE)) %>% 
    t() %>% 
    as.data.frame() %>% 
    rename('num_sites' = 'V1') %>% 
    mutate(cov = 1:n() - 1)

The dataset as 10,000 rows. Each row represents the total number of sites with a total X coverage. The exception is bin 10,000, which represents the number of sites with coverage equal or greater than 10,000.

In [4]:
head(depth_df)

### Histogram of sequencing coverage

Plotting histogram of number of sites at each X coverage along chromosome 1.

Red dashed line is the max depth cutoff used when estimating genotype likelihoods and the SFS. Sites with depth great than this cutoff were excluded. 

- This depth (1250X) was calculated as 2 x the mean coverage from qualimap (1.25X) x Number of samples (500)
- I didn't bother recalculating max depth when we switched from 500 to 495 samples. 

In [5]:
depth_cutoff <- 1250
depthGlobal_plot <- ggplot(depth_df, aes(x = cov, y = num_sites)) +
    geom_bar(stat = 'identity', color = 'black', fill = 'white') + 
    xlab('Coverage') + ylab('Number of sites') +
    theme_classic() + 
    geom_vline(xintercept = depth_cutoff, linetype = 'dashed', color = 'red') +
    theme(axis.text = element_text(size = 13),
        axis.title = element_text(size = 15))
depthGlobal_plot

In [6]:
# Estimate proportion of sites abe and below cutoff
total_sites <- sum(depth_df$num_sites, na.rm = TRUE)
depth_df %>% 
    mutate(is_below_cutoff = ifelse(cov <= depth_cutoff, 1, 0)) %>% 
    group_by(is_below_cutoff) %>% 
    summarize(site_count = sum(num_sites, na.rm = TRUE),
              prop = site_count / total_sites)

In [8]:
# Save plot
outpath <- snakemake@output[[1]]
print(outpath)
ggsave(filename = outpath, plot = depthGlobal_plot, device = 'png', 
       dpi = 300, width = 8, height = 8, units = 'cm')

### My take

- Most sites have 1 to 2X coverage
- As coverage increases beyond 2X, there is a gradual decline in the number of sites for a given depth
- 1250X is a reasonable cutoff and captures 98% of all sites. 

## SFS and diversity

Here I'll plot the SFS for all sites across the genome, in additional to subsets of degenerate sites (4-fold and 0-fold). We'll similarly estimate diversity as the genome-wide average of pairwise nucleotide differences for these same sets of sites. 

- Filters applied on on sites:
    - Minimum phred-scaled read mapping quality of 30
    - Minimimum phred-scaled base quality of 20
    - Max depth of 1250X across all individuals
    - 50% of all individuals required to have at least 1 read
    
In addition to the above filters, genotype likelihoods are estimated using the `samtools` model with the `samtools` "extended BAQ" algorithm to re-assign base quality scores around INDELS. 

### SFS

#### Load in SFS data as single dataframe

In [73]:
load_long_sfs <- function(path){
  
    # Get name of folder with parameter combinations
    dirname <- dirname(path)

    # Read in SFS
    full_path <- paste0(inpath, '/', path)
    sfs <- suppressMessages(read_delim(full_path, delim= '\t', col_names = FALSE)) %>% 
    as.data.frame() %>% 
    rename('maf' = 'X1',
           'num_sites' = 'X2') %>% 
    filter(num_sites != 0) %>%  #  folded SFS so samples > # of samples will be 0
    mutate(prop_sites = num_sites / sum(num_sites),
           site = dirname)
    return(sfs)
}

In [74]:
inpath <- '../results/angsd/sfs/'
sfs_df <- list.files(inpath, pattern = '*allChroms.sfs', recursive = TRUE) %>% 
  map_dfr(., load_long_sfs)

In [19]:
# Quick look at the da
head(sfs_df)

#### Plot SFS

- Only plotting minor allele frequence from 1/495 to 25/495 for clarity

In [20]:
# How many sites in each category?
sfs_df %>% 
    group_by(site) %>% 
    summarize(total_sites = sum(num_sites))

In [76]:
# SFS with for all sites, 0fold, and 4fold sites
# Invariant sites not plotted
# Only show up to 25-tons
cols <- wes_palette("Darjeeling1", n = 3, type = 'discrete')
sfs_plot <- sfs_df %>% 
  filter(maf != 0 & maf <= 25)  %>%
  ggplot(., aes(x = maf, y = prop_sites, fill = site)) + 
  geom_bar(stat ='identity', color = 'black',  width=.70, position = "dodge") + 
  ylab('Proportion of sites') + xlab('Minor allele frequency') +
  scale_fill_manual(values = cols) +
  scale_x_continuous(breaks = seq(1, 40, 3)) +
  scale_y_continuous(breaks = seq(0, 0.13, 0.01)) + 
  theme_classic() + 
  theme(axis.text = element_text(size = 13),
        axis.title = element_text(size = 15))
sfs_plot

In [77]:
# Save plot
outpath <- snakemake@output[[2]]
print(outpath)
ggsave(filename = outpath, plot = sfs_plot, device = 'png', 
       dpi = 300, width = 8, height = 8, units = 'cm')

#### My take

- This SFS looks a little suspicious. Why is there more 4-fold than 0-fold rare variants? We see the same thing in the Toronto data. Though we do see more 0-fold invariant, as expected (not shown)

### Diversity

#### Load in diversity dataframes

In [64]:
load_div_neut_df <- function(path){
  
    # Get name of folder with parameter combinations
    dirname <- dirname(path)

    # Read in stats
    full_path <- paste0(inpath, '/', path)
    stats <- suppressMessages(read_delim(full_path, delim= '\t', col_names = TRUE)) %>% 
    mutate(site = dirname,
           tp_scaled = tP / nSites,
           tw_scaled = tW / nSites)
    return(stats)
  
}

In [65]:
inpath <- '../results/angsd/summary_stats/thetas/'
thetas_df <- list.files(inpath, pattern = 'allChroms.thetas.idx.pestPG', recursive = TRUE) %>% 
  map_dfr(., load_div_neut_df)

In [25]:
# tp is theta_pi
# tw is theta_waterson
# td is Tajima's D
thetas_df %>% 
  group_by(site) %>% 
  summarise(total_sites = sum(nSites),
            mean_tp = mean(tp_scaled),
            mean_tw = mean(tw_scaled), 
            mean_td = mean(Tajima))

#### My take

- Looks good. Lower diversity at 0-fold sites, as expected
- 1.7% diversity worldwide, slightly higher than the estimate of 1.55% in Toronto

## PCA

- PCA is performed on genotype likelihoods
- Filtering criteria the same as those above with one exception:
    - This is based only on variant sites with MAF > 0.05

### Load in data

In [67]:
# Load covariance matrix from PCAngsd
cov_mat <- suppressMessages(
    read_delim(
        '../results//population_structure//pcangsd//allSamples_4fold_maf0.05_pcangsd.cov', 
                      col_names = FALSE, delim = ' ')) %>% 
      as.matrix()

# Load data with continent info
continent <- suppressMessages(
    read_delim(
        '../../sequencing-prep/data-clean/low-1/plantsToPrep_low1.csv', 
                           delim = ',')) %>% 
    dplyr::select(continent, city, pop, individual)

# Load sata with habitat info
habitat_info <- suppressMessages(
    read_delim(
        '../resources/low1_libraryConcentrations.csv', 
                           delim = ',')) %>% 
    dplyr::select(city, pop, individual, site, plantID) %>% 
    left_join(., continent, by = c('city', 'pop', 'individual'))

# Combine continent and habitat data with sample order from ANGSD
samples <- suppressMessages(
    read_table(
        '../results/program_resources/angsd_sample_order.txt', col_names = FALSE) %>% 
  rename('plantID' = 'X1')) %>%
  left_join(., habitat_info, by = 'plantID')

### Perform the PCA and summarize

In [68]:
# Summary of PCA
# 17% explained by PC1
# 7% explained by PC2
# < 1% for all remaining PCs
summary(princomp(cov_mat))

In [29]:
# Dataframe with eigenvectors
eigenvectors <- eigen(cov_mat)
eigen_df <- eigenvectors$vectors %>% 
  as.data.frame() %>% 
  dplyr::select(V1, V2, V3, V4) %>% 
  rename('PC1' = 'V1',
         'PC2' = 'V2', 
         'PC3' = 'V3',
         'PC4' = 'V4') %>% 
  bind_cols(., samples)

#### Plot the PCA

__Colored by habitat__

In [78]:
# Plot
col1 <- wes_palette("Darjeeling1", n = 5, type = 'discrete')[2]
col2 <- wes_palette("Darjeeling1", n = 5, type = 'discrete')[4]
cols <- c(col1, col2)
pca_byHabitat <- ggplot(eigen_df, aes(x = PC1, y = PC2, color = site)) +
  geom_point(size = 2.5) + 
  scale_color_manual(values = cols) + 
  theme_classic() + 
  xlab('PC1 (17%)') + ylab('PC2 (7%)') +
  theme(axis.text = element_text(size = 13),
        axis.title = element_text(size = 15))
pca_byHabitat

In [79]:
# Save plot
outpath <- snakemake@output[[3]]
print(outpath)
ggsave(filename = outpath, plot = pca_byHabitat, device = 'png', 
       dpi = 300, width = 8, height = 8, units = 'cm')

__Colored by city__

In [80]:
# Plot
cols <- wes_palette("Darjeeling1", n = 25, type = 'continuous')
pca_byCity <- ggplot(eigen_df, aes(x = PC1, y = PC2, color = city)) +
  geom_point(size = 2.5) + 
  scale_color_manual(values = cols) + 
  theme_classic() + 
  xlab('PC1 (17%)') + ylab('PC2 (7%)') +
  theme(axis.text = element_text(size = 13),
        axis.title = element_text(size = 15))
pca_byCity

In [81]:
# Save plot
outpath <- snakemake@output[[4]]
print(outpath)
ggsave(filename = outpath, plot = pca_byCity, device = 'png', 
       dpi = 300, width = 8, height = 8, units = 'cm')

__Colored by continent__

In [82]:
# Plot
cols <- wes_palette("Darjeeling1", n = 6, type = 'continuous')
pca_byContinent <- ggplot(eigen_df, aes(x = PC1, y = PC2, color = continent)) +
  geom_point(size = 2.5) + 
  scale_color_manual(values = cols) + 
  theme_classic() + 
  xlab('PC1 (17%)') + ylab('PC2 (7%)') +
  theme(axis.text = element_text(size = 13),
        axis.title = element_text(size = 15))
pca_byContinent

In [83]:
# Save plot
outpath <- snakemake@output[[5]]
print(outpath)
ggsave(filename = outpath, plot = pca_byContinent, device = 'png', 
       dpi = 300, width = 8, height = 8, units = 'cm')

#### My take

- The worldwide distribution of clover represents a single population
    - Obviously it's not that extreme
    - There is some clustering by continent and by cities within continent
    - We should do some more digging to see whether these PCs correpons to Lat/Long, for example