In [2]:
# Load packages
library(tidyverse)
library(wesanderson)
library(fs)

# Description of data and sample sets

All analyses in this notebook were performed separately on two different sample sets: 

1. _highErrorRemoved_: This sample set contains 515 samples from urban and rural habitats across 26 cities (25 cities for GLUE-LOW1 plus 20 samples from Toronto. 5 samples were removed due to having abnormaly high alignment error rates (as reported by `Qualimap`). 
2. _finalSamples_lowCovRemoved_: This sample set contains 432 samples across 26 cities. In addition to the 5 samples with high error rates above, and additional 83 samples with low coverage (< 0.31X) were removed since these were previously found to have low mapping % and lower than average SNP density when compared to remaining samples. 

The 20 Toronto samples were selected as follows:
- 2 samples from each of 5 urban populations for a total of 10 urban plants
- 5 samples from one rural population and 1 sample from each of 5 other populations. These samples come from the western Toronto transect only. We couldn't evenly sample rural populations because they weren't evenly sampled for the Toronto data. Total of 10 rural plants 
- All 20 plants downsampled by sampling 25% of their reads. Because initial coverage between these samples varied between ~7.5X and ~14X, downsampled coverages in GLUE should range from ~1.8X to 3.5X.

# Global patterns of diversity and structure

In this notebook, we'll analyze the output fron `ANGSD` run separately on both sample sets described above. The notebook contains the following analyses:

1. Sequencing depth across all samples. This is mostly a check to see whether our maximum depth threshold used in genotype likelihoos and SFS estimation is reasonable.
2. General shape of the SFS estimated using all sites, 4-fold sites, and 0-fold sites across the genome
3. Genome-wide average pairwise genetic diversity (Theta<sub>pi</sub>) across all samples. Done for all sites, 4-fold sites, and 0-fold sites.
4. PCA showing population clustering of all 495 samples using 4-fold sites only

## Sequencing depth

### Load in depth dataset

In [3]:
load_depth_df <- function(path){
    
    # Get name of folder with parameter combinations
    sample_set <- str_split(path_dir(path), '/', simplify = TRUE)[1,1]

    # Read in SFS
    full_path <- paste0(inpath, '/', path)
    depth_df <- suppressMessages(
        read_delim(full_path, delim = '\t', col_names = FALSE)) %>% 
    t() %>% 
    as.data.frame() %>% 
    rename('num_sites' = 'V1') %>% 
    mutate(cov = 1:n() - 1,
           sample_set = sample_set)
    return(depth_df)
}

In [4]:
inpath <- '../results/angsd/depth/'
depth_df <- list.files(inpath, pattern = 'CM019101.1_\\w+_allSites.depthGlobal', recursive = TRUE) %>% 
  map_dfr(., load_depth_df)

In [5]:
head(depth_df)

The dataset as 20,000 rows. Each row represents the total number of sites with a total X coverage, for one or the other sample set (10,000 rows per sample set). The exception is bin 10,000, which represents the number of sites with coverage equal or greater than 10,000. 

### Histogram of sequencing coverage

Plotting histogram of number of sites at each X coverage along chromosome 1.

Red dashed line is the max depth cutoff used when estimating genotype likelihoods and the SFS. Sites with depth great than this cutoff were excluded. 

- This depth (1250X) was calculated as 2 x the mean coverage from qualimap (1.25X) x Number of samples (500)
- I didn't bother recalculating max depth when we switched from 500 to 495 samples. 

In [6]:
depth_cutoff <- 1250
depthGlobal_plot <- ggplot(depth_df, aes(x = cov, y = num_sites)) +
    geom_bar(stat = 'identity', color = 'black', fill = 'white') + 
    xlab('Coverage') + ylab('Number of sites') +
    facet_wrap(~sample_set) + 
    theme_classic() + 
    geom_vline(xintercept = depth_cutoff, linetype = 'dashed', color = 'red') +
    theme(axis.text = element_text(size = 13),
        axis.title = element_text(size = 15))
depthGlobal_plot

In [7]:
# Estimate proportion of sites above and below cutoff for both sample sets
total_sites <- sum(depth_df$num_sites, na.rm = TRUE)
depth_df %>% 
    mutate(is_below_cutoff = ifelse(cov <= depth_cutoff, 1, 0)) %>% 
    group_by(sample_set, is_below_cutoff) %>% 
    summarize(n = sum(num_sites, na.rm = TRUE)) %>% 
    mutate(freq = n / sum(n, na.rm = TRUE))

In [9]:
# Save plot
outpath <- snakemake@output[[1]]
print(outpath)
ggsave(filename = outpath, plot = depthGlobal_plot, device = 'pdf', 
       dpi = 300, width = 14, height = 8, units = 'in')

### My take

Regardless of sample set:
- Most sites have 1 to 2X coverage
- As coverage increases beyond 2X, there is a gradual decline in the number of sites for a given depth
- 1250X is a reasonable cutoff and captures ~98% of all sites. 

## SFS and diversity

Here I'll plot the SFS for all sites across the genome, in additional to subsets of degenerate sites (4-fold and 0-fold). We'll similarly estimate diversity as the genome-wide average of pairwise nucleotide differences for these same sets of sites. 

- Filters applied on on sites:
    - Minimum phred-scaled read mapping quality of 30
    - Minimimum phred-scaled base quality of 20
    - Max depth of 1250X across all individuals
    - 50% of all individuals required to have at least 1 read
    
In addition to the above filters, genotype likelihoods are estimated using the `samtools` model with the `samtools` "extended BAQ" algorithm to re-assign base quality scores around INDELS. 

### SFS

#### Load in SFS data as single dataframe

In [10]:
load_long_sfs <- function(path){
  
    # Get name of folder with parameter combinations
    dir <- str_split(path_dir(path), '/', simplify = TRUE)
    sample_set <- dir[1, 1]
    site <- dir[1, 2]

    # Read in SFS
    full_path <- paste0(inpath, '/', path)
    sfs <- suppressMessages(read_delim(full_path, delim= '\t', col_names = FALSE)) %>% 
    as.data.frame() %>% 
    rename('maf' = 'X1',
           'num_sites' = 'X2') %>% 
    filter(num_sites != 0) %>%  #  folded SFS so samples > # of samples will be 0
    mutate(prop_sites = num_sites / sum(num_sites),
           sample_set = sample_set,
           site = site)
    return(sfs)
}

In [11]:
inpath <- '../results/angsd/sfs/'
sfs_df <- list.files(inpath, pattern = '*allChroms.sfs', recursive = TRUE) %>% 
  map_dfr(., load_long_sfs)

In [12]:
# Quick look at the da
head(sfs_df)

#### Plot SFS

- Only plotting minor allele frequence from 1 to 25. Therefore, invariant sites and sites with MAF > 25/N not shown

In [13]:
# How many sites in each category?
sfs_df %>% 
    group_by(sample_set, site) %>% 
    summarize(total_sites = sum(num_sites))

In [27]:
sfs_plot <- sfs_df %>% 
    filter(maf != 0 & maf <= 25)  %>%
    ggplot(., aes(x = maf, y = prop_sites)) + 
    geom_bar(stat ='identity', color = 'black',  width=.70) + 
    facet_grid(sample_set~ site) +
    ylab('Proportion of sites') + xlab('Minor allele frequency') +
    scale_fill_manual(values = cols) +
    scale_x_continuous(breaks = seq(1, 40, 3)) +
    scale_y_continuous(breaks = seq(0, 0.13, 0.02)) + 
    theme_classic() + 
    theme(axis.text = element_text(size = 13),
        axis.title = element_text(size = 15))
sfs_plot

In [28]:
# Save plot
outpath <- snakemake@output[[2]]
print(outpath)
ggsave(filename = outpath, plot = sfs_plot, device = 'pdf', 
       dpi = 300, width = 12, height = 10, units = 'in')

#### My take

- The allSites SFS look good
- The 4-fold SFS look strange, especially for the *finalSamples_lowCovRemoved* sample set. They're not as "smooth" as I would have expected
    - Is this a real result? Do we expect this when pooling samples from across the world with different demographic histories? 
- 0-fold SFS looks OK, though maybe a little jagged. It's more left-skewed than 4-fold SFS, as expected. 
- One quick note: There are about twice as many sites in the *finalSamples* dataset than the *highErrorRemoved* dataset, despite having about 83 fewer samples. 
    - I think this is because the proportion of individuals required to have reads at a site was fixed at 50%. For the *finalSamples* dataset, this equates to 216 samples, which is esily met in this dataset since bad individuals have already been removed. As a result, more sites pass this filter. 

### Diversity

#### Load in diversity dataframes

In [29]:
load_div_neut_df <- function(path){
  
    # Get name of folder with parameter combinations
    dir <- str_split(path_dir(path), '/', simplify = TRUE)
    sample_set <- dir[1, 1]
    site <- dir[1, 2]

    # Read in stats
    full_path <- paste0(inpath, '/', path)
    stats <- suppressMessages(read_delim(full_path, delim= '\t', col_names = TRUE)) %>% 
    mutate(sample_set = sample_set,
           site = site,
           tp_scaled = tP / nSites,
           tw_scaled = tW / nSites)
    return(stats)
  
}

In [30]:
inpath <- '../results/angsd/summary_stats/thetas/'
thetas_df <- list.files(inpath, pattern = '*diversityNeutrality.thetas.idx.pestPG', recursive = TRUE) %>% 
  map_dfr(., load_div_neut_df)

In [31]:
# tp is theta_pi
# tw is theta_waterson
# td is Tajima's D
thetas_df %>% 
  group_by(sample_set, site) %>% 
  summarise(total_sites = sum(nSites),
            mean_tp = mean(tp_scaled),
            mean_tw = mean(tw_scaled), 
            mean_td = mean(Tajima))

#### My take

Regardless of sample set:
- Looks good. Lower diversity at 0-fold sites, as expected
- ~1.5% genome-wide diversity

## PCA

- PCA is performed on genotype likelihoods
- Filtering criteria the same as those above with one exception:
    - This is based only on variant sites with MAF > 0.05

### Load in data and perform PCA

- Perform PCA on both sample sets and merge into single dataframe

In [42]:
# Function to get PCA summary as matrix
pca_importance <- function(x) {
  vars <- x$sdev^2
  vars <- vars/sum(vars)
  rbind(`Standard deviation` = x$sdev, `Proportion of Variance` = vars, 
      `Cumulative Proportion` = cumsum(vars))
}

In [32]:
# Load data with habitat info
habitat_info <- suppressMessages(
    read_delim(
        '../../sequencing-prep/resources/low1_sampleSheet.txt', 
                           delim = '\t')) %>% 
    dplyr::select(continent, range, city, pop, individual, site, sample)

#### highErrorRemoved

In [34]:
# Load covariance matrix from PCAngsd
cov_mat_highErrorRemoved <- suppressMessages(
    read_delim(
        '../results/population_structure/pcangsd/highErrorRemoved_4fold_maf0.05_pcangsd.cov', 
                      col_names = FALSE, delim = ' ')) %>% 
      as.matrix()

# Combine continent and habitat data with sample order from ANGSD
samples_highErrorRemoved <- suppressMessages(
    read_table(
        '../results/program_resources/angsd_highErrorRemoved_order.txt', col_names = FALSE) %>% 
  rename('sample' = 'X1')) %>%
  left_join(., habitat_info, by = 'sample')

In [55]:
pca_importance(summary(princomp(cov_mat_highErrorRemoved))) %>% 
    as.data.frame() %>% 
    dplyr::select(Comp.1:Comp.4)

In [58]:
# Dataframe with eigenvectors
eigenvectors_highErrorRemoved <- eigen(cov_mat_highErrorRemoved)
eigen_df_highErrorRemoved <- eigenvectors_highErrorRemoved$vectors %>% 
    as.data.frame() %>% 
    dplyr::select(V1, V2, V3, V4) %>% 
    rename('PC1' = 'V1',
         'PC2' = 'V2', 
         'PC3' = 'V3',
         'PC4' = 'V4') %>% 
    bind_cols(., samples_highErrorRemoved) %>% 
    mutate(sample_set = 'highErrorRemoved')

#### finalSamples_lowCovRemoved

In [60]:
# Load covariance matrix from PCAngsd
cov_mat_finalSamples <- suppressMessages(
    read_delim(
        '../results/population_structure/pcangsd/finalSamples_lowCovRemoved_4fold_maf0.05_pcangsd.cov', 
                      col_names = FALSE, delim = ' ')) %>% 
      as.matrix()

# Combine continent and habitat data with sample order from ANGSD
samples_finalSamples <- suppressMessages(
    read_table(
        '../results/program_resources/angsd_finalSamples_lowCovRemoved_order.txt', col_names = FALSE) %>% 
  rename('sample' = 'X1')) %>%
  left_join(., habitat_info, by = 'sample')

In [62]:
pca_importance(summary(princomp(cov_mat_finalSamples))) %>% 
    as.data.frame() %>% 
    dplyr::select(Comp.1:Comp.4)

In [61]:
# Dataframe with eigenvectors
eigenvectors_finalSamples <- eigen(cov_mat_finalSamples)
eigen_df_finalSamples <- eigenvectors_finalSamples$vectors %>% 
    as.data.frame() %>% 
    dplyr::select(V1, V2, V3, V4) %>% 
    rename('PC1' = 'V1',
         'PC2' = 'V2', 
         'PC3' = 'V3',
         'PC4' = 'V4') %>% 
    bind_cols(., samples_finalSamples) %>% 
    mutate(sample_set = 'finalSamples_lowCovRemoved')

In [64]:
# Join dataframes
eigen_df <- bind_rows(eigen_df_highErrorRemoved, eigen_df_finalSamples)
head(eigen_df)

### Plot the PCA

#### Colored by habitat

In [78]:
# Plot
col1 <- wes_palette("Darjeeling1", n = 5, type = 'discrete')[2]
col2 <- wes_palette("Darjeeling1", n = 5, type = 'discrete')[4]
cols <- c(col1, col2)
pca_byHabitat <- ggplot(eigen_df, aes(x = PC1, y = PC2, color = site, shape = site)) +
    geom_point(size = 2) + 
    scale_color_manual(values = cols) + 
    facet_wrap(~sample_set) +
    theme_classic() + 
    xlab('PC1') + ylab('PC2') +
    theme(axis.text = element_text(size = 13),
         axis.title = element_text(size = 15),
         legend.position = 'top')
pca_byHabitat

In [79]:
# Save plot
outpath <- snakemake@output[[3]]
print(outpath)
ggsave(filename = outpath, plot = pca_byHabitat, device = 'pdf', 
       dpi = 300, width = 14, height = 8, units = 'in')

#### Colored by city

In [83]:
# Plot
cols <- wes_palette("Darjeeling1", n = 26, type = 'continuous')
pca_byCity <- ggplot(eigen_df, aes(x = PC1, y = PC2, color = city, shape = site)) +
    geom_point(size = 2) + 
    scale_color_manual(values = cols) + 
    facet_wrap(~sample_set) +
    theme_classic() + 
    xlab('PC1') + ylab('PC2') +
    theme(axis.text = element_text(size = 13),
        axis.title = element_text(size = 15),
        legend.position = 'top')
pca_byCity

In [84]:
# Save plot
outpath <- snakemake@output[[4]]
print(outpath)
ggsave(filename = outpath, plot = pca_byCity, device = 'pdf', 
       dpi = 300, width = 14, height = 8, units = 'in')

#### Colored by continent

In [87]:
# Plot
cols <- wes_palette("Darjeeling1", n = 6, type = 'continuous')
pca_byContinent <- ggplot(eigen_df, aes(x = PC1, y = PC2, color = continent, shape = site)) +
    geom_point(size = 2) + 
    scale_color_manual(values = cols) + 
    facet_wrap(~sample_set) +
    theme_classic() + 
    xlab('PC1') + ylab('PC2') +
    theme(axis.text = element_text(size = 13),
        axis.title = element_text(size = 15),
        legend.position = 'top')
pca_byContinent

In [88]:
# Save plot
outpath <- snakemake@output[[5]]
print(outpath)
ggsave(filename = outpath, plot = pca_byContinent, device = 'pdf', 
       dpi = 300, width = 14, height = 8, units = 'in')

#### Colored by range

In [89]:
# Plot
col1 <- wes_palette("Darjeeling1", n = 5, type = 'discrete')[2]
col2 <- wes_palette("Darjeeling1", n = 5, type = 'discrete')[4]
cols <- c(col1, col2)
pca_byRange <- ggplot(eigen_df, aes(x = PC1, y = PC2, color = range, shape = site)) +
    geom_point(size = 2) + 
    scale_color_manual(values = cols) + 
    facet_wrap(~sample_set) +
    theme_classic() + 
    xlab('PC1') + ylab('PC2') +
    theme(axis.text = element_text(size = 13),
        axis.title = element_text(size = 15),
        legend.position = 'top')
pca_byRange

In [90]:
# Save plot
outpath <- snakemake@output[[6]]
print(outpath)
ggsave(filename = outpath, plot = pca_byRange, device = 'pdf', 
       dpi = 300, width = 14, height = 8, units = 'in')

#### My take

- PCAs are the same, regardless of sample set
    - Suggests we could use all samples for this analysis (N = 515).
- PC2 seems to largely separate Introduced vs. Native range
- Unclear what PC1 corresponds to (maybe related to something like Lat/Long?)
- Urban/rural sites seem to be largely overlapping within cities with some exceptions (e.g., Thessaloniki). 
