In [2]:
library(tidyverse)
library(car)
library(emmeans)

# Pairwise urban-rural pi and Fst within cities

In this notebook, we'll examine urban-rural differences in diversity (theta_pi) and urban-rural Fst across all 26 cities. 

## Description of the data

- Low coverage individuals have been removed from these analyses, so we're using the samples that are part of the *finalSamples_lowCovRemoved* sample set from the previous analyses. 
- All analyses were performed using genome-wide 4-fold degenerate site.

The basic workflow is as follows:

1. Generate the Site Allele Frequency (SAF) likelihood distribution for each habitat within each city. 

    - Used the same filters as previously, the most important of which is that sites are only included if 50% of individuals have data. Remember, this is 50% of individuals _within_ a habitat; this would correspond to 5 individuals if none have been removed due to low coverage. 

2. To estimate diversity, generate the folded, one-dimensional SFS from the SAF file in step 1 and estimate diversity separately in urban and rural habitats
3. To estimate Fst, generate the folded, two-dimensional joint SFS of urban-rural habitats and estimate Fst. This uses only sites that are shared between both populations (i.e., the intersection of the two SAF files). 

    - For comparison, I estimated Fst using both Weir and Cockerham (1984) and Hudson (1992).

## Pairwise diversity

### Load diversity data

In [3]:
# Function to load diversity estimates by city and habitat
load_pairwise_diversity <- function(path){
    
    # Get city and site names from file
    city <- dirname(path)
    site <- str_extract(basename(path), pattern = '(?<=_)[r|u]')
    
    full_path <- paste0(inpath, path)
    df <- suppressMessages(read_delim(full_path, delim = '\t')) %>% 
        mutate(tp_scaled = tP / nSites,
               tw_scaled = tW / nSites,
               city = city,
               habitat = site) %>% 
    dplyr::select(city, habitat, tp_scaled, tw_scaled, nSites) %>% 
    group_by(city, habitat) %>% 
    
    # Mean across chromosomes
    summarise(tp_scaled = mean(tp_scaled),
              tw_scaled = mean(tw_scaled),
              total_sites = sum(nSites),
              .groups = 'drop')
    return(df)
    
}

In [4]:
# Merge diversity and habitat diversity dfs into single dataframe
inpath <- '../results/angsd/summary_stats/thetas/by_city/'
div_df <- list.files(inpath, pattern = '*.pestPG', recursive = TRUE) %>% 
    map_dfr(., load_pairwise_diversity) 

In [5]:
head(div_df)

In [6]:
print(mean(div_df$total_sites))
print(range(div_df$total_sites))

In [7]:
div_df %>% filter(total_sites == min(total_sites) | total_sites == max(total_sites))

### Plot differences in diversity

In [8]:
div_df_wide <- div_df %>% 
    # Calculate urban-rural difference in theta
    pivot_wider(names_from = habitat, values_from = c(tp_scaled, tw_scaled, total_sites)) %>% 
    mutate(tp_diff = tp_scaled_u - tp_scaled_r)

In [9]:
# Histogram of urban-rural differences in diversity
div_diff_hist <- div_df_wide %>% 
    ggplot(., aes(x = tp_diff)) +
        geom_histogram(color = 'black', fill = 'white', bins = 13) +
        ylab("Number of cities") + xlab("Urban - rural difference in diversity") +
        theme_classic()
div_diff_hist

## Fst

### Load Fst dataframe

In [10]:
# Function to load Fst df by city/habitat
load_fst <- function(path){
    
    # Get Fst type and city from filenames
    dir <- str_split(path, pattern = '/')
    fst_type <- dir[[1]][[1]]
    city <- dir[[1]][[2]]    
    
    full_path <- paste0(inpath, path)
    colnames <- c('chrom', 'pos', 'num', 'denom')
    df <- suppressMessages(read_delim(full_path, delim = '\t', col_names = colnames)) %>% 
        
        # Cap numerators at 0 if negative 
        # https://github.com/ANGSD/angsd/issues/309
        # Does not affect overall pattern
        mutate(num = ifelse(num < 0, 0, num)) %>% 
        
        # Estimate weighted Fst as ratio of averages
        # https://github.com/ANGSD/angsd/issues/61
        summarise(num_sum = sum(num),
                  denom_sum = sum(denom),
                  fst = num_sum / denom_sum,
                  nSites = n()) %>% 
        mutate(fst_type = ifelse(fst_type == 'fst0', 'wc', 'hudson'),
               city = city)
    
    return(df)
    
}

In [11]:
# Function to get number of samples buy city/habitat
get_num_samples <- function(path){
    
    city <- dirname(path)
    habitat <- str_extract(basename(path), pattern = '(?<=_)[r|u]')  
    
    full_path <- paste0(inpath, path)
    df <- suppressMessages(read_table(full_path, col_names = FALSE)) 
    nsamples <- nrow(df)
    
    df <- data.frame(city = city, habitat = habitat, n = nsamples)
    return(df)
    
}

In [12]:
# Merged df with sample size info
inpath <- '../results/program_resources/bam_lists/by_city/'
sample_size_df <- list.files(inpath, pattern = '*.list', recursive = TRUE) %>% 
    map_dfr(., get_num_samples) %>% 
    pivot_wider(names_from = habitat, values_from = n)
head(sample_size_df)

In [13]:
# Merge Fst dataframes
inpath <- '../results/angsd/summary_stats/fst/'
fst_df <- list.files(inpath, pattern = '*_readable.fst', recursive = TRUE) %>% 
    map_dfr(., load_fst)

In [14]:
fst_df_withSampleSize <- fst_df %>% 
    left_join(., sample_size_df, by = 'city') %>% 
    mutate(nmax = pmax(r, u),
           nmin = pmin(r, u),
           ndiff = nmax - nmin) %>% 
    rowwise() %>% 
    mutate(nmean = mean(c(r, u)))

In [15]:
head(fst_df_withSampleSize)

In [16]:
print(mean(fst_df_withSampleSize$nSites))

In [17]:
fst_df_withSampleSize %>% ungroup() %>% filter(nSites == min(nSites) | nSites == max(nSites))

### WC vs. Hudson's Fst

In [18]:
wc <- fst_df_withSampleSize %>% filter(fst_type == 'wc') %>% pull(fst)
hudson <- fst_df_withSampleSize %>% filter(fst_type == 'hudson') %>% pull(fst)

In [19]:
wc_vs_hudson <- qplot() + 
    geom_point(aes(x = wc, y = hudson), size = 2, alpha = 0.5) +
    geom_abline(slope = 1, intercept = 0) +
    xlab('Weir and Cockerham Fst') + ylab('Hudson Fst') +
    theme_classic()
wc_vs_hudson

In [20]:
outpath <- snakemake@output[[1]]
print(outpath)
ggsave(filename = outpath, plot = wc_vs_hudson, device = 'pdf', width = 8, height = 9, units = 'in', dpi = 300)

### Dependence of Fst on sample size

In [21]:
fst_by_ss <- ggplot(fst_df_withSampleSize, aes(x = nmin, y = fst, color = fst_type)) + 
    geom_point(size = 2, alpha = 0.5) +
    geom_smooth(method = 'loess', se = FALSE) + 
    xlab('Minimum sample size') + ylab('Fst') +
    theme_classic()
fst_by_ss

In [22]:
outpath <- snakemake@output[[2]]
print(outpath)
ggsave(filename = outpath, plot = fst_by_ss, device = 'pdf', width = 8, height = 9, units = 'in', dpi = 300)

#### My take

- WC Fst generally higher than Hudson, as expected based on Bhatia (2013)
- Highest Fst estimates occur for cities where sample sizes are lowest, suggesting these estimates may be biased upward
    - Downsampling Toronto data will help us resolve this
- That being said, Fst estimates are generally low for seemingly reliable estimates. Often Fst < 0.05

## Euclidean distance from PCA

- Estimate Euclidean distance between urban and rural centroids by city

In [23]:
euclidean <- function(x1, y1, x2, y2){
    
    dist <- sqrt((x1 - x2)^2 + (y1 - y2)^2)
    return(dist)
}

In [24]:
# Load data with habitat info
habitat_info <- suppressMessages(
    read_delim(
        '../../sequencing-prep/resources/low1_sampleSheet.txt', 
                           delim = '\t')) %>% 
    dplyr::select(continent, range, city, pop, individual, site, sample)

In [25]:
# Load covariance matrix from PCAngsd
cov_mat <- suppressMessages(
    read_delim(
        '../results/population_structure/pcangsd/highErrorRemoved_4fold_maf0.05_pcangsd.cov', 
                      col_names = FALSE, delim = ' ')) %>% 
      as.matrix()

# Combine continent and habitat data with sample order from ANGSD
samples <- suppressMessages(
    read_table(
        '../results/program_resources/angsd_highErrorRemoved_order.txt', col_names = FALSE) %>% 
  rename('sample' = 'X1')) %>%
  left_join(., habitat_info, by = 'sample')

In [27]:
# Dataframe with eigenvectors
eigenvectors <- eigen(cov_mat)
eigen_df <- eigenvectors$vectors %>% 
    as.data.frame() %>% 
    dplyr::select(V1, V2) %>% 
    rename('PC1' = 'V1',
         'PC2' = 'V2') %>% 
    bind_cols(., samples) %>% 
    mutate(sample_set = 'highErrorRemoved')

In [28]:
euc_dist_df <- eigen_df %>% 
    group_by(city, site) %>% 
    summarise(x = mean(PC1),
              y = mean(PC2)) %>% 
    pivot_wider(names_from = site, values_from = c(x, y)) %>% 
    mutate(distance = euclidean(x_u, y_u, x_r, y_r)) %>% 
    dplyr::select(city, distance)

In [29]:
head(euc_dist_df)

## Fst vs. Euclidean distance

In [30]:
fst <- fst_df %>% filter(fst_type == 'hudson') %>% pull(fst)
dist <- euc_dist_df %>% pull(distance)

In [31]:
fst_by_eucl <- qplot() + 
    geom_point(aes(x = fst, y = dist), size = 2, alpha = 0.5) +
#     geom_abline(slope = 1, intercept = 0) +
    xlab("Hudson's Fst") + ylab('Euclidean distance') +
    theme_classic()
fst_by_eucl

In [32]:
outpath <- snakemake@output[[3]]
print(outpath)
ggsave(filename = outpath, plot = fst_by_eucl, device = 'pdf', width = 8, height = 9, units = 'in', dpi = 300)

In [33]:
# Correlation using all points
cor(fst, dist, method = 'pearson')

In [34]:
fst_highDrop <- fst[fst<0.1]
dist_highDrop <- dist[fst<0.1]

In [35]:
fst_by_eucl_highDrop <- qplot() + 
    geom_point(aes(x = fst_highDrop, y = dist_highDrop), size = 2, alpha = 0.5) +
#     geom_abline(slope = 1, intercept = 0) +
    xlab("Hudson's Fst") + ylab('Euclidean distance') +
    theme_classic()
fst_by_eucl_highDrop

In [36]:
# Correlation when large Fst outliers are removed
cor.test(fst_highDrop, dist_highDrop)

## Models

In [37]:
# Get dataframe with slopes and significance of clines
slopes <- suppressMessages(read_csv('../../phenotypic-analyses/analysis/supplementary-tables/allCities_HCNslopes_enviroMeansSlopes.csv')) %>% 
    dplyr::select(city, betaRLM_freqHCN, sigRLM)

In [38]:
df_allStats <- euc_dist_df %>% 
    left_join(., fst_df %>% filter(fst_type == 'hudson'), by = 'city') %>% 
    left_join(., div_df_wide, by = 'city') %>% 
    left_join(., slopes, by = 'city')

### Does pi differ by habitat or city?

In [40]:
pi_mod <- aov(tp_scaled ~ city + habitat, data = div_df)
summary(pi_mod)

In [41]:
# Least squared means of pi in each habitat
emmeans(pi_mod, specs = 'habitat')

In [42]:
# Standard errors from data instead of model
div_df %>% 
    group_by(habitat) %>% 
    summarise(mean = round(mean(tp_scaled), 4),
              n = n(),
              se = round(sd(tp_scaled) / sqrt(n), 6))

### Pi by habitat and clines (sig vs. ns)

In [43]:
div_df_mod <- div_df %>% 
    left_join(., slopes, by = 'city')

In [45]:
# Model to get least squared means
pi_mod_sig <- aov(tp_scaled ~ habitat + sigRLM, data = div_df_mod)
summary(pi_mod_sig)

In [50]:
# Get least quared means
emmeans(pi_mod_sig, specs = 'habitat', by = 'sigRLM')

In [51]:
# Standard errors from data instead of model
div_df_mod %>% 
    group_by(habitat, sigRLM) %>% 
    summarise(mean = round(mean(tp_scaled), 4),
              n = n(),
              se = round(sd(tp_scaled) / sqrt(n), 6))

### Does the strength of clines predict mean diversity?

- Model above suggests diversity is higher in cities with clines?
- Is this a real result?
- Do mean diversity across cities vary with the strength of clines?

In [52]:
div_df_mean <- div_df %>% 
    group_by(city) %>% 
    summarise(tp_scaled = mean(tp_scaled)) %>% 
    left_join(., slopes, by = 'city')

In [55]:
summary(lm(tp_scaled ~ betaRLM_freqHCN, data = div_df_mean))

### Does difference in neutral diversity predict HCN clines?

In [59]:
df_allStats$betaRLM_freqHCN

In [56]:
div_mod <- aov(betaRLM_freqHCN ~ tp_diff, data = df_allStats)
summary(div_mod)

### Does difference in pi differ between cities with and without clines?

In [57]:
tpDiff_by_sig_mod <- aov(tp_diff ~ sigRLM, data = df_allStats)
summary(tpDiff_by_sig_mod)

In [58]:
emmeans(tpDiff_by_sig_mod, specs = 'sigRLM')

### Does Fst predict HCN?

In [65]:
# Does Fst predict HCN?
fst_mod <- aov(betaRLM_freqHCN ~ fst, data = df_allStats)
summary(fst_mod)

### Does Fst differ between cities with and without clines?

In [61]:
df_allStats %>% ungroup() %>% summarise(meanFst = mean(fst), n = n(), se = sd(fst) / sqrt(n))

In [62]:
fst_mod <- aov(fst ~ sigRLM, data = df_allStats)
summary(fst_mod)

In [64]:
emmeans(fst_mod, specs = 'sigRLM')

In [102]:
# Standard errors from data instead of model
df_allStats %>% 
    group_by(sig) %>% 
    summarise(mean = round(mean(fst), 4),
              n = n(),
              se = round(sd(fst) / sqrt(n), 4))

### Does euclidean distance predict HCN?

In [66]:
dist_mod <- aov(betaRLM_freqHCN ~ distance, data = df_allStats)
summary(dist_mod)

## Important considerations

1. Some estimates of pairwise diversity Fst result from urban/rural populations with different sample sizes. It seems common to downsample populations to the lowest sample size before estimating Fst. I haven't seen a detailed examination of biases in Fst resulting from unequal sample sizes but it's common enough that I wonder if we should consider this. However, variation in sample sizes seems to be a bigger problem for WC Fst (Bhatia 2013) than Hudson's, so maybe we don't need to worry about it. Not really sure but something to think about. 
2. We still need to downsample the Toronto data to get a sense of whether our estimates of diversity and Fst are biased at lower sample sizes, and by how much. 
3. Estimates of differences in pi, Fst, and euclidean distance seem pretty small on average. If we wanted to get a sense of whether these estimates differ significantly from zero, I think a permutation approach would be most appropriate. We could generate 100 bootstrapped replicates of the one-dimensional (or 2D) SFS, re-calculated theta (or Fst), and generate confidence intervals around our estimates to see if they overlap zero. 