In [2]:
# Load required packages 
library(tidyverse)

# Description the the data

The goal of this notebook is to assess potential biases with estimating the urban-rural difference in pi
and urban-rural Fst for cities with low sample samples sizes in at least one of the habitats. As such, I'm
testing not only cases where both habitats have few individuals, but also biases associated with urban-rural variation
in sample sizes.

To do this, I'm using high coverage data from Toronto as a way of "simulating" various sampling strategies. To start, I selected 7 individuals from a single urban Toronto population, and 7 from a single rural population. These individuals had mean coverage around 12X. I then estimated urban and rural pi in these populations and urban-rural Fst using these high coverage samples. This is the "control".

I then downsampled the urban and rural samples to approximately 1X to simulate low coverage data and re-estimated pi and Fst under varying sampling schemes for individuals within habitats. I simulated cases where each habitat was represented either by 1, 3, 5, or 7 individuals and these were fully crossed for a total of 16 comparisons. 

In all cases, individuals were randomly selected. Ideally, we would do this multiple times to get a sense of how selection of individuals impacts our estimates but I didn't bother with this. We can add this in if we think it's important. 

Finally, I used only 4fold sites along chromosome 1

# Diversity

- Estimate rural-urban difference in thetas for difference sample sizes
- Estimate bias as the difference between the estimated difference in theta and the "true" difference in theata (i.e., estimated from high coverage samples)

In [3]:
load_div_neut_df <- function(path){
  
    # Get name of folder with parameter combinations
    base <- basename(path)
    split <- str_split(base, pattern = '_')[[1]]
    group <- split[1]
    habitat <- split[2]
    num_ind <- as.integer(str_sub(split[3], start = 1, end = 1))

    # Read in stats
    full_path <- paste0(inpath, '/', path)
    stats <- suppressMessages(read_delim(full_path, delim= '\t', col_names = TRUE)) %>% 
    mutate(group = group,
           habitat = habitat,
           num_ind = num_ind,
           tp_scaled = tP / nSites,
           tw_scaled = tW / nSites)
    return(stats)
  
}

In [4]:
inpath <- '../results/angsd/pi_fst_sample_size_test/stats/thetas/'
thetas_df <- list.files(inpath, pattern = '*.thetas.idx.pestPG', recursive = TRUE) %>% 
  map_dfr(., load_div_neut_df)

## Same sample sizes

In [5]:
# Show raw values
thetas_df_mod <- thetas_df %>% 
    dplyr::select(group, num_ind, habitat, tp_scaled, tw_scaled, nSites) %>% 
    pivot_wider(names_from = habitat, values_from = c(tp_scaled, tw_scaled, nSites))
thetas_df_mod

In [6]:
# Estimate bias
thetas_df_bias <- thetas_df_mod %>% 
    mutate(tp_diff = round(tp_scaled_rural - tp_scaled_urban, 4),
           tw_diff = round(tw_scaled_rural - tw_scaled_urban, 4)) %>% 
    dplyr::select(group, num_ind, tp_diff, tw_diff) %>% 
    mutate(tp_bias = tp_diff[group == 'highCov'] - tp_diff,
           tw_bias = tw_diff[group == 'highCov'] - tw_diff)
thetas_df_bias

## Different sample sizes

In [7]:
thetas_df_mod2 <- thetas_df %>% 
    dplyr::select(group, num_ind, habitat, tp_scaled)  %>% 
    left_join(thetas_df %>% dplyr::select(group, num_ind, habitat, tp_scaled), suffix = c("1", "2"), 
              by = c("group", "habitat")) %>% 
    filter(group == 'highCov' | num_ind1 != num_ind2) %>% 
    pivot_wider(names_from = habitat, values_from = c(tp_scaled1, tp_scaled2)) %>% 
    mutate(tp_diff = round(tp_scaled1_rural - tp_scaled2_urban, 4),
           tp_bias = tp_diff[group == 'highCov'] - tp_diff) %>% 
    dplyr::select(group, num_ind1, num_ind2, tp_diff, tp_bias)
thetas_df_mod2

### My take

- Bias in difference in theta estimates is only apparent at the lowest sample size (N = 1).
- This is true as long as __one__ of the populations has N = 1
- This is true for both pi and Waterson's theta. 

# Fst

- Uses Hudson's Fst

In [8]:
 # Function to load Fst df by city/habitat
load_fst <- function(path){
    
    # Get Fst type and city from filenames
    base <- basename(path)
    split <- str_split(base, pattern = '_')[[1]]
    group <- split[1]
    urban_n <- as.integer(str_sub(split[2], start = 2, end = 2))
    rural_n <- as.integer(str_sub(split[3], start = 2, end = 2))
    
    full_path <- paste0(inpath, path)
    colnames <- c('chrom', 'pos', 'num', 'denom')
    df <- suppressMessages(read_delim(full_path, delim = '\t', col_names = colnames)) %>%
        
    # Cap numerators at 0 if negative
    # https://github.com/ANGSD/angsd/issues/309
    # Does not affect overall pattern
    mutate(num = ifelse(num < 0, 0, num)) %>%
    # Estimate weighted Fst as ratio of averages # https://github.com/ANGSD/angsd/issues/61 
    summarise(num_sum = sum(num),
        denom_sum = sum(denom),
        fst = num_sum / denom_sum,
        nSites = n()) %>%
        mutate(group = group,
               urban_n = urban_n,
               rural_n = rural_n)
    return(df)
}

In [9]:
inpath <- '../results/angsd/pi_fst_sample_size_test/'
fst_df <- list.files(inpath, pattern = '*.fst$', recursive = TRUE) %>%
    map_dfr(., load_fst)
fst_df

# Same sample size

In [10]:
fst_df_mod <- fst_df %>% 
    filter(urban_n == rural_n) %>% 
    select(urban_n, rural_n, group, fst) %>% 
    mutate(fst_bias = fst[group == 'highCov'] - fst)
fst_df_mod

# Difference in sample size

In [11]:
fst_df_mod <- fst_df %>% 
    filter(group == 'highCov' | urban_n != rural_n) %>% 
    select(urban_n, rural_n, group, fst) %>% 
    mutate(fst_bias = fst[group == 'highCov'] - fst)
fst_df_mod

### My take

- Once again, bias in Fst is only apparent when at least one population has N = 1
- In some cases this bias is quite large (e.g., estimated Fst > 0.06 units away from "true" value)

## Does it matter?

- Let's extract urban and rural sample sizes for cities with sample sizes of either habitat < 5
- Let's look at the Fst estimates for these cities

In [12]:
get_num_samples <- function(path){
    city <- dirname(path)
    habitat <- str_extract(basename(path), pattern = '(?<=_)[r|u]')
    full_path <- paste0(inpath, path)
    df <- suppressMessages(read_table(full_path, col_names = FALSE)) 
    nsamples <- nrow(df)
    df <- data.frame(city = city, habitat = habitat, n = nsamples)
    return(df)
}

In [13]:
inpath <- '../results/program_resources/bam_lists/by_city/'
sample_size_df <- list.files(inpath, pattern = '*.list', recursive = TRUE) %>%
    map_dfr(., get_num_samples) %>%
    pivot_wider(names_from = habitat, values_from = n) %>% 
    filter(r < 5 | u < 5)
sample_size_df

In [14]:
load_fst_glue <- function(path){
    
    # Get Fst type and city from filenames
    dir <- str_split(path, pattern = '/')
    fst_type <- dir[[1]][[1]]
    city <- dir[[1]][[2]]
    full_path <- paste0(inpath, path)
     colnames <- c('chrom', 'pos', 'num', 'denom')
    df <- suppressMessages(read_delim(full_path, delim = '\t', col_names = colnames)) %>%
    # Cap numerators at 0 if negative
    # https://github.com/ANGSD/angsd/issues/309
    # Does not affect overall pattern
    mutate(num = ifelse(num < 0, 0, num)) %>%
    # Estimate weighted Fst as ratio of averages 
    # https://github.com/ANGSD/angsd/issues/61 
    summarise(num_sum = sum(num),
            denom_sum = sum(denom),
            fst = num_sum / denom_sum,
            nSites = n()) %>%
    mutate(fst_type = ifelse(fst_type == 'fst0', 'wc', 'hudson'),
            city = city)
    return(df) 
}

In [15]:
cities <- sample_size_df %>% pull(city)
inpath <- '../results/angsd/summary_stats/fst/'
fst_df <- list.files(inpath, pattern = '*_readable.fst', recursive = TRUE) %>%
    map_dfr(., load_fst_glue) %>% 
    filter(city %in% cities & fst_type == 'hudson')

In [16]:
fst_df %>% 
    left_join(., sample_size_df, by = 'city') %>% 
    dplyr::select(city, r, u, fst)

### My take

- Fst for Kyoto is really high and has only a single rural sample. I don't trust this Fst
- The rest of the cities are probably fine. Melbourne has second highest Fst value, but "simulation" above suggests minimal bias when both habitats have 3 samples, so I would leave it in. 

In [17]:
outpath = snakemake@output[[1]]
print(outpath)
file.create(outpath)