# Analysis 15: Candidate Gene Effects

In [None]:
library(dplyr)



Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union


Attaching package: 'data.table'

The following objects are masked from 'package:dplyr':

    between, first, last


Attaching package: 'kableExtra'

The following object is masked from 'package:dplyr':

    group_rows


Attaching package: 'flextable'

The following objects are masked from 'package:kableExtra':

    as_image, footnote


Attaching package: 'rebus'

The following object is masked from 'package:stringr':

    regex


Attaching package: 'gtools'

The following object is masked from 'package:rebus':

    capture

# Overview

This script integrates the beta coefficient from the finemapping results with the candidate genes identifed in `candidate_gene_identification.qmd`.

# Inputs

File paths:

In [None]:
candidate_table_fn <- "data/processed/interval_genes/candidate_genes/inbred_candidate_genes.tsv.gz"
ns_folder <- "data/processed/20231116_Analysis_NemaScan"


## Candidate gene table

The candidate gene table generated in `candidate_gene_identification.qmd` contains information about SNVs located within candidate genes, along with allele and divergence information.

In [None]:
candidate_table <- data.table::fread(candidate_table_fn)
head(candidate_table)


        MARKER  CHROM      POS GENE_NAME       WBGeneID PEAK_MARKER
        <char> <char>    <int>    <char>         <char>      <char>
1:  V_14554027      V 14554027   C53A5.1 WBGene00008270  5:14453636
2:  V_14554027      V 14554027   C53A5.1 WBGene00008270  5:14453636
3:  V_14554027      V 14554027   C53A5.1 WBGene00008270  5:14453636
4: III_8311919    III  8311919   F10E9.3 WBGene00017355   3:8311919
5: III_8311919    III  8311919   F10E9.3 WBGene00017355   3:8311919
6: IV_16629576     IV 16629576  Y51H4A.9 WBGene00194924  4:16631628
   VARIANT_LD_WITH_PEAK_MARKER MAF_variant VARIANT_IMPACT VARIANT_LOG10p
                         <num>       <num>         <char>          <num>
1:                    0.776750   0.0666667           HIGH       7.687533
2:                    0.776750   0.0666667           HIGH       7.687533
3:                    0.776750   0.0666667           HIGH       7.687533
4:                    1.000000   0.0512821           HIGH      13.064949
5:                

The candidate table from candidate_gene_identification.qmd has duplicate rows for the same SNV (Single Nucleotide Variant at a specific genomic position). These duplicates exist because the table includes:

-   ALLELE - the genotype (REF or ALT)
-   divergent - divergence classification
-   n - count of strains

This means a single SNV in a specific gene appears multiple times - once for each combination of allele type and divergence status.

Since this script is adding beta coefficients and allele information from the finemapping data (which already contains detailed allele info), it doesn’t need the genotype breakdown anymore

In [None]:
n_row_before <- nrow(candidate_table)
print(n_row_before)


[1] 410

        MARKER  CHROM      POS GENE_NAME       WBGeneID PEAK_MARKER
        <char> <char>    <int>    <char>         <char>      <char>
1:  V_14554027      V 14554027   C53A5.1 WBGene00008270  5:14453636
2: III_8311919    III  8311919   F10E9.3 WBGene00017355   3:8311919
3: IV_16629576     IV 16629576  Y51H4A.9 WBGene00194924  4:16631628
4: IV_16738382     IV 16738382  Y51H4A.2 WBGene00013118  4:16631628
5: IV_16867801     IV 16867801  T06A10.1 WBGene00235300  4:16631628
6: IV_16989319     IV 16989319 Y116A8C.1 WBGene00013791  4:16631628
   VARIANT_LD_WITH_PEAK_MARKER MAF_variant VARIANT_IMPACT VARIANT_LOG10p
                         <num>       <num>         <char>          <num>
1:                    0.776750   0.0666667           HIGH       7.687533
2:                    1.000000   0.0512821           HIGH      13.064949
3:                    0.797642   0.4871790           HIGH       5.487833
4:                    1.000000   0.4564100           HIGH       7.687347
5:                

[1] 179

## Finemapping data

Load all of the finemapping data so that we can merge the beta coefficients into the candidate table.

First define the functions to load the finemappings

In [None]:
# Helper functions to extract information from the finemap file names
get_trait_value <- function(fn_name) {
  # get trait value from file name (keeping the length_ prefix)
  trait_value <- stringr::str_extract(fn_name, "length_[^.]+")
  return(trait_value)
}

get_chromosome <- function(fn_name) {
  # get chromosome from file name
  chromosome <- stringr::str_extract(fn_name, "\\b(I|II|III|IV|V|X)\\b")
  return(chromosome)
}

get_positions <- function(fn_name) {
  # split the filename by period
  parts <- stringr::str_split(fn_name, "\\.")[[1]]
  startPOS <- as.numeric(parts[3])
  stopPOS <- as.numeric(parts[4])
  return(list(startPOS = startPOS, stopPOS = stopPOS))
}

# Main function
aggregate_finemappings <- function(finemap_dir) {
  finemap_files <- list.files(
    path = finemap_dir,
    pattern = "finemap_inbred.(inbred|loco).fastGWA",
    full.names = TRUE,
    recursive = TRUE
  ) %>%
    .[basename(.) %>% stringr::str_detect("^length")]

  finemap_df <- data.frame(
    trait_value = character(),
    CHROM = character(),
    startPOS = numeric(),
    endPOS = numeric(),
    file_name = character(),
    file_path = character(),
    stringsAsFactors = FALSE
  )

  for (fn in finemap_files) {
    fn_name <- basename(fn)
    trait_value <- get_trait_value(fn_name)
    chromosome <- get_chromosome(fn_name)
    positions <- get_positions(fn_name)

    finemap_df <- finemap_df %>%
      add_row(
        trait_value = trait_value,
        CHROM = chromosome,
        startPOS = positions$startPOS,
        endPOS = positions$stopPOS,
        file_name = fn_name,
        file_path = fn
      )
  }

  return(finemap_df)
}


In [None]:
inbred_finemap_dir <-
  glue::glue("{ns_folder}/INBRED/Fine_Mappings/Data")

inbred_finemap_df <- aggregate_finemappings(inbred_finemap_dir)
head(inbred_finemap_df)


      trait_value CHROM startPOS   endPOS
1    length_2_4_D     V 13865698 15025162
2 length_Aldicarb   III  7790159  8989539
3 length_Aldicarb    IV 12966415 13439670
4 length_Aldicarb    IV 16474675 17180844
5 length_Aldicarb    IV  2179845  2507551
6 length_Aldicarb     X 13002956 13505962
                                                           file_name
1     length_2_4_D.V.13865698.15025162.finemap_inbred.inbred.fastGWA
2  length_Aldicarb.III.7790159.8989539.finemap_inbred.inbred.fastGWA
3 length_Aldicarb.IV.12966415.13439670.finemap_inbred.inbred.fastGWA
4 length_Aldicarb.IV.16474675.17180844.finemap_inbred.inbred.fastGWA
5   length_Aldicarb.IV.2179845.2507551.finemap_inbred.inbred.fastGWA
6  length_Aldicarb.X.13002956.13505962.finemap_inbred.inbred.fastGWA
                                                                                                                               file_path
1     data/processed/20231116_Analysis_NemaScan/INBRED/Fine_Mappings/Data/length_2_4_D

This data frame contains the paths to all of the inbred finemapping results. Next we will filter to just the finemappings that correspond to the candidate genes in our candidate table and then read in those finemapping results into a single data frame.

In [None]:
# Create a column in the finemap df that matches the interval in the candidate table
inbred_finemap_df_m <- inbred_finemap_df %>%
  dplyr::mutate(interval = paste0(startPOS, "-", endPOS))
head(inbred_finemap_df_m)


      trait_value CHROM startPOS   endPOS
1    length_2_4_D     V 13865698 15025162
2 length_Aldicarb   III  7790159  8989539
3 length_Aldicarb    IV 12966415 13439670
4 length_Aldicarb    IV 16474675 17180844
5 length_Aldicarb    IV  2179845  2507551
6 length_Aldicarb     X 13002956 13505962
                                                           file_name
1     length_2_4_D.V.13865698.15025162.finemap_inbred.inbred.fastGWA
2  length_Aldicarb.III.7790159.8989539.finemap_inbred.inbred.fastGWA
3 length_Aldicarb.IV.12966415.13439670.finemap_inbred.inbred.fastGWA
4 length_Aldicarb.IV.16474675.17180844.finemap_inbred.inbred.fastGWA
5   length_Aldicarb.IV.2179845.2507551.finemap_inbred.inbred.fastGWA
6  length_Aldicarb.X.13002956.13505962.finemap_inbred.inbred.fastGWA
                                                                                                                               file_path
1     data/processed/20231116_Analysis_NemaScan/INBRED/Fine_Mappings/Data/length_2_4_D

Create a table of unique trait and interval combinations from the candidate table

In [None]:
candidate_intervals <- candidate_table %>%
  dplyr::select(trait, interval) %>%
  dplyr::distinct()
head(candidate_intervals)


                     trait          interval
                    <char>            <char>
1:            length_2_4_D 13865698-15025162
2:         length_Aldicarb   7790159-8989539
3:         length_Aldicarb 16474675-17180844
4:         length_Aldicarb 13002956-13505962
5: length_Arsenic_trioxide 13521931-14347284
6: length_Arsenic_trioxide     28947-2527992

Filter the finemapping data frame to only include files that match the candidate intervals

In [None]:
filtered_finemap_df <- inbred_finemap_df_m %>%
  dplyr::inner_join(candidate_intervals, by = c("trait_value" = "trait", "interval" = "interval"))
head(filtered_finemap_df)


              trait_value CHROM startPOS   endPOS
1            length_2_4_D     V 13865698 15025162
2         length_Aldicarb   III  7790159  8989539
3         length_Aldicarb    IV 16474675 17180844
4         length_Aldicarb     X 13002956 13505962
5 length_Arsenic_trioxide    II 13521931 14347284
6 length_Arsenic_trioxide   III    28947  2527992
                                                                   file_name
1             length_2_4_D.V.13865698.15025162.finemap_inbred.inbred.fastGWA
2          length_Aldicarb.III.7790159.8989539.finemap_inbred.inbred.fastGWA
3         length_Aldicarb.IV.16474675.17180844.finemap_inbred.inbred.fastGWA
4          length_Aldicarb.X.13002956.13505962.finemap_inbred.inbred.fastGWA
5 length_Arsenic_trioxide.II.13521931.14347284.finemap_inbred.inbred.fastGWA
6    length_Arsenic_trioxide.III.28947.2527992.finemap_inbred.inbred.fastGWA
                                                                                                               

[1] 31

Read in all of the filtered finemapping results and combine them into a single data frame

In [None]:
combine_finemapping_files <- function(file_df) {
  combined_df <- file_df %>%
    rowwise() %>%
    mutate(
      data = list(data.table::fread(file_path) %>%
        mutate(trait = trait_value))
    ) %>%
    tidyr::unnest(cols = c(data))

  return(combined_df)
}

filtered_inbred_finemap_data <- combine_finemapping_files(filtered_finemap_df)
head(filtered_inbred_finemap_data)


# A tibble: 6 × 18
  trait_value  CHROM startPOS   endPOS file_name  file_path interval   CHR SNP  
  <chr>        <chr>    <dbl>    <dbl> <chr>      <chr>     <chr>    <int> <chr>
1 length_2_4_D V     13865698 15025162 length_2_… data/pro… 1386569…     5 5:13…
2 length_2_4_D V     13865698 15025162 length_2_… data/pro… 1386569…     5 5:13…
3 length_2_4_D V     13865698 15025162 length_2_… data/pro… 1386569…     5 5:13…
4 length_2_4_D V     13865698 15025162 length_2_… data/pro… 1386569…     5 5:13…
5 length_2_4_D V     13865698 15025162 length_2_… data/pro… 1386569…     5 5:13…
6 length_2_4_D V     13865698 15025162 length_2_… data/pro… 1386569…     5 5:13…
# ℹ 9 more variables: POS <int>, A1 <chr>, A2 <chr>, N <int>, AF1 <dbl>,
#   BETA <dbl>, SE <dbl>, P <dbl>, trait <chr>

Now format and pull the relevant columns from the combined finemapping data. Pulling the SNP genotypes (A1, A2) from the finemapping output. Per [GCTA inbred output docs](https://yanglab.westlake.edu.cn/software/gcta/#fastGWA)

In [None]:
finemap_snvs <- filtered_inbred_finemap_data %>%
  dplyr::select(trait, SNP, BETA, A1, A2) %>%
  dplyr::rename(
    MARKER = SNP,
    beta_coefficient = BETA,
    effect_allele = A1,
    other_allele = A2
  ) %>%
  dplyr::mutate(MARKER = sapply(MARKER, convert_marker, to_marker = "RM", from_marker = "NUM")) %>%
  dplyr::mutate(MARKER = stringr::str_replace_all(MARKER, ":", "_"))

head(finemap_snvs)


# A tibble: 6 × 5
  trait        MARKER     beta_coefficient effect_allele other_allele
  <chr>        <chr>                 <dbl> <chr>         <chr>       
1 length_2_4_D V_13865698            0.493 G             T           
2 length_2_4_D V_13865823            5.33  A             C           
3 length_2_4_D V_13865980            2.87  A             G           
4 length_2_4_D V_13867914            3.67  C             T           
5 length_2_4_D V_13868694           -3.89  T             C           
6 length_2_4_D V_13869222            8.62  G             A           

# Main

Join the candidate table with the finemapping SNV data to add the beta coefficients and allele information

In [None]:
candidate_table_with_betas <- candidate_table %>%
  dplyr::left_join(finemap_snvs, by = c("trait", "MARKER"))
head(candidate_table_with_betas)


        MARKER  CHROM      POS GENE_NAME       WBGeneID PEAK_MARKER
        <char> <char>    <int>    <char>         <char>      <char>
1:  V_14554027      V 14554027   C53A5.1 WBGene00008270  5:14453636
2: III_8311919    III  8311919   F10E9.3 WBGene00017355   3:8311919
3: IV_16629576     IV 16629576  Y51H4A.9 WBGene00194924  4:16631628
4: IV_16738382     IV 16738382  Y51H4A.2 WBGene00013118  4:16631628
5: IV_16867801     IV 16867801  T06A10.1 WBGene00235300  4:16631628
6: IV_16989319     IV 16989319 Y116A8C.1 WBGene00013791  4:16631628
   VARIANT_LD_WITH_PEAK_MARKER MAF_variant VARIANT_IMPACT VARIANT_LOG10p
                         <num>       <num>         <char>          <num>
1:                    0.776750   0.0666667           HIGH       7.687533
2:                    1.000000   0.0512821           HIGH      13.064949
3:                    0.797642   0.4871790           HIGH       5.487833
4:                    1.000000   0.4564100           HIGH       7.687347
5:                

# Outputs

Save the candidate table with beta coefficients

In [None]:
candidate_table_betas_outfn <- "data/processed/interval_genes/candidate_genes/inbred_candidate_genes_with_betas.tsv.gz"
save_compressed_tsv(candidate_table_with_betas, candidate_table_betas_outfn)
