# **Google Drive**

In [None]:
# Mount Google Drive:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# **R Magic**

In [None]:
# Downgrade rpy:
!pip install rpy2==3.5.1

Collecting rpy2==3.5.1
  Downloading rpy2-3.5.1.tar.gz (201 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m201.7/201.7 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rpy2
  Building wheel for rpy2 (setup.py) ... [?25l[?25hdone
  Created wheel for rpy2: filename=rpy2-3.5.1-cp310-cp310-linux_x86_64.whl size=314932 sha256=f30a057a3a6d586d11a358340705cf2ae7913bc73896e815a77c293bad920d64
  Stored in directory: /root/.cache/pip/wheels/73/a6/ff/4e75dd1ce1cfa2b9a670cbccf6a1e41c553199e9b25f05d953
Successfully built rpy2
Installing collected packages: rpy2
  Attempting uninstall: rpy2
    Found existing installation: rpy2 3.4.2
    Uninstalling rpy2-3.4.2:
      Successfully uninstalled rpy2-3.4.2
Successfully installed rpy2-3.5.1


In [None]:
# Activate R magic:
%load_ext rpy2.ipython

# **GWAS**

**Paper:**

["Genome-wide Association Study of Long COVID" Lammi et al., 2023](https://www.medrxiv.org/content/10.1101/2023.06.29.23292056v1)

**Code:**

[GitHub: Meta Analysis](https://github.com/long-covid-hg/META_ANALYSIS/tree/main)

[GitHub: PCA](https://github.com/long-covid-hg/pca_projection)

[GitHub: MR](https://github.com/marcoralab/MRcovid)

[GitHub: Tools](https://github.com/long-covid-hg/LongCovidTools/tree/main)

**GWAS:**

[LocusZoom: GWAS1](https://my.locuszoom.org/gwas/192226/?token=09a18cf9138243db9cdf79ff6930fdf8)

[LocusZoom: GWAS2](https://my.locuszoom.org/gwas/826733/?token=c7274597af504bf3811de6d742921bc8)

[LocusZoom: GWAS3](https://my.locuszoom.org/gwas/793752/?token=0dc986619af14b6e8a564c580d3220b4)

[LocusZoom: GWAS4](https://my.locuszoom.org/gwas/91854/?token=723e672edf13478e817ca44b56c0c068)

**More Information:**
- Release: 7
- Ensembl: 109
- Human Genome Build: GRCh38

**Cases vs. Controls:**
- **GWAS 1:** Long COVID cases after test-verified SARS-CoV-2 infection (strict case definition) (N=3,018) vs. population control (broad control definition) (N=1,093,995).
- **GWAS 2:** Long COVID cases after any SARS-CoV-2 Infection (tested and no tested  clinical confirmation) (broad case definition) (N=6,450) vs. population control (broad control definition) (N= 1,093,995).
- **GWAS 3:** Long COVID cases after test-verified SARS-CoV-2 infection (strict case definition) (N=3,018) vs. SARS-CoV-2 cases that are not long-COVID (strict control definition) (N=46,208).
- **GWAS 4:** Long COVID cases after any SARS-CoV-2 Infection (broad case definition) (N=6,450) vs. SARS-CoV-2 cases that are not long-COVID (strict control definition) (N=46,208).

- <font color='green'>**Broad Controls:**</font> General population control.
- <font color='pink'>**Strict Controls:**</font> SARS-CoV-2 cases no long-COVID.
- <font color='blue'>**Broad Cases:**</font> Long-COVID cases tested and no tested.
- <font color='yellow'>**Strict Cases:**</font> Long-COVID cases tested.

<img src="https://drive.google.com/uc?export=view&id=17vvFmvFv8mt-WKIjp7JMCwEI_fF0g_50" alt="drawing" width="500"/>

In [None]:
%%R
# Step 1: Install dplyr:
install.packages("dplyr")
library(dplyr)

In [None]:
%%R
# Step 2: Install BiocManager:
install.packages("BiocManager")
library(BiocManager)

In [None]:
%%R
# Step 3: Install biomaRt:
BiocManager::install("biomaRt")
library(biomaRt)

In [None]:
%%R
# Read the first dataset:
# Long COVID cases after test-verified SARS-CoV-2 infection (strict case definition) (N=3,018) vs. population control (broad control definition) (N=1,093,995).
gwas_1 <- base::gzfile("/content/drive/MyDrive/Colab/Long_COVID/GWAS/long_covid_gwas_1.gz") %>%
  utils::read.table(header = FALSE,
                    sep = "\t",
                    stringsAsFactors = FALSE)

print(head(gwas_1))

  V1     V2          V3 V4 V5          V6          V7        V8        V9
1  1 727242  rs61769339  G  A 0.659672903 -0.08914350 0.0725127 0.1419410
2  1 729886 rs539032812  T  C 0.203850456 -0.08556010 0.1752460 0.0278364
3  1 758351  rs12238997  A  G 0.642450093 -0.08377110 0.0694591 0.1499760
4  1 758443  rs61769351  G  C 0.462277845 -0.06721850 0.0711697 0.1478110
5  1 770988  rs12029736  A  G 0.331720340  0.04515580 0.0619260 0.4910420
6  1 771398 rs113462541  G  A 0.009011522  0.00159483 0.0619572 0.7187970


In [None]:
%%R
# Rename the columns of gwas_1:
names(gwas_1) <- c("chr", "position", "variant_id", "ref_all", "alt_all", "logOR", "beta", "SE", "alt_all_freq")

print(dim(gwas_1))
print(length(unique(gwas_1$variant_id)))
print(head(gwas_1))

[1] 9510587       9
[1] 9023988
  chr position  variant_id ref_all alt_all       logOR        beta        SE
1   1   727242  rs61769339       G       A 0.659672903 -0.08914350 0.0725127
2   1   729886 rs539032812       T       C 0.203850456 -0.08556010 0.1752460
3   1   758351  rs12238997       A       G 0.642450093 -0.08377110 0.0694591
4   1   758443  rs61769351       G       C 0.462277845 -0.06721850 0.0711697
5   1   770988  rs12029736       A       G 0.331720340  0.04515580 0.0619260
6   1   771398 rs113462541       G       A 0.009011522  0.00159483 0.0619572
  alt_all_freq
1    0.1419410
2    0.0278364
3    0.1499760
4    0.1478110
5    0.4910420
6    0.7187970


In [None]:
%%R
# Subset the previous dataset to have the most important columns for MtRobin:
gwas1_subset <- subset(gwas_1, select = c("variant_id", "beta", "SE"))

print(dim(gwas1_subset))
print(length(unique(gwas1_subset$variant_id)))
print(head(gwas1_subset))

[1] 9510587       3
[1] 9023988
   variant_id        beta        SE
1  rs61769339 -0.08914350 0.0725127
2 rs539032812 -0.08556010 0.1752460
3  rs12238997 -0.08377110 0.0694591
4  rs61769351 -0.06721850 0.0711697
5  rs12029736  0.04515580 0.0619260
6 rs113462541  0.00159483 0.0619572


In [None]:
%%R
# Save the output data in a file:
write.table(gwas1_subset,
            file = "/content/drive/MyDrive/Colab/Long_COVID/GWAS/long_covid_gwas_1_subset.txt",
            sep = "\t",
            row.names = FALSE,
            col.names = TRUE,
            quote = FALSE)

In [None]:
%%R
# Read the second dataset:
# Long COVID cases after any SARS-CoV-2 Infection (broad case definition) (N=6,450) vs. population control (broad control definition) (N= 1,093,995).
gwas_2 <- base::gzfile("/content/drive/MyDrive/Colab/Long_COVID/GWAS/long_covid_gwas_2.gz") %>%
  utils::read.table(header = FALSE,
                    sep = "\t",
                    stringsAsFactors = FALSE)

print(head(gwas_2))

  V1     V2          V3 V4 V5         V6          V7        V8        V9
1  1 758351  rs12238997  A  G 1.42869305 -0.07519460 0.0361019 0.1500900
2  1 794707 rs148120343  T  C 0.01924556  0.00276278 0.0508297 0.0597876
3  1 796338  rs58276399  T  C 1.25511506 -0.06778640 0.0354096 0.1560230
4  1 796652  rs61770163  A  C 2.13629358 -0.15318400 0.0571040 0.1545600
5  1 797429  rs12131618  T  C 1.25887284  0.13449700 0.0701197 0.0792994
6  1 798969 rs141242758  T  C 1.78633086 -0.08296060 0.0345542 0.1549950


In [None]:
%%R
# Rename the columns of gwas_2:
names(gwas_2) <- c("chr", "position", "variant_id", "ref_all", "alt_all", "logOR", "beta", "SE", "alt_all_freq")

print(dim(gwas_2))
print(length(unique(gwas_2$variant_id)))
print(head(gwas_2))

[1] 9722678       9
[1] 9135125
  chr position  variant_id ref_all alt_all      logOR        beta        SE
1   1   758351  rs12238997       A       G 1.42869305 -0.07519460 0.0361019
2   1   794707 rs148120343       T       C 0.01924556  0.00276278 0.0508297
3   1   796338  rs58276399       T       C 1.25511506 -0.06778640 0.0354096
4   1   796652  rs61770163       A       C 2.13629358 -0.15318400 0.0571040
5   1   797429  rs12131618       T       C 1.25887284  0.13449700 0.0701197
6   1   798969 rs141242758       T       C 1.78633086 -0.08296060 0.0345542
  alt_all_freq
1    0.1500900
2    0.0597876
3    0.1560230
4    0.1545600
5    0.0792994
6    0.1549950


In [None]:
%%R
# Subset the previous dataset to have the most important columns for MtRobin:
gwas2_subset <- subset(gwas_2, select = c("variant_id", "beta", "SE"))

print(dim(gwas2_subset))
print(length(unique(gwas2_subset$variant_id)))
print(head(gwas2_subset))

[1] 9722678       3
[1] 9135125
   variant_id        beta        SE
1  rs12238997 -0.07519460 0.0361019
2 rs148120343  0.00276278 0.0508297
3  rs58276399 -0.06778640 0.0354096
4  rs61770163 -0.15318400 0.0571040
5  rs12131618  0.13449700 0.0701197
6 rs141242758 -0.08296060 0.0345542


In [None]:
%%R
# Save the output data in a file:
write.table(gwas2_subset,
            file = "/content/drive/MyDrive/Colab/Long_COVID/GWAS/long_covid_gwas_2_subset.txt",
            sep = "\t",
            row.names = FALSE,
            col.names = TRUE,
            quote = FALSE)

In [None]:
%%R
# Read the third dataset:
# Long COVID cases after test-verified SARS-CoV-2 infection (strict case definition) (N=3,018) vs. SARS-CoV-2 cases that are not long-COVID (strict control definition) (N=46,208).
gwas_3 <- base::gzfile("/content/drive/MyDrive/Colab/Long_COVID/GWAS/long_covid_gwas_3.gz") %>%
  utils::read.table(header = FALSE,
                    sep = "\t",
                    stringsAsFactors = FALSE)

print(head(gwas_3))

  V1     V2          V3 V4 V5         V6         V7        V8        V9
1  1 727242  rs61769339  G  A 0.30395306 -0.0524928 0.0772207 0.1261010
2  1 729886 rs539032812  T  C 0.23106754 -0.0898624 0.1656110 0.0239551
3  1 758351  rs12238997  A  G 0.32343624 -0.0508974 0.0712253 0.1338670
4  1 758443  rs61769351  G  C 0.20615027 -0.0373298 0.0757353 0.1304840
5  1 771398 rs113462541  G  A 0.01595811  0.0028148 0.0622293 0.7284580
6  1 782207 rs144155419  G  A 0.39331559 -0.1775650 0.2129080 0.0128376


In [None]:
%%R
# Rename the columns of gwas_3:
names(gwas_3) <- c("chr", "position", "variant_id", "ref_all", "alt_all", "logOR", "beta", "SE", "alt_all_freq")

print(dim(gwas_3))
print(length(unique(gwas_3$variant_id)))
print(head(gwas_3))

[1] 9738584       9
[1] 9120808
  chr position  variant_id ref_all alt_all      logOR       beta        SE
1   1   727242  rs61769339       G       A 0.30395306 -0.0524928 0.0772207
2   1   729886 rs539032812       T       C 0.23106754 -0.0898624 0.1656110
3   1   758351  rs12238997       A       G 0.32343624 -0.0508974 0.0712253
4   1   758443  rs61769351       G       C 0.20615027 -0.0373298 0.0757353
5   1   771398 rs113462541       G       A 0.01595811  0.0028148 0.0622293
6   1   782207 rs144155419       G       A 0.39331559 -0.1775650 0.2129080
  alt_all_freq
1    0.1261010
2    0.0239551
3    0.1338670
4    0.1304840
5    0.7284580
6    0.0128376


In [None]:
%%R
# Subset the previous dataset to have the most important columns for MtRobin:
gwas3_subset <- subset(gwas_3, select = c("variant_id", "beta", "SE"))

print(dim(gwas3_subset))
print(length(unique(gwas3_subset$variant_id)))
print(head(gwas3_subset))

[1] 9738584       3
[1] 9120808
   variant_id       beta        SE
1  rs61769339 -0.0524928 0.0772207
2 rs539032812 -0.0898624 0.1656110
3  rs12238997 -0.0508974 0.0712253
4  rs61769351 -0.0373298 0.0757353
5 rs113462541  0.0028148 0.0622293
6 rs144155419 -0.1775650 0.2129080


In [None]:
%%R
# Save the output data in a file:
write.table(gwas3_subset,
            file = "/content/drive/MyDrive/Colab/Long_COVID/GWAS/long_covid_gwas_3_subset.txt",
            sep = "\t",
            row.names = FALSE,
            col.names = TRUE,
            quote = FALSE)

In [None]:
%%R
# Read the fourth dataset:
# Long COVID cases after any SARS-CoV-2 Infection (broad case definition) (N=6,450) vs. SARS-CoV-2 cases that are not long-COVID (strict control definition) (N=46,208).
gwas_4 <- base::gzfile("/content/drive/MyDrive/Colab/Long_COVID/GWAS/long_covid_gwas_4.gz") %>%
  utils::read.table(header = FALSE,
                    sep = "\t",
                    stringsAsFactors = FALSE)

print(head(gwas_4))

  V1     V2          V3 V4 V5         V6          V7        V8        V9
1  1 758351  rs12238997  A  G 1.18554130 -0.07748130 0.0420257 0.1384160
2  1 794707 rs148120343  T  C 0.06082033 -0.00965152 0.0586638 0.0602122
3  1 796338  rs58276399  T  C 1.31144582 -0.08097980 0.0411023 0.1438010
4  1 796652  rs61770163  A  C 1.15127158 -0.11419100 0.0631543 0.1372740
5  1 797429  rs12131618  T  C 0.87743683  0.11802000 0.0784760 0.0777810
6  1 798969 rs141242758  T  C 1.78767653 -0.09648930 0.0401701 0.1431910


In [None]:
%%R
# Rename the columns of gwas_4:
names(gwas_4) <- c("chr", "position", "variant_id", "ref_all", "alt_all", "logOR", "beta", "SE", "alt_all_freq")

print(dim(gwas_4))
print(length(unique(gwas_4$variant_id)))
print(head(gwas_4))

[1] 9753825       9
[1] 9173788
  chr position  variant_id ref_all alt_all      logOR        beta        SE
1   1   758351  rs12238997       A       G 1.18554130 -0.07748130 0.0420257
2   1   794707 rs148120343       T       C 0.06082033 -0.00965152 0.0586638
3   1   796338  rs58276399       T       C 1.31144582 -0.08097980 0.0411023
4   1   796652  rs61770163       A       C 1.15127158 -0.11419100 0.0631543
5   1   797429  rs12131618       T       C 0.87743683  0.11802000 0.0784760
6   1   798969 rs141242758       T       C 1.78767653 -0.09648930 0.0401701
  alt_all_freq
1    0.1384160
2    0.0602122
3    0.1438010
4    0.1372740
5    0.0777810
6    0.1431910


In [None]:
%%R
# Subset the previous dataset to have the most important columns for MtRobin:
gwas4_subset <- subset(gwas_4, select = c("variant_id", "beta", "SE"))

print(dim(gwas4_subset))
print(length(unique(gwas4_subset$variant_id)))
print(head(gwas4_subset))

[1] 9753825       3
[1] 9173788
   variant_id        beta        SE
1  rs12238997 -0.07748130 0.0420257
2 rs148120343 -0.00965152 0.0586638
3  rs58276399 -0.08097980 0.0411023
4  rs61770163 -0.11419100 0.0631543
5  rs12131618  0.11802000 0.0784760
6 rs141242758 -0.09648930 0.0401701


In [None]:
%%R
# Save the output data in a file:
write.table(gwas4_subset,
            file = "/content/drive/MyDrive/Colab/Long_COVID/GWAS/long_covid_gwas_4_subset.txt",
            sep = "\t",
            row.names = FALSE,
            col.names = TRUE,
            quote = FALSE)

# **eQTL**

**Database:**

[GTEx](https://gtexportal.org/home/datasets)
- Version: 8
- Ensembl: 99
- Human Genome Build: GRCh38


**Paper:**

["The GTEx Consortium atlas of genetic regulatory effects across
human tissues" Aguet et al., 2020](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7737656/pdf/nihms-1651674.pdf)

In [None]:
%%R
# Create a function to pre-process the eQTL data files to merge them and obtain one dataset:
# This function pre-processes eQTL datasets for the MR_MtRobin method:
process_data <- function(input_folder,
                         pattern,
                         columns_to_select,
                         column_names,
                         output_path,
                         output_delimiter) {

  # Load the required libraries:
  required_packages <- c("devtools", "dplyr", "utils", "tidyverse", "data.table", "R.utils")
  installed_packages <- rownames(installed.packages())
  for (pkg in required_packages) {
    if (!pkg %in% installed_packages) {
      install.packages(pkg)
    }
    library(pkg, character.only = TRUE)
  }

  # Read the input data:
  file_list <- base::list.files(path = input_folder, pattern = pattern) %>%
    base::lapply(function(x) read.table(file.path(input_folder, x),
                                        fill = TRUE, # In case the rows have unequal length, blank fields are implicitly added.
                                        header = TRUE))

  # Check if the files were read:
  if (length(file_list) == 0) {
    stop("No files found matching the pattern: ", pattern)
  }

  # Create a mapping between gene_id and gene_name:
  gene_id_to_name <- file_list %>%
    lapply(function(df) {
      df %>%
        select(gene_id, gene_name) %>%
        distinct()
    }) %>%
    bind_rows() %>%
    distinct()

  # Subset, rename, and merge the data frames:
  eqtl <- file_list %>%
    lapply(function(df) {
      df %>%
        subset(select = columns_to_select) %>%
        rename(!!column_names$gene_id := gene_id,
               !!column_names$variant_id := rs_id_dbSNP151_GRCh38p7,
               !!column_names$chr_id := variant_id) %>%
        select(-gene_name) # Remove the gene_name column
    }) %>%
    # Join by gene_id, variant_id and chr_id
    # Don't join by gene_name because it is ambiguous
    purrr::reduce(full_join, by = c("gene_id", "variant_id", "chr_id"))

  # Add gene_name as the second column:
  eqtl <- eqtl %>%
    left_join(gene_id_to_name, by = "gene_id") %>%
    relocate(gene_name, .after = gene_id)

  # Find the number of input files:
  num_files <- length(file_list)

  # Generate column indices for beta, SE, and pvalue columns:
  beta_indices <- seq(5, 5 + 3 * (num_files - 1), 3)
  SE_indices <- seq(6, 6 + 3 * (num_files - 1), 3)
  pvalue_indices <- seq(7, 7 + 3 * (num_files - 1), 3)

  # Subset the data frame:
  eqtl <- eqtl[, c(1, 2, 3, 4,
                   beta_indices,
                   SE_indices,
                   pvalue_indices)]

  # Generate column names for beta, SE, and pvalue columns:
  if (is.null(column_names$beta_names)) {
    column_names$beta_names <- paste0("beta_", 1:num_files)
  }
  if (is.null(column_names$SE_names)) {
    column_names$SE_names <- paste0("SE_", 1:num_files)
  }
  if (is.null(column_names$pvalue_names)) {
    column_names$pvalue_names <- paste0("pvalue_", 1:num_files)
  }

  # Rename the columns:
  colnames(eqtl) <- c(column_names$gene_id,
                      column_names$gene_name,
                      column_names$variant_id,
                      column_names$chr_id,
                      column_names$beta_names,
                      column_names$SE_names,
                      column_names$pvalue_names)

  # Define a function to count the non-NA values for beta_value, SE and p-value columns in each row:
  count_non_na <- function(row) {
    pval_count <- sum(!is.na(row[column_names$pvalue_names]))
    beta_count <- sum(!is.na(row[column_names$beta_names]))
    SE_count <- sum(!is.na(row[column_names$SE_names]))
    return(pval_count == beta_count && beta_count == SE_count)
  }

  # Retain the rows that have the values in the beta, SE and pvalue corresponding columns:
  eqtl <- eqtl[apply(eqtl, 1, count_non_na),]

  # Export the processed data to a .txt file:
  write.table(eqtl,
              file = file.path(output_path),
              sep = output_delimiter,
              row.names = FALSE,
              col.names = TRUE,
              quote = FALSE)
}

In [None]:
%%R
# Call the function:
process_data(input_folder = "/content/drive/MyDrive/Colab/Long_COVID/eQTL/Input",
             pattern = "*.txt.gz",
             columns_to_select = c("gene_id",
                                   "gene_name",
                                   "rs_id_dbSNP151_GRCh38p7",
                                   "variant_id",
                                   "slope",
                                   "slope_se",
                                   "pval_nominal"),
             column_names = list(gene_id = "gene_id",
                                 gene_name = "gene_name",
                                 variant_id = "variant_id",
                                 chr_id = "chr_id",
                                 beta_names = NULL,
                                 SE_names = NULL,
                                 pvalue_names = NULL),
             output_path = "/content/drive/MyDrive/Colab/Long_COVID/eQTL/eqtl_all_tissues.txt",
             output_delimiter = "\t")

In [None]:
%%R
# Open the previous file:
eqtl_all_tissues <- read.table(file = "/content/drive/MyDrive/Colab/Long_COVID/eQTL/eqtl_all_tissues.txt",
                               header = TRUE,
                               sep = "")

print(dim(eqtl_all_tissues)) # Number of rows and columns
print(length(unique(eqtl_all_tissues$gene_id))) # Number of unique genes
print(length(unique(eqtl_all_tissues$variant_id))) # Number of unique rsIDs
print(head(eqtl_all_tissues)) # Check the format of the table

[1] 1008494     151
[1] 39832
[1] 828935
            gene_id     gene_name   variant_id              chr_id    beta_1
1 ENSG00000227232.5        WASH7P  rs769952832  chr1_64764_C_T_b38  0.586346
2 ENSG00000268903.1 RP11-34P13.15  rs866355763 chr1_103147_C_T_b38 -0.612097
3 ENSG00000269981.1 RP11-34P13.16   rs62642117 chr1_108826_G_C_b38  0.431229
4 ENSG00000241860.6 RP11-34P13.13  rs201327123  chr1_14677_G_A_b38  0.700658
5 ENSG00000279457.4 RP11-34P13.18  rs188376087 chr1_599167_G_A_b38 -0.687794
6 ENSG00000228463.9    AP006222.2 rs1206875823 chr1_280550_G_A_b38 -1.822030
     beta_2 beta_3   beta_4   beta_5    beta_6 beta_7    beta_8   beta_9
1        NA     NA       NA       NA        NA     NA        NA       NA
2        NA     NA       NA       NA        NA     NA        NA       NA
3        NA     NA 0.727510       NA  0.907156     NA        NA       NA
4  0.955911     NA 0.844491 0.841275  0.924220     NA  0.832645 0.855631
5        NA     NA       NA       NA -0.672016     NA  

# **Ensembl**

Each dataset was built in different Ensembl versions. It is a good practice to check and update them to the latest and same Ensembl version.

- **Gene Annotations:** Different Ensembl versions might have different annotations for genes, transcripts, and regulatory regions. This could affect the identification of genetic variants that are used as instrumental variables in MR.

- **Variant Mapping:** SNP IDs and their locations can change between versions. This may lead to inconsistencies in variant mapping across datasets, affecting the selection of instrumental variables.

- **InDels and Structural Variants:** Information about insertions, deletions, and structural variants might vary across versions, which could impact the analysis if these types of variants are included.

- **Functional Annotations:** Different versions may have different functional annotations (e.g., missense, synonymous) for variants, affecting the biological interpretation.

- **Compatibility with Other Tools:** Tools used in MR analysis may be optimized for specific Ensembl versions, so using different versions might lead to incompatibilities or suboptimal performance.

- **Population Stratification and Cohort Differences:** If the eQTL and GWAS datasets are from different populations or cohorts, differences in Ensembl versions might exacerbate underlying heterogeneity, further complicating the analysis.

## **eQTL**

**Ensembl 88 to 110:**

- Built using the GENCODE version 26 = Ensembl release version 88.
- It should be updated to the Ensembl release version 110.

https://gtexportal.org/home/releaseInfoPage

<img src="https://drive.google.com/uc?export=view&id=1COIP2zk_tZfJOzMSouefKDFKFoHQ9IGA" alt="drawing" width="300"/>

https://www.gencodegenes.org/human/releases.html

<img src="https://drive.google.com/uc?export=view&id=1NlUXWwaVdW5cmm_nPF_kEvoHTXIfIZSa" alt="drawing" width="500"/>

**ENSG_IDs**

Gene stable IDs (without version numbers) tend to be consistent, but it's still a good practice to verify.

In [None]:
%%R
# Step 1: Install and load dplyr:
install.packages("dplyr")
library(dplyr)

In [None]:
%%R
# Step 2: Install and load BiocManager:
install.packages("BiocManager")
library(BiocManager)

In [None]:
%%R
# Step 3: Install and load biomaRt:
BiocManager::install("biomaRt")
library(biomaRt)

In [None]:
%%R
# Step 4: Open the eQTL file:
eqtl_all_tissues <- read.table("/content/drive/MyDrive/Colab/Long_COVID/eQTL/eqtl_all_tissues.txt",
                               header = TRUE,
                               sep = "",
                               stringsAsFactors = FALSE)
print(dim(eqtl_all_tissues))
print(head(eqtl_all_tissues))

[1] 1008494     151
            gene_id     gene_name   variant_id              chr_id    beta_1
1 ENSG00000227232.5        WASH7P  rs769952832  chr1_64764_C_T_b38  0.586346
2 ENSG00000268903.1 RP11-34P13.15  rs866355763 chr1_103147_C_T_b38 -0.612097
3 ENSG00000269981.1 RP11-34P13.16   rs62642117 chr1_108826_G_C_b38  0.431229
4 ENSG00000241860.6 RP11-34P13.13  rs201327123  chr1_14677_G_A_b38  0.700658
5 ENSG00000279457.4 RP11-34P13.18  rs188376087 chr1_599167_G_A_b38 -0.687794
6 ENSG00000228463.9    AP006222.2 rs1206875823 chr1_280550_G_A_b38 -1.822030
     beta_2 beta_3   beta_4   beta_5    beta_6 beta_7    beta_8   beta_9
1        NA     NA       NA       NA        NA     NA        NA       NA
2        NA     NA       NA       NA        NA     NA        NA       NA
3        NA     NA 0.727510       NA  0.907156     NA        NA       NA
4  0.955911     NA 0.844491 0.841275  0.924220     NA  0.832645 0.855631
5        NA     NA       NA       NA -0.672016     NA        NA       NA
6 -

In [None]:
%%R
# Step 5: Delete the .# at the end of each gene_id:
eqtl_all_tissues$gene_id <- sub("\\..*", "", eqtl_all_tissues$gene_id)
print(dim(eqtl_all_tissues))
print(head(eqtl_all_tissues))

[1] 1008494     151
          gene_id     gene_name   variant_id              chr_id    beta_1
1 ENSG00000227232        WASH7P  rs769952832  chr1_64764_C_T_b38  0.586346
2 ENSG00000268903 RP11-34P13.15  rs866355763 chr1_103147_C_T_b38 -0.612097
3 ENSG00000269981 RP11-34P13.16   rs62642117 chr1_108826_G_C_b38  0.431229
4 ENSG00000241860 RP11-34P13.13  rs201327123  chr1_14677_G_A_b38  0.700658
5 ENSG00000279457 RP11-34P13.18  rs188376087 chr1_599167_G_A_b38 -0.687794
6 ENSG00000228463    AP006222.2 rs1206875823 chr1_280550_G_A_b38 -1.822030
     beta_2 beta_3   beta_4   beta_5    beta_6 beta_7    beta_8   beta_9
1        NA     NA       NA       NA        NA     NA        NA       NA
2        NA     NA       NA       NA        NA     NA        NA       NA
3        NA     NA 0.727510       NA  0.907156     NA        NA       NA
4  0.955911     NA 0.844491 0.841275  0.924220     NA  0.832645 0.855631
5        NA     NA       NA       NA -0.672016     NA        NA       NA
6 -2.041440     N

In [None]:
%%R
# Step 6: Print the current Mart database:
print(listMarts())

               biomart                version
1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 110
2   ENSEMBL_MART_MOUSE      Mouse strains 110
3     ENSEMBL_MART_SNP  Ensembl Variation 110
4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 110


In [None]:
%%R
# Step 7: Print the archived databases links:
attributes <- listEnsemblArchives()
print(attributes[,1:4])

             name     date                                 url version
1  Ensembl GRCh37 Feb 2014          https://grch37.ensembl.org  GRCh37
2     Ensembl 110 Jul 2023 https://jul2023.archive.ensembl.org     110
3     Ensembl 109 Feb 2023 https://feb2023.archive.ensembl.org     109
4     Ensembl 108 Oct 2022 https://oct2022.archive.ensembl.org     108
5     Ensembl 107 Jul 2022 https://jul2022.archive.ensembl.org     107
6     Ensembl 106 Apr 2022 https://apr2022.archive.ensembl.org     106
7     Ensembl 105 Dec 2021 https://dec2021.archive.ensembl.org     105
8     Ensembl 104 May 2021 https://may2021.archive.ensembl.org     104
9     Ensembl 103 Feb 2021 https://feb2021.archive.ensembl.org     103
10    Ensembl 102 Nov 2020 https://nov2020.archive.ensembl.org     102
11    Ensembl 101 Aug 2020 https://aug2020.archive.ensembl.org     101
12    Ensembl 100 Apr 2020 https://apr2020.archive.ensembl.org     100
13     Ensembl 99 Jan 2020 https://jan2020.archive.ensembl.org      99
14    

In [None]:
%%R
# Step 8: Print the latest Mart update 110:
print(listMarts(host='https://jul2023.archive.ensembl.org'))

               biomart                version
1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 110
2   ENSEMBL_MART_MOUSE      Mouse strains 110
3     ENSEMBL_MART_SNP  Ensembl Variation 110
4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 110


In [None]:
%%R
# Step 9: Create the query for the Mart 110:
# Obs.: The current eQTL version 8 was built using Mart 88:
ensembl_110 <- useMart(host='https://jul2023.archive.ensembl.org',
                       biomart='ENSEMBL_MART_ENSEMBL',
                       dataset='hsapiens_gene_ensembl')

In [None]:
%%R
# Step 10: Check the current Ensembl attributes to select the desired ones:
print(head(listAttributes(ensembl_110)))

                           name                  description         page
1               ensembl_gene_id               Gene stable ID feature_page
2       ensembl_gene_id_version       Gene stable ID version feature_page
3         ensembl_transcript_id         Transcript stable ID feature_page
4 ensembl_transcript_id_version Transcript stable ID version feature_page
5            ensembl_peptide_id            Protein stable ID feature_page
6    ensembl_peptide_id_version    Protein stable ID version feature_page


In [None]:
%%R
# Step 11: Check the current Ensembl filters to select the desired ones:
print(head(listFilters(ensembl_110)))

             name              description
1 chromosome_name Chromosome/scaffold name
2           start                    Start
3             end                      End
4      band_start               Band Start
5        band_end                 Band End
6    marker_start             Marker Start


In [None]:
%%R
# Step 12: Get the Gene Information for Ensembl Version 110:
genes_110 <- getBM(attributes = c("ensembl_gene_id"),
                   filters = "ensembl_gene_id",
                   values = eqtl_all_tissues$gene_id,
                   mart = ensembl_110)

print(dim(genes_110))
print(head(genes_110))

[1] 997207      1
  ensembl_gene_id
1 ENSG00000000457
2 ENSG00000000460
3 ENSG00000000938
4 ENSG00000000971
5 ENSG00000001460
6 ENSG00000001461


Other important attributes for SNPs:
- `variation_name`
- `chromosome_name`
- `start_position`
- `end_position`
- `with_validated_snp`
- `allele`

In [None]:
%%R
# Step 13: Compare the number of rows:
print(dim(eqtl_all_tissues)) # 1,008,494 rows
print(dim(genes_110)) # 997,207 rows
print(length(unique(eqtl_all_tissues$gene_id))) # 39,832 unique rows
print(length(unique(genes_110$ensembl_gene_id))) # 39,278 unique rows
# I lost 554 rows and 403 unique gene_id

[1] 1008494     151
[1] 997207      2
[1] 39832
[1] 39278


In [None]:
%%R
# Step 14: Check and filter to ensure that each ensembl_gene_id appears only once:
genes_110 <- genes_110 %>%
  group_by(ensembl_gene_id) %>%
  slice(1) %>%
  ungroup()

In [None]:
%%R
# Step 15: Update the eQTL dataset:
eqtl_all_tissues_110 <- eqtl_all_tissues %>%
  inner_join(genes_110, by = c("gene_id" = "ensembl_gene_id"))
print(dim(eqtl_all_tissues_110))
print(head(eqtl_all_tissues_110))

[1] 997207    151
          gene_id     gene_name   variant_id              chr_id    beta_1
1 ENSG00000227232        WASH7P  rs769952832  chr1_64764_C_T_b38  0.586346
2 ENSG00000268903 RP11-34P13.15  rs866355763 chr1_103147_C_T_b38 -0.612097
3 ENSG00000269981 RP11-34P13.16   rs62642117 chr1_108826_G_C_b38  0.431229
4 ENSG00000241860 RP11-34P13.13  rs201327123  chr1_14677_G_A_b38  0.700658
5 ENSG00000279457 RP11-34P13.18  rs188376087 chr1_599167_G_A_b38 -0.687794
6 ENSG00000228463    AP006222.2 rs1206875823 chr1_280550_G_A_b38 -1.822030
     beta_2 beta_3   beta_4   beta_5    beta_6 beta_7    beta_8   beta_9
1        NA     NA       NA       NA        NA     NA        NA       NA
2        NA     NA       NA       NA        NA     NA        NA       NA
3        NA     NA 0.727510       NA  0.907156     NA        NA       NA
4  0.955911     NA 0.844491 0.841275  0.924220     NA  0.832645 0.855631
5        NA     NA       NA       NA -0.672016     NA        NA       NA
6 -2.041440     NA 

In [None]:
%%R
# Step 16: Export the dataset:
write.table(eqtl_all_tissues_110,
            "/content/drive/MyDrive/Colab/Long_COVID/eQTL/eqtl_all_tissues_110.txt",
            sep = "\t",
            row.names = FALSE,
            quote = FALSE,
            col.names = TRUE)

**rsIDs**

Check if there are differences in the rsIDs of the eQTL based on Ensembl 88 and the rsIDs from Ensembl 110:

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q normal

# Request resources:
#PBS -l ncpus=1
#PBS -l mem=100GB
#PBS -l jobfs=10GB
#PBS -l walltime=10:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Specify paths
EQTL_FILE="/scratch/sq95/sp6154/eQTL/chrID_to_rsID_eQTL/eqtl_all_tissues.txt"
TEMP_EQTL_RSID="/scratch/sq95/sp6154/eQTL/chrID_to_rsID_eQTL/eqtl_rsid.tmp"
TEMP_VCF_RSID="/scratch/sq95/sp6154/eQTL/chrID_to_rsID_eQTL/vcf_rsid.tmp"
RESULT_FILE="/scratch/sq95/sp6154/eQTL/chrID_to_rsID_eQTL/missing_rsids.txt"

# Extract rsIDs from EQTL dataset
awk 'NR>1 {print $3}' $EQTL_FILE > $TEMP_EQTL_RSID

# Extract rsIDs from all VCF files, concatenate, sort, and deduplicate
> $TEMP_VCF_RSID # Initialize the temp VCF rsID file
for i in {1..22} X Y; do
    zcat "/scratch/sq95/sp6154/eQTL/chrID_to_rsID_eQTL/version_110/homo_sapiens-chr${i}.vcf.gz" | awk '!/^#/ {print $3}' >> $TEMP_VCF_RSID
done
sort $TEMP_VCF_RSID | uniq > ${TEMP_VCF_RSID}.sorted
mv ${TEMP_VCF_RSID}.sorted $TEMP_VCF_RSID

# Find rsIDs from EQTL dataset not in VCF files and save to RESULT_FILE
grep -v -f $TEMP_VCF_RSID $TEMP_EQTL_RSID > $RESULT_FILE

# Cleanup
rm $TEMP_EQTL_RSID $TEMP_VCF_RSID

echo "Missing rsIDs saved in $RESULT_FILE"

###########################################################################################################

# Exit cleanly
exit 0

**Result:**

I got `0-row` in the `missing_rsids.txt file`. It means, there are not missing rsIDs in the `eqtl_all_tissues.txt` file

All the SNPs in the `eqtl_all_tissues.txt` based on Ensembl 88 are present in the Ensembl 110.

## **GWAS**

**Ensembl 109 to 110**:

`rsID`:
- Check for any deprecated or merged rsIDs by comparing with Ensembl 110.
- The rsIDs will remain consistent across different Ensembl releases and other databases like dbSNP. An rsID, once assigned to a specific variant, does not change its meaning across versions.
- **Content**: While the rsID's meaning remains consistent, the content or the number of SNPs in different releases may vary. New SNPs might be added, and some might be removed or merged.
- **Annotations**: The annotations or additional information related to each SNP (like its effect, associated traits, etc.) might be updated or refined in newer releases.
- **Genomic Coordinates**: As reference genomes get updated, the genomic coordinates of SNPs might change. This doesn't change the rsID, but the position on a chromosome might be different between two versions of the reference genome.
- **File Formats and Structure**: The structure or format of the files provided by Ensembl might differ between releases. While the core information remains consistent, there might be additional columns or fields in newer releases.

Check if there are differences in the rsIDs of the GWAS datasets based on Ensembl 109 and the rsIDs from Ensembl 110:

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q normal

# Request resources:
#PBS -l ncpus=1
#PBS -l mem=100GB
#PBS -l jobfs=10GB
#PBS -l walltime=10:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################

# Specify the work to be done:

# Temporary file for the rsIDs from the GWAS dataset
TEMP_GWAS_RSID="/scratch/sq95/sp6154/GWAS/gwas_rsid.tmp"

# Results file prefix
RESULT_PREFIX="/scratch/sq95/sp6154/GWAS/missing_rsids"

# Loop through the GWAS datasets
for GWAS_FILE in long_covid_gwas_1_subset.txt long_covid_gwas_2_subset.txt long_covid_gwas_3_subset.txt long_covid_gwas_4_subset.txt
do
    # Extract the rsIDs from the current GWAS dataset
    awk 'NR>1 {print $1}' $GWAS_FILE > $TEMP_GWAS_RSID

    # Generate a distinct result file name based on the current GWAS dataset
    RESULT_FILE="missing_rsids_$(basename $GWAS_FILE .txt).result"

    # Empty the results file (initialize)
    > $RESULT_FILE

    # Loop through the VCF files
    for i in {1..22} X Y
    do
        # Path to the VCF file
        VCF_FILE="/scratch/sq95/sp6154/eQTL/chrID_to_rsID_eQTL/version_110/homo_sapiens-chr${i}.vcf.gz"

        # Use zcat to handle gzipped files and awk to extract the 3rd column (ID)
        # Then use grep to find which rsIDs from the GWAS dataset are present in the VCF file
        # And use grep again to remove them from the TEMP_GWAS_RSID file
        zcat $VCF_FILE | awk '!/^#/ {print $3}' | grep -f - $TEMP_GWAS_RSID | grep -v -f - $TEMP_GWAS_RSID > ${TEMP_GWAS_RSID}.temp && mv ${TEMP_GWAS_RSID}.temp $TEMP_GWAS_RSID
    done

    # Save the remaining rsIDs to the result file
    cat $TEMP_GWAS_RSID >> $RESULT_FILE

    echo "Missing rsIDs for $GWAS_FILE saved in $RESULT_FILE"
done

# Remove temporary file
rm $TEMP_GWAS_RSID

###########################################################################################################

# Exit cleanly
exit 0

**Result:**

I got `0-row` in the `missing_rsids.txt file`. It means, there are not missing rsIDs in any GWAS datasets.

All the SNPs in the GWAS datasets based on Ensembl 109 are present in the Ensembl 110.

## **GWS**

**Ensembl 88**:

- Subset the WGS dataset with the rsIDs from the eQTL dataset.
- Because the rsIDs in the eQTL dataset based on the Ensembl 88 are the same than those based on the Enseml 110, it is not necessary to convert the rsIDs from Ensembl 88 to Ensembl 110.

**VCF file to be used if it is necessary to update the rsIDs from one Ensembl version to another one:**

Both `homo_sapiens_somatic.vcf.gz` and `homo_sapiens.vcf.gz` are VCF (Variant Call Format) files, which contain information about variants found in the genome. However, they represent different types of variants:

1. **homo_sapiens.vcf.gz**:
   - This file contains germline variants for Homo sapiens.
   - Germline variants are present in every cell of an individual and are inherited from parents. They are found in both somatic (body) cells and germ cells (sperm and egg).
   - These are the typical variants one might think of when considering genetic differences between individuals, such as the variants responsible for eye color, height, susceptibility to certain inherited diseases, etc.

2. **homo_sapiens_somatic.vcf.gz**:
   - This file contains somatic variants for Homo sapiens.
   - Somatic variants arise during an individual's lifetime in a particular cell and are passed on only to the descendants of that cell. They are not present in germ cells and, therefore, are not inherited by offspring.
   - Somatic mutations are often associated with cancers, as they can accumulate in cells over time and may lead to uncontrolled cell growth if they occur in certain genes.

In [None]:
# Run the following script in GADI:

# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q normal

# Request resources:
#PBS -l ncpus=1
#PBS -l mem=100GB
#PBS -l jobfs=10GB
#PBS -l walltime=01:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Step 1: Extract the third (variant_id) and fourth (chr_id) columns from the eQTL file:
awk -F, 'FNR>1 && !seen[$3]++ {print $4, $3}' eqtl_all_tissues_110.txt > ids_to_update_unique.txt

# Step 2: Create a GTEx.bim file with the rs_ID that matches in the ids_to_update_unique.txt and GTEx.bim files:
/scratch/sq95/sp6154/PLINK/plink --bfile WGS_GTEx --update-name ids_to_update_unique.txt --make-bed --out WGS_GTEx_covid_chrID

# Step 3: Create a list of rs_IDs from the eQTL file:
awk -F, '{print $3}' eqtl_all_tissues_110.txt > rs_IDs.txt

# Step 4: Change the chr_ID format to rs_ID in the GTEx file using the list with the rs_IDs obtained from the eQTL file:
/scratch/sq95/sp6154/PLINK/plink --bfile WGS_GTEx_covid_chrID --extract rs_IDs.txt --make-bed --out WGS_GTEx_covid_rsID

###########################################################################################################

# Exit cleanly
exit 0

In [None]:
%%bash
# Check the results:
wc /content/drive/MyDrive/Colab/Long_COVID/GTEx/WGS_GTEx_covid_rsID.bim
head /content/drive/MyDrive/Colab/Long_COVID/GTEx/WGS_GTEx_covid_rsID.bim

  820792  4924752 23687328 /content/drive/MyDrive/Colab/MR_MtRobin/Long_COVID/GTEx/WGS_GTEx_covid_rsID.bim
1	rs554008981	0	13550	A	G
1	rs201327123	0	14677	A	G
1	rs62636368	0	16841	T	G
1	rs3891260	0	16856	G	A
1	rs372841554	0	17407	A	G
1	rs747093451	0	17408	G	C
1	rs376731495	0	17722	G	A
1	rs1190943651	0	17730	A	C
1	rs1209835338	0	20254	A	G
1	rs141149254	0	54490	A	G


**Results:**

**eQTL:** 828,935 SNPs

**GTEx_filtered:** 820,792 SNPs

I lost 8,143 SNPs

# **MtRobin**

## **General Setup**

### **Packages**

In [None]:
%%R

# Install the lme4 package:
install.packages("lme4")
library(lme4)

In [None]:
%%R

# Install and load the dplyr package:
install.packages("dplyr")
library(dplyr)

In [None]:
%%R

# Install and load the httr package:
install.packages("httr")
library(httr)

In [None]:
%%R

# Install and load the jsonlite package:
install.packages("jsonlite")
library(jsonlite)

### **GADI**

In [None]:
# Create the .def file for the select_setup environment:
# Name: select_setup.def

Bootstrap: docker
From: rocker/r-base:4.2.2

%files
    /mnt/c/Users/pinsy007/Downloads/working/libicu60_60.2-3ubuntu3_amd64.deb /mnt/c/Users/pinsy007/Downloads/working/libicu60_60.2-3ubuntu3_amd64.deb

%post
    apt-get update
    apt-get install -y libcurl4-openssl-dev libssl-dev libxml2-dev libicu-dev liblapack-dev libblas-dev gfortran

    # Install libicu60 from the provided .deb file
    dpkg -i /mnt/c/Users/pinsy007/Downloads/working/libicu60_60.2-3ubuntu3_amd64.deb

    echo "Checking R lib paths..."
    R -e ".libPaths()"

    echo "Installing the packages separately..."
    # External packages:
    R -e "install.packages('matrixStats', repos='http://cran.rstudio.com/')"
    R -e "install.packages('reshape2', repos='http://cran.rstudio.com/')"
    R -e "install.packages('BiocManager', repos='http://cran.rstudio.com/')"
    R -e "BiocManager::install('snpStats')"
    R -e "install.packages('stringi', repos='http://cran.rstudio.com/')"

%runscript
    exec R "$@"

In [None]:
# Run the following command on my local computer to create the .sif file:
$ singularity build --fakeroot select_setup.sif select_setup.def

In [None]:
# Create the .def file for the main_function environment:
# Name: main_function.def

Bootstrap: docker
From: rocker/r-base:4.2.2

%post
    apt-get update && apt-get install -y cmake \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

    R -e "install.packages(c('nloptr', 'lme4', 'data.table'), repos='http://cran.rstudio.com/')"

%runscript
    exec R "$@"

In [None]:
# Run the following command on my local computer to create the .sif file:
$ singularity build --fakeroot main_function.sif main_function.def

In [None]:
# Create the .def file for the MR_MtRobin_resample environment to calculate the p_values and FDR:
# Name: p_values.def

Bootstrap: docker
From: rocker/r-base:4.2.2

%post
    apt-get update && apt-get install -y cmake \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

    R -e "install.packages(c('nloptr', 'lme4', 'mvtnorm', 'data.table'), repos='http://cran.rstudio.com/')"

%runscript
    exec R "$@"

In [None]:
# Run the following command on my local computer to create the .sif file:
$ sudo singularity build p_values.sif p_values.def

There are 2 functions for the selection of IVs that are used in statistical modeling, particularly in the context of multiple regression, but they can be applied in various other modeling contexts as well.

1. **Forward selection:**
- **Starting Point:** Begins with an empty model (i.e., no predictors/independent variables).
- **Process**: At each step, the variable that gives the most significant improvement to the model is added.
- **Stopping Criterion**: Continues until adding more variables does not produce a statistically significant improvement in the model's fit.
- **Advantages**:
  - Can be computationally faster when there are many predictors since it doesn't start with a model containing all predictors.
  - It might never include some predictors if they don't meet the inclusion criteria.

2. **Backward selection:**
- **Starting Point**: Begins with a full model that includes all potential predictors/independent variables.
- **Process**: At each step, the least significant variable (i.e., the one that contributes the least to the model) is removed.
- **Stopping Criterion**: Continues removing variables until removing more would result in a statistically significant loss of fit.
- **Advantages**:
  - It begins by considering all interactions between predictors, which might be missed in forward selection.
  - More likely to find a globally optimal model in certain situations.

In summary, forward selection starts with no predictors and sequentially adds them, while backward elimination starts with all predictors and sequentially removes them.

<img src="https://drive.google.com/uc?export=view&id=1ZxvYoCDcjA6CkqjcQOKcq4br8yoUy2yx" alt="drawing" width="700"/>

**Steps in the functions:**
1. **`select_IV_fs` (Forward Selection)**:
   - Start with an empty set of SNPs.
   - For each SNP (from `snp_set`), compute its p-value (`pval`).
   - If this p-value is below a threshold (`pval_thresh`), the SNP is considered significant and added to the SNP set for the gene.
   - Check for the LD (Linkage Disequilibrium) threshold and ensure that the newly added SNP is not in high LD with any of the previously added SNPs.
   - Filter each SNP based on the number of tissues (`nTiss_thresh`) in which the SNP is significant.

2. **`select_IV_bs` (Backward Selection)**:
   - Start with a full set of SNPs based on the p-value threshold (`pval_thresh`).
   - Iteratively remove SNPs from this set:
     - Identify SNPs that are in high LD (greater than `ld_thresh`) with another SNP in the set.
     - Among these high LD SNPs, keep the one with the lowest p-value and remove the others.
   - Continue until no pairs of SNPs in the set have LD above the threshold.

From the above descriptions, the criteria used for the performance (or for SNP inclusion/exclusion) in both functions are:

1. **P-value of the SNP**: A threshold (`pval_thresh`) is used to determine if the SNP's association with the gene expression (or another phenotype) is statistically significant.

2. **LD (Linkage Disequilibrium) between SNPs**: SNPs that are in high LD can provide redundant information, so there's a threshold (`ld_thresh`) to ensure selected SNPs are not in high LD.

3. **Number of Tissues**: In the forward selection method, a SNP needs to be significant in a certain number of tissues (`nTiss_thresh`).

To sum it up:
- Forward selection adds SNPs based on their p-values, ensuring LD criteria and the number of tissues are met.
- Backward selection removes SNPs based on LD criteria. The initial set is based on the p-value threshold.

## **Common Functions**

In [None]:
%%R

# Create a function to organize the Mt_Robin results for significant and non-significant genes:

# Load necessary libraries:
#install.packages("httr")
#library(httr)
#install.packages("jsonlite")
#library(jsonlite)
#install.packages("dplyr")
#library(dplyr)
#install.packages("tidyverse")
#library(tidyverse)

analyze_gene_data <- function(result_main_path,
                              eqtl_long_covid_exposure_path,
                              save_files = list()) {
  # Read the input data:
  result_main <- readRDS(result_main_path)
  eqtl_long_covid_exposure <- read.table(eqtl_long_covid_exposure_path, header = TRUE, sep = "\t", stringsAsFactors = FALSE)
  # Filter out errors:
  result_main <- result_main[!sapply(result_main, function(x) inherits(x, "simpleError"))]
  # Create a data frame:
  results <- data.frame(gene_id = character(),
                        beta_y = numeric(),
                        Obs = integer(),
                        SNPs = integer(),
                        REML = numeric(),
                        SD = numeric(),
                        Residual_SD = numeric(),
                        SNP_IDs = character(),
                        stringsAsFactors = FALSE)
  # Loop over each gene to extract each value:
  for (i in seq_along(result_main)) {
    lme_res <- result_main[[i]]$lme_res
    gene_id <- names(result_main)[i]  # get the gene id
    # Extract each required information:
    reml <- -2 * logLik(lme_res)
    sd <- attr(VarCorr(lme_res)$snpID, "stddev")
    res_sd <- sigma(lme_res)
    obs <- length(residuals(lme_res))
    snps <- length(unique(ranef(lme_res)$snpID[[1]]))
    beta_y <- fixef(lme_res)["beta_y"]
    # Extract SNP IDs and convert to a single comma-separated string:
    snp_ids <- result_main[[i]][["lme_res"]]@pp[[".->Ut"]]@Dimnames[[1]]
    snp_ids_str <- paste(snp_ids, collapse = ",")
    # Add the results to the previous empty data frame including SNP IDs:
    results <- rbind(results, data.frame(gene_id = gene_id,
                                         beta_y = beta_y,
                                         Obs = obs,
                                         SNPs = snps,
                                         SNP_IDs = snp_ids_str,
                                         REML = reml,
                                         SD = sd,
                                         Residual_SD = res_sd,
                                         row.names = NULL))
  }
  # Remove duplicates based on 'gene_id', keeping the first occurrence:
  eqtl_long_covid_exposure <- eqtl_long_covid_exposure[!duplicated(eqtl_long_covid_exposure$gene_id), ]
  # Merge the two tables based on gene_id:
  merged_data <- merge(results, eqtl_long_covid_exposure[c("gene_id","gene_name")], by = "gene_id")
  # Get the column names:
  cols <- colnames(merged_data)
  # Remove 'gene_name' from the list:
  cols <- cols[cols != "gene_name"]
  # Add 'gene_name' as the second column:
  cols <- c(cols[1], "gene_name", cols[2:length(cols)])
  # Reorder the columns:
  merged_data <- merged_data[, cols]
  # Order the results by absolute value of the beta_y column from highest to lowest for all genes:
  all_genes <- merged_data[order(-abs(merged_data$beta_y)), ]
  # Order the results by absolute value of the beta_y column from highest to lowest for TOP 10 genes:
  top10_beta <- merged_data[order(-abs(merged_data$beta_y)), ][1:10, ]
  # Function to get gene types in bulk from Ensembl ensuring JSON response:
  get_gene_types_bulk <- function(gene_names) {
    url <- "https://rest.ensembl.org/lookup/symbol/homo_sapiens"
    body <- list(
      symbols = gene_names
    )
    response <- POST(url, body = toJSON(body), add_headers("Content-Type" = "application/json", "Accept" = "application/json"))
    if (http_type(response) == "application/json") {
      content <- fromJSON(content(response, "text", encoding = "UTF-8"))
      biotypes <- sapply(gene_names, function(gene) {
        if (!is.null(content[[gene]])) {
          return(content[[gene]]$biotype)
        } else {
          return(NA)
        }
      })
      return(biotypes)
    } else {
      return(rep(NA, length(gene_names)))
    }
  }
  # Split genes into smaller chunks (e.g., 100 genes per chunk) to avoid overloading the API:
  gene_chunks <- split(merged_data$gene_name, ceiling(seq_along(merged_data$gene_name)/100))
  all_biotypes <- unlist(lapply(gene_chunks, get_gene_types_bulk))
  # Create a copy of the original dataframe and add gene types to the new dataframe:
  all_genes <- merged_data
  all_genes$gene_type <- all_biotypes
  # Reorder the columns:
  all_genes <- merged_data %>%
    mutate(gene_type = all_biotypes) %>%
    select(1:2, gene_type, everything())
  # Filter and sort the new dataframe:
  protein_coding_genes <- all_genes %>%
    filter(gene_type == "protein_coding") %>%
    arrange(desc(abs(beta_y)))
  # Create a custom name mapping:
  name_mapping <- list(
    all_genes = "all_genes",
    protein_coding_genes = "protein_coding_genes"
  )
  # Print dimensions and head of the result data frames:
  result_tables <- list(
    all_genes = all_genes,
    protein_coding_genes = protein_coding_genes
  )
  # Save the result data frames:
  for (name in names(result_tables)) {
    custom_name <- name_mapping[[name]]
    if (!is.null(custom_name) && custom_name %in% names(save_files)) {
      save_path <- save_files[[custom_name]]
      if (!dir.exists(dirname(save_path))) {
        dir.create(dirname(save_path), recursive = TRUE)
      }
      # Use tab as the separator for .txt files:
      write.table(result_tables[[name]], file = save_path, sep = "\t", row.names = FALSE, quote = FALSE)
      cat(paste("Saved", custom_name, "as", custom_name, "to", save_path, "\n"))  # Use custom_name here for correct referencing
    }
    cat("Dimensions of", custom_name, ":", dim(result_tables[[name]]), "\n\n")
    print(head(result_tables[[name]]))
    cat("\n---\n")
  }
}

In [None]:
%%R

# Create a function to retrieve more information from the Mt_Robin and filter the significant genes:

# Install and Load necessary libraries:
#install.packages("lme4")
#library(lme4)
#install.packages("dplyr")
#library(dplyr)

# Define the function:
process_final_data <- function(input_path_1, input_path_2, input_path_3, output_path) {

  # Set the number of decimals:
  options(digits=7) # Default by R

  # Step 1: Open the input datasets
  results_1 <- readRDS(input_path_1)
  print(paste("Causal genes (noncoding genes and protein coding genes): "))
  print(length(results_1))
  print(results_1[[1]])

  results_2 <- read.table(input_path_2, header=TRUE, sep="\t")
  print(paste("Causal genes (protein coding genes): "))
  print(dim(results_2))
  print(head(results_2))

  MR_results <- read.table(input_path_3, header=TRUE, sep="\t")
  print(paste("Causal genes (noncoding genes and protein coding genes): "))
  print(dim(MR_results))
  print(head(MR_results))

  # Step 2: Initialize an empty dataframe with additional CI columns
  extracted_results <- data.frame(
    gene_id = character(),
    gene_name = character(),
    SNPs = integer(),
    SNPs_IDs = character(),
    Obs = integer(),
    beta_y = numeric(),
    Min = numeric(),
    First_Q = numeric(),
    Median = numeric(),
    Third_Q = numeric(),
    Max = numeric(),
    SD_RE = numeric(),
    Res_SD_RE = numeric(),
    Var_RE = numeric(),
    Res_Var_RE = numeric(),
    SE_FE = numeric(),
    t_value_FE = numeric(),
    REML = numeric(),
    p_value = numeric(),
    FDR = numeric(),
    CI_lower = numeric(),
    CI_upper = numeric(),
    stringsAsFactors = FALSE
  )

  # Step 3: Loop over each gene ID in results_2, enrich with MR_results and calculate other statistics:
  for (gene_id in results_2$gene_id) {
    # Check if the gene ID exists in results_1:
    if (gene_id %in% names(results_1)) {
      lme_res <- results_1[[gene_id]][["lme_res"]]
      # If lme_res is NULL, skip to the next iteration of the loop:
      if(is.null(lme_res)) next
      summary_result <- summary(lme_res)
      residuals_summary <- summary_result$residuals
      var_corr <- VarCorr(lme_res)
      beta_y <- summary_result$coefficients["beta_y", "Estimate"]
      SE_FE <- summary_result$coefficients["beta_y", "Std. Error"]
      # Calculate 95% Confidence Intervals:
      CI_lower <- beta_y - 1.96 * SE_FE
      CI_upper <- beta_y + 1.96 * SE_FE
      # Find the corresponding p_value and FDR in MR_results:
      MR_row <- MR_results[MR_results$gene_id == gene_id,]
      p_value <- ifelse(nrow(MR_row) > 0, MR_row$pvalue, NA)
      FDR <- ifelse(nrow(MR_row) > 0, MR_row$fdr, NA)
      # Extract and prepare additional metrics from lme_res and results_2:
      Min <- min(residuals_summary)
      First_Q <- quantile(residuals_summary, 0.25)
      Median <- median(residuals_summary)
      Third_Q <- quantile(residuals_summary, 0.75)
      Max <- max(residuals_summary)
      SD_RE <- sqrt(var_corr$snpID[1])
      Var_RE <- var_corr$snpID[1]
      Res_SD_RE <- attr(var_corr, "sc")
      Res_Var_RE <- Res_SD_RE^2
      t_value_FE <- summary_result$coefficients["beta_y", "t value"]
      gene_row <- results_2[results_2$gene_id == gene_id,]
      gene_name <- gene_row$gene_name
      SNPs_IDs <- gene_row$SNP_IDs
      SNPs <- length(strsplit(as.character(SNPs_IDs), ",")[[1]])
      Obs <- gene_row$Obs
      REML <- gene_row$REML
      # Append results to the dataframe:
      extracted_results <- rbind(extracted_results, data.frame(
        gene_id, gene_name, SNPs, SNPs_IDs, Obs, beta_y,
        Min, First_Q, Median, Third_Q, Max,
        SD_RE, Res_SD_RE, Var_RE, Res_Var_RE,
        SE_FE, t_value_FE, REML, p_value, FDR, CI_lower, CI_upper, stringsAsFactors = FALSE))
    }
  }

  # Step 4: Remove all row names:
  rownames(extracted_results) <- NULL

  # Step 5: Filter out genes with a p_value greater than 0.05 and an FDR greater than 0.05:
  extracted_results_filtered <- extracted_results %>%
    filter(p_value <= 0.05 & FDR <= 0.05)

  # Step 6: Print dim and head table:
  print(paste("Final table (protein coding genes): "))
  print(dim(extracted_results_filtered))
  print(extracted_results_filtered)

  # Step 7: Export the table with just significant values (p_values <= 0.05):
  write.table(extracted_results_filtered, file = output_path, sep = "\t", row.names = FALSE, quote = FALSE, col.names = TRUE)
}

In [None]:
%%R

# Install and load necessary libraries:
#install.packages("dplyr")
#library(dplyr)
#install.packages("tidyr")
#library(tidyr)
#install.packages("matrixStats")
#library(matrixStats)

# Function to retrieve the tissues and category tissues from the eQTL to the final table:
process_eqtl_mt_robin <- function(input_eqtl_path, input_mt_robin_path, output_path) {

  # Load datasets:
  eqtl_all_tissues <- read.table(file = input_eqtl_path, header = TRUE, sep = "\t")
  mt_robin <- read.table(file = input_mt_robin_path, header = TRUE, sep = "\t")

  # Map beta columns to tissue names:
  tissue_names <- c("Subcutaneous adipose", "Visceral omentum adipose", "Adrenal gland",
                    "Aorta artery", "Coronary artery", "Tibial artery", "Amygdala brain",
                    "Anterior cingulate cortex", "Caudate", "Cerebellar hemisphere",
                    "Cerebellum", "Cortex", "Frontal cortex", "Hippocampus", "Hypothalamus",
                    "Nucleus accumbens (basal ganglia)", "Putamen (basal ganglia)", "Spinal cord",
                    "Substantia nigra", "Breast mammary tissue", "Cultured fibroblasts",
                    "EBV-transformed lymphocytes", "Sigmoid colon", "Transverse colon",
                    "Gastroesophageal junction", "Esophagus mucosa", "Esophagus muscularis",
                    "Atrial appendage", "Left ventricle", "Kidney cortex", "Liver", "Lung",
                    "Minor salivary gland", "Skeletal muscle", "Tibial nerve", "Ovary",
                    "Pancreas", "Pituitary", "Prostate", "Not sun-exposed skin (suprapubic)",
                    "Sun-exposed skin (lower leg)", "Small intestine terminal ileum", "Spleen",
                    "Stomach", "Testis", "Thyroid", "Uterus", "Vagina", "Whole blood")

  # Define a mapping from tissues to categories:
  tissue_to_category <- list(
    "Adrenal gland" = "Brain and Nervous System",
    "Amygdala brain" = "Brain and Nervous System",
    "Anterior cingulate cortex" = "Brain and Nervous System",
    "Aorta artery" = "Cardiovascular System",
    "Atrial appendage" = "Cardiovascular System",
    "Breast mammary tissue" = "Exocrine System",
    "Caudate" = "Brain and Nervous System",
    "Cerebellar hemisphere" = "Brain and Nervous System",
    "Cerebellum" = "Brain and Nervous System",
    "Coronary artery" = "Cardiovascular System",
    "Cortex" = "Brain and Nervous System",
    "Cultured fibroblasts" = "Connective Tissue",
    "EBV-transformed lymphocytes" = "Immune System",
    "Esophagus mucosa" = "Digestive System",
    "Esophagus muscularis" = "Digestive System",
    "Frontal cortex" = "Brain and Nervous System",
    "Gastroesophageal junction" = "Digestive System",
    "Hippocampus" = "Brain and Nervous System",
    "Hypothalamus" = "Brain and Nervous System",
    "Kidney cortex" = "Excretory System",
    "Left ventricle" = "Cardiovascular System",
    "Liver" = "Digestive System",
    "Lung" = "Respiratory System",
    "Minor salivary gland" = "Exocrine System",
    "Not sun-exposed skin (suprapubic)" = "Skin",
    "Nucleus accumbens (basal ganglia)" = "Brain and Nervous System",
    "Ovary" = "Reproductive System",
    "Pancreas" = "Digestive System",
    "Pituitary" = "Endocrine System",
    "Prostate" = "Reproductive System",
    "Putamen (basal ganglia)" = "Brain and Nervous System",
    "Sigmoid colon" = "Digestive System",
    "Skeletal muscle" = "Muscular System",
    "Small intestine terminal ileum" = "Digestive System",
    "Spinal cord" = "Brain and Nervous System",
    "Spleen" = "Immune System",
    "Stomach" = "Digestive System",
    "Subcutaneous adipose" = "Adipose Tissue",
    "Substantia nigra" = "Brain and Nervous System",
    "Sun-exposed skin (lower leg)" = "Skin",
    "Testis" = "Reproductive System",
    "Thyroid" = "Endocrine System",
    "Tibial artery" = "Cardiovascular System",
    "Tibial nerve" = "Brain and Nervous System",
    "Transverse colon" = "Digestive System",
    "Uterus" = "Reproductive System",
    "Vagina" = "Reproductive System",
    "Visceral omentum adipose" = "Adipose Tissue",
    "Whole blood" = "Blood"
  )

  # Create variables for the thresholds:
  pval_thresh <- 0.001
  nTiss_thresh <- 1

  # Create a function to map the tissues for each gene/SNP combination:
  process_gene_snp <- function(gene_id, snp_ids) {
    # Filter the eQTL data for the current gene and SNPs:
    eqtl_gene_snp <- eqtl_all_tissues[eqtl_all_tissues$gene_id == gene_id & eqtl_all_tissues$variant_id %in% snp_ids, ]

    # Extract the p-value matrix and beta matrix:
    pval_mat <- as.matrix(eqtl_gene_snp[, grep("^pvalue_", colnames(eqtl_gene_snp)), drop = FALSE])
    beta_mat <- as.matrix(eqtl_gene_snp[, grep("^beta_", colnames(eqtl_gene_snp)), drop = FALSE])

    # Calculate the minimum p-value for each SNP:
    min_p <- matrixStats::rowMins(pval_mat, na.rm = TRUE)

    # Calculate the p-value threshold for the specified number of tissues:
    nTissThresh_p <- apply(pval_mat, 1, function(x) sort(x, na.last = TRUE)[nTiss_thresh])

    # Subset SNPs that meet the p-value threshold in the specified number of tissues:
    eqtl_gene_snp_filtered <- eqtl_gene_snp[min_p < pval_thresh & nTissThresh_p < pval_thresh, ]

    # Extract the beta matrix and p-value matrix for the filtered SNPs:
    beta_mat_filtered <- as.matrix(eqtl_gene_snp_filtered[, grep("^beta_", colnames(eqtl_gene_snp_filtered)), drop = FALSE])
    pval_mat_filtered <- as.matrix(eqtl_gene_snp_filtered[, grep("^pvalue_", colnames(eqtl_gene_snp_filtered)), drop = FALSE])

    # Identify tissues with significant eQTL associations:
    pvals_lt001 <- pval_mat_filtered < pval_thresh
    betas_sig <- beta_mat_filtered * pvals_lt001

    # Check if each SNP has both positive and negative significant eQTL associations:
    sig_neg_eQTL <- as.numeric(apply(betas_sig, 1, function(x) any((!is.na(x) & x < -1e-15))))
    sig_pos_eQTL <- as.numeric(apply(betas_sig, 1, function(x) any((!is.na(x) & x > 1e-15))))

    # Filter out SNPs with both positive and negative significant eQTL associations:
    eqtl_gene_snp_filtered <- eqtl_gene_snp_filtered[!(sig_pos_eQTL == 1 & sig_neg_eQTL == 1), ]

    # Extract the list of beta columns (tissues) for the remaining SNPs:
    beta_cols <- colnames(eqtl_gene_snp_filtered)[grepl("^beta_", colnames(eqtl_gene_snp_filtered))]

    # Filter out beta columns with all NAs:
    beta_cols <- beta_cols[colSums(!is.na(eqtl_gene_snp_filtered[, beta_cols, drop = FALSE])) > 0]

    # Count the total number of tissues:
    n_tissues <- length(beta_cols)

    # Map beta column names to tissue names:
    tissue_cols <- tissue_names[as.integer(gsub("beta_", "", beta_cols))]

    # Map tissue names to categories:
    category_cols <- unname(tissue_to_category[tissue_cols])

    # Return results:
    return(list(tissue_cols = tissue_cols, n_tissues = n_tissues, category_cols = category_cols))
  }

  # Apply the previous function for each gene/SNP combination:
  mt_robin_processed <- apply(mt_robin, 1, function(row) {
    gene_id <- row["gene_id"]
    snp_ids <- strsplit(row["SNPs_IDs"], ",")[[1]]
    result <- process_gene_snp(gene_id, snp_ids)
    return(data.frame(gene_id = gene_id, SNPs_IDs = row["SNPs_IDs"], tissue_cols = paste(result$tissue_cols, collapse = ","), n_tissues = result$n_tissues, category_cols = paste(result$category_cols, collapse = ",")))
  })
  mt_robin_processed <- do.call(rbind, mt_robin_processed)

  # Rename the columns in mt_robin_processed:
  colnames(mt_robin_processed) <- c("gene_id", "SNPs_IDs", "Tissues", "n_Tissues", "Tissue_Categories")

  # Merge mt_robin and mt_robin_processed based on the gene_id column, selecting only the required columns
  mt_robin_merged <- merge(mt_robin, mt_robin_processed[, c("gene_id", "Tissues", "Tissue_Categories")], by = "gene_id", all.x = TRUE)

  # Print results:
  print(mt_robin_merged)

  # Export the final table with adjusted delimiter for tissues and categories
  write.table(mt_robin_merged, file = output_path, sep = "\t", row.names = FALSE, quote = FALSE, col.names = TRUE)
}

## **forward_selection**

### **Loop GWAS**

In [None]:
# Step 1: Load the required libraries:
library(matrixStats)
library(snpStats)
library(utils)
library(parallel)

# Step 2: Open the eQTL file:
eqtl_covid_exposure <- read.table("/data/eqtl_all_tissues_110.txt",
                                  header = TRUE,
                                  sep = "",
                                  stringsAsFactors = FALSE)

# Step 3: Open the genes input data set:
all_genes <- read.table("/data/all_genes_110.txt",
                        header = TRUE,
                        sep = "",
                        stringsAsFactors = FALSE)

# Step 4: Convert the previous list of genes into a vector:
all_genes_vector <- all_genes$gene_id

# Step 5: Save the paths of the GTEx files:
gtex_covid_bed <- "/data/WGS_GTEx_long_covid_rsID.bed"
gtex_covid_bim <- "/data/WGS_GTEx_long_covid_rsID.bim"
gtex_covid_fam <- "/data/WGS_GTEx_long_covid_rsID.fam"

# Step 6: Modify the original select_IV_fs() function to accept NAs from the eQTL and calculate the LD matrix inside:
# select_IV_fs() function --> Starts with no variables and adds one at a time:
select_IV_fs <- function(geneID,
                         eqtl_data,
                         nTiss,
                         bed_file,
                         bim_file,
                         fam_file,
                         ld_matrix_path,
                         ld_thresh=0.5,
                         pval_thresh=0.001,
                         nTiss_thresh=3){

  ## Subset eQTL dataset to gene of interest:
  if(!(geneID %in% eqtl_data$gene_id)) stop("Gene is missing in eQTL dataset")
  eqtl_data <- subset(eqtl_data, gene_id == geneID)

  # pval_mat <- as.matrix(eqtl_data[,paste0("pvalue_",1:nTiss)])
  pval_mat <- as.matrix(subset(eqtl_data,select=paste0("pvalue_",1:nTiss)))
  eqtl_data$min_p <- matrixStats::rowMins(pval_mat, na.rm = TRUE) # Mod1: I added na.rm = TRUE
  eqtl_data$nTissThresh_p <- apply(pval_mat, 1, function(x) sort(x)[nTiss_thresh])
  eqtl_data <- subset(eqtl_data, nTissThresh_p < pval_thresh)

  ## Ensure same sign for tissues with p < pval_thresh:
  betas <- as.matrix(subset(eqtl_data,select=paste0("beta_",1:nTiss)))
  pvals <- as.matrix(subset(eqtl_data,select=paste0("pvalue_",1:nTiss)))
  pvals_lt001 <- pvals<0.001
  betas_sig <- betas*pvals_lt001
  #sig_neg_eQTL <- as.numeric(apply(betas_sig,1,function(x) any(x < -1e-15)))
  #sig_pos_eQTL <- as.numeric(apply(betas_sig,1,function(x) any(x > 1e-15)))
  sig_neg_eQTL <- as.numeric(apply(betas_sig, 1, function(x) any((!is.na(x) & x < -1e-15)))) # Mod2: I added is.na()
  sig_pos_eQTL <- as.numeric(apply(betas_sig, 1, function(x) any((!is.na(x) & x > 1e-15)))) # Mod3: I added is.na()
  eqtl_data$pos_sig_eQTL <- sig_pos_eQTL
  eqtl_data$neg_sig_eQTL <- sig_neg_eQTL
  eqtl_data <- subset(eqtl_data, !(pos_sig_eQTL==1 & neg_sig_eQTL==1))

  if(nrow(eqtl_data)==0) stop(paste0("No cis-SNPs for gene with p<",pval_thresh," in at least ",nTiss_thresh," tissues."))

  ## Obtain a set of candidate SNPs:
  SNP_pool <- eqtl_data$variant_id
  if(length(SNP_pool)==1){
    return(SNP_pool)
  }

  # Calculate the LD matrix:
  snp_data <- read.plink(bed_file, bim_file, fam_file)
  snp_matrix <- snp_data$genotypes
  numeric_matrix <- as(snp_matrix, "numeric")
  # Restrict LD matrix to SNPs in the data set:
  df_subset <- numeric_matrix[, colnames(numeric_matrix) %in% SNP_pool]
  LD <- cor(df_subset, use="pairwise.complete")
  LD_r2 <- LD^2
  # Export a .txt file of the LD matrix:
  utils::write.table(LD,
                     file = ld_matrix_path,
                     sep = " ",
                     row.names = TRUE,
                     col.names = TRUE,
                     quote = FALSE)

  ## Select the best SNPs remaining in SNP_pool based on minimum p-value:
  selected_snps <- NULL
  while(length(SNP_pool) > 1){

    ## Identify top SNPs in pool:
    eqtl_data <- subset(eqtl_data, variant_id %in% SNP_pool)
    next_snp_idx <- which.min(eqtl_data$min_p)
    next_snp <- eqtl_data$variant_id[next_snp_idx]
    selected_snps <-c(selected_snps,next_snp)

    ## Identify SNPs not in high LD with top SNPs in the pool:
    LD_idx <- which(row.names(LD_r2)==next_snp)
    LD_keep_idx <- which(LD_r2[LD_idx,] < ld_thresh)
    LD_r2 <- LD_r2[LD_keep_idx,LD_keep_idx]
    SNP_pool <- names(LD_keep_idx)
  }
  selected_snps <- c(selected_snps, SNP_pool)
  return(selected_snps)
}

# Step 7: Define the number of cores to use for the parallelism process:
num_cores <- 48

# Step 8: Loop over the 4 GWAS datasets:
gwas_files <- c("long_covid_gwas_1_subset.txt",
                "long_covid_gwas_2_subset.txt",
                "long_covid_gwas_3_subset.txt",
                "long_covid_gwas_4_subset.txt")
for (gwas_file in gwas_files) {

  # Step 8.1: Open the GWAS input data set:
  gwas_covid_outcome <- read.table(paste0("/data/", gwas_file),
                                   header = TRUE,
                                   sep = "",
                                   stringsAsFactors = FALSE)

  # Step 8.2: Create a unique directory for the LD matrices corresponding to this GWAS dataset:
  ld_matrix_dir <- paste0("/data/ld_matrix_loop_fs_",
                          gsub(".txt", "", gwas_file))

  # Step 8.3: Create the directory if it doesn't exist:
  if (!dir.exists(ld_matrix_dir)) {
    dir.create(ld_matrix_dir)
  }

# Step 9: Apply the function to each gene id:
result_list <- mclapply(all_genes_vector, function(geneID) {

  # Step 9.1: Generate a unique file path for the LD matrix of this gene
  ld_matrix_path <- file.path(ld_matrix_dir, paste0("ld_matrix_", geneID, ".txt"))

  tryCatch({
    select_IV_fs(geneID = geneID,
                 eqtl_data = eqtl_covid_exposure,
                 nTiss = 49,
                 bed_file = gtex_covid_bed,
                 bim_file = gtex_covid_bim,
                 fam_file = gtex_covid_fam,
                 ld_matrix_path = ld_matrix_path,
                 ld_thresh = 0.5,
                 pval_thresh = 0.001,
                 nTiss_thresh = 1)
    }, error = function(e) print(e))
  }, mc.cores = num_cores)

# Step 10: Name the elements of the list with corresponding gene ids:
names(result_list) <- all_genes_vector

# Step 11: Save the previous result as a .rds file:
saveRDS(result_list, file = paste0("/data/long_covid_IVs_loop_fs_",
                                   gsub(".txt", "", gwas_file), ".rds"))
}

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q megamem

# Request resources:
#PBS -l ncpus=48
#PBS -l mem=2000GB
#PBS -l jobfs=1000GB
#PBS -l walltime=30:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/Mt_Robin/select_setup.sif Rscript /data/select_function_fs_loop.R

# Run the R script without any container:
#R --vanilla < parallelism_test.R > output_test

###########################################################################################################

# Exit cleanly
exit 0

`Result: Entered into a held queue.`

Check the following links:

https://opus.nci.org.au/display/Help/Queue+Structure

https://opus.nci.org.au/pages/viewpage.action?pageId=90308823

https://opus.nci.org.au/display/Help/FAQ+2%3A+What+does+exceeded+memory+allocation+mean

**Pay attention:**

1. **Queuing**: Jobs typically enter a queue when submitted. The scheduler decides when to start each job based on available resources, job priority, and other parameters. If the cluster doesn't have sufficient resources available to start all the jobs immediately, some of the jobs may be left waiting in the queue.

2. **Fair Share and User Priority**: Many clusters implement a "fair share" policy, which means that if a single user or group submits a large number of jobs or continually monopolizes resources, their job priority may be temporarily decreased. This prevents any single user from dominating the cluster resources and ensures equitable access for all users.

3. **Resource Fragmentation**: Even if the cluster has free resources, they might not be contiguous. For example, if you request 48 cores and there are 50 cores free but spread across multiple nodes, the scheduler might not start the job until 48 cores become available on a single node or in a configuration that meets the job's specifications.

4. **Parallel Execution**: If there are enough resources on the cluster, multiple jobs can run simultaneously. If you submit two jobs, each requesting 48 cores, and there are 96 cores available, both jobs could potentially start at the same time.

5. **Job Dependencies**: Some schedulers allow you to set dependencies between jobs, meaning one job won't start until another has completed. This can be useful if one job produces data that another job will use.

6. **Array Jobs**: If you're running multiple similar jobs with the same resource requirements, some schedulers support array jobs. This allows you to submit a large number of similar jobs as a single array, and the scheduler can then manage and start these jobs efficiently.

7. **Backfilling**: Some schedulers use backfilling to optimize the job queue. If a large job is waiting for resources to become available, but there's a gap where smaller jobs can fit without delaying the large job, the scheduler might start the smaller jobs to make efficient use of resources.

8. **Job Preemption**: On some clusters, especially those with a partition or Quality of Service (QoS) for high-priority or express jobs, longer-running, lower-priority jobs might be preempted (paused or stopped) to free up resources for a high-priority job. Once the high-priority job completes, the preempted jobs may be restarted or resumed.


**Possible solutions:**

1. **Optimize the R code**: Review the R scripts to check for potential memory inefficiencies. For instance:
    - Use more memory-efficient data structures, like data.table instead of data.frame.
    - Clear out large objects that are no longer needed using `rm(objectName)` and then run `gc()` to free up memory.
    - Use functions that perform operations in chunks rather than loading entire large datasets into memory.

2. **Parallelism**: Consider reducing the number of concurrent processes. While parallelism can speed up computations, it multiplies memory usage. If each R process requires a lot of memory, running many of them at once could quickly deplete available resources.

3. **Increase Memory Allocation**: If possible, request a node with more memory or increase the memory allocation for the job. This may involve communicating with the system administrator or adjusting the job submission scripts.

4. **Use High Memory Nodes**: If the cluster offers nodes specifically designed for high-memory tasks, request those nodes for the job.

5. **Use Disk**: For operations that deal with large datasets, consider using disk-based tools or algorithms that operate on chunks of data at a time, rather than loading everything into memory. For example, the `ff` and `bigmemory` packages in R allow for the analysis of large datasets without loading them fully into memory.

6. **Profile Memory**: Tools like `pryr` can help profile memory usage in R. This way, you can identify which parts of the code consume the most memory and target those for optimization.

7. **External Help**: Check the link provided in the error message (https://opus.nci.org.au/x/SwGRAQ). It might contain specific guidelines or recommendations for managing memory on the `gadi` system.

**Nodes vs. CPUs vs. Cores:**

**Node:**

- A node generally refers to a single physical or virtual machine in the cluster.
- Each node will have its own local memory and possibly local storage.
- A node might contain one or multiple CPUs (processors). Each of these CPUs can have multiple cores.
- Nodes might also contain other specialized hardware components, such as GPUs (Graphics Processing Units) or FPGAs (Field-Programmable Gate Arrays).
- In HPC contexts, there are often two main types of nodes:
  - Compute Nodes: Used primarily to execute application code.
  - Service or Head Nodes: Used for managing the cluster, submitting jobs, and other administrative tasks.

**Core:**

- A core is a component of a CPU and represents a single processing unit. Modern CPUs are multi-core, meaning they have multiple cores that can process data independently.
- Each core can execute its own thread(s), allowing for parallel processing within a single CPU.
- Cores share the resources of the CPU they are part of, such as caches, but often have their own local resources too.
- For example, an 8-core CPU can potentially run eight different processes or threads simultaneously.
- Cores can be physical or logical. Logical cores (or threads) are seen in technologies like Intel's Hyper-Threading, where each physical core can execute two threads simultaneously, effectively doubling the number of cores from a software perspective.

**Example:**

ncpus=48 --> 48 cores are requested, which might be distributed across one or multiple nodes, depending on the cluster's configuration and how nodes are set up. If the cluster has nodes with 48 cores each, this might equate to one node. If nodes have 24 cores each, your job would span two nodes.

### **GWAS1**

In [None]:
# select_IV function:

library(matrixStats) # It needs to be installed separately
#library(BiocManager) # It needs to be installed separately
library(snpStats) # It needs to be installed separately
library(utils) # R package installed by default
library(parallel) # R package installed by default

# Step 1: Open the eQTL file:
eqtl_covid_exposure <- read.table("/data/eqtl_all_tissues_110.txt",
                                  header = TRUE,
                                  sep = "",
                                  stringsAsFactors = FALSE)

# Step 2: Open the GWAS input data set:
gwas_covid_outcome <- read.table("/data/long_covid_gwas_1_subset.txt",
                                 header = TRUE,
                                 sep = "",
                                 stringsAsFactors = FALSE)

# Step 3: Open the genes input data set:
all_genes <- read.table("/data/all_genes_110.txt",
                        header = TRUE,
                        sep = "",
                        stringsAsFactors = FALSE)

# Step 4: Convert the previous list of genes into a vector:
all_genes_vector <- all_genes$gene_id

# Step 5: Save the paths of the GTEx files:
gtex_covid_bed <- "/data/WGS_GTEx_long_covid_rsID.bed"
gtex_covid_bim <- "/data/WGS_GTEx_long_covid_rsID.bim"
gtex_covid_fam <- "/data/WGS_GTEx_long_covid_rsID.fam"

# Step 6: Create a variable for the path to the directory to save the LD matrix results:
ld_matrix_dir <- "/data/ld_matrix_fs_1"

# Step 7: Create the directory if it doesn't exist:
if (!dir.exists(ld_matrix_dir)) {
  dir.create(ld_matrix_dir)
}

# Step 8: Modify the original select_IV_fs() function to accept NAs from the eQTL and calculate the LD matrix inside:
# select_IV_fs() function --> Starts with no variables and adds one at a time:
select_IV_fs <- function(geneID,
                         eqtl_data,
                         nTiss,
                         bed_file,
                         bim_file,
                         fam_file,
                         ld_matrix_path,
                         ld_thresh=0.5,
                         pval_thresh=0.001,
                         nTiss_thresh=3){

  ## Subset eQTL dataset to gene of interest:
  if(!(geneID %in% eqtl_data$gene_id)) stop("Gene is missing in eQTL dataset")
  eqtl_data <- subset(eqtl_data, gene_id == geneID)

  # pval_mat <- as.matrix(eqtl_data[,paste0("pvalue_",1:nTiss)])
  pval_mat <- as.matrix(subset(eqtl_data,select=paste0("pvalue_",1:nTiss)))
  eqtl_data$min_p <- matrixStats::rowMins(pval_mat, na.rm = TRUE) # Mod1: I added na.rm = TRUE
  eqtl_data$nTissThresh_p <- apply(pval_mat, 1, function(x) sort(x)[nTiss_thresh])
  eqtl_data <- subset(eqtl_data, nTissThresh_p < pval_thresh)

  ## Ensure same sign for tissues with p < pval_thresh:
  betas <- as.matrix(subset(eqtl_data,select=paste0("beta_",1:nTiss)))
  pvals <- as.matrix(subset(eqtl_data,select=paste0("pvalue_",1:nTiss)))
  pvals_lt001 <- pvals<0.001
  betas_sig <- betas*pvals_lt001
  #sig_neg_eQTL <- as.numeric(apply(betas_sig,1,function(x) any(x < -1e-15)))
  #sig_pos_eQTL <- as.numeric(apply(betas_sig,1,function(x) any(x > 1e-15)))
  sig_neg_eQTL <- as.numeric(apply(betas_sig, 1, function(x) any((!is.na(x) & x < -1e-15)))) # Mod2: I added is.na()
  sig_pos_eQTL <- as.numeric(apply(betas_sig, 1, function(x) any((!is.na(x) & x > 1e-15)))) # Mod3: I added is.na()
  eqtl_data$pos_sig_eQTL <- sig_pos_eQTL
  eqtl_data$neg_sig_eQTL <- sig_neg_eQTL
  eqtl_data <- subset(eqtl_data, !(pos_sig_eQTL==1 & neg_sig_eQTL==1))

  if(nrow(eqtl_data)==0) stop(paste0("No cis-SNPs for gene with p<",pval_thresh," in at least ",nTiss_thresh," tissues."))

  ## Obtain a set of candidate SNPs:
  SNP_pool <- eqtl_data$variant_id
  if(length(SNP_pool)==1){
    return(SNP_pool)
  }

  # Calculate the LD matrix:
  snp_data <- read.plink(bed_file, bim_file, fam_file)
  snp_matrix <- snp_data$genotypes
  numeric_matrix <- as(snp_matrix, "numeric")
  # Restrict LD matrix to SNPs in the data set:
  df_subset <- numeric_matrix[, colnames(numeric_matrix) %in% SNP_pool]
  LD <- cor(df_subset, use="pairwise.complete")
  LD_r2 <- LD^2
  # Export a .txt file of the LD matrix:
  utils::write.table(LD,
                     file = ld_matrix_path,
                     sep = " ",
                     row.names = TRUE,
                     col.names = TRUE,
                     quote = FALSE)

  ## Select the best SNPs remaining in SNP_pool based on minimum p-value:
  selected_snps <- NULL
  while(length(SNP_pool) > 1){

    ## Identify top SNPs in pool:
    eqtl_data <- subset(eqtl_data, variant_id %in% SNP_pool)
    next_snp_idx <- which.min(eqtl_data$min_p)
    next_snp <- eqtl_data$variant_id[next_snp_idx]
    selected_snps <-c(selected_snps,next_snp)

    ## Identify SNPs not in high LD with top SNPs in the pool:
    LD_idx <- which(row.names(LD_r2)==next_snp)
    LD_keep_idx <- which(LD_r2[LD_idx,] < ld_thresh)
    LD_r2 <- LD_r2[LD_keep_idx,LD_keep_idx]
    SNP_pool <- names(LD_keep_idx)
  }
  selected_snps <- c(selected_snps, SNP_pool)
  return(selected_snps)
}

# Step 9: Define the number of cores to use for the parallelism process:
num_cores <- 48

# Step 10: Apply the function to each gene id:
result_list <- mclapply(all_genes_vector, function(geneID) {
  # Generate a unique file path for the LD matrix of this gene
  ld_matrix_path <- file.path(ld_matrix_dir, paste0("ld_matrix_", geneID, ".txt"))

  tryCatch({
    select_IV_fs(geneID = geneID,
                 eqtl_data = eqtl_covid_exposure,
                 nTiss = 49,
                 bed_file = gtex_covid_bed,
                 bim_file = gtex_covid_bim,
                 fam_file = gtex_covid_fam,
                 ld_matrix_path = ld_matrix_path,
                 ld_thresh = 0.5,
                 pval_thresh = 0.001,
                 nTiss_thresh = 1)
  }, error = function(e) print(e))
}, mc.cores = num_cores)

# Step 11: Name the elements of the list with corresponding gene ids:
names(result_list) <- all_genes_vector

# Step 12: Save the previous result as a .rds file:
saveRDS(result_list, file = "/data/long_covid_IVs_fs_1.rds")

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q hugemem

# Request resources:
#PBS -l ncpus=48
#PBS -l mem=500GB
#PBS -l jobfs=200GB
#PBS -l walltime=30:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/Mt_Robin/select_setup.sif Rscript /data/select_function_fs_1.R

# Run the R script without any container:
#R --vanilla < parallelism_test.R > output_test

###########################################################################################################

# Exit cleanly:
exit 0

In [None]:
======================================================================================
                  Resource Usage on 2023-08-23 22:20:17:
   Job Id:             93322355.gadi-pbs
   Project:            sq95
   Exit Status:        0
   Service Units:      804.92
   NCPUs Requested:    48                     NCPUs Used: 48
                                           CPU Time Used: 257:47:02
   Memory Requested:   500.0GB               Memory Used: 391.58GB
   Walltime requested: 30:00:00            Walltime Used: 05:35:23
   JobFS requested:    200.0GB                JobFS used: 0B
======================================================================================

In [None]:
%%R

# Open the output file:
result_selectIV_fs_1 <- readRDS("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_IVs_fs_1.rds")

# Filter out errors:
result_selectIV_fs_1 <- result_selectIV_fs_1[!sapply(result_selectIV_fs_1, function(x) inherits(x, "simpleError"))]

# Print the length and the first element:
print(length(result_selectIV_fs_1))
print(length(unique(result_selectIV_fs_1)))
print(head(result_selectIV_fs_1))

[1] 38574
[1] 38480
$ENSG00000000003
 [1] "rs8780"      "rs1204407"   "rs189980168" "rs4145090"   "rs139645081"
 [6] "rs112002620" "rs113922837" "rs17329105"  "rs193124579" "rs146639579"
[11] "rs201585484" "rs111478058" "rs143491698" "rs572714672" "rs56100956" 
[16] "rs147601806" "rs139830751" "rs12842251"  "rs73630694" 

$ENSG00000000005
 [1] "rs6616166"   "rs4828038"   "rs12006573"  "rs73557922"  "rs141394856"
 [6] "rs140468798" "rs73555045"  "rs5921738"   "rs139231094" "rs59368006" 
[11] "rs111915382" "rs73562536"  "rs12832642"  "rs201254498" "rs186611188"
[16] "rs113422802" "rs796285800" "rs180795909" "rs1004006"   "rs148617015"
[21] "rs140109110"

$ENSG00000000419
 [1] "rs1292040204" "rs79562553"   "rs4809840"    "rs35201621"   "rs35530561"  
 [6] "rs141892692"  "rs944933"     "rs11483800"   "rs113036491"  "rs116892696" 
[11] "rs34483216"   "rs60578213"   "rs118181672"  "rs4811253"    "rs11086363"  
[16] "rs73278571"   "rs73131221"   "rs78319058"   "rs79223715"   "rs189417964" 
[2

In [None]:
# setup function:

library(parallel) # R package installed by default

# Step 1: Open the eQTL file:
eqtl_covid_exposure <- read.table("/data/eqtl_all_tissues_110.txt",
                                  header = TRUE,
                                  sep = "",
                                  stringsAsFactors = FALSE)

# Step 2: Open the GWAS input data set:
gwas_covid_outcome <- read.table("/data/long_covid_gwas_1_subset.txt",
                                 header = TRUE,
                                 sep = "",
                                 stringsAsFactors = FALSE)

# Step 3: Open the list of genes-SNPs:
result_selectedIVs <- readRDS("/data/long_covid_IVs_fs_1.rds")

# Step 4: Create a variable for the path to the LD matrix directory:
ld_matrix_dir <- "/data/ld_matrix_fs_1"

# Step 5: Setup the Mt_Robin method using the following setup function:
MR_MtRobin_setup <- function(geneID,
                             snpID,
                             eqtl_data,
                             gwas_data,
                             ld_matrix_path,
                             nTiss){
  tryCatch({
    ## Check if the gene is present and subset
    if(!(geneID %in% eqtl_data$gene_id)) stop("Gene is missing in eQTL dataset")
    eqtl_data <- subset(eqtl_data, gene_id == geneID)

    ## Confirm all SNPs are present in the dataset
    if(!all(snpID %in% eqtl_data$variant_id)) stop("Some SNPs missing in eQTL dataset")
    if(!all(snpID %in% gwas_data$variant_id)) stop("Some SNPs missing in GWAS dataset")

    # Read the LD matrix from a file
    LD <- as.matrix(utils::read.table(ld_matrix_path,
                                      header = TRUE,
                                      check.names = FALSE))

    if(!all(snpID %in% colnames(LD))) stop("Some SNPs missing in LD matrix")

    ## Align data
    eqtl_data <- eqtl_data[match(snpID,eqtl_data$variant_id),]
    gwas_data <- gwas_data[match(snpID,gwas_data$variant_id),]
    LD <- LD[match(snpID,rownames(LD)),match(snpID,colnames(LD))]

    eqtl_betas <- as.matrix(eqtl_data[,paste0("beta_",1:nTiss)])
    eqtl_se <- as.matrix(eqtl_data[,paste0("SE_",1:nTiss)])
    eqtl_pvals <- as.matrix(eqtl_data[,paste0("pvalue_",1:nTiss)])
    gwas_betas <- gwas_data$beta
    gwas_se <- gwas_data$SE

    row.names(eqtl_betas) <- row.names(eqtl_se) <- row.names(eqtl_pvals) <- eqtl_data$variant_id

    return(list(snpID=snpID,gwas_betas=gwas_betas,gwas_se=gwas_se,
                eqtl_betas=eqtl_betas,eqtl_se=eqtl_se,eqtl_pvals=eqtl_pvals,LD=LD))
  }, error = function(e) print(e))
}

# Step 6: Apply the previous setup function:

# Step 6.1: Initialize an empty list to store the results:
MR_MtRobin_results <- list()

# Step 6.2: Define the number of cores to use for the parallelism process:
num_cores <- 36

# Step 6.3: Loop over each gene in result_selectedIVs:
MR_MtRobin_results <- mclapply(names(result_selectedIVs), function(geneID) {

  # Step 6.3.1: Get the SNPs associated with each gene:
  snpID <- result_selectedIVs[[geneID]]

  # Step 6.3.2: Generate a unique file path for the LD matrix of this gene:
  ld_matrix_path <- file.path(ld_matrix_dir, paste0("ld_matrix_", geneID, ".txt"))

  # Step 6.3.3: Call the MR_MtRobin_setup() function with the parameters for each gene:
  return(tryCatch({
    MR_MtRobin_setup(geneID = geneID,
                     snpID = snpID,
                     eqtl_data = eqtl_covid_exposure,
                     gwas_data = gwas_covid_outcome,
                     ld_matrix_path = ld_matrix_path,
                     nTiss = 49)
  }, error = function(e) print(e)))
}, mc.cores = num_cores)

# Step 7: Assign names to the list:
names(MR_MtRobin_results) <- names(result_selectedIVs)

# Step 8: Save the previous result as a .rds file:
saveRDS(MR_MtRobin_results, file = "/data/long_covid_setup_fs_1.rds")

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q hugemem

# Request resources:
#PBS -l ncpus=36
#PBS -l mem=300GB
#PBS -l jobfs=200GB
#PBS -l walltime=10:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/Mt_Robin/select_setup.sif Rscript /data/setup_function_fs_1.R

###########################################################################################################

# Exit cleanly
exit 0

In [None]:
======================================================================================
                  Resource Usage on 2023-08-24 14:27:55:
   Job Id:             93416546.gadi-pbs
   Project:            sq95
   Exit Status:        0
   Service Units:      77.91
   NCPUs Requested:    36                     NCPUs Used: 36
                                           CPU Time Used: 20:20:02
   Memory Requested:   300.0GB               Memory Used: 77.45GB
   Walltime requested: 10:00:00            Walltime Used: 00:43:17
   JobFS requested:    200.0GB                JobFS used: 0B
======================================================================================

In [None]:
%%R

# Open the output file:
result_setup_fs_1 <- readRDS("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_setup_fs_1.rds")

# Filter out errors:
result_setup_fs_1 <- result_setup_fs_1[!sapply(result_setup_fs_1, function(x) inherits(x, "simpleError"))]

# Print the length and the first element:
print(length(result_setup_fs_1))
print(length(unique(result_setup_fs_1)))
print(result_setup_fs_1[[1]])

[1] 4881
[1] 4881
$snpID
[1] "rs147560086" "rs7955986"   "rs718389"    "rs11571383"  "rs2107614"  
[6] "rs4980935"   "rs11064536"  "rs117766797"

$gwas_betas
[1]  0.0320435 -0.0154484 -0.0713325 -0.0156255 -0.0385448 -0.0151348  0.0651312
[8] -0.1373340

$gwas_se
[1] 0.0386181 0.0307614 0.0374723 0.0667074 0.0328251 0.0408972 0.0422060
[8] 0.1174630

$eqtl_betas
            beta_1 beta_2    beta_3   beta_4 beta_5    beta_6    beta_7 beta_8
rs147560086     NA     NA -0.332722 -0.36379     NA -0.428724        NA     NA
rs7955986       NA     NA        NA       NA     NA        NA        NA     NA
rs718389        NA     NA        NA       NA     NA        NA        NA     NA
rs11571383      NA     NA        NA       NA     NA        NA -0.525715     NA
rs2107614       NA     NA        NA       NA     NA        NA        NA     NA
rs4980935       NA     NA        NA       NA     NA        NA        NA     NA
rs11064536      NA     NA        NA       NA     NA        NA        NA     NA
rs1

In [None]:
# Mt_Robin function:

# Step 1: Load the required libraries:
library(base) # R package installed by default
library(utils) # R package installed by default
library(parallel) # R package installed by default
library(Matrix) # R package installed by default
library(methods) # R package installed by default
library(stats) # R package installed by default
library(lme4) # It needs to be installed separately
library(data.table) # It needs to be installed separately

# Step 2: Main Mt_Robin function:
MR_MtRobin <- function(MR_MtRobin_input,pval_thresh=0.001){
  snpID <- MR_MtRobin_input$snpID
  gwas_betas <- MR_MtRobin_input$gwas_betas
  gwas_se <- MR_MtRobin_input$gwas_se
  eqtl_betas <- MR_MtRobin_input$eqtl_betas
  eqtl_se <- MR_MtRobin_input$eqtl_se
  eqtl_pvals <- MR_MtRobin_input$eqtl_pvals
  LD <- MR_MtRobin_input$LD

  if(ncol(eqtl_betas) != ncol(eqtl_se) | ncol(eqtl_betas) != ncol(eqtl_pvals)){
    stop("eQTL betas, SE and p-values do not all have same number of columns")
  }

  ## determine number of tissues (or conditions/studies/etc.)
  nT <- ncol(eqtl_betas)

  ## rename columns for standardized processing
  colnames(eqtl_betas) <- paste0("beta_",1:nT)
  colnames(eqtl_se) <- paste0("SE_",1:nT)
  colnames(eqtl_pvals) <- paste0("pvalue_",1:nT)

  ## combine eQTL results into data.table
  eqtl_res <- data.table::data.table(snpID,eqtl_betas,eqtl_se,eqtl_pvals)

  # reshape to long format (each SNP-gene-tissue will be row)
  eQTL_res_melt <- data.table::melt(eqtl_res,id.vars=c("snpID"),
                                    measure.vars=patterns("^beta_","^SE_","^pvalue_"),value.name=c("beta","se","pvalue"),
                                    variable.name="tissue")

  ## subset to strong instrumental variables (based on p-value threshold)
  eQTL_res_melt_PltThresh <- subset(eQTL_res_melt, pvalue < pval_thresh)

  ## set up gwas data for merging
  gwas_res <- data.table::data.table(snpID=snpID,gwas_beta=gwas_betas,gwas_se=gwas_se)

  ## merge eQTL and GWAS results
  data.table::setkey(eQTL_res_melt_PltThresh,snpID)
  data.table::setkey(gwas_res,snpID)
  merged_res <- data.table::merge.data.table(eQTL_res_melt_PltThresh,gwas_res)

  ## set up coefficients for reverse regression
  beta_x <- matrix(merged_res$beta,ncol=1)
  beta_y <- matrix(merged_res$gwas_beta,ncol=1)

  ## standard errors (for weights)
  se_x <- matrix(merged_res$se,ncol=1)
  se_y <- matrix(merged_res$gwas_se,ncol=1) ## not used by function; used in resampling (consider returning with results)

  ## identifiers
  snpID <- merged_res$snpID
  lme_res <- lme4::lmer(beta_x~(beta_y-1)+(beta_y-1|snpID),weights=1/se_x^2)

  ## return results from weighted regression with random slopes (and correlated errors) along with GWAS SE and LD (needed for resampling)
  return(list(lme_res=lme_res,gwas_res=gwas_res,LD=LD))
}

# Step 3: Read the previous file:
result_setup <- readRDS("/data/long_covid_setup_fs_1.rds")

# Step 4: Apply the main Mt_Robin function using parallel processing:
no_cores <- 38

cl <- makeCluster(no_cores) # Create a cluster with the number of cores
clusterExport(cl, "MR_MtRobin") # Export the MR_MtRobin function to all the nodes of the cluster
result_list_MtRobin <- parLapply(cl, result_setup, function(MR_MtRobin_input) {
  tryCatch({
    MR_MtRobin(MR_MtRobin_input = MR_MtRobin_input)
  }, error = function(e) print(e))
})
stopCluster(cl) # Stop the cluster once the computation is done

# Step 5: The names of the results will correspond to the gene ids:
names(result_list_MtRobin) <- names(result_setup)

# Step 6: Save the previous result as a .rds file:
saveRDS(result_list_MtRobin, file = "/data/long_covid_main_fs_1.rds")

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q hugemem

# Request resources:
#PBS -l ncpus=36
#PBS -l mem=300GB
#PBS -l jobfs=200GB
#PBS -l walltime=10:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/Mt_Robin/main_function.sif Rscript /data/main_function_fs_1.R

# Run the R script without any container:
#R --vanilla < parallelism_test.R > output_test

###########################################################################################################

# Exit cleanly
exit 0

In [None]:
======================================================================================
                  Resource Usage on 2023-08-24 17:11:55:
   Job Id:             93440305.gadi-pbs
   Project:            sq95
   Exit Status:        0
   Service Units:      2.73
   NCPUs Requested:    36                     NCPUs Used: 36
                                           CPU Time Used: 00:02:51
   Memory Requested:   300.0GB               Memory Used: 9.67GB
   Walltime requested: 10:00:00            Walltime Used: 00:01:31
   JobFS requested:    200.0GB                JobFS used: 0B
======================================================================================

In [None]:
%%R

# Open the output file:
result_main_fs_1 <- readRDS("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_main_fs_1.rds")

# Filter out errors:
result_main_fs_1 <- result_main_fs_1[!sapply(result_main_fs_1, function(x) inherits(x, "simpleError"))]

# Print the length and the first element:
print(length(result_main_fs_1))
print(length(unique(result_main_fs_1)))
print(head(result_main_fs_1))

[1] 1670
[1] 1670
$ENSG00000002016
$ENSG00000002016$lme_res
Linear mixed model fit by REML ['lmerMod']
Formula: beta_x ~ (beta_y - 1) + (beta_y - 1 | snpID)
Weights: 1/se_x^2
REML criterion at convergence: -25.7147
Random effects:
 Groups   Name   Std.Dev.
 snpID    beta_y 14.411  
 Residual         1.807  
Number of obs: 23, groups:  snpID, 8
Fixed Effects:
beta_y  
0.8438  

$ENSG00000002016$gwas_res
        snpID  gwas_beta   gwas_se
1  rs11064536  0.0651312 0.0422060
2  rs11571383 -0.0156255 0.0667074
3 rs117766797 -0.1373340 0.1174630
4 rs147560086  0.0320435 0.0386181
5   rs2107614 -0.0385448 0.0328251
6   rs4980935 -0.0151348 0.0408972
7    rs718389 -0.0713325 0.0374723
8   rs7955986 -0.0154484 0.0307614

$ENSG00000002016$LD
            rs147560086   rs7955986    rs718389  rs11571383   rs2107614
rs147560086  1.00000000  0.57848413 -0.19239180  0.66694928 -0.29808391
rs7955986    0.57848413  1.00000000 -0.35065828  0.41826699 -0.19482499
rs718389    -0.19239180 -0.35065828  1.000

In [None]:
########################################################################
# GADI:

# Obtain p-value from MR-MtRobin results:
# Uses a resampling procedure to estimate the p_value for a MR-MtRobin object:
# param MR_MtRobin_res a list object returned by \code{MR_MtRobin}.
# param nsamp integer of the number of samples to use in estimating \eqn{P}-value using a null distribution.
# param use_nonconverge logical of whether to use samples resulting in non-convergence of random slope
# return A list of two elements:
# \tabular{ll}{\code{pvalue} \tab numeric of the estimated \eqn{P}-value.
# \code{nsamp_used} \tab integer of the number of samples used in estimating the \eqn{P}-value.
# \code{nsamp_used} is returned because not all \code{nsamp} may be used +
# If \code{use_nonconverge=FALSE} (samples will be dropped if the model does not converge or results in a singular fit of the random slope).
# myRes <- MR_MtRobin(snpID,gwas_betas,gwas_se,eqtl_betas,eqtl_se,eqtl_pvals,LD)
# MR_MtRobin_resample(MR_MtRobin_res=myRes, nsamp=1e4, use_nonconverge=FALSE)

# Load packages:
library(lme4)
library(mvtnorm)
library(data.table)
library(parallel)

# Load main results:
results_1 <- readRDS("/scratch/sq95/sp6154/p_values/long_covid_main_fs_1.rds")
# Filter out errors:
results_1 <- results_1[!sapply(results_1, function(x) inherits(x, "simpleError"))]
# Export:
saveRDS(results_1, "/scratch/sq95/sp6154/p_values/long_covid_main_fs_1_no_errors.rds")

# Updated Function to calculate the p-value and FDR with parallel processing:
MR_MtRobin_resample <- function(results_1_path, nsamp=1e4, use_nonconverge=FALSE, output_file_path, num_cpus=1){
  # Check and load the results_1 dataset from the specified path:
  if(file.exists(results_1_path)){
    results_1 <- readRDS(results_1_path)
  } else {
    stop("The specified file path for results_1 does not exist.")
  }
  # Initialize a list to store results for each gene:
  pvalues_results <- list()
  # Loop over each gene:
  for(gene_id in names(results_1)) {
    # Check if necessary components are present and not NULL:
    if(is.null(results_1[[gene_id]][["lme_res"]]) || is.null(results_1[[gene_id]][["gwas_res"]]) || is.null(results_1[[gene_id]][["LD"]])) {
      next
    }
    # Create slopes for each information:
    lme_res <- results_1[[gene_id]][["lme_res"]]
    gwas_res <- results_1[[gene_id]][["gwas_res"]]
    LD <- results_1[[gene_id]][["LD"]]
    # Extract data from MR-MtRobin results:
    eqtl_betas <- lme_res@frame$beta_x
    snpID <- lme_res@frame$snpID
    weights <- lme_res@frame$weights
    tstat_MR_MtRobin <- summary(lme_res)$coefficients[1,3]
    gwas_se <- gwas_res$gwas_se
    # Bootstrapped null distribution, accounting for LD correlations:
    beta_gwas_nullMat <- mvtnorm::rmvnorm(nsamp, mean=rep(0, length(gwas_se)), sigma=diag(gwas_se) %*% LD %*% diag(gwas_se))
    colnames(beta_gwas_nullMat) <- gwas_res$snpID
    # Return data types:
    tstat_nulls <- mclapply(1:nsamp, function(i) {
      beta_gwas_null <- beta_gwas_nullMat[i, snpID]
      tryCatch({
        # LMM equation:
        lme_null <- lme4::lmer(eqtl_betas~(beta_gwas_null-1)+(beta_gwas_null-1|snpID), weights=weights)
        return(summary(lme_null)$coefficients[1,3])
      }, error=function(e) {
        if(!use_nonconverge) return(NA)
      })
    }, mc.cores = num_cpus)
    tstat_nulls <- unlist(tstat_nulls)
    tstat_nulls <- tstat_nulls[!is.na(tstat_nulls)] # Remove NAs
    nsamp_used <- length(tstat_nulls)
    # Calculate the p-value and round it to 3 decimal places:
    pval <- round(mean(abs(tstat_nulls) >= abs(tstat_MR_MtRobin)), digits = 3)
    # Store the results for the current gene in the list:
    pvalues_results[[gene_id]] <- list(pvalue=pval, nsamp_used=nsamp_used)
  }
  # Calculate FDR:
  all_pvalues <- sapply(pvalues_results, function(x) x$pvalue)
  fdr_values <- p.adjust(all_pvalues, method = "BH")
  # Combine results with FDR values:
  for(i in seq_along(fdr_values)) {
    pvalues_results[[names(fdr_values)[i]]]$fdr <- fdr_values[i]
  }
  # Prepare results dataframe:
  results_df <- do.call(rbind, lapply(names(pvalues_results), function(gene_id) {
    data.frame(gene_id = gene_id,
               pvalue = sprintf("%.3f", pvalues_results[[gene_id]]$pvalue), # Format pvalue with 3 decimal places
               nsamp_used = pvalues_results[[gene_id]]$nsamp_used,
               fdr = sprintf("%.15f", pvalues_results[[gene_id]]$fdr), # Keep full precision for fdr
               stringsAsFactors = FALSE)
  }))
  # Write the results to the output file, ensuring the formatting is kept
  write.table(results_df, file = output_file_path, sep = "\t", row.names = FALSE, quote = FALSE, col.names = TRUE)
  return(pvalues_results)
}

# Call the previous function:
MR_MtRobin_resample(
  results_1_path = "/scratch/sq95/sp6154/p_values/long_covid_main_fs_1_no_errors.rds",
  nsamp = 10000, # Number of bootstrap samples
  use_nonconverge = FALSE, # Whether to use non-converging results, typically FALSE
  output_file_path = "/scratch/sq95/sp6154/p_values/MR_pvalue_FDR_results_fs1.txt",
  num_cpus = 48 # Number of CPUs
)

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q hugemem

# Request resources:
#PBS -l ncpus=48
#PBS -l mem=500GB
#PBS -l jobfs=200GB
#PBS -l walltime=10:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/p_values/p_values.sif Rscript /data/p_values_fs_1.R

# Run the R script without any container:
#R --vanilla < parallelism_test.R > output_test

###########################################################################################################

# Exit cleanly:
exit 0

In [None]:
======================================================================================
                  Resource Usage on 2024-02-06 20:16:00:
   Job Id:             107649683.gadi-pbs
   Project:            sq95
   Exit Status:        0
   Service Units:      656.68
   NCPUs Requested:    48                     NCPUs Used: 48
                                           CPU Time Used: 135:10:53
   Memory Requested:   500.0GB               Memory Used: 117.95GB
   Walltime requested: 10:00:00            Walltime Used: 04:33:37
   JobFS requested:    200.0GB                JobFS used: 0B
======================================================================================

In [None]:
%%R

# Read the results with the p_value and FDR values:
results_2 <- read.table("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/MR_pvalue_FDR_results_fs1.txt", header=TRUE, sep="\t")
print(dim(results_2))
print(head(results_2))

[1] 1670    4
          gene_id pvalue nsamp_used       fdr
1 ENSG00000002016  0.884      10000 0.9779379
2 ENSG00000005156  0.143      10000 0.5553721
3 ENSG00000006282  0.126      10000 0.5367857
4 ENSG00000006744  0.140      10000 0.5514520
5 ENSG00000007171  0.467      10000 0.8385914
6 ENSG00000007341  0.286      10000 0.7280793


In [None]:
%%R

# Call the analyse_gene_data() function to organize the Mt_Robin results for significant and non-significant genes:

# Create the output paths:
save_paths <- list(
  all_genes = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs1_all_genes_sig_non_sig.txt",
  protein_coding_genes = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs1_protein_coding_genes_sig_not_sig.txt"
)

# Call the function:
analyze_gene_data(
  result_main_path = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_main_fs_1.rds",
  eqtl_long_covid_exposure_path = "/content/drive/MyDrive/Colab/Long_COVID/eQTL/eqtl_all_tissues_110.txt",
  save_files = save_paths
)

Saved all_genes as all_genes to /content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs1_all_genes_sig_non_sig.txt 
Dimensions of all_genes : 1670 10 

          gene_id gene_name      gene_type      beta_y Obs SNPs
1 ENSG00000002016     RAD52 protein_coding   0.8437588  23    8
2 ENSG00000005156      LIG3 protein_coding  12.6260663  10    3
3 ENSG00000006282   SPATA20 protein_coding -27.0963502  46    2
4 ENSG00000006744     ELAC2 protein_coding   7.5924973  18    4
5 ENSG00000007171      NOS2 protein_coding -20.1542226  31    5
6 ENSG00000007341      ST7L protein_coding 298.0706258  12    2
                                                                               SNP_IDs
1 rs11064536,rs11571383,rs117766797,rs147560086,rs2107614,rs4980935,rs718389,rs7955986
2                                                     rs12945428,rs201016783,rs3135958
3                                                                  rs2306001,rs9890200
4                                             rs1044564,r

In [None]:
%%R

# FS1 final table:
# Call the function that retrieves more information from the Mt_Robin and filter the significant genes:
process_final_data(input_path_1 = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_main_fs_1_no_errors.rds",
                   input_path_2 = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs1_protein_coding_genes_sig_not_sig.txt",
                   input_path_3 = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/MR_pvalue_FDR_results_fs1.txt",
                   output_path = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs1_protein_coding_genes_sig_full_table.txt")

[1] "Causal genes (noncoding genes and protein coding genes): "
[1] 1670
$lme_res
Linear mixed model fit by REML ['lmerMod']
Formula: beta_x ~ (beta_y - 1) + (beta_y - 1 | snpID)
Weights: 1/se_x^2
REML criterion at convergence: -25.7147
Random effects:
 Groups   Name   Std.Dev.
 snpID    beta_y 14.411  
 Residual         1.807  
Number of obs: 23, groups:  snpID, 8
Fixed Effects:
beta_y  
0.8438  

$gwas_res
        snpID  gwas_beta   gwas_se
1  rs11064536  0.0651312 0.0422060
2  rs11571383 -0.0156255 0.0667074
3 rs117766797 -0.1373340 0.1174630
4 rs147560086  0.0320435 0.0386181
5   rs2107614 -0.0385448 0.0328251
6   rs4980935 -0.0151348 0.0408972
7    rs718389 -0.0713325 0.0374723
8   rs7955986 -0.0154484 0.0307614

$LD
            rs147560086   rs7955986    rs718389  rs11571383   rs2107614
rs147560086  1.00000000  0.57848413 -0.19239180  0.66694928 -0.29808391
rs7955986    0.57848413  1.00000000 -0.35065828  0.41826699 -0.19482499
rs718389    -0.19239180 -0.35065828  1.00000000  0.0

In [None]:
%%R

# Call the last function to retrieve the tissue information and categories for fs1:
input_eqtl_path <- "/content/drive/MyDrive/Colab/Long_COVID/eQTL/eqtl_all_tissues_110.txt"
input_mt_robin_path <- "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs1_protein_coding_genes_sig_full_table.txt"
output_path <- "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs1_protein_coding_genes_sig_full_table_tissues.txt"
process_eqtl_mt_robin(input_eqtl_path, input_mt_robin_path, output_path)

           gene_id gene_name SNPs
1  ENSG00000026950    BTN3A1    9
2  ENSG00000065457     ADAT1    4
3  ENSG00000090661     CERS4   15
4  ENSG00000113734     BNIP1    7
5  ENSG00000132507     EIF5A    2
6  ENSG00000139714     MORN3   18
7  ENSG00000146530      VWDE    8
8  ENSG00000158825       CDA   14
9  ENSG00000171160     MORN4    2
10 ENSG00000173540     GMPPB    2
11 ENSG00000176386     CDC26    2
12 ENSG00000177025  C19orf18   17
13 ENSG00000183336     BOLA2    3
14 ENSG00000184983    NDUFA6    2
                                                                                                                                                                                                SNPs_IDs
1                                                                                                         rs114157910,rs12213056,rs13194491,rs2275906,rs3208733,rs3799378,rs56325223,rs7356982,rs9467733
2                                                                                       

### **GWAS2**

In [None]:
# Step 1: Load the required libraries:
library(matrixStats) # It needs to be installed separately
#library(BiocManager) # It needs to be installed separately
library(snpStats) # It needs to be installed separately
library(utils) # R package installed by default
library(parallel) # R package installed by default
library(reshape2) # Additional package for melt function

# Step 2: Open the eQTL file:
eqtl_covid_exposure <- read.table("/data/eqtl_all_tissues_110.txt",
                                  header = TRUE,
                                  sep = "",
                                  stringsAsFactors = FALSE)

# Step 3: Open the GWAS input data set:
gwas_covid_outcome <- read.table("/data/long_covid_gwas_2_subset.txt",
                                 header = TRUE,
                                 sep = "",
                                 stringsAsFactors = FALSE)

# Step 4: Open the genes input data set:
all_genes <- read.table("/data/all_genes_110.txt",
                        header = TRUE,
                        sep = "",
                        stringsAsFactors = FALSE)

# Step 5: Convert the previous list of genes into a vector:
all_genes_vector <- all_genes$gene_id

# Step 6: Save the paths of the GTEx files:
gtex_covid_bed <- "/data/WGS_GTEx_long_covid_rsID.bed"
gtex_covid_bim <- "/data/WGS_GTEx_long_covid_rsID.bim"
gtex_covid_fam <- "/data/WGS_GTEx_long_covid_rsID.fam"

# Step 7: Create a variable for the path to the directory to save the LD matrix results:
ld_matrix_dir <- "/data/ld_matrix_bs_2"

# Step 8: Create the directory if it doesn't exist:
if (!dir.exists(ld_matrix_dir)) {
  dir.create(ld_matrix_dir)
}

# Step 9: Modify the original select_IV_bs() function to accept NAs from the eQTL and calculate the LD matrix inside:
# select_IV_bs() --> backward selection starts with all variables and removes one at a time:
select_IV_bs <- function(geneID,
                         eqtl_data,
                         nTiss,
                         bed_file,
                         bim_file,
                         fam_file,
                         ld_matrix_path,
                         ld_thresh=0.5,
                         pval_thresh=0.001,
                         nTiss_thresh=3){

  ## Subset the eQTL data set for the gene of interest:
  if(!(geneID %in% eqtl_data$gene_id)) stop("Gene is missing in eQTL dataset")
  eqtl_data <- subset(eqtl_data, gene_id == geneID)

  pval_mat <- as.matrix(subset(eqtl_data,select=paste0("pvalue_",1:nTiss)))
  eqtl_data$min_p <- matrixStats::rowMins(pval_mat, na.rm = TRUE) # Mod1: I added na.rm = TRUE
  eqtl_data$nTissThresh_p <- apply(pval_mat, 1, function(x) sort(x)[nTiss_thresh])
  eqtl_data <- subset(eqtl_data, nTissThresh_p < pval_thresh)

  ## Ensure same sign for tissues with p < pval_thresh:
  betas <- as.matrix(subset(eqtl_data,select=paste0("beta_",1:nTiss)))
  pvals <- as.matrix(subset(eqtl_data,select=paste0("pvalue_",1:nTiss)))
  pvals_lt001 <- pvals<0.001
  betas_sig <- betas*pvals_lt001
  sig_neg_eQTL <- as.numeric(apply(betas_sig, 1, function(x) any((!is.na(x) & x < -1e-15)))) # Mod2: I added is.na()
  sig_pos_eQTL <- as.numeric(apply(betas_sig, 1, function(x) any((!is.na(x) & x > 1e-15)))) # Mod3: I added is.na()
  eqtl_data$pos_sig_eQTL <- sig_pos_eQTL
  eqtl_data$neg_sig_eQTL <- sig_neg_eQTL
  eqtl_data <- subset(eqtl_data, !(pos_sig_eQTL==1 & neg_sig_eQTL==1))

  if(nrow(eqtl_data)==0) stop(paste0("No cis-SNPs for gene with p<",pval_thresh," in at least ",nTiss_thresh," tissues."))

  ## Obtain a set of candidate SNPs:
  SNP_pool <- eqtl_data$variant_id
  if(length(SNP_pool)==1){
    return(SNP_pool)
  }

  # Calculate the LD matrix:
  snp_data <- read.plink(bed_file, bim_file, fam_file)
  snp_matrix <- snp_data$genotypes
  numeric_matrix <- as(snp_matrix, "numeric")

  # Restrict LD matrix to SNPs in the data set:
  df_subset <- numeric_matrix[, colnames(numeric_matrix) %in% SNP_pool]
  LD <- cor(df_subset, use="pairwise.complete")
  LD_r2 <- LD^2

  # Export a .txt file of the LD matrix:
  utils::write.table(LD,
                     file = ld_matrix_path,
                     sep = " ",
                     row.names = TRUE,
                     col.names = TRUE,
                     quote = FALSE)

  ## Get pairwise r^2:
  LD_r2_pw <- LD_r2
  LD_r2_pw[lower.tri(LD_r2_pw,diag=T)] <- NA
  LD_r2_pw <- reshape2::melt(LD_r2_pw)
  colnames(LD_r2_pw) <- c("SNP1","SNP2","r2")
  LD_r2_pw$SNP1 <- as.character(LD_r2_pw$SNP1)
  LD_r2_pw$SNP2 <- as.character(LD_r2_pw$SNP2)

  ## Drop SNPs paired with themselves and duplicates (resulting from lower triangle, set to NA):
  LD_r2_pw <- subset(LD_r2_pw, SNP1 != SNP2 & !is.na(r2))

  ## Sort by descending r^2:
  LD_r2_pw <- LD_r2_pw[order(LD_r2_pw$r2,decreasing = T),]

  ## Check for case all candidate SNPs have pw r^2 > thresh (return SNPs with smallest p-value):
  if(min(LD_r2_pw$r2) >=ld_thresh){
    return(eqtl_data$variant_id[which.max(eqtl_data$min_p)])
  }

  while(max(LD_r2_pw$r2)>=ld_thresh){

    ## Identify top SNPs in the pool:
    eqtl_data <- subset(eqtl_data, variant_id %in% SNP_pool)
    LD_r2_pw <- subset(LD_r2_pw, SNP1 %in% SNP_pool & SNP2 %in% SNP_pool)
    LD_r2_pw_geThresh <- subset(LD_r2_pw, r2>=ld_thresh)

    ## Identify SNP pair with highest remaining r^2:
    next_pair <- LD_r2_pw[1,]

    ## Criteria 1: Drop the SNP with largest number of correlations > threshold with other SNPs:
    SNP1_nSNP_geThresh <- sum(LD_r2_pw_geThresh$SNP1 == next_pair$SNP1) + sum(LD_r2_pw_geThresh$SNP2 == next_pair$SNP1)
    SNP2_nSNP_geThresh <- sum(LD_r2_pw_geThresh$SNP1 == next_pair$SNP2) + sum(LD_r2_pw_geThresh$SNP2 == next_pair$SNP2)
    if(SNP1_nSNP_geThresh > SNP2_nSNP_geThresh){
      next_SNP_drop <- next_pair$SNP1
    } else if (SNP1_nSNP_geThresh < SNP2_nSNP_geThresh){
      next_SNP_drop <- next_pair$SNP2
    } else{

      ## Criteria 2: Drop the SNP with smaller number of tissues with p<0.001:
      next_pair_eqtl <- subset(eqtl_data, variant_id %in% c(next_pair$SNP1,next_pair$SNP2))
      SNP1_nTiss <- subset(next_pair_eqtl,variant_id==next_pair$SNP1)$nTiss_ltThresh
      SNP2_nTiss <- subset(next_pair_eqtl,variant_id==next_pair$SNP2)$nTiss_ltThresh

      if(SNP1_nTiss > SNP2_nTiss){
        next_SNP_drop <- next_pair$SNP2
      } else if(SNP1_nTiss < SNP2_nTiss){
        next_SNP_drop <- next_pair$SNP1
      } else{

        ## Criteria 3: Drop the SNP with larger minimum p-value:
        next_SNP_drop <- next_pair_eqtl$variant_id[which.max(next_pair_eqtl$min_p)]
      }
    }
    SNP_pool <- SNP_pool[which(SNP_pool != next_SNP_drop)]
  }
  return(SNP_pool)
}

# Step 10: Define the number of cores to use for the parallelism process:
num_cores <- 48

# Step 11:
result_list <- mclapply(all_genes_vector, function(geneID) {
  # Generate a unique file path for the LD matrix of this gene
  ld_matrix_path <- file.path(ld_matrix_dir, paste0("ld_matrix_", geneID, ".txt"))

  tryCatch({
    select_IV_bs(geneID = geneID,
                 eqtl_data = eqtl_covid_exposure,
                 nTiss = 49,
                 bed_file = gtex_covid_bed,
                 bim_file = gtex_covid_bim,
                 fam_file = gtex_covid_fam,
                 ld_matrix_path = ld_matrix_path,
                 ld_thresh = 0.5,
                 pval_thresh = 0.001,
                 nTiss_thresh = 1)
  }, error = function(e) print(e))
}, mc.cores = num_cores)

# Step 12:
names(result_list) <- all_genes_vector

# Step 13:
saveRDS(result_list, file = "/data/long_covid_IVs_bs_2.rds")

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q hugemem

# Request resources:
#PBS -l ncpus=48
#PBS -l mem=500GB
#PBS -l jobfs=200GB
#PBS -l walltime=30:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/Mt_Robin/select_setup.sif Rscript /data/select_function_bs_2.R

# Run the R script without any container:
#R --vanilla < parallelism_test.R > output_test

###########################################################################################################

# Exit cleanly
exit 0

In [None]:
======================================================================================
                  Resource Usage on 2023-08-24 04:51:18:
   Job Id:             93353705.gadi-pbs
   Project:            sq95
   Exit Status:        0
   Service Units:      800.16
   NCPUs Requested:    48                     NCPUs Used: 48
                                           CPU Time Used: 258:44:17
   Memory Requested:   500.0GB               Memory Used: 393.32GB
   Walltime requested: 30:00:00            Walltime Used: 05:33:24
   JobFS requested:    200.0GB                JobFS used: 0B
======================================================================================

In [None]:
%%R

# Open the output file:
result_selectIV_fs_2 <- readRDS("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_IVs_fs_2.rds")

# Filter out errors:
result_selectIV_fs_2 <- result_selectIV_fs_2[!sapply(result_selectIV_fs_2, function(x) inherits(x, "simpleError"))]

# Print the length and the first element:
print(length(result_selectIV_fs_2))
print(length(unique(result_selectIV_fs_2)))
print(head(result_selectIV_fs_2))

[1] 38574
[1] 38480
$ENSG00000000003
 [1] "rs8780"      "rs1204407"   "rs189980168" "rs4145090"   "rs139645081"
 [6] "rs112002620" "rs113922837" "rs17329105"  "rs193124579" "rs146639579"
[11] "rs201585484" "rs111478058" "rs143491698" "rs572714672" "rs56100956" 
[16] "rs147601806" "rs139830751" "rs12842251"  "rs73630694" 

$ENSG00000000005
 [1] "rs6616166"   "rs4828038"   "rs12006573"  "rs73557922"  "rs141394856"
 [6] "rs140468798" "rs73555045"  "rs5921738"   "rs139231094" "rs59368006" 
[11] "rs111915382" "rs73562536"  "rs12832642"  "rs201254498" "rs186611188"
[16] "rs113422802" "rs796285800" "rs180795909" "rs1004006"   "rs148617015"
[21] "rs140109110"

$ENSG00000000419
 [1] "rs1292040204" "rs79562553"   "rs4809840"    "rs35201621"   "rs35530561"  
 [6] "rs141892692"  "rs944933"     "rs11483800"   "rs113036491"  "rs116892696" 
[11] "rs34483216"   "rs60578213"   "rs118181672"  "rs4811253"    "rs11086363"  
[16] "rs73278571"   "rs73131221"   "rs78319058"   "rs79223715"   "rs189417964" 
[2

In [None]:
library(parallel) # R package installed by default

# Step 1: Open the eQTL file:
eqtl_covid_exposure <- read.table("/data/eqtl_all_tissues_110.txt",
                                  header = TRUE,
                                  sep = "",
                                  stringsAsFactors = FALSE)

# Step 2: Open the GWAS input data set:
gwas_covid_outcome <- read.table("/data/long_covid_gwas_2_subset.txt",
                                 header = TRUE,
                                 sep = "",
                                 stringsAsFactors = FALSE)

# Step 3: Open the list of genes-SNPs:
result_selectedIVs <- readRDS("/data/long_covid_IVs_bs_2.rds")

# Step 4: Create a variable for the path to the LD matrix directory:
ld_matrix_dir <- "/data/ld_matrix_bs_2"

# Step 5: Setup the Mt_Robin method using the following setup function:
MR_MtRobin_setup <- function(geneID,
                             snpID,
                             eqtl_data,
                             gwas_data,
                             ld_matrix_path,
                             nTiss){
  tryCatch({
    ## Check if the gene is present and subset
    if(!(geneID %in% eqtl_data$gene_id)) stop("Gene is missing in eQTL dataset")
    eqtl_data <- subset(eqtl_data, gene_id == geneID)

    ## Confirm all SNPs are present in the dataset
    if(!all(snpID %in% eqtl_data$variant_id)) stop("Some SNPs missing in eQTL dataset")
    if(!all(snpID %in% gwas_data$variant_id)) stop("Some SNPs missing in GWAS dataset")

    # Read the LD matrix from a file
    LD <- as.matrix(utils::read.table(ld_matrix_path,
                                      header = TRUE,
                                      check.names = FALSE))

    if(!all(snpID %in% colnames(LD))) stop("Some SNPs missing in LD matrix")

    ## Align data
    eqtl_data <- eqtl_data[match(snpID,eqtl_data$variant_id),]
    gwas_data <- gwas_data[match(snpID,gwas_data$variant_id),]
    LD <- LD[match(snpID,rownames(LD)),match(snpID,colnames(LD))]

    eqtl_betas <- as.matrix(eqtl_data[,paste0("beta_",1:nTiss)])
    eqtl_se <- as.matrix(eqtl_data[,paste0("SE_",1:nTiss)])
    eqtl_pvals <- as.matrix(eqtl_data[,paste0("pvalue_",1:nTiss)])
    gwas_betas <- gwas_data$beta
    gwas_se <- gwas_data$SE

    row.names(eqtl_betas) <- row.names(eqtl_se) <- row.names(eqtl_pvals) <- eqtl_data$variant_id

    return(list(snpID=snpID,gwas_betas=gwas_betas,gwas_se=gwas_se,
                eqtl_betas=eqtl_betas,eqtl_se=eqtl_se,eqtl_pvals=eqtl_pvals,LD=LD))
  }, error = function(e) print(e))
}

# Step 6: Apply the previous setup function:

# Step 6.1: Initialize an empty list to store the results:
MR_MtRobin_results <- list()

# Step 6.2: Define the number of cores to use for the parallelism process:
num_cores <- 36

# Step 6.3: Loop over each gene in result_selectedIVs:
MR_MtRobin_results <- mclapply(names(result_selectedIVs), function(geneID) {

  # Step 6.3.1: Get the SNPs associated with each gene:
  snpID <- result_selectedIVs[[geneID]]

  # Step 6.3.2: Generate a unique file path for the LD matrix of this gene:
  ld_matrix_path <- file.path(ld_matrix_dir, paste0("ld_matrix_", geneID, ".txt"))

  # Step 6.3.3: Call the MR_MtRobin_setup() function with the parameters for each gene:
  return(tryCatch({
    MR_MtRobin_setup(geneID = geneID,
                     snpID = snpID,
                     eqtl_data = eqtl_covid_exposure,
                     gwas_data = gwas_covid_outcome,
                     ld_matrix_path = ld_matrix_path,
                     nTiss = 49)
  }, error = function(e) print(e)))
}, mc.cores = num_cores)

# Step 7: Assign names to the list:
names(MR_MtRobin_results) <- names(result_selectedIVs)

# Step 8: Save the previous result as a .rds file:
saveRDS(MR_MtRobin_results, file = "/data/long_covid_setup_bs_2.rds")

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q hugemem

# Request resources:
#PBS -l ncpus=36
#PBS -l mem=300GB
#PBS -l jobfs=200GB
#PBS -l walltime=10:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/Mt_Robin/select_setup.sif Rscript /data/setup_function_bs_2.R

###########################################################################################################

# Exit cleanly
exit 0

In [None]:
======================================================================================
                  Resource Usage on 2023-08-24 16:29:36:
   Job Id:             93430390.gadi-pbs
   Project:            sq95
   Exit Status:        0
   Service Units:      74.34
   NCPUs Requested:    36                     NCPUs Used: 36
                                           CPU Time Used: 19:47:13
   Memory Requested:   300.0GB               Memory Used: 83.56GB
   Walltime requested: 10:00:00            Walltime Used: 00:41:18
   JobFS requested:    200.0GB                JobFS used: 0B
======================================================================================

In [None]:
%%R

# Open the output file:
result_setup_fs_2 <- readRDS("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_setup_fs_2.rds")

# Filter out errors:
result_setup_fs_2 <- result_setup_fs_2[!sapply(result_setup_fs_2, function(x) inherits(x, "simpleError"))]

# Print the length and the first element:
print(length(result_setup_fs_2))
print(length(unique(result_setup_fs_2)))
print(result_setup_fs_2[[1]])

[1] 4754
[1] 4754
$snpID
[1] "rs147560086" "rs7955986"   "rs718389"    "rs11571383"  "rs2107614"  
[6] "rs4980935"   "rs11064536"  "rs117766797"

$gwas_betas
[1]  0.021120800 -0.006508970 -0.017204900 -0.045787500 -0.000839245
[6] -0.015072100  0.005801520 -0.016338200

$gwas_se
[1] 0.0258757 0.0206358 0.0234662 0.0569715 0.0220439 0.0229060 0.0281981
[8] 0.0628533

$eqtl_betas
            beta_1 beta_2    beta_3   beta_4 beta_5    beta_6    beta_7 beta_8
rs147560086     NA     NA -0.332722 -0.36379     NA -0.428724        NA     NA
rs7955986       NA     NA        NA       NA     NA        NA        NA     NA
rs718389        NA     NA        NA       NA     NA        NA        NA     NA
rs11571383      NA     NA        NA       NA     NA        NA -0.525715     NA
rs2107614       NA     NA        NA       NA     NA        NA        NA     NA
rs4980935       NA     NA        NA       NA     NA        NA        NA     NA
rs11064536      NA     NA        NA       NA     NA        NA     

In [None]:
# Step 1: Load the required libraries:
library(base) # R package installed by default
library(utils) # R package installed by default
library(parallel) # R package installed by default
library(Matrix) # R package installed by default
library(methods) # R package installed by default
library(stats) # R package installed by default
library(lme4) # It needs to be installed separately
library(data.table) # It needs to be installed separately

# Step 2: Main Mt_Robin function:
MR_MtRobin <- function(MR_MtRobin_input,pval_thresh=0.001){
  snpID <- MR_MtRobin_input$snpID
  gwas_betas <- MR_MtRobin_input$gwas_betas
  gwas_se <- MR_MtRobin_input$gwas_se
  eqtl_betas <- MR_MtRobin_input$eqtl_betas
  eqtl_se <- MR_MtRobin_input$eqtl_se
  eqtl_pvals <- MR_MtRobin_input$eqtl_pvals
  LD <- MR_MtRobin_input$LD

  if(ncol(eqtl_betas) != ncol(eqtl_se) | ncol(eqtl_betas) != ncol(eqtl_pvals)){
    stop("eQTL betas, SE and p-values do not all have same number of columns")
  }

  ## Determine number of tissues (or conditions/studies/etc.)
  nT <- ncol(eqtl_betas)

  ## rename columns for standardized processing
  colnames(eqtl_betas) <- paste0("beta_",1:nT)
  colnames(eqtl_se) <- paste0("SE_",1:nT)
  colnames(eqtl_pvals) <- paste0("pvalue_",1:nT)

  ## combine eQTL results into data.table
  eqtl_res <- data.table::data.table(snpID,eqtl_betas,eqtl_se,eqtl_pvals)

  # reshape to long format (each SNP-gene-tissue will be row)
  eQTL_res_melt <- data.table::melt(eqtl_res,id.vars=c("snpID"),
                                    measure.vars=patterns("^beta_","^SE_","^pvalue_"),value.name=c("beta","se","pvalue"),
                                    variable.name="tissue")

  ## subset to strong instrumental variables (based on p-value threshold)
  eQTL_res_melt_PltThresh <- subset(eQTL_res_melt, pvalue < pval_thresh)

  ## set up gwas data for merging
  gwas_res <- data.table::data.table(snpID=snpID,gwas_beta=gwas_betas,gwas_se=gwas_se)

  ## merge eQTL and GWAS results
  data.table::setkey(eQTL_res_melt_PltThresh,snpID)
  data.table::setkey(gwas_res,snpID)
  merged_res <- data.table::merge.data.table(eQTL_res_melt_PltThresh,gwas_res)

  ## set up coefficients for reverse regression
  beta_x <- matrix(merged_res$beta,ncol=1)
  beta_y <- matrix(merged_res$gwas_beta,ncol=1)

  ## standard errors (for weights)
  se_x <- matrix(merged_res$se,ncol=1)
  se_y <- matrix(merged_res$gwas_se,ncol=1) ## not used by function; used in resampling (consider returning with results)

  ## identifiers
  snpID <- merged_res$snpID

  lme_res <- lme4::lmer(beta_x~(beta_y-1)+(beta_y-1|snpID),weights=1/se_x^2)

  ## return results from weighted regression with random slopes (and correlated errors) along with GWAS SE and LD (needed for resampling)
  return(list(lme_res=lme_res,gwas_res=gwas_res,LD=LD))
}

# Step 3: Read the previous file:
result_setup <- readRDS("/data/long_covid_setup_fs_2.rds")

# Step 4: Apply the main Mt_Robin function using parallel processing:
no_cores <- 38

cl <- makeCluster(no_cores) # Create a cluster with the number of cores
clusterExport(cl, "MR_MtRobin") # Export the MR_MtRobin function to all the nodes of the cluster
result_list_MtRobin <- parLapply(cl, result_setup, function(MR_MtRobin_input) {
  tryCatch({
    MR_MtRobin(MR_MtRobin_input = MR_MtRobin_input)
  }, error = function(e) print(e))
})
stopCluster(cl) # Stop the cluster once the computation is done

# Step 5: The names of the results will correspond to the gene ids:
names(result_list_MtRobin) <- names(result_setup)

# Step 6: Save the previous result as a .rds file:
saveRDS(result_list_MtRobin, file = "/data/long_covid_main_fs_2.rds")

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q hugemem

# Request resources:
#PBS -l ncpus=36
#PBS -l mem=300GB
#PBS -l jobfs=200GB
#PBS -l walltime=10:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/Mt_Robin/main_function.sif Rscript /data/main_function_fs_2.R

# Run the R script without any container:
#R --vanilla < parallelism_test.R > output_test

###########################################################################################################

# Exit cleanly
exit 0

In [None]:
======================================================================================
                  Resource Usage on 2023-08-24 17:34:46:
   Job Id:             93443496.gadi-pbs
   Project:            sq95
   Exit Status:        0
   Service Units:      2.70
   NCPUs Requested:    36                     NCPUs Used: 36
                                           CPU Time Used: 00:02:51
   Memory Requested:   300.0GB               Memory Used: 9.55GB
   Walltime requested: 10:00:00            Walltime Used: 00:01:30
   JobFS requested:    200.0GB                JobFS used: 0B
======================================================================================

In [None]:
%%R

# Open the output file:
result_main_fs_2 <- readRDS("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_main_fs_2.rds")

# Filter out errors:
result_main_fs_2 <- result_main_fs_2[!sapply(result_main_fs_2, function(x) inherits(x, "simpleError"))]

# Print the length and the first element:
print(length(result_main_fs_2))
print(length(unique(result_main_fs_2)))
print(head(result_main_fs_2))

[1] 1624
[1] 1624
$ENSG00000002016
$ENSG00000002016$lme_res
Linear mixed model fit by REML ['lmerMod']
Formula: beta_x ~ (beta_y - 1) + (beta_y - 1 | snpID)
Weights: 1/se_x^2
REML criterion at convergence: -27.2451
Random effects:
 Groups   Name   Std.Dev.
 snpID    beta_y 24.597  
 Residual         2.247  
Number of obs: 23, groups:  snpID, 8
Fixed Effects:
beta_y  
 1.193  

$ENSG00000002016$gwas_res
        snpID    gwas_beta   gwas_se
1  rs11064536  0.005801520 0.0281981
2  rs11571383 -0.045787500 0.0569715
3 rs117766797 -0.016338200 0.0628533
4 rs147560086  0.021120800 0.0258757
5   rs2107614 -0.000839245 0.0220439
6   rs4980935 -0.015072100 0.0229060
7    rs718389 -0.017204900 0.0234662
8   rs7955986 -0.006508970 0.0206358

$ENSG00000002016$LD
            rs147560086   rs7955986    rs718389  rs11571383   rs2107614
rs147560086  1.00000000  0.57848413 -0.19239180  0.66694928 -0.29808391
rs7955986    0.57848413  1.00000000 -0.35065828  0.41826699 -0.19482499
rs718389    -0.19239180 

In [None]:
########################################################################
# GADI:

# Obtain p-value from MR-MtRobin results:
# Uses a resampling procedure to estimate the p_value for a MR-MtRobin object:
# param MR_MtRobin_res a list object returned by \code{MR_MtRobin}.
# param nsamp integer of the number of samples to use in estimating \eqn{P}-value using a null distribution.
# param use_nonconverge logical of whether to use samples resulting in non-convergence of random slope
# return A list of two elements:
# \tabular{ll}{\code{pvalue} \tab numeric of the estimated \eqn{P}-value.
# \code{nsamp_used} \tab integer of the number of samples used in estimating the \eqn{P}-value.
# \code{nsamp_used} is returned because not all \code{nsamp} may be used +
# If \code{use_nonconverge=FALSE} (samples will be dropped if the model does not converge or results in a singular fit of the random slope).
# myRes <- MR_MtRobin(snpID,gwas_betas,gwas_se,eqtl_betas,eqtl_se,eqtl_pvals,LD)
# MR_MtRobin_resample(MR_MtRobin_res=myRes, nsamp=1e4, use_nonconverge=FALSE)

# Load packages:
library(lme4)
library(mvtnorm)
library(data.table)
library(parallel)

# Load main results:
results_1 <- readRDS("/scratch/sq95/sp6154/p_values/long_covid_main_fs_2.rds")
# Filter out errors:
results_1 <- results_1[!sapply(results_1, function(x) inherits(x, "simpleError"))]
# Export:
saveRDS(results_1, "/scratch/sq95/sp6154/p_values/long_covid_main_fs_2_no_errors.rds")

# Updated Function to calculate the p-value and FDR with parallel processing:
MR_MtRobin_resample <- function(results_1_path, nsamp=1e4, use_nonconverge=FALSE, output_file_path, num_cpus=1){
  # Check and load the results_1 dataset from the specified path:
  if(file.exists(results_1_path)){
    results_1 <- readRDS(results_1_path)
  } else {
    stop("The specified file path for results_1 does not exist.")
  }
  # Initialize a list to store results for each gene:
  pvalues_results <- list()
  # Loop over each gene:
  for(gene_id in names(results_1)) {
    # Check if necessary components are present and not NULL:
    if(is.null(results_1[[gene_id]][["lme_res"]]) || is.null(results_1[[gene_id]][["gwas_res"]]) || is.null(results_1[[gene_id]][["LD"]])) {
      next
    }
    # Create slopes for each information:
    lme_res <- results_1[[gene_id]][["lme_res"]]
    gwas_res <- results_1[[gene_id]][["gwas_res"]]
    LD <- results_1[[gene_id]][["LD"]]
    # Extract data from MR-MtRobin results:
    eqtl_betas <- lme_res@frame$beta_x
    snpID <- lme_res@frame$snpID
    weights <- lme_res@frame$weights
    tstat_MR_MtRobin <- summary(lme_res)$coefficients[1,3]
    gwas_se <- gwas_res$gwas_se
    # Bootstrapped null distribution, accounting for LD correlations:
    beta_gwas_nullMat <- mvtnorm::rmvnorm(nsamp, mean=rep(0, length(gwas_se)), sigma=diag(gwas_se) %*% LD %*% diag(gwas_se))
    colnames(beta_gwas_nullMat) <- gwas_res$snpID
    # Return data types:
    tstat_nulls <- mclapply(1:nsamp, function(i) {
      beta_gwas_null <- beta_gwas_nullMat[i, snpID]
      tryCatch({
        # LMM equation:
        lme_null <- lme4::lmer(eqtl_betas~(beta_gwas_null-1)+(beta_gwas_null-1|snpID), weights=weights)
        return(summary(lme_null)$coefficients[1,3])
      }, error=function(e) {
        if(!use_nonconverge) return(NA)
      })
    }, mc.cores = num_cpus)
    tstat_nulls <- unlist(tstat_nulls)
    tstat_nulls <- tstat_nulls[!is.na(tstat_nulls)] # Remove NAs
    nsamp_used <- length(tstat_nulls)
    # Calculate the p-value and round it to 3 decimal places:
    pval <- round(mean(abs(tstat_nulls) >= abs(tstat_MR_MtRobin)), digits = 3)
    # Store the results for the current gene in the list:
    pvalues_results[[gene_id]] <- list(pvalue=pval, nsamp_used=nsamp_used)
  }
  # Calculate FDR:
  all_pvalues <- sapply(pvalues_results, function(x) x$pvalue)
  fdr_values <- p.adjust(all_pvalues, method = "BH")
  # Combine results with FDR values:
  for(i in seq_along(fdr_values)) {
    pvalues_results[[names(fdr_values)[i]]]$fdr <- fdr_values[i]
  }
  # Prepare results dataframe:
  results_df <- do.call(rbind, lapply(names(pvalues_results), function(gene_id) {
    data.frame(gene_id = gene_id,
               pvalue = sprintf("%.3f", pvalues_results[[gene_id]]$pvalue), # Format pvalue with 3 decimal places
               nsamp_used = pvalues_results[[gene_id]]$nsamp_used,
               fdr = sprintf("%.15f", pvalues_results[[gene_id]]$fdr), # Keep full precision for fdr
               stringsAsFactors = FALSE)
  }))
  # Write the results to the output file, ensuring the formatting is kept
  write.table(results_df, file = output_file_path, sep = "\t", row.names = FALSE, quote = FALSE, col.names = TRUE)
  return(pvalues_results)
}

# Call the previous function:
MR_MtRobin_resample(
  results_1_path = "/scratch/sq95/sp6154/p_values/long_covid_main_fs_2_no_errors.rds",
  nsamp = 10000, # Number of bootstrap samples
  use_nonconverge = FALSE, # Whether to use non-converging results, typically FALSE
  output_file_path = "/scratch/sq95/sp6154/p_values/MR_pvalue_FDR_results_fs2.txt",
  num_cpus = 48 # Number of CPUs
)

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q hugemem

# Request resources:
#PBS -l ncpus=48
#PBS -l mem=500GB
#PBS -l jobfs=200GB
#PBS -l walltime=10:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Option 1: Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/p_values/p_values.sif Rscript /data/p_values_fs_2.R

# Option 2: Run the R script without any container:
#R --vanilla < parallelism_test.R > output_test

###########################################################################################################

# Exit cleanly:
exit 0

In [None]:
======================================================================================
                  Resource Usage on 2024-02-07 15:15:59:
   Job Id:             107706975.gadi-pbs
   Project:            sq95
   Exit Status:        0
   Service Units:      630.16
   NCPUs Requested:    48                     NCPUs Used: 48
                                           CPU Time Used: 130:46:06
   Memory Requested:   500.0GB               Memory Used: 116.65GB
   Walltime requested: 10:00:00            Walltime Used: 04:22:34
   JobFS requested:    200.0GB                JobFS used: 0B
======================================================================================

In [None]:
%%R

# Read the results with the p_value and FDR values:
results <- read.table("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/MR_pvalue_FDR_results_fs2.txt", header=TRUE, sep="\t")
print(dim(results))
print(head(results))

[1] 1624    4
          gene_id pvalue nsamp_used       fdr
1 ENSG00000002016  0.914      10000 0.9861120
2 ENSG00000005156  0.083      10000 0.5029552
3 ENSG00000006282  0.472      10000 0.8593363
4 ENSG00000006744  0.926      10000 0.9861120
5 ENSG00000007171  0.769      10000 0.9505140
6 ENSG00000007341  0.070      10000 0.4796624


In [None]:
%%R

# Call the analyse_gene_data() function to organize the Mt_Robin results for significant and non-significant genes:

# Create the output paths:
save_paths <- list(
  all_genes = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs2_all_genes_sig_non_sig.txt",
  protein_coding_genes = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs2_protein_coding_genes_sig_not_sig.txt"
)

# Call the function:
analyze_gene_data(
  result_main_path = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_main_fs_2.rds",
  eqtl_long_covid_exposure_path = "/content/drive/MyDrive/Colab/Long_COVID/eQTL/eqtl_all_tissues_110.txt",
  save_files = save_paths
)

Saved all_genes as all_genes to /content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs2_all_genes_sig_non_sig.txt 
Dimensions of all_genes : 1624 10 

          gene_id gene_name      gene_type     beta_y Obs SNPs
1 ENSG00000002016     RAD52 protein_coding   1.193051  23    8
2 ENSG00000005156      LIG3 protein_coding   9.750327  10    3
3 ENSG00000006282   SPATA20 protein_coding 107.342074  46    2
4 ENSG00000006744     ELAC2 protein_coding  -1.126989  18    4
5 ENSG00000007171      NOS2 protein_coding  -9.295643  31    5
6 ENSG00000007341      ST7L protein_coding -15.155748  12    1
                                                                               SNP_IDs
1 rs11064536,rs11571383,rs117766797,rs147560086,rs2107614,rs4980935,rs718389,rs7955986
2                                                     rs12945428,rs201016783,rs3135958
3                                                                  rs2306001,rs9890200
4                                             rs1044564,rs116559

In [None]:
%%R

# FS2 final table:
# Call the function that retrieves more information from the Mt_Robin and filter the significant genes:
process_final_data(input_path_1 = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_main_fs_2_no_errors.rds",
                   input_path_2 = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs2_protein_coding_genes_sig_not_sig.txt",
                   input_path_3 = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/MR_pvalue_FDR_results_fs2.txt",
                   output_path = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs2_protein_coding_genes_sig_full_table.txt")

[1] "Causal genes (noncoding genes and protein coding genes): "
[1] 1624
$lme_res
Linear mixed model fit by REML ['lmerMod']
Formula: beta_x ~ (beta_y - 1) + (beta_y - 1 | snpID)
Weights: 1/se_x^2
REML criterion at convergence: -27.2451
Random effects:
 Groups   Name   Std.Dev.
 snpID    beta_y 24.597  
 Residual         2.247  
Number of obs: 23, groups:  snpID, 8
Fixed Effects:
beta_y  
 1.193  

$gwas_res
        snpID    gwas_beta   gwas_se
1  rs11064536  0.005801520 0.0281981
2  rs11571383 -0.045787500 0.0569715
3 rs117766797 -0.016338200 0.0628533
4 rs147560086  0.021120800 0.0258757
5   rs2107614 -0.000839245 0.0220439
6   rs4980935 -0.015072100 0.0229060
7    rs718389 -0.017204900 0.0234662
8   rs7955986 -0.006508970 0.0206358

$LD
            rs147560086   rs7955986    rs718389  rs11571383   rs2107614
rs147560086  1.00000000  0.57848413 -0.19239180  0.66694928 -0.29808391
rs7955986    0.57848413  1.00000000 -0.35065828  0.41826699 -0.19482499
rs718389    -0.19239180 -0.3506582

In [None]:
%%R

# Call the last function to retrieve the tissue information and categories for fs1:
input_eqtl_path <- "/content/drive/MyDrive/Colab/Long_COVID/eQTL/eqtl_all_tissues_110.txt"
input_mt_robin_path <- "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs2_protein_coding_genes_sig_full_table.txt"
output_path <- "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs2_protein_coding_genes_sig_full_table_tissues.txt"
process_eqtl_mt_robin(input_eqtl_path, input_mt_robin_path, output_path)

           gene_id gene_name SNPs
1  ENSG00000068097    HEATR6    9
2  ENSG00000089063   TMEM230    6
3  ENSG00000090432      MUL1    5
4  ENSG00000090661     CERS4   15
5  ENSG00000107186      MPDZ   18
6  ENSG00000108439      PNPO   15
7  ENSG00000127399    LRRC61    2
8  ENSG00000139714     MORN3   18
9  ENSG00000141076      UTP4    3
10 ENSG00000142156    COL6A1   22
11 ENSG00000147576    ADHFE1    8
12 ENSG00000149089      APIP    4
13 ENSG00000163161     ERCC3    5
14 ENSG00000177082     WDR73   13
15 ENSG00000184361   SPATA32    8
16 ENSG00000197134    ZNF257    5
17 ENSG00000205078    SYCE1L    3
18 ENSG00000212127   TAS2R14    5
19 ENSG00000223510    CDRT15    3
                                                                                                                                                                                                                                      SNPs_IDs
1                                                                                

### **GWAS3**

In [None]:
library(matrixStats) # It needs to be installed separately
#library(BiocManager) # It needs to be installed separately
library(snpStats) # It needs to be installed separately
library(utils) # R package installed by default
library(parallel) # R package installed by default

# Step 1: Open the eQTL file:
eqtl_covid_exposure <- read.table("/data/eqtl_all_tissues_110.txt",
                                  header = TRUE,
                                  sep = "",
                                  stringsAsFactors = FALSE)

# Step 2: Open the GWAS input data set:
gwas_covid_outcome <- read.table("/data/long_covid_gwas_3_subset.txt",
                                 header = TRUE,
                                 sep = "",
                                 stringsAsFactors = FALSE)

# Step 3: Open the genes input data set:
all_genes <- read.table("/data/all_genes_110.txt",
                        header = TRUE,
                        sep = "",
                        stringsAsFactors = FALSE)

# Step 4: Convert the previous list of genes into a vector:
all_genes_vector <- all_genes$gene_id

# Step 5: Save the paths of the GTEx files:
gtex_covid_bed <- "/data/WGS_GTEx_long_covid_rsID.bed"
gtex_covid_bim <- "/data/WGS_GTEx_long_covid_rsID.bim"
gtex_covid_fam <- "/data/WGS_GTEx_long_covid_rsID.fam"

# Step 6: Create a variable for the path to the directory to save the LD matrix results:
ld_matrix_dir <- "/data/ld_matrix_fs_3"

# Step 7: Create the directory if it doesn't exist:
if (!dir.exists(ld_matrix_dir)) {
  dir.create(ld_matrix_dir)
}

# Step 8: Modify the original select_IV_fs() function to accept NAs from the eQTL and calculate the LD matrix inside:
# select_IV_fs() function --> Starts with no variables and adds one at a time:
select_IV_fs <- function(geneID,
                         eqtl_data,
                         nTiss,
                         bed_file,
                         bim_file,
                         fam_file,
                         ld_matrix_path,
                         ld_thresh=0.5,
                         pval_thresh=0.001,
                         nTiss_thresh=3){

  ## Subset eQTL dataset to gene of interest:
  if(!(geneID %in% eqtl_data$gene_id)) stop("Gene is missing in eQTL dataset")
  eqtl_data <- subset(eqtl_data, gene_id == geneID)

  # pval_mat <- as.matrix(eqtl_data[,paste0("pvalue_",1:nTiss)])
  pval_mat <- as.matrix(subset(eqtl_data,select=paste0("pvalue_",1:nTiss)))
  eqtl_data$min_p <- matrixStats::rowMins(pval_mat, na.rm = TRUE) # Mod1: I added na.rm = TRUE
  eqtl_data$nTissThresh_p <- apply(pval_mat, 1, function(x) sort(x)[nTiss_thresh])
  eqtl_data <- subset(eqtl_data, nTissThresh_p < pval_thresh)

  ## Ensure same sign for tissues with p < pval_thresh:
  betas <- as.matrix(subset(eqtl_data,select=paste0("beta_",1:nTiss)))
  pvals <- as.matrix(subset(eqtl_data,select=paste0("pvalue_",1:nTiss)))
  pvals_lt001 <- pvals<0.001
  betas_sig <- betas*pvals_lt001
  #sig_neg_eQTL <- as.numeric(apply(betas_sig,1,function(x) any(x < -1e-15)))
  #sig_pos_eQTL <- as.numeric(apply(betas_sig,1,function(x) any(x > 1e-15)))
  sig_neg_eQTL <- as.numeric(apply(betas_sig, 1, function(x) any((!is.na(x) & x < -1e-15)))) # Mod2: I added is.na()
  sig_pos_eQTL <- as.numeric(apply(betas_sig, 1, function(x) any((!is.na(x) & x > 1e-15)))) # Mod3: I added is.na()
  eqtl_data$pos_sig_eQTL <- sig_pos_eQTL
  eqtl_data$neg_sig_eQTL <- sig_neg_eQTL
  eqtl_data <- subset(eqtl_data, !(pos_sig_eQTL==1 & neg_sig_eQTL==1))

  if(nrow(eqtl_data)==0) stop(paste0("No cis-SNPs for gene with p<",pval_thresh," in at least ",nTiss_thresh," tissues."))

  ## Obtain a set of candidate SNPs:
  SNP_pool <- eqtl_data$variant_id
  if(length(SNP_pool)==1){
    return(SNP_pool)
  }

  # Calculate the LD matrix:
  snp_data <- read.plink(bed_file, bim_file, fam_file)
  snp_matrix <- snp_data$genotypes
  numeric_matrix <- as(snp_matrix, "numeric")
  # Restrict LD matrix to SNPs in the data set:
  df_subset <- numeric_matrix[, colnames(numeric_matrix) %in% SNP_pool]
  LD <- cor(df_subset, use="pairwise.complete")
  LD_r2 <- LD^2
  # Export a .txt file of the LD matrix:
  utils::write.table(LD,
                     file = ld_matrix_path,
                     sep = " ",
                     row.names = TRUE,
                     col.names = TRUE,
                     quote = FALSE)

  ## Select the best SNPs remaining in SNP_pool based on minimum p-value:
  selected_snps <- NULL
  while(length(SNP_pool) > 1){

    ## Identify top SNPs in pool:
    eqtl_data <- subset(eqtl_data, variant_id %in% SNP_pool)
    next_snp_idx <- which.min(eqtl_data$min_p)
    next_snp <- eqtl_data$variant_id[next_snp_idx]
    selected_snps <-c(selected_snps,next_snp)

    ## Identify SNPs not in high LD with top SNPs in the pool:
    LD_idx <- which(row.names(LD_r2)==next_snp)
    LD_keep_idx <- which(LD_r2[LD_idx,] < ld_thresh)
    LD_r2 <- LD_r2[LD_keep_idx,LD_keep_idx]
    SNP_pool <- names(LD_keep_idx)
  }
  selected_snps <- c(selected_snps, SNP_pool)
  return(selected_snps)
}

# Step 9: Define the number of cores to use for the parallelism process:
num_cores <- 48

# Step 10: Apply the function to each gene id:
result_list <- mclapply(all_genes_vector, function(geneID) {
  # Generate a unique file path for the LD matrix of this gene
  ld_matrix_path <- file.path(ld_matrix_dir, paste0("ld_matrix_", geneID, ".txt"))

  tryCatch({
    select_IV_fs(geneID = geneID,
                 eqtl_data = eqtl_covid_exposure,
                 nTiss = 49,
                 bed_file = gtex_covid_bed,
                 bim_file = gtex_covid_bim,
                 fam_file = gtex_covid_fam,
                 ld_matrix_path = ld_matrix_path,
                 ld_thresh = 0.5,
                 pval_thresh = 0.001,
                 nTiss_thresh = 1)
  }, error = function(e) print(e))
}, mc.cores = num_cores)

# Step 11: Name the elements of the list with corresponding gene ids:
names(result_list) <- all_genes_vector

# Step 12: Save the previous result as a .rds file:
saveRDS(result_list, file = "/data/long_covid_IVs_fs_3.rds")

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q hugemem

# Request resources:
#PBS -l ncpus=48
#PBS -l mem=500GB
#PBS -l jobfs=200GB
#PBS -l walltime=30:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/Mt_Robin/select_setup.sif Rscript /data/select_function_fs_3.R

# Run the R script without any container:
#R --vanilla < parallelism_test.R > output_test

###########################################################################################################

# Exit cleanly
exit 0

In [None]:
======================================================================================
                  Resource Usage on 2023-08-24 04:50:19:
   Job Id:             93353706.gadi-pbs
   Project:            sq95
   Exit Status:        0
   Service Units:      798.52
   NCPUs Requested:    48                     NCPUs Used: 48
                                           CPU Time Used: 257:47:58
   Memory Requested:   500.0GB               Memory Used: 393.04GB
   Walltime requested: 30:00:00            Walltime Used: 05:32:43
   JobFS requested:    200.0GB                JobFS used: 0B
======================================================================================

In [None]:
%%R
# Open the output file:
result_selectIV_fs_3 <- readRDS("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_IVs_fs_3.rds")

# Filter out errors:
result_selectIV_fs_3 <- result_selectIV_fs_3[!sapply(result_selectIV_fs_3, function(x) inherits(x, "simpleError"))]

# Print the length and the first element:
print(length(result_selectIV_fs_3))
print(length(unique(result_selectIV_fs_3)))
print(head(result_selectIV_fs_3))

[1] 38574
[1] 38480
$ENSG00000000003
 [1] "rs8780"      "rs1204407"   "rs189980168" "rs4145090"   "rs139645081"
 [6] "rs112002620" "rs113922837" "rs17329105"  "rs193124579" "rs146639579"
[11] "rs201585484" "rs111478058" "rs143491698" "rs572714672" "rs56100956" 
[16] "rs147601806" "rs139830751" "rs12842251"  "rs73630694" 

$ENSG00000000005
 [1] "rs6616166"   "rs4828038"   "rs12006573"  "rs73557922"  "rs141394856"
 [6] "rs140468798" "rs73555045"  "rs5921738"   "rs139231094" "rs59368006" 
[11] "rs111915382" "rs73562536"  "rs12832642"  "rs201254498" "rs186611188"
[16] "rs113422802" "rs796285800" "rs180795909" "rs1004006"   "rs148617015"
[21] "rs140109110"

$ENSG00000000419
 [1] "rs1292040204" "rs79562553"   "rs4809840"    "rs35201621"   "rs35530561"  
 [6] "rs141892692"  "rs944933"     "rs11483800"   "rs113036491"  "rs116892696" 
[11] "rs34483216"   "rs60578213"   "rs118181672"  "rs4811253"    "rs11086363"  
[16] "rs73278571"   "rs73131221"   "rs78319058"   "rs79223715"   "rs189417964" 
[2

In [None]:
library(parallel) # R package installed by default

# Step 1: Open the eQTL file:
eqtl_covid_exposure <- read.table("/data/eqtl_all_tissues_110.txt",
                                  header = TRUE,
                                  sep = "",
                                  stringsAsFactors = FALSE)

# Step 2: Open the GWAS input data set:
gwas_covid_outcome <- read.table("/data/long_covid_gwas_3_subset.txt",
                                 header = TRUE,
                                 sep = "",
                                 stringsAsFactors = FALSE)

# Step 3: Open the list of genes-SNPs:
result_selectedIVs <- readRDS("/data/long_covid_IVs_fs_3.rds")

# Step 4: Create a variable for the path to the LD matrix directory:
ld_matrix_dir <- "/data/ld_matrix_fs_3"

# Step 5: Setup the Mt_Robin method using the following setup function:
MR_MtRobin_setup <- function(geneID,
                             snpID,
                             eqtl_data,
                             gwas_data,
                             ld_matrix_path,
                             nTiss){
  tryCatch({
    ## Check if the gene is present and subset
    if(!(geneID %in% eqtl_data$gene_id)) stop("Gene is missing in eQTL dataset")
    eqtl_data <- subset(eqtl_data, gene_id == geneID)

    ## Confirm all SNPs are present in the dataset
    if(!all(snpID %in% eqtl_data$variant_id)) stop("Some SNPs missing in eQTL dataset")
    if(!all(snpID %in% gwas_data$variant_id)) stop("Some SNPs missing in GWAS dataset")

    # Read the LD matrix from a file
    LD <- as.matrix(utils::read.table(ld_matrix_path,
                                      header = TRUE,
                                      check.names = FALSE))

    if(!all(snpID %in% colnames(LD))) stop("Some SNPs missing in LD matrix")

    ## Align data
    eqtl_data <- eqtl_data[match(snpID,eqtl_data$variant_id),]
    gwas_data <- gwas_data[match(snpID,gwas_data$variant_id),]
    LD <- LD[match(snpID,rownames(LD)),match(snpID,colnames(LD))]

    eqtl_betas <- as.matrix(eqtl_data[,paste0("beta_",1:nTiss)])
    eqtl_se <- as.matrix(eqtl_data[,paste0("SE_",1:nTiss)])
    eqtl_pvals <- as.matrix(eqtl_data[,paste0("pvalue_",1:nTiss)])
    gwas_betas <- gwas_data$beta
    gwas_se <- gwas_data$SE

    row.names(eqtl_betas) <- row.names(eqtl_se) <- row.names(eqtl_pvals) <- eqtl_data$variant_id

    return(list(snpID=snpID,gwas_betas=gwas_betas,gwas_se=gwas_se,
                eqtl_betas=eqtl_betas,eqtl_se=eqtl_se,eqtl_pvals=eqtl_pvals,LD=LD))
  }, error = function(e) print(e))
}

# Step 6: Apply the previous setup function:

# Step 6.1: Initialize an empty list to store the results:
MR_MtRobin_results <- list()

# Step 6.2: Define the number of cores to use for the parallelism process:
num_cores <- 36

# Step 6.3: Loop over each gene in result_selectedIVs:
MR_MtRobin_results <- mclapply(names(result_selectedIVs), function(geneID) {

  # Step 6.3.1: Get the SNPs associated with each gene:
  snpID <- result_selectedIVs[[geneID]]

  # Step 6.3.2: Generate a unique file path for the LD matrix of this gene:
  ld_matrix_path <- file.path(ld_matrix_dir, paste0("ld_matrix_", geneID, ".txt"))

  # Step 6.3.3: Call the MR_MtRobin_setup() function with the parameters for each gene:
  return(tryCatch({
    MR_MtRobin_setup(geneID = geneID,
                     snpID = snpID,
                     eqtl_data = eqtl_covid_exposure,
                     gwas_data = gwas_covid_outcome,
                     ld_matrix_path = ld_matrix_path,
                     nTiss = 49)
  }, error = function(e) print(e)))
}, mc.cores = num_cores)

# Step 7: Assign names to the list:
names(MR_MtRobin_results) <- names(result_selectedIVs)

# Step 8: Save the previous result as a .rds file:
saveRDS(MR_MtRobin_results, file = "/data/long_covid_setup_fs_3.rds")

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q hugemem

# Request resources:
#PBS -l ncpus=36
#PBS -l mem=300GB
#PBS -l jobfs=200GB
#PBS -l walltime=10:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/Mt_Robin/select_setup.sif Rscript /data/setup_function_fs_3.R

###########################################################################################################

# Exit cleanly
exit 0

In [None]:
======================================================================================
                  Resource Usage on 2023-08-24 16:29:35:
   Job Id:             93430395.gadi-pbs
   Project:            sq95
   Exit Status:        0
   Service Units:      73.56
   NCPUs Requested:    36                     NCPUs Used: 36
                                           CPU Time Used: 19:57:44
   Memory Requested:   300.0GB               Memory Used: 81.99GB
   Walltime requested: 10:00:00            Walltime Used: 00:40:52
   JobFS requested:    200.0GB                JobFS used: 0B
======================================================================================

In [None]:
%%R
# Open the output file:
result_setup_fs_3 <- readRDS("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_setup_fs_3.rds")

# Filter out errors:
result_setup_fs_3 <- result_setup_fs_3[!sapply(result_setup_fs_3, function(x) inherits(x, "simpleError"))]

# Print the length and the first element:
print(length(result_setup_fs_3))
print(length(unique(result_setup_fs_3)))
print(result_setup_fs_3[[1]])

[1] 5167
[1] 5167
$snpID
[1] "rs147560086" "rs7955986"   "rs718389"    "rs11571383"  "rs2107614"  
[6] "rs4980935"   "rs11064536"  "rs117766797"

$gwas_betas
[1]  0.04747050 -0.03200370 -0.06199730 -0.03928230 -0.02811250 -0.00631512
[7]  0.07077150 -0.09468080

$gwas_se
[1] 0.0395567 0.0324808 0.0403831 0.0700993 0.0339167 0.0413246 0.0437185
[8] 0.1211550

$eqtl_betas
            beta_1 beta_2    beta_3   beta_4 beta_5    beta_6    beta_7 beta_8
rs147560086     NA     NA -0.332722 -0.36379     NA -0.428724        NA     NA
rs7955986       NA     NA        NA       NA     NA        NA        NA     NA
rs718389        NA     NA        NA       NA     NA        NA        NA     NA
rs11571383      NA     NA        NA       NA     NA        NA -0.525715     NA
rs2107614       NA     NA        NA       NA     NA        NA        NA     NA
rs4980935       NA     NA        NA       NA     NA        NA        NA     NA
rs11064536      NA     NA        NA       NA     NA        NA        NA   

In [None]:
# Step 1: Load the required libraries:
library(base) # R package installed by default
library(utils) # R package installed by default
library(parallel) # R package installed by default
library(Matrix) # R package installed by default
library(methods) # R package installed by default
library(stats) # R package installed by default
library(lme4) # It needs to be installed separately
library(data.table) # It needs to be installed separately

# Step 2: Main Mt_Robin function:
MR_MtRobin <- function(MR_MtRobin_input,pval_thresh=0.001){
  snpID <- MR_MtRobin_input$snpID
  gwas_betas <- MR_MtRobin_input$gwas_betas
  gwas_se <- MR_MtRobin_input$gwas_se
  eqtl_betas <- MR_MtRobin_input$eqtl_betas
  eqtl_se <- MR_MtRobin_input$eqtl_se
  eqtl_pvals <- MR_MtRobin_input$eqtl_pvals
  LD <- MR_MtRobin_input$LD

  if(ncol(eqtl_betas) != ncol(eqtl_se) | ncol(eqtl_betas) != ncol(eqtl_pvals)){
    stop("eQTL betas, SE and p-values do not all have same number of columns")
  }

  ## determine number of tissues (or conditions/studies/etc.)
  nT <- ncol(eqtl_betas)

  ## rename columns for standardized processing
  colnames(eqtl_betas) <- paste0("beta_",1:nT)
  colnames(eqtl_se) <- paste0("SE_",1:nT)
  colnames(eqtl_pvals) <- paste0("pvalue_",1:nT)

  ## combine eQTL results into data.table
  eqtl_res <- data.table::data.table(snpID,eqtl_betas,eqtl_se,eqtl_pvals)

  # reshape to long format (each SNP-gene-tissue will be row)
  eQTL_res_melt <- data.table::melt(eqtl_res,id.vars=c("snpID"),
                                    measure.vars=patterns("^beta_","^SE_","^pvalue_"),value.name=c("beta","se","pvalue"),
                                    variable.name="tissue")

  ## subset to strong instrumental variables (based on p-value threshold)
  eQTL_res_melt_PltThresh <- subset(eQTL_res_melt, pvalue < pval_thresh)

  ## set up gwas data for merging
  gwas_res <- data.table::data.table(snpID=snpID,gwas_beta=gwas_betas,gwas_se=gwas_se)

  ## merge eQTL and GWAS results
  data.table::setkey(eQTL_res_melt_PltThresh,snpID)
  data.table::setkey(gwas_res,snpID)
  merged_res <- data.table::merge.data.table(eQTL_res_melt_PltThresh,gwas_res)

  ## set up coefficients for reverse regression
  beta_x <- matrix(merged_res$beta,ncol=1)
  beta_y <- matrix(merged_res$gwas_beta,ncol=1)

  ## standard errors (for weights)
  se_x <- matrix(merged_res$se,ncol=1)
  se_y <- matrix(merged_res$gwas_se,ncol=1) ## not used by function; used in resampling (consider returning with results)

  ## identifiers
  snpID <- merged_res$snpID

  lme_res <- lme4::lmer(beta_x~(beta_y-1)+(beta_y-1|snpID),weights=1/se_x^2)

  ## return results from weighted regression with random slopes (and correlated errors) along with GWAS SE and LD (needed for resampling)
  return(list(lme_res=lme_res,gwas_res=gwas_res,LD=LD))
}

# Step 3: Read the previous file:
result_setup <- readRDS("/data/long_covid_setup_fs_3.rds")

# Step 4: Apply the main Mt_Robin function using parallel processing:
no_cores <- 38

cl <- makeCluster(no_cores) # Create a cluster with the number of cores
clusterExport(cl, "MR_MtRobin") # Export the MR_MtRobin function to all the nodes of the cluster
result_list_MtRobin <- parLapply(cl, result_setup, function(MR_MtRobin_input) {
  tryCatch({
    MR_MtRobin(MR_MtRobin_input = MR_MtRobin_input)
  }, error = function(e) print(e))
})
stopCluster(cl) # Stop the cluster once the computation is done

# Step 5: The names of the results will correspond to the gene ids:
names(result_list_MtRobin) <- names(result_setup)

# Step 6: Save the previous result as a .rds file:
saveRDS(result_list_MtRobin, file = "/data/long_covid_main_fs_3.rds")

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q hugemem

# Request resources:
#PBS -l ncpus=36
#PBS -l mem=300GB
#PBS -l jobfs=200GB
#PBS -l walltime=10:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/Mt_Robin/main_function.sif Rscript /data/main_function_fs_3.R

# Run the R script without any container:
#R --vanilla < parallelism_test.R > output_test

###########################################################################################################

# Exit cleanly
exit 0

In [None]:
======================================================================================
                  Resource Usage on 2023-08-24 17:34:46:
   Job Id:             93443498.gadi-pbs
   Project:            sq95
   Exit Status:        0
   Service Units:      2.73
   NCPUs Requested:    36                     NCPUs Used: 36
                                           CPU Time Used: 00:02:59
   Memory Requested:   300.0GB               Memory Used: 9.72GB
   Walltime requested: 10:00:00            Walltime Used: 00:01:31
   JobFS requested:    200.0GB                JobFS used: 0B
======================================================================================

In [None]:
%%R
# Open the output file:
result_main_fs_3 <- readRDS("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_main_fs_3.rds")

# Filter out errors:
result_main_fs_3 <- result_main_fs_3[!sapply(result_main_fs_3, function(x) inherits(x, "simpleError"))]

# Print the length and the first element:
print(length(result_main_fs_3))
print(length(unique(result_main_fs_3)))
print(head(result_main_fs_3))

[1] 1796
[1] 1796
$ENSG00000002016
$ENSG00000002016$lme_res
Linear mixed model fit by REML ['lmerMod']
Formula: beta_x ~ (beta_y - 1) + (beta_y - 1 | snpID)
Weights: 1/se_x^2
REML criterion at convergence: -28.894
Random effects:
 Groups   Name   Std.Dev.
 snpID    beta_y 8.267   
 Residual        1.941   
Number of obs: 23, groups:  snpID, 8
Fixed Effects:
beta_y  
 1.655  

$ENSG00000002016$gwas_res
        snpID   gwas_beta   gwas_se
1  rs11064536  0.07077150 0.0437185
2  rs11571383 -0.03928230 0.0700993
3 rs117766797 -0.09468080 0.1211550
4 rs147560086  0.04747050 0.0395567
5   rs2107614 -0.02811250 0.0339167
6   rs4980935 -0.00631512 0.0413246
7    rs718389 -0.06199730 0.0403831
8   rs7955986 -0.03200370 0.0324808

$ENSG00000002016$LD
            rs147560086   rs7955986    rs718389  rs11571383   rs2107614
rs147560086  1.00000000  0.57848413 -0.19239180  0.66694928 -0.29808391
rs7955986    0.57848413  1.00000000 -0.35065828  0.41826699 -0.19482499
rs718389    -0.19239180 -0.3506582

In [None]:
########################################################################
# GADI:

# Obtain p-value from MR-MtRobin results:
# Uses a resampling procedure to estimate the p_value for a MR-MtRobin object:
# param MR_MtRobin_res a list object returned by \code{MR_MtRobin}.
# param nsamp integer of the number of samples to use in estimating \eqn{P}-value using a null distribution.
# param use_nonconverge logical of whether to use samples resulting in non-convergence of random slope
# return A list of two elements:
# \tabular{ll}{\code{pvalue} \tab numeric of the estimated \eqn{P}-value.
# \code{nsamp_used} \tab integer of the number of samples used in estimating the \eqn{P}-value.
# \code{nsamp_used} is returned because not all \code{nsamp} may be used +
# If \code{use_nonconverge=FALSE} (samples will be dropped if the model does not converge or results in a singular fit of the random slope).
# myRes <- MR_MtRobin(snpID,gwas_betas,gwas_se,eqtl_betas,eqtl_se,eqtl_pvals,LD)
# MR_MtRobin_resample(MR_MtRobin_res=myRes, nsamp=1e4, use_nonconverge=FALSE)

# Load packages:
library(lme4)
library(mvtnorm)
library(data.table)
library(parallel)

# Load main results:
results_1 <- readRDS("/scratch/sq95/sp6154/p_values/long_covid_main_fs_3.rds")
# Filter out errors:
results_1 <- results_1[!sapply(results_1, function(x) inherits(x, "simpleError"))]
# Export:
saveRDS(results_1, "/scratch/sq95/sp6154/p_values/long_covid_main_fs_3_no_errors.rds")

# Updated Function to calculate the p-value and FDR with parallel processing:
MR_MtRobin_resample <- function(results_1_path, nsamp=1e4, use_nonconverge=FALSE, output_file_path, num_cpus=1){
  # Check and load the results_1 dataset from the specified path:
  if(file.exists(results_1_path)){
    results_1 <- readRDS(results_1_path)
  } else {
    stop("The specified file path for results_1 does not exist.")
  }
  # Initialize a list to store results for each gene:
  pvalues_results <- list()
  # Loop over each gene:
  for(gene_id in names(results_1)) {
    # Check if necessary components are present and not NULL:
    if(is.null(results_1[[gene_id]][["lme_res"]]) || is.null(results_1[[gene_id]][["gwas_res"]]) || is.null(results_1[[gene_id]][["LD"]])) {
      next
    }
    # Create slopes for each information:
    lme_res <- results_1[[gene_id]][["lme_res"]]
    gwas_res <- results_1[[gene_id]][["gwas_res"]]
    LD <- results_1[[gene_id]][["LD"]]
    # Extract data from MR-MtRobin results:
    eqtl_betas <- lme_res@frame$beta_x
    snpID <- lme_res@frame$snpID
    weights <- lme_res@frame$weights
    tstat_MR_MtRobin <- summary(lme_res)$coefficients[1,3]
    gwas_se <- gwas_res$gwas_se
    # Bootstrapped null distribution, accounting for LD correlations:
    beta_gwas_nullMat <- mvtnorm::rmvnorm(nsamp, mean=rep(0, length(gwas_se)), sigma=diag(gwas_se) %*% LD %*% diag(gwas_se))
    colnames(beta_gwas_nullMat) <- gwas_res$snpID
    # Return data types:
    tstat_nulls <- mclapply(1:nsamp, function(i) {
      beta_gwas_null <- beta_gwas_nullMat[i, snpID]
      tryCatch({
        # LMM equation:
        lme_null <- lme4::lmer(eqtl_betas~(beta_gwas_null-1)+(beta_gwas_null-1|snpID), weights=weights)
        return(summary(lme_null)$coefficients[1,3])
      }, error=function(e) {
        if(!use_nonconverge) return(NA)
      })
    }, mc.cores = num_cpus)
    tstat_nulls <- unlist(tstat_nulls)
    tstat_nulls <- tstat_nulls[!is.na(tstat_nulls)] # Remove NAs
    nsamp_used <- length(tstat_nulls)
    # Calculate the p-value and round it to 3 decimal places:
    pval <- round(mean(abs(tstat_nulls) >= abs(tstat_MR_MtRobin)), digits = 3)
    # Store the results for the current gene in the list:
    pvalues_results[[gene_id]] <- list(pvalue=pval, nsamp_used=nsamp_used)
  }
  # Calculate FDR:
  all_pvalues <- sapply(pvalues_results, function(x) x$pvalue)
  fdr_values <- p.adjust(all_pvalues, method = "BH")
  # Combine results with FDR values:
  for(i in seq_along(fdr_values)) {
    pvalues_results[[names(fdr_values)[i]]]$fdr <- fdr_values[i]
  }
  # Prepare results dataframe:
  results_df <- do.call(rbind, lapply(names(pvalues_results), function(gene_id) {
    data.frame(gene_id = gene_id,
               pvalue = sprintf("%.3f", pvalues_results[[gene_id]]$pvalue), # Format pvalue with 3 decimal places
               nsamp_used = pvalues_results[[gene_id]]$nsamp_used,
               fdr = sprintf("%.15f", pvalues_results[[gene_id]]$fdr), # Keep full precision for fdr
               stringsAsFactors = FALSE)
  }))
  # Write the results to the output file, ensuring the formatting is kept
  write.table(results_df, file = output_file_path, sep = "\t", row.names = FALSE, quote = FALSE, col.names = TRUE)
  return(pvalues_results)
}

# Call the previous function:
MR_MtRobin_resample(
  results_1_path = "/scratch/sq95/sp6154/p_values/long_covid_main_fs_3_no_errors.rds",
  nsamp = 10000, # Number of bootstrap samples
  use_nonconverge = FALSE, # Whether to use non-converging results, typically FALSE
  output_file_path = "/scratch/sq95/sp6154/p_values/MR_pvalue_FDR_results_fs3.txt",
  num_cpus = 48 # Number of CPUs
)

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q hugemem

# Request resources:
#PBS -l ncpus=48
#PBS -l mem=500GB
#PBS -l jobfs=200GB
#PBS -l walltime=10:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Option 1: Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/p_values/p_values.sif Rscript /data/p_values_fs_3.R

# Option 2: Run the R script without any container:
#R --vanilla < parallelism_test.R > output_test

###########################################################################################################

# Exit cleanly:
exit 0

In [None]:
======================================================================================
                  Resource Usage on 2024-02-07 20:15:36:
   Job Id:             107707388.gadi-pbs
   Project:            sq95
   Exit Status:        0
   Service Units:      692.00
   NCPUs Requested:    48                     NCPUs Used: 48
                                           CPU Time Used: 139:26:08
   Memory Requested:   500.0GB               Memory Used: 114.15GB
   Walltime requested: 10:00:00            Walltime Used: 04:48:20
   JobFS requested:    200.0GB                JobFS used: 0B
======================================================================================

In [None]:
%%R

# Read the results with the p_value and FDR values:
results <- read.table("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/MR_pvalue_FDR_results_fs3.txt", header=TRUE, sep="\t")
print(dim(results))
print(head(results))

[1] 1796    4
          gene_id pvalue nsamp_used       fdr
1 ENSG00000002016  0.628      10000 0.9095871
2 ENSG00000005156  0.165      10000 0.6704525
3 ENSG00000006282  0.365      10000 0.8204506
4 ENSG00000006744  0.181      10000 0.6901826
5 ENSG00000007171  0.289      10000 0.7645310
6 ENSG00000007341  0.232      10000 0.7050288


In [None]:
%%R

# Call the analyse_gene_data() function to organize the Mt_Robin results for significant and non-significant genes:

# Create the output paths:
save_paths <- list(
  all_genes = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs3_all_genes_sig_non_sig.txt",
  protein_coding_genes = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs3_protein_coding_genes_sig_not_sig.txt"
)

# Call the function:
analyze_gene_data(
  result_main_path = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_main_fs_3.rds",
  eqtl_long_covid_exposure_path = "/content/drive/MyDrive/Colab/Long_COVID/eQTL/eqtl_all_tissues_110.txt",
  save_files = save_paths
)

Saved all_genes as all_genes to /content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs3_all_genes_sig_non_sig.txt 
Dimensions of all_genes : 1796 10 

          gene_id gene_name      gene_type     beta_y Obs SNPs
1 ENSG00000002016     RAD52 protein_coding   1.655462  23    8
2 ENSG00000005156      LIG3 protein_coding   9.792645  10    3
3 ENSG00000006282   SPATA20 protein_coding -18.930839  46    2
4 ENSG00000006744     ELAC2 protein_coding   6.471382  18    4
5 ENSG00000007171      NOS2 protein_coding  29.386142  31    5
6 ENSG00000007341      ST7L protein_coding -71.497400  12    2
                                                                               SNP_IDs
1 rs11064536,rs11571383,rs117766797,rs147560086,rs2107614,rs4980935,rs718389,rs7955986
2                                                     rs12945428,rs201016783,rs3135958
3                                                                  rs2306001,rs9890200
4                                             rs1044564,rs116559

In [None]:
%%R

# FS3 final table:
# Call the function that retrieves more information from the Mt_Robin and filter the significant genes:
process_final_data(input_path_1 = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_main_fs_3_no_errors.rds",
                   input_path_2 = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs3_protein_coding_genes_sig_not_sig.txt",
                   input_path_3 = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/MR_pvalue_FDR_results_fs3.txt",
                   output_path = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs3_protein_coding_genes_sig_full_table.txt")

[1] "Causal genes (noncoding genes and protein coding genes): "
[1] 1796
$lme_res
Linear mixed model fit by REML ['lmerMod']
Formula: beta_x ~ (beta_y - 1) + (beta_y - 1 | snpID)
Weights: 1/se_x^2
REML criterion at convergence: -28.894
Random effects:
 Groups   Name   Std.Dev.
 snpID    beta_y 8.267   
 Residual        1.941   
Number of obs: 23, groups:  snpID, 8
Fixed Effects:
beta_y  
 1.655  

$gwas_res
        snpID   gwas_beta   gwas_se
1  rs11064536  0.07077150 0.0437185
2  rs11571383 -0.03928230 0.0700993
3 rs117766797 -0.09468080 0.1211550
4 rs147560086  0.04747050 0.0395567
5   rs2107614 -0.02811250 0.0339167
6   rs4980935 -0.00631512 0.0413246
7    rs718389 -0.06199730 0.0403831
8   rs7955986 -0.03200370 0.0324808

$LD
            rs147560086   rs7955986    rs718389  rs11571383   rs2107614
rs147560086  1.00000000  0.57848413 -0.19239180  0.66694928 -0.29808391
rs7955986    0.57848413  1.00000000 -0.35065828  0.41826699 -0.19482499
rs718389    -0.19239180 -0.35065828  1.00000

In [None]:
########################################################################
# GADI:

# Obtain p-value from MR-MtRobin results:
# Uses a resampling procedure to estimate the p_value for a MR-MtRobin object:
# param MR_MtRobin_res a list object returned by \code{MR_MtRobin}.
# param nsamp integer of the number of samples to use in estimating \eqn{P}-value using a null distribution.
# param use_nonconverge logical of whether to use samples resulting in non-convergence of random slope
# return A list of two elements:
# \tabular{ll}{\code{pvalue} \tab numeric of the estimated \eqn{P}-value.
# \code{nsamp_used} \tab integer of the number of samples used in estimating the \eqn{P}-value.
# \code{nsamp_used} is returned because not all \code{nsamp} may be used +
# If \code{use_nonconverge=FALSE} (samples will be dropped if the model does not converge or results in a singular fit of the random slope).
# myRes <- MR_MtRobin(snpID,gwas_betas,gwas_se,eqtl_betas,eqtl_se,eqtl_pvals,LD)
# MR_MtRobin_resample(MR_MtRobin_res=myRes, nsamp=1e4, use_nonconverge=FALSE)

# Load packages:
library(lme4)
library(mvtnorm)
library(data.table)
library(parallel)

# Load main results:
results_1 <- readRDS("/scratch/sq95/sp6154/p_values/long_covid_main_fs_3.rds")
# Filter out errors:
results_1 <- results_1[!sapply(results_1, function(x) inherits(x, "simpleError"))]
# Export:
saveRDS(results_1, "/scratch/sq95/sp6154/p_values/long_covid_main_fs_3_no_errors.rds")

# Updated Function to calculate the p-value and FDR with parallel processing:
MR_MtRobin_resample <- function(results_1_path, nsamp=1e4, use_nonconverge=FALSE, output_file_path, num_cpus=1){
  # Check and load the results_1 dataset from the specified path:
  if(file.exists(results_1_path)){
    results_1 <- readRDS(results_1_path)
  } else {
    stop("The specified file path for results_1 does not exist.")
  }
  # Initialize a list to store results for each gene:
  pvalues_results <- list()
  # Loop over each gene:
  for(gene_id in names(results_1)) {
    # Check if necessary components are present and not NULL:
    if(is.null(results_1[[gene_id]][["lme_res"]]) || is.null(results_1[[gene_id]][["gwas_res"]]) || is.null(results_1[[gene_id]][["LD"]])) {
      next
    }
    # Create slopes for each information:
    lme_res <- results_1[[gene_id]][["lme_res"]]
    gwas_res <- results_1[[gene_id]][["gwas_res"]]
    LD <- results_1[[gene_id]][["LD"]]
    # Extract data from MR-MtRobin results:
    eqtl_betas <- lme_res@frame$beta_x
    snpID <- lme_res@frame$snpID
    weights <- lme_res@frame$weights
    tstat_MR_MtRobin <- summary(lme_res)$coefficients[1,3]
    gwas_se <- gwas_res$gwas_se
    # Bootstrapped null distribution, accounting for LD correlations:
    beta_gwas_nullMat <- mvtnorm::rmvnorm(nsamp, mean=rep(0, length(gwas_se)), sigma=diag(gwas_se) %*% LD %*% diag(gwas_se))
    colnames(beta_gwas_nullMat) <- gwas_res$snpID
    # Return data types:
    tstat_nulls <- mclapply(1:nsamp, function(i) {
      beta_gwas_null <- beta_gwas_nullMat[i, snpID]
      tryCatch({
        # LMM equation:
        lme_null <- lme4::lmer(eqtl_betas~(beta_gwas_null-1)+(beta_gwas_null-1|snpID), weights=weights)
        return(summary(lme_null)$coefficients[1,3])
      }, error=function(e) {
        if(!use_nonconverge) return(NA)
      })
    }, mc.cores = num_cpus)
    tstat_nulls <- unlist(tstat_nulls)
    tstat_nulls <- tstat_nulls[!is.na(tstat_nulls)] # Remove NAs
    nsamp_used <- length(tstat_nulls)
    # Calculate the p-value and round it to 3 decimal places:
    pval <- round(mean(abs(tstat_nulls) >= abs(tstat_MR_MtRobin)), digits = 3)
    # Store the results for the current gene in the list:
    pvalues_results[[gene_id]] <- list(pvalue=pval, nsamp_used=nsamp_used)
  }
  # Calculate FDR:
  all_pvalues <- sapply(pvalues_results, function(x) x$pvalue)
  fdr_values <- p.adjust(all_pvalues, method = "BH")
  # Combine results with FDR values:
  for(i in seq_along(fdr_values)) {
    pvalues_results[[names(fdr_values)[i]]]$fdr <- fdr_values[i]
  }
  # Prepare results dataframe:
  results_df <- do.call(rbind, lapply(names(pvalues_results), function(gene_id) {
    data.frame(gene_id = gene_id,
               pvalue = sprintf("%.3f", pvalues_results[[gene_id]]$pvalue), # Format pvalue with 3 decimal places
               nsamp_used = pvalues_results[[gene_id]]$nsamp_used,
               fdr = sprintf("%.15f", pvalues_results[[gene_id]]$fdr), # Keep full precision for fdr
               stringsAsFactors = FALSE)
  }))
  # Write the results to the output file, ensuring the formatting is kept
  write.table(results_df, file = output_file_path, sep = "\t", row.names = FALSE, quote = FALSE, col.names = TRUE)
  return(pvalues_results)
}

# Call the previous function:
MR_MtRobin_resample(
  results_1_path = "/scratch/sq95/sp6154/p_values/long_covid_main_fs_3_no_errors.rds",
  nsamp = 10000, # Number of bootstrap samples
  use_nonconverge = FALSE, # Whether to use non-converging results, typically FALSE
  output_file_path = "/scratch/sq95/sp6154/p_values/MR_pvalue_FDR_results_fs3.txt",
  num_cpus = 48 # Number of CPUs
)

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q hugemem

# Request resources:
#PBS -l ncpus=48
#PBS -l mem=500GB
#PBS -l jobfs=200GB
#PBS -l walltime=10:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Option 1: Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/p_values/p_values.sif Rscript /data/p_values_fs_3.R

# Option 2: Run the R script without any container:
#R --vanilla < parallelism_test.R > output_test

###########################################################################################################

# Exit cleanly:
exit 0

In [None]:
======================================================================================
                  Resource Usage on 2024-02-07 20:15:36:
   Job Id:             107707388.gadi-pbs
   Project:            sq95
   Exit Status:        0
   Service Units:      692.00
   NCPUs Requested:    48                     NCPUs Used: 48
                                           CPU Time Used: 139:26:08
   Memory Requested:   500.0GB               Memory Used: 114.15GB
   Walltime requested: 10:00:00            Walltime Used: 04:48:20
   JobFS requested:    200.0GB                JobFS used: 0B
======================================================================================

In [None]:
%%R

# Read the results with the p_value and FDR values:
results <- read.table("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/MR_pvalue_FDR_results_fs3.txt", header=TRUE, sep="\t")
print(dim(results))
print(head(results))

[1] 1796    4
          gene_id pvalue nsamp_used       fdr
1 ENSG00000002016  0.628      10000 0.9095871
2 ENSG00000005156  0.165      10000 0.6704525
3 ENSG00000006282  0.365      10000 0.8204506
4 ENSG00000006744  0.181      10000 0.6901826
5 ENSG00000007171  0.289      10000 0.7645310
6 ENSG00000007341  0.232      10000 0.7050288


In [None]:
%%R

# Call the analyse_gene_data() function to organize the Mt_Robin results for significant and non-significant genes:

# Create the output paths:
save_paths <- list(
  all_genes = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs3_all_genes_sig_non_sig.txt",
  protein_coding_genes = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs3_protein_coding_genes_sig_not_sig.txt"
)

# Call the function:
analyze_gene_data(
  result_main_path = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_main_fs_3.rds",
  eqtl_long_covid_exposure_path = "/content/drive/MyDrive/Colab/Long_COVID/eQTL/eqtl_all_tissues_110.txt",
  save_files = save_paths
)

Saved all_genes as all_genes to /content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs3_all_genes_sig_non_sig.txt 
Dimensions of all_genes : 1796 10 

          gene_id gene_name      gene_type     beta_y Obs SNPs
1 ENSG00000002016     RAD52 protein_coding   1.655462  23    8
2 ENSG00000005156      LIG3 protein_coding   9.792645  10    3
3 ENSG00000006282   SPATA20 protein_coding -18.930839  46    2
4 ENSG00000006744     ELAC2 protein_coding   6.471382  18    4
5 ENSG00000007171      NOS2 protein_coding  29.386142  31    5
6 ENSG00000007341      ST7L protein_coding -71.497400  12    2
                                                                               SNP_IDs
1 rs11064536,rs11571383,rs117766797,rs147560086,rs2107614,rs4980935,rs718389,rs7955986
2                                                     rs12945428,rs201016783,rs3135958
3                                                                  rs2306001,rs9890200
4                                             rs1044564,rs116559

In [None]:
%%R

# FS3 final table:
# Call the function that retrieves more information from the Mt_Robin and filter the significant genes:
process_final_data(input_path_1 = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_main_fs_3_no_errors.rds",
                   input_path_2 = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs3_protein_coding_genes_sig_not_sig.txt",
                   input_path_3 = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/MR_pvalue_FDR_results_fs3.txt",
                   output_path = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs3_protein_coding_genes_sig_full_table.txt")

[1] "Causal genes (noncoding genes and protein coding genes): "
[1] 1796
$lme_res
Linear mixed model fit by REML ['lmerMod']
Formula: beta_x ~ (beta_y - 1) + (beta_y - 1 | snpID)
Weights: 1/se_x^2
REML criterion at convergence: -28.894
Random effects:
 Groups   Name   Std.Dev.
 snpID    beta_y 8.267   
 Residual        1.941   
Number of obs: 23, groups:  snpID, 8
Fixed Effects:
beta_y  
 1.655  

$gwas_res
        snpID   gwas_beta   gwas_se
1  rs11064536  0.07077150 0.0437185
2  rs11571383 -0.03928230 0.0700993
3 rs117766797 -0.09468080 0.1211550
4 rs147560086  0.04747050 0.0395567
5   rs2107614 -0.02811250 0.0339167
6   rs4980935 -0.00631512 0.0413246
7    rs718389 -0.06199730 0.0403831
8   rs7955986 -0.03200370 0.0324808

$LD
            rs147560086   rs7955986    rs718389  rs11571383   rs2107614
rs147560086  1.00000000  0.57848413 -0.19239180  0.66694928 -0.29808391
rs7955986    0.57848413  1.00000000 -0.35065828  0.41826699 -0.19482499
rs718389    -0.19239180 -0.35065828  1.00000

In [None]:
%%R

# Call the last function to retrieve the tissue information and categories for fs1:
input_eqtl_path <- "/content/drive/MyDrive/Colab/Long_COVID/eQTL/eqtl_all_tissues_110.txt"
input_mt_robin_path <- "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs3_protein_coding_genes_sig_full_table.txt"
output_path <- "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs3_protein_coding_genes_sig_full_table_tissues.txt"
process_eqtl_mt_robin(input_eqtl_path, input_mt_robin_path, output_path)

           gene_id gene_name SNPs
1  ENSG00000139714     MORN3   18
2  ENSG00000146530      VWDE    8
3  ENSG00000156603     MED19   14
4  ENSG00000158825       CDA   14
5  ENSG00000163218   PGLYRP4    3
6  ENSG00000163879    DNALI1    9
7  ENSG00000180988    OR52N2    4
8  ENSG00000184983    NDUFA6    2
9  ENSG00000188878      FBF1    5
10 ENSG00000231887      PRH1    2
                                                                                                                                                                                                SNPs_IDs
1  rs10522094,rs112537090,rs113467236,rs11502393,rs138851658,rs144181641,rs146337807,rs148311487,rs148754753,rs208291,rs2263271,rs2458404,rs28421662,rs58423269,rs63496760,rs7313015,rs7486943,rs7957569
2                                                                                                                      rs12673085,rs12699367,rs2287067,rs28811348,rs2966627,rs4721083,rs6971764,rs970035
3                      

### **GWAS4**

In [None]:
library(matrixStats) # It needs to be installed separately
#library(BiocManager) # It needs to be installed separately
library(snpStats) # It needs to be installed separately
library(utils) # R package installed by default
library(parallel) # R package installed by default

# Step 1: Open the eQTL file:
eqtl_covid_exposure <- read.table("/data/eqtl_all_tissues_110.txt",
                                  header = TRUE,
                                  sep = "",
                                  stringsAsFactors = FALSE)

# Step 2: Open the GWAS input data set:
gwas_covid_outcome <- read.table("/data/long_covid_gwas_4_subset.txt",
                                 header = TRUE,
                                 sep = "",
                                 stringsAsFactors = FALSE)

# Step 3: Open the genes input data set:
all_genes <- read.table("/data/all_genes_110.txt",
                        header = TRUE,
                        sep = "",
                        stringsAsFactors = FALSE)

# Step 4: Convert the previous list of genes into a vector:
all_genes_vector <- all_genes$gene_id

# Step 5: Save the paths of the GTEx files:
gtex_covid_bed <- "/data/WGS_GTEx_long_covid_rsID.bed"
gtex_covid_bim <- "/data/WGS_GTEx_long_covid_rsID.bim"
gtex_covid_fam <- "/data/WGS_GTEx_long_covid_rsID.fam"

# Step 6: Create a variable for the path to the directory to save the LD matrix results:
ld_matrix_dir <- "/data/ld_matrix_fs_4"

# Step 7: Create the directory if it doesn't exist:
if (!dir.exists(ld_matrix_dir)) {
  dir.create(ld_matrix_dir)
}

# Step 8: Modify the original select_IV_fs() function to accept NAs from the eQTL and calculate the LD matrix inside:
# select_IV_fs() function --> Starts with no variables and adds one at a time:
select_IV_fs <- function(geneID,
                         eqtl_data,
                         nTiss,
                         bed_file,
                         bim_file,
                         fam_file,
                         ld_matrix_path,
                         ld_thresh=0.5,
                         pval_thresh=0.001,
                         nTiss_thresh=3){

  ## Subset eQTL dataset to gene of interest:
  if(!(geneID %in% eqtl_data$gene_id)) stop("Gene is missing in eQTL dataset")
  eqtl_data <- subset(eqtl_data, gene_id == geneID)

  # pval_mat <- as.matrix(eqtl_data[,paste0("pvalue_",1:nTiss)])
  pval_mat <- as.matrix(subset(eqtl_data,select=paste0("pvalue_",1:nTiss)))
  eqtl_data$min_p <- matrixStats::rowMins(pval_mat, na.rm = TRUE) # Mod1: I added na.rm = TRUE
  eqtl_data$nTissThresh_p <- apply(pval_mat, 1, function(x) sort(x)[nTiss_thresh])
  eqtl_data <- subset(eqtl_data, nTissThresh_p < pval_thresh)

  ## Ensure same sign for tissues with p < pval_thresh:
  betas <- as.matrix(subset(eqtl_data,select=paste0("beta_",1:nTiss)))
  pvals <- as.matrix(subset(eqtl_data,select=paste0("pvalue_",1:nTiss)))
  pvals_lt001 <- pvals<0.001
  betas_sig <- betas*pvals_lt001
  #sig_neg_eQTL <- as.numeric(apply(betas_sig,1,function(x) any(x < -1e-15)))
  #sig_pos_eQTL <- as.numeric(apply(betas_sig,1,function(x) any(x > 1e-15)))
  sig_neg_eQTL <- as.numeric(apply(betas_sig, 1, function(x) any((!is.na(x) & x < -1e-15)))) # Mod2: I added is.na()
  sig_pos_eQTL <- as.numeric(apply(betas_sig, 1, function(x) any((!is.na(x) & x > 1e-15)))) # Mod3: I added is.na()
  eqtl_data$pos_sig_eQTL <- sig_pos_eQTL
  eqtl_data$neg_sig_eQTL <- sig_neg_eQTL
  eqtl_data <- subset(eqtl_data, !(pos_sig_eQTL==1 & neg_sig_eQTL==1))

  if(nrow(eqtl_data)==0) stop(paste0("No cis-SNPs for gene with p<",pval_thresh," in at least ",nTiss_thresh," tissues."))

  ## Obtain a set of candidate SNPs:
  SNP_pool <- eqtl_data$variant_id
  if(length(SNP_pool)==1){
    return(SNP_pool)
  }

  # Calculate the LD matrix:
  snp_data <- read.plink(bed_file, bim_file, fam_file)
  snp_matrix <- snp_data$genotypes
  numeric_matrix <- as(snp_matrix, "numeric")
  # Restrict LD matrix to SNPs in the data set:
  df_subset <- numeric_matrix[, colnames(numeric_matrix) %in% SNP_pool]
  LD <- cor(df_subset, use="pairwise.complete")
  LD_r2 <- LD^2
  # Export a .txt file of the LD matrix:
  utils::write.table(LD,
                     file = ld_matrix_path,
                     sep = " ",
                     row.names = TRUE,
                     col.names = TRUE,
                     quote = FALSE)

  ## Select the best SNPs remaining in SNP_pool based on minimum p-value:
  selected_snps <- NULL
  while(length(SNP_pool) > 1){

    ## Identify top SNPs in pool:
    eqtl_data <- subset(eqtl_data, variant_id %in% SNP_pool)
    next_snp_idx <- which.min(eqtl_data$min_p)
    next_snp <- eqtl_data$variant_id[next_snp_idx]
    selected_snps <-c(selected_snps,next_snp)

    ## Identify SNPs not in high LD with top SNPs in the pool:
    LD_idx <- which(row.names(LD_r2)==next_snp)
    LD_keep_idx <- which(LD_r2[LD_idx,] < ld_thresh)
    LD_r2 <- LD_r2[LD_keep_idx,LD_keep_idx]
    SNP_pool <- names(LD_keep_idx)
  }
  selected_snps <- c(selected_snps, SNP_pool)
  return(selected_snps)
}

# Step 9: Define the number of cores to use for the parallelism process:
num_cores <- 48

# Step 10: Apply the function to each gene id:
result_list <- mclapply(all_genes_vector, function(geneID) {
  # Generate a unique file path for the LD matrix of this gene
  ld_matrix_path <- file.path(ld_matrix_dir, paste0("ld_matrix_", geneID, ".txt"))

  tryCatch({
    select_IV_fs(geneID = geneID,
                 eqtl_data = eqtl_covid_exposure,
                 nTiss = 49,
                 bed_file = gtex_covid_bed,
                 bim_file = gtex_covid_bim,
                 fam_file = gtex_covid_fam,
                 ld_matrix_path = ld_matrix_path,
                 ld_thresh = 0.5,
                 pval_thresh = 0.001,
                 nTiss_thresh = 1)
  }, error = function(e) print(e))
}, mc.cores = num_cores)

# Step 11: Name the elements of the list with corresponding gene ids:
names(result_list) <- all_genes_vector

# Step 12: Save the previous result as a .rds file:
saveRDS(result_list, file = "/data/long_covid_IVs_fs_4.rds")

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q hugemem

# Request resources:
#PBS -l ncpus=48
#PBS -l mem=500GB
#PBS -l jobfs=200GB
#PBS -l walltime=30:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/Mt_Robin/select_setup.sif Rscript /data/select_function_fs_4.R

# Run the R script without any container:
#R --vanilla < parallelism_test.R > output_test

###########################################################################################################

# Exit cleanly
exit 0

In [None]:
======================================================================================
                  Resource Usage on 2023-08-24 10:26:27:
   Job Id:             93353708.gadi-pbs
   Project:            sq95
   Exit Status:        0
   Service Units:      800.88
   NCPUs Requested:    48                     NCPUs Used: 48
                                           CPU Time Used: 258:56:13
   Memory Requested:   500.0GB               Memory Used: 393.53GB
   Walltime requested: 30:00:00            Walltime Used: 05:33:42
   JobFS requested:    200.0GB                JobFS used: 0B
======================================================================================

In [None]:
%%R
# Open the output file:
result_selectIV_fs_4 <- readRDS("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_IVs_fs_4.rds")

# Filter out errors:
result_selectIV_fs_4 <- result_selectIV_fs_4[!sapply(result_selectIV_fs_4, function(x) inherits(x, "simpleError"))]

# Print the length and the first element:
print(length(result_selectIV_fs_4))
print(length(unique(result_selectIV_fs_4)))
print(head(result_selectIV_fs_4))

[1] 38574
[1] 38480
$ENSG00000000003
 [1] "rs8780"      "rs1204407"   "rs189980168" "rs4145090"   "rs139645081"
 [6] "rs112002620" "rs113922837" "rs17329105"  "rs193124579" "rs146639579"
[11] "rs201585484" "rs111478058" "rs143491698" "rs572714672" "rs56100956" 
[16] "rs147601806" "rs139830751" "rs12842251"  "rs73630694" 

$ENSG00000000005
 [1] "rs6616166"   "rs4828038"   "rs12006573"  "rs73557922"  "rs141394856"
 [6] "rs140468798" "rs73555045"  "rs5921738"   "rs139231094" "rs59368006" 
[11] "rs111915382" "rs73562536"  "rs12832642"  "rs201254498" "rs186611188"
[16] "rs113422802" "rs796285800" "rs180795909" "rs1004006"   "rs148617015"
[21] "rs140109110"

$ENSG00000000419
 [1] "rs1292040204" "rs79562553"   "rs4809840"    "rs35201621"   "rs35530561"  
 [6] "rs141892692"  "rs944933"     "rs11483800"   "rs113036491"  "rs116892696" 
[11] "rs34483216"   "rs60578213"   "rs118181672"  "rs4811253"    "rs11086363"  
[16] "rs73278571"   "rs73131221"   "rs78319058"   "rs79223715"   "rs189417964" 
[2

In [None]:
library(parallel) # R package installed by default

# Step 1: Open the eQTL file:
eqtl_covid_exposure <- read.table("/data/eqtl_all_tissues_110.txt",
                                  header = TRUE,
                                  sep = "",
                                  stringsAsFactors = FALSE)

# Step 2: Open the GWAS input data set:
gwas_covid_outcome <- read.table("/data/long_covid_gwas_4_subset.txt",
                                 header = TRUE,
                                 sep = "",
                                 stringsAsFactors = FALSE)

# Step 3: Open the list of genes-SNPs:
result_selectedIVs <- readRDS("/data/long_covid_IVs_fs_4.rds")

# Step 4: Create a variable for the path to the LD matrix directory:
ld_matrix_dir <- "/data/ld_matrix_fs_4"

# Step 5: Setup the Mt_Robin method using the following setup function:
MR_MtRobin_setup <- function(geneID,
                             snpID,
                             eqtl_data,
                             gwas_data,
                             ld_matrix_path,
                             nTiss){
  tryCatch({
    ## Check if the gene is present and subset
    if(!(geneID %in% eqtl_data$gene_id)) stop("Gene is missing in eQTL dataset")
    eqtl_data <- subset(eqtl_data, gene_id == geneID)

    ## Confirm all SNPs are present in the dataset
    if(!all(snpID %in% eqtl_data$variant_id)) stop("Some SNPs missing in eQTL dataset")
    if(!all(snpID %in% gwas_data$variant_id)) stop("Some SNPs missing in GWAS dataset")

    # Read the LD matrix from a file
    LD <- as.matrix(utils::read.table(ld_matrix_path,
                                      header = TRUE,
                                      check.names = FALSE))

    if(!all(snpID %in% colnames(LD))) stop("Some SNPs missing in LD matrix")

    ## Align data
    eqtl_data <- eqtl_data[match(snpID,eqtl_data$variant_id),]
    gwas_data <- gwas_data[match(snpID,gwas_data$variant_id),]
    LD <- LD[match(snpID,rownames(LD)),match(snpID,colnames(LD))]

    eqtl_betas <- as.matrix(eqtl_data[,paste0("beta_",1:nTiss)])
    eqtl_se <- as.matrix(eqtl_data[,paste0("SE_",1:nTiss)])
    eqtl_pvals <- as.matrix(eqtl_data[,paste0("pvalue_",1:nTiss)])
    gwas_betas <- gwas_data$beta
    gwas_se <- gwas_data$SE

    row.names(eqtl_betas) <- row.names(eqtl_se) <- row.names(eqtl_pvals) <- eqtl_data$variant_id

    return(list(snpID=snpID,gwas_betas=gwas_betas,gwas_se=gwas_se,
                eqtl_betas=eqtl_betas,eqtl_se=eqtl_se,eqtl_pvals=eqtl_pvals,LD=LD))
  }, error = function(e) print(e))
}

# Step 6: Apply the previous setup function:

# Step 6.1: Initialize an empty list to store the results:
MR_MtRobin_results <- list()

# Step 6.2: Define the number of cores to use for the parallelism process:
num_cores <- 36

# Step 6.3: Loop over each gene in result_selectedIVs:
MR_MtRobin_results <- mclapply(names(result_selectedIVs), function(geneID) {

  # Step 6.3.1: Get the SNPs associated with each gene:
  snpID <- result_selectedIVs[[geneID]]

  # Step 6.3.2: Generate a unique file path for the LD matrix of this gene:
  ld_matrix_path <- file.path(ld_matrix_dir, paste0("ld_matrix_", geneID, ".txt"))

  # Step 6.3.3: Call the MR_MtRobin_setup() function with the parameters for each gene:
  return(tryCatch({
    MR_MtRobin_setup(geneID = geneID,
                     snpID = snpID,
                     eqtl_data = eqtl_covid_exposure,
                     gwas_data = gwas_covid_outcome,
                     ld_matrix_path = ld_matrix_path,
                     nTiss = 49)
  }, error = function(e) print(e)))
}, mc.cores = num_cores)

# Step 7: Assign names to the list:
names(MR_MtRobin_results) <- names(result_selectedIVs)

# Step 8: Save the previous result as a .rds file:
saveRDS(MR_MtRobin_results, file = "/data/long_covid_setup_fs_4.rds")

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q hugemem

# Request resources:
#PBS -l ncpus=36
#PBS -l mem=300GB
#PBS -l jobfs=200GB
#PBS -l walltime=10:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/Mt_Robin/select_setup.sif Rscript /data/setup_function_fs_4.R

###########################################################################################################

# Exit cleanly
exit 0

In [None]:
======================================================================================
                  Resource Usage on 2023-08-24 16:32:34:
   Job Id:             93430398.gadi-pbs
   Project:            sq95
   Exit Status:        0
   Service Units:      79.80
   NCPUs Requested:    36                     NCPUs Used: 36
                                           CPU Time Used: 20:38:10
   Memory Requested:   300.0GB               Memory Used: 83.78GB
   Walltime requested: 10:00:00            Walltime Used: 00:44:20
   JobFS requested:    200.0GB                JobFS used: 0B
======================================================================================

In [None]:
%%R
# Open the output file:
result_setup_fs_4 <- readRDS("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_setup_fs_4.rds")

# Filter out errors:
result_setup_fs_4 <- result_setup_fs_4[!sapply(result_setup_fs_4, function(x) inherits(x, "simpleError"))]

# Print the length and the first element:
print(length(result_setup_fs_4))
print(length(unique(result_setup_fs_4)))
print(result_setup_fs_4[[1]])

[1] 4958
[1] 4958
$snpID
[1] "rs147560086" "rs7955986"   "rs718389"    "rs11571383"  "rs2107614"  
[6] "rs4980935"   "rs11064536"  "rs117766797"

$gwas_betas
[1]  0.03468750 -0.02506380 -0.03680740 -0.05198390 -0.00444987 -0.00825649
[7]  0.01503010 -0.00869216

$gwas_se
[1] 0.0287461 0.0233427 0.0272009 0.0617035 0.0245610 0.0261202 0.0315410
[8] 0.0734673

$eqtl_betas
            beta_1 beta_2    beta_3   beta_4 beta_5    beta_6    beta_7 beta_8
rs147560086     NA     NA -0.332722 -0.36379     NA -0.428724        NA     NA
rs7955986       NA     NA        NA       NA     NA        NA        NA     NA
rs718389        NA     NA        NA       NA     NA        NA        NA     NA
rs11571383      NA     NA        NA       NA     NA        NA -0.525715     NA
rs2107614       NA     NA        NA       NA     NA        NA        NA     NA
rs4980935       NA     NA        NA       NA     NA        NA        NA     NA
rs11064536      NA     NA        NA       NA     NA        NA        NA   

In [None]:
# Step 1: Load the required libraries:
library(base) # R package installed by default
library(utils) # R package installed by default
library(parallel) # R package installed by default
library(Matrix) # R package installed by default
library(methods) # R package installed by default
library(stats) # R package installed by default
library(lme4) # It needs to be installed separately
library(data.table) # It needs to be installed separately

# Step 2: Main Mt_Robin function:
MR_MtRobin <- function(MR_MtRobin_input,pval_thresh=0.001){
  snpID <- MR_MtRobin_input$snpID
  gwas_betas <- MR_MtRobin_input$gwas_betas
  gwas_se <- MR_MtRobin_input$gwas_se
  eqtl_betas <- MR_MtRobin_input$eqtl_betas
  eqtl_se <- MR_MtRobin_input$eqtl_se
  eqtl_pvals <- MR_MtRobin_input$eqtl_pvals
  LD <- MR_MtRobin_input$LD

  if(ncol(eqtl_betas) != ncol(eqtl_se) | ncol(eqtl_betas) != ncol(eqtl_pvals)){
    stop("eQTL betas, SE and p-values do not all have same number of columns")
  }

  ## determine number of tissues (or conditions/studies/etc.)
  nT <- ncol(eqtl_betas)

  ## rename columns for standardized processing
  colnames(eqtl_betas) <- paste0("beta_",1:nT)
  colnames(eqtl_se) <- paste0("SE_",1:nT)
  colnames(eqtl_pvals) <- paste0("pvalue_",1:nT)

  ## combine eQTL results into data.table
  eqtl_res <- data.table::data.table(snpID,eqtl_betas,eqtl_se,eqtl_pvals)

  # reshape to long format (each SNP-gene-tissue will be row)
  eQTL_res_melt <- data.table::melt(eqtl_res,id.vars=c("snpID"),
                                    measure.vars=patterns("^beta_","^SE_","^pvalue_"),value.name=c("beta","se","pvalue"),
                                    variable.name="tissue")

  ## subset to strong instrumental variables (based on p-value threshold)
  eQTL_res_melt_PltThresh <- subset(eQTL_res_melt, pvalue < pval_thresh)

  ## set up gwas data for merging
  gwas_res <- data.table::data.table(snpID=snpID,gwas_beta=gwas_betas,gwas_se=gwas_se)

  ## merge eQTL and GWAS results
  data.table::setkey(eQTL_res_melt_PltThresh,snpID)
  data.table::setkey(gwas_res,snpID)
  merged_res <- data.table::merge.data.table(eQTL_res_melt_PltThresh,gwas_res)

  ## set up coefficients for reverse regression
  beta_x <- matrix(merged_res$beta,ncol=1)
  beta_y <- matrix(merged_res$gwas_beta,ncol=1)

  ## standard errors (for weights)
  se_x <- matrix(merged_res$se,ncol=1)
  se_y <- matrix(merged_res$gwas_se,ncol=1) ## not used by function; used in resampling (consider returning with results)

  ## identifiers
  snpID <- merged_res$snpID

  lme_res <- lme4::lmer(beta_x~(beta_y-1)+(beta_y-1|snpID),weights=1/se_x^2)

  ## return results from weighted regression with random slopes (and correlated errors) along with GWAS SE and LD (needed for resampling)
  return(list(lme_res=lme_res,gwas_res=gwas_res,LD=LD))
}

# Step 3: Read the previous file:
result_setup <- readRDS("/data/long_covid_setup_fs_4.rds")

# Step 4: Apply the main Mt_Robin function using parallel processing:
no_cores <- 38

cl <- makeCluster(no_cores) # Create a cluster with the number of cores
clusterExport(cl, "MR_MtRobin") # Export the MR_MtRobin function to all the nodes of the cluster
result_list_MtRobin <- parLapply(cl, result_setup, function(MR_MtRobin_input) {
  tryCatch({
    MR_MtRobin(MR_MtRobin_input = MR_MtRobin_input)
  }, error = function(e) print(e))
})
stopCluster(cl) # Stop the cluster once the computation is done

# Step 5: The names of the results will correspond to the gene ids:
names(result_list_MtRobin) <- names(result_setup)

# Step 6: Save the previous result as a .rds file:
saveRDS(result_list_MtRobin, file = "/data/long_covid_main_fs_4.rds")

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q hugemem

# Request resources:
#PBS -l ncpus=36
#PBS -l mem=300GB
#PBS -l jobfs=200GB
#PBS -l walltime=10:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/Mt_Robin/main_function.sif Rscript /data/main_function_fs_4.R

# Run the R script without any container:
#R --vanilla < parallelism_test.R > output_test

###########################################################################################################

# Exit cleanly
exit 0

In [None]:
======================================================================================
                  Resource Usage on 2023-08-24 17:34:46:
   Job Id:             93443499.gadi-pbs
   Project:            sq95
   Exit Status:        0
   Service Units:      2.73
   NCPUs Requested:    36                     NCPUs Used: 36
                                           CPU Time Used: 00:02:56
   Memory Requested:   300.0GB               Memory Used: 9.71GB
   Walltime requested: 10:00:00            Walltime Used: 00:01:31
   JobFS requested:    200.0GB                JobFS used: 0B
======================================================================================

In [None]:
%%R
# Open the output file:
result_main_fs_4 <- readRDS("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_main_fs_4.rds")

# Filter out errors:
result_main_fs_4 <- result_main_fs_4[!sapply(result_main_fs_4, function(x) inherits(x, "simpleError"))]

# Print the length and the first element:
print(length(result_main_fs_4))
print(length(unique(result_main_fs_4)))
print(head(result_main_fs_4))

[1] 1724
[1] 1724
$ENSG00000002016
$ENSG00000002016$lme_res
Linear mixed model fit by REML ['lmerMod']
Formula: beta_x ~ (beta_y - 1) + (beta_y - 1 | snpID)
Weights: 1/se_x^2
REML criterion at convergence: -29.6022
Random effects:
 Groups   Name   Std.Dev.
 snpID    beta_y 22.469  
 Residual         1.836  
Number of obs: 23, groups:  snpID, 8
Fixed Effects:
beta_y  
 8.039  

$ENSG00000002016$gwas_res
        snpID   gwas_beta   gwas_se
1  rs11064536  0.01503010 0.0315410
2  rs11571383 -0.05198390 0.0617035
3 rs117766797 -0.00869216 0.0734673
4 rs147560086  0.03468750 0.0287461
5   rs2107614 -0.00444987 0.0245610
6   rs4980935 -0.00825649 0.0261202
7    rs718389 -0.03680740 0.0272009
8   rs7955986 -0.02506380 0.0233427

$ENSG00000002016$LD
            rs147560086   rs7955986    rs718389  rs11571383   rs2107614
rs147560086  1.00000000  0.57848413 -0.19239180  0.66694928 -0.29808391
rs7955986    0.57848413  1.00000000 -0.35065828  0.41826699 -0.19482499
rs718389    -0.19239180 -0.350658

In [None]:
########################################################################
# GADI:

# Obtain p-value from MR-MtRobin results:
# Uses a resampling procedure to estimate the p_value for a MR-MtRobin object:
# param MR_MtRobin_res a list object returned by \code{MR_MtRobin}.
# param nsamp integer of the number of samples to use in estimating \eqn{P}-value using a null distribution.
# param use_nonconverge logical of whether to use samples resulting in non-convergence of random slope
# return A list of two elements:
# \tabular{ll}{\code{pvalue} \tab numeric of the estimated \eqn{P}-value.
# \code{nsamp_used} \tab integer of the number of samples used in estimating the \eqn{P}-value.
# \code{nsamp_used} is returned because not all \code{nsamp} may be used +
# If \code{use_nonconverge=FALSE} (samples will be dropped if the model does not converge or results in a singular fit of the random slope).
# myRes <- MR_MtRobin(snpID,gwas_betas,gwas_se,eqtl_betas,eqtl_se,eqtl_pvals,LD)
# MR_MtRobin_resample(MR_MtRobin_res=myRes, nsamp=1e4, use_nonconverge=FALSE)

# Load packages:
library(lme4)
library(mvtnorm)
library(data.table)
library(parallel)

# Load main results:
results_1 <- readRDS("/scratch/sq95/sp6154/p_values/long_covid_main_fs_4.rds")
# Filter out errors:
results_1 <- results_1[!sapply(results_1, function(x) inherits(x, "simpleError"))]
# Export:
saveRDS(results_1, "/scratch/sq95/sp6154/p_values/long_covid_main_fs_4_no_errors.rds")

# Updated Function to calculate the p-value and FDR with parallel processing:
MR_MtRobin_resample <- function(results_1_path, nsamp=1e4, use_nonconverge=FALSE, output_file_path, num_cpus=1){
  # Check and load the results_1 dataset from the specified path:
  if(file.exists(results_1_path)){
    results_1 <- readRDS(results_1_path)
  } else {
    stop("The specified file path for results_1 does not exist.")
  }
  # Initialize a list to store results for each gene:
  pvalues_results <- list()
  # Loop over each gene:
  for(gene_id in names(results_1)) {
    # Check if necessary components are present and not NULL:
    if(is.null(results_1[[gene_id]][["lme_res"]]) || is.null(results_1[[gene_id]][["gwas_res"]]) || is.null(results_1[[gene_id]][["LD"]])) {
      next
    }
    # Create slopes for each information:
    lme_res <- results_1[[gene_id]][["lme_res"]]
    gwas_res <- results_1[[gene_id]][["gwas_res"]]
    LD <- results_1[[gene_id]][["LD"]]
    # Extract data from MR-MtRobin results:
    eqtl_betas <- lme_res@frame$beta_x
    snpID <- lme_res@frame$snpID
    weights <- lme_res@frame$weights
    tstat_MR_MtRobin <- summary(lme_res)$coefficients[1,3]
    gwas_se <- gwas_res$gwas_se
    # Bootstrapped null distribution, accounting for LD correlations:
    beta_gwas_nullMat <- mvtnorm::rmvnorm(nsamp, mean=rep(0, length(gwas_se)), sigma=diag(gwas_se) %*% LD %*% diag(gwas_se))
    colnames(beta_gwas_nullMat) <- gwas_res$snpID
    # Return data types:
    tstat_nulls <- mclapply(1:nsamp, function(i) {
      beta_gwas_null <- beta_gwas_nullMat[i, snpID]
      tryCatch({
        # LMM equation:
        lme_null <- lme4::lmer(eqtl_betas~(beta_gwas_null-1)+(beta_gwas_null-1|snpID), weights=weights)
        return(summary(lme_null)$coefficients[1,3])
      }, error=function(e) {
        if(!use_nonconverge) return(NA)
      })
    }, mc.cores = num_cpus)
    tstat_nulls <- unlist(tstat_nulls)
    tstat_nulls <- tstat_nulls[!is.na(tstat_nulls)] # Remove NAs
    nsamp_used <- length(tstat_nulls)
    # Calculate the p-value and round it to 3 decimal places:
    pval <- round(mean(abs(tstat_nulls) >= abs(tstat_MR_MtRobin)), digits = 3)
    # Store the results for the current gene in the list:
    pvalues_results[[gene_id]] <- list(pvalue=pval, nsamp_used=nsamp_used)
  }
  # Calculate FDR:
  all_pvalues <- sapply(pvalues_results, function(x) x$pvalue)
  fdr_values <- p.adjust(all_pvalues, method = "BH")
  # Combine results with FDR values:
  for(i in seq_along(fdr_values)) {
    pvalues_results[[names(fdr_values)[i]]]$fdr <- fdr_values[i]
  }
  # Prepare results dataframe:
  results_df <- do.call(rbind, lapply(names(pvalues_results), function(gene_id) {
    data.frame(gene_id = gene_id,
               pvalue = sprintf("%.3f", pvalues_results[[gene_id]]$pvalue), # Format pvalue with 3 decimal places
               nsamp_used = pvalues_results[[gene_id]]$nsamp_used,
               fdr = sprintf("%.15f", pvalues_results[[gene_id]]$fdr), # Keep full precision for fdr
               stringsAsFactors = FALSE)
  }))
  # Write the results to the output file, ensuring the formatting is kept
  write.table(results_df, file = output_file_path, sep = "\t", row.names = FALSE, quote = FALSE, col.names = TRUE)
  return(pvalues_results)
}

# Call the previous function:
MR_MtRobin_resample(
  results_1_path = "/scratch/sq95/sp6154/p_values/long_covid_main_fs_4_no_errors.rds",
  nsamp = 10000, # Number of bootstrap samples
  use_nonconverge = FALSE, # Whether to use non-converging results, typically FALSE
  output_file_path = "/scratch/sq95/sp6154/p_values/MR_pvalue_FDR_results_fs4.txt",
  num_cpus = 48 # Number of CPUs
)

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q hugemem

# Request resources:
#PBS -l ncpus=48
#PBS -l mem=500GB
#PBS -l jobfs=200GB
#PBS -l walltime=10:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Option 1: Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/p_values/p_values.sif Rscript /data/p_values_fs_4.R

# Option 2: Run the R script without any container:
#R --vanilla < parallelism_test.R > output_test

###########################################################################################################

# Exit cleanly:
exit 0

In [None]:
======================================================================================
                  Resource Usage on 2024-02-08 12:27:08:
   Job Id:             107797348.gadi-pbs
   Project:            sq95
   Exit Status:        0
   Service Units:      446.90
   NCPUs Requested:    24                     NCPUs Used: 24
                                           CPU Time Used: 131:46:19
   Memory Requested:   200.0GB               Memory Used: 58.68GB
   Walltime requested: 10:00:00            Walltime Used: 06:12:25
   JobFS requested:    100.0GB                JobFS used: 0B
======================================================================================

In [None]:
%%R

# Read the results with the p_value and FDR values:
results <- read.table("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/MR_pvalue_FDR_results_fs4.txt", header=TRUE, sep="\t")
print(dim(results))
print(head(results))

[1] 1724    4
          gene_id pvalue nsamp_used       fdr
1 ENSG00000002016  0.404      10000 0.8213396
2 ENSG00000005156  0.141      10000 0.6241316
3 ENSG00000006282  0.988      10000 0.9976805
4 ENSG00000006744  0.931      10000 0.9817077
5 ENSG00000007171  0.514      10000 0.8611623
6 ENSG00000007341  0.071      10000 0.4684075


In [None]:
%%R

# Call the analyse_gene_data() function to organize the Mt_Robin results for significant and non-significant genes:

# Create the output paths:
save_paths <- list(
  all_genes = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs4_all_genes_sig_non_sig.txt",
  protein_coding_genes = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs4_protein_coding_genes_sig_not_sig.txt"
)

# Call the function:
analyze_gene_data(
  result_main_path = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_main_fs_4.rds",
  eqtl_long_covid_exposure_path = "/content/drive/MyDrive/Colab/Long_COVID/eQTL/eqtl_all_tissues_110.txt",
  save_files = save_paths
)

Saved all_genes as all_genes to /content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs4_all_genes_sig_non_sig.txt 
Dimensions of all_genes : 1724 10 

          gene_id gene_name      gene_type      beta_y Obs SNPs
1 ENSG00000002016     RAD52 protein_coding   8.0391110  23    8
2 ENSG00000005156      LIG3 protein_coding   9.2715580  10    3
3 ENSG00000006282   SPATA20 protein_coding   6.3657734  46    2
4 ENSG00000006744     ELAC2 protein_coding   0.8208629  18    4
5 ENSG00000007171      NOS2 protein_coding -77.4824431  31    5
6 ENSG00000007341      ST7L protein_coding -19.6826570  12    1
                                                                               SNP_IDs
1 rs11064536,rs11571383,rs117766797,rs147560086,rs2107614,rs4980935,rs718389,rs7955986
2                                                     rs12945428,rs201016783,rs3135958
3                                                                  rs2306001,rs9890200
4                                             rs1044564,r

In [None]:
%%R

# FS4 final table:
# Call the function that retrieves more information from the Mt_Robin and filter the significant genes:
process_final_data(input_path_1 = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_main_fs_4_no_errors.rds",
                   input_path_2 = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs4_protein_coding_genes_sig_not_sig.txt",
                   input_path_3 = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/MR_pvalue_FDR_results_fs4.txt",
                   output_path = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs4_protein_coding_genes_sig_full_table.txt")

[1] "Causal genes (noncoding genes and protein coding genes): "
[1] 1724
$lme_res
Linear mixed model fit by REML ['lmerMod']
Formula: beta_x ~ (beta_y - 1) + (beta_y - 1 | snpID)
Weights: 1/se_x^2
REML criterion at convergence: -29.6022
Random effects:
 Groups   Name   Std.Dev.
 snpID    beta_y 22.469  
 Residual         1.836  
Number of obs: 23, groups:  snpID, 8
Fixed Effects:
beta_y  
 8.039  

$gwas_res
        snpID   gwas_beta   gwas_se
1  rs11064536  0.01503010 0.0315410
2  rs11571383 -0.05198390 0.0617035
3 rs117766797 -0.00869216 0.0734673
4 rs147560086  0.03468750 0.0287461
5   rs2107614 -0.00444987 0.0245610
6   rs4980935 -0.00825649 0.0261202
7    rs718389 -0.03680740 0.0272009
8   rs7955986 -0.02506380 0.0233427

$LD
            rs147560086   rs7955986    rs718389  rs11571383   rs2107614
rs147560086  1.00000000  0.57848413 -0.19239180  0.66694928 -0.29808391
rs7955986    0.57848413  1.00000000 -0.35065828  0.41826699 -0.19482499
rs718389    -0.19239180 -0.35065828  1.0000

In [None]:
%%R

# Call the last function to retrieve the tissue information and categories for fs1:
input_eqtl_path <- "/content/drive/MyDrive/Colab/Long_COVID/eQTL/eqtl_all_tissues_110.txt"
input_mt_robin_path <- "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs4_protein_coding_genes_sig_full_table.txt"
output_path <- "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs4_protein_coding_genes_sig_full_table_tissues.txt"
process_eqtl_mt_robin(input_eqtl_path, input_mt_robin_path, output_path)

           gene_id gene_name SNPs
1  ENSG00000068097    HEATR6    9
2  ENSG00000089063   TMEM230    6
3  ENSG00000089335    ZNF302    2
4  ENSG00000108439      PNPO   15
5  ENSG00000125818     PSMF1    9
6  ENSG00000127399    LRRC61    2
7  ENSG00000132305      IMMT    7
8  ENSG00000138381    ASNSD1    5
9  ENSG00000138495     COX17   18
10 ENSG00000139714     MORN3   18
11 ENSG00000149089      APIP    4
12 ENSG00000163161     ERCC3    5
13 ENSG00000164002      EXO5   10
14 ENSG00000170915     PAQR8   24
15 ENSG00000171160     MORN4    2
16 ENSG00000171495    MROH2B    5
17 ENSG00000171928    TVP23B    5
18 ENSG00000177082     WDR73   13
19 ENSG00000177409    SAMD9L    9
20 ENSG00000197013    ZNF429   10
21 ENSG00000205078    SYCE1L    3
22 ENSG00000212127   TAS2R14    5
23 ENSG00000213246   SUPT4H1   12
24 ENSG00000215251   FASTKD5   22
                                                                                                                                                      

### **All GWAS**

In [None]:
%%R

# Install and load dplyr:
#install.packages("dplyr")
#library(dplyr)

# Install and load purrr:
#install.packages("purrr")
#library(purrr)

# Read the GWAS datasets
gwas1 <- read.table("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs1_protein_coding_genes_sig_full_table_tissues.txt", header=TRUE, sep="\t")
gwas2 <- read.table("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs2_protein_coding_genes_sig_full_table_tissues.txt", header=TRUE, sep="\t")
gwas3 <- read.table("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs3_protein_coding_genes_sig_full_table_tissues.txt", header=TRUE, sep="\t")
gwas4 <- read.table("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/fs4_protein_coding_genes_sig_full_table_tissues.txt", header=TRUE, sep="\t")

# Combine all tables into one, allowing for repeated gene_ids if other values vary:
all_gwas <- bind_rows(gwas1, gwas2, gwas3, gwas4)

# Count duplicates for gene_id:
gene_id_counts <- all_gwas %>%
  group_by(gene_id) %>%
  summarise(count = n(), .groups = 'drop')
duplicates_only <- filter(gene_id_counts, count > 1)
print(duplicates_only)

# Print the dimensions and the head of the combined dataset:
print("Combined causal genes with repeated gene_id:")
glimpse(all_gwas)

# Export combined datasets:
write.table(all_gwas, "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/all_gwas_causal_repeated_genes.txt", sep="\t", row.names=FALSE, col.names=TRUE, quote=FALSE)

# A tibble: 15 × 2
   gene_id         count
   <chr>           <int>
 1 ENSG00000068097     2
 2 ENSG00000089063     2
 3 ENSG00000090661     2
 4 ENSG00000108439     2
 5 ENSG00000127399     2
 6 ENSG00000139714     4
 7 ENSG00000146530     2
 8 ENSG00000149089     2
 9 ENSG00000158825     2
10 ENSG00000163161     2
11 ENSG00000171160     2
12 ENSG00000177082     2
13 ENSG00000184983     2
14 ENSG00000205078     2
15 ENSG00000212127     2
[1] "Combined causal genes with repeated gene_id:"
Rows: 67
Columns: 24
$ gene_id           <chr> "ENSG00000026950", "ENSG00000065457", "ENSG000000906…
$ gene_name         <chr> "BTN3A1", "ADAT1", "CERS4", "BNIP1", "EIF5A", "MORN3…
$ SNPs              <int> 9, 4, 15, 7, 2, 18, 8, 14, 2, 2, 2, 17, 3, 2, 9, 6, …
$ SNPs_IDs          <chr> "rs114157910,rs12213056,rs13194491,rs2275906,rs32087…
$ Obs               <int> 22, 11, 23, 13, 20, 21, 12, 26, 3, 18, 18, 20, 21, 2…
$ beta_y            <dbl> -4.817862, -7.713004, -7.804575, 8.994436, 21.013660…
$ Mi

## **backward_selection**

### **GWAS1**

In [None]:
# Step 1: Load the required libraries:
library(matrixStats) # It needs to be installed separately
#library(BiocManager) # It needs to be installed separately
library(snpStats) # It needs to be installed separately
library(utils) # R package installed by default
library(parallel) # R package installed by default
library(reshape2) # Additional package for melt function

# Step 2: Open the eQTL file:
eqtl_covid_exposure <- read.table("/data/eqtl_all_tissues_110.txt",
                                  header = TRUE,
                                  sep = "",
                                  stringsAsFactors = FALSE)

# Step 3: Open the GWAS input data set:
gwas_covid_outcome <- read.table("/data/long_covid_gwas_1_subset.txt",
                                 header = TRUE,
                                 sep = "",
                                 stringsAsFactors = FALSE)

# Step 4: Open the genes input data set:
all_genes <- read.table("/data/all_genes_110.txt",
                        header = TRUE,
                        sep = "",
                        stringsAsFactors = FALSE)

# Step 5: Convert the previous list of genes into a vector:
all_genes_vector <- all_genes$gene_id

# Step 6: Save the paths of the GTEx files:
gtex_covid_bed <- "/data/WGS_GTEx_long_covid_rsID.bed"
gtex_covid_bim <- "/data/WGS_GTEx_long_covid_rsID.bim"
gtex_covid_fam <- "/data/WGS_GTEx_long_covid_rsID.fam"

# Step 7: Create a variable for the path to the directory to save the LD matrix results:
ld_matrix_dir <- "/data/ld_matrix_bs_1"

# Step 8: Create the directory if it doesn't exist:
if (!dir.exists(ld_matrix_dir)) {
  dir.create(ld_matrix_dir)
}

# Step 9: Modify the original select_IV_fs() function to accept NAs from the eQTL and calculate the LD matrix inside:
# select_IV_bs() --> backward selection starts with all variables and removes one at a time:
select_IV_bs <- function(geneID,
                         eqtl_data,
                         nTiss,
                         bed_file,
                         bim_file,
                         fam_file,
                         ld_matrix_path,
                         ld_thresh=0.5,
                         pval_thresh=0.001,
                         nTiss_thresh=3){

  ## Subset the eQTL data set for the gene of interest:
  if(!(geneID %in% eqtl_data$gene_id)) stop("Gene is missing in eQTL dataset")
  eqtl_data <- subset(eqtl_data, gene_id == geneID)

  pval_mat <- as.matrix(subset(eqtl_data,select=paste0("pvalue_",1:nTiss)))
  eqtl_data$min_p <- matrixStats::rowMins(pval_mat, na.rm = TRUE) # Mod1: I added na.rm = TRUE
  eqtl_data$nTissThresh_p <- apply(pval_mat, 1, function(x) sort(x)[nTiss_thresh])
  eqtl_data <- subset(eqtl_data, nTissThresh_p < pval_thresh)

  ## Ensure same sign for tissues with p < pval_thresh:
  betas <- as.matrix(subset(eqtl_data,select=paste0("beta_",1:nTiss)))
  pvals <- as.matrix(subset(eqtl_data,select=paste0("pvalue_",1:nTiss)))
  pvals_lt001 <- pvals<0.001
  betas_sig <- betas*pvals_lt001
  sig_neg_eQTL <- as.numeric(apply(betas_sig, 1, function(x) any((!is.na(x) & x < -1e-15)))) # Mod2: I added is.na()
  sig_pos_eQTL <- as.numeric(apply(betas_sig, 1, function(x) any((!is.na(x) & x > 1e-15)))) # Mod3: I added is.na()
  eqtl_data$pos_sig_eQTL <- sig_pos_eQTL
  eqtl_data$neg_sig_eQTL <- sig_neg_eQTL
  eqtl_data <- subset(eqtl_data, !(pos_sig_eQTL==1 & neg_sig_eQTL==1))

  if(nrow(eqtl_data)==0) stop(paste0("No cis-SNPs for gene with p<",pval_thresh," in at least ",nTiss_thresh," tissues."))

  ## Obtain a set of candidate SNPs:
  SNP_pool <- eqtl_data$variant_id
  if(length(SNP_pool)==1){
    return(SNP_pool)
  }

  # Calculate the LD matrix:
  snp_data <- read.plink(bed_file, bim_file, fam_file)
  snp_matrix <- snp_data$genotypes
  numeric_matrix <- as(snp_matrix, "numeric")

  # Restrict LD matrix to SNPs in the data set:
  df_subset <- numeric_matrix[, colnames(numeric_matrix) %in% SNP_pool]
  LD <- cor(df_subset, use="pairwise.complete")
  LD_r2 <- LD^2

  # Export a .txt file of the LD matrix:
  utils::write.table(LD,
                     file = ld_matrix_path,
                     sep = " ",
                     row.names = TRUE,
                     col.names = TRUE,
                     quote = FALSE)

  ## Get pairwise r^2:
  LD_r2_pw <- LD_r2
  LD_r2_pw[lower.tri(LD_r2_pw,diag=T)] <- NA
  LD_r2_pw <- reshape2::melt(LD_r2_pw)
  colnames(LD_r2_pw) <- c("SNP1","SNP2","r2")
  LD_r2_pw$SNP1 <- as.character(LD_r2_pw$SNP1)
  LD_r2_pw$SNP2 <- as.character(LD_r2_pw$SNP2)

  ## Drop SNPs paired with themselves and duplicates (resulting from lower triangle, set to NA):
  LD_r2_pw <- subset(LD_r2_pw, SNP1 != SNP2 & !is.na(r2))

  ## Sort by descending r^2:
  LD_r2_pw <- LD_r2_pw[order(LD_r2_pw$r2,decreasing = T),]

  ## Check for case all candidate SNPs have pw r^2 > thresh (return SNPs with smallest p-value):
  if(min(LD_r2_pw$r2) >=ld_thresh){
    return(eqtl_data$variant_id[which.max(eqtl_data$min_p)])
  }

  while(max(LD_r2_pw$r2)>=ld_thresh){

    ## Identify top SNPs in the pool:
    eqtl_data <- subset(eqtl_data, variant_id %in% SNP_pool)
    LD_r2_pw <- subset(LD_r2_pw, SNP1 %in% SNP_pool & SNP2 %in% SNP_pool)
    LD_r2_pw_geThresh <- subset(LD_r2_pw, r2>=ld_thresh)

    ## Identify SNP pair with highest remaining r^2:
    next_pair <- LD_r2_pw[1,]

    ## Criteria 1: Drop the SNP with largest number of correlations > threshold with other SNPs:
    SNP1_nSNP_geThresh <- sum(LD_r2_pw_geThresh$SNP1 == next_pair$SNP1) + sum(LD_r2_pw_geThresh$SNP2 == next_pair$SNP1)
    SNP2_nSNP_geThresh <- sum(LD_r2_pw_geThresh$SNP1 == next_pair$SNP2) + sum(LD_r2_pw_geThresh$SNP2 == next_pair$SNP2)
    if(SNP1_nSNP_geThresh > SNP2_nSNP_geThresh){
      next_SNP_drop <- next_pair$SNP1
    } else if (SNP1_nSNP_geThresh < SNP2_nSNP_geThresh){
      next_SNP_drop <- next_pair$SNP2
    } else{

      ## Criteria 2: Drop the SNP with smaller number of tissues with p<0.001:
      next_pair_eqtl <- subset(eqtl_data, variant_id %in% c(next_pair$SNP1,next_pair$SNP2))
      SNP1_nTiss <- subset(next_pair_eqtl,variant_id==next_pair$SNP1)$nTiss_ltThresh
      SNP2_nTiss <- subset(next_pair_eqtl,variant_id==next_pair$SNP2)$nTiss_ltThresh

      if(SNP1_nTiss > SNP2_nTiss){
        next_SNP_drop <- next_pair$SNP2
      } else if(SNP1_nTiss < SNP2_nTiss){
        next_SNP_drop <- next_pair$SNP1
      } else{

        ## Criteria 3: Drop the SNP with larger minimum p-value:
        next_SNP_drop <- next_pair_eqtl$variant_id[which.max(next_pair_eqtl$min_p)]
      }
    }
    SNP_pool <- SNP_pool[which(SNP_pool != next_SNP_drop)]
  }
  return(SNP_pool)
}

# Step 10: Define the number of cores to use for the parallelism process:
num_cores <- 48

# Step 11:
result_list <- mclapply(all_genes_vector, function(geneID) {
  # Generate a unique file path for the LD matrix of this gene
  ld_matrix_path <- file.path(ld_matrix_dir, paste0("ld_matrix_", geneID, ".txt"))

  tryCatch({
    select_IV_fs(geneID = geneID,
                 eqtl_data = eqtl_covid_exposure,
                 nTiss = 49,
                 bed_file = gtex_covid_bed,
                 bim_file = gtex_covid_bim,
                 fam_file = gtex_covid_fam,
                 ld_matrix_path = ld_matrix_path,
                 ld_thresh = 0.5,
                 pval_thresh = 0.001,
                 nTiss_thresh = 1)
  }, error = function(e) print(e))
}, mc.cores = num_cores)

# Step 12:
names(result_list) <- all_genes_vector

# Step 13:
saveRDS(result_list, file = "/data/long_covid_IVs_bs_1.rds")

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q hugemem

# Request resources:
#PBS -l ncpus=48
#PBS -l mem=500GB
#PBS -l jobfs=200GB
#PBS -l walltime=30:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/Mt_Robin/select_setup_bs.sif Rscript /data/select_function_bs_1.R

# Run the R script without any container:
#R --vanilla < parallelism_test.R > output_test

###########################################################################################################

# Exit cleanly
exit 0

In [None]:
======================================================================================
                  Resource Usage on 2023-08-24 23:39:34:
   Job Id:             93444366.gadi-pbs
   Project:            sq95
   Exit Status:        0
   Service Units:      866.28
   NCPUs Requested:    48                     NCPUs Used: 48
                                           CPU Time Used: 277:35:00
   Memory Requested:   500.0GB               Memory Used: 388.77GB
   Walltime requested: 30:00:00            Walltime Used: 06:00:57
   JobFS requested:    200.0GB                JobFS used: 0B
======================================================================================

In [None]:
%%R
# Open the output file:
result_selectIV_bs_1 <- readRDS("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_IVs_bs_1.rds")

# Filter out errors:
result_selectIV_bs_1 <- result_selectIV_bs_1[!sapply(result_selectIV_bs_1, function(x) inherits(x, "simpleError"))]

# Print the length and the first element:
print(length(result_selectIV_bs_1))
print(length(unique(result_selectIV_bs_1)))
print(head(result_selectIV_bs_1))

[1] 13809
[1] 13737
$ENSG00000002079
[1] "rs36168039"  "rs151109487" "rs41258334"  "rs28497153"  "rs7457787"  
[6] "rs73157578" 

$ENSG00000004846
 [1] "rs58540449"  "rs79448643"  "rs114103500" "rs111613242" "rs76325633" 
 [6] "rs117337555" "rs73682327"  "rs11981446"  "rs7801178"   "rs17455533" 
[11] "rs2108258"   "rs77254950"  "rs139614792"

$ENSG00000004939
 [1] "rs2880712"   "rs9747287"   "rs117705553" "rs79555595"  "rs78230173" 
 [6] "rs114651584" "rs34637106"  "rs7213794"   "rs144535065" "rs142313238"
[11] "rs76004777"  "rs9944407"   "rs34260468"  "rs142437085" "rs8070867"  
[16] "rs730818"    "rs200108084" "rs1728174"   "rs140638631" "rs4793150"  
[21] "rs76480869"  "rs147752346" "rs62080669"  "rs9895862"   "rs7222401"  
[26] "rs540551010" "rs9890834"   "rs76498976"  "rs138050677"

$ENSG00000005073
 [1] "rs60642981"  "rs2525760"   "rs10273618"  "rs4722617"   "rs138820646"
 [6] "rs117987507" "rs17501111"  "rs10251237"  "rs147214544" "rs6947348"  
[11] "rs71539542"  "rs3735570"   "

In [None]:
library(parallel) # R package installed by default

# Step 1: Open the eQTL file:
eqtl_covid_exposure <- read.table("/data/eqtl_all_tissues_110.txt",
                                  header = TRUE,
                                  sep = "",
                                  stringsAsFactors = FALSE)

# Step 2: Open the GWAS input data set:
gwas_covid_outcome <- read.table("/data/long_covid_gwas_1_subset.txt",
                                 header = TRUE,
                                 sep = "",
                                 stringsAsFactors = FALSE)

# Step 3: Open the list of genes-SNPs:
result_selectedIVs <- readRDS("/data/long_covid_IVs_bs_1.rds")

# Step 4: Create a variable for the path to the LD matrix directory:
ld_matrix_dir <- "/data/ld_matrix_bs_1"

# Step 5: Setup the Mt_Robin method using the following setup function:
MR_MtRobin_setup <- function(geneID,
                             snpID,
                             eqtl_data,
                             gwas_data,
                             ld_matrix_path,
                             nTiss){
  tryCatch({
    ## Check if the gene is present and subset
    if(!(geneID %in% eqtl_data$gene_id)) stop("Gene is missing in eQTL dataset")
    eqtl_data <- subset(eqtl_data, gene_id == geneID)

    ## Confirm all SNPs are present in the dataset
    if(!all(snpID %in% eqtl_data$variant_id)) stop("Some SNPs missing in eQTL dataset")
    if(!all(snpID %in% gwas_data$variant_id)) stop("Some SNPs missing in GWAS dataset")

    # Read the LD matrix from a file
    LD <- as.matrix(utils::read.table(ld_matrix_path,
                                      header = TRUE,
                                      check.names = FALSE))

    if(!all(snpID %in% colnames(LD))) stop("Some SNPs missing in LD matrix")

    ## Align data
    eqtl_data <- eqtl_data[match(snpID,eqtl_data$variant_id),]
    gwas_data <- gwas_data[match(snpID,gwas_data$variant_id),]
    LD <- LD[match(snpID,rownames(LD)),match(snpID,colnames(LD))]

    eqtl_betas <- as.matrix(eqtl_data[,paste0("beta_",1:nTiss)])
    eqtl_se <- as.matrix(eqtl_data[,paste0("SE_",1:nTiss)])
    eqtl_pvals <- as.matrix(eqtl_data[,paste0("pvalue_",1:nTiss)])
    gwas_betas <- gwas_data$beta
    gwas_se <- gwas_data$SE

    row.names(eqtl_betas) <- row.names(eqtl_se) <- row.names(eqtl_pvals) <- eqtl_data$variant_id

    return(list(snpID=snpID,gwas_betas=gwas_betas,gwas_se=gwas_se,
                eqtl_betas=eqtl_betas,eqtl_se=eqtl_se,eqtl_pvals=eqtl_pvals,LD=LD))
  }, error = function(e) print(e))
}

# Step 6: Apply the previous setup function:

# Step 6.1: Initialize an empty list to store the results:
MR_MtRobin_results <- list()

# Step 6.2: Define the number of cores to use for the parallelism process:
num_cores <- 36

# Step 6.3: Loop over each gene in result_selectedIVs:
MR_MtRobin_results <- mclapply(names(result_selectedIVs), function(geneID) {

  # Step 6.3.1: Get the SNPs associated with each gene:
  snpID <- result_selectedIVs[[geneID]]

  # Step 6.3.2: Generate a unique file path for the LD matrix of this gene:
  ld_matrix_path <- file.path(ld_matrix_dir, paste0("ld_matrix_", geneID, ".txt"))

  # Step 6.3.3: Call the MR_MtRobin_setup() function with the parameters for each gene:
  return(tryCatch({
    MR_MtRobin_setup(geneID = geneID,
                     snpID = snpID,
                     eqtl_data = eqtl_covid_exposure,
                     gwas_data = gwas_covid_outcome,
                     ld_matrix_path = ld_matrix_path,
                     nTiss = 49)
  }, error = function(e) print(e)))
}, mc.cores = num_cores)

# Step 7: Assign names to the list:
names(MR_MtRobin_results) <- names(result_selectedIVs)

# Step 8: Save the previous result as a .rds file:
saveRDS(MR_MtRobin_results, file = "/data/long_covid_setup_bs_1.rds")

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q hugemem

# Request resources:
#PBS -l ncpus=36
#PBS -l mem=300GB
#PBS -l jobfs=200GB
#PBS -l walltime=10:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/Mt_Robin/select_setup_bs.sif Rscript /data/setup_function_bs_1.R

###########################################################################################################

# Exit cleanly
exit 0

In [None]:
======================================================================================
                  Resource Usage on 2023-08-25 08:50:01:
   Job Id:             93510073.gadi-pbs
   Project:            sq95
   Exit Status:        0
   Service Units:      48.57
   NCPUs Requested:    36                     NCPUs Used: 36
                                           CPU Time Used: 12:49:52
   Memory Requested:   300.0GB               Memory Used: 76.49GB
   Walltime requested: 10:00:00            Walltime Used: 00:26:59
   JobFS requested:    200.0GB                JobFS used: 0B
======================================================================================

In [None]:
%%R
# Open the output file:
result_setup_bs_1 <- readRDS("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_setup_bs_1.rds")

# Filter out errors:
result_setup_bs_1 <- result_setup_bs_1[!sapply(result_setup_bs_1, function(x) inherits(x, "simpleError"))]

# Print the length and the first element:
print(length(result_setup_bs_1))
print(length(unique(result_setup_bs_1)))
print(result_setup_bs_1[[1]])

[1] 2635
[1] 2635
$snpID
 [1] "rs2206271"   "rs35305915"  "rs72901027"  "rs9367397"   "rs61660288" 
 [6] "rs141707622" "rs2477488"   "rs13202872"  "rs62406441"  "rs17667036" 
[11] "rs75397901"  "rs10948573"  "rs4715208"   "rs571503"   

$gwas_betas
 [1]  0.05003660 -0.03369060 -0.00541575  0.04018010 -0.06373460  0.05997710
 [7]  0.05336300  0.02460510 -0.02075750  0.00391041 -0.06437320 -0.00690805
[13]  0.03944380 -0.00698292

$gwas_se
 [1] 0.0318188 0.0446311 0.0464996 0.0627267 0.0469640 0.0901525 0.0297989
 [8] 0.0340206 0.0360749 0.0320274 0.1337120 0.0321628 0.0345040 0.0299366

$eqtl_betas
              beta_1 beta_2 beta_3 beta_4 beta_5 beta_6 beta_7 beta_8 beta_9
rs2206271   0.160091     NA     NA     NA     NA     NA     NA     NA     NA
rs35305915        NA     NA     NA     NA     NA     NA     NA     NA     NA
rs72901027        NA     NA     NA     NA     NA     NA     NA     NA     NA
rs9367397         NA     NA     NA     NA     NA     NA     NA     NA     NA
rs61660288

In [None]:
# Step 1: Load the required libraries:
library(base) # R package installed by default
library(utils) # R package installed by default
library(parallel) # R package installed by default
library(Matrix) # R package installed by default
library(methods) # R package installed by default
library(stats) # R package installed by default
library(lme4) # It needs to be installed separately
library(data.table) # It needs to be installed separately

# Step 2: Main Mt_Robin function:
MR_MtRobin <- function(MR_MtRobin_input,pval_thresh=0.001){
  snpID <- MR_MtRobin_input$snpID
  gwas_betas <- MR_MtRobin_input$gwas_betas
  gwas_se <- MR_MtRobin_input$gwas_se
  eqtl_betas <- MR_MtRobin_input$eqtl_betas
  eqtl_se <- MR_MtRobin_input$eqtl_se
  eqtl_pvals <- MR_MtRobin_input$eqtl_pvals
  LD <- MR_MtRobin_input$LD

  if(ncol(eqtl_betas) != ncol(eqtl_se) | ncol(eqtl_betas) != ncol(eqtl_pvals)){
    stop("eQTL betas, SE and p-values do not all have same number of columns")
  }

  ## determine number of tissues (or conditions/studies/etc.)
  nT <- ncol(eqtl_betas)

  ## rename columns for standardized processing
  colnames(eqtl_betas) <- paste0("beta_",1:nT)
  colnames(eqtl_se) <- paste0("SE_",1:nT)
  colnames(eqtl_pvals) <- paste0("pvalue_",1:nT)

  ## combine eQTL results into data.table
  eqtl_res <- data.table::data.table(snpID,eqtl_betas,eqtl_se,eqtl_pvals)

  # reshape to long format (each SNP-gene-tissue will be row)
  eQTL_res_melt <- data.table::melt(eqtl_res,id.vars=c("snpID"),
                                    measure.vars=patterns("^beta_","^SE_","^pvalue_"),value.name=c("beta","se","pvalue"),
                                    variable.name="tissue")

  ## subset to strong instrumental variables (based on p-value threshold)
  eQTL_res_melt_PltThresh <- subset(eQTL_res_melt, pvalue < pval_thresh)

  ## set up gwas data for merging
  gwas_res <- data.table::data.table(snpID=snpID,gwas_beta=gwas_betas,gwas_se=gwas_se)

  ## merge eQTL and GWAS results
  data.table::setkey(eQTL_res_melt_PltThresh,snpID)
  data.table::setkey(gwas_res,snpID)
  merged_res <- data.table::merge.data.table(eQTL_res_melt_PltThresh,gwas_res)

  ## set up coefficients for reverse regression
  beta_x <- matrix(merged_res$beta,ncol=1)
  beta_y <- matrix(merged_res$gwas_beta,ncol=1)

  ## standard errors (for weights)
  se_x <- matrix(merged_res$se,ncol=1)
  se_y <- matrix(merged_res$gwas_se,ncol=1) ## not used by function; used in resampling (consider returning with results)

  ## identifiers
  snpID <- merged_res$snpID

  lme_res <- lme4::lmer(beta_x~(beta_y-1)+(beta_y-1|snpID),weights=1/se_x^2)

  ## return results from weighted regression with random slopes (and correlated errors) along with GWAS SE and LD (needed for resampling)
  return(list(lme_res=lme_res,gwas_res=gwas_res,LD=LD))
}

# Step 3: Read the previous file:
result_setup <- readRDS("/data/long_covid_setup_bs_1.rds")

# Step 4: Apply the main Mt_Robin function using parallel processing:
no_cores <- 36

cl <- makeCluster(no_cores) # Create a cluster with the number of cores
clusterExport(cl, "MR_MtRobin") # Export the MR_MtRobin function to all the nodes of the cluster
result_list_MtRobin <- parLapply(cl, result_setup, function(MR_MtRobin_input) {
  tryCatch({
    MR_MtRobin(MR_MtRobin_input = MR_MtRobin_input)
  }, error = function(e) print(e))
})
stopCluster(cl) # Stop the cluster once the computation is done

# Step 5: The names of the results will correspond to the gene ids:
names(result_list_MtRobin) <- names(result_setup)

# Step 6: Save the previous result as a .rds file:
saveRDS(result_list_MtRobin, file = "/data/long_covid_main_bs_1.rds")

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q hugemem

# Request resources:
#PBS -l ncpus=36
#PBS -l mem=300GB
#PBS -l jobfs=200GB
#PBS -l walltime=10:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/Mt_Robin/main_function.sif Rscript /data/main_function_bs_1.R

# Run the R script without any container:
#R --vanilla < parallelism_test.R > output_test

###########################################################################################################

# Exit cleanly
exit 0

In [None]:
======================================================================================
                  Resource Usage on 2023-08-30 10:28:34:
   Job Id:             93872797.gadi-pbs
   Project:            sq95
   Exit Status:        0
   Service Units:      2.25
   NCPUs Requested:    36                     NCPUs Used: 36
                                           CPU Time Used: 00:01:48
   Memory Requested:   300.0GB               Memory Used: 8.24GB
   Walltime requested: 10:00:00            Walltime Used: 00:01:15
   JobFS requested:    200.0GB                JobFS used: 0B
======================================================================================

In [None]:
%%R
# Open the output file:
result_main_bs_1 <- readRDS("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_main_bs_1.rds")

# Filter out errors:
result_main_bs_1 <- result_main_bs_1[!sapply(result_main_bs_1, function(x) inherits(x, "simpleError"))]

# Print the length and the first element:
print(length(result_main_bs_1))
print(length(unique(result_main_bs_1)))
print(head(result_main_bs_1))

[1] 141
[1] 141
$ENSG00000008196
$ENSG00000008196$lme_res
Linear mixed model fit by REML ['lmerMod']
Formula: beta_x ~ (beta_y - 1) + (beta_y - 1 | snpID)
Weights: 1/se_x^2
REML criterion at convergence: 6.9579
Random effects:
 Groups   Name   Std.Dev.
 snpID    beta_y 4.105   
 Residual        4.354   
Number of obs: 15, groups:  snpID, 14
Fixed Effects:
beta_y  
-2.835  

$ENSG00000008196$gwas_res
         snpID   gwas_beta   gwas_se
1   rs10948573 -0.00690805 0.0321628
2   rs13202872  0.02460510 0.0340206
3  rs141707622  0.05997710 0.0901525
4   rs17667036  0.00391041 0.0320274
5    rs2206271  0.05003660 0.0318188
6    rs2477488  0.05336300 0.0297989
7   rs35305915 -0.03369060 0.0446311
8    rs4715208  0.03944380 0.0345040
9     rs571503 -0.00698292 0.0299366
10  rs61660288 -0.06373460 0.0469640
11  rs62406441 -0.02075750 0.0360749
12  rs72901027 -0.00541575 0.0464996
13  rs75397901 -0.06437320 0.1337120
14   rs9367397  0.04018010 0.0627267

$ENSG00000008196$LD
               rs2206

In [None]:
########################################################################
# GADI:

# Obtain p-value from MR-MtRobin results:
# Uses a resampling procedure to estimate the p_value for a MR-MtRobin object:
# param MR_MtRobin_res a list object returned by \code{MR_MtRobin}.
# param nsamp integer of the number of samples to use in estimating \eqn{P}-value using a null distribution.
# param use_nonconverge logical of whether to use samples resulting in non-convergence of random slope
# return A list of two elements:
# \tabular{ll}{\code{pvalue} \tab numeric of the estimated \eqn{P}-value.
# \code{nsamp_used} \tab integer of the number of samples used in estimating the \eqn{P}-value.
# \code{nsamp_used} is returned because not all \code{nsamp} may be used +
# If \code{use_nonconverge=FALSE} (samples will be dropped if the model does not converge or results in a singular fit of the random slope).
# myRes <- MR_MtRobin(snpID,gwas_betas,gwas_se,eqtl_betas,eqtl_se,eqtl_pvals,LD)
# MR_MtRobin_resample(MR_MtRobin_res=myRes, nsamp=1e4, use_nonconverge=FALSE)

# Load packages:
library(lme4)
library(mvtnorm)
library(data.table)
library(parallel)

# Load main results:
results_1 <- readRDS("/scratch/sq95/sp6154/p_values/long_covid_main_bs_1.rds")
# Filter out errors:
results_1 <- results_1[!sapply(results_1, function(x) inherits(x, "simpleError"))]
# Export:
saveRDS(results_1, "/scratch/sq95/sp6154/p_values/long_covid_main_bs_1_no_errors.rds")

# Updated Function to calculate the p-value and FDR with parallel processing:
MR_MtRobin_resample <- function(results_1_path, nsamp=1e4, use_nonconverge=FALSE, output_file_path, num_cpus=1){
  # Check and load the results_1 dataset from the specified path:
  if(file.exists(results_1_path)){
    results_1 <- readRDS(results_1_path)
  } else {
    stop("The specified file path for results_1 does not exist.")
  }
  # Initialize a list to store results for each gene:
  pvalues_results <- list()
  # Loop over each gene:
  for(gene_id in names(results_1)) {
    # Check if necessary components are present and not NULL:
    if(is.null(results_1[[gene_id]][["lme_res"]]) || is.null(results_1[[gene_id]][["gwas_res"]]) || is.null(results_1[[gene_id]][["LD"]])) {
      next
    }
    # Create slopes for each information:
    lme_res <- results_1[[gene_id]][["lme_res"]]
    gwas_res <- results_1[[gene_id]][["gwas_res"]]
    LD <- results_1[[gene_id]][["LD"]]
    # Extract data from MR-MtRobin results:
    eqtl_betas <- lme_res@frame$beta_x
    snpID <- lme_res@frame$snpID
    weights <- lme_res@frame$weights
    tstat_MR_MtRobin <- summary(lme_res)$coefficients[1,3]
    gwas_se <- gwas_res$gwas_se
    # Bootstrapped null distribution, accounting for LD correlations:
    beta_gwas_nullMat <- mvtnorm::rmvnorm(nsamp, mean=rep(0, length(gwas_se)), sigma=diag(gwas_se) %*% LD %*% diag(gwas_se))
    colnames(beta_gwas_nullMat) <- gwas_res$snpID
    # Return data types:
    tstat_nulls <- mclapply(1:nsamp, function(i) {
      beta_gwas_null <- beta_gwas_nullMat[i, snpID]
      tryCatch({
        # LMM equation:
        lme_null <- lme4::lmer(eqtl_betas~(beta_gwas_null-1)+(beta_gwas_null-1|snpID), weights=weights)
        return(summary(lme_null)$coefficients[1,3])
      }, error=function(e) {
        if(!use_nonconverge) return(NA)
      })
    }, mc.cores = num_cpus)
    tstat_nulls <- unlist(tstat_nulls)
    tstat_nulls <- tstat_nulls[!is.na(tstat_nulls)] # Remove NAs
    nsamp_used <- length(tstat_nulls)
    # Calculate the p-value and round it to 3 decimal places:
    pval <- round(mean(abs(tstat_nulls) >= abs(tstat_MR_MtRobin)), digits = 3)
    # Store the results for the current gene in the list:
    pvalues_results[[gene_id]] <- list(pvalue=pval, nsamp_used=nsamp_used)
  }
  # Calculate FDR:
  all_pvalues <- sapply(pvalues_results, function(x) x$pvalue)
  fdr_values <- p.adjust(all_pvalues, method = "BH")
  # Combine results with FDR values:
  for(i in seq_along(fdr_values)) {
    pvalues_results[[names(fdr_values)[i]]]$fdr <- fdr_values[i]
  }
  # Prepare results dataframe:
  results_df <- do.call(rbind, lapply(names(pvalues_results), function(gene_id) {
    data.frame(gene_id = gene_id,
               pvalue = sprintf("%.3f", pvalues_results[[gene_id]]$pvalue), # Format pvalue with 3 decimal places
               nsamp_used = pvalues_results[[gene_id]]$nsamp_used,
               fdr = sprintf("%.15f", pvalues_results[[gene_id]]$fdr), # Keep full precision for fdr
               stringsAsFactors = FALSE)
  }))
  # Write the results to the output file, ensuring the formatting is kept
  write.table(results_df, file = output_file_path, sep = "\t", row.names = FALSE, quote = FALSE, col.names = TRUE)
  return(pvalues_results)
}

# Call the previous function:
MR_MtRobin_resample(
  results_1_path = "/scratch/sq95/sp6154/p_values/long_covid_main_bs_1_no_errors.rds",
  nsamp = 10000, # Number of bootstrap samples
  use_nonconverge = FALSE, # Whether to use non-converging results, typically FALSE
  output_file_path = "/scratch/sq95/sp6154/p_values/MR_pvalue_FDR_results_bs1.txt",
  num_cpus = 48 # Number of CPUs
)

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q hugemem

# Request resources:
#PBS -l ncpus=48
#PBS -l mem=500GB
#PBS -l jobfs=200GB
#PBS -l walltime=10:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Option 1: Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/p_values/p_values.sif Rscript /data/p_values_bs_1.R

# Option 2: Run the R script without any container:
#R --vanilla < parallelism_test.R > output_test

###########################################################################################################

# Exit cleanly:
exit 0

In [None]:
======================================================================================
                  Resource Usage on 2024-02-08 13:49:52:
   Job Id:             107817285.gadi-pbs
   Project:            sq95
   Exit Status:        0
   Service Units:      49.06
   NCPUs Requested:    24                     NCPUs Used: 24
                                           CPU Time Used: 08:59:16
   Memory Requested:   200.0GB               Memory Used: 19.22GB
   Walltime requested: 10:00:00            Walltime Used: 00:40:53
   JobFS requested:    100.0GB                JobFS used: 0B
======================================================================================

In [None]:
%%R

# Read the results with the p_value and FDR values:
results <- read.table("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/MR_pvalue_FDR_results_bs1.txt", header=TRUE, sep="\t")
print(dim(results))
print(head(results))

[1] 141   4
          gene_id pvalue nsamp_used       fdr
1 ENSG00000008196  0.250      10000 0.7663043
2 ENSG00000068781  0.592      10000 0.8100849
3 ENSG00000117148  0.453      10000 0.8025000
4 ENSG00000122965  0.074      10000 0.5301600
5 ENSG00000126266  0.079      10000 0.5301600
6 ENSG00000136696  0.510      10000 0.8025000


In [None]:
%%R

# Call the analyse_gene_data() function to organize the Mt_Robin results for significant and non-significant genes:

# Create the output paths:
save_paths <- list(
  all_genes = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/bs1_all_genes_sig_non_sig.txt",
  protein_coding_genes = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/bs1_protein_coding_genes_sig_not_sig.txt"
)

# Call the function:
analyze_gene_data(
  result_main_path = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_main_bs_1.rds",
  eqtl_long_covid_exposure_path = "/content/drive/MyDrive/Colab/Long_COVID/eQTL/eqtl_all_tissues_110.txt",
  save_files = save_paths
)

Saved all_genes as all_genes to /content/drive/MyDrive/Colab/Long_COVID/MtRobin/bs1_all_genes_sig_non_sig.txt 
Dimensions of all_genes : 141 10 

          gene_id     gene_name      gene_type     beta_y Obs SNPs
1 ENSG00000008196        TFAP2B protein_coding  -2.835254  15   14
2 ENSG00000068781 STON1-GTF2A1L protein_coding  50.950920  29    2
3 ENSG00000117148         ACTL8 protein_coding  -5.827370   5    4
4 ENSG00000122965         RBM19 protein_coding  -3.067708  49   10
5 ENSG00000126266         FFAR1 protein_coding -12.391244  34    5
6 ENSG00000136696         IL36B protein_coding   7.650008   5    3
                                                                                                                                               SNP_IDs
1 rs10948573,rs13202872,rs141707622,rs17667036,rs2206271,rs2477488,rs35305915,rs4715208,rs571503,rs61660288,rs62406441,rs72901027,rs75397901,rs9367397
2                                                                                  

In [None]:
%%R

# BS1 final table:
# Call the function that retrieves more information from the Mt_Robin and filter the significant genes:
process_final_data(input_path_1 = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_main_bs_1_no_errors.rds",
                   input_path_2 = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/bs1_protein_coding_genes_sig_not_sig.txt",
                   input_path_3 = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/MR_pvalue_FDR_results_bs1.txt",
                   output_path = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/bs1_protein_coding_genes_sig_full_table.txt")

[1] "Causal genes (noncoding genes and protein coding genes): "
[1] 141
$lme_res
Linear mixed model fit by REML ['lmerMod']
Formula: beta_x ~ (beta_y - 1) + (beta_y - 1 | snpID)
Weights: 1/se_x^2
REML criterion at convergence: 6.9579
Random effects:
 Groups   Name   Std.Dev.
 snpID    beta_y 4.105   
 Residual        4.354   
Number of obs: 15, groups:  snpID, 14
Fixed Effects:
beta_y  
-2.835  

$gwas_res
         snpID   gwas_beta   gwas_se
1   rs10948573 -0.00690805 0.0321628
2   rs13202872  0.02460510 0.0340206
3  rs141707622  0.05997710 0.0901525
4   rs17667036  0.00391041 0.0320274
5    rs2206271  0.05003660 0.0318188
6    rs2477488  0.05336300 0.0297989
7   rs35305915 -0.03369060 0.0446311
8    rs4715208  0.03944380 0.0345040
9     rs571503 -0.00698292 0.0299366
10  rs61660288 -0.06373460 0.0469640
11  rs62406441 -0.02075750 0.0360749
12  rs72901027 -0.00541575 0.0464996
13  rs75397901 -0.06437320 0.1337120
14   rs9367397  0.04018010 0.0627267

$LD
               rs2206271   rs3

**Obs.:** No significant protein coding genes

### **GWAS2**

In [None]:
# Step 1: Load the required libraries:
library(matrixStats) # It needs to be installed separately
#library(BiocManager) # It needs to be installed separately
library(snpStats) # It needs to be installed separately
library(utils) # R package installed by default
library(parallel) # R package installed by default
library(reshape2) # Additional package for melt function

# Step 2: Open the eQTL file:
eqtl_covid_exposure <- read.table("/data/eqtl_all_tissues_110.txt",
                                  header = TRUE,
                                  sep = "",
                                  stringsAsFactors = FALSE)

# Step 3: Open the GWAS input data set:
gwas_covid_outcome <- read.table("/data/long_covid_gwas_2_subset.txt",
                                 header = TRUE,
                                 sep = "",
                                 stringsAsFactors = FALSE)

# Step 4: Open the genes input data set:
all_genes <- read.table("/data/all_genes_110.txt",
                        header = TRUE,
                        sep = "",
                        stringsAsFactors = FALSE)

# Step 5: Convert the previous list of genes into a vector:
all_genes_vector <- all_genes$gene_id

# Step 6: Save the paths of the GTEx files:
gtex_covid_bed <- "/data/WGS_GTEx_long_covid_rsID.bed"
gtex_covid_bim <- "/data/WGS_GTEx_long_covid_rsID.bim"
gtex_covid_fam <- "/data/WGS_GTEx_long_covid_rsID.fam"

# Step 7: Create a variable for the path to the directory to save the LD matrix results:
ld_matrix_dir <- "/data/ld_matrix_bs_2"

# Step 8: Create the directory if it doesn't exist:
if (!dir.exists(ld_matrix_dir)) {
  dir.create(ld_matrix_dir)
}

# Step 9: Modify the original select_IV_bs() function to accept NAs from the eQTL and calculate the LD matrix inside:
# select_IV_bs() --> backward selection starts with all variables and removes one at a time:
select_IV_bs <- function(geneID,
                         eqtl_data,
                         nTiss,
                         bed_file,
                         bim_file,
                         fam_file,
                         ld_matrix_path,
                         ld_thresh=0.5,
                         pval_thresh=0.001,
                         nTiss_thresh=3){

  ## Subset the eQTL data set for the gene of interest:
  if(!(geneID %in% eqtl_data$gene_id)) stop("Gene is missing in eQTL dataset")
  eqtl_data <- subset(eqtl_data, gene_id == geneID)

  pval_mat <- as.matrix(subset(eqtl_data,select=paste0("pvalue_",1:nTiss)))
  eqtl_data$min_p <- matrixStats::rowMins(pval_mat, na.rm = TRUE) # Mod1: I added na.rm = TRUE
  eqtl_data$nTissThresh_p <- apply(pval_mat, 1, function(x) sort(x)[nTiss_thresh])
  eqtl_data <- subset(eqtl_data, nTissThresh_p < pval_thresh)

  ## Ensure same sign for tissues with p < pval_thresh:
  betas <- as.matrix(subset(eqtl_data,select=paste0("beta_",1:nTiss)))
  pvals <- as.matrix(subset(eqtl_data,select=paste0("pvalue_",1:nTiss)))
  pvals_lt001 <- pvals<0.001
  betas_sig <- betas*pvals_lt001
  sig_neg_eQTL <- as.numeric(apply(betas_sig, 1, function(x) any((!is.na(x) & x < -1e-15)))) # Mod2: I added is.na()
  sig_pos_eQTL <- as.numeric(apply(betas_sig, 1, function(x) any((!is.na(x) & x > 1e-15)))) # Mod3: I added is.na()
  eqtl_data$pos_sig_eQTL <- sig_pos_eQTL
  eqtl_data$neg_sig_eQTL <- sig_neg_eQTL
  eqtl_data <- subset(eqtl_data, !(pos_sig_eQTL==1 & neg_sig_eQTL==1))

  if(nrow(eqtl_data)==0) stop(paste0("No cis-SNPs for gene with p<",pval_thresh," in at least ",nTiss_thresh," tissues."))

  ## Obtain a set of candidate SNPs:
  SNP_pool <- eqtl_data$variant_id
  if(length(SNP_pool)==1){
    return(SNP_pool)
  }

  # Calculate the LD matrix:
  snp_data <- read.plink(bed_file, bim_file, fam_file)
  snp_matrix <- snp_data$genotypes
  numeric_matrix <- as(snp_matrix, "numeric")

  # Restrict LD matrix to SNPs in the data set:
  df_subset <- numeric_matrix[, colnames(numeric_matrix) %in% SNP_pool]
  LD <- cor(df_subset, use="pairwise.complete")
  LD_r2 <- LD^2

  # Export a .txt file of the LD matrix:
  utils::write.table(LD,
                     file = ld_matrix_path,
                     sep = " ",
                     row.names = TRUE,
                     col.names = TRUE,
                     quote = FALSE)

  ## Get pairwise r^2:
  LD_r2_pw <- LD_r2
  LD_r2_pw[lower.tri(LD_r2_pw,diag=T)] <- NA
  LD_r2_pw <- reshape2::melt(LD_r2_pw)
  colnames(LD_r2_pw) <- c("SNP1","SNP2","r2")
  LD_r2_pw$SNP1 <- as.character(LD_r2_pw$SNP1)
  LD_r2_pw$SNP2 <- as.character(LD_r2_pw$SNP2)

  ## Drop SNPs paired with themselves and duplicates (resulting from lower triangle, set to NA):
  LD_r2_pw <- subset(LD_r2_pw, SNP1 != SNP2 & !is.na(r2))

  ## Sort by descending r^2:
  LD_r2_pw <- LD_r2_pw[order(LD_r2_pw$r2,decreasing = T),]

  ## Check for case all candidate SNPs have pw r^2 > thresh (return SNPs with smallest p-value):
  if(min(LD_r2_pw$r2) >=ld_thresh){
    return(eqtl_data$variant_id[which.max(eqtl_data$min_p)])
  }

  while(max(LD_r2_pw$r2)>=ld_thresh){

    ## Identify top SNPs in the pool:
    eqtl_data <- subset(eqtl_data, variant_id %in% SNP_pool)
    LD_r2_pw <- subset(LD_r2_pw, SNP1 %in% SNP_pool & SNP2 %in% SNP_pool)
    LD_r2_pw_geThresh <- subset(LD_r2_pw, r2>=ld_thresh)

    ## Identify SNP pair with highest remaining r^2:
    next_pair <- LD_r2_pw[1,]

    ## Criteria 1: Drop the SNP with largest number of correlations > threshold with other SNPs:
    SNP1_nSNP_geThresh <- sum(LD_r2_pw_geThresh$SNP1 == next_pair$SNP1) + sum(LD_r2_pw_geThresh$SNP2 == next_pair$SNP1)
    SNP2_nSNP_geThresh <- sum(LD_r2_pw_geThresh$SNP1 == next_pair$SNP2) + sum(LD_r2_pw_geThresh$SNP2 == next_pair$SNP2)
    if(SNP1_nSNP_geThresh > SNP2_nSNP_geThresh){
      next_SNP_drop <- next_pair$SNP1
    } else if (SNP1_nSNP_geThresh < SNP2_nSNP_geThresh){
      next_SNP_drop <- next_pair$SNP2
    } else{

      ## Criteria 2: Drop the SNP with smaller number of tissues with p<0.001:
      next_pair_eqtl <- subset(eqtl_data, variant_id %in% c(next_pair$SNP1,next_pair$SNP2))
      SNP1_nTiss <- subset(next_pair_eqtl,variant_id==next_pair$SNP1)$nTiss_ltThresh
      SNP2_nTiss <- subset(next_pair_eqtl,variant_id==next_pair$SNP2)$nTiss_ltThresh

      if(SNP1_nTiss > SNP2_nTiss){
        next_SNP_drop <- next_pair$SNP2
      } else if(SNP1_nTiss < SNP2_nTiss){
        next_SNP_drop <- next_pair$SNP1
      } else{

        ## Criteria 3: Drop the SNP with larger minimum p-value:
        next_SNP_drop <- next_pair_eqtl$variant_id[which.max(next_pair_eqtl$min_p)]
      }
    }
    SNP_pool <- SNP_pool[which(SNP_pool != next_SNP_drop)]
  }
  return(SNP_pool)
}

# Step 10: Define the number of cores to use for the parallelism process:
num_cores <- 48

# Step 11:
result_list <- mclapply(all_genes_vector, function(geneID) {
  # Generate a unique file path for the LD matrix of this gene
  ld_matrix_path <- file.path(ld_matrix_dir, paste0("ld_matrix_", geneID, ".txt"))

  tryCatch({
    select_IV_bs(geneID = geneID,
                 eqtl_data = eqtl_covid_exposure,
                 nTiss = 49,
                 bed_file = gtex_covid_bed,
                 bim_file = gtex_covid_bim,
                 fam_file = gtex_covid_fam,
                 ld_matrix_path = ld_matrix_path,
                 ld_thresh = 0.5,
                 pval_thresh = 0.001,
                 nTiss_thresh = 1)
  }, error = function(e) print(e))
}, mc.cores = num_cores)

# Step 12:
names(result_list) <- all_genes_vector

# Step 13:
saveRDS(result_list, file = "/data/long_covid_IVs_bs_2.rds")

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q hugemem

# Request resources:
#PBS -l ncpus=48
#PBS -l mem=500GB
#PBS -l jobfs=200GB
#PBS -l walltime=30:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/Mt_Robin/select_setup_bs.sif Rscript /data/select_function_bs_2.R

# Run the R script without any container:
#R --vanilla < parallelism_test.R > output_test

###########################################################################################################

# Exit cleanly
exit 0

In [None]:
======================================================================================
                  Resource Usage on 2023-08-25 14:10:06:
   Job Id:             93508970.gadi-pbs
   Project:            sq95
   Exit Status:        0
   Service Units:      874.08
   NCPUs Requested:    48                     NCPUs Used: 48
                                           CPU Time Used: 277:39:51
   Memory Requested:   500.0GB               Memory Used: 382.3GB
   Walltime requested: 30:00:00            Walltime Used: 06:04:12
   JobFS requested:    200.0GB                JobFS used: 0B
======================================================================================

In [None]:
%%R
# Open the output file:
result_selectIV_bs_2 <- readRDS("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_IVs_bs_2.rds")

# Filter out errors:
result_selectIV_bs_2 <- result_selectIV_bs_2[!sapply(result_selectIV_bs_2, function(x) inherits(x, "simpleError"))]

# Print the length and the first element:
print(length(result_selectIV_bs_2))
print(length(unique(result_selectIV_bs_2)))
print(head(result_selectIV_bs_2))

[1] 13809
[1] 13737
$ENSG00000002079
[1] "rs36168039"  "rs151109487" "rs41258334"  "rs28497153"  "rs7457787"  
[6] "rs73157578" 

$ENSG00000004846
 [1] "rs58540449"  "rs79448643"  "rs114103500" "rs111613242" "rs76325633" 
 [6] "rs117337555" "rs73682327"  "rs11981446"  "rs7801178"   "rs17455533" 
[11] "rs2108258"   "rs77254950"  "rs139614792"

$ENSG00000004939
 [1] "rs2880712"   "rs9747287"   "rs117705553" "rs79555595"  "rs78230173" 
 [6] "rs114651584" "rs34637106"  "rs7213794"   "rs144535065" "rs142313238"
[11] "rs76004777"  "rs9944407"   "rs34260468"  "rs142437085" "rs8070867"  
[16] "rs730818"    "rs200108084" "rs1728174"   "rs140638631" "rs4793150"  
[21] "rs76480869"  "rs147752346" "rs62080669"  "rs9895862"   "rs7222401"  
[26] "rs540551010" "rs9890834"   "rs76498976"  "rs138050677"

$ENSG00000005073
 [1] "rs60642981"  "rs2525760"   "rs10273618"  "rs4722617"   "rs138820646"
 [6] "rs117987507" "rs17501111"  "rs10251237"  "rs147214544" "rs6947348"  
[11] "rs71539542"  "rs3735570"   "

In [None]:
library(parallel) # R package installed by default

# Step 1: Open the eQTL file:
eqtl_covid_exposure <- read.table("/data/eqtl_all_tissues_110.txt",
                                  header = TRUE,
                                  sep = "",
                                  stringsAsFactors = FALSE)

# Step 2: Open the GWAS input data set:
gwas_covid_outcome <- read.table("/data/long_covid_gwas_2_subset.txt",
                                 header = TRUE,
                                 sep = "",
                                 stringsAsFactors = FALSE)

# Step 3: Open the list of genes-SNPs:
result_selectedIVs <- readRDS("/data/long_covid_IVs_bs_2.rds")

# Step 4: Create a variable for the path to the LD matrix directory:
ld_matrix_dir <- "/data/ld_matrix_bs_2"

# Step 5: Setup the Mt_Robin method using the following setup function:
MR_MtRobin_setup <- function(geneID,
                             snpID,
                             eqtl_data,
                             gwas_data,
                             ld_matrix_path,
                             nTiss){
  tryCatch({
    ## Check if the gene is present and subset
    if(!(geneID %in% eqtl_data$gene_id)) stop("Gene is missing in eQTL dataset")
    eqtl_data <- subset(eqtl_data, gene_id == geneID)

    ## Confirm all SNPs are present in the dataset
    if(!all(snpID %in% eqtl_data$variant_id)) stop("Some SNPs missing in eQTL dataset")
    if(!all(snpID %in% gwas_data$variant_id)) stop("Some SNPs missing in GWAS dataset")

    # Read the LD matrix from a file
    LD <- as.matrix(utils::read.table(ld_matrix_path,
                                      header = TRUE,
                                      check.names = FALSE))

    if(!all(snpID %in% colnames(LD))) stop("Some SNPs missing in LD matrix")

    ## Align data
    eqtl_data <- eqtl_data[match(snpID,eqtl_data$variant_id),]
    gwas_data <- gwas_data[match(snpID,gwas_data$variant_id),]
    LD <- LD[match(snpID,rownames(LD)),match(snpID,colnames(LD))]

    eqtl_betas <- as.matrix(eqtl_data[,paste0("beta_",1:nTiss)])
    eqtl_se <- as.matrix(eqtl_data[,paste0("SE_",1:nTiss)])
    eqtl_pvals <- as.matrix(eqtl_data[,paste0("pvalue_",1:nTiss)])
    gwas_betas <- gwas_data$beta
    gwas_se <- gwas_data$SE

    row.names(eqtl_betas) <- row.names(eqtl_se) <- row.names(eqtl_pvals) <- eqtl_data$variant_id

    return(list(snpID=snpID,gwas_betas=gwas_betas,gwas_se=gwas_se,
                eqtl_betas=eqtl_betas,eqtl_se=eqtl_se,eqtl_pvals=eqtl_pvals,LD=LD))
  }, error = function(e) print(e))
}

# Step 6: Apply the previous setup function:

# Step 6.1: Initialize an empty list to store the results:
MR_MtRobin_results <- list()

# Step 6.2: Define the number of cores to use for the parallelism process:
num_cores <- 36

# Step 6.3: Loop over each gene in result_selectedIVs:
MR_MtRobin_results <- mclapply(names(result_selectedIVs), function(geneID) {

  # Step 6.3.1: Get the SNPs associated with each gene:
  snpID <- result_selectedIVs[[geneID]]

  # Step 6.3.2: Generate a unique file path for the LD matrix of this gene:
  ld_matrix_path <- file.path(ld_matrix_dir, paste0("ld_matrix_", geneID, ".txt"))

  # Step 6.3.3: Call the MR_MtRobin_setup() function with the parameters for each gene:
  return(tryCatch({
    MR_MtRobin_setup(geneID = geneID,
                     snpID = snpID,
                     eqtl_data = eqtl_covid_exposure,
                     gwas_data = gwas_covid_outcome,
                     ld_matrix_path = ld_matrix_path,
                     nTiss = 49)
  }, error = function(e) print(e)))
}, mc.cores = num_cores)

# Step 7: Assign names to the list:
names(MR_MtRobin_results) <- names(result_selectedIVs)

# Step 8: Save the previous result as a .rds file:
saveRDS(MR_MtRobin_results, file = "/data/long_covid_setup_bs_2.rds")

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q hugemem

# Request resources:
#PBS -l ncpus=36
#PBS -l mem=300GB
#PBS -l jobfs=200GB
#PBS -l walltime=10:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/Mt_Robin/select_setup_bs.sif Rscript /data/setup_function_bs_2.R

###########################################################################################################

# Exit cleanly
exit 0

In [None]:
======================================================================================
                  Resource Usage on 2023-08-29 17:00:33:
   Job Id:             93819200.gadi-pbs
   Project:            sq95
   Exit Status:        0
   Service Units:      47.70
   NCPUs Requested:    36                     NCPUs Used: 36
                                           CPU Time Used: 13:22:08
   Memory Requested:   300.0GB               Memory Used: 75.35GB
   Walltime requested: 10:00:00            Walltime Used: 00:26:30
   JobFS requested:    200.0GB                JobFS used: 0B
======================================================================================

In [None]:
%%R
# Open the output file:
result_setup_bs_2 <- readRDS("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_setup_bs_2.rds")

# Filter out errors:
result_setup_bs_2 <- result_setup_bs_2[!sapply(result_setup_bs_2, function(x) inherits(x, "simpleError"))]

# Print the length and the first element:
print(length(result_setup_bs_2))
print(length(unique(result_setup_bs_2)))
print(result_setup_bs_2[[1]])

[1] 2565
[1] 2565
$snpID
 [1] "rs2206271"   "rs35305915"  "rs72901027"  "rs9367397"   "rs61660288" 
 [6] "rs141707622" "rs2477488"   "rs13202872"  "rs62406441"  "rs17667036" 
[11] "rs75397901"  "rs10948573"  "rs4715208"   "rs571503"   

$gwas_betas
 [1]  0.03032140 -0.03624490 -0.01994180 -0.00767773 -0.03838880  0.08991210
 [7]  0.03898050  0.01695140  0.01588890 -0.02039250 -0.07066480  0.01248640
[13]  0.03617890  0.01188490

$gwas_se
 [1] 0.0210871 0.0272673 0.0303522 0.0418627 0.0329034 0.0595761 0.0199385
 [8] 0.0228978 0.0242326 0.0213210 0.0829379 0.0214767 0.0236948 0.0200321

$eqtl_betas
              beta_1 beta_2 beta_3 beta_4 beta_5 beta_6 beta_7 beta_8 beta_9
rs2206271   0.160091     NA     NA     NA     NA     NA     NA     NA     NA
rs35305915        NA     NA     NA     NA     NA     NA     NA     NA     NA
rs72901027        NA     NA     NA     NA     NA     NA     NA     NA     NA
rs9367397         NA     NA     NA     NA     NA     NA     NA     NA     NA
rs61660288

In [None]:
# Step 1: Load the required libraries:
library(base) # R package installed by default
library(utils) # R package installed by default
library(parallel) # R package installed by default
library(Matrix) # R package installed by default
library(methods) # R package installed by default
library(stats) # R package installed by default
library(lme4) # It needs to be installed separately
library(data.table) # It needs to be installed separately

# Step 2: Main Mt_Robin function:
MR_MtRobin <- function(MR_MtRobin_input,pval_thresh=0.001){
  snpID <- MR_MtRobin_input$snpID
  gwas_betas <- MR_MtRobin_input$gwas_betas
  gwas_se <- MR_MtRobin_input$gwas_se
  eqtl_betas <- MR_MtRobin_input$eqtl_betas
  eqtl_se <- MR_MtRobin_input$eqtl_se
  eqtl_pvals <- MR_MtRobin_input$eqtl_pvals
  LD <- MR_MtRobin_input$LD

  if(ncol(eqtl_betas) != ncol(eqtl_se) | ncol(eqtl_betas) != ncol(eqtl_pvals)){
    stop("eQTL betas, SE and p-values do not all have same number of columns")
  }

  ## determine number of tissues (or conditions/studies/etc.)
  nT <- ncol(eqtl_betas)

  ## rename columns for standardized processing
  colnames(eqtl_betas) <- paste0("beta_",1:nT)
  colnames(eqtl_se) <- paste0("SE_",1:nT)
  colnames(eqtl_pvals) <- paste0("pvalue_",1:nT)

  ## combine eQTL results into data.table
  eqtl_res <- data.table::data.table(snpID,eqtl_betas,eqtl_se,eqtl_pvals)

  # reshape to long format (each SNP-gene-tissue will be row)
  eQTL_res_melt <- data.table::melt(eqtl_res,id.vars=c("snpID"),
                                    measure.vars=patterns("^beta_","^SE_","^pvalue_"),value.name=c("beta","se","pvalue"),
                                    variable.name="tissue")

  ## subset to strong instrumental variables (based on p-value threshold)
  eQTL_res_melt_PltThresh <- subset(eQTL_res_melt, pvalue < pval_thresh)

  ## set up gwas data for merging
  gwas_res <- data.table::data.table(snpID=snpID,gwas_beta=gwas_betas,gwas_se=gwas_se)

  ## merge eQTL and GWAS results
  data.table::setkey(eQTL_res_melt_PltThresh,snpID)
  data.table::setkey(gwas_res,snpID)
  merged_res <- data.table::merge.data.table(eQTL_res_melt_PltThresh,gwas_res)

  ## set up coefficients for reverse regression
  beta_x <- matrix(merged_res$beta,ncol=1)
  beta_y <- matrix(merged_res$gwas_beta,ncol=1)

  ## standard errors (for weights)
  se_x <- matrix(merged_res$se,ncol=1)
  se_y <- matrix(merged_res$gwas_se,ncol=1) ## not used by function; used in resampling (consider returning with results)

  ## identifiers
  snpID <- merged_res$snpID

  lme_res <- lme4::lmer(beta_x~(beta_y-1)+(beta_y-1|snpID),weights=1/se_x^2)

  ## return results from weighted regression with random slopes (and correlated errors) along with GWAS SE and LD (needed for resampling)
  return(list(lme_res=lme_res,gwas_res=gwas_res,LD=LD))
}

# Step 3: Read the previous file:
result_setup <- readRDS("/data/long_covid_setup_bs_2.rds")

# Step 4: Apply the main Mt_Robin function using parallel processing:
no_cores <- 36

cl <- makeCluster(no_cores) # Create a cluster with the number of cores
clusterExport(cl, "MR_MtRobin") # Export the MR_MtRobin function to all the nodes of the cluster
result_list_MtRobin <- parLapply(cl, result_setup, function(MR_MtRobin_input) {
  tryCatch({
    MR_MtRobin(MR_MtRobin_input = MR_MtRobin_input)
  }, error = function(e) print(e))
})
stopCluster(cl) # Stop the cluster once the computation is done

# Step 5: The names of the results will correspond to the gene ids:
names(result_list_MtRobin) <- names(result_setup)

# Step 6: Save the previous result as a .rds file:
saveRDS(result_list_MtRobin, file = "/data/long_covid_main_bs_2.rds")

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q hugemem

# Request resources:
#PBS -l ncpus=36
#PBS -l mem=300GB
#PBS -l jobfs=200GB
#PBS -l walltime=10:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/Mt_Robin/main_function.sif Rscript /data/main_function_bs_2.R

# Run the R script without any container:
#R --vanilla < parallelism_test.R > output_test

###########################################################################################################

# Exit cleanly
exit 0

In [None]:
======================================================================================
                  Resource Usage on 2023-08-30 10:28:34:
   Job Id:             93872798.gadi-pbs
   Project:            sq95
   Exit Status:        0
   Service Units:      2.25
   NCPUs Requested:    36                     NCPUs Used: 36
                                           CPU Time Used: 00:01:54
   Memory Requested:   300.0GB               Memory Used: 8.17GB
   Walltime requested: 10:00:00            Walltime Used: 00:01:15
   JobFS requested:    200.0GB                JobFS used: 0B
======================================================================================

In [None]:
%%R
# Open the output file:
result_main_bs_2 <- readRDS("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_main_bs_2.rds")

# Filter out errors:
result_main_bs_2 <- result_main_bs_2[!sapply(result_main_bs_2, function(x) inherits(x, "simpleError"))]

# Print the length and the first element:
print(length(result_main_bs_2))
print(length(unique(result_main_bs_2)))
print(head(result_main_bs_2))

[1] 136
[1] 136
$ENSG00000008196
$ENSG00000008196$lme_res
Linear mixed model fit by REML ['lmerMod']
Formula: beta_x ~ (beta_y - 1) + (beta_y - 1 | snpID)
Weights: 1/se_x^2
REML criterion at convergence: 7.0839
Random effects:
 Groups   Name   Std.Dev.
 snpID    beta_y 3.142   
 Residual        4.646   
Number of obs: 15, groups:  snpID, 14
Fixed Effects:
beta_y  
-3.156  

$ENSG00000008196$gwas_res
         snpID   gwas_beta   gwas_se
1   rs10948573  0.01248640 0.0214767
2   rs13202872  0.01695140 0.0228978
3  rs141707622  0.08991210 0.0595761
4   rs17667036 -0.02039250 0.0213210
5    rs2206271  0.03032140 0.0210871
6    rs2477488  0.03898050 0.0199385
7   rs35305915 -0.03624490 0.0272673
8    rs4715208  0.03617890 0.0236948
9     rs571503  0.01188490 0.0200321
10  rs61660288 -0.03838880 0.0329034
11  rs62406441  0.01588890 0.0242326
12  rs72901027 -0.01994180 0.0303522
13  rs75397901 -0.07066480 0.0829379
14   rs9367397 -0.00767773 0.0418627

$ENSG00000008196$LD
               rs2206

In [None]:
########################################################################
# GADI:

# Obtain p-value from MR-MtRobin results:
# Uses a resampling procedure to estimate the p_value for a MR-MtRobin object:
# param MR_MtRobin_res a list object returned by \code{MR_MtRobin}.
# param nsamp integer of the number of samples to use in estimating \eqn{P}-value using a null distribution.
# param use_nonconverge logical of whether to use samples resulting in non-convergence of random slope
# return A list of two elements:
# \tabular{ll}{\code{pvalue} \tab numeric of the estimated \eqn{P}-value.
# \code{nsamp_used} \tab integer of the number of samples used in estimating the \eqn{P}-value.
# \code{nsamp_used} is returned because not all \code{nsamp} may be used +
# If \code{use_nonconverge=FALSE} (samples will be dropped if the model does not converge or results in a singular fit of the random slope).
# myRes <- MR_MtRobin(snpID,gwas_betas,gwas_se,eqtl_betas,eqtl_se,eqtl_pvals,LD)
# MR_MtRobin_resample(MR_MtRobin_res=myRes, nsamp=1e4, use_nonconverge=FALSE)

# Load packages:
library(lme4)
library(mvtnorm)
library(data.table)
library(parallel)

# Load main results:
results_1 <- readRDS("/scratch/sq95/sp6154/p_values/long_covid_main_bs_2.rds")
# Filter out errors:
results_1 <- results_1[!sapply(results_1, function(x) inherits(x, "simpleError"))]
# Export:
saveRDS(results_1, "/scratch/sq95/sp6154/p_values/long_covid_main_bs_2_no_errors.rds")

# Updated Function to calculate the p-value and FDR with parallel processing:
MR_MtRobin_resample <- function(results_1_path, nsamp=1e4, use_nonconverge=FALSE, output_file_path, num_cpus=1){
  # Check and load the results_1 dataset from the specified path:
  if(file.exists(results_1_path)){
    results_1 <- readRDS(results_1_path)
  } else {
    stop("The specified file path for results_1 does not exist.")
  }
  # Initialize a list to store results for each gene:
  pvalues_results <- list()
  # Loop over each gene:
  for(gene_id in names(results_1)) {
    # Check if necessary components are present and not NULL:
    if(is.null(results_1[[gene_id]][["lme_res"]]) || is.null(results_1[[gene_id]][["gwas_res"]]) || is.null(results_1[[gene_id]][["LD"]])) {
      next
    }
    # Create slopes for each information:
    lme_res <- results_1[[gene_id]][["lme_res"]]
    gwas_res <- results_1[[gene_id]][["gwas_res"]]
    LD <- results_1[[gene_id]][["LD"]]
    # Extract data from MR-MtRobin results:
    eqtl_betas <- lme_res@frame$beta_x
    snpID <- lme_res@frame$snpID
    weights <- lme_res@frame$weights
    tstat_MR_MtRobin <- summary(lme_res)$coefficients[1,3]
    gwas_se <- gwas_res$gwas_se
    # Bootstrapped null distribution, accounting for LD correlations:
    beta_gwas_nullMat <- mvtnorm::rmvnorm(nsamp, mean=rep(0, length(gwas_se)), sigma=diag(gwas_se) %*% LD %*% diag(gwas_se))
    colnames(beta_gwas_nullMat) <- gwas_res$snpID
    # Return data types:
    tstat_nulls <- mclapply(1:nsamp, function(i) {
      beta_gwas_null <- beta_gwas_nullMat[i, snpID]
      tryCatch({
        # LMM equation:
        lme_null <- lme4::lmer(eqtl_betas~(beta_gwas_null-1)+(beta_gwas_null-1|snpID), weights=weights)
        return(summary(lme_null)$coefficients[1,3])
      }, error=function(e) {
        if(!use_nonconverge) return(NA)
      })
    }, mc.cores = num_cpus)
    tstat_nulls <- unlist(tstat_nulls)
    tstat_nulls <- tstat_nulls[!is.na(tstat_nulls)] # Remove NAs
    nsamp_used <- length(tstat_nulls)
    # Calculate the p-value and round it to 3 decimal places:
    pval <- round(mean(abs(tstat_nulls) >= abs(tstat_MR_MtRobin)), digits = 3)
    # Store the results for the current gene in the list:
    pvalues_results[[gene_id]] <- list(pvalue=pval, nsamp_used=nsamp_used)
  }
  # Calculate FDR:
  all_pvalues <- sapply(pvalues_results, function(x) x$pvalue)
  fdr_values <- p.adjust(all_pvalues, method = "BH")
  # Combine results with FDR values:
  for(i in seq_along(fdr_values)) {
    pvalues_results[[names(fdr_values)[i]]]$fdr <- fdr_values[i]
  }
  # Prepare results dataframe:
  results_df <- do.call(rbind, lapply(names(pvalues_results), function(gene_id) {
    data.frame(gene_id = gene_id,
               pvalue = sprintf("%.3f", pvalues_results[[gene_id]]$pvalue), # Format pvalue with 3 decimal places
               nsamp_used = pvalues_results[[gene_id]]$nsamp_used,
               fdr = sprintf("%.15f", pvalues_results[[gene_id]]$fdr), # Keep full precision for fdr
               stringsAsFactors = FALSE)
  }))
  # Write the results to the output file, ensuring the formatting is kept
  write.table(results_df, file = output_file_path, sep = "\t", row.names = FALSE, quote = FALSE, col.names = TRUE)
  return(pvalues_results)
}

# Call the previous function:
MR_MtRobin_resample(
  results_1_path = "/scratch/sq95/sp6154/p_values/long_covid_main_bs_2_no_errors.rds",
  nsamp = 10000, # Number of bootstrap samples
  use_nonconverge = FALSE, # Whether to use non-converging results, typically FALSE
  output_file_path = "/scratch/sq95/sp6154/p_values/MR_pvalue_FDR_results_bs2.txt",
  num_cpus = 48 # Number of CPUs
)

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q hugemem

# Request resources:
#PBS -l ncpus=48
#PBS -l mem=500GB
#PBS -l jobfs=200GB
#PBS -l walltime=10:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Option 1: Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/p_values/p_values.sif Rscript /data/p_values_bs_2.R

# Option 2: Run the R script without any container:
#R --vanilla < parallelism_test.R > output_test

###########################################################################################################

# Exit cleanly:
exit 0

In [None]:
======================================================================================
                  Resource Usage on 2024-02-08 17:00:12:
   Job Id:             107828865.gadi-pbs
   Project:            sq95
   Exit Status:        0
   Service Units:      32.26
   NCPUs Requested:    24                     NCPUs Used: 24
                                           CPU Time Used: 09:35:47
   Memory Requested:   200.0GB               Memory Used: 19.28GB
   Walltime requested: 10:00:00            Walltime Used: 00:26:53
   JobFS requested:    100.0GB                JobFS used: 0B
======================================================================================

In [None]:
%%R

# Read the results with the p_value and FDR values:
results <- read.table("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/MR_pvalue_FDR_results_bs2.txt", header=TRUE, sep="\t")
print(dim(results))
print(head(results))

[1] 136   4
          gene_id pvalue nsamp_used       fdr
1 ENSG00000008196  0.192      10000 0.6866341
2 ENSG00000068781  0.638      10000 0.9640889
3 ENSG00000112494  0.751      10000 0.9825047
4 ENSG00000117148  0.164      10000 0.6760000
5 ENSG00000122965  0.144      10000 0.6712258
6 ENSG00000126266  0.563      10000 0.9380909


In [None]:
%%R

# Call the analyse_gene_data() function to organize the Mt_Robin results for significant and non-significant genes:

# Create the output paths:
save_paths <- list(
  all_genes = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/bs2_all_genes_sig_non_sig.txt",
  protein_coding_genes = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/bs2_protein_coding_genes_sig_not_sig.txt"
)

# Call the function:
analyze_gene_data(
  result_main_path = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_main_bs_2.rds",
  eqtl_long_covid_exposure_path = "/content/drive/MyDrive/Colab/Long_COVID/eQTL/eqtl_all_tissues_110.txt",
  save_files = save_paths
)

Saved all_genes as all_genes to /content/drive/MyDrive/Colab/Long_COVID/MtRobin/bs2_all_genes_sig_non_sig.txt 
Dimensions of all_genes : 136 10 

          gene_id     gene_name      gene_type     beta_y Obs SNPs
1 ENSG00000008196        TFAP2B protein_coding  -3.156101  15   14
2 ENSG00000068781 STON1-GTF2A1L protein_coding  42.502211  29    2
3 ENSG00000112494        UNC93A protein_coding -10.701137   9    8
4 ENSG00000117148         ACTL8 protein_coding -13.389779   5    4
5 ENSG00000122965         RBM19 protein_coding  -4.920307  49   10
6 ENSG00000126266         FFAR1 protein_coding -14.933544  34    5
                                                                                                                                               SNP_IDs
1 rs10948573,rs13202872,rs141707622,rs17667036,rs2206271,rs2477488,rs35305915,rs4715208,rs571503,rs61660288,rs62406441,rs72901027,rs75397901,rs9367397
2                                                                                  

In [None]:
%%R

# BS2 final table:
# Call the function that retrieves more information from the Mt_Robin and filter the significant genes:
process_final_data(input_path_1 = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_main_bs_2_no_errors.rds",
                   input_path_2 = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/bs2_protein_coding_genes_sig_not_sig.txt",
                   input_path_3 = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/MR_pvalue_FDR_results_bs2.txt",
                   output_path = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/bs2_protein_coding_genes_sig_full_table.txt")

[1] "Causal genes (noncoding genes and protein coding genes): "
[1] 136
$lme_res
Linear mixed model fit by REML ['lmerMod']
Formula: beta_x ~ (beta_y - 1) + (beta_y - 1 | snpID)
Weights: 1/se_x^2
REML criterion at convergence: 7.0839
Random effects:
 Groups   Name   Std.Dev.
 snpID    beta_y 3.142   
 Residual        4.646   
Number of obs: 15, groups:  snpID, 14
Fixed Effects:
beta_y  
-3.156  

$gwas_res
         snpID   gwas_beta   gwas_se
1   rs10948573  0.01248640 0.0214767
2   rs13202872  0.01695140 0.0228978
3  rs141707622  0.08991210 0.0595761
4   rs17667036 -0.02039250 0.0213210
5    rs2206271  0.03032140 0.0210871
6    rs2477488  0.03898050 0.0199385
7   rs35305915 -0.03624490 0.0272673
8    rs4715208  0.03617890 0.0236948
9     rs571503  0.01188490 0.0200321
10  rs61660288 -0.03838880 0.0329034
11  rs62406441  0.01588890 0.0242326
12  rs72901027 -0.01994180 0.0303522
13  rs75397901 -0.07066480 0.0829379
14   rs9367397 -0.00767773 0.0418627

$LD
               rs2206271   rs3

**Obs.:** No significant protein coding genes

### **GWAS3**

In [None]:
# Step 1: Load the required libraries:
library(matrixStats) # It needs to be installed separately
#library(BiocManager) # It needs to be installed separately
library(snpStats) # It needs to be installed separately
library(utils) # R package installed by default
library(parallel) # R package installed by default
library(reshape2) # Additional package for melt function

# Step 2: Open the eQTL file:
eqtl_covid_exposure <- read.table("/data/eqtl_all_tissues_110.txt",
                                  header = TRUE,
                                  sep = "",
                                  stringsAsFactors = FALSE)

# Step 3: Open the GWAS input data set:
gwas_covid_outcome <- read.table("/data/long_covid_gwas_3_subset.txt",
                                 header = TRUE,
                                 sep = "",
                                 stringsAsFactors = FALSE)

# Step 4: Open the genes input data set:
all_genes <- read.table("/data/all_genes_110.txt",
                        header = TRUE,
                        sep = "",
                        stringsAsFactors = FALSE)

# Step 5: Convert the previous list of genes into a vector:
all_genes_vector <- all_genes$gene_id

# Step 6: Save the paths of the GTEx files:
gtex_covid_bed <- "/data/WGS_GTEx_long_covid_rsID.bed"
gtex_covid_bim <- "/data/WGS_GTEx_long_covid_rsID.bim"
gtex_covid_fam <- "/data/WGS_GTEx_long_covid_rsID.fam"

# Step 7: Create a variable for the path to the directory to save the LD matrix results:
ld_matrix_dir <- "/data/ld_matrix_bs_3"

# Step 8: Create the directory if it doesn't exist:
if (!dir.exists(ld_matrix_dir)) {
  dir.create(ld_matrix_dir)
}

# Step 9: Modify the original select_IV_bs() function to accept NAs from the eQTL and calculate the LD matrix inside:
# select_IV_bs() --> backward selection starts with all variables and removes one at a time:
select_IV_bs <- function(geneID,
                         eqtl_data,
                         nTiss,
                         bed_file,
                         bim_file,
                         fam_file,
                         ld_matrix_path,
                         ld_thresh=0.5,
                         pval_thresh=0.001,
                         nTiss_thresh=3){

  ## Subset the eQTL data set for the gene of interest:
  if(!(geneID %in% eqtl_data$gene_id)) stop("Gene is missing in eQTL dataset")
  eqtl_data <- subset(eqtl_data, gene_id == geneID)

  pval_mat <- as.matrix(subset(eqtl_data,select=paste0("pvalue_",1:nTiss)))
  eqtl_data$min_p <- matrixStats::rowMins(pval_mat, na.rm = TRUE) # Mod1: I added na.rm = TRUE
  eqtl_data$nTissThresh_p <- apply(pval_mat, 1, function(x) sort(x)[nTiss_thresh])
  eqtl_data <- subset(eqtl_data, nTissThresh_p < pval_thresh)

  ## Ensure same sign for tissues with p < pval_thresh:
  betas <- as.matrix(subset(eqtl_data,select=paste0("beta_",1:nTiss)))
  pvals <- as.matrix(subset(eqtl_data,select=paste0("pvalue_",1:nTiss)))
  pvals_lt001 <- pvals<0.001
  betas_sig <- betas*pvals_lt001
  sig_neg_eQTL <- as.numeric(apply(betas_sig, 1, function(x) any((!is.na(x) & x < -1e-15)))) # Mod2: I added is.na()
  sig_pos_eQTL <- as.numeric(apply(betas_sig, 1, function(x) any((!is.na(x) & x > 1e-15)))) # Mod3: I added is.na()
  eqtl_data$pos_sig_eQTL <- sig_pos_eQTL
  eqtl_data$neg_sig_eQTL <- sig_neg_eQTL
  eqtl_data <- subset(eqtl_data, !(pos_sig_eQTL==1 & neg_sig_eQTL==1))

  if(nrow(eqtl_data)==0) stop(paste0("No cis-SNPs for gene with p<",pval_thresh," in at least ",nTiss_thresh," tissues."))

  ## Obtain a set of candidate SNPs:
  SNP_pool <- eqtl_data$variant_id
  if(length(SNP_pool)==1){
    return(SNP_pool)
  }

  # Calculate the LD matrix:
  snp_data <- read.plink(bed_file, bim_file, fam_file)
  snp_matrix <- snp_data$genotypes
  numeric_matrix <- as(snp_matrix, "numeric")

  # Restrict LD matrix to SNPs in the data set:
  df_subset <- numeric_matrix[, colnames(numeric_matrix) %in% SNP_pool]
  LD <- cor(df_subset, use="pairwise.complete")
  LD_r2 <- LD^2

  # Export a .txt file of the LD matrix:
  utils::write.table(LD,
                     file = ld_matrix_path,
                     sep = " ",
                     row.names = TRUE,
                     col.names = TRUE,
                     quote = FALSE)

  ## Get pairwise r^2:
  LD_r2_pw <- LD_r2
  LD_r2_pw[lower.tri(LD_r2_pw,diag=T)] <- NA
  LD_r2_pw <- reshape2::melt(LD_r2_pw)
  colnames(LD_r2_pw) <- c("SNP1","SNP2","r2")
  LD_r2_pw$SNP1 <- as.character(LD_r2_pw$SNP1)
  LD_r2_pw$SNP2 <- as.character(LD_r2_pw$SNP2)

  ## Drop SNPs paired with themselves and duplicates (resulting from lower triangle, set to NA):
  LD_r2_pw <- subset(LD_r2_pw, SNP1 != SNP2 & !is.na(r2))

  ## Sort by descending r^2:
  LD_r2_pw <- LD_r2_pw[order(LD_r2_pw$r2,decreasing = T),]

  ## Check for case all candidate SNPs have pw r^2 > thresh (return SNPs with smallest p-value):
  if(min(LD_r2_pw$r2) >=ld_thresh){
    return(eqtl_data$variant_id[which.max(eqtl_data$min_p)])
  }

  while(max(LD_r2_pw$r2)>=ld_thresh){

    ## Identify top SNPs in the pool:
    eqtl_data <- subset(eqtl_data, variant_id %in% SNP_pool)
    LD_r2_pw <- subset(LD_r2_pw, SNP1 %in% SNP_pool & SNP2 %in% SNP_pool)
    LD_r2_pw_geThresh <- subset(LD_r2_pw, r2>=ld_thresh)

    ## Identify SNP pair with highest remaining r^2:
    next_pair <- LD_r2_pw[1,]

    ## Criteria 1: Drop the SNP with largest number of correlations > threshold with other SNPs:
    SNP1_nSNP_geThresh <- sum(LD_r2_pw_geThresh$SNP1 == next_pair$SNP1) + sum(LD_r2_pw_geThresh$SNP2 == next_pair$SNP1)
    SNP2_nSNP_geThresh <- sum(LD_r2_pw_geThresh$SNP1 == next_pair$SNP2) + sum(LD_r2_pw_geThresh$SNP2 == next_pair$SNP2)
    if(SNP1_nSNP_geThresh > SNP2_nSNP_geThresh){
      next_SNP_drop <- next_pair$SNP1
    } else if (SNP1_nSNP_geThresh < SNP2_nSNP_geThresh){
      next_SNP_drop <- next_pair$SNP2
    } else{

      ## Criteria 2: Drop the SNP with smaller number of tissues with p<0.001:
      next_pair_eqtl <- subset(eqtl_data, variant_id %in% c(next_pair$SNP1,next_pair$SNP2))
      SNP1_nTiss <- subset(next_pair_eqtl,variant_id==next_pair$SNP1)$nTiss_ltThresh
      SNP2_nTiss <- subset(next_pair_eqtl,variant_id==next_pair$SNP2)$nTiss_ltThresh

      if(SNP1_nTiss > SNP2_nTiss){
        next_SNP_drop <- next_pair$SNP2
      } else if(SNP1_nTiss < SNP2_nTiss){
        next_SNP_drop <- next_pair$SNP1
      } else{

        ## Criteria 3: Drop the SNP with larger minimum p-value:
        next_SNP_drop <- next_pair_eqtl$variant_id[which.max(next_pair_eqtl$min_p)]
      }
    }
    SNP_pool <- SNP_pool[which(SNP_pool != next_SNP_drop)]
  }
  return(SNP_pool)
}

# Step 10: Define the number of cores to use for the parallelism process:
num_cores <- 48

# Step 11:
result_list <- mclapply(all_genes_vector, function(geneID) {
  # Generate a unique file path for the LD matrix of this gene
  ld_matrix_path <- file.path(ld_matrix_dir, paste0("ld_matrix_", geneID, ".txt"))

  tryCatch({
    select_IV_bs(geneID = geneID,
                 eqtl_data = eqtl_covid_exposure,
                 nTiss = 49,
                 bed_file = gtex_covid_bed,
                 bim_file = gtex_covid_bim,
                 fam_file = gtex_covid_fam,
                 ld_matrix_path = ld_matrix_path,
                 ld_thresh = 0.5,
                 pval_thresh = 0.001,
                 nTiss_thresh = 1)
  }, error = function(e) print(e))
}, mc.cores = num_cores)

# Step 12:
names(result_list) <- all_genes_vector

# Step 13:
saveRDS(result_list, file = "/data/long_covid_IVs_bs_3.rds")

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q hugemem

# Request resources:
#PBS -l ncpus=48
#PBS -l mem=500GB
#PBS -l jobfs=200GB
#PBS -l walltime=30:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/Mt_Robin/select_setup_bs.sif Rscript /data/select_function_bs_3.R

# Run the R script without any container:
#R --vanilla < parallelism_test.R > output_test

###########################################################################################################

# Exit cleanly
exit 0

In [None]:
======================================================================================
                  Resource Usage on 2023-08-25 20:12:47:
   Job Id:             93508988.gadi-pbs
   Project:            sq95
   Exit Status:        0
   Service Units:      863.04
   NCPUs Requested:    48                     NCPUs Used: 48
                                           CPU Time Used: 276:35:18
   Memory Requested:   500.0GB               Memory Used: 389.23GB
   Walltime requested: 30:00:00            Walltime Used: 05:59:36
   JobFS requested:    200.0GB                JobFS used: 0B
======================================================================================

In [None]:
%%R
# Open the output file:
result_selectIV_bs_3 <- readRDS("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_IVs_bs_3.rds")

# Filter out errors:
result_selectIV_bs_3 <- result_selectIV_bs_3[!sapply(result_selectIV_bs_3, function(x) inherits(x, "simpleError"))]

# Print the length and the first element:
print(length(result_selectIV_bs_3))
print(length(unique(result_selectIV_bs_3)))
print(head(result_selectIV_bs_3))

[1] 13690
[1] 13619
$ENSG00000002079
[1] "rs36168039"  "rs151109487" "rs41258334"  "rs28497153"  "rs7457787"  
[6] "rs73157578" 

$ENSG00000004846
 [1] "rs58540449"  "rs79448643"  "rs114103500" "rs111613242" "rs76325633" 
 [6] "rs117337555" "rs73682327"  "rs11981446"  "rs7801178"   "rs17455533" 
[11] "rs2108258"   "rs77254950"  "rs139614792"

$ENSG00000004939
 [1] "rs2880712"   "rs9747287"   "rs117705553" "rs79555595"  "rs78230173" 
 [6] "rs114651584" "rs34637106"  "rs7213794"   "rs144535065" "rs142313238"
[11] "rs76004777"  "rs9944407"   "rs34260468"  "rs142437085" "rs8070867"  
[16] "rs730818"    "rs200108084" "rs1728174"   "rs140638631" "rs4793150"  
[21] "rs76480869"  "rs147752346" "rs62080669"  "rs9895862"   "rs7222401"  
[26] "rs540551010" "rs9890834"   "rs76498976"  "rs138050677"

$ENSG00000005073
 [1] "rs60642981"  "rs2525760"   "rs10273618"  "rs4722617"   "rs138820646"
 [6] "rs117987507" "rs17501111"  "rs10251237"  "rs147214544" "rs6947348"  
[11] "rs71539542"  "rs3735570"   "

In [None]:
library(parallel) # R package installed by default

# Step 1: Open the eQTL file:
eqtl_covid_exposure <- read.table("/data/eqtl_all_tissues_110.txt",
                                  header = TRUE,
                                  sep = "",
                                  stringsAsFactors = FALSE)

# Step 2: Open the GWAS input data set:
gwas_covid_outcome <- read.table("/data/long_covid_gwas_3_subset.txt",
                                 header = TRUE,
                                 sep = "",
                                 stringsAsFactors = FALSE)

# Step 3: Open the list of genes-SNPs:
result_selectedIVs <- readRDS("/data/long_covid_IVs_bs_3.rds")

# Step 4: Create a variable for the path to the LD matrix directory:
ld_matrix_dir <- "/data/ld_matrix_bs_3"

# Step 5: Setup the Mt_Robin method using the following setup function:
MR_MtRobin_setup <- function(geneID,
                             snpID,
                             eqtl_data,
                             gwas_data,
                             ld_matrix_path,
                             nTiss){
  tryCatch({
    ## Check if the gene is present and subset
    if(!(geneID %in% eqtl_data$gene_id)) stop("Gene is missing in eQTL dataset")
    eqtl_data <- subset(eqtl_data, gene_id == geneID)

    ## Confirm all SNPs are present in the dataset
    if(!all(snpID %in% eqtl_data$variant_id)) stop("Some SNPs missing in eQTL dataset")
    if(!all(snpID %in% gwas_data$variant_id)) stop("Some SNPs missing in GWAS dataset")

    # Read the LD matrix from a file
    LD <- as.matrix(utils::read.table(ld_matrix_path,
                                      header = TRUE,
                                      check.names = FALSE))

    if(!all(snpID %in% colnames(LD))) stop("Some SNPs missing in LD matrix")

    ## Align data
    eqtl_data <- eqtl_data[match(snpID,eqtl_data$variant_id),]
    gwas_data <- gwas_data[match(snpID,gwas_data$variant_id),]
    LD <- LD[match(snpID,rownames(LD)),match(snpID,colnames(LD))]

    eqtl_betas <- as.matrix(eqtl_data[,paste0("beta_",1:nTiss)])
    eqtl_se <- as.matrix(eqtl_data[,paste0("SE_",1:nTiss)])
    eqtl_pvals <- as.matrix(eqtl_data[,paste0("pvalue_",1:nTiss)])
    gwas_betas <- gwas_data$beta
    gwas_se <- gwas_data$SE

    row.names(eqtl_betas) <- row.names(eqtl_se) <- row.names(eqtl_pvals) <- eqtl_data$variant_id

    return(list(snpID=snpID,gwas_betas=gwas_betas,gwas_se=gwas_se,
                eqtl_betas=eqtl_betas,eqtl_se=eqtl_se,eqtl_pvals=eqtl_pvals,LD=LD))
  }, error = function(e) print(e))
}

# Step 6: Apply the previous setup function:

# Step 6.1: Initialize an empty list to store the results:
MR_MtRobin_results <- list()

# Step 6.2: Define the number of cores to use for the parallelism process:
num_cores <- 36

# Step 6.3: Loop over each gene in result_selectedIVs:
MR_MtRobin_results <- mclapply(names(result_selectedIVs), function(geneID) {

  # Step 6.3.1: Get the SNPs associated with each gene:
  snpID <- result_selectedIVs[[geneID]]

  # Step 6.3.2: Generate a unique file path for the LD matrix of this gene:
  ld_matrix_path <- file.path(ld_matrix_dir, paste0("ld_matrix_", geneID, ".txt"))

  # Step 6.3.3: Call the MR_MtRobin_setup() function with the parameters for each gene:
  return(tryCatch({
    MR_MtRobin_setup(geneID = geneID,
                     snpID = snpID,
                     eqtl_data = eqtl_covid_exposure,
                     gwas_data = gwas_covid_outcome,
                     ld_matrix_path = ld_matrix_path,
                     nTiss = 49)
  }, error = function(e) print(e)))
}, mc.cores = num_cores)

# Step 7: Assign names to the list:
names(MR_MtRobin_results) <- names(result_selectedIVs)

# Step 8: Save the previous result as a .rds file:
saveRDS(MR_MtRobin_results, file = "/data/long_covid_setup_bs_3.rds")

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q hugemem

# Request resources:
#PBS -l ncpus=36
#PBS -l mem=300GB
#PBS -l jobfs=200GB
#PBS -l walltime=10:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/Mt_Robin/select_setup_bs.sif Rscript /data/setup_function_bs_3.R

###########################################################################################################

# Exit cleanly
exit 0

In [None]:
======================================================================================
                  Resource Usage on 2023-08-29 17:00:33:
   Job Id:             93819302.gadi-pbs
   Project:            sq95
   Exit Status:        0
   Service Units:      47.01
   NCPUs Requested:    36                     NCPUs Used: 36
                                           CPU Time Used: 12:27:09
   Memory Requested:   300.0GB               Memory Used: 77.53GB
   Walltime requested: 10:00:00            Walltime Used: 00:26:07
   JobFS requested:    200.0GB                JobFS used: 0B
======================================================================================

In [None]:
%%R
# Open the output file:
result_setup_bs_3 <- readRDS("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_setup_bs_3.rds")

# Filter out errors:
result_setup_bs_3 <- result_setup_bs_3[!sapply(result_setup_bs_3, function(x) inherits(x, "simpleError"))]

# Print the length and the first element:
print(length(result_setup_bs_3))
print(length(unique(result_setup_bs_3)))
print(result_setup_bs_3[[1]])

[1] 2723
[1] 2723
$snpID
 [1] "rs2206271"   "rs35305915"  "rs72901027"  "rs9367397"   "rs61660288" 
 [6] "rs141707622" "rs2477488"   "rs13202872"  "rs62406441"  "rs17667036" 
[11] "rs75397901"  "rs10948573"  "rs4715208"   "rs571503"   

$gwas_betas
 [1]  0.03582120 -0.00993813 -0.01632840  0.03942670 -0.03288230  0.05918180
 [7]  0.04048180  0.00774911 -0.03468580  0.00966651 -0.10708000 -0.01167420
[13]  0.02275550 -0.00592643

$gwas_se
 [1] 0.0327991 0.0460413 0.0493336 0.0654533 0.0466355 0.0920342 0.0311052
 [8] 0.0349034 0.0381658 0.0338888 0.1320330 0.0338025 0.0350512 0.0311419

$eqtl_betas
              beta_1 beta_2 beta_3 beta_4 beta_5 beta_6 beta_7 beta_8 beta_9
rs2206271   0.160091     NA     NA     NA     NA     NA     NA     NA     NA
rs35305915        NA     NA     NA     NA     NA     NA     NA     NA     NA
rs72901027        NA     NA     NA     NA     NA     NA     NA     NA     NA
rs9367397         NA     NA     NA     NA     NA     NA     NA     NA     NA
rs61660288

In [None]:
# Step 1: Load the required libraries:
library(base) # R package installed by default
library(utils) # R package installed by default
library(parallel) # R package installed by default
library(Matrix) # R package installed by default
library(methods) # R package installed by default
library(stats) # R package installed by default
library(lme4) # It needs to be installed separately
library(data.table) # It needs to be installed separately

# Step 2: Main Mt_Robin function:
MR_MtRobin <- function(MR_MtRobin_input,pval_thresh=0.001){
  snpID <- MR_MtRobin_input$snpID
  gwas_betas <- MR_MtRobin_input$gwas_betas
  gwas_se <- MR_MtRobin_input$gwas_se
  eqtl_betas <- MR_MtRobin_input$eqtl_betas
  eqtl_se <- MR_MtRobin_input$eqtl_se
  eqtl_pvals <- MR_MtRobin_input$eqtl_pvals
  LD <- MR_MtRobin_input$LD

  if(ncol(eqtl_betas) != ncol(eqtl_se) | ncol(eqtl_betas) != ncol(eqtl_pvals)){
    stop("eQTL betas, SE and p-values do not all have same number of columns")
  }

  ## determine number of tissues (or conditions/studies/etc.)
  nT <- ncol(eqtl_betas)

  ## rename columns for standardized processing
  colnames(eqtl_betas) <- paste0("beta_",1:nT)
  colnames(eqtl_se) <- paste0("SE_",1:nT)
  colnames(eqtl_pvals) <- paste0("pvalue_",1:nT)

  ## combine eQTL results into data.table
  eqtl_res <- data.table::data.table(snpID,eqtl_betas,eqtl_se,eqtl_pvals)

  # reshape to long format (each SNP-gene-tissue will be row)
  eQTL_res_melt <- data.table::melt(eqtl_res,id.vars=c("snpID"),
                                    measure.vars=patterns("^beta_","^SE_","^pvalue_"),value.name=c("beta","se","pvalue"),
                                    variable.name="tissue")

  ## subset to strong instrumental variables (based on p-value threshold)
  eQTL_res_melt_PltThresh <- subset(eQTL_res_melt, pvalue < pval_thresh)

  ## set up gwas data for merging
  gwas_res <- data.table::data.table(snpID=snpID,gwas_beta=gwas_betas,gwas_se=gwas_se)

  ## merge eQTL and GWAS results
  data.table::setkey(eQTL_res_melt_PltThresh,snpID)
  data.table::setkey(gwas_res,snpID)
  merged_res <- data.table::merge.data.table(eQTL_res_melt_PltThresh,gwas_res)

  ## set up coefficients for reverse regression
  beta_x <- matrix(merged_res$beta,ncol=1)
  beta_y <- matrix(merged_res$gwas_beta,ncol=1)

  ## standard errors (for weights)
  se_x <- matrix(merged_res$se,ncol=1)
  se_y <- matrix(merged_res$gwas_se,ncol=1) ## not used by function; used in resampling (consider returning with results)

  ## identifiers
  snpID <- merged_res$snpID

  lme_res <- lme4::lmer(beta_x~(beta_y-1)+(beta_y-1|snpID),weights=1/se_x^2)

  ## return results from weighted regression with random slopes (and correlated errors) along with GWAS SE and LD (needed for resampling)
  return(list(lme_res=lme_res,gwas_res=gwas_res,LD=LD))
}

# Step 3: Read the previous file:
result_setup <- readRDS("/data/long_covid_setup_bs_3.rds")

# Step 4: Apply the main Mt_Robin function using parallel processing:
no_cores <- 36

cl <- makeCluster(no_cores) # Create a cluster with the number of cores
clusterExport(cl, "MR_MtRobin") # Export the MR_MtRobin function to all the nodes of the cluster
result_list_MtRobin <- parLapply(cl, result_setup, function(MR_MtRobin_input) {
  tryCatch({
    MR_MtRobin(MR_MtRobin_input = MR_MtRobin_input)
  }, error = function(e) print(e))
})
stopCluster(cl) # Stop the cluster once the computation is done

# Step 5: The names of the results will correspond to the gene ids:
names(result_list_MtRobin) <- names(result_setup)

# Step 6: Save the previous result as a .rds file:
saveRDS(result_list_MtRobin, file = "/data/long_covid_main_bs_3.rds")

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q hugemem

# Request resources:
#PBS -l ncpus=36
#PBS -l mem=300GB
#PBS -l jobfs=200GB
#PBS -l walltime=10:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/Mt_Robin/main_function.sif Rscript /data/main_function_bs_3.R

# Run the R script without any container:
#R --vanilla < parallelism_test.R > output_test

###########################################################################################################

# Exit cleanly
exit 0

In [None]:
======================================================================================
                  Resource Usage on 2023-08-30 10:29:30:
   Job Id:             93872802.gadi-pbs
   Project:            sq95
   Exit Status:        0
   Service Units:      2.25
   NCPUs Requested:    36                     NCPUs Used: 36
                                           CPU Time Used: 00:01:51
   Memory Requested:   300.0GB               Memory Used: 8.18GB
   Walltime requested: 10:00:00            Walltime Used: 00:01:15
   JobFS requested:    200.0GB                JobFS used: 0B
======================================================================================

In [None]:

%%R
# Open the output file:
result_main_bs_3 <- readRDS("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_main_bs_3.rds")

# Filter out errors:
result_main_bs_3 <- result_main_bs_3[!sapply(result_main_bs_3, function(x) inherits(x, "simpleError"))]

# Print the length and the first element:
print(length(result_main_bs_3))
print(length(unique(result_main_bs_3)))
print(head(result_main_bs_3))

[1] 148
[1] 148
$ENSG00000008196
$ENSG00000008196$lme_res
Linear mixed model fit by REML ['lmerMod']
Formula: beta_x ~ (beta_y - 1) + (beta_y - 1 | snpID)
Weights: 1/se_x^2
REML criterion at convergence: 7.7591
Random effects:
 Groups   Name   Std.Dev.
 snpID    beta_y 4.398   
 Residual        4.800   
Number of obs: 15, groups:  snpID, 14
Fixed Effects:
beta_y  
-2.306  

$ENSG00000008196$gwas_res
         snpID   gwas_beta   gwas_se
1   rs10948573 -0.01167420 0.0338025
2   rs13202872  0.00774911 0.0349034
3  rs141707622  0.05918180 0.0920342
4   rs17667036  0.00966651 0.0338888
5    rs2206271  0.03582120 0.0327991
6    rs2477488  0.04048180 0.0311052
7   rs35305915 -0.00993813 0.0460413
8    rs4715208  0.02275550 0.0350512
9     rs571503 -0.00592643 0.0311419
10  rs61660288 -0.03288230 0.0466355
11  rs62406441 -0.03468580 0.0381658
12  rs72901027 -0.01632840 0.0493336
13  rs75397901 -0.10708000 0.1320330
14   rs9367397  0.03942670 0.0654533

$ENSG00000008196$LD
               rs2206

In [None]:
########################################################################
# GADI:

# Obtain p-value from MR-MtRobin results:
# Uses a resampling procedure to estimate the p_value for a MR-MtRobin object:
# param MR_MtRobin_res a list object returned by \code{MR_MtRobin}.
# param nsamp integer of the number of samples to use in estimating \eqn{P}-value using a null distribution.
# param use_nonconverge logical of whether to use samples resulting in non-convergence of random slope
# return A list of two elements:
# \tabular{ll}{\code{pvalue} \tab numeric of the estimated \eqn{P}-value.
# \code{nsamp_used} \tab integer of the number of samples used in estimating the \eqn{P}-value.
# \code{nsamp_used} is returned because not all \code{nsamp} may be used +
# If \code{use_nonconverge=FALSE} (samples will be dropped if the model does not converge or results in a singular fit of the random slope).
# myRes <- MR_MtRobin(snpID,gwas_betas,gwas_se,eqtl_betas,eqtl_se,eqtl_pvals,LD)
# MR_MtRobin_resample(MR_MtRobin_res=myRes, nsamp=1e4, use_nonconverge=FALSE)

# Load packages:
library(lme4)
library(mvtnorm)
library(data.table)
library(parallel)

# Load main results:
results_1 <- readRDS("/scratch/sq95/sp6154/p_values/long_covid_main_bs_3.rds")
# Filter out errors:
results_1 <- results_1[!sapply(results_1, function(x) inherits(x, "simpleError"))]
# Export:
saveRDS(results_1, "/scratch/sq95/sp6154/p_values/long_covid_main_bs_3_no_errors.rds")

# Updated Function to calculate the p-value and FDR with parallel processing:
MR_MtRobin_resample <- function(results_1_path, nsamp=1e4, use_nonconverge=FALSE, output_file_path, num_cpus=1){
  # Check and load the results_1 dataset from the specified path:
  if(file.exists(results_1_path)){
    results_1 <- readRDS(results_1_path)
  } else {
    stop("The specified file path for results_1 does not exist.")
  }
  # Initialize a list to store results for each gene:
  pvalues_results <- list()
  # Loop over each gene:
  for(gene_id in names(results_1)) {
    # Check if necessary components are present and not NULL:
    if(is.null(results_1[[gene_id]][["lme_res"]]) || is.null(results_1[[gene_id]][["gwas_res"]]) || is.null(results_1[[gene_id]][["LD"]])) {
      next
    }
    # Create slopes for each information:
    lme_res <- results_1[[gene_id]][["lme_res"]]
    gwas_res <- results_1[[gene_id]][["gwas_res"]]
    LD <- results_1[[gene_id]][["LD"]]
    # Extract data from MR-MtRobin results:
    eqtl_betas <- lme_res@frame$beta_x
    snpID <- lme_res@frame$snpID
    weights <- lme_res@frame$weights
    tstat_MR_MtRobin <- summary(lme_res)$coefficients[1,3]
    gwas_se <- gwas_res$gwas_se
    # Bootstrapped null distribution, accounting for LD correlations:
    beta_gwas_nullMat <- mvtnorm::rmvnorm(nsamp, mean=rep(0, length(gwas_se)), sigma=diag(gwas_se) %*% LD %*% diag(gwas_se))
    colnames(beta_gwas_nullMat) <- gwas_res$snpID
    # Return data types:
    tstat_nulls <- mclapply(1:nsamp, function(i) {
      beta_gwas_null <- beta_gwas_nullMat[i, snpID]
      tryCatch({
        # LMM equation:
        lme_null <- lme4::lmer(eqtl_betas~(beta_gwas_null-1)+(beta_gwas_null-1|snpID), weights=weights)
        return(summary(lme_null)$coefficients[1,3])
      }, error=function(e) {
        if(!use_nonconverge) return(NA)
      })
    }, mc.cores = num_cpus)
    tstat_nulls <- unlist(tstat_nulls)
    tstat_nulls <- tstat_nulls[!is.na(tstat_nulls)] # Remove NAs
    nsamp_used <- length(tstat_nulls)
    # Calculate the p-value and round it to 3 decimal places:
    pval <- round(mean(abs(tstat_nulls) >= abs(tstat_MR_MtRobin)), digits = 3)
    # Store the results for the current gene in the list:
    pvalues_results[[gene_id]] <- list(pvalue=pval, nsamp_used=nsamp_used)
  }
  # Calculate FDR:
  all_pvalues <- sapply(pvalues_results, function(x) x$pvalue)
  fdr_values <- p.adjust(all_pvalues, method = "BH")
  # Combine results with FDR values:
  for(i in seq_along(fdr_values)) {
    pvalues_results[[names(fdr_values)[i]]]$fdr <- fdr_values[i]
  }
  # Prepare results dataframe:
  results_df <- do.call(rbind, lapply(names(pvalues_results), function(gene_id) {
    data.frame(gene_id = gene_id,
               pvalue = sprintf("%.3f", pvalues_results[[gene_id]]$pvalue), # Format pvalue with 3 decimal places
               nsamp_used = pvalues_results[[gene_id]]$nsamp_used,
               fdr = sprintf("%.15f", pvalues_results[[gene_id]]$fdr), # Keep full precision for fdr
               stringsAsFactors = FALSE)
  }))
  # Write the results to the output file, ensuring the formatting is kept
  write.table(results_df, file = output_file_path, sep = "\t", row.names = FALSE, quote = FALSE, col.names = TRUE)
  return(pvalues_results)
}

# Call the previous function:
MR_MtRobin_resample(
  results_1_path = "/scratch/sq95/sp6154/p_values/long_covid_main_bs_3_no_errors.rds",
  nsamp = 10000, # Number of bootstrap samples
  use_nonconverge = FALSE, # Whether to use non-converging results, typically FALSE
  output_file_path = "/scratch/sq95/sp6154/p_values/MR_pvalue_FDR_results_bs3.txt",
  num_cpus = 48 # Number of CPUs
)

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q hugemem

# Request resources:
#PBS -l ncpus=48
#PBS -l mem=500GB
#PBS -l jobfs=200GB
#PBS -l walltime=10:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Option 1: Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/p_values/p_values.sif Rscript /data/p_values_bs_3.R

# Option 2: Run the R script without any container:
#R --vanilla < parallelism_test.R > output_test

###########################################################################################################

# Exit cleanly:
exit 0

In [None]:
======================================================================================
                  Resource Usage on 2024-02-08 17:48:59:
   Job Id:             107834555.gadi-pbs
   Project:            sq95
   Exit Status:        0
   Service Units:      47.26
   NCPUs Requested:    24                     NCPUs Used: 24
                                           CPU Time Used: 09:15:37
   Memory Requested:   200.0GB               Memory Used: 19.38GB
   Walltime requested: 10:00:00            Walltime Used: 00:39:23
   JobFS requested:    100.0GB                JobFS used: 0B
======================================================================================

In [None]:
%%R

# Read the results with the p_value and FDR values:
results <- read.table("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/MR_pvalue_FDR_results_bs3.txt", header=TRUE, sep="\t")
print(dim(results))
print(head(results))

[1] 148   4
          gene_id pvalue nsamp_used       fdr
1 ENSG00000008196  0.449      10000 0.8968095
2 ENSG00000068781  0.483      10000 0.8968095
3 ENSG00000117148  0.451      10000 0.8968095
4 ENSG00000122965  0.118      10000 0.8902424
5 ENSG00000126266  0.056      10000 0.6375385
6 ENSG00000136696  0.456      10000 0.8968095


In [None]:
%%R

# Call the analyse_gene_data() function to organize the Mt_Robin results for significant and non-significant genes:

# Create the output paths:
save_paths <- list(
  all_genes = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/bs3_all_genes_sig_non_sig.txt",
  protein_coding_genes = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/bs3_protein_coding_genes_sig_not_sig.txt"
)

# Call the function:
analyze_gene_data(
  result_main_path = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_main_bs_3.rds",
  eqtl_long_covid_exposure_path = "/content/drive/MyDrive/Colab/Long_COVID/eQTL/eqtl_all_tissues_110.txt",
  save_files = save_paths
)

Saved all_genes as all_genes to /content/drive/MyDrive/Colab/Long_COVID/MtRobin/bs3_all_genes_sig_non_sig.txt 
Dimensions of all_genes : 148 10 

          gene_id     gene_name      gene_type      beta_y Obs SNPs
1 ENSG00000008196        TFAP2B protein_coding   -2.306357  15   14
2 ENSG00000068781 STON1-GTF2A1L protein_coding -229.601291  29    2
3 ENSG00000117148         ACTL8 protein_coding    1.751772   5    1
4 ENSG00000122965         RBM19 protein_coding   -3.349624  49   10
5 ENSG00000126266         FFAR1 protein_coding   -9.198547  34    5
6 ENSG00000136696         IL36B protein_coding    9.788681   5    3
                                                                                                                                               SNP_IDs
1 rs10948573,rs13202872,rs141707622,rs17667036,rs2206271,rs2477488,rs35305915,rs4715208,rs571503,rs61660288,rs62406441,rs72901027,rs75397901,rs9367397
2                                                                           

In [None]:
%%R

# BS3 final table:
# Call the function that retrieves more information from the Mt_Robin and filter the significant genes:
process_final_data(input_path_1 = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_main_bs_3_no_errors.rds",
                   input_path_2 = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/bs3_protein_coding_genes_sig_not_sig.txt",
                   input_path_3 = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/MR_pvalue_FDR_results_bs3.txt",
                   output_path = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/bs3_protein_coding_genes_sig_full_table.txt")

[1] "Causal genes (noncoding genes and protein coding genes): "
[1] 148
$lme_res
Linear mixed model fit by REML ['lmerMod']
Formula: beta_x ~ (beta_y - 1) + (beta_y - 1 | snpID)
Weights: 1/se_x^2
REML criterion at convergence: 7.7591
Random effects:
 Groups   Name   Std.Dev.
 snpID    beta_y 4.398   
 Residual        4.800   
Number of obs: 15, groups:  snpID, 14
Fixed Effects:
beta_y  
-2.306  

$gwas_res
         snpID   gwas_beta   gwas_se
1   rs10948573 -0.01167420 0.0338025
2   rs13202872  0.00774911 0.0349034
3  rs141707622  0.05918180 0.0920342
4   rs17667036  0.00966651 0.0338888
5    rs2206271  0.03582120 0.0327991
6    rs2477488  0.04048180 0.0311052
7   rs35305915 -0.00993813 0.0460413
8    rs4715208  0.02275550 0.0350512
9     rs571503 -0.00592643 0.0311419
10  rs61660288 -0.03288230 0.0466355
11  rs62406441 -0.03468580 0.0381658
12  rs72901027 -0.01632840 0.0493336
13  rs75397901 -0.10708000 0.1320330
14   rs9367397  0.03942670 0.0654533

$LD
               rs2206271   rs3

**Obs.:** No significant protein coding genes

### **GWAS4**

In [None]:
# Step 1: Load the required libraries:
library(matrixStats) # It needs to be installed separately
#library(BiocManager) # It needs to be installed separately
library(snpStats) # It needs to be installed separately
library(utils) # R package installed by default
library(parallel) # R package installed by default
library(reshape2) # Additional package for melt function

# Step 2: Open the eQTL file:
eqtl_covid_exposure <- read.table("/data/eqtl_all_tissues_110.txt",
                                  header = TRUE,
                                  sep = "",
                                  stringsAsFactors = FALSE)

# Step 3: Open the GWAS input data set:
gwas_covid_outcome <- read.table("/data/long_covid_gwas_4_subset.txt",
                                 header = TRUE,
                                 sep = "",
                                 stringsAsFactors = FALSE)

# Step 4: Open the genes input data set:
all_genes <- read.table("/data/all_genes_110.txt",
                        header = TRUE,
                        sep = "",
                        stringsAsFactors = FALSE)

# Step 5: Convert the previous list of genes into a vector:
all_genes_vector <- all_genes$gene_id

# Step 6: Save the paths of the GTEx files:
gtex_covid_bed <- "/data/WGS_GTEx_long_covid_rsID.bed"
gtex_covid_bim <- "/data/WGS_GTEx_long_covid_rsID.bim"
gtex_covid_fam <- "/data/WGS_GTEx_long_covid_rsID.fam"

# Step 7: Create a variable for the path to the directory to save the LD matrix results:
ld_matrix_dir <- "/data/ld_matrix_bs_4"

# Step 8: Create the directory if it doesn't exist:
if (!dir.exists(ld_matrix_dir)) {
  dir.create(ld_matrix_dir)
}

# Step 9: Modify the original select_IV_bs() function to accept NAs from the eQTL and calculate the LD matrix inside:
# select_IV_bs() --> backward selection starts with all variables and removes one at a time:
select_IV_bs <- function(geneID,
                         eqtl_data,
                         nTiss,
                         bed_file,
                         bim_file,
                         fam_file,
                         ld_matrix_path,
                         ld_thresh=0.5,
                         pval_thresh=0.001,
                         nTiss_thresh=3){

  ## Subset the eQTL data set for the gene of interest:
  if(!(geneID %in% eqtl_data$gene_id)) stop("Gene is missing in eQTL dataset")
  eqtl_data <- subset(eqtl_data, gene_id == geneID)

  pval_mat <- as.matrix(subset(eqtl_data,select=paste0("pvalue_",1:nTiss)))
  eqtl_data$min_p <- matrixStats::rowMins(pval_mat, na.rm = TRUE) # Mod1: I added na.rm = TRUE
  eqtl_data$nTissThresh_p <- apply(pval_mat, 1, function(x) sort(x)[nTiss_thresh])
  eqtl_data <- subset(eqtl_data, nTissThresh_p < pval_thresh)

  ## Ensure same sign for tissues with p < pval_thresh:
  betas <- as.matrix(subset(eqtl_data,select=paste0("beta_",1:nTiss)))
  pvals <- as.matrix(subset(eqtl_data,select=paste0("pvalue_",1:nTiss)))
  pvals_lt001 <- pvals<0.001
  betas_sig <- betas*pvals_lt001
  sig_neg_eQTL <- as.numeric(apply(betas_sig, 1, function(x) any((!is.na(x) & x < -1e-15)))) # Mod2: I added is.na()
  sig_pos_eQTL <- as.numeric(apply(betas_sig, 1, function(x) any((!is.na(x) & x > 1e-15)))) # Mod3: I added is.na()
  eqtl_data$pos_sig_eQTL <- sig_pos_eQTL
  eqtl_data$neg_sig_eQTL <- sig_neg_eQTL
  eqtl_data <- subset(eqtl_data, !(pos_sig_eQTL==1 & neg_sig_eQTL==1))

  if(nrow(eqtl_data)==0) stop(paste0("No cis-SNPs for gene with p<",pval_thresh," in at least ",nTiss_thresh," tissues."))

  ## Obtain a set of candidate SNPs:
  SNP_pool <- eqtl_data$variant_id
  if(length(SNP_pool)==1){
    return(SNP_pool)
  }

  # Calculate the LD matrix:
  snp_data <- read.plink(bed_file, bim_file, fam_file)
  snp_matrix <- snp_data$genotypes
  numeric_matrix <- as(snp_matrix, "numeric")

  # Restrict LD matrix to SNPs in the data set:
  df_subset <- numeric_matrix[, colnames(numeric_matrix) %in% SNP_pool]
  LD <- cor(df_subset, use="pairwise.complete")
  LD_r2 <- LD^2

  # Export a .txt file of the LD matrix:
  utils::write.table(LD,
                     file = ld_matrix_path,
                     sep = " ",
                     row.names = TRUE,
                     col.names = TRUE,
                     quote = FALSE)

  ## Get pairwise r^2:
  LD_r2_pw <- LD_r2
  LD_r2_pw[lower.tri(LD_r2_pw,diag=T)] <- NA
  LD_r2_pw <- reshape2::melt(LD_r2_pw)
  colnames(LD_r2_pw) <- c("SNP1","SNP2","r2")
  LD_r2_pw$SNP1 <- as.character(LD_r2_pw$SNP1)
  LD_r2_pw$SNP2 <- as.character(LD_r2_pw$SNP2)

  ## Drop SNPs paired with themselves and duplicates (resulting from lower triangle, set to NA):
  LD_r2_pw <- subset(LD_r2_pw, SNP1 != SNP2 & !is.na(r2))

  ## Sort by descending r^2:
  LD_r2_pw <- LD_r2_pw[order(LD_r2_pw$r2,decreasing = T),]

  ## Check for case all candidate SNPs have pw r^2 > thresh (return SNPs with smallest p-value):
  if(min(LD_r2_pw$r2) >=ld_thresh){
    return(eqtl_data$variant_id[which.max(eqtl_data$min_p)])
  }

  while(max(LD_r2_pw$r2)>=ld_thresh){

    ## Identify top SNPs in the pool:
    eqtl_data <- subset(eqtl_data, variant_id %in% SNP_pool)
    LD_r2_pw <- subset(LD_r2_pw, SNP1 %in% SNP_pool & SNP2 %in% SNP_pool)
    LD_r2_pw_geThresh <- subset(LD_r2_pw, r2>=ld_thresh)

    ## Identify SNP pair with highest remaining r^2:
    next_pair <- LD_r2_pw[1,]

    ## Criteria 1: Drop the SNP with largest number of correlations > threshold with other SNPs:
    SNP1_nSNP_geThresh <- sum(LD_r2_pw_geThresh$SNP1 == next_pair$SNP1) + sum(LD_r2_pw_geThresh$SNP2 == next_pair$SNP1)
    SNP2_nSNP_geThresh <- sum(LD_r2_pw_geThresh$SNP1 == next_pair$SNP2) + sum(LD_r2_pw_geThresh$SNP2 == next_pair$SNP2)
    if(SNP1_nSNP_geThresh > SNP2_nSNP_geThresh){
      next_SNP_drop <- next_pair$SNP1
    } else if (SNP1_nSNP_geThresh < SNP2_nSNP_geThresh){
      next_SNP_drop <- next_pair$SNP2
    } else{

      ## Criteria 2: Drop the SNP with smaller number of tissues with p<0.001:
      next_pair_eqtl <- subset(eqtl_data, variant_id %in% c(next_pair$SNP1,next_pair$SNP2))
      SNP1_nTiss <- subset(next_pair_eqtl,variant_id==next_pair$SNP1)$nTiss_ltThresh
      SNP2_nTiss <- subset(next_pair_eqtl,variant_id==next_pair$SNP2)$nTiss_ltThresh

      if(SNP1_nTiss > SNP2_nTiss){
        next_SNP_drop <- next_pair$SNP2
      } else if(SNP1_nTiss < SNP2_nTiss){
        next_SNP_drop <- next_pair$SNP1
      } else{

        ## Criteria 3: Drop the SNP with larger minimum p-value:
        next_SNP_drop <- next_pair_eqtl$variant_id[which.max(next_pair_eqtl$min_p)]
      }
    }
    SNP_pool <- SNP_pool[which(SNP_pool != next_SNP_drop)]
  }
  return(SNP_pool)
}

# Step 10: Define the number of cores to use for the parallelism process:
num_cores <- 48

# Step 11:
result_list <- mclapply(all_genes_vector, function(geneID) {
  # Generate a unique file path for the LD matrix of this gene
  ld_matrix_path <- file.path(ld_matrix_dir, paste0("ld_matrix_", geneID, ".txt"))

  tryCatch({
    select_IV_bs(geneID = geneID,
                 eqtl_data = eqtl_covid_exposure,
                 nTiss = 49,
                 bed_file = gtex_covid_bed,
                 bim_file = gtex_covid_bim,
                 fam_file = gtex_covid_fam,
                 ld_matrix_path = ld_matrix_path,
                 ld_thresh = 0.5,
                 pval_thresh = 0.001,
                 nTiss_thresh = 1)
  }, error = function(e) print(e))
}, mc.cores = num_cores)

# Step 12:
names(result_list) <- all_genes_vector

# Step 13:
saveRDS(result_list, file = "/data/long_covid_IVs_bs_4.rds")

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q hugemem

# Request resources:
#PBS -l ncpus=48
#PBS -l mem=500GB
#PBS -l jobfs=200GB
#PBS -l walltime=30:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/Mt_Robin/select_setup_bs.sif Rscript /data/select_function_bs_4.R

# Run the R script without any container:
#R --vanilla < parallelism_test.R > output_test

###########################################################################################################

# Exit cleanly
exit 0

In [None]:
======================================================================================
                  Resource Usage on 2023-08-26 02:15:44:
   Job Id:             93551120.gadi-pbs
   Project:            sq95
   Exit Status:        0
   Service Units:      865.48
   NCPUs Requested:    48                     NCPUs Used: 48
                                           CPU Time Used: 278:35:30
   Memory Requested:   500.0GB               Memory Used: 388.01GB
   Walltime requested: 30:00:00            Walltime Used: 06:00:37
   JobFS requested:    200.0GB                JobFS used: 0B
======================================================================================

In [None]:
%%R
# Open the output file:
result_selectIV_bs_4 <- readRDS("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_IVs_bs_4.rds")

# Filter out errors:
result_selectIV_bs_4 <- result_selectIV_bs_4[!sapply(result_selectIV_bs_4, function(x) inherits(x, "simpleError"))]

# Print the length and the first element:
print(length(result_selectIV_bs_4))
print(length(unique(result_selectIV_bs_4)))
print(head(result_selectIV_bs_4))

[1] 13809
[1] 13737
$ENSG00000002079
[1] "rs36168039"  "rs151109487" "rs41258334"  "rs28497153"  "rs7457787"  
[6] "rs73157578" 

$ENSG00000004846
 [1] "rs58540449"  "rs79448643"  "rs114103500" "rs111613242" "rs76325633" 
 [6] "rs117337555" "rs73682327"  "rs11981446"  "rs7801178"   "rs17455533" 
[11] "rs2108258"   "rs77254950"  "rs139614792"

$ENSG00000004939
 [1] "rs2880712"   "rs9747287"   "rs117705553" "rs79555595"  "rs78230173" 
 [6] "rs114651584" "rs34637106"  "rs7213794"   "rs144535065" "rs142313238"
[11] "rs76004777"  "rs9944407"   "rs34260468"  "rs142437085" "rs8070867"  
[16] "rs730818"    "rs200108084" "rs1728174"   "rs140638631" "rs4793150"  
[21] "rs76480869"  "rs147752346" "rs62080669"  "rs9895862"   "rs7222401"  
[26] "rs540551010" "rs9890834"   "rs76498976"  "rs138050677"

$ENSG00000005073
 [1] "rs60642981"  "rs2525760"   "rs10273618"  "rs4722617"   "rs138820646"
 [6] "rs117987507" "rs17501111"  "rs10251237"  "rs147214544" "rs6947348"  
[11] "rs71539542"  "rs3735570"   "

In [None]:
library(parallel) # R package installed by default

# Step 1: Open the eQTL file:
eqtl_covid_exposure <- read.table("/data/eqtl_all_tissues_110.txt",
                                  header = TRUE,
                                  sep = "",
                                  stringsAsFactors = FALSE)

# Step 2: Open the GWAS input data set:
gwas_covid_outcome <- read.table("/data/long_covid_gwas_4_subset.txt",
                                 header = TRUE,
                                 sep = "",
                                 stringsAsFactors = FALSE)

# Step 3: Open the list of genes-SNPs:
result_selectedIVs <- readRDS("/data/long_covid_IVs_bs_4.rds")

# Step 4: Create a variable for the path to the LD matrix directory:
ld_matrix_dir <- "/data/ld_matrix_bs_4"

# Step 5: Setup the Mt_Robin method using the following setup function:
MR_MtRobin_setup <- function(geneID,
                             snpID,
                             eqtl_data,
                             gwas_data,
                             ld_matrix_path,
                             nTiss){
  tryCatch({
    ## Check if the gene is present and subset
    if(!(geneID %in% eqtl_data$gene_id)) stop("Gene is missing in eQTL dataset")
    eqtl_data <- subset(eqtl_data, gene_id == geneID)

    ## Confirm all SNPs are present in the dataset
    if(!all(snpID %in% eqtl_data$variant_id)) stop("Some SNPs missing in eQTL dataset")
    if(!all(snpID %in% gwas_data$variant_id)) stop("Some SNPs missing in GWAS dataset")

    # Read the LD matrix from a file
    LD <- as.matrix(utils::read.table(ld_matrix_path,
                                      header = TRUE,
                                      check.names = FALSE))

    if(!all(snpID %in% colnames(LD))) stop("Some SNPs missing in LD matrix")

    ## Align data
    eqtl_data <- eqtl_data[match(snpID,eqtl_data$variant_id),]
    gwas_data <- gwas_data[match(snpID,gwas_data$variant_id),]
    LD <- LD[match(snpID,rownames(LD)),match(snpID,colnames(LD))]

    eqtl_betas <- as.matrix(eqtl_data[,paste0("beta_",1:nTiss)])
    eqtl_se <- as.matrix(eqtl_data[,paste0("SE_",1:nTiss)])
    eqtl_pvals <- as.matrix(eqtl_data[,paste0("pvalue_",1:nTiss)])
    gwas_betas <- gwas_data$beta
    gwas_se <- gwas_data$SE

    row.names(eqtl_betas) <- row.names(eqtl_se) <- row.names(eqtl_pvals) <- eqtl_data$variant_id

    return(list(snpID=snpID,gwas_betas=gwas_betas,gwas_se=gwas_se,
                eqtl_betas=eqtl_betas,eqtl_se=eqtl_se,eqtl_pvals=eqtl_pvals,LD=LD))
  }, error = function(e) print(e))
}

# Step 6: Apply the previous setup function:

# Step 6.1: Initialize an empty list to store the results:
MR_MtRobin_results <- list()

# Step 6.2: Define the number of cores to use for the parallelism process:
num_cores <- 36

# Step 6.3: Loop over each gene in result_selectedIVs:
MR_MtRobin_results <- mclapply(names(result_selectedIVs), function(geneID) {

  # Step 6.3.1: Get the SNPs associated with each gene:
  snpID <- result_selectedIVs[[geneID]]

  # Step 6.3.2: Generate a unique file path for the LD matrix of this gene:
  ld_matrix_path <- file.path(ld_matrix_dir, paste0("ld_matrix_", geneID, ".txt"))

  # Step 6.3.3: Call the MR_MtRobin_setup() function with the parameters for each gene:
  return(tryCatch({
    MR_MtRobin_setup(geneID = geneID,
                     snpID = snpID,
                     eqtl_data = eqtl_covid_exposure,
                     gwas_data = gwas_covid_outcome,
                     ld_matrix_path = ld_matrix_path,
                     nTiss = 49)
  }, error = function(e) print(e)))
}, mc.cores = num_cores)

# Step 7: Assign names to the list:
names(MR_MtRobin_results) <- names(result_selectedIVs)

# Step 8: Save the previous result as a .rds file:
saveRDS(MR_MtRobin_results, file = "/data/long_covid_setup_bs_4.rds")

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q hugemem

# Request resources:
#PBS -l ncpus=36
#PBS -l mem=300GB
#PBS -l jobfs=200GB
#PBS -l walltime=10:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/Mt_Robin/select_setup_bs.sif Rscript /data/setup_function_bs_4.R

###########################################################################################################

# Exit cleanly
exit 0

In [None]:
======================================================================================
                  Resource Usage on 2023-08-29 17:02:20:
   Job Id:             93819310.gadi-pbs
   Project:            sq95
   Exit Status:        0
   Service Units:      50.10
   NCPUs Requested:    36                     NCPUs Used: 36
                                           CPU Time Used: 12:54:56
   Memory Requested:   300.0GB               Memory Used: 80.8GB
   Walltime requested: 10:00:00            Walltime Used: 00:27:50
   JobFS requested:    200.0GB                JobFS used: 0B
======================================================================================

In [None]:
%%R
# Open the output file:
result_setup_bs_4 <- readRDS("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_setup_bs_4.rds")

# Filter out errors:
result_setup_bs_4 <- result_setup_bs_4[!sapply(result_setup_bs_4, function(x) inherits(x, "simpleError"))]

# Print the length and the first element:
print(length(result_setup_bs_4))
print(length(unique(result_setup_bs_4)))
print(result_setup_bs_4[[1]])

[1] 2636
[1] 2636
$snpID
 [1] "rs2206271"   "rs35305915"  "rs72901027"  "rs9367397"   "rs61660288" 
 [6] "rs141707622" "rs2477488"   "rs13202872"  "rs62406441"  "rs17667036" 
[11] "rs75397901"  "rs10948573"  "rs4715208"   "rs571503"   

$gwas_betas
 [1]  0.02738150 -0.03549410 -0.01634200  0.00006950 -0.03065830  0.12363800
 [7]  0.03828220  0.01946290  0.00770182 -0.01505850 -0.09779650  0.00386748
[13]  0.02011210 -0.00302458

$gwas_se
 [1] 0.0236648 0.0311378 0.0344939 0.0471503 0.0354960 0.0676588 0.0224505
 [8] 0.0255439 0.0275417 0.0241452 0.0923557 0.0242689 0.0261452 0.0224935

$eqtl_betas
              beta_1 beta_2 beta_3 beta_4 beta_5 beta_6 beta_7 beta_8 beta_9
rs2206271   0.160091     NA     NA     NA     NA     NA     NA     NA     NA
rs35305915        NA     NA     NA     NA     NA     NA     NA     NA     NA
rs72901027        NA     NA     NA     NA     NA     NA     NA     NA     NA
rs9367397         NA     NA     NA     NA     NA     NA     NA     NA     NA
rs61660288

In [None]:
# Step 1: Load the required libraries:
library(base) # R package installed by default
library(utils) # R package installed by default
library(parallel) # R package installed by default
library(Matrix) # R package installed by default
library(methods) # R package installed by default
library(stats) # R package installed by default
library(lme4) # It needs to be installed separately
library(data.table) # It needs to be installed separately

# Step 2: Main Mt_Robin function:
MR_MtRobin <- function(MR_MtRobin_input,pval_thresh=0.001){
  snpID <- MR_MtRobin_input$snpID
  gwas_betas <- MR_MtRobin_input$gwas_betas
  gwas_se <- MR_MtRobin_input$gwas_se
  eqtl_betas <- MR_MtRobin_input$eqtl_betas
  eqtl_se <- MR_MtRobin_input$eqtl_se
  eqtl_pvals <- MR_MtRobin_input$eqtl_pvals
  LD <- MR_MtRobin_input$LD

  if(ncol(eqtl_betas) != ncol(eqtl_se) | ncol(eqtl_betas) != ncol(eqtl_pvals)){
    stop("eQTL betas, SE and p-values do not all have same number of columns")
  }

  ## determine number of tissues (or conditions/studies/etc.)
  nT <- ncol(eqtl_betas)

  ## rename columns for standardized processing
  colnames(eqtl_betas) <- paste0("beta_",1:nT)
  colnames(eqtl_se) <- paste0("SE_",1:nT)
  colnames(eqtl_pvals) <- paste0("pvalue_",1:nT)

  ## combine eQTL results into data.table
  eqtl_res <- data.table::data.table(snpID,eqtl_betas,eqtl_se,eqtl_pvals)

  # reshape to long format (each SNP-gene-tissue will be row)
  eQTL_res_melt <- data.table::melt(eqtl_res,id.vars=c("snpID"),
                                    measure.vars=patterns("^beta_","^SE_","^pvalue_"),value.name=c("beta","se","pvalue"),
                                    variable.name="tissue")

  ## subset to strong instrumental variables (based on p-value threshold)
  eQTL_res_melt_PltThresh <- subset(eQTL_res_melt, pvalue < pval_thresh)

  ## set up gwas data for merging
  gwas_res <- data.table::data.table(snpID=snpID,gwas_beta=gwas_betas,gwas_se=gwas_se)

  ## merge eQTL and GWAS results
  data.table::setkey(eQTL_res_melt_PltThresh,snpID)
  data.table::setkey(gwas_res,snpID)
  merged_res <- data.table::merge.data.table(eQTL_res_melt_PltThresh,gwas_res)

  ## set up coefficients for reverse regression
  beta_x <- matrix(merged_res$beta,ncol=1)
  beta_y <- matrix(merged_res$gwas_beta,ncol=1)

  ## standard errors (for weights)
  se_x <- matrix(merged_res$se,ncol=1)
  se_y <- matrix(merged_res$gwas_se,ncol=1) ## not used by function; used in resampling (consider returning with results)

  ## identifiers
  snpID <- merged_res$snpID

  lme_res <- lme4::lmer(beta_x~(beta_y-1)+(beta_y-1|snpID),weights=1/se_x^2)

  ## return results from weighted regression with random slopes (and correlated errors) along with GWAS SE and LD (needed for resampling)
  return(list(lme_res=lme_res,gwas_res=gwas_res,LD=LD))
}

# Step 3: Read the previous file:
result_setup <- readRDS("/data/long_covid_setup_bs_4.rds")

# Step 4: Apply the main Mt_Robin function using parallel processing:
no_cores <- 36

cl <- makeCluster(no_cores) # Create a cluster with the number of cores
clusterExport(cl, "MR_MtRobin") # Export the MR_MtRobin function to all the nodes of the cluster
result_list_MtRobin <- parLapply(cl, result_setup, function(MR_MtRobin_input) {
  tryCatch({
    MR_MtRobin(MR_MtRobin_input = MR_MtRobin_input)
  }, error = function(e) print(e))
})
stopCluster(cl) # Stop the cluster once the computation is done

# Step 5: The names of the results will correspond to the gene ids:
names(result_list_MtRobin) <- names(result_setup)

# Step 6: Save the previous result as a .rds file:
saveRDS(result_list_MtRobin, file = "/data/long_covid_main_bs_4.rds")

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q hugemem

# Request resources:
#PBS -l ncpus=36
#PBS -l mem=300GB
#PBS -l jobfs=200GB
#PBS -l walltime=10:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/Mt_Robin/main_function.sif Rscript /data/main_function_bs_4.R

# Run the R script without any container:
#R --vanilla < parallelism_test.R > output_test

###########################################################################################################

# Exit cleanly
exit 0

In [None]:
======================================================================================
                  Resource Usage on 2023-08-30 10:32:18:
   Job Id:             93872803.gadi-pbs
   Project:            sq95
   Exit Status:        0
   Service Units:      2.25
   NCPUs Requested:    36                     NCPUs Used: 36
                                           CPU Time Used: 00:01:46
   Memory Requested:   300.0GB               Memory Used: 8.04GB
   Walltime requested: 10:00:00            Walltime Used: 00:01:15
   JobFS requested:    200.0GB                JobFS used: 0B
======================================================================================

In [None]:
%%R
# Open the output file:
result_main_bs_4 <- readRDS("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_main_bs_4.rds")

# Filter out errors:
result_main_bs_4 <- result_main_bs_4[!sapply(result_main_bs_4, function(x) inherits(x, "simpleError"))]

# Print the length and the first element:
print(length(result_main_bs_4))
print(length(unique(result_main_bs_4)))
print(head(result_main_bs_4))

[1] 145
[1] 145
$ENSG00000008196
$ENSG00000008196$lme_res
Linear mixed model fit by REML ['lmerMod']
Formula: beta_x ~ (beta_y - 1) + (beta_y - 1 | snpID)
Weights: 1/se_x^2
REML criterion at convergence: 7.9038
Random effects:
 Groups   Name   Std.Dev.
 snpID    beta_y 0.000   
 Residual        5.069   
Number of obs: 15, groups:  snpID, 14
Fixed Effects:
beta_y  
-2.161  

$ENSG00000008196$gwas_res
         snpID   gwas_beta   gwas_se
1   rs10948573  0.00386748 0.0242689
2   rs13202872  0.01946290 0.0255439
3  rs141707622  0.12363800 0.0676588
4   rs17667036 -0.01505850 0.0241452
5    rs2206271  0.02738150 0.0236648
6    rs2477488  0.03828220 0.0224505
7   rs35305915 -0.03549410 0.0311378
8    rs4715208  0.02011210 0.0261452
9     rs571503 -0.00302458 0.0224935
10  rs61660288 -0.03065830 0.0354960
11  rs62406441  0.00770182 0.0275417
12  rs72901027 -0.01634200 0.0344939
13  rs75397901 -0.09779650 0.0923557
14   rs9367397  0.00006950 0.0471503

$ENSG00000008196$LD
               rs2206

In [None]:
########################################################################
# GADI:

# Obtain p-value from MR-MtRobin results:
# Uses a resampling procedure to estimate the p_value for a MR-MtRobin object:
# param MR_MtRobin_res a list object returned by \code{MR_MtRobin}.
# param nsamp integer of the number of samples to use in estimating \eqn{P}-value using a null distribution.
# param use_nonconverge logical of whether to use samples resulting in non-convergence of random slope
# return A list of two elements:
# \tabular{ll}{\code{pvalue} \tab numeric of the estimated \eqn{P}-value.
# \code{nsamp_used} \tab integer of the number of samples used in estimating the \eqn{P}-value.
# \code{nsamp_used} is returned because not all \code{nsamp} may be used +
# If \code{use_nonconverge=FALSE} (samples will be dropped if the model does not converge or results in a singular fit of the random slope).
# myRes <- MR_MtRobin(snpID,gwas_betas,gwas_se,eqtl_betas,eqtl_se,eqtl_pvals,LD)
# MR_MtRobin_resample(MR_MtRobin_res=myRes, nsamp=1e4, use_nonconverge=FALSE)

# Load packages:
library(lme4)
library(mvtnorm)
library(data.table)
library(parallel)

# Load main results:
results_1 <- readRDS("/scratch/sq95/sp6154/p_values/long_covid_main_bs_4.rds")
# Filter out errors:
results_1 <- results_1[!sapply(results_1, function(x) inherits(x, "simpleError"))]
# Export:
saveRDS(results_1, "/scratch/sq95/sp6154/p_values/long_covid_main_bs_4_no_errors.rds")

# Updated Function to calculate the p-value and FDR with parallel processing:
MR_MtRobin_resample <- function(results_1_path, nsamp=1e4, use_nonconverge=FALSE, output_file_path, num_cpus=1){
  # Check and load the results_1 dataset from the specified path:
  if(file.exists(results_1_path)){
    results_1 <- readRDS(results_1_path)
  } else {
    stop("The specified file path for results_1 does not exist.")
  }
  # Initialize a list to store results for each gene:
  pvalues_results <- list()
  # Loop over each gene:
  for(gene_id in names(results_1)) {
    # Check if necessary components are present and not NULL:
    if(is.null(results_1[[gene_id]][["lme_res"]]) || is.null(results_1[[gene_id]][["gwas_res"]]) || is.null(results_1[[gene_id]][["LD"]])) {
      next
    }
    # Create slopes for each information:
    lme_res <- results_1[[gene_id]][["lme_res"]]
    gwas_res <- results_1[[gene_id]][["gwas_res"]]
    LD <- results_1[[gene_id]][["LD"]]
    # Extract data from MR-MtRobin results:
    eqtl_betas <- lme_res@frame$beta_x
    snpID <- lme_res@frame$snpID
    weights <- lme_res@frame$weights
    tstat_MR_MtRobin <- summary(lme_res)$coefficients[1,3]
    gwas_se <- gwas_res$gwas_se
    # Bootstrapped null distribution, accounting for LD correlations:
    beta_gwas_nullMat <- mvtnorm::rmvnorm(nsamp, mean=rep(0, length(gwas_se)), sigma=diag(gwas_se) %*% LD %*% diag(gwas_se))
    colnames(beta_gwas_nullMat) <- gwas_res$snpID
    # Return data types:
    tstat_nulls <- mclapply(1:nsamp, function(i) {
      beta_gwas_null <- beta_gwas_nullMat[i, snpID]
      tryCatch({
        # LMM equation:
        lme_null <- lme4::lmer(eqtl_betas~(beta_gwas_null-1)+(beta_gwas_null-1|snpID), weights=weights)
        return(summary(lme_null)$coefficients[1,3])
      }, error=function(e) {
        if(!use_nonconverge) return(NA)
      })
    }, mc.cores = num_cpus)
    tstat_nulls <- unlist(tstat_nulls)
    tstat_nulls <- tstat_nulls[!is.na(tstat_nulls)] # Remove NAs
    nsamp_used <- length(tstat_nulls)
    # Calculate the p-value and round it to 3 decimal places:
    pval <- round(mean(abs(tstat_nulls) >= abs(tstat_MR_MtRobin)), digits = 3)
    # Store the results for the current gene in the list:
    pvalues_results[[gene_id]] <- list(pvalue=pval, nsamp_used=nsamp_used)
  }
  # Calculate FDR:
  all_pvalues <- sapply(pvalues_results, function(x) x$pvalue)
  fdr_values <- p.adjust(all_pvalues, method = "BH")
  # Combine results with FDR values:
  for(i in seq_along(fdr_values)) {
    pvalues_results[[names(fdr_values)[i]]]$fdr <- fdr_values[i]
  }
  # Prepare results dataframe:
  results_df <- do.call(rbind, lapply(names(pvalues_results), function(gene_id) {
    data.frame(gene_id = gene_id,
               pvalue = sprintf("%.3f", pvalues_results[[gene_id]]$pvalue), # Format pvalue with 3 decimal places
               nsamp_used = pvalues_results[[gene_id]]$nsamp_used,
               fdr = sprintf("%.15f", pvalues_results[[gene_id]]$fdr), # Keep full precision for fdr
               stringsAsFactors = FALSE)
  }))
  # Write the results to the output file, ensuring the formatting is kept
  write.table(results_df, file = output_file_path, sep = "\t", row.names = FALSE, quote = FALSE, col.names = TRUE)
  return(pvalues_results)
}

# Call the previous function:
MR_MtRobin_resample(
  results_1_path = "/scratch/sq95/sp6154/p_values/long_covid_main_bs_4_no_errors.rds",
  nsamp = 10000, # Number of bootstrap samples
  use_nonconverge = FALSE, # Whether to use non-converging results, typically FALSE
  output_file_path = "/scratch/sq95/sp6154/p_values/MR_pvalue_FDR_results_bs4.txt",
  num_cpus = 48 # Number of CPUs
)

In [None]:
# Specify the location of the program to be used:
#!/bin/bash

# Specify the project code:
#PBS -P sq95

# Specify the queue:
#PBS -q hugemem

# Request resources:
#PBS -l ncpus=48
#PBS -l mem=500GB
#PBS -l jobfs=200GB
#PBS -l walltime=10:00:00

# Request the system to enter and save the job in the working directory:
#PBS -l wd

# Create different output files instead of overwriting the same file and specify the name:
date > test1_out_$PBS_JOBID.txt
sleep 60
date >> test1_out$PBS_JOBID.txt

# Move the output to the directory where the job was submitted:
cd $PBS_O_WORKDIR

# Receive alerts by email - aborted(a), begins(b), ends(e):
#PBS -M pinsy007@mymail.unisa.edu.au
#PBS -m abe

###########################################################################################################
# Specify the work to be done:

# Load the required modules:
module load R/4.2.2
module load gcc/12.2.0
module load intel-mkl/2021.2.0
module load singularity

# Option 1: Run the R script using the Singularity container:
singularity exec --bind $PBS_O_WORKDIR:/data /scratch/sq95/sp6154/p_values/p_values.sif Rscript /data/p_values_bs_4.R

# Option 2: Run the R script without any container:
#R --vanilla < parallelism_test.R > output_test

###########################################################################################################

# Exit cleanly:
exit 0

In [None]:
======================================================================================
                  Resource Usage on 2024-02-08 18:06:52:
   Job Id:             107837488.gadi-pbs
   Project:            sq95
   Exit Status:        0
   Service Units:      33.94
   NCPUs Requested:    24                     NCPUs Used: 24
                                           CPU Time Used: 10:08:22
   Memory Requested:   200.0GB               Memory Used: 19.28GB
   Walltime requested: 10:00:00            Walltime Used: 00:28:17
   JobFS requested:    100.0GB                JobFS used: 0B
======================================================================================

In [None]:
%%R

# Read the results with the p_value and FDR values:
results <- read.table("/content/drive/MyDrive/Colab/Long_COVID/MtRobin/MR_pvalue_FDR_results_bs4.txt", header=TRUE, sep="\t")
print(dim(results))
print(head(results))

[1] 145   4
          gene_id pvalue nsamp_used       fdr
1 ENSG00000008196  0.258      10000 0.8382192
2 ENSG00000068781  0.878      10000 0.9583083
3 ENSG00000112494  0.655      10000 0.8952752
4 ENSG00000113905  0.776      10000 0.9132661
5 ENSG00000117148  0.745      10000 0.9132661
6 ENSG00000122965  0.374      10000 0.8382192


In [None]:
%%R

# Call the analyse_gene_data() function to organize the Mt_Robin results for significant and non-significant genes:

# Create the output paths:
save_paths <- list(
  all_genes = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/bs4_all_genes_sig_non_sig.txt",
  protein_coding_genes = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/bs4_protein_coding_genes_sig_not_sig.txt"
)

# Call the function:
analyze_gene_data(
  result_main_path = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_main_bs_4.rds",
  eqtl_long_covid_exposure_path = "/content/drive/MyDrive/Colab/Long_COVID/eQTL/eqtl_all_tissues_110.txt",
  save_files = save_paths
)

Saved all_genes as all_genes to /content/drive/MyDrive/Colab/Long_COVID/MtRobin/bs4_all_genes_sig_non_sig.txt 
Dimensions of all_genes : 145 10 

          gene_id     gene_name      gene_type    beta_y Obs SNPs
1 ENSG00000008196        TFAP2B protein_coding -2.161423  15    1
2 ENSG00000068781 STON1-GTF2A1L protein_coding  8.932349  29    2
3 ENSG00000112494        UNC93A protein_coding -6.110370   9    8
4 ENSG00000113905           HRG protein_coding -1.004191  15   14
5 ENSG00000117148         ACTL8 protein_coding -2.903997   5    4
6 ENSG00000122965         RBM19 protein_coding -3.602768  49   10
                                                                                                                                                   SNP_IDs
1     rs10948573,rs13202872,rs141707622,rs17667036,rs2206271,rs2477488,rs35305915,rs4715208,rs571503,rs61660288,rs62406441,rs72901027,rs75397901,rs9367397
2                                                                                 

In [None]:
%%R

# BS4 final table:
# Call the function that retrieves more information from the Mt_Robin and filter the significant genes:
process_final_data(input_path_1 = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/long_covid_main_bs_4_no_errors.rds",
                   input_path_2 = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/bs4_protein_coding_genes_sig_not_sig.txt",
                   input_path_3 = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/MR_pvalue_FDR_results_bs4.txt",
                   output_path = "/content/drive/MyDrive/Colab/Long_COVID/MtRobin/bs4_protein_coding_genes_sig_full_table.txt")

[1] "Causal genes (noncoding genes and protein coding genes): "
[1] 145
$lme_res
Linear mixed model fit by REML ['lmerMod']
Formula: beta_x ~ (beta_y - 1) + (beta_y - 1 | snpID)
Weights: 1/se_x^2
REML criterion at convergence: 7.9038
Random effects:
 Groups   Name   Std.Dev.
 snpID    beta_y 0.000   
 Residual        5.069   
Number of obs: 15, groups:  snpID, 14
Fixed Effects:
beta_y  
-2.161  

$gwas_res
         snpID   gwas_beta   gwas_se
1   rs10948573  0.00386748 0.0242689
2   rs13202872  0.01946290 0.0255439
3  rs141707622  0.12363800 0.0676588
4   rs17667036 -0.01505850 0.0241452
5    rs2206271  0.02738150 0.0236648
6    rs2477488  0.03828220 0.0224505
7   rs35305915 -0.03549410 0.0311378
8    rs4715208  0.02011210 0.0261452
9     rs571503 -0.00302458 0.0224935
10  rs61660288 -0.03065830 0.0354960
11  rs62406441  0.00770182 0.0275417
12  rs72901027 -0.01634200 0.0344939
13  rs75397901 -0.09779650 0.0923557
14   rs9367397  0.00006950 0.0471503

$LD
               rs2206271   rs3

**Obs.:** No significant protein coding genes

## **Analysis**

**Results:**

In [None]:
$ ENSG00000102606:List of 3
  ..$ lme_res :Formal class 'lmerMod' [package "lme4"] with 13 slots
  .. .. ..@ resp   :Reference class 'lmerResp' [package "lme4"] with 9 fields
  .. .. .. ..$ Ptr    :<externalptr>
  .. .. .. ..$ mu     : num [1:37] 0.355 0.355 0.355 0.355 0.355 ...
  .. .. .. ..$ offset : num [1:37] 0 0 0 0 0 0 0 0 0 0 ...
  .. .. .. ..$ sqrtXwt: num [1:37] 42.3 31.1 25.2 28.3 18.4 ...
  .. .. .. ..$ sqrtrwt: num [1:37] 42.3 31.1 25.2 28.3 18.4 ...
  .. .. .. ..$ weights: num [1:37] 1786 966 634 803 338 ...
  .. .. .. ..$ wtres  : num [1:37] -3.0474 -0.9313 -0.0708 3.6392 0.5418 ...
  .. .. .. ..$ y      : num [1:37] 0.282 0.325 0.352 0.483 0.384 ...
  .. .. .. ..$ REML   : int 1
  .. .. .. ..and 28 methods, of which 14 are  possibly relevant:
  .. .. .. ..  allInfo, copy#envRefClass, initialize, initialize#lmResp,
  .. .. .. ..  initializePtr, initializePtr#lmResp, objective, ptr, ptr#lmResp,
  .. .. .. ..  setOffset, setResp, setWeights, updateMu, wrss
  .. .. ..@ Gp     : int [1:2] 0 7
  .. .. ..@ call   : language lme4::lmer(formula = beta_x ~ (beta_y - 1) + (beta_y - 1 | snpID), weights = 1/se_x^2)
  .. .. ..@ frame  :'data.frame':	37 obs. of  4 variables:
  .. .. .. ..$ beta_x   : num [1:37, 1] 0.282 0.325 0.352 0.483 0.384 ...
  .. .. .. ..$ beta_y   : num [1:37, 1] -0.056 -0.056 -0.056 -0.056 -0.056 ...
  .. .. .. ..$ snpID    : Factor w/ 7 levels "rs2296353","rs41275852",..: 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. ..$ (weights): num [1:37, 1] 1786 966 634 803 338 ...
  .. .. .. ..- attr(*, "terms")=Classes 'terms', 'formula'  language beta_x ~ (beta_y - 1) + (beta_y - 1 + snpID)
  .. .. .. .. .. ..- attr(*, "variables")= language list(beta_x, beta_y, snpID)
  .. .. .. .. .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
  .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. .. .. .. ..$ : chr [1:3] "beta_x" "beta_y" "snpID"
  .. .. .. .. .. .. .. ..$ : chr [1:2] "beta_y" "snpID"
  .. .. .. .. .. ..- attr(*, "term.labels")= chr [1:2] "beta_y" "snpID"
  .. .. .. .. .. ..- attr(*, "order")= int [1:2] 1 1
  .. .. .. .. .. ..- attr(*, "intercept")= int 0
  .. .. .. .. .. ..- attr(*, "response")= int 1
  .. .. .. .. .. ..- attr(*, ".Environment")=<environment: 0x0000013780e25478>
  .. .. .. .. .. ..- attr(*, "predvars")= language list(beta_x, beta_y, snpID)
  .. .. .. .. .. ..- attr(*, "dataClasses")= Named chr [1:4] "nmatrix.1" "nmatrix.1" "character" "nmatrix.1"
  .. .. .. .. .. .. ..- attr(*, "names")= chr [1:4] "beta_x" "beta_y" "snpID" "(weights)"
  .. .. .. .. .. ..- attr(*, "predvars.fixed")= language list(beta_x, beta_y)
  .. .. .. .. .. ..- attr(*, "varnames.fixed")= chr [1:3] "beta_x" "beta_y" "(weights)"
  .. .. .. .. .. ..- attr(*, "predvars.random")= language list(beta_x, beta_y, snpID)
  .. .. .. ..- attr(*, "formula")=Class 'formula'  language beta_x ~ (beta_y - 1) + (beta_y - 1 | snpID)
  .. .. .. .. .. ..- attr(*, ".Environment")=<environment: 0x0000013780e25478>
  .. .. ..@ flist  :List of 1
  .. .. .. ..$ snpID: Factor w/ 7 levels "rs2296353","rs41275852",..: 1 1 1 1 1 1 1 1 1 1 ...
  .. .. .. ..- attr(*, "assign")= int 1
  .. .. ..@ cnms   :List of 1
  .. .. .. ..$ snpID: chr "beta_y"
  .. .. ..@ lower  : num 0
  .. .. ..@ theta  : num 0.712
  .. .. ..@ beta   : num -4.76
  .. .. ..@ u      : num [1:7] -2.202 0.199 -0.271 -0.757 0.492 ...
  .. .. ..@ devcomp:List of 2
  .. .. .. ..$ cmp : Named num [1:10] 7.97 1.87 267.71 20.49 288.2 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:10] "ldL2" "ldRX2" "wrss" "ussq" ...
  .. .. .. ..$ dims: Named int [1:12] 37 37 1 36 7 1 1 1 0 1 ...
  .. .. .. .. ..- attr(*, "names")= chr [1:12] "N" "n" "p" "nmp" ...
  .. .. ..@ pp     :Reference class 'merPredD' [package "lme4"] with 18 fields
  .. .. .. ..$ Lambdat:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  .. .. .. .. .. ..@ i       : int [1:7] 0 1 2 3 4 5 6
  .. .. .. .. .. ..@ p       : int [1:8] 0 1 2 3 4 5 6 7
  .. .. .. .. .. ..@ Dim     : int [1:2] 7 7
  .. .. .. .. .. ..@ Dimnames:List of 2
  .. .. .. .. .. .. ..$ : NULL
  .. .. .. .. .. .. ..$ : NULL
  .. .. .. .. .. ..@ x       : num [1:7] 0.712 0.712 0.712 0.712 0.712 ...
  .. .. .. .. .. ..@ factors : list()
  .. .. .. ..$ LamtUt :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  .. .. .. .. .. ..@ i       : int [1:37] 0 0 0 0 0 0 0 0 0 0 ...
  .. .. .. .. .. ..@ p       : int [1:38] 0 1 2 3 4 5 6 7 8 9 ...
  .. .. .. .. .. ..@ Dim     : int [1:2] 7 37
  .. .. .. .. .. ..@ Dimnames:List of 2
  .. .. .. .. .. .. ..$ : NULL
  .. .. .. .. .. .. ..$ : chr [1:37] "1" "2" "3" "4" ...
  .. .. .. .. .. ..@ x       : num [1:37] -1.685 -1.239 -1.004 -1.13 -0.733 ...
  .. .. .. .. .. ..@ factors : list()
  .. .. .. ..$ Lind   : int [1:7] 1 1 1 1 1 1 1
  .. .. .. ..$ Ptr    :<externalptr>
  .. .. .. ..$ RZX    : num [1:7, 1] 9.3981 1.0636 1.424 0.4506 0.0181 ...
  .. .. .. ..$ Ut     :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  .. .. .. .. .. ..@ i       : int [1:37] 0 0 0 0 0 0 0 0 0 0 ...
  .. .. .. .. .. ..@ p       : int [1:38] 0 1 2 3 4 5 6 7 8 9 ...
  .. .. .. .. .. ..@ Dim     : int [1:2] 7 37
  .. .. .. .. .. ..@ Dimnames:List of 2
  .. .. .. .. .. .. ..$ : chr [1:7] "rs2296353" "rs41275852" "rs9515381" "rs9522179" ...
  .. .. .. .. .. .. ..$ : chr [1:37] "1" "2" "3" "4" ...
  .. .. .. .. .. ..@ x       : num [1:37] -2.37 -1.74 -1.41 -1.59 -1.03 ...
  .. .. .. .. .. ..@ factors : list()
  .. .. .. ..$ Utr    : num [1:7] -408.842 -6.915 -11.756 -3.559 0.411 ...
  .. .. .. ..$ V      : num [1:37, 1] -2.37 -1.74 -1.41 -1.59 -1.03 ...
  .. .. .. ..$ VtV    : num [1, 1] 110
  .. .. .. ..$ Vtr    : num -631
  .. .. .. ..$ X      : num [1:37, 1] -0.056 -0.056 -0.056 -0.056 -0.056 ...
  .. .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. .. ..$ : chr [1:37] "1" "2" "3" "4" ...
  .. .. .. .. .. ..$ : chr "beta_y"
  .. .. .. .. ..- attr(*, "assign")= int 1
  .. .. .. .. ..- attr(*, "msgScaleX")= chr(0)
  .. .. .. ..$ Xwts   : num [1:37] 42.3 31.1 25.2 28.3 18.4 ...
  .. .. .. ..$ Zt     :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  .. .. .. .. .. ..@ i       : int [1:37] 0 0 0 0 0 0 0 0 0 0 ...
  .. .. .. .. .. ..@ p       : int [1:38] 0 1 2 3 4 5 6 7 8 9 ...
  .. .. .. .. .. ..@ Dim     : int [1:2] 7 37
  .. .. .. .. .. ..@ Dimnames:List of 2
  .. .. .. .. .. .. ..$ : chr [1:7] "rs2296353" "rs41275852" "rs9515381" "rs9522179" ...
  .. .. .. .. .. .. ..$ : chr [1:37] "1" "2" "3" "4" ...
  .. .. .. .. .. ..@ x       : num [1:37] -0.056 -0.056 -0.056 -0.056 -0.056 ...
  .. .. .. .. .. ..@ factors : list()
  .. .. .. ..$ beta0  : num 0
  .. .. .. ..$ delb   : num -4.76
  .. .. .. ..$ delu   : num [1:7] -2.202 0.199 -0.271 -0.757 0.492 ...
  .. .. .. ..$ theta  : num 0.712
  .. .. .. ..$ u0     : num [1:7] 0 0 0 0 0 0 0
  .. .. .. ..and 45 methods, of which 31 are  possibly relevant:
  .. .. .. ..  b, beta, CcNumer, copy#envRefClass, initialize, initializePtr,
  .. .. .. ..  installPars, L, ldL2, ldRX2, linPred, P, ptr, RX, RXdiag, RXi,
  .. .. .. ..  setBeta0, setDelb, setDelu, setTheta, setZt, solve, solveU, sqrL, u,
  .. .. .. ..  unsc, updateDecomp, updateL, updateLamtUt, updateRes, updateXwts
  .. .. ..@ optinfo:List of 8
  .. .. .. ..$ optimizer: chr "nloptwrap"
  .. .. .. ..$ control  :List of 1
  .. .. .. .. ..$ print_level: num 0
  .. .. .. ..$ derivs   :List of 2
  .. .. .. .. ..$ gradient: num 1.19e-06
  .. .. .. .. ..$ Hessian : num [1, 1] 17.1
  .. .. .. ..$ conv     :List of 2
  .. .. .. .. ..$ opt : num 0
  .. .. .. .. ..$ lme4: list()
  .. .. .. ..$ feval    : int 21
  .. .. .. ..$ message  : chr "NLOPT_XTOL_REACHED: Optimization stopped because xtol_rel or xtol_abs (above) was reached."
  .. .. .. ..$ warnings : list()
  .. .. .. ..$ val      : num 0.712
  ..$ gwas_res:Classes ‘data.table’ and 'data.frame':	7 obs. of  3 variables:
  .. ..$ snpID    : chr [1:7] "rs2296353" "rs41275852" "rs9515381" "rs9522179" ...
  .. ..$ gwas_beta: num [1:7] -0.05602 -0.04274 0.06575 0.07151 -0.00874 ...
  .. ..$ gwas_se  : num [1:7] 0.0321 0.0409 0.0312 0.0297 0.03 ...
  .. ..- attr(*, ".internal.selfref")=<externalptr>
  .. ..- attr(*, "sorted")= chr "snpID"
  ..$ LD      : num [1:7, 1:7] 1 0.662 -0.41 0.415 -0.481 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:7] "rs2296353" "rs9560010" "rs9515381" "rs41275852" ...
  .. .. ..$ : chr [1:7] "rs2296353" "rs9560010" "rs9515381" "rs41275852" ...

**Interpretation of the Results:**

**Linear Mixed Model (`lmerMod`) components:**

1. `mu:` The vector of fitted values, or predicted means, of the response variable based on the fitted model.
  
2. `sqrtXwt:` Square root of the weights for the fixed effects. Useful for internal calculations.

3. `sqrtrwt:` Square root of the "residual weights." Useful for internal calculations.

4. `weights:` Weights assigned to each observation, typically used to handle heteroscedasticity or to give certain observations more influence in the model fit.

5. `wtres:` Weighted residuals, which are the differences between observed and fitted values, each multiplied by the square root of its corresponding weight.

6. `y:` The response variable, (beta_x) in this case, which the model is trying to predict.

7. `REML:` Indicates whether Restricted Maximum Likelihood (REML) was used. REML is a method for estimating the parameters in a linear mixed model.
- A lower (more negative) REML value indicates better fit but does not necessarily mean the gene is more biologically relevant to the outcome of interest. Biological plausibility and other external evidence should also be considered.
- It is the `Log-likelihood function value at convergence` that the method uses to estimate the parameters in a mixed model --> Used instead of the `Maximum Likelihood Estimation` because it produces lower unbiased estimates of variance components.
- Useful for comparing different models fitted to the same dataset. It is not useful for comparing models across different datasets.
- While a lower REML value indicates a better fit, it doesn't necessarily mean that the model is the best possible model. Overfitting is a concern if the model is too complex.
- It is generally used for comparing nested models, that is, models where one is a special case of the other (e.g., one model has additional predictors compared to the other).
- It is particularly useful to compare fixed and random effects in a model, as it provides a way to compare the goodness-of-fit taking both into account.
If you're using one dataset for exposures (e.g., gene expression levels for multiple genes) and another dataset for the outcome (e.g., a specific clinical trait), and you're interested in examining the relationship between each gene and the outcome, then you could potentially use REML as one of the metrics to assess how well each gene's model fits the data.
- Both input datasets should represent the same or very similar underlying populations for the REML criterion to be meaningful. If the datasets are drastically different, then the model's fit may not be valid for such a comparison.
- Ensure that the models for different genes are specified in the same way (i.e., using the same covariates, random effects structure, etc.) for the REML values to be comparable.
- REML is just one measure of model fit. Other metrics and validation methods should also be considered. For example, predictive accuracy on a held-out test set.

**Data Frame (`frame`) components:**

1. `beta_x:` The response variable in the model.

2. `beta_y:` A predictor variable in the model.

3. `lower:` Lower boundary for the parameter estimates.

4. `theta:` The estimated variance components for the random effects.

5. `beta:` The estimated fixed effects parameters, excluding the intercept.

6. `u:` The random effects estimates.

**Deviance Components (`devcomp`):**

1. `cmp:` Various components needed to compute the model deviance, AIC, BIC, etc.

2. `dims:` Various dimensions related to the model, like the number of observations, number of parameters, etc.

**Prediction (`pp`) components:**

1. `x:` Appears multiple times, these are various vectors/matrices used internally for prediction.

2. `RZX:` Related to the rotated or "whitened" design matrix for the random effects.

3. `Utr:` Related to a transformed version of the random effects.

4. `V:` This may represent the "pseudo-data" in the penalized least squares formulation of the linear mixed model.

5. `Xwts:` Weights for the predictor matrix (X).

6. `delu:` Changes in the random effects estimates, typically during optimization.

---

**Downstream analysis:**

**Basic Model Summary:**
1. `beta:` The relationship between the predictors and the response variable.
2. `theta:` The variability explained by random effects.
3. `u:` To make predictions for specific groups in the random effects.

**Model Quality:**
1. `REML:` Whether REML or ML (Maximum Likelihood) was used can impact the comparability of models, especially if when comparing models with different fixed effects.
2. `cmp:` Components like AIC (Akaike Information Criterion) or BIC (Bayesian Information Criterion) can be computed from these, which are useful for model comparison.
3. `dims:` Dimensions (e.g., number of observations, number of parameters, etc.) to understand the scope of the model.

**Residual Analysis:**
1. `wtres:` Weighted residuals for diagnostic plots to assess model fit.
2. `mu:` The fitted values can be compared to the actual values to assess how well the model is fitting the data.

**Weights and Other Diagnostics:**
1. `weights:` Important to carry these weighted values forward into any downstream analyses.
2. `sqrtXwt, sqrtrwt:` Can be used in specific types of weighted or penalized analyses.

**Specific Genetic Analysis:**
1. `gwas_res:` Crucial for any genetic-specific downstream analysis.
2. `LD:` Linkage disequilibrium information for genetic studies.

**Enrichment Analysis:**
1. `beta:` The estimated fixed effects parameters can be crucial for enrichment analyses. These represent the relationship between the predictors and the response variable. A stronger effect (either positive or negative) can highlight genes more likely to be involved in the biological process.
2. `gwas_res:` Essencial to find enrichments specific to certain genetic loci.
3. `LD:` To understand whether multiple SNPs are independently contributing to a trait or if they are just proxies for one another one.

**Prediction:**
1. `mu:` The fitted values can be used for making predictions on the training set. They can serve as a baseline for predicting the same outcome in a different cohort.
2. `X, Zt, etc.:` To do predictions for new data points based on the model.
3. `theta:` The estimated variance components for the random effects can help to understand how much of the variability in the outcome is accounted by the random effects, which might be important depending on the nature of the predictions.
4. `u:` Random effects estimates can also be useful for making predictions, especially for predicting outcomes for specific groups that are represented in the random effects.

**Drug Repurposing:**
1. `beta:` The fixed effects estimates are crucial. Genes with strong effects are candidates for targeting by drugs.
2. `gwas_res:` The original GWAS results could provide additional context for why a gene might be a good target for drug repurposing.

**General Important Variables:**
1. `weights:` Important when interpreting the results.
2. `REML:` Goodness of fit of the model.

---

**Most important variables for Drug Repurposing:**

`gene_id:` ID of each gene.

`gene_name:` Name of each gene.

`beta_y:`
- Causal effect value of each gene.
- Higher absolute value (far from 0) --> stronger the causal relationship.

`Obs:`
- Number of tissues in which the expression of a gene is associated or influenced by some specific SNPs (gene_expression-SNP association).

`SNPs:`
- Number of SNPs that influence the expression of a specific gene (gene_expression-SNP association).

`REML criterion at convergence:`
- Goodness of fit of the model.
- Lower values (more negative) --> Better fit.

`SD:` Standard Deviation of the Random Effects.

`Residual_SD:`
- The differences between the observed and predicted values for the dependent variable (gene_expression/SNPs).
- Smaller values --> The model is doing a better job of capturing the variation in the data.

---

**Most important variables for Prediction:**

All the previous variables and:

- ` theta:` Variance of the random effects.
- `u:` Random effect estimate.
- `mu:` The vector of fitted values, or predicted means, of the response variable based on the fitted model.

---

**Other important variables:**
- `weights:` Higher/Lower influence of the observations.
- `cmp:` Components of the model fit (AIC and BIC)
- `dims`: Number od observations/parameteres to understand the scope of the model.

---

**SD vs. Residual SD:**
- The Standard Deviation (SD) and the Residual Standard Deviation (Residual SD) refer to different aspects of variability in the data.

**SD:**
- The SD of the random effects of `beta_y` across different levels of `snpID`.
- The SD for random effects helps us to understand how much the groups differ.
- The value that indicates the amount of variability in the random intercept `beta_y` across different `snpID` groups. `A higher value indicates that there is substantial variability across groups, suggesting that including a random effect is appropriate.`

**Residual SD:**
- The differences between the observed and predicted values for the dependent variable (gene_expression/SNPs).
- The residual SD helps us to understand how well the model fits the data.
- The "typical" difference between the observed values of the dependent variable and the values predicted by the model. `Smaller values indicate that the model is doing a better job of capturing the variation in the data.`

**SD and Variance vs. beta_y:**

The Standard Deviation (SD) and Variance values are not directly comparable to the `beta_y` values for both fixed and random effects:

**Different Units and Interpretations:**

1. **Fixed Effects**: The SE for a fixed effect estimate `beta_y` shows how much that estimate would vary when taking many samples from the same population and estimating the model each time. It is not a measure of the "size" of the effect but rather of the precision of the estimate.
  
2. **Random Effects**: The variance and SD for a random effect are measures of how much that effect varies across different levels of the grouping variable (`snpID`) and cannot be compared to the fixed effect estimate.

**Scale Sensitivity:**

Both SD and variance are sensitive to the scale of the variables involved. If `beta_y` and `beta_x` are measured in different units or have different scales, then their SDs and variances will also be on different scales, making them incomparable.

**Model Complexity and Sample Size:**

The SD and variance can also be influenced by the complexity of the model and the sample size, which are independent of the value of the `beta_y`.

**What I can Compare?:**

1. **Coefficient Significance**: Assess the significance of the `beta_y` value by looking at its `t-value`, which takes the SE into account.

2. **Confidence Intervals**: Compute confidence intervals for `beta_y` using its SD to get an idea of its possible range of values.
  
3. **Variance Components**: For random effects, look at the estimated variance and SD to understand the degree of variability in `beta_y` across different levels of `snpID`.

4. **Model Comparison**: While the SD and variance aren't directly comparable to `beta_y`, they can be used for model comparison. For example, a model with lower SDs for its estimates is generally considered to be a better-fitting model, all else being equal.