# **Validation of a Mitochondrial Polygenic Score (MGS) for Parkinson's Disease - LRRK2**

## Project Title: Validation of a Mitochondrial Polygenic Score (MGS) for Parkinson's Disease

**V:** GATK 4.3.0.0, Python 3.10.12, R 4.4.2

**Note:** To proceed with this notebook please ensure you already have the score files from all the ancestries (you can get this by going through notebook 00).
This notebook is for the European (EUR) ancestry group, to apply to other ancestry groups simply change the "EUR" to one of the following ancestries:

* African Admixed (AAC)
* African (AFR)
* Ashkenazi Jewish (AJ)
* American Admixed (AMR)
* Central Asian (CAS)
* East Asian (EAS)
* European (EUR)
* Middle Eastern (MDE)
* South Asian (SAS)

## Description:

- [1. Getting started](#getting-started)
- [2. Copying data to workspace](#copying-data-to-workplace)
- [3. Merging Files](#merging-files)
- [4. LRRK2 Analyses](#lrrk2-analyses)
- [5. Saving results](#saving-results)


For more information contact Joshua Ooi

Last updated: 06/02/2025

# Getting Started

## Load Python Libraries

In [None]:
# Use the os package to interact with the environment (helps me find the proper paths to things)
import os
import sys

# Bring in Pandas for Dataframe functionality (popular python packages to read in my data, manipulate, subset, filter, etc.)
import pandas as pd
from functools import reduce

# Bring some visualization functionality (visualisation package useful when plotting stuff)
import seaborn as sns

# numpy for basics (mathematics package)
import numpy as np

# Use StringIO for working with file contents (so that it's interacting with terra cloud)
from io import StringIO

# Enable IPython to display matplotlib graphs (also a visualisation package)
import matplotlib.pyplot as plt
%matplotlib inline

# Enable interaction with the FireCloud API (needs to be enabled in order to have terra interacting with data in the buckets)
from firecloud import api as fapi

# Import the iPython HTML rendering for displaying links to Google Cloud Console (just to display things in a jupyter notebook)
from IPython.core.display import display, HTML

# Import urllib modules for building URLs to Google Cloud Console (for interactions between terra and the cloud)
import urllib.parse

# BigQuery for querying data (for interactions between terra and the cloud)
from google.cloud import bigquery

print('Buenos Dias, Joshua!')

## Define Python Functions to Interact with GCP/Terra

In [None]:
# Utility routine for printing a shell command before executing it
def shell_do(command):
    print(f'Executing: {command}', file=sys.stderr)
    !$command

def shell_return(command):
    print(f'Executing: {command}', file=sys.stderr)
    output = !$command
    return '\n'.join(output)

# Utility routine for printing a query before executing it
def bq_query(query):
    print(f'Executing: {query}', file=sys.stderr)
    return pd.read_gbq(query, project_id=BILLING_PROJECT_ID, dialect='standard')

# Utility routine for display a message and a link
def display_html_link(description, link_text, url):
    html = f'''
    <p>
    </p>
    <p>
    {description}
    <a target=_blank href="{url}">{link_text}</a>.
    </p>
    '''

    display(HTML(html))

# Utility routines for reading files from Google Cloud Storage
def gcs_read_file(path):
    """Return the contents of a file in GCS"""
    contents = !gsutil -u {BILLING_PROJECT_ID} cat {path}
    return '\n'.join(contents)

def gcs_read_csv(path, sep=None):
    """Return a DataFrame from the contents of a delimited file in GCS"""
    return pd.read_csv(StringIO(gcs_read_file(path)), sep=sep, engine='python')

# Utility routine for displaying a message and link to Cloud Console
def link_to_cloud_console_gcs(description, link_text, gcs_path):
    url = '{}?{}'.format(
        os.path.join('https://console.cloud.google.com/storage/browser',
                     gcs_path.replace("gs://","")),
        urllib.parse.urlencode({'userProject': BILLING_PROJECT_ID}))

    display_html_link(description, link_text, url)

## Initialise Work Environment Variables

In [None]:
# Set up billing project and data path variables
BILLING_PROJECT_ID = os.environ['GOOGLE_PROJECT']
WORKSPACE_NAMESPACE = os.environ['WORKSPACE_NAMESPACE']
WORKSPACE_NAME = os.environ['WORKSPACE_NAME']
WORKSPACE_BUCKET = os.environ['WORKSPACE_BUCKET']

WORKSPACE_ATTRIBUTES = fapi.get_workspace(WORKSPACE_NAMESPACE, WORKSPACE_NAME).json().get('workspace',{}).get('attributes',{})

## Accessing GP2 Data

In [None]:
##  GP2 v6.0
GP2_RELEASE_PATH = '/GP2/release/path' ##  Enter valid GP2 Release path
GP2_CLINICAL_RELEASE_PATH = f'{GP2_RELEASE_PATH}/clinical_data'
GP2_META_RELEASE_PATH = f'{GP2_RELEASE_PATH}/meta_data'
GP2_SUMSTAT_RELEASE_PATH = f'{GP2_RELEASE_PATH}/summary_statistics'
GP2_RAW_GENO_PATH = f'{GP2_RELEASE_PATH}/raw_genotypes'
GP2_IMPUTED_GENO_PATH = f'{GP2_RELEASE_PATH}/imputed_genotypes'
GP2_WGS_PATH = f'{GP2_RELEASE_PATH}/wgs'
print('GP2 v6.0')
print(f'Path to GP2 v6.0 Clinical Data: {GP2_CLINICAL_RELEASE_PATH}')
print(f'Path to GP2 v6.0 Raw Genotype Data: {GP2_RAW_GENO_PATH}')
print(f'Path to GP2 v6.0 Imputed Genotype Data: {GP2_IMPUTED_GENO_PATH}')
print(f'Path to GP2 v6.0 Metadata: {GP2_META_RELEASE_PATH}')
print(f'Path to GP2 v6.0 WGS Data: {GP2_WGS_PATH}')

## Install PLINK

In [None]:
%%capture
%%bash
# Install plink 1.9
cd /home/jupyter/
if test -e /home/jupyter/plink; then

echo "Plink is already installed in /home/jupyter/"
else
echo "Plink is not installed"
cd /home/jupyter

wget http://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20190304.zip

unzip -o plink_linux_x86_64_20190304.zip
mv plink plink1.9

fi

In [None]:
%%capture
%%bash
# Install plink 2.0
cd /home/jupyter/
if test -e /home/jupyter/plink2; then

echo "Plink2 is already installed in /home/jupyter/"
else
echo "Plink2 is not installed"
cd /home/jupyter/

wget http://s3.amazonaws.com/plink2-assets/plink2_linux_x86_64_latest.zip

unzip -o plink2_linux_x86_64_latest.zip

fi

In [None]:
%%bash
# chmod plink 1.9 to make sure I have permission to run the program
chmod u+x /home/jupyter/plink1.9

In [None]:
%%bash
# chmod plink 2.0 to make sure I have permission to run the program
chmod u+x /home/jupyter/plink2

## Install R

In [None]:
# Install R
! pip install --upgrade rpy2

In [None]:
pip install --upgrade pip

In [None]:
%load_ext rpy2.ipython

# Copying Data to Workspace

## Make a Folder

In [None]:
print("Making a working directory")
WORK_DIR = f'/home/jupyter/WD_GP2_MITO_AIM1_PSMLRRK2ALLINCLUDED_PD_JO'
shell_do(f'mkdir -p {WORK_DIR}')

## Copy Files from GP2 Buckets Over to My Folder

####  **Score files**

In [None]:
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp /enter/relevant/path/MGS_ALL_AFR_release6_score.profile {WORK_DIR}')
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp /enter/relevant/path/MGS_ALL_AJ_release6_score.profile {WORK_DIR}')
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp /enter/relevant/path/MGS_ALL_AAC_release6_score.profile {WORK_DIR}')
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp /enter/relevant/path/MGS_ALL_AMR_release6_score.profile {WORK_DIR}')
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp /enter/relevant/path/MGS_ALL_CAH_release6_score.profile {WORK_DIR}')
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp /enter/relevant/path/MGS_ALL_CAS_release6_score.profile {WORK_DIR}')
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp /enter/relevant/path/MGS_ALL_EAS_release6_score.profile {WORK_DIR}')
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp /enter/relevant/path/MGS_ALL_EUR_release6_score.profile {WORK_DIR}')
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp /enter/relevant/path/MGS_ALL_MDE_release6_score.profile {WORK_DIR}')
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp /enter/relevant/path/MGS_ALL_SAS_release6_score.profile {WORK_DIR}')

####  **Covariate files**

In [None]:
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp /enter/relevant/path/AFR_gp2_covs.csv {WORK_DIR}')
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp /enter/relevant/path/AJ_gp2_covs.csv {WORK_DIR}')
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp /enter/relevant/path/AAC_gp2_covs.csv {WORK_DIR}')
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp /enter/relevant/path/AMR_gp2_covs.csv {WORK_DIR}')
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp /enter/relevant/path/CAH_gp2_covs.csv {WORK_DIR}')
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp /enter/relevant/path/CAS_gp2_covs.csv {WORK_DIR}')
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp /enter/relevant/path/EAS_gp2_covs.csv {WORK_DIR}')
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp /enter/relevant/path/EUR_gp2_covs.csv {WORK_DIR}')
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp /enter/relevant/path/MDE_gp2_covs.csv {WORK_DIR}')
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp /enter/relevant/path/SAS_gp2_covs.csv {WORK_DIR}')

# Merge the files

####  **Score files**

In [None]:
%%bash -s "$WORK_DIR"
cd $1

# Copy the EUR file to create the merged file, keeping its header
cp MGS_ALL_AFR_release6_score.profile MGS_ALL_MERGED_release6_score.profile

# Loop through the other files and append them without their headers
for file in MGS_ALL_AJ_release6_score.profile MGS_ALL_AAC_release6_score.profile MGS_ALL_AMR_release6_score.profile MGS_ALL_CAH_release6_score.profile MGS_ALL_CAS_release6_score.profile MGS_ALL_EAS_release6_score.profile MGS_ALL_EUR_release6_score.profile MGS_ALL_MDE_release6_score.profile MGS_ALL_SAS_release6_score.profile
do
  tail -n +2 "$file" >> MGS_ALL_MERGED_release6_score.profile
done

####  **Covariate files**

In [None]:
%%bash -s "$WORK_DIR"
cd $1


# Step 1: Copy the EUR file to create the merged file, keeping its header
cp AFR_gp2_covs.csv MERGED_gp2_covs.csv

# Step 2: Loop through the other files and append them without their headers
for file in AJ_gp2_covs.csv AAC_gp2_covs.csv AMR_gp2_covs.csv CAH_gp2_covs.csv CAS_gp2_covs.csv EAS_gp2_covs.csv EUR_gp2_covs.csv MDE_gp2_covs.csv SAS_gp2_covs.csv
do
  tail -n +2 "$file" >> MERGED_gp2_covs.csv
done

####  **General**

In [None]:
%%R
pack <- "/home/jupyter/WD_GP2_MITO_AIM1_PSMLRRK2ALLINCLUDED_PD_JO"
temp_data <- read.table("/home/jupyter/WD_GP2_MITO_AIM1_PSMLRRK2ALLINCLUDED_PD_JO/MGS_ALL_MERGED_release6_score.profile", header = T)
temp_covs <- read.csv("/home/jupyter/WD_GP2_MITO_AIM1_PSMLRRK2ALLINCLUDED_PD_JO/MERGED_gp2_covs.csv", header = T, sep=",")
data <- merge(temp_data, temp_covs, by = "IID")
data$CASE <- data$PHENO - 1
data$sex_for_qc <- as.numeric(data$sex_for_qc)
meanControls <- mean(data$SCORE[data$CASE == 0])
sdControls <- sd(data$SCORE[data$CASE == 0])
data$zSCORE <- (data$SCORE - meanControls)/sdControls

In [None]:
%%R

table(data$CASE)

In [None]:
%%R

data <- data[data$CASE != -10, ]

In [None]:
%%R

table(data$CASE)

In [None]:
%%R
grsTests <- glm(CASE ~ zSCORE + sex_for_qc + PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + PC10 + age, family="binomial", data = data)
summary(grsTests)

# Extract beta and SE from the linear regression model
beta <- coef(grsTests)["zSCORE"]
SE <- summary(grsTests)$coefficients["zSCORE", "Std. Error"]

# Calculate OR, U95, and L95
OR <- exp(beta)
U95 <- exp((beta) + (1.96 * SE))
L95 <- exp((beta) - (1.96 * SE))

# Print results
print(summary(grsTests))

# Print results
print(OR)
print(L95)
print(U95)

In [None]:
%%R
cases <- subset(data, CASE == 1)
meanPop <- mean(cases$SCORE)
sdPop <- sd(cases$SCORE)
cases$zSCORE <- (cases$SCORE - meanPop)/sdPop
grsTests <- lm(age_of_onset ~ zSCORE + sex_for_qc + PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + PC10, data = cases)
summary(grsTests)

# *LRRK2* Analyses

## Bring in *LRRK2, PRKN, PINK1* carrier lists

In [None]:
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp /enter/relevant/path/LRRK2_carriers_all.csv {WORK_DIR}')
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp /enter/relevant/path/LRRK2_all_carriers_merged.csv {WORK_DIR}')
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp /enter/relevant/path/pp_hombil.csv {WORK_DIR}')
shell_do(f'gsutil -u {BILLING_PROJECT_ID} -m cp /enter/relevant/path/pp_hombilhet.csv {WORK_DIR}')

## Risk analysis

####  **Indicate *LRRK2* carriers in merged covariate file**

In [None]:
import pandas as pd
WORK_DIR = f'/home/jupyter/WD_GP2_MITO_AIM1_PSMLRRK2ALLINCLUDED_PD_JO'
FULL_PATH = WORK_DIR + '/LRRK2_all_carriers_merged.csv'

# Load the main CSV file containing subject IDs and MGS scores
a_main_df = pd.read_csv('/home/jupyter/WD_GP2_MITO_AIM1_PSMLRRK2ALLINCLUDED_PD_JO/MERGED_gp2_covs.csv')

# Load the CSV file containing the list of carrier subject IDs
a_carrier_df = pd.read_csv(FULL_PATH, delimiter='\t')

# Create a new column in the main dataframe to indicate carrier status
a_main_df["Carrier"] = 0

# Iterate over each subject ID in the carrier list
for subject_id in a_carrier_df['IID']:
    # Mark the corresponding rows in the main dataframe as carriers
    a_main_df.loc[a_main_df['IID'] == subject_id, "Carrier"] = 1

# Save the updated dataframe to a new CSV file
a_main_df.to_csv('/home/jupyter/WD_GP2_MITO_AIM1_PSMLRRK2ALLINCLUDED_PD_JO/updated_all_MERGED_gp2_covs.csv', index=False)

####  **Idenitfy *LRRK2* carriers who are also cases and standardising the scores**

In [None]:
%%R
pack <- "/home/jupyter/WD_GP2_MITO_AIM1_PSMLRRK2ALLINCLUDED_PD_JO"
temp_data <- read.table("/home/jupyter/WD_GP2_MITO_AIM1_PSMLRRK2ALLINCLUDED_PD_JO/MGS_ALL_MERGED_release6_score.profile", header = T)
temp_covs <- read.csv("/home/jupyter/WD_GP2_MITO_AIM1_PSMLRRK2ALLINCLUDED_PD_JO/updated_all_MERGED_gp2_covs.csv", header = T, sep=",")
data <- merge(temp_data, temp_covs, by = "IID")

# Create the lrrk2pd column based on phenotype and carrier status
data$lrrk2pd <- ifelse(data$PHENO == 1 & data$Carrier == 0, 0,
                       ifelse(data$PHENO == 2 & data$Carrier == 1, 1, NA))

# Creating the binary CASE variable remains the same
data$CASE <- data$PHENO - 1

# Converting sex_for_qc to numeric remains the same
data$sex_for_qc <- as.numeric(data$sex_for_qc)

# Calculate the mean and standard deviation for the controls where lrrk2pd == 0
meanHControls <- mean(data$SCORE[data$lrrk2pd == 0], na.rm = TRUE)
sdHControls <- sd(data$SCORE[data$lrrk2pd == 0], na.rm = TRUE)

# Standardizing the score based on meanHControls and sdHControls
data$zSCORE <- (data$SCORE - meanHControls) / sdHControls

In [None]:
%%R

table(data$lrrk2pd)

####  **Weighting**

In [None]:
%%R

data$weights <- ifelse(data$PHENO == 2, 3, 1)

In [None]:
%%R

# Perform logistic regression
grsTests <- glm(lrrk2pd ~ zSCORE + sex_for_qc + PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + PC10 + age, family = binomial, data = data, weights = weights)
summary(grsTests)

# Extract beta and SE from the linear regression model
beta <- coef(grsTests)["zSCORE"]
SE <- summary(grsTests)$coefficients["zSCORE", "Std. Error"]

# Calculate OR, U95, and L95
OR <- exp(beta)
U95 <- exp((beta) + (1.96 * SE))
L95 <- exp((beta) - (1.96 * SE))

# Print results
print(summary(grsTests))

# Print results
print(OR)
print(L95)
print(U95)

## Comparing between cases and controls

####  **Count means and SDs of standardised scores**

In [None]:
%%R

library(dplyr)

# Calculate mean and sd for standardised scores by CASE
summary_stats <- data %>%
  group_by(CASE) %>%
  summarize(
    mean_standardized = mean(zSCORE, na.rm = TRUE),
    sd_standardized = sd(zSCORE, na.rm = TRUE)
  ) %>%
  mutate(CASE = ifelse(CASE == 0, "Controls", "Cases")) %>%
  rename(Group = CASE)


print(summary_stats)

### **Deciding which statistical test to use to compare means**

### *Check for normality of control data*

In [None]:
%%R

library(ggplot2)

# Histogram for Controls (CASE == 0)
ggplot(data[data$CASE == 0, ], aes(x = zSCORE)) +
  geom_histogram(binwidth = 0.2, fill = "blue", alpha = 0.7) +
  ggtitle("Histogram of zSCORE for Controls")

In [None]:
%%R

library(ggplot2)

# Histogram for Controls (CASE == 0)
ggplot(data[data$CASE == 0, ], aes(x = zSCORE)) +
  geom_histogram(binwidth = 0.2, fill = "blue", alpha = 0.7) +
  ggtitle("Histogram of zSCORE for Controls")


# Q-Q plots to check normality for controls
qqnorm(data$zSCORE[data$CASE == 0], main = "Q-Q Plot for Controls")
qqline(data$zSCORE[data$CASE == 0])

###  *Check for normality of cases data*

In [None]:
%%R

library(ggplot2)

# Histogram for Cases (CASE == 1)
ggplot(data[data$CASE == 1, ], aes(x = zSCORE)) +
  geom_histogram(binwidth = 0.2, fill = "red", alpha = 0.7) +
  ggtitle("Histogram of zSCORE for Cases")

In [None]:
%%R

library(ggplot2)

# Histogram for Cases (CASE == 1)
ggplot(data[data$CASE == 1, ], aes(x = zSCORE)) +
  geom_histogram(binwidth = 0.2, fill = "red", alpha = 0.7) +
  ggtitle("Histogram of zSCORE for Cases")

# Q-Q plots to check normality for cases
qqnorm(data$zSCORE[data$CASE == 1], main = "Q-Q Plot for Cases")
qqline(data$zSCORE[data$CASE == 1])

###  *Check for equal variances*

In [None]:
%%R
invisible(install.packages("car"))

In [None]:
%%R

data$CASE <- as.factor(data$CASE)

# Run Levene's test
library(car)
leveneTest(zSCORE ~ CASE, data = data)

## Comparisons

###  **T-test or Mannâ€“Whitney**

###  *If data normally distributed and has no outliers*

In [None]:
%%R
t.test(zSCORE ~ CASE, data = data, var.equal = TRUE)

###  *If data not normally distributed or has outliers*

In [None]:
%%R
t.test(zSCORE ~ CASE, data = data, var.equal = FALSE)

In [None]:
%%R
wilcox.test(zSCORE ~ CASE, data = data)

## Visualisations

In [None]:
%%R
install.packages("wesanderson")

In [None]:
%%R
library(wesanderson)

In [None]:
%%R

library(ggplot2)
library(wesanderson)

colors <- wes_palette(name = "Chevalier1", n = 2)

swapped_colors <- colors[c(2, 1)]

p <- ggplot(data, aes(x = reorder(as.factor(CASE), zSCORE), y = zSCORE, fill = as.factor(CASE))) +
  geom_violin(trim = FALSE) +
  geom_boxplot(width = 0.4, fill = "white") +
  theme_minimal() +
  scale_fill_manual(values = swapped_colors) +
  theme_bw() +
  ylab("Standardized MGS") +
  xlab("") +
  theme(legend.position = "none")

# Save the plot as a JPEG file
ggsave("/home/jupyter/WD_GP2_MITO_AIM1_EUR_PD_JO/EUR_violin.jpeg", dpi = 600, units = "in", height = 6, width = 6)

# Display the plot
p

In [None]:
%%R
pack <- "/home/jupyter/WD_GP2_MITO_AIM1_PSMLRRK2ALLINCLUDED_PD_JO"
temp_data <- read.table("/home/jupyter/WD_GP2_MITO_AIM1_PSMLRRK2ALLINCLUDED_PD_JO/MGS_ALL_MERGED_release6_score.profile", header = T)
temp_covs <- read.csv("/home/jupyter/WD_GP2_MITO_AIM1_PSMLRRK2ALLINCLUDED_PD_JO/updated_all_MERGED_gp2_covs.csv", header = T, sep=",")
data <- merge(temp_data, temp_covs, by = "IID")

data$CASE <- data$PHENO - 1
data <- data[data$CASE != -10, ]
meanControls <- mean(data$SCORE[data$CASE == 0])
sdControls <- sd(data$SCORE[data$CASE == 0])
head(data)
data$zSCORE <- (data$SCORE - meanControls)/sdControls

Model <- glm(CASE ~ SCORE, data = data, family = 'binomial')
data$probDisease <- predict(Model, data, type = c("response"))
data$predicted <- ifelse(data$probDisease > 0.5, "DISEASE", "CONTROL")
data$reported <- ifelse(data$CASE == 1, "DISEASE","CONTROL")

# Density plot
densPlot <- ggplot(data, aes(probDisease, fill = reported, color = reported)) + geom_density(alpha = 0.5) + theme_bw()
ggsave(plot = densPlot, filename = "/home/jupyter/WD_GP2_MITO_AIM1_PSMLRRK2ALLINCLUDED_PD_JO/density.png", width = 8, height = 5, units = "in", dpi = 300)
densPlot

####  **Weigh cases to controls 1:3**

In [None]:
%%R
num_cases <- sum(data$CASE == 1)

set.seed(123)
downsampled_controls <- data[data$CASE == 0, ][sample(which(data$CASE == 0), 3 * num_cases), ]

balanced_data <- rbind(downsampled_controls, data[data$CASE == 1, ])

In [None]:
%%R

library(ggplot2)
library(wesanderson)

colors <- wes_palette(name = "Chevalier1", n = 2)

swapped_colors <- colors[c(2, 1)]

p <- ggplot(balanced_data, aes(x = reorder(as.factor(CASE), zSCORE), y = zSCORE, fill = as.factor(CASE))) +
  geom_violin(trim = FALSE) +
  geom_boxplot(width = 0.4, fill = "white") +
  theme_minimal() +
  scale_fill_manual(values = swapped_colors) +
  theme_bw() +
  ylab("Standardized MGS") +
  xlab("") +
  theme(legend.position = "none")

ggsave("/home/jupyter/WD_GP2_MITO_AIM1_EUR_PD_JO/EUR.jpeg", dpi = 600, units = "in", height = 6, width = 6)

p

## AUC analyses

In [None]:
%%R
install.packages("pROC")

In [None]:
%%R
library(pROC)

In [None]:
%%R

library(pROC)

Model <- glm(CASE ~ SCORE, data = data, family = 'binomial')

data$probDisease <- predict(Model, data, type = "response")

roc_curve <- roc(data$CASE, data$probDisease)

# Print the AUC value
auc_value <- auc(roc_curve)
cat("AUC Value:", auc_value, "\n")

# Plot the ROC curve
plot(roc_curve, main = "ROC Curve", col = "blue")

In [None]:
%%R

auc_ci <- ci(roc_curve)

# Print AUC and CI
cat("AUC:", auc(roc_curve), "\n")
cat("95% CI for AUC:", auc_ci[1], "-", auc_ci[3], "\n")

# Check if the CI includes 0.5
if (auc_ci[1] > 0.5 | auc_ci[3] < 0.5) {
  cat("The AUC is significantly different from 0.5 at the 95% confidence level.\n")
} else {
  cat("The AUC is not significantly different from 0.5 at the 95% confidence level.\n")
}

####  **Downsample**

In [None]:
%%R

set.seed(123)
cases <- data[data$CASE == 1, ]
controls <- data[data$CASE == 0, ]
downsampled_controls <- controls[sample(nrow(controls), nrow(cases)), ]
downsampled_data <- rbind(cases, downsampled_controls)

In [None]:
%%R

Model <- glm(CASE ~ SCORE, data = downsampled_data, family = 'binomial')
downsampled_data$probDisease <- predict(Model, downsampled_data, type = "response")
roc_curve <- roc(downsampled_data$CASE, downsampled_data$probDisease)
auc_value <- auc(roc_curve)
cat("AUC Value (Downsampled):", auc_value, "\n")

In [None]:
%%R

library(pROC)

Model <- glm(CASE ~ SCORE, data = downsampled_data, family = 'binomial')

downsampled_data$probDisease <- predict(Model, downsampled_data, type = "response")

roc_curve <- roc(downsampled_data$CASE, downsampled_data$probDisease)

# Print AUC value
auc_value <- auc(roc_curve)
cat("AUC Value (Downsampled):", auc_value, "\n")

# Plot the ROC curve
plot(roc_curve, main = "ROC Curve", col = "blue")

In [None]:
%%R

auc_ci <- ci(roc_curve)

# Print AUC and CI
cat("AUC:", auc(roc_curve), "\n")
cat("95% CI for AUC:", auc_ci[1], "-", auc_ci[3], "\n")

# Check if the CI includes 0.5
if (auc_ci[1] > 0.5 | auc_ci[3] < 0.5) {
  cat("The AUC is significantly different from 0.5 at the 95% confidence level.\n")
} else {
  cat("The AUC is not significantly different from 0.5 at the 95% confidence level.\n")
}

# Multivariable logistic regression

###  **Regression against disease status**

In [None]:
%%R
grsTests <- glm(CASE ~ zSCORE + sex_for_qc + PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + PC10 + age, family="binomial", data = data)
summary(grsTests)

# Extract beta and SE from the linear regression model
beta <- coef(grsTests)["zSCORE"]
SE <- summary(grsTests)$coefficients["zSCORE", "Std. Error"]

# Calculate OR, U95, and L95
OR <- exp(beta)
U95 <- exp((beta) + (1.96 * SE))
L95 <- exp((beta) - (1.96 * SE))

# Print results
print(summary(grsTests))

# Print results
print(OR)
print(L95)
print(U95)

####  **Weigh cases to controls 1:3**

In [None]:
%%R

data$weights <- ifelse(data$PHENO == 2, 3, 1)

In [None]:
%%R

# Perform logistic regression
grsTests <- glm(lrrk2pd ~ zSCORE + sex_for_qc + PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + PC10 + age, family = binomial, data = data, weights = weights)
summary(grsTests)

# Extract beta and SE from the linear regression model
beta <- coef(grsTests)["zSCORE"]
SE <- summary(grsTests)$coefficients["zSCORE", "Std. Error"]

# Calculate OR, U95, and L95
OR <- exp(beta)
U95 <- exp((beta) + (1.96 * SE))
L95 <- exp((beta) - (1.96 * SE))

# Print results
print(summary(grsTests))

# Print results
print(OR)
print(L95)
print(U95)

####  **AAO (weight doesn't matter)**

In [None]:
%%R
# Subset the data to cases who are also carriers
carrier_cases <- subset(data, CASE == 1 & Carrier == 1)

# Calculate mean and standard deviation for the carrier cases
meanCarrier <- mean(carrier_cases$SCORE)
sdCarrier <- sd(carrier_cases$SCORE)

# Calculate Z-scores for the carrier cases
carrier_cases$zSCORE <- (carrier_cases$SCORE - meanCarrier) / sdCarrier

# Perform your analysis using the recalculated Z-scores
grsTests <- lm(age_of_onset ~ zSCORE + sex_for_qc + PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + PC10, data = carrier_cases)
summary(grsTests)

In [None]:
%%R

carrier_cases <- subset(data, CASE == 1 & Carrier == 1)
# Calculate the correlation
cor_test <- cor.test(carrier_cases$zSCORE, carrier_cases$age_of_onset)
cor_coefficient <- cor_test$estimate  # Pearson correlation coefficient
p_value <- cor_test$p.value  # p-value

# Print the results
cat("Correlation coefficient (r):", cor_coefficient, "\n")
cat("p-value:", p_value, "\n")

# Create a correlation plot
library(ggplot2)

# Generate the correlation plot
cor_plot <- ggplot(carrier_cases, aes(x = zSCORE, y = age_of_onset)) +
  geom_point(alpha = 0.6) +  # Add points with some transparency
  geom_smooth(method = "lm", se = FALSE, color = "red") +  # Add linear regression line
  theme_minimal() +  # Minimal theme
  xlab("Z-Score") +  # Label for x-axis
  ylab("Age of Onset") +  # Label for y-axis
  ggtitle(paste("Correlation: r =", round(cor_coefficient, 2), ", p =", round(p_value, 3)))  # Title with correlation and p-value

# Display the plot
print(cor_plot)

# Optionally save the plot
ggsave("correlation_plot.jpeg", plot = cor_plot, dpi = 600, units = "in", height = 6, width = 6)

# Saving


Save the final files to your workspace bucket, since we are conducting this analysis on Terra.