# Evaluating Polygenic Risk Score and Merging with Accelerometer Cohort and Pheno Cohort

### Sources:

> Bycroft C, Freeman C, Petkova D, et al. Genome-wide genetic data on ~500,000 UK Biobank participants Supplementary Material.
>
> Tamlander M, Mars N, Pirinen M, Widén E, Ripatti S. Integration of questionnaire-based risk factors improves polygenic risk scores for human coronary heart disease and type 2 diabetes. Commun Biol 2022 51. 2022;5(1):1-13. doi:10.1038/s42003-021-02996-0
>
> Birling M-C, Yoshiki A, Adams DJ, et al. The Polygenic Score Catalog as an open database for reproducibility and systematic evaluation. Nat Genet 2021 534. 2021;53(4):420-425. doi:10.1038/s41588-021-00783-5



## Overview of Merging & PRS Evaluation

Finally, I had to merge all of the datasets together, evaluate the PRS, and restrict kinship. I started by restricting kinship to only individuals not related third degree or closer. I then merged the principal component dataset with the phenotype dataset and then ultimately merged these datasets with the accelerometer wearing cohort. This creates the final accelerometer cohort with all accelerometer- and genetic-based inclusion criteria applied.

I next tested the PRS in the original larger phenotype dataset to roughly replicate the sample used in Tamlander *et al.*, which showed that the performance was extremely similar to the original study. The full comparison is available in the Replications folder. I finally save the datasets as FinalPhenoDataset.csv and the FinalPADataset.csv, which is used for subsequent analyses.

In [None]:
# In Bash kernel
dx download FULLPHENODATAALLANCESTRIES.csv
dx download FINALPGS.sscore
dx download PCsforCAD.csv
dx download PACOHORTprocessedPAVarsPAEE.csv
dx download UsedinPCA.csv



In [None]:
# In R kernel


PGS <- read.table("FINALPGS.sscore") # Polygenic scores for all genotyped individuals
FINALDATA <- read.csv("FULLPHENODATAALLANCESTRIES.csv") # Pheno Cohort
PCs <- read.csv("PCsforCAD.csv") # Principal components
PAData <- read.csv("PACOHORTprocessedPAVarsPAEE.csv") # Accelerometer Cohort
UsedinPCA <- read.csv("UsedinPCA.csv") # Indicator if used in PCA

In [None]:
# Renaming columns
colnames(PGS) <- c("eid", "PGS")


# UsedinPCA merging to get ONLY those that make kinship criteria
FINALDATAMERGE <- merge(FINALDATA, UsedinPCA, by = "eid", all = F)

dim(FINALDATA)
dim(FINALDATAMERGE)
# 486432 x 91
# 486432 x 93


# Filtering to ONLY UsedinPCA == "yes" - which includes kinship exclusion criteria
FINALDATAMERGE <- subset(FINALDATAMERGE, FINALDATAMERGE$p22020 == "Yes")
# NOW 406,554 x 93 - this makes sense

# Merging PGS w/ Pheno Cohort
PGSFINALDATA <- merge(PGS, FINALDATAMERGE, by = "eid", all = F)

# Merging this dataset w/ PCs
PGSFINALDATAPCs <- merge(PGSFINALDATA, PCs, by = "eid", all = F)

dim(PGSFINALDATA)
dim(PGSFINALDATAPCs)
# 406554 x 94
# 406554 x 135
# No changes in numbers here - good

In [None]:
# Restricting to only European ancestry
summary(as.factor(PGSFINALDATAPCs$Genetic.Ethnic.Grouping))
# Either Caucasian or not... 337,005 Caucasian (vs 358,922 in Mars et al... Makes sense due to diff kinship)

PGSFINALDATAWHITE <- subset(PGSFINALDATAPCs, Genetic.Ethnic.Grouping == "Caucasian")

dim(PGSFINALDATAWHITE)
# 337,005 x 135 variables

In [None]:
# Tamlander et al.
# OR = 1.72 (1.70-1.75)


# NOW standardizing PGS to make interpretable on correct scale
PGSFINALDATAWHITE$StandPGS <- scale(PGSFINALDATAWHITE$PGS)

# Regression covariates based on those in PGS Catalog
logitfit <- glm(Status ~ StandPGS + Age.at.Recruitment + Biological.Sex, data = PGSFINALDATAWHITE, family = "binomial")
summary(logitfit)

exp(coef(logitfit))

exp(cbind(OR = coef(logitfit), confint(logitfit)))
# For 95% CIs
# 1.67 (1.65 to 1.70)
# Very slightly off BUT also not exactly the same pop - stricter ancestry restriction

In [None]:
# -----------
# FINALLY merging PA cohort w/ PGSFINALDATA
# -----------

FINALOVERALLDATASET <- merge(PGSFINALDATAPCs, PAData, by = "eid", all = F)

dim(FINALOVERALLDATASET)
dim(PAData)
dim(PGSFINALDATAPCs)
dim(FINALDATA)
dim(FINALDATAMERGE)
# 79454 x 838 - so 17,206 dropped from kinship
# 96660 x 704
# 406554 x 135
# 486432 x 91
# 406554 x 93



write.csv(FINALOVERALLDATASET, "FinalPADataset.csv")
write.csv(PGSFINALDATAPCs, "FinalPhenoDataset.csv")

summary(as.factor(FINALOVERALLDATASET$Genetic.Ethnic.Grouping))
# 1 - 11592 Caucasian - 67862

In [None]:
# In Bash Kernel
dx upload FinalPADataset.csv
dx upload FinalPhenoDataset.csv