# Genotype QC UKBB data

## Aim

Generate a set of Qc'ed genotype array data for the whole dataset of the UKBB 

## Output file

Final bed file with all the QC steps located here:

`~/UKBiobank/genotype_files_processed/083021_sample_variant_qc_final/cache/UKB_genotypedatadownloaded083019.083021_sample_variant_qc_final.filtered.extracted.bed`

# Variant QC summary

Original file downloaded from the UKBB 

~/UKBiobank_Yale_transfer/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed

- Starting number of variants: 784,256 (starting with only autosomal variants)
- Starting number of individuals: 488,377 (individuals that have both genotype and phenotype information)

* Autosomal variants --> 784,256
* Covered by both arrays (array : (0/1/2) Presence of SNP on genotyping arrays 0=BiLEVE, 1=Axiom, 2=both) --> 733,322
* Batch level QC (BATCH_qc : (0/1) For each batch (Batch_b001-b095,UKBiLEVEAX_b1-b11), SNP passed all QC tests (no/yes)) --> 687,004
* SNPs only (remove indels) --> 674,489

In [1]:
library(tidyverse)

  msg['msg_id'] = self._parent_header['header']['msg_id']
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.5     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.1.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



In [2]:
setwd('~/UKBiobank/genotype_files_processed')

## Step 1. Import file needed (ukb_snp_qc.txt)

In [3]:
ukb_snp_qc <- read.table('~/UKBiobank/data/genotype_files/ukb_snp_qc.txt', header=TRUE, sep=" ")

In [4]:
# variants genotyped and thus listed in this dataframe = 805,426
head(ukb_snp_qc)
nrow(ukb_snp_qc)

Unnamed: 0_level_0,rs_id,affymetrix_snp_id,affymetrix_probeset_id,chromosome,position,allele1_ref,allele2_alt,strand,array,Batch_b001_qc,...,PC32_loading,PC33_loading,PC34_loading,PC35_loading,PC36_loading,PC37_loading,PC38_loading,PC9_loading.3,PC40_loading,in_Phasing_Input
Unnamed: 0_level_1,<fct>,<fct>,<fct>,<int>,<int>,<fct>,<fct>,<fct>,<int>,<int>,...,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
1,rs28659788,Affx-13546538,AX-32115783,1,723307,C,G,+,0,1,...,,,,,,,,,,0
2,rs116587930,Affx-35298040,AX-37361813,1,727841,G,A,+,2,1,...,,,,,,,,,,0
3,rs116720794,Affx-13637449,AX-32137419,1,729632,C,T,+,2,1,...,,,,,,,,,,0
4,rs3131972,Affx-13945728,AX-13191280,1,752721,A,G,+,2,1,...,,,,,,,,,,1
5,rs12184325,Affx-13963217,AX-11194291,1,754105,C,T,+,2,1,...,-0.00347144,0.00589896,-0.00373881,-0.00189571,-0.00286888,0.000792278,-0.00191024,0.00307453,-0.000410934,1
6,rs3131962,Affx-13995532,AX-32225497,1,756604,A,G,+,2,1,...,,,,,,,,,,1


## Step 2. Select autosomal variants only  

In [37]:
# subset dataframe 

data.SNP_QC <-subset(ukb_snp_qc, chromosome %in% 1:22)

# autosomal variants listed in dataframe = 784,256
nrow(data.SNP_QC)

In [None]:
# create list.nonautosomalvariants for reference

list.nonautosomalvariants <-subset(ukb_snp_qc, !chromosome %in% 1:22, select=c(rs_id))
outfile.name1 <- "list.nonautosomalvariants.txt"
write.table(list.nonautosomalvariants, outfile.name1, quote=FALSE, col.names=TRUE, row.names=FALSE, sep="\t")

## Step 3. Select variants typed in both arrays

In [38]:
# subset dataframe 

data.SNP_QC <-subset(data.SNP_QC, array ==2)

# autosomal variants in both arrays = 733,322 
nrow(data.SNP_QC)

In [5]:
array1 <-subset(ukb_snp_qc, array ==0)
nrow(array1 )

In [7]:
array2 <-subset(ukb_snp_qc, array ==1)
nrow(array2 )

In [8]:
both <-subset(ukb_snp_qc, array ==2)
nrow(both )

In [None]:
# create list.notypedinbotharrays for reference
  
list.notypedinbotharrays <-subset(data.SNP_QC, ! array ==2, select=c(rs_id))
outfile.name2 <- "list.variantsnotypedinbotharrays.txt"
write.table(list.notypedinbotharrays, outfile.name2, quote=FALSE, col.names=TRUE, row.names=FALSE, sep="\t")

## Step 4. Select variants passing all batch QC tests

In [42]:
both_arrays <- data.SNP_QC %>%
     filter_at(vars(starts_with("Batch")), all_vars(. > 0)) %>%
     filter_at(vars(starts_with("UKBiLEVEAX")), all_vars(. > 0)) %>%
     filter(array==2)
    
#create sumbatches variable (sums batch-specific indicators for each column, sumbatches should equal 106 for inclusion)        
data.SNP_QC$sumbatches <- rowSums(data.SNP_QC[,10:115])

# subset dataframe 
data.SNP_QC <-subset(data.SNP_QC,sumbatches  ==106, select=c(rs_id))

# autosomal variants = 687,004
nrow(data.SNP_QC)
write.table(data.SNP_QC, "SNPs_autosomalvariantspassingbatchqc_120219.txt", quote=FALSE, col.names=TRUE, row.names=FALSE, sep="\t")

In [None]:
# create list.failbatchQCtests for reference
list.failbatchQCtests <-subset(data.SNP_QC, ! sumbatches ==106, select=c(rs_id))
outfile.name3 <- "list.variantsfailingoneormorebatchQCtests.txt"
write.table(list.failbatchQCtests, outfile.name3, quote=FALSE, col.names=TRUE, row.names=FALSE, sep="\t")

## Step 5. select SNPs only within PLINK  (--SNPS-ONLY flag)

In [2]:
# This is the path to the data transferred from Yale
UKBB_yale=/mnt/mfs/statgen/archive/UKBiobank_Yale_transfer
UKBB_PATH=/mnt/mfs/statgen/UKBiobank
USER_PATH=$HOME/project
OUT_PATH=$USER_PATH/UKBB_GWAS_dev/output
tpl_file=$USER_PATH/bioworkflows/admin/csg.yml
container_lmm=$HOME/containers/lmm.sif
#Columbia's cluster
cwd=~/UKBiobank/genotype_files_processed
# Original bfile containing all of the samples Columbias's cluster
genoFile=$UKBB_yale/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed
#To keep the variants that passed previous qc: autosomal, both batches, pass qc within batches
keep_variants=~/UKBiobank/genotype_files_processed/SNPs_autosomalvariantspassingbatchqc_120219.txt
gwasqc_sos=$USER_PATH/bioworkflows/GWAS/GWAS_QC.ipynb
gwasqc_sbatch=$USER_PATH/UKBB_GWAS_dev/output/snponly_originalbed_$(date +"%Y-%m-%d").sbatch
snps_only=True
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
mem='30G'
name='082421'
job_size=1
numThreads=2

gwasqc1_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --snps_only $snps_only
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwasqc_sbatch \
    --args "$gwasqc1_args"

  msg['msg_id'] = self._parent_header['header']['msg_id']


INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/snponly_originalbed_2021-08-24.sbatch[0m
INFO: Workflow csg (ID=wc0e398d163b9a337) is executed successfully with 1 completed step.



In [None]:
## Step 6. Run the PERL SCRIPT for Missingness, MAF, and HWE
  # Note that this script will yield two output files: <nameoffile>.QC and <nameoffile>.out
  # Use <nameoffile>.QC to filter out the variants
  # Keep <nameoffile>.out as reference

  data.perl_SNP_QC <- read.table("/research_storage/dbgap/work/Yasmmyn/UK_Biobank/geneQC/try2_UKB_SNPs.QC", header=TRUE, sep="\t")

  # Step 6a. missingness
  names(data.perl_SNP_QC)[1] <-"rs_id"
  list.exclude.perl_SNP_QC1  <- subset(data.perl_SNP_QC,data.perl_SNP_QC$MISS==1,select=c(rs_id))
  list.exclude.perl_SNP_QC1  <- as.matrix(list.exclude.perl_SNP_QC1)


  data.SNP_QC <-subset(data.SNP_QC, ! rs_id %in% list.exclude.perl_SNP_QC1)

  # SNPs = 652,399
  nrow(data.SNP_QC)

  # Step 6b. HWE
  
  list.exclude.perl_SNP_QC2 <- subset(data.perl_SNP_QC,data.perl_SNP_QC$HWE==1,select=c(rs_id))
  list.exclude.perl_SNP_QC2 <- as.matrix(list.exclude.perl_SNP_QC2)

  data.SNP_QC <-subset(data.SNP_QC, ! rs_id %in% list.exclude.perl_SNP_QC2)

  # SNPs = 622,266
  nrow(data.SNP_QC)
  
  # Step 6c. MAF 
  
  list.exclude.perl_SNP_QC3  <- subset(data.perl_SNP_QC,data.perl_SNP_QC$MAF==1,select=c(rs_id))
  list.exclude.perl_SNP_QC3  <- as.matrix(list.exclude.perl_SNP_QC3)

  data.SNP_QC <-subset(data.SNP_QC, ! rs_id %in% list.exclude.perl_SNP_QC3)

  
  # SNPs = 541,312
  nrow(data.SNP_QC)

  # note that these numbers are off by 12,515 because I deleted indels using the --SNPS-ONLY flag in PLINK

# Sample QC summary

Original file downloaded from the UKBB 

~/UKBiobank_Yale_transfer/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.fam

- Starting number of variants: 784,256 (starting with only autosomal variants)
- Starting number of individuals: 488,221 (individuals that have both genotype and phenotype information)

* Genetic and reported sex match --> 487,849
* Sex chromosomes non XX - XY --> 487,379
* Outliers heterozygosity/missing rate --> 486,416
* Select individuals from different ethnicities (asian N=10,18; african N=8,621; and white N=)
* Individuals call rate > 99% --> 436,698

The R script and submission file are present here

`~/UKBiobank/genotype_files_processed/082421_sample_qc.R`

`~/UKBiobank/genotype_files_processed/082421_sample_qc.sh`

## Step 1. Import list of people with genotype data into a dataframe

In [1]:
data.havegenotypes <- read.table("~/UKBiobank_Yale_transfer/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.fam", header= FALSE, stringsAsFactors = FALSE)
names(data.havegenotypes) <-c("FID","IID","ignore1", "ignore2", "ignore3", "ignore4")
cat("The number of individuals with genotype data is:",nrow(data.havegenotypes),"\n")
 # n = 488377 subjects

  msg['msg_id'] = self._parent_header['header']['msg_id']


## Step 2. Import phenotype data and create a dataframe

In [2]:
library(data.table)
dat <- fread("~/UKBiobank/data/ukbb_databases/ukb47922_updatedAug2021/ukb47922.tab", header=TRUE, sep="\t", select = c("f.eid","f.31.0.0","f.22001.0.0","f.22019.0.0","f.22027.0.0","f.22021.0.0","f.22006.0.0"))

In [3]:
head(dat)

f.eid,f.31.0.0,f.22001.0.0,f.22019.0.0,f.22027.0.0,f.22021.0.0,f.22006.0.0
<int>,<int>,<int>,<int>,<int>,<int>,<int>
1000019,0,0,,,1,1
1000022,1,1,,,0,1
1000035,1,1,,,1,1
1000046,0,0,,,1,1
1000054,0,0,,,1,1
1000063,1,1,,,0,1


In [4]:
#mydata <- read.table("~/UKBiobank/data/ukbb_databases/ukb47922_updatedAug2021/ukb47922.tab", header=TRUE, sep="\t")
names(dat)[1] <- "IID" 
cat("The number of individuals with phenotype data is:",nrow(dat),"\n")
 # n = 502,461 subjects

The number of individuals with phenotype data is: 502461 


## Step 3. Merge the dataframes 

In [5]:
data.haveboth <-merge(data.havegenotypes, dat, by="IID", all=FALSE)
cat("The number of individuals with both phenotype and genotype data:",nrow(data.haveboth),"\n")
# n = 487,849(had phenotype data, #528 indiduals removed)

The number of individuals with both phenotype and genotype data: 488221 


## Step 4. Additional sex checks  

In [6]:
#4a. genetic sex vs. self-reported sex

data.haveboth$sex_match<- (data.haveboth$f.22001.0.0 == data.haveboth$f.31.0.0)
sexnomatch <- subset(data.haveboth, data.haveboth$sex_match=="FALSE")
data.haveboth<- subset(data.haveboth, data.haveboth$sex_match=="TRUE")
cat("The number of individuals whose reported and genetic sex match is:",nrow(data.haveboth),"\n")
cat("The number of individuals whose reported and genetic sex does NOT match is:",nrow(sexnomatch),"\n")
# n = 487,379 subjects match for sex

The number of individuals whose reported and genetic sex match is: 487849 
The number of individuals whose reported and genetic sex does NOT match is: 372 


In [7]:
#4b. Identify subjects with sex chromosome karyotypes putatively different from XX or XY

aneu.toexclude  <- subset(data.haveboth, data.haveboth$f.22019.0.0==1, select=c(IID))
aneu.toexclude  <- as.matrix(aneu.toexclude)

data.haveboth <-subset(data.haveboth, ! IID %in% aneu.toexclude)
cat("The number of individuals with aneuploidies is:",nrow(aneu.toexclude),"\n")
cat("The number of individuals with XX and XY",nrow(data.haveboth),"\n")
#n = 486,416 subjects remain

The number of individuals with aneuploidies is: 470 
The number of individuals with XX and XY 487379 


In [9]:
# Save the IID that pass these QC filters to a file
ID_keep <- data.haveboth[,1, drop=FALSE]
write.csv(ID_keep,'~/UKBiobank/genotype_files_processed/01172023_sampleQC_IID_keep_487379ind.csv', row.names = FALSE, col.names=FALSE)

“attempt to set 'col.names' ignored”


## Step 5. Identify subjects that are outliers in heterozygosity and missing rates

In [10]:
list.toexclude <- subset(data.haveboth, data.haveboth$f.22027.0.0==1, select=c(IID))
list.toexclude  <- as.matrix(list.toexclude)

data.haveboth <-subset(data.haveboth, ! IID %in% list.toexclude)
cat("The number of individuals that are outliers for heterozygosity or missing rates is:",nrow(list.toexclude),"\n")
cat("The number of individuals that are not outliers for heterozygosity or missing rates is:",nrow(data.haveboth),"\n")

The number of individuals that are outliers for heterozygosity or missing rates is: 963 
The number of individuals that are not outliers for heterozygosity or missing rates is: 486416 


## Identify ancestry and do QC for British, Asians and Africans separately

### Expanded white with exome QC of genotype array

In [None]:
exp_white <- read.table("~/UKBiobank/results/083021_PCA_results/121721_ukb42495_exomed_white_187908ind_no_outliers.id")

### Call rate 90% expanded white

In [2]:
# This is the path to the data transferred from Yale
UKBB_yale=/mnt/mfs/statgen/archive/UKBiobank_Yale_transfer
UKBB_PATH=/mnt/mfs/statgen/UKBiobank
USER_PATH=$HOME/project
OUT_PATH=$USER_PATH/UKBB_GWAS_dev/output
tpl_file=$USER_PATH/bioworkflows/admin/csg.yml
container_lmm=$HOME/containers/lmm.sif
#Columbia's cluster
cwd=~/UKBiobank/genotype_files_processed/010622_exp_white_exomed_mind0.1
# bfile with variant_qc_1
genoFile=~/UKBiobank/genotype_files_processed/cache/UKB_genotypedatadownloaded083019.genotype_files_processed.filtered.extracted.bed
#To keep the samples that passed sample_qc_1 N=187,908 after PCA it was determined that this are the expanded white
keep_samples=~/UKBiobank/results/083021_PCA_results/121721_ukb42495_exomed_white_187908ind_no_outliers.id
gwasqc_sos=$USER_PATH/bioworkflows/GWAS/GWAS_QC.ipynb
gwasqc_sbatch=$USER_PATH/UKBB_GWAS_dev/output/ExWhite_exome_QC_mind0.05_$(date +"%Y-%m-%d").sbatch
maf_filter=0
geno_filter=0
hwe_filter=0
# with mind=0.01 no samples remaining
# Keep individuals with call rate > 90%
mind_filter=0.1
mem='30G'
name='010622'
job_size=1
numThreads=2

gwasqc1_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwasqc_sbatch \
    --args "$gwasqc1_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/ExWhite_exome_QC_mind0.05_2022-01-10.sbatch[0m
INFO: Workflow csg (ID=w2a0fd98be7b35893) is executed successfully with 1 completed step.



### Remove the related individuals before the variant qc2

In [11]:
### Read in the removed individuals after the sample missingness filter (mind=0.05) 95% call rate for the asian subset
### In this case all of the individuals were kept N=10,189
keep_white <- read.table("~/UKBiobank/genotype_files_processed/010622_exp_white_exomed_mind0.1/cache/UKB_genotypedatadownloaded083019.genotype_files_processed.filtered.extracted.010622_exp_white_exomed_mind0.1.filtered.fam", header=FALSE)
names(keep_white) <-c("FID","IID", "father", "mother", "sex", "pheno")

In [12]:
head(keep_white)

Unnamed: 0_level_0,FID,IID,father,mother,sex,pheno
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>
1,1000019,1000019,0,0,2,-9
2,1000035,1000035,0,0,1,-9
3,1000078,1000078,0,0,2,-9
4,1000081,1000081,0,0,1,-9
5,1000198,1000198,0,0,2,-9
6,1000210,1000210,0,0,1,-9


In [21]:
nrow(keep_white)

In [14]:
white_subset<-subset(data.haveboth, IID %in% keep_white$IID)

In [17]:
white.related <- subset(white_subset, white_subset$f.22021.0.0>1)  #at least one relative identified
head(white.related)
nrow(white.related)

Unnamed: 0_level_0,IID,FID,ignore1,ignore2,ignore3,ignore4,f.31.0.0,f.22001.0.0,f.22019.0.0,f.22027.0.0,f.22021.0.0,f.22006.0.0,sex_match
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<lgl>
6955,1071720,1071720,0,0,2,-9,0,0,,,10,1.0,True
14391,1148275,1148275,0,0,1,-9,1,1,,,10,,True
25635,1263990,1263990,0,0,1,-9,1,1,,,10,1.0,True
28409,1292522,1292522,0,0,2,-9,0,0,,,10,1.0,True
33581,1345826,1345826,0,0,2,-9,0,0,,,10,1.0,True
33615,1346185,1346185,0,0,2,-9,0,0,,,10,1.0,True


In [19]:
library(dplyr)
white_iid_related <- white.related %>%
    select ("FID", "IID")
cat("The number of white individuals that are related in the sample is:",nrow(white.related),"\n")
cat("The number of individuals that are related in the sample after filtering for sample missingess is:",nrow(white_iid_related),"\n")

The number of white individuals that are related in the sample is: 74 
The number of individuals that are related in the sample after filtering for sample missingess is: 74 


In [20]:
write.table(white_iid_related, '~/UKBiobank/genotype_files_processed/010622_exp_white_exomed_mind0.1/010722_sampleQC_IID_white_related.txt', sep="\t", row.names = FALSE, col.names= FALSE)

## Do the variant QC2 with the subset of white expanded individuals

In [22]:
# This is the path to the data transferred from Yale
UKBB_yale=/mnt/mfs/statgen/archive/UKBiobank_Yale_transfer
UKBB_PATH=/mnt/mfs/statgen/UKBiobank
USER_PATH=$HOME/project
OUT_PATH=$USER_PATH/UKBB_GWAS_dev/output
tpl_file=$USER_PATH/bioworkflows/admin/csg.yml
container_lmm=$HOME/containers/lmm.sif
#Columbia's cluster
cwd=~/UKBiobank/genotype_files_processed/010622_exp_white_exomed_mind0.1/
# bfile with variant_qc_1 N=187,908 and variants=674,489
genoFile=~/UKBiobank/genotype_files_processed/010622_exp_white_exomed_mind0.1/cache/UKB_genotypedatadownloaded083019.genotype_files_processed.filtered.extracted.010622_exp_white_exomed_mind0.1.filtered.bed
#To remove related samples
remove_samples=~/UKBiobank/genotype_files_processed/010622_exp_white_exomed_mind0.1/010722_sampleQC_IID_white_related.txt
gwasqc_sos=$USER_PATH/bioworkflows/GWAS/GWAS_QC.ipynb
gwasqc_sbatch=$USER_PATH/UKBB_GWAS_dev/output/white_variantqc2_$(date +"%Y-%m-%d").sbatch
maf_filter=0.01
geno_filter=0.01
hwe_filter=5e-08
# Set mind filter to 0 not to filter out more individuals based on sample missingness
mind_filter=0
mem='30G'
name='010722'
job_size=1
numThreads=2

gwasqc1_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --remove_samples $remove_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwasqc_sbatch \
    --args "$gwasqc1_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/white_variantqc2_2022-01-10.sbatch[0m
INFO: Workflow csg (ID=wa896da5b4dc83176) is executed successfully with 1 completed step.



### Variants and samples to keep from white expanded

`~/UKBiobank/genotype_files_processed/010622_exp_white_exomed_mind0.1/cache/UKB_genotypedatadownloaded083019.genotype_files_processed.filtered.extracted.010622_exp_white_exomed_mind0.1.filtered.pass_qc.snplist`

`UKB_genotypedatadownloaded083019.genotype_files_processed.filtered.extracted.010622_exp_white_exomed_mind0.1.filtered.pass_qc.samplelist`

### Asians QC

Please find the scripts generated this subset of the data here

`~/UKBiobank/genotype_files_processed/010522_sample_qc_asians.R`

`~/UKBiobank/genotype_files_processed/010522_sample_qc_asians.sh`

And in this notebook the ethnicity filtering

`~/project/UKBB_GWAS_dev/analysis/phenotypes/122021_Asians_Africans_in_500K.ipynb`

In [10]:
# Read in the file with the asian individuals to subset
asian <- read.table("/mnt/mfs/statgen/UKBiobank/phenotype_files/asian_IID/010622_ukb47922_asian_10695.iid", header=TRUE, sep="\t")

In [11]:
head(asian)

Unnamed: 0_level_0,FID,IID,ethnicity
Unnamed: 0_level_1,<int>,<int>,<fct>
1,1000906,1000906,3003
2,1001874,1001874,3004
3,1002497,1002497,3001
4,1002712,1002712,3001
5,1003025,1003025,3001
6,1003083,1003083,3001


In [12]:
asian_subset<-subset(data.haveboth, IID %in% asian$IID)

In [13]:
nrow(asian_subset)

#### Call rate 95% for Asians

In [1]:
# This is the path to the data transferred from Yale
UKBB_yale=/mnt/mfs/statgen/archive/UKBiobank_Yale_transfer
UKBB_PATH=/mnt/mfs/statgen/UKBiobank
USER_PATH=$HOME/project
OUT_PATH=$USER_PATH/UKBB_GWAS_dev/output
tpl_file=$USER_PATH/bioworkflows/admin/csg.yml
container_lmm=$HOME/containers/lmm.sif
#Columbia's cluster
cwd=~/UKBiobank/genotype_files_processed/010622_asian_10189ind
# bfile with variant_qc_1
genoFile=~/UKBiobank/genotype_files_processed/cache/UKB_genotypedatadownloaded083019.genotype_files_processed.filtered.extracted.bed
#To keep the samples that passed sample_qc_1 N=10,189
keep_samples=~/UKBiobank/genotype_files_processed/010622_sampleQC_IID_keep_asian_iid.txt
gwasqc_sos=$USER_PATH/bioworkflows/GWAS/GWAS_QC.ipynb
gwasqc_sbatch=$USER_PATH/UKBB_GWAS_dev/output/Asian_QC_mind0.05_$(date +"%Y-%m-%d").sbatch
maf_filter=0
geno_filter=0
hwe_filter=0
# with mind=0.01 no samples remaining
# Keep individuals with call rate > 95%
mind_filter=0.05
mem='30G'
name='010622'
job_size=1
numThreads=2

gwasqc1_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwasqc_sbatch \
    --args "$gwasqc1_args"

  msg['msg_id'] = self._parent_header['header']['msg_id']


INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/Asian_QC_mind0.05_2022-01-07.sbatch[0m
INFO: Workflow csg (ID=w9d5378c14af9575e) is executed successfully with 1 completed step.



#### Remove the related inviduals in the asian subset

In [20]:
### Read in the removed individuals after the sample missingness filter (mind=0.05) 95% call rate for the asian subset
### In this case all of the individuals were kept N=10,189
rm_asian <- read.table("~/UKBiobank/genotype_files_processed/010622_asian_10189ind/cache/UKB_genotypedatadownloaded083019.genotype_files_processed.filtered.extracted.010622_asian_10189ind.filtered.mindrem.id", header=FALSE)
names(rm_asian) <-c("FID","IID")

In [14]:
asian.related <- subset(asian_subset, asian_subset$f.22021.0.0>1)  #at least one relative identified
head(asian.related)

Unnamed: 0_level_0,IID,FID,ignore1,ignore2,ignore3,ignore4,f.31.0.0,f.22001.0.0,f.22019.0.0,f.22027.0.0,f.22021.0.0,f.22006.0.0,sex_match
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<lgl>
339249,4492584,4492584,0,0,1,-9,1,1,,,10,,True


In [21]:
library(dplyr)
asian_iid_related <- asian.related %>%
    filter(!IID %in% rm_asian$IID) %>%
    select ("FID", "IID")
cat("The number of asian individuals that are related in the sample is:",nrow(asian.related),"\n")
cat("The number of individuals that are related in the sample after filtering for sample missingess is:",nrow(asian_iid_related),"\n")

The number of asian individuals that are related in the sample is: 1 
The number of individuals that are related in the sample after filtering for sample missingess is: 1 


In [17]:
write.table(asian_iid_related, '~/UKBiobank/genotype_files_processed/010622_asian_10189ind/010722_sampleQC_IID_asian_related.txt', sep="\t", row.names = FALSE, col.names= FALSE)

#### Do the variant qc #2

In [19]:
# This is the path to the data transferred from Yale
UKBB_yale=/mnt/mfs/statgen/archive/UKBiobank_Yale_transfer
UKBB_PATH=/mnt/mfs/statgen/UKBiobank
USER_PATH=$HOME/project
OUT_PATH=$USER_PATH/UKBB_GWAS_dev/output
tpl_file=$USER_PATH/bioworkflows/admin/csg.yml
container_lmm=$HOME/containers/lmm.sif
#Columbia's cluster
cwd=~/UKBiobank/genotype_files_processed/010622_asian_10189ind/
# bfile with variant_qc_1 N=10,189 and variants=674,489
genoFile=~/UKBiobank/genotype_files_processed/010622_asian_10189ind/cache/UKB_genotypedatadownloaded083019.genotype_files_processed.filtered.extracted.010622_asian_10189ind.filtered.bed
#To remove related samples
remove_samples=~/UKBiobank/genotype_files_processed/010622_asian_10189ind/010722_sampleQC_IID_asian_related.txt
gwasqc_sos=$USER_PATH/bioworkflows/GWAS/GWAS_QC.ipynb
gwasqc_sbatch=$USER_PATH/UKBB_GWAS_dev/output/asian_variantqc2_$(date +"%Y-%m-%d").sbatch
maf_filter=0.01
geno_filter=0.01
hwe_filter=5e-08
# Set mind filter to 0 not to filter out more individuals based on sample missingness
mind_filter=0
mem='30G'
name='010722'
job_size=1
numThreads=2

gwasqc1_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --remove_samples $remove_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwasqc_sbatch \
    --args "$gwasqc1_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/asian_variantqc2_2022-01-07.sbatch[0m
INFO: Workflow csg (ID=w183d9d34fd76a226) is executed successfully with 1 completed step.



#### Obtain the final file for all the samples after variant qc1, sample qc and variant qc2

The snplist file was created base in the bim file after filtering the previous step

`UKB_genotypedatadownloaded083019.genotype_files_processed.filtered.extracted.010622_asian_10189ind.filtered.010622_asian_10189ind.filtered.bim`

The sample file to keep consists of N=10188 inviduals (only 1 removed after mind=0.05)

`UKB_genotypedatadownloaded083019.genotype_files_processed.filtered.extracted.010622_asian_10189ind.filtered.fam`

In [22]:
# This is the path to the data transferred from Yale
UKBB_yale=/mnt/mfs/statgen/archive/UKBiobank_Yale_transfer
UKBB_PATH=/mnt/mfs/statgen/UKBiobank
USER_PATH=$HOME/project
OUT_PATH=$USER_PATH/UKBB_GWAS_dev/output
tpl_file=$USER_PATH/bioworkflows/admin/csg.yml
container_lmm=$HOME/containers/lmm.sif
#Columbia's cluster
cwd=~/UKBiobank/genotype_files_processed/010622_asian_10189ind/010722_sample_var_final_qc
# orginal bfile
genoFile=$UKBB_yale/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed
#To keep samples
keep_samples=~/UKBiobank/genotype_files_processed/010622_asian_10189ind/cache/UKB_genotypedatadownloaded083019.genotype_files_processed.filtered.extracted.010622_asian_10189ind.filtered.fam
# To keep variants
keep_variants=~/UKBiobank/genotype_files_processed/010622_asian_10189ind/cache/UKB_genotypedatadownloaded083019.genotype_files_processed.filtered.extracted.010622_asian_10189ind.qc_pass.snplist
gwasqc_sos=$USER_PATH/bioworkflows/GWAS/GWAS_QC.ipynb
gwasqc_sbatch=$USER_PATH/UKBB_GWAS_dev/output/Asian_sampleqc_final_$(date +"%Y-%m-%d").sbatch
## All filters set to 0 
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
mem='30G'
name='010722'
job_size=1
numThreads=2

gwasqc1_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwasqc_sbatch \
    --args "$gwasqc1_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/Asian_sampleqc_final_2022-01-07.sbatch[0m
INFO: Workflow csg (ID=w881c9553f378afc2) is executed successfully with 1 completed step.



#### Generate the file for exome and imputed data analysis

In [43]:
#Read in the file with individuals that have exome data

exome_id <- read.table("~/UKBiobank/data/exome_files/project_VCF/072721_run/plink/ukb23156_c1.merged.filtered.fam")
names(exome_id) <-c("FID","IID","father", "mother", "sex", "pheno")
head(exome_id)
nrow(exome_id) ## 200,643

Unnamed: 0_level_0,FID,IID,father,mother,sex,pheno
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>
1,1434748,1434748,0,0,0,-9
2,5523981,5523981,0,0,0,-9
3,5023838,5023838,0,0,0,-9
4,4023729,4023729,0,0,0,-9
5,4442146,4442146,0,0,0,-9
6,5654789,5654789,0,0,0,-9


In [51]:
# Read in the file with individuals that have imputed data N=487410 file=ukb32285_imputedindiv.sample
imput_id <- read.table("~/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb32285_imputedindiv.sample", header=T)
head(imput_id)
nrow(imput_id)
nrow(imput_id[imput_id$ID_1 == "0",]) # number of missing individuals

Unnamed: 0_level_0,ID_1,ID_2,missing,sex
Unnamed: 0_level_1,<int>,<int>,<int>,<fct>
1,0,0,0,D
2,5414209,5414209,0,1
3,5296052,5296052,0,2
4,4852763,4852763,0,2
5,5230840,5230840,0,2
6,1992219,1992219,0,2


In [39]:
#For the 500K individuals
asian_qc <- read.table("~/UKBiobank/genotype_files_processed/010622_asian_10189ind/010722_sample_var_final_qc/cache/UKB_genotypedatadownloaded083019.010722_sample_var_final_qc.filtered.extracted.fam")
names(asian_qc) <-c("FID","IID","father", "mother", "sex", "pheno")

In [71]:
head(asian_qc)
nrow(asian_qc)

Unnamed: 0_level_0,FID,IID,father,mother,sex,pheno
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>
1,1000906,1000906,0,0,1,-9
2,1001874,1001874,0,0,2,-9
3,1002712,1002712,0,0,2,-9
4,1003025,1003025,0,0,1,-9
5,1003083,1003083,0,0,1,-9
6,1004204,1004204,0,0,1,-9


In [52]:
asian_exome <- asian_qc %>%
    filter(IID %in% exome_id$IID) %>%
    select ("FID", "IID")

In [53]:
nrow(asian_exome)

In [62]:
asian_imput <- asian_qc %>%
    filter(IID %in% imput_id$ID_1) %>%
    select ("FID", "IID")
nrow(asian_imput)

In [69]:
asian_full <- asian_qc %>%
  mutate(exome = if_else(IID %in% exome_id$IID,1 , 0),
         imputed = if_else(IID %in% imput_id$ID_1,1, 0),
         both = if_else (exome == 1 & imputed == 1, 1, 0))
nrow(asian_full[asian_full$exome == 1,]) #making sure numbers match
nrow(asian_full[asian_full$imputed == 1,]) #making sure numbers match
nrow(asian_full[asian_full$both == 1,])

In [70]:
head(asian_full)

Unnamed: 0_level_0,FID,IID,father,mother,sex,pheno,exome,imputed,both
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>
1,1000906,1000906,0,0,1,-9,0,1,0
2,1001874,1001874,0,0,2,-9,0,1,0
3,1002712,1002712,0,0,2,-9,0,1,0
4,1003025,1003025,0,0,1,-9,1,1,1
5,1003083,1003083,0,0,1,-9,1,1,1
6,1004204,1004204,0,0,1,-9,0,1,0


In [72]:
write.table(asian_full, '~/UKBiobank/genotype_files_processed/010622_asian_10189ind/010722_asian_qc_10188ind_withexome_or_imputed', sep="\t", row.names = FALSE, col.names= TRUE)

### Africans QC

Please find the scripts generated this subset of the data here

`~/UKBiobank/genotype_files_processed/010522_sample_qc_africans.R`

`~/UKBiobank/genotype_files_processed/010522_sample_qc_africans.sh`

And in this notebook the ethnicity filtering

`~/project/UKBB_GWAS_dev/analysis/phenotypes/122021_Asians_Africans_in_500K.ipynb`

In [23]:
# Read in the file with the asian individuals to subset
african <- read.table("/mnt/mfs/statgen/UKBiobank/phenotype_files/african_IID/010622_ukb47922_african_9096.iid", header=TRUE, sep="\t")

In [24]:
head(african)

Unnamed: 0_level_0,FID,IID,ethnicity
Unnamed: 0_level_1,<int>,<int>,<fct>
1,1000697,1000697,4001
2,1001447,1001447,4001
3,1001465,1001465,2001
4,1002004,1002004,4002
5,1002354,1002354,4001
6,1002390,1002390,4002


In [25]:
african_subset<-subset(data.haveboth, IID %in% african$IID)

In [27]:
nrow(african_subset)

In [71]:
# Save the IID that pass these QC filters to a file
ID_keep <- data.haveboth[,1, drop=FALSE]

#### Call rate 95% for Africans

In [7]:
# This is the path to the data transferred from Yale
UKBB_yale=/mnt/mfs/statgen/archive/UKBiobank_Yale_transfer
UKBB_PATH=/mnt/mfs/statgen/UKBiobank
USER_PATH=$HOME/project
OUT_PATH=$USER_PATH/UKBB_GWAS_dev/output
tpl_file=$USER_PATH/bioworkflows/admin/csg.yml
container_lmm=$HOME/containers/lmm.sif
#Columbia's cluster
cwd=~/UKBiobank/genotype_files_processed/010622_african_9096ind
# bfile with variant_qc_1
genoFile=~/UKBiobank/genotype_files_processed/cache/UKB_genotypedatadownloaded083019.genotype_files_processed.filtered.extracted.bed
#To keep the samples that passed sample_qc_1 N=10,189
keep_samples=~/UKBiobank/genotype_files_processed/010622_sampleQC_IID_keep_african_iid.txt
gwasqc_sos=$USER_PATH/bioworkflows/GWAS/GWAS_QC.ipynb
gwasqc_sbatch=$USER_PATH/UKBB_GWAS_dev/output/African_QC_mind0.05_$(date +"%Y-%m-%d").sbatch
maf_filter=0
geno_filter=0
hwe_filter=0
# Keep individuals with call rate > 95%
mind_filter=0.05
mem='30G'
name='010622'
job_size=1
numThreads=2

gwasqc1_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwasqc_sbatch \
    --args "$gwasqc1_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/African_QC_mind0.05_2022-01-06.sbatch[0m
INFO: Workflow csg (ID=wfc71384fbf7a3eb4) is executed successfully with 1 completed step.



#### Remove the related inviduals in the african subset

In [31]:
### Read in the removed individuals after the sample missingness filter (mind=0.05) 95% call rate for the asian subset
### In this case all of the individuals were kept N=10,189
rm_african<- read.table("~/UKBiobank/genotype_files_processed/010622_african_9096ind/cache/UKB_genotypedatadownloaded083019.genotype_files_processed.filtered.extracted.010622_african_9096ind.filtered.mindrem.id", header=FALSE)
names(rm_african) <-c("FID","IID")
head(rm_african)

Unnamed: 0_level_0,FID,IID
Unnamed: 0_level_1,<int>,<int>
1,3656538,3656538
2,3733695,3733695
3,4958925,4958925
4,5991763,5991763


In [29]:
african.related <- subset(african_subset, african_subset$f.22021.0.0>1)  #at least one relative identified
head(african.related)

Unnamed: 0_level_0,IID,FID,ignore1,ignore2,ignore3,ignore4,f.31.0.0,f.22001.0.0,f.22019.0.0,f.22027.0.0,f.22021.0.0,f.22006.0.0,sex_match
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<lgl>
223911,3304568,3304568,0,0,1,-9,1,1,,,10,,True
300586,4094127,4094127,0,0,2,-9,0,0,,,10,,True


In [30]:
african_iid_related <- african.related %>%
    filter(!IID %in% rm_african$IID) %>%
    select ("FID", "IID")
cat("The number of african individuals that are related in the sample is:",nrow(african.related),"\n")
cat("The number of individuals that are related in the sample after filtering for sample missingess is:",nrow(african_iid_related),"\n")

The number of african individuals that are related in the sample is: 2 
The number of individuals that are related in the sample after filtering for sample missingess is: 2 


In [32]:
write.table(african_iid_related, '~/UKBiobank/genotype_files_processed/010622_african_9096ind/010722_sampleQC_IID_african_related.txt', sep="\t", row.names = FALSE, col.names= FALSE)

#### Do the variant qc #2

In [33]:
# This is the path to the data transferred from Yale
UKBB_yale=/mnt/mfs/statgen/archive/UKBiobank_Yale_transfer
UKBB_PATH=/mnt/mfs/statgen/UKBiobank
USER_PATH=$HOME/project
OUT_PATH=$USER_PATH/UKBB_GWAS_dev/output
tpl_file=$USER_PATH/bioworkflows/admin/csg.yml
container_lmm=$HOME/containers/lmm.sif
#Columbia's cluster
cwd=~/UKBiobank/genotype_files_processed/010622_african_9096ind
# bfile with variant_qc_1 N=8,617 and variants=674,489
genoFile=~/UKBiobank/genotype_files_processed/010622_african_9096ind/cache/UKB_genotypedatadownloaded083019.genotype_files_processed.filtered.extracted.010622_african_9096ind.filtered.bed
#To remove related samples
remove_samples=~/UKBiobank/genotype_files_processed/010622_african_9096ind/010722_sampleQC_IID_african_related.txt
gwasqc_sos=$USER_PATH/bioworkflows/GWAS/GWAS_QC.ipynb
gwasqc_sbatch=$USER_PATH/UKBB_GWAS_dev/output/african_variantqc2_$(date +"%Y-%m-%d").sbatch
maf_filter=0.01
geno_filter=0.01
hwe_filter=5e-08
# Set mind filter to 0 not to filter out more individuals based on sample missingness
mind_filter=0
mem='30G'
name='010722'
job_size=1
numThreads=2

gwasqc1_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --remove_samples $remove_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwasqc_sbatch \
    --args "$gwasqc1_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/african_variantqc2_2022-01-07.sbatch[0m
INFO: Workflow csg (ID=w8a9ee43c49953502) is executed successfully with 1 completed step.



#### Obtain the final file for all the samples after variant qc1, sample qc and variant qc2

The snplist file was created base in the bim file after filtering the previous step variants=351,690

`UKB_genotypedatadownloaded083019.genotype_files_processed.filtered.extracted.010622_african_9096ind.filtered.010622_african_9096ind.filtered.bim`

The sample file to keep consists of N=8,617 inviduals (only 4 removed after mind=0.05)

`UKB_genotypedatadownloaded083019.genotype_files_processed.filtered.extracted.010622_african_9096ind.filtered.fam`

In [34]:
# This is the path to the data transferred from Yale
UKBB_yale=/mnt/mfs/statgen/archive/UKBiobank_Yale_transfer
UKBB_PATH=/mnt/mfs/statgen/UKBiobank
USER_PATH=$HOME/project
OUT_PATH=$USER_PATH/UKBB_GWAS_dev/output
tpl_file=$USER_PATH/bioworkflows/admin/csg.yml
container_lmm=$HOME/containers/lmm.sif
#Columbia's cluster
cwd=~/UKBiobank/genotype_files_processed/010622_african_9096ind/010722_sample_var_final_qc
# orginal bfile
genoFile=$UKBB_yale/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed
#To keep samples N=8,617
keep_samples=~/UKBiobank/genotype_files_processed/010622_african_9096ind/cache/UKB_genotypedatadownloaded083019.genotype_files_processed.filtered.extracted.010622_african_9096ind.filtered.fam
# To keep variants=351,690
keep_variants=~/UKBiobank/genotype_files_processed/010622_african_9096ind/cache/UKB_genotypedatadownloaded083019.genotype_files_processed.filtered.extracted.010622_african_9096ind.qc_pass.snplist
gwasqc_sos=$USER_PATH/bioworkflows/GWAS/GWAS_QC.ipynb
gwasqc_sbatch=$USER_PATH/UKBB_GWAS_dev/output/African_sampleqc_final_$(date +"%Y-%m-%d").sbatch
## All filters set to 0 
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
mem='30G'
name='010722'
job_size=1
numThreads=2

gwasqc1_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwasqc_sbatch \
    --args "$gwasqc1_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/African_sampleqc_final_2022-01-07.sbatch[0m
INFO: Workflow csg (ID=w71e740a51bd0f6fb) is executed successfully with 1 completed step.



In [73]:
#For the 500K individuals
african_qc <- read.table("~/UKBiobank/genotype_files_processed/010622_african_9096ind/010722_sample_var_final_qc/cache/UKB_genotypedatadownloaded083019.010722_sample_var_final_qc.filtered.extracted.fam")
names(african_qc) <-c("FID","IID","father", "mother", "sex", "pheno")

In [74]:
head(african_qc)
nrow(african_qc)

Unnamed: 0_level_0,FID,IID,father,mother,sex,pheno
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>
1,1000697,1000697,0,0,2,-9
2,1001447,1001447,0,0,2,-9
3,1001465,1001465,0,0,2,-9
4,1002004,1002004,0,0,1,-9
5,1002354,1002354,0,0,2,-9
6,1002390,1002390,0,0,2,-9


In [75]:
african_exome <- african_qc %>%
    filter(IID %in% exome_id$IID) %>%
    select ("FID", "IID")

In [76]:
nrow(african_exome)

In [77]:
african_imput <- african_qc %>%
    filter(IID %in% imput_id$ID_1) %>%
    select ("FID", "IID")
nrow(african_imput)

In [78]:
african_full <- african_qc %>%
  mutate(exome = if_else(IID %in% exome_id$IID,1 , 0),
         imputed = if_else(IID %in% imput_id$ID_1,1, 0),
         both = if_else (exome == 1 & imputed == 1, 1, 0))
nrow(african_full[african_full$exome == 1,]) #making sure numbers match
nrow(african_full[african_full$imputed == 1,]) #making sure numbers match
nrow(african_full[african_full$both == 1,])

In [79]:
head(african_full)

Unnamed: 0_level_0,FID,IID,father,mother,sex,pheno,exome,imputed,both
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>
1,1000697,1000697,0,0,2,-9,1,1,1
2,1001447,1001447,0,0,2,-9,1,1,1
3,1001465,1001465,0,0,2,-9,0,1,0
4,1002004,1002004,0,0,1,-9,1,1,1
5,1002354,1002354,0,0,2,-9,1,1,1
6,1002390,1002390,0,0,2,-9,0,1,0


In [80]:
write.table(african_full, '~/UKBiobank/genotype_files_processed/010622_african_9096ind/010722_african_qc_8617ind_withexome_or_imputed', sep="\t", row.names = FALSE, col.names= TRUE)

## Step 6. Filter using plink for individual call rate > 99% White British + other

After selecting the individuals and applying mind=0.01 (keep individuals with call rate > 99%) 

49718 samples removed due to missing genotype data (--mind)

In [6]:
# This is the path to the data transferred from Yale
UKBB_yale=/mnt/mfs/statgen/archive/UKBiobank_Yale_transfer
UKBB_PATH=/mnt/mfs/statgen/UKBiobank
USER_PATH=$HOME/project
OUT_PATH=$USER_PATH/UKBB_GWAS_dev/output
tpl_file=$USER_PATH/bioworkflows/admin/csg.yml
container_lmm=$HOME/containers/lmm.sif
#Columbia's cluster
cwd=~/UKBiobank/genotype_files_processed/082621_sampleqc
# bfile with variant_qc_1
genoFile=~/UKBiobank/genotype_files_processed/cache/UKB_genotypedatadownloaded083019.genotype_files_processed.filtered.extracted.bed
#To keep the samples that passed sample_qc_1
keep_samples=~/UKBiobank/genotype_files_processed/082421_sampleQC_IID_keep.txt
gwasqc_sos=$USER_PATH/bioworkflows/GWAS/GWAS_QC.ipynb
gwasqc_sbatch=$USER_PATH/UKBB_GWAS_dev/output/sampleqc_variantfilteredbed_$(date +"%Y-%m-%d").sbatch
maf_filter=0
geno_filter=0
hwe_filter=0
# Keep individuals with call rate > 99%
mind_filter=0.01
mem='30G'
name='082621'
job_size=1
numThreads=2

gwasqc1_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwasqc_sbatch \
    --args "$gwasqc1_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/sampleqc_variantfilteredbed_2021-08-26.sbatch[0m
INFO: Workflow csg (ID=wb3a0e834b911693a) is executed successfully with 1 completed step.



## Use a call rate >90% white British + other

In [92]:
# This is the path to the data transferred from Yale
UKBB_yale=/mnt/mfs/statgen/archive/UKBiobank_Yale_transfer
UKBB_PATH=/mnt/mfs/statgen/UKBiobank
USER_PATH=$HOME/project
OUT_PATH=$USER_PATH/UKBB_GWAS_dev/output
tpl_file=$USER_PATH/bioworkflows/admin/csg.yml
container_lmm=$HOME/containers/lmm.sif
#Columbia's cluster
cwd=~/UKBiobank/genotype_files_processed/082621_sampleqc_call90
# bfile with variant_qc_1
genoFile=~/UKBiobank/genotype_files_processed/cache/UKB_genotypedatadownloaded083019.genotype_files_processed.filtered.extracted.bed
#To keep the samples that passed sample_qc_1
keep_samples=~/UKBiobank/genotype_files_processed/082421_sampleQC_IID_keep.txt
gwasqc_sos=$USER_PATH/bioworkflows/GWAS/GWAS_QC.ipynb
gwasqc_sbatch=$USER_PATH/UKBB_GWAS_dev/output/sampleqc_call90_$(date +"%Y-%m-%d").sbatch
maf_filter=0
geno_filter=0
hwe_filter=0
# Keep individuals with call rate > 90%
mind_filter=0.1
mem='30G'
name='082621'
job_size=1
numThreads=2

gwasqc1_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwasqc_sbatch \
    --args "$gwasqc1_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/sampleqc_call90_2021-08-31.sbatch[0m
INFO: Workflow csg (ID=w5c2d59f208020dfc) is executed successfully with 1 completed step.



## Step 7. Relatedness using variable f.22021.0.0

In [70]:
### Read in the removed individuals after the sample missingness filter (mind=0.01) 99% call rate
removed_IID <- read.table("~/UKBiobank/genotype_files_processed/082621_sampleqc/cache/UKB_genotypedatadownloaded083019.genotype_files_processed.filtered.extracted.082621_sampleqc.filtered.mindrem.id", header=FALSE)
names(removed_IID) <-c("FID","IID")

In [84]:
data.related <- subset(data.haveboth, data.haveboth$f.22021.0.0==1)  #at least one relative identified
ID_related <- data.related %>%
    filter(!IID %in% removed_IID$IID) %>%
    select ("FID", "IID")
cat("The number of individuals that are related in the sample is:",nrow(data.related),"\n")
cat("The number of individuals that are related in the sample after filtering for sample missingess is:",nrow(ID_related),"\n")
write.table(ID_related, '~/UKBiobank/genotype_files_processed/082421_sampleQC_IID_related.txt', sep="\t", row.names = FALSE, col.names= FALSE)

The number of individuals that are related in the sample is: 147252 
The number of individuals that are related in the sample after filtering for sample missingess is: 132222 


In [59]:
### Read in the removed individuals after the sample missingness filter (mind=0.1) 90% call rate
retained_IID <- read.table("~/UKBiobank/genotype_files_processed/082621_sampleqc_call90/cache/UKB_genotypedatadownloaded083019.genotype_files_processed.filtered.extracted.082621_sampleqc_call90.filtered.fam", header=FALSE)
names(retained_IID) <-c("FID","IID","ignore1", "ignore2", "ignore3", "ignore4")

In [13]:
data.related <- subset(data.haveboth, data.haveboth$f.22021.0.0==1)  #at least one relative identified
ID_related <- data.related %>%
    filter(!IID %in% retained_IID$IID) %>%
    select ("FID", "IID")
cat("The number of individuals that are related in the sample is:",nrow(data.related),"\n")
cat("The number of individuals that are related in the sample after filtering for sample missingess is:",nrow(ID_related),"\n")
#write.table(ID_related, '~/UKBiobank/genotype_files_processed/082421_sampleQC_IID_related.txt', sep="\t", row.names = FALSE, col.names= FALSE)

The number of individuals that are related in the sample is: 147252 
The number of individuals that are related in the sample after filtering for sample missingess is: 0 


### Determine kinship using king

In [11]:
# This is the path to the data transferred from Yale
UKBB_yale=/mnt/mfs/statgen/archive/UKBiobank_Yale_transfer
UKBB_PATH=/mnt/mfs/statgen/UKBiobank
USER_PATH=$HOME/project
tpl_file=$USER_PATH/bioworkflows/admin/csg.yml
container_lmm=$HOME/containers/lmm.sif
##Columbia's variables
cwd=$UKBB_PATH/genotype_files_processed/082621_king
genoFile=$UKBB_PATH/genotype_files_processed/082621_sampleqc/cache/UKB_genotypedatadownloaded083019.genotype_files_processed.filtered.extracted.082621_sampleqc.filtered.bed
king_sbatch=$USER_PATH/UKBB_GWAS_dev/output/king_genoarray_$(date +"%Y-%m-%d").sbatch
kinship=0.0625
gwasqc_sos=$USER_PATH/bioworkflows/GWAS/GWAS_QC.ipynb
numThreads=20
mem='30G'
walltime='60h'

king_args="""king
    --cwd $cwd
    --genoFile $genoFile
    --kinship $kinship
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
    --walltime $walltime
    --no-maximize-unrelated
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $king_sbatch \
    --args "$king_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/king_genoarray_2021-08-27.sbatch[0m
INFO: Workflow csg (ID=w9ab1d59b1ee0d16a) is executed successfully with 1 completed step.



## Step 8 Variant QC number 2

### Get the unrelated individuals only, do quality controls for variant missingness (geno=0.01), HWE (5e-08) and maf filter (0.01)

In [85]:
# This is the path to the data transferred from Yale
UKBB_yale=/mnt/mfs/statgen/archive/UKBiobank_Yale_transfer
UKBB_PATH=/mnt/mfs/statgen/UKBiobank
USER_PATH=$HOME/project
OUT_PATH=$USER_PATH/UKBB_GWAS_dev/output
tpl_file=$USER_PATH/bioworkflows/admin/csg.yml
container_lmm=$HOME/containers/lmm.sif
#Columbia's cluster
cwd=~/UKBiobank/genotype_files_processed/083021_sampleqc2
# bfile with variant_qc_1
genoFile=~/UKBiobank/genotype_files_processed/082621_sampleqc/cache/UKB_genotypedatadownloaded083019.genotype_files_processed.filtered.extracted.082621_sampleqc.filtered.bed
#To remove related samples
remove_samples=~/UKBiobank/genotype_files_processed/082421_sampleQC_IID_related.txt
gwasqc_sos=$USER_PATH/bioworkflows/GWAS/GWAS_QC.ipynb
gwasqc_sbatch=$USER_PATH/UKBB_GWAS_dev/output/sampleqc2_$(date +"%Y-%m-%d").sbatch
maf_filter=0.01
geno_filter=0.01
hwe_filter=5e-08
# Set mind filter to 0 not to filter out more individuals based on sample missingness
mind_filter=0
mem='30G'
name='083021'
job_size=1
numThreads=2

gwasqc1_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --remove_samples $remove_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwasqc_sbatch \
    --args "$gwasqc1_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/sampleqc2_2021-08-31.sbatch[0m
INFO: Workflow csg (ID=w039a30c991021747) is executed successfully with 1 completed step.



## Create the final file for analyses call rate >99%

In [86]:
# This is the path to the data transferred from Yale
UKBB_yale=/mnt/mfs/statgen/archive/UKBiobank_Yale_transfer
UKBB_PATH=/mnt/mfs/statgen/UKBiobank
USER_PATH=$HOME/project
OUT_PATH=$USER_PATH/UKBB_GWAS_dev/output
tpl_file=$USER_PATH/bioworkflows/admin/csg.yml
container_lmm=$HOME/containers/lmm.sif
#Columbia's cluster
cwd=~/UKBiobank/genotype_files_processed/083021_sample_variant_qc_final
# orginal bfile
genoFile=$UKBB_yale/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed
#To keep samples
keep_samples=~/UKBiobank/genotype_files_processed/082621_sampleqc/cache/UKB_genotypedatadownloaded083019.082621_sampleqc.qc_pass.id
# To keep variants
keep_variants=~/UKBiobank/genotype_files_processed/083021_sampleqc2/cache/UKB_genotypedatadownloaded083019.082621_sampleqc.083021_sampleqc2.qc_pass.snplist
gwasqc_sos=$USER_PATH/bioworkflows/GWAS/GWAS_QC.ipynb
gwasqc_sbatch=$USER_PATH/UKBB_GWAS_dev/output/sampleqc_final_$(date +"%Y-%m-%d").sbatch
## All filters set to 0 
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
mem='30G'
name='083021'
job_size=1
numThreads=2

gwasqc1_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwasqc_sbatch \
    --args "$gwasqc1_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/sampleqc2_2021-08-31.sbatch[0m
INFO: Workflow csg (ID=wb46da603845062d4) is executed successfully with 1 completed step.



## Create the final file for analyses call rate >90% (this is the one used downstream for PCA and LMM analysis)

In [77]:
# This is the path to the data transferred from Yale
UKBB_yale=/mnt/mfs/statgen/archive/UKBiobank_Yale_transfer
UKBB_PATH=/mnt/mfs/statgen/UKBiobank
USER_PATH=$HOME/project
OUT_PATH=$USER_PATH/UKBB_GWAS_dev/output
tpl_file=$USER_PATH/bioworkflows/admin/csg.yml
container_lmm=$HOME/containers/lmm.sif
#Columbia's cluster
cwd=~/UKBiobank/genotype_files_processed/090221_sample_variant_qc_final_callrate90
# orginal bfile
genoFile=$UKBB_yale/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed
#To keep samples
keep_samples=~/UKBiobank/genotype_files_processed/082621_sampleqc_call90/cache/UKB_genotypedatadownloaded083019.genotype_files_processed.filtered.extracted.082621_sampleqc_call90.filtered.qc_pass.id
# To keep variants
keep_variants=~/UKBiobank/genotype_files_processed/083021_sampleqc2/cache/UKB_genotypedatadownloaded083019.082621_sampleqc.083021_sampleqc2.qc_pass.snplist
gwasqc_sos=$USER_PATH/bioworkflows/GWAS/GWAS_QC.ipynb
gwasqc_sbatch=$USER_PATH/UKBB_GWAS_dev/output/sampleqc_final_$(date +"%Y-%m-%d").sbatch
## All filters set to 0 
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
mem='30G'
name='090221'
job_size=1
numThreads=2

gwasqc1_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwasqc_sbatch \
    --args "$gwasqc1_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/sampleqc_final_2021-09-02.sbatch[0m
INFO: Workflow csg (ID=wd3cc554baec8ffe9) is executed successfully with 1 completed step.



### Ancestry restriction (Megan pipeline to use only people that self-identify as white-Bristish)

In this case we are defining our own ancestry based on the PC analysis

In [None]:
data.haveboth <- subset(data.haveboth, data.haveboth$f.22006.0.0==1)
#n = 408245

## Callrate

In [None]:
data.imissing <-read.table("/research_storage/scratch/UKBiobank/genotype_files/my_SNP_QC/bfiles_created_along_the_way/UKB_autosomalvariants_passing_batchQC_noindels_unrelated_whiteBritishsubjects_passing_standardexclusions_samplecallrt.irem", header=FALSE, stringsAsFactors = FALSE)

## Determine how many of the individuals from the exome data are removed with the sampleQC in the genotype array data

In [58]:
## Read-in 'white' individuals 

exomed_IID <- read.table('/mnt/mfs/statgen/archive/UKBiobank_Yale_transfer/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind')
names(exomed_IID) <-c("FID","IID")

In [75]:
head(exomed_IID)
nrow(exomed_IID)

Unnamed: 0_level_0,FID,IID
Unnamed: 0_level_1,<int>,<int>
1,1000019,1000019
2,1000035,1000035
3,1000078,1000078
4,1000081,1000081
5,1000198,1000198
6,1000210,1000210


In [60]:
ID_exome_not_geno <- exomed_IID %>%
    filter(!IID %in% retained_IID$IID) %>%
    select ("FID", "IID")

In [61]:
# Individuals present in the exome data but that did not pass sample-QC on the genotype array data
nrow(ID_exome_not_geno)

In [68]:
head(sexnomatch)

Unnamed: 0_level_0,IID,FID,ignore1,ignore2,ignore3,ignore4,f.31.0.0,f.22001.0.0,f.22019.0.0,f.22027.0.0,f.22021.0.0,f.22006.0.0,sex_match
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<lgl>
374,1003854,1003854,0,0,1,-9,0,1,,,0,1,False
621,1006361,1006361,0,0,1,-9,0,1,,,0,1,False
1158,1011973,1011973,0,0,1,-9,0,1,,,0,1,False
1297,1013439,1013439,0,0,1,-9,0,1,,,1,1,False
1939,1019998,1019998,0,0,2,-9,1,0,1.0,,1,1,False
3881,1040038,1040038,0,0,2,-9,1,0,1.0,,1,1,False


In [62]:
# How many removed because sex did not match
nomatchsex <-  ID_exome_not_geno %>%
    filter(IID %in% sexnomatch$IID)
nrow(nomatchsex)

In [63]:
head(aneu.toexclude)
aneu.toexclude <- as.data.frame(aneu.toexclude)

Unnamed: 0,IID
93,1000971
1780,1018401
1849,1019099
2368,1024453
2630,1027190
4269,1044050


In [64]:
# How many have sex aneuploidies
aneu <- ID_exome_not_geno %>%
    filter(IID %in% aneu.toexclude$IID)
nrow(aneu)

In [65]:
head(list.toexclude)
list.toexclude <- as.data.frame(list.toexclude)

Unnamed: 0,IID
1396,1014455
1644,1017024
1699,1017579
2996,1030922
3350,1034540
3795,1039146


In [67]:
# How many removed for being outliers in heterozygosity and missing rates
outliers <- ID_exome_not_geno %>%
    filter(IID %in% list.toexclude$IID)
nrow(outliers)

In [None]:
nrow(ID_exome_not_geno %>% filter(IID %in% dat$IID))

## Determine sample missingness for merged exomes 

In [3]:
# This is the path to the data transferred from Yale
UKBB_yale=/mnt/mfs/statgen/archive/UKBiobank_Yale_transfer
UKBB_PATH=/mnt/mfs/statgen/UKBiobank
USER_PATH=$HOME/project
OUT_PATH=$USER_PATH/UKBB_GWAS_dev/output
tpl_file=$USER_PATH/bioworkflows/admin/csg.yml
container_lmm=$HOME/containers/lmm.sif
#Columbia's cluster
cwd=~/UKBiobank/data/exome_files/project_VCF/072721_run/merged_plink/mind_0.1
# orginal bfile
genoFile=~/UKBiobank/data/exome_files/project_VCF/072721_run/merged_plink/ukb23155_qc_merged.bed
gwasqc_sos=$USER_PATH/bioworkflows/GWAS/GWAS_QC.ipynb
gwasqc_sbatch=$USER_PATH/UKBB_GWAS_dev/output/exome_sample_missingess_$(date +"%Y-%m-%d").sbatch
##remove individuals with > 10% genotypes missing
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0.1
mem='30G'
name='090221'
job_size=1
numThreads=2

gwasqc1_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwasqc_sbatch \
    --args "$gwasqc1_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/exome_sample_missingess_2021-09-03.sbatch[0m
INFO: Workflow csg (ID=wfadfd852309f5295) is executed successfully with 1 completed step.



In [6]:
## Select individuals for PCA phenofile 
white <- read.table("~/UKBiobank_Yale_transfer/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno", sep="\t", header=TRUE)

In [7]:
head(white)

Unnamed: 0_level_0,FID,IID,ethnicity
Unnamed: 0_level_1,<int>,<int>,<fct>
1,1000019,1000019,British
2,1000035,1000035,British
3,1000078,1000078,British
4,1000081,1000081,British
5,1000198,1000198,British
6,1000210,1000210,British


In [17]:
pca_keep <- read.table("/mnt/mfs/statgen/UKBiobank/results/083021_PCA_results/europeans/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.europeans.filtered.fam")
names(pca_keep) <-c("FID","IID", "ignore1","ignore2", "ignore3", "ignore4")

In [18]:
head(pca_keep)

Unnamed: 0_level_0,FID,IID,ignore1,ignore2,ignore3,ignore4
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>
1,1000019,1000019,0,0,2,-9
2,1000035,1000035,0,0,1,-9
3,1000078,1000078,0,0,2,-9
4,1000081,1000081,0,0,1,-9
5,1000198,1000198,0,0,2,-9
6,1000210,1000210,0,0,1,-9


In [20]:
library(tidyverse)
selected <- white %>%
    filter(IID %in% pca_keep$IID)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.2     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.1.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



In [21]:
nrow(selected)

## Evaluate whereas the witdrawn participants are in out dataset

In [3]:
exclusion1 <- read.csv("~/UKBiobank/data/ukbb_databases/participant_withdrawal/32285_20210809.csv")

“cannot open file '/home/dmc2245/UKBiobank/data/ukbb_databases/participant_withdrawal/32285_20210809.csv': No such file or directory”


ERROR: Error in file(file, "rt"): cannot open the connection


In [None]:
head(2021_08_09)

# Determine from the exome data those individuals that are removed because they did not pass genotype array QC

In [6]:
fam <- read.table("~/UKBiobank_Yale_transfer/ukb28374_exomedata/exome_data_OCT2020/ukb23155_s200631.fam", sep=' ', header=F)
colnames(fam) <- c("FID","IID","fatherID", "motherID", "sex", "phenotype")
head(fam)
dim(fam)

Unnamed: 0_level_0,FID,IID,fatherID,motherID,sex,phenotype
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>
1,1434748,1434748,0,0,2,-9
2,5523981,5523981,0,0,1,-9
3,5023838,5023838,0,0,2,-9
4,4023729,4023729,0,0,1,-9
5,4442146,4442146,0,0,2,-9
6,5654789,5654789,0,0,2,-9


# Subset the total database base on the individuals that have exome data

In [9]:
dim(data.haveboth)

In [17]:
# Individuals with exomes that also have genotype array data
ind_with_exome  <- merge(data.haveboth, fam, by="IID", all=FALSE)
dim(ind_with_exome)
# There are 174 individuals less. This means that those 174 individuals have phenotype and exome data but do not have genotype data

In [18]:
ind_with_exome$sex_match<- (ind_with_exome$f.22001.0.0 == ind_with_exome$f.31.0.0)
sexnomatch <- subset(ind_with_exome, ind_with_exome$sex_match=="FALSE")
ind_with_exome<- subset(ind_with_exome, ind_with_exome$sex_match=="TRUE")
cat("The number of individuals whose reported and genetic sex match is:",nrow(ind_with_exome),"\n")
cat("The number of individuals whose reported and genetic sex does NOT match is:",nrow(sexnomatch),"\n")

The number of individuals whose reported and genetic sex match is: 200386 
The number of individuals whose reported and genetic sex does NOT match is: 59 


In [12]:
#4b. Identify subjects with sex chromosome karyotypes putatively different from XX or XY

aneu.toexclude  <- subset(ind_with_exome, ind_with_exome$f.22019.0.0==1, select=c(IID))
aneu.toexclude  <- as.matrix(aneu.toexclude)

ind_with_exome <-subset(ind_with_exome, ! IID %in% aneu.toexclude)
cat("The number of individuals with aneuploidies is:",nrow(aneu.toexclude),"\n")
cat("The number of individuals with XX and XY",nrow(ind_with_exome),"\n")
#n = 486,416 subjects remain

The number of individuals with aneuploidies is: 176 
The number of individuals with XX and XY 200210 


In [13]:
list.toexclude <- subset(ind_with_exome, ind_with_exome$f.22027.0.0==1, select=c(IID))
list.toexclude  <- as.matrix(list.toexclude)

ind_with_exome <-subset(ind_with_exome, ! IID %in% list.toexclude)
cat("The number of individuals that are outliers for heterozygosity or missing rates is:",nrow(list.toexclude),"\n")
cat("The number of individuals that are not outliers for heterozygosity or missing rates is:",nrow(ind_with_exome),"\n")

The number of individuals that are outliers for heterozygosity or missing rates is: 388 
The number of individuals that are not outliers for heterozygosity or missing rates is: 199822 


# How do the numbers change when I select the white after the sex check

In [16]:
white <-  read.table("~/UKBiobank_Yale_transfer/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind", header=F)
colnames(white) <- c("FID","IID")
dim(white)

In [19]:
ind_with_exome_white  <- merge(ind_with_exome,  white, by="IID", all=FALSE)
dim(ind_with_exome_white)

In [20]:
#4b. Identify subjects with sex chromosome karyotypes putatively different from XX or XY

aneu.toexclude  <- subset(ind_with_exome_white, ind_with_exome_white$f.22019.0.0==1, select=c(IID))
aneu.toexclude  <- as.matrix(aneu.toexclude)

ind_with_exome_white <-subset(ind_with_exome_white, ! IID %in% aneu.toexclude)
cat("The number of individuals with aneuploidies is:",nrow(aneu.toexclude),"\n")
cat("The number of individuals with XX and XY",nrow(ind_with_exome_white),"\n")
#n = 486,416 subjects remain

The number of individuals with aneuploidies is: 162 
The number of individuals with XX and XY 188839 


In [21]:
list.toexclude <- subset(ind_with_exome_white, ind_with_exome_white$f.22027.0.0==1, select=c(IID))
list.toexclude  <- as.matrix(list.toexclude)

ind_with_exome_white <-subset(ind_with_exome_white, ! IID %in% list.toexclude)
cat("The number of individuals that are outliers for heterozygosity or missing rates is:",nrow(list.toexclude),"\n")
cat("The number of individuals that are not outliers for heterozygosity or missing rates is:",nrow(ind_with_exome_white),"\n")

The number of individuals that are outliers for heterozygosity or missing rates is: 365 
The number of individuals that are not outliers for heterozygosity or missing rates is: 188474 
