# Scripts for PCA analyses

This notebook applies the `Get_Job_Script.ipynb` to automatically generate the sbatch scripts to run in Yale's or Columbia's cluster. 

Here the scripts generated are to run:

1. PCA analysis
2. Detect missingness in plink files
3. Extract SNPs/Individuals using Plink
4. Run regenie burden MWE

## File paths on Yale cluster
- Genotype files exome data:
`/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/exome_data_OCT2020`
- Genotype files in PLINK format:
`/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/pleiotropy_geneticfiles/UKB_Caucasians_phenotypeindepqc120319_updated020720removedwithdrawnindiv`
- Genotype files in bgen format:
`SAY/dbgapstg/scratch/UKBiobank/genotype_files/ukb39554_imputeddataset/`
- Summary stats for imputed variants BOLT-LMM:
`/gpfs/gibbs/pi/dewan/data/UKBiobank/results/BOLTLMM_results/results_imputed_data`
- Summary stats for inputed variants FastGWA:
`/gpfs/gibbs/pi/dewan/data/UKBiobank/results/FastGWA_results/results_imputed_data`
- Phenotype files:
`/gpfs/gibbs/pi/dewan/data/UKBiobank/phenotype_files/pleiotropy_R01/phenotypesforanalysis/UKB_Caucasiansubset_cholesterolfields_adjbymedstatus_062420_foranalysis`
- Relationship file:
`/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/pleiotropy_geneticfiles/unrelated_n307259/UKB_unrelatedcauc_phenotypes_asthmat2dbmiwaisthip_agesex_waisthipratio_040620`
- Other traits to be analyzed:
`/gpfs/gibbs/pi/dewan/data/UKBiobank/phenotype_files/pleiotropy_R01/phenotypesforanalysis/UKB_CAUC_lipidsforanalysis_apolipoproteinAandB,Hba1c_continuousandcategorical,egfrbyCKDEPI,serumcreatinine,UACR_inverseranknorm_110320`
- PCA results for expanded white
`/gpfs/gibbs/pi/dewan/data/UKBiobank/results/070921_pca_genotype_array`

## Create symlinks to necessary folders in your home dir

```
ln -s /mnt/mfs/statgen/archive/UKBiobank_Yale_transfer ~/
ln -s /mnt/mfs/statgen/UKBiobank ~/
```

## Yale's variables

In [None]:
# Common variables Yale's cluster
UKBB_PATH=/gpfs/gibbs/pi/dewan/data/UKBiobank
USER_PATH=$HOME/project
OUT_PATH=$USER_PATH/UKBB_GWAS_dev/output
tpl_file=$USER_PATH/bioworkflows/admin/farnam.yml
# Working directory for PCA made from exome data
pca_dir=$UKBB_PATH/results/pca_exomes
#Working directory for PCA made from genotype array data
cwd=$UKBB_PATH/results/070921_pca_genotype_array
#Use the original bed files for the genotype array for kinship calculation
bfile=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed
# Container lmm 
container_lmm=$UKBB_PATH/lmm.sif
# Use a subset of the exomed markers
#genoFile=`echo $UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bed`
#phenoFile=~/scratch60/pca/cache/ukb23155_s200631.non_white_white_outliers_11971ind.pheno
#database=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb42495_updatedJune2020/ukb42495.tab
#ethnia_prefix='non_white_white_outliers_11971ind'

## Columbia's variables

In [3]:
# This is the path to the data transferred from Yale
UKBB_yale=/mnt/vast/hpc/csg/UKBiobank_Yale_transfer
UKBB_PATH=/mnt/vast/hpc/csg/UKBiobank
USER_PATH=$HOME/project
#OUT_PATH=$HOME/pca_01_18_22
tpl_file=$USER_PATH/bioworkflows/admin/csg.yml
container=$HOME/containers/lmm.sif




## General variables

In [4]:
# Pipeline
pca_sos=$USER_PATH/xqtl-pipeline/code/data_preprocessing/genotype/PCA.ipynb
gwasqc_sos=$USER_PATH/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb
numThreads=1
job_size=1




## PCA variables

These can change depending on the work you are doing

In [5]:
#PCA variables change according to your analyses
k=10
maxiter=0
topk=10
sigma=6
window=50
shift=10
r2=0.1




In [None]:
# Name of bash script
#pca_sbatch=../output/$(date +"%Y-%m-%d")_pca_non_white.sbatch
#pca_sbatch=../output/$(date +"%Y-%m-%d")_flashpca_non_white_whiteoutliers.sbatch

## PCA jobs

### Full sample exome data UKBB

## 1. Do QC_1 on the genotype file (genotypic array) that includes all samples

In [4]:
# Yale's cluster vars
#cwd=$UKBB_PATH/results/070921_pca_genotype_array/plinkqc_05_28_21
# Original bfile containing all of the samples Yale's cluster
#genoFile=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed
#To keep the samples of white individuals only
#keep_samples=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind

#Columbia's cluster
cwd=$UKBB_PATH/results/070921_pca_genotype_array/plinkqc_05_28_21
# Original bfile containing all of the samples Columbias's cluster
genoFile=$UKBB_yale/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed
#To keep the samples of white individuals only
keep_samples=$UKBB_yale/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind

maf_filter=0.01
geno_filter=0.01
hwe_filter=5e-08
mind_filter=0.1
mem='30G'

gwasqc_sos=$USER_PATH/bioworkflows/GWAS/GWAS_QC.ipynb
gwasqc_sbatch=$USER_PATH/UKBB_GWAS_dev/output/$(date +"%Y-%m-%d")_gwasqc1_originalbed.sbatch

gwasqc1_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb dewan \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwasqc_sbatch \
    --args "$gwasqc1_args"

INFO: Running [32mdewan[0m: Configuration for Yale `pi_dewan` partition cluster
INFO: [32mdewan[0m is [32mcompleted[0m.
INFO: [32mdewan[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-06-14_gwasqc1_originalbed.sbatch[0m
INFO: Workflow dewan (ID=w138e6a004ca3f3fd) is executed successfully with 1 completed step.



## 2. Run king:

Estimate relationship between the exomed individuals.

In this case using the subset of white individuals first file `030821_ukb42495_exomed_white_189010ind`

In [2]:
##Yale's variables
#cwd=$UKBB_PATH/results/070921_pca_genotype_array/king_05_28_21
#phenoFile=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno
#keep_samples=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind
##Use the qc'ed version of the genotype data
#genoFile=$UKBB_PATH/results/070921_pca_genotype_array/plinkqc_05_28_21/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.bed

##Columbia's variables
cwd=$UKBB_PATH/results/070921_pca_genotype_array/king_05_28_21
genoFile=$UKBB_yale/results/070921_pca_genotype_array/plinkqc_05_28_21/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.bed

king_sbatch=$OUT_PATH/$(date +"%Y-%m-%d")_flashpca_king_extendedwhite.sbatch
kinship=0.0625
gwasqc_sos=$USER_PATH/bioworkflows/GWAS/GWAS_QC.ipynb
numThreads=20
mem='30G'
walltime='24h'

king_args="""king
    --cwd $cwd
    --genoFile $genoFile
    --kinship $kinship
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
    --walltime $walltime
    --no-maximize-unrelated
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $king_sbatch \
    --args "$king_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-06-21_flashpca_king_extendedwhite.sbatch[0m
INFO: Workflow farnam (ID=wb21b557b74272734) is executed successfully with 1 completed step.



## 2.1 Merge all of the exome bed files

This step is only necessary if working with the exome data. This is the non-qc'ed UKBB data

In [4]:
## Yale's cluster
#pca_dir=$UKBB_PATH/results/pca_exomes/merged_exomes
#genoFile=`echo $UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bed`

## Columbia's cluster
pca_dir=$UKBB_yale/results/pca_exomes/merged_exomes
genoFile=`echo $UKBB_yale/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bed`

gwas_sbatch=$OUT_PATH/$(date +"%Y-%m-%d")_merged_exomes.sbatch
numThreads=20
mem='60G'
merged_prefix='ukb23155_all_merged'

gwasqc_args="""merge_plink
    --cwd $pca_dir
    --genoFile $genoFile
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
    --merged_prefix $merged_prefix
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam\
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-06-23_merged_exomes.sbatch[0m
INFO: Workflow farnam (ID=w01c904d3a0a199dc) is executed successfully with 1 completed step.



## 3. QC the exome data for PCA calculations 

This time I'll use the merged exomes with no qc (directly downloaded from UKBB) as the mind filter is hard to apply in individual chromosomes)

In [2]:
## Yale's cluster
#pca_dir=$UKBB_PATH/results/pca_exomes/white_expanded_06_29_21_merged_exomes
## Use non qc'ed exomes files but merged since the begining
#genoFile=$UKBB_PATH/results/pca_exomes/merged_exomes/ukb23155_all_merged.bed
#To keep the samples of white individuals only
keep_samples=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind
remove_samples=$UKBB_PATH/results/070921_pca_genotype_array/king_05_28_21/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.king_05_28_21.related_id

## Columbia's cluster
pca_dir=$UKBB_yale/results/pca_exomes/white_expanded_06_29_21_merged_exomes
genoFile=$UKBB_yale/results/pca_exomes/merged_exomes/ukb23155_all_merged.bed
keep_samples=$UKBB_yale/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind
remove_samples=$UKBB_yale/results/070921_pca_genotype_array/king_05_28_21/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.king_05_28_21.related_id

#GWAS QC variables
maf_filter=0.01
geno_filter=0.01
hwe_filter=5e-08
#In this case I do want to remove individuals with 1% missing data
mind_filter=0.1
#LD prunning variables
window=50
shift=10
r2=0.1
gwas_sbatch=$OUT_PATH/$(date +"%Y-%m-%d")_flashpca_white_unrelated_qc_merged_exomes.sbatch
numThreads=20
mem='30G'
merged_prefix='ukb23155_unrelated_noqc_merged_exomes'

gwasqc_args="""qc
    --cwd $pca_dir
    --genoFile $genoFile
    --keep_samples $keep_samples
    --remove_samples $remove_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --window $window
    --shift $shift
    --r2 $r2
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
    --merged_prefix $merged_prefix
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam\
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-06-29_flashpca_white_unrelated_qc_merged_exomes.sbatch[0m
INFO: Workflow farnam (ID=w3a3fa030c00ef69a) is executed successfully with 1 completed step.



## 3.1 QC the genotype data we want to use for the PCA calculation on unrelated individuals

Ideal: In the case of the UKBB exome data, we would like to use the genotypes after pVCF-QC for every chromosome.

Trial run: Use the exome data without QC filters as was released by the UKBB

In [2]:
pca_dir=$UKBB_PATH/results/pca_exomes/white_expanded_06_14_21
#pca_dir=~/scratch60/pca/white_expanded_06_14_21
## Use non qc'ed exomes files
genoFile=`echo $UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bed`
#To keep the samples of white individuals only
keep_samples=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind
remove_samples=$UKBB_PATH/results/070921_pca_genotype_array/king_05_28_21/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.king_05_28_21.related_id
#GWAS QC variables
maf_filter=0.01
geno_filter=0.01
hwe_filter=5e-08
## Do not use mind filter this time since it will remove samples based on each chromosome
mind_filter=0
#LD prunning variables
window=50
shift=10
r2=0.1
gwas_sbatch=$OUT_PATH/$(date +"%Y-%m-%d")_flashpca_white_unrelated_qc.sbatch
numThreads=20
mem='30G'
merged_prefix='ukb23155_unrelated_merged'

gwasqc_args="""qc
    --cwd $pca_dir
    --genoFile $genoFile
    --keep_samples $keep_samples
    --remove_samples $remove_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --window $window
    --shift $shift
    --r2 $r2
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
    --merged_prefix $merged_prefix
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam\
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-06-23_flashpca_white_unrelated_qc.sbatch[0m
INFO: Workflow farnam (ID=w23511b402aa82ba0) is executed successfully with 1 completed step.



## 3.2 Remove related individuals and do LD prunning for genotype array and further PCA calculation

After the meeting on 06/30/21 the group decided that we should be using the PC's calculated from the genotype array. So this new analysis reflects that

In [6]:
## Yale's cluster
#cwd=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray
## Use the qc version of the genotype array with the already filtered 189010 white individuals
#genoFile=$UKBB_PATH/results/070921_pca_genotype_array/plinkqc_05_28_21/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.bed
#To keep the samples of white individuals only
#remove_samples=$UKBB_PATH/results/070921_pca_genotype_array/king_05_28_21/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.king_05_28_21.related_id

## Columbia's cluster
cwd=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray
## Use the qc version of the genotype array with the already filtered 189010 white individuals
genoFile=$UKBB_yale/results/070921_pca_genotype_array/plinkqc_05_28_21/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.bed
#To keep the samples of white individuals only
remove_samples=$UKBB_yale/results/070921_pca_genotype_array/king_05_28_21/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.king_05_28_21.related_id

#GWAS QC variables: leave all the variables in 0 so there's no more filtering in the already filtered data
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
#LD prunning variables
window=50
shift=10
r2=0.1
gwas_sbatch=$OUT_PATH/$(date +"%Y-%m-%d")_gwas_qc_white_expanded_unrelated_genoarray.sbatch
numThreads=20
mem='30G'
merged_prefix='ukb23155_qc_unrelated_genoarray'

gwasqc_args="""qc
    --cwd $cwd
    --genoFile $genoFile
    --remove_samples $remove_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --window $window
    --shift $shift
    --r2 $r2
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
    --merged_prefix $merged_prefix
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam\
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-06-30_gwas_qc_white_expanded_unrelated_genoarray.sbatch[0m
INFO: Workflow farnam (ID=wf132c3ecc6bb7885) is executed successfully with 1 completed step.



### Do the merge_plink independently since it's not working in the nested workflow

In [7]:
## Yale's cluster
#genoFile=`echo $UKBB_PATH/results/pca_exomes/white_expanded_06_14_21/cache/ukb23155_c{1..22}_b0_v1.white_expanded_06_14_21.filtered.prune.bed`
#pca_dir=$UKBB_PATH/results/pca_exomes/white_expanded_06_14_21_merged_exomes

## Columbia's cluster
genoFile=`echo $UKBB_yale/results/pca_exomes/white_expanded_06_14_21/cache/ukb23155_c{1..22}_b0_v1.white_expanded_06_14_21.filtered.prune.bed`
pca_dir=$UKBB_yale/results/pca_exomes/white_expanded_06_14_21_merged_exomes

gwas_sbatch=$OUT_PATH/$(date +"%Y-%m-%d")_merged_unrelated_pruned.sbatch
numThreads=20
mem='30G'
merged_prefix='ukb23155_unrelated_pruned_merged'

gwasqc_args="""merge_plink
    --cwd $pca_dir
    --genoFile $genoFile
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
    --merged_prefix $merged_prefix
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam\
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-06-23merged_unrelated_pruned.sbatch[0m
INFO: Workflow farnam (ID=w008b80bc19edb218) is executed successfully with 1 completed step.



## 4. Get bed file for related individuals for exome data

This implies a problem when getting the related idnividuals from the exome data that is not merged, that's why the data was merged first and then the related individuals can be extracted

In [None]:
##Yale's cluster
#pca_dir=$UKBB_PATH/results/pca_exomes/white_expanded_related_06_29_21
## Use non qc'ed exomes files
#genoFile=$UKBB_PATH/results/pca_exomes/merged_exomes/ukb23155_all_merged.bed
## phenoFile=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno
#keep_samples=$UKBB_PATH/results/070921_pca_genotype_array/king_05_28_21/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.king_05_28_21.related_id
#Keep the same variants as above
#keep_variants=~/scratch60/pca/white_expanded_06_29_21_merged_exomes

##Columbia's cluster
pca_dir=$UKBB_yale/results/pca_exomes/white_expanded_related_06_29_21
## Use non qc'ed exomes files
genoFile=$UKBB_yale/results/pca_exomes/merged_exomes/ukb23155_all_merged.bed
## phenoFile=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno
keep_samples=$UKBB_yale/results/070921_pca_genotype_array/king_05_28_21/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.king_05_28_21.related_id
#Keep the same variants as above
keep_variants=~/scratch60/pca/white_expanded_06_29_21_merged_exomes

#GWAS QC variables
maf_filter=0.0
geno_filter=0.0
hwe_filter=0.0
mind_filter=0.0
gwas_sbatch=$OUT_PATH/$(date +"%Y-%m-%d")_flashpca_white_related_qc_merged_exomes.sbatch
numThreads=20
mem='30G'

gwasqc_args="""qc:1
    --cwd $pca_dir
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

## 4.1 Get bed file for related individuals genotype data

In [12]:
##Yale's cluster
#cwd=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_related_06_30_21_genoarray
## Use qc'ed genotype array
#genoFile=$UKBB_PATH/results/070921_pca_genotype_array/plinkqc_05_28_21/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.bed
## phenoFile=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno
#keep_samples=$UKBB_PATH/results/070921_pca_genotype_array/king_05_28_21/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.king_05_28_21.related_id
#Keep the same variants as above
#keep_variants=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_06_30_21_genoarray.filtered.prune.in

## Columbia's cluster
cwd=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_related_06_30_21_genoarray
## Use qc'ed genotype array
genoFile=$UKBB_yale/results/070921_pca_genotype_array/plinkqc_05_28_21/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.bed
## phenoFile=$UKBB_yale/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno
keep_samples=$UKBB_yale/results/070921_pca_genotype_array/king_05_28_21/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.king_05_28_21.related_id
#Keep the same variants as above
keep_variants=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_06_30_21_genoarray.filtered.prune.in

#GWAS QC variables
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
gwas_sbatch=$OUT_PATH/$(date +"%Y-%m-%d")_flashpca_white_related_qc_genoarray.sbatch
numThreads=20
mem='30G'
merged_prefix='ukb23155_qc_related_genoarray'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --merged_prefix $merged_prefix
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-06-30_flashpca_white_related_qc_genoarray.sbatch[0m
INFO: Workflow farnam (ID=w3038c5cdfd98262a) is executed successfully with 1 completed step.



## 5. Run PCA analysis for unrelated expanded white individuals with merged exomed data

In [2]:
pca_dir=$UKBB_PATH/results/pca_exomes/white_expanded_06_14_21
#This is the bfile originated after filtering unrelated individuals
genoFile=$UKBB_PATH/results/pca_exomes/white_expanded_06_14_21_merged_exomes/ukb23155_unrelated_pruned_merged.bed
## I had to modify the original file to add a super_pop and replace ethnicity for pop
phenoFile=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno.new
label_col=pop
pop_col=pop
pops=extended_white
k=10
maha_k=5
pca_sbatch=$OUT_PATH/$(date +"%Y-%m-%d")_flashpca_white_unrelated.sbatch
min_axis=""
max_axis=""

pca_args="""flashpca
    --cwd $pca_dir
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --pops $pops
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-06-24_flashpca_white_unrelated.sbatch[0m
INFO: Workflow farnam (ID=w3542743200dac65e) is executed successfully with 1 completed step.



## 5.1 Run PCA analysis for unrelated expanded white individuals with genotype array

In [13]:
## Yale's cluster
#cwd=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray
#This is the bfile originated after filtering unrelated individuals and pruning
#genoFile=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_06_30_21_genoarray.filtered.prune.bed
## I had to modify the original file to add a super_pop and replace ethnicity for pop
#phenoFile=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno.new

## Columbia's cluster
cwd=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray
#This is the bfile originated after filtering unrelated individuals and pruning
genoFile=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_06_30_21_genoarray.filtered.prune.bed
## I had to modify the original file to add a super_pop and replace ethnicity for pop
phenoFile=$UKBB_yale/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno.new

label_col=pop
pop_col=pop
pops=extended_white
pca_sbatch=$OUT_PATH/$(date +"%Y-%m-%d")_flashpca_white_unrelated_genoarray.sbatch
k=10
maha_k=5
min_axis=""
max_axis=""

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --pops $pops
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-06-30_flashpca_white_unrelated_genoarray.sbatch[0m
INFO: Workflow farnam (ID=w68e7dd36254f1a17) is executed successfully with 1 completed step.



## 6. Project related invididuals back genotype array

In [6]:
## Yale's cluster
#cwd=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_07_09_21_genoarray_projected
#This is the bfile originated after filtering related individuals
#genoFile=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_related_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_related_06_30_21_genoarray.filtered.extracted.bed
## I had to modify the original file to add a super_pop and replace ethnicity for pop
#phenoFile=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno.new
#pca_model=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/030821_ukb42495_exomed_white_189010ind.pheno.white_expanded_06_30_21_genoarray.extended_white.pca.rds

## Columbia's cluster
cwd=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_07_09_21_genoarray_projected
#This is the bfile originated after filtering related individuals
genoFile=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_related_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_related_06_30_21_genoarray.filtered.extracted.bed
## I had to modify the original file to add a super_pop and replace ethnicity for pop
phenoFile=$UKBB_yale/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno.new
pca_model=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/030821_ukb42495_exomed_white_189010ind.pheno.white_expanded_06_30_21_genoarray.extended_white.pca.rds

pca_sbatch=$OUT_PATH/$(date +"%Y-%m-%d")_flashpca_white_related_genoarray_projected.sbatch
label_col=pop
pop_col=pop
pops=extended_white
k=10
maha_k=5
prob=0.997
pval=0.05
min_axis=""
max_axis=""
label_col=pop
pop_col=pop
pops=extended_white
## set the --homogeneous TRUE options to consider all the pops like one 
homogeneous=TRUE
## For the plot you need to use the *.projected.rds and not the *.projected.mahalanobis.rds
#plot_data=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray_projected/030821_ukb42495_exomed_white_189010ind.pheno.white_expanded_06_30_21_genoarray_projected.pca.projected.rds
#outlier_file=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray_projected/030821_ukb42495_exomed_white_189010ind.pheno.white_expanded_06_30_21_genoarray_projected.pca.projected.outliers


pca_args="""project_samples
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --pca_model $pca_model
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --pops $pops
    --prob $prob
    --pval $pval
    --homogeneous $homogeneous
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-07-09_flashpca_white_related_genoarray_projected.sbatch[0m
INFO: Workflow farnam (ID=w7e5b5c1fbf037041) is executed successfully with 1 completed step.



# Plot the projected individuals highlight outliers

In [5]:
##Yale's cluster
#pca_dir=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_07_09_21_genoarray_projected
#This is the bfile originated after filtering related individuals
#genoFile=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_related_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_related_06_30_21_genoarray.filtered.extracted.bed
## I had to modify the original file to add a super_pop and replace ethnicity for pop
#phenoFile=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno.new
#pca_model=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/030821_ukb42495_exomed_white_189010ind.pheno.white_expanded_06_30_21_genoarray.extended_white.pca.rds
#plot_data=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_07_09_21_genoarray_projected/030821_ukb42495_exomed_white_189010ind.pheno.white_expanded_07_09_21_genoarray_projected.pca.projected.rds
#outlier_file=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_07_09_21_genoarray_projected/030821_ukb42495_exomed_white_189010ind.pheno.white_expanded_07_09_21_genoarray_projected.pca.projected.outliers

## Columbia's cluster
pca_dir=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_07_09_21_genoarray_projected
#This is the bfile originated after filtering related individuals
genoFile=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_related_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_related_06_30_21_genoarray.filtered.extracted.bed
## I had to modify the original file to add a super_pop and replace ethnicity for pop
phenoFile=$UKBB_yale/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno.new
pca_model=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/030821_ukb42495_exomed_white_189010ind.pheno.white_expanded_06_30_21_genoarray.extended_white.pca.rds
plot_data=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_07_09_21_genoarray_projected/030821_ukb42495_exomed_white_189010ind.pheno.white_expanded_07_09_21_genoarray_projected.pca.projected.rds
outlier_file=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_07_09_21_genoarray_projected/030821_ukb42495_exomed_white_189010ind.pheno.white_expanded_07_09_21_genoarray_projected.pca.projected.outliers

label_col=pop
pop_col=pop
pops=extended_white
pca_sbatch=$OUT_PATH/$(date +"%Y-%m-%d")_flashpca_white_related_genoarray_plot.sbatch

pca_args="""plot_pca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --label_col $label_col
    --pop_col $pop_col
    --pops $pops
    --plot_data $plot_data
    --outlier_file $outlier_file
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-07-09_flashpca_white_related_genoarray_plot.sbatch[0m
INFO: Workflow farnam (ID=w77ea737f258b746f) is executed successfully with 1 completed step.



## Old run for white population (using old phenotype file)

In [19]:
pca_dir=~/scratch60/pca/white_030121_repeat
phenoFile=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030121_ukb42495_exomed_white_189228ind.pheno
keep_samples=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030121_ukb42495_exomed_white_189228ind
pca_sbatch=$OUT_PATH/$(date +"%Y-%m-%d")_flashpca_filter_white.sbatch
trait_name=ethnicity
numThreads=20

pca_args="""filter
    --cwd $pca_dir
    --bfile $bfile
    --genoFile $genoFile
    --phenoFile $phenoFile
    --keep_samples $keep_samples
    --k $k
    --window $window
    --shift $shift
    --r2 $r2
    --trait_name $trait_name
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run ~/project/bioworkflows/GWAS/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-03-01_flashpca_filter_white.sbatch[0m
INFO: Workflow farnam (ID=wdddf2ddca776f9f1) is executed successfully with 1 completed step.



### 1. African ancestry

In [5]:
pca_dir=~/scratch60/pca/african_ancestry
ethnia_prefix='african_3690ind'
phenoFile=~/scratch60/pca/african_ancestry/cache/ukb23155_s200631.african_3690ind.pheno
pca_sbatch=$OUT_PATH/$(date +"%Y-%m-%d")_flashpca_african.sbatch
trait_name=ethnicity

pca_args="""flashpca
    --cwd $pca_dir
    --bfile $bfile
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --maxiter $maxiter
    --topk $topk
    --sigma $sigma
    --window $window
    --shift $shift
    --r2 $r2
    --trait_name $trait_name
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run ~/project/bioworkflows/GWAS/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-02-22_flashpca_african.sbatch[0m
INFO: Workflow farnam (ID=wd99c56feacb4bc44) is executed successfully with 1 completed step.



### 2. Asian ancestry

In [7]:
pca_dir=~/scratch60/pca/asian_ancestry
ethnia_prefix='asian_4618ind'
phenoFile=~/scratch60/pca/asian_ancestry/cache/ukb23155_s200631.asian_4618ind.pheno
pca_sbatch=$OUT_PATH/$(date +"%Y-%m-%d")_flashpca_asian.sbatch
trait_name=ethnicity

pca_args="""flashpca
    --cwd $pca_dir 
    --genoFile $genoFile
    --famFile $famFile
    --database $database
    --ethnia_prefix $ethnia_prefix
    --select_ethnia $select_ethnia
    --phenoFile $phenoFile
    --k $k
    --maxiter $maxiter
    --topk $topk
    --sigma $sigma
    --window $window
    --shift $shift
    --r2 $r2
    --trait_name $trait_name
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run ~/project/bioworkflows/GWAS/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-02-22_flashpca_asian.sbatch[0m
INFO: Workflow farnam (ID=wd32c86713853144c) is executed successfully with 1 completed step.



In [15]:
bfile=$UKBB_PATH/MWE/genotypes21_22.bed
genoFile=$UKBB_PATH/MWE/burden/ukb23155_c2*_b0_v1.plink.exome.filtered.bed
keep_samples=$UKBB_PATH/MWE/burden/unrelated_ind_burden.txt
phenoFile=$UKBB_PATH/MWE/burden/phenotype_burden.txt
kinship=0.05
maf_filter=0.01
geno_filter=0.1
mind_filter=0.2 
hwe_filter=5e-08 
numThreads=2
k=10
trait_name='ASTHMA'
sos run ~/project/bioworkflows/GWAS/PCA.ipynb flashpca:1\
    --cwd $pca_dir \
    --bfile $bfile \
    --genoFile $genoFile \
    --keep_samples $keep_samples \
    --kinship $kinship \
    --phenoFile $phenoFile \
    --window $window \
    --shift $shift \
    --r2 $r2 \
    --maf_filter $maf_filter\
    --geno_filter $geno_filter\
    --mind_filter $mind_filter \
    --hwe_filter $hwe_filter \
    --k $k\
    --trait_name $trait_name \
    --numThreads $numThreads \
    --job_size $job_size \
    --container_lmm $container_lmm

INFO: Running [32mflashpca_1[0m: Run PCA analysis using flashpca
[91mERROR[0m: [91mflashpca_1 (id=40e0bfbf8b836d3a) returns an error.[0m
[91mERROR[0m: [91m[flashpca_1]: [0]: 
Failed to execute [0m[32mRscript /home/dc2325/.sos/af137bdf2659ed53/flashpca_1_0_fe1199ce.R[0m[91m
exitcode=1, workdir=[0m[32m/gpfs/ysm/project/dewan/dc2325/UKBB_GWAS_dev/analysis/cluster_scripts[0m[91m, stdout=/home/dc2325/scratch60/pca/phenotype_burden.filtered.merged.prune.stdout, stderr=/home/dc2325/scratch60/pca/phenotype_burden.filtered.merged.prune.stderr
---------------------------------------------------------------------------[0m



In [None]:
tpl_file=../farnam.yml
pca_dir=$UKBB_PATH/results/pca_exomes
famFile=$UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_s200631.fam
bedfiles=`echo $UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bed`
bimfiles=`echo $UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/UKBexomeOQFE_chr{1..22}.bim`
database=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb42495_updatedJune2020/ukb42495.tab
# Container
container_lmm=$UKBB_PATH/lmm.sif
# Pipeline
pca_sos=~/project/UKBB_GWAS_dev/PCA.ipynb
# Name of bash script
pca_sbatch=../output/$(date +"%Y-%m-%d")_pca_white.sbatch
numThreads=1
job_size=1
#PCA variables
k=10
maxiter=5
topk=10
sigma=6
window=50
shift=5
r2=0.5
stand="binom2"
maf_filter=0.01
geno_filter=0.01
mind_filter=0.02

sos run ~/project/UKBB_GWAS_dev/PCA.ipynb smartpca \
    --cwd $pca_dir \
    --bedfiles $bedfiles \
    --bimfiles $bimfiles \
    --famFile $famFile \
    --database $database \
    --k $k \
    --stand $stand \
    --maxiter $maxiter \
    --topk $topk \
    --sigma $sigma \
    --window $window \
    --shift $shift \
    --r2 $r2 \
    --maf_filter $maf_filter\
    --geno_filter $geno_filter\
    --mind_filter $mind_filter \
    --numThreads $numThreads \
    --job_size $job_size \
    --container_lmm $container_lmm \
    -s build

In [None]:
    smartpca.perl \
    -i example.geno \
    -a example.snp \
    -b example.ind \
    -k 2 \
    -o example.pca \
    -p example.plot \
    -e example.eval \
    -l example.log \
    -m 5 \
    -t 2 \
    -s 6.0

In [None]:
par.PACKEDPED.EIGENSTRAT
genotypename:    ukb23155_s200631.filtered.merged.bed
snpname:         ukb23155_s200631.filtered.merged.bim
indivname:       ukb23155_s200631.filtered.merged.fam
outputformat:    EIGENSTRAT
genotypeoutname: ukb23155_s200631.filtered.merged.eigenstratgeno
snpoutname:      ukb23155_s200631.filtered.merged.snp
indivoutname:    ukb23155_s200631.filtered.merged.ind

In [None]:
#!/bin/bash
#SBATCH --partition general
#SBATCH --nodes 1
#SBATCH --ntasks-per-node 1
#SBATCH --cpus-per-task 1
#SBATCH --mem 60G
#SBATCH --time 5-0:00:00
#SBATCH --job-name ../output/2020-12-01_pca_white
#SBATCH --output ../output/2020-12-01_pca_white-%J.out
#SBATCH --error ../output/2020-12-01_pca_white-%J.log
module load EIGENSOFT/7.2.1-foss-2018b
smartpca.perl -i ukb23155_s200631.filtered.merged.bed -a ukb23155_s200631.filtered.merged.pedsnp -b ukb23155_s200631.filtered.merged.pedind -o ukb23155_s200631.filtered.merged.pca -p ukb23155_s200631.filtered.merged.plot -e ukb23155_s200631.filtered.eval -l ukb23155_s200631.filtered.merged.log

# Running plink missing pipeline

In [36]:
tpl_file=../farnam.yml
pca_dir=$UKBB_PATH/results/pca_exomes
famFile=$UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_s200631.fam
bedfiles=`echo $UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bed`
bimfiles=`echo $UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/UKBexomeOQFE_chr{1..22}.bim`
# Container
container_lmm=$UKBB_PATH/lmm.sif
container_marp=$UKBB_PATH/marp.sif
# Pipeline
plink_sos=~/project/UKBB_GWAS_dev/workflow/plink_missing.ipynb
# Name of bash script
pca_sbatch=../output/$(date +"%Y-%m-%d")_plink_miss.sbatch
numThreads=1
job_size=1


sos run ~/project/UKBB_GWAS_dev/workflow/plink_missing.ipynb missing\
    --cwd $pca_dir \
    --bedfiles $bedfiles \
    --bimfiles $bimfiles \
    --famFile $famFile \
    --numThreads $numThreads \
    --job_size $job_size \
    --container_lmm $container_lmm \
    --container_marp $container_marp \
    -s build

INFO: Running [32mmissing_1[0m: Genotype and sample missingness for exome files
INFO: Step [32mmissing_1[0m (index=0) is [32mignored[0m with signature constructed
INFO: Step [32mmissing_1[0m (index=1) is [32mignored[0m with signature constructed
INFO: Step [32mmissing_1[0m (index=2) is [32mignored[0m with signature constructed
INFO: Step [32mmissing_1[0m (index=3) is [32mignored[0m with signature constructed
INFO: Step [32mmissing_1[0m (index=4) is [32mignored[0m with signature constructed
INFO: Step [32mmissing_1[0m (index=5) is [32mignored[0m with signature constructed
INFO: Step [32mmissing_1[0m (index=6) is [32mignored[0m with signature constructed
INFO: Step [32mmissing_1[0m (index=7) is [32mignored[0m with signature constructed
INFO: Step [32mmissing_1[0m (index=8) is [32mignored[0m with signature constructed
INFO: Step [32mmissing_1[0m (index=9) is [32mignored[0m with signature constructed
INFO: Step [32mmissing_1[0m (index=10) is [32

## Extracting individuals for a particular snp plink

In [37]:
tpl_file=../farnam.yml
pca_dir=/home/dc2325/scratch60/plink_extract
famFile=$UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_s200631.fam
bedfiles=$UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c12_b0_v1.bed
bimfiles=$UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/UKBexomeOQFE_chr12.bim
snp_list=/home/dc2325/scratch60/plink_extract/snp.txt
# Container
container_lmm=$UKBB_PATH/lmm.sif
container_marp=$UKBB_PATH/marp.sif
# Pipeline
plink_sos=~/project/UKBB_GWAS_dev/plink_extract.ipynb
# Name of bash script
pca_sbatch=../output/$(date +"%Y-%m-%d")_plink_miss.sbatch
numThreads=1
job_size=1


sos run ~/project/UKBB_GWAS_dev/plink_extract.ipynb  \
    --cwd $pca_dir \
    --bedfiles $bedfiles \
    --bimfiles $bimfiles \
    --famFile $famFile \
    --snp_list $snp_list \
    --numThreads $numThreads \
    --job_size $job_size \
    --container_lmm $container_lmm \
    --container_marp $container_marp

[91mERROR[0m: [91mFailed to locate /home/dc2325/project/UKBB_GWAS_dev/plink_extract.ipynb.sos[0m



## Regenie burden

In [23]:
genoFile=`echo $UKBB_PATH/MWE/burden/ukb23155_c{21..22}_b0_v1.plink.exome.filtered.bed`
sos dryrun ~/project/bioworkflows/GWAS/LMM.ipynb regenie_burden \
    --cwd output \
    --bfile genotypes21_22.bed \
    --genoFile $genoFile \
    --sampleFile \
    --phenoFile burden/phenotype_burden.txt\
    --phenoCol ASTHMA\
    --covarCol SEX \
    --qCovarCol AGE \
    --numThreads 8 \
    --bsize 10 \
    --anno_file burden/annotation_file.txt\
    --set_list burden/set_list_file.txt \
    --mask_file burden/mask_file.txt \
    --keep_gene burden/keep_file.txt\
    --aaf_bins 0.05 \
    --trait bt \
    --build_mask max \
    --container_lmm $UKBB_PATH/lmm.sif

INFO: Checking [32mregenie_burden[0m: Run regenie for burden tests
HINT: singularity exec  /gpfs/gibbs/pi/dewan/data/UKBiobank/lmm.sif /bin/bash /gpfs/ysm/project/dewan/dc2325/UKBB_GWAS_dev/analysis/tmph24c8d72/singularity_run_193934.sh
set -e
regenie \
  --step 2 \
  --bed /gpfs/gibbs/pi/dewan/data/UKBiobank/MWE/burden/ukb23155_c21_b0_v1.plink.exome.filtered \
  --phenoFile output/phenotype_burden.regenie_phenotype \
  --covarFile output/phenotype_burden.regenie_covar \
  --phenoColList ASTHMA \
  --bt
  --firth --approx \
  --pred output/phenotype_burden_ASTHMA.regenie_pred.list \
  --anno-file burden/annotation_file.txt \
  --set-list burden/set_list_file.txt \
  --extract-sets burden/keep_file.txt\
  --mask-def burden/mask_file.txt \
  --aaf-bins 0.05 \
  --write-mask \
  --build-mask \
  --bsize 10 \
  --check-burden-files \
  --gz \
  --out  output/cache/ukb23155_c21_b0_v1.plink.exome.filtered.burden


INFO: [32mregenie_burden[0m (index=0) is [32mcompleted[0m.
HINT: singula

## PCA to calculate PC scores per hearing impairment trait

### f.3393

#### Step 1. Produce the genotype file keeping specific samples and variants

In [9]:
##Yale's cluster
#cwd=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_related_06_30_21_genoarray
## Use qc'ed genotype array
#genoFile=$UKBB_PATH/results/070921_pca_genotype_array/plinkqc_05_28_21/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.bed
## phenoFile=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno
#keep_samples=$UKBB_PATH/results/070921_pca_genotype_array/king_05_28_21/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.king_05_28_21.related_id
#Keep the same variants as above
#keep_variants=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_06_30_21_genoarray.filtered.prune.in

## Columbia's cluster
cwd=$UKBB_PATH/results/080621_pca_genoarray_HI/f3393
gwas_sbatch=$OUT_PATH/qc1_f3393_genoarray_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=$UKBB_yale/results/070921_pca_genotype_array/plinkqc_05_28_21/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.bed
keep_samples=$UKBB_PATH/results/080621_pca_genoarray_HI/080421_UKBB_Hearing_aid_f3393_expandedwhite_6305cases_98082ctrl.keep_id
#Keep the same variants as above
keep_variants=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_06_30_21_genoarray.filtered.prune.in

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/tf2478/pca/qc1_f3393_genoarray_2021-08-06.sbatch[0m
INFO: Workflow csg (ID=we5e680985396158d) is executed successfully with 1 completed step.


#### Step 2: Get the PC's

In [11]:
## Yale's cluster
#cwd=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray
#This is the bfile originated after filtering unrelated individuals and pruning
#genoFile=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_06_30_21_genoarray.filtered.prune.bed
## I had to modify the original file to add a super_pop and replace ethnicity for pop
#phenoFile=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno.new

## Columbia's cluster
cwd=$UKBB_PATH/results/080621_pca_genoarray_HI/f3393
#This is the bfile originated after filtering unrelated individuals and pruning ending in *filtered.prune.bed
genoFile=$UKBB_PATH/results/080621_pca_genoarray_HI/f3393/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.f3393.filtered.extracted.bed
# Format FID, IID, pop, superpop
phenoFile=$UKBB_PATH/results/080621_pca_genoarray_HI/080421_UKBB_Hearing_aid_f3393_expandedwhite_6305cases_98082ctrl.phenopca

label_col=pop
pop_col=pop
pops=extended_white
pca_sbatch=$OUT_PATH/flashpca_f3393_genoarray_$(date +"%Y-%m-%d").sbatch
k=2
maha_k=2
min_axis=0
max_axis=0

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --pops $pops
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/tf2478/pca/flashpca_f3393_genoarray_2021-08-07.sbatch[0m
INFO: Workflow csg (ID=w1101c0cce37b86f9) is executed successfully with 1 completed step.


### f3393 50k individuals

#### Step 1

In [6]:
## Columbia's cluster
cwd=$UKBB_PATH/results/080621_pca_genoarray_HI/090721_f3393_50K
gwas_sbatch=$OUT_PATH/qc1_f3393_genoarray_50K$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=$UKBB_yale/results/070921_pca_genotype_array/plinkqc_05_28_21/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.bed
keep_samples=~/UKBiobank/phenotype_files/hearing_impairment/080421_UKBB_Hearing_aid_f3393_expandedwhite_24496ind_50Kexomes.keep_id
#Keep the same variants as above
keep_variants=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_06_30_21_genoarray.filtered.prune.in

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/qc1_f3393_genoarray_50K2021-09-07.sbatch[0m
INFO: Workflow csg (ID=w189c76f0d4f4d0cb) is executed successfully with 1 completed step.



#### Step 2. 

In [17]:
## Columbia's cluster
cwd=$UKBB_PATH/results/080621_pca_genoarray_HI/090721_f3393_50K
#This is the bfile originated after filtering unrelated individuals and pruning ending in *filtered.prune.bed
genoFile=$UKBB_PATH/results/080621_pca_genoarray_HI/090721_f3393_50K/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.090721_f3393_50K.filtered.extracted.bed
# Format FID, IID, pop
phenoFile=$UKBB_PATH/results/080621_pca_genoarray_HI/080421_UKBB_Hearing_aid_f3393_expandedwhite_24496ind_50K.phenopca

label_col=pop
pop_col=pop
pca_sbatch=$OUT_PATH/flashpca_f3393_genoarray_50K_$(date +"%Y-%m-%d").sbatch
k=2
maha_k=2
min_axis=0
max_axis=0

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/flashpca_f3393_genoarray_50K_2021-09-07.sbatch[0m
INFO: Workflow csg (ID=w180994a972fcdfac) is executed successfully with 1 completed step.



### f3393. 150K individuals

#### Step 1. 

In [5]:
## Columbia's cluster
cwd=$UKBB_PATH/results/080621_pca_genoarray_HI/090721_f3393_150K
gwas_sbatch=$OUT_PATH/qc1_f3393_genoarray_150K$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=$UKBB_yale/results/070921_pca_genotype_array/plinkqc_05_28_21/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.bed
keep_samples=~/UKBiobank/phenotype_files/hearing_impairment/080421_UKBB_Hearing_aid_f3393_expandedwhite_79891ind_150Kexomes.keep_id
#Keep the same variants as above
keep_variants=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_06_30_21_genoarray.filtered.prune.in

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/qc1_f3393_genoarray_2021-09-07.sbatch[0m
INFO: Workflow csg (ID=wd20db76a4d467345) is executed successfully with 1 completed step.



#### Step 2

In [18]:
## Columbia's cluster
cwd=$UKBB_PATH/results/080621_pca_genoarray_HI/090721_f3393_150K
#This is the bfile originated after filtering unrelated individuals and pruning ending in *filtered.prune.bed
genoFile=$UKBB_PATH/results/080621_pca_genoarray_HI/090721_f3393_150K/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.090721_f3393_150K.filtered.extracted.bed
# Format FID, IID, pop
phenoFile=$UKBB_PATH/results/080621_pca_genoarray_HI/080421_UKBB_Hearing_aid_f3393_expandedwhite_79891ind_150K.phenopca

label_col=pop
pop_col=pop
pca_sbatch=$OUT_PATH/flashpca_f3393_genoarray_150K_$(date +"%Y-%m-%d").sbatch
k=2
maha_k=2
min_axis=0
max_axis=0

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/flashpca_f3393_genoarray_150K_2021-09-07.sbatch[0m
INFO: Workflow csg (ID=wc70a3d6332b7a074) is executed successfully with 1 completed step.



### f.2247

#### Step 1. Produce the genotype file keeping specific samples and variants

In [13]:
##Yale's cluster
#cwd=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_related_06_30_21_genoarray
## Use qc'ed genotype array
#genoFile=$UKBB_PATH/results/070921_pca_genotype_array/plinkqc_05_28_21/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.bed
## phenoFile=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno
#keep_samples=$UKBB_PATH/results/070921_pca_genotype_array/king_05_28_21/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.king_05_28_21.related_id
#Keep the same variants as above
#keep_variants=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_06_30_21_genoarray.filtered.prune.in

## Columbia's cluster
cwd=$UKBB_PATH/results/080621_pca_genoarray_HI/f2247
gwas_sbatch=$OUT_PATH/qc1_f2247_genoarray_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=$UKBB_yale/results/070921_pca_genotype_array/plinkqc_05_28_21/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.bed
keep_samples=$UKBB_PATH/results/080621_pca_genoarray_HI/080421_UKBB_Hearing_difficulty_f2247_expandedwhite_46237cases_98082ctrl.keep_id
#Keep the same variants as above
keep_variants=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_06_30_21_genoarray.filtered.prune.in

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/tf2478/pca/qc1_f2247_genoarray_2021-08-06.sbatch[0m
INFO: Workflow csg (ID=wba0b0069c01c6c33) is executed successfully with 1 completed step.


#### Step 2: Get the PC's

In [5]:
## Yale's cluster
#cwd=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray
#This is the bfile originated after filtering unrelated individuals and pruning
#genoFile=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_06_30_21_genoarray.filtered.prune.bed
## I had to modify the original file to add a super_pop and replace ethnicity for pop
#phenoFile=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno.new

## Columbia's cluster
cwd=$UKBB_PATH/results/080621_pca_genoarray_HI/f2247
#This is the bfile originated from qc1 ending in *.bed
genoFile=$UKBB_PATH/results/080621_pca_genoarray_HI/f2247/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.f2247.filtered.extracted.bed
# Format FID, IID, pop, superpop
phenoFile=$UKBB_PATH/results/080621_pca_genoarray_HI/080421_UKBB_Hearing_difficulty_f2247_expandedwhite_46237cases_98082ctrl.phenopca

label_col=pop
pop_col=pop
pops=extended_white
pca_sbatch=$OUT_PATH/flashpca_f2247_genoarray_$(date +"%Y-%m-%d").sbatch
k=2
maha_k=2
min_axis=0
max_axis=0

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --pops $pops
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/tf2478/pca/flashpca_f2247_genoarray_2021-08-06.sbatch[0m
INFO: Workflow csg (ID=wc48eaca4d58fc942) is executed successfully with 1 completed step.


### f.2247 50K

#### Step 1

In [7]:
## Columbia's cluster
cwd=$UKBB_PATH/results/080621_pca_genoarray_HI/090721_f2247_50K
gwas_sbatch=$OUT_PATH/qc1_f2247_genoarray_50K_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=$UKBB_yale/results/070921_pca_genotype_array/plinkqc_05_28_21/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.bed
keep_samples=~/UKBiobank/phenotype_files/hearing_impairment/080421_UKBB_Hearing_difficulty_f2247_expandedwhite_35147ind_50Kexomes.keep_id
#Keep the same variants as above
keep_variants=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_06_30_21_genoarray.filtered.prune.in

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/qc1_f2247_genoarray_50K_2021-09-07.sbatch[0m
INFO: Workflow csg (ID=w73f26b4d9ef6b15d) is executed successfully with 1 completed step.



#### Step 2

In [20]:
## Columbia's cluster
cwd=$UKBB_PATH/results/080621_pca_genoarray_HI/090721_f2247_50K
#This is the bfile originated from qc1 ending in *.bed
genoFile=$UKBB_PATH/results/080621_pca_genoarray_HI/090721_f2247_50K/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.090721_f2247_50K.filtered.extracted.bed
# Format FID, IID, pop, superpop
phenoFile=$UKBB_PATH/results/080621_pca_genoarray_HI/080421_UKBB_Hearing_difficulty_f2247_expandedwhite_35147ind_50K.phenopca

label_col=pop
pop_col=pop
pca_sbatch=$OUT_PATH/flashpca_f2247_genoarray_50K_$(date +"%Y-%m-%d").sbatch
k=2
maha_k=2
min_axis=0
max_axis=0

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/flashpca_f2247_genoarray_50K_2021-09-07.sbatch[0m
INFO: Workflow csg (ID=w40eb593ada863c75) is executed successfully with 1 completed step.



### f.2247 150 K

#### Step 1

In [14]:
## Columbia's cluster
cwd=$UKBB_PATH/results/080621_pca_genoarray_HI/090721_f2247_150K
gwas_sbatch=$OUT_PATH/qc1_f2247_genoarray_150k_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=$UKBB_yale/results/070921_pca_genotype_array/plinkqc_05_28_21/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.bed
keep_samples=~/UKBiobank/phenotype_files/hearing_impairment/080421_UKBB_Hearing_difficulty_f2247_expandedwhite_109172ind_150Kexomes.keep_id
#Keep the same variants as above
keep_variants=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_06_30_21_genoarray.filtered.prune.in

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/qc1_f2247_genoarray_150k_2021-09-07.sbatch[0m
INFO: Workflow csg (ID=wbabb09f1af66670f) is executed successfully with 1 completed step.



#### Step 2

In [21]:
## Columbia's cluster
cwd=$UKBB_PATH/results/080621_pca_genoarray_HI/090721_f2247_150K
#This is the bfile originated from qc1 ending in *.bed
genoFile=$UKBB_PATH/results/080621_pca_genoarray_HI/090721_f2247_150K/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.090721_f2247_150K.filtered.extracted.bed
# Format FID, IID, pop, superpop
phenoFile=$UKBB_PATH/results/080621_pca_genoarray_HI/080421_UKBB_Hearing_difficulty_f2247_expandedwhite_109172ind_150K.phenopca

label_col=pop
pop_col=pop
pca_sbatch=$OUT_PATH/flashpca_f2247_genoarray_150K_$(date +"%Y-%m-%d").sbatch
k=2
maha_k=2
min_axis=0
max_axis=0

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/flashpca_f2247_genoarray_150K_2021-09-07.sbatch[0m
INFO: Workflow csg (ID=w24fc03e66d394141) is executed successfully with 1 completed step.



### f.2257

#### Step 1. Produce the genotype file keeping specific samples and variants

In [19]:
##Yale's cluster
#cwd=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_related_06_30_21_genoarray
## Use qc'ed genotype array
#genoFile=$UKBB_PATH/results/070921_pca_genotype_array/plinkqc_05_28_21/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.bed
## phenoFile=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno
#keep_samples=$UKBB_PATH/results/070921_pca_genotype_array/king_05_28_21/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.king_05_28_21.related_id
#Keep the same variants as above
#keep_variants=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_06_30_21_genoarray.filtered.prune.in

## Columbia's cluster
cwd=$UKBB_PATH/results/080621_pca_genoarray_HI/f2257
gwas_sbatch=$OUT_PATH/qc1_f2257_genoarray_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=$UKBB_yale/results/070921_pca_genotype_array/plinkqc_05_28_21/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.bed
keep_samples=$UKBB_PATH/results/080621_pca_genoarray_HI/080421_UKBB_Hearing_aid_f3393_expandedwhite_6305cases_98082ctrl.keep_id
#Keep the same variants as above
keep_variants=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_06_30_21_genoarray.filtered.prune.in

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/tf2478/pca/qc1_f2257_genoarray_2021-08-06.sbatch[0m
INFO: Workflow csg (ID=w6621ca45cbe5b97e) is executed successfully with 1 completed step.


#### Step 2: Get the PC's

In [14]:
## Yale's cluster
#cwd=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray
#This is the bfile originated after filtering unrelated individuals and pruning
#genoFile=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_06_30_21_genoarray.filtered.prune.bed
## I had to modify the original file to add a super_pop and replace ethnicity for pop
#phenoFile=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno.new

## Columbia's cluster
cwd=$UKBB_PATH/results/080621_pca_genoarray_HI/f2257
#This is the bfile originated after filtering unrelated individuals and pruning ending in *filtered.prune.bed
genoFile=$UKBB_PATH/results/080621_pca_genoarray_HI/f2257/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.f2257.filtered.extracted.bed
# Format FID, IID, pop, superpop
phenoFile=$UKBB_PATH/results/080621_pca_genoarray_HI/080421_UKBB_Hearing_noise_f2257_expandedwhite_66656cases_98082ctrl.phenopca

label_col=pop
pop_col=pop
pops=extended_white
pca_sbatch=$OUT_PATH/flashpca_f2257_genoarray_$(date +"%Y-%m-%d").sbatch
k=2
maha_k=2
min_axis=0
max_axis=0

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --pops $pops
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/tf2478/pca/flashpca_f2257_genoarray_2021-08-10.sbatch[0m
INFO: Workflow csg (ID=w0beb6d6cde13fddd) is executed successfully with 1 completed step.


### f.2257 50K

#### Step 1

In [9]:
## Columbia's cluster
cwd=$UKBB_PATH/results/080621_pca_genoarray_HI/090721_f2257_50k
gwas_sbatch=$OUT_PATH/qc1_f2257_genoarray_50K_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=$UKBB_yale/results/070921_pca_genotype_array/plinkqc_05_28_21/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.bed
keep_samples=~/UKBiobank/phenotype_files/hearing_impairment/080421_UKBB_Hearing_noise_f2257_expandedwhite_39344ind_50Kexomes.keep_id
#Keep the same variants as above
keep_variants=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_06_30_21_genoarray.filtered.prune.in

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/qc1_f2257_genoarray_50K_2021-09-07.sbatch[0m
INFO: Workflow csg (ID=wf7a442071013d36e) is executed successfully with 1 completed step.



#### Step 2

In [4]:
## Columbia's cluster
cwd=$UKBB_PATH/results/080621_pca_genoarray_HI/090721_f2257_50k
#This is the bfile originated after filtering unrelated individuals and pruning ending in *filtered.prune.bed
genoFile=$UKBB_PATH/results/080621_pca_genoarray_HI/090721_f2257_50k/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.090721_f2257_50k.filtered.extracted.bed
# Format FID, IID, pop, superpop
phenoFile=$UKBB_PATH/results/080621_pca_genoarray_HI/080421_UKBB_Hearing_noise_f2257_expandedwhite_39344ind_50K.phenopca

label_col=pop
pop_col=pop
pca_sbatch=$OUT_PATH/flashpca_f2257_genoarray_50K_$(date +"%Y-%m-%d").sbatch
k=2
maha_k=2
min_axis=0
max_axis=0

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/flashpca_f2257_genoarray_50K_2021-09-07.sbatch[0m
INFO: Workflow csg (ID=wf2ebca4ad92fb2e5) is executed successfully with 1 completed step.



### f.2257 150K

#### Step 1

In [11]:
## Columbia's cluster
cwd=$UKBB_PATH/results/080621_pca_genoarray_HI/090721_f2257_150k
gwas_sbatch=$OUT_PATH/qc1_f2257_genoarray_150K_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=$UKBB_yale/results/070921_pca_genotype_array/plinkqc_05_28_21/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.bed
keep_samples=~/UKBiobank/phenotype_files/hearing_impairment/080421_UKBB_Hearing_noise_f2257_expandedwhite_125394ind_150Kexomes.keep_id
#Keep the same variants as above
keep_variants=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_06_30_21_genoarray.filtered.prune.in

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/qc1_f2257_genoarray_150K_2021-09-07.sbatch[0m
INFO: Workflow csg (ID=w16a91f2ae83f04f3) is ignored with 1 ignored step.



#### Step 2

In [5]:
## Columbia's cluster
cwd=$UKBB_PATH/results/080621_pca_genoarray_HI/090721_f2257_150k
#This is the bfile originated after filtering unrelated individuals and pruning ending in *filtered.prune.bed
genoFile=$UKBB_PATH/results/080621_pca_genoarray_HI/090721_f2257_150k/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.090721_f2257_150k.filtered.extracted.bed
# Format FID, IID, pop, superpop
phenoFile=$UKBB_PATH/results/080621_pca_genoarray_HI/080421_UKBB_Hearing_noise_f2257_expandedwhite_125394ind_150K.phenopca

label_col=pop
pop_col=pop
pca_sbatch=$OUT_PATH/flashpca_f2257_genoarray_150K_$(date +"%Y-%m-%d").sbatch
k=2
maha_k=2
min_axis=0
max_axis=0

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/flashpca_f2257_genoarray_150K_2021-09-07.sbatch[0m
INFO: Workflow csg (ID=w2869a34a897f09f1) is executed successfully with 1 completed step.



### Combined f.2247 and f.2257

#### Step 1. Produce the genotype file keeping specific samples and variants

In [26]:
##Yale's cluster
#cwd=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_related_06_30_21_genoarray
## Use qc'ed genotype array
#genoFile=$UKBB_PATH/results/070921_pca_genotype_array/plinkqc_05_28_21/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.bed
## phenoFile=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno
#keep_samples=$UKBB_PATH/results/070921_pca_genotype_array/king_05_28_21/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.king_05_28_21.related_id
#Keep the same variants as above
#keep_variants=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_06_30_21_genoarray.filtered.prune.in

## Columbia's cluster
cwd=$UKBB_PATH/results/080621_pca_genoarray_HI/f2247_f2257
gwas_sbatch=$OUT_PATH/qc1_f2247_f2257_genoarray_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=$UKBB_yale/results/070921_pca_genotype_array/plinkqc_05_28_21/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.bed
keep_samples=$UKBB_PATH/results/080621_pca_genoarray_HI/080421_UKBB_Combined_f2247_f2257_expandedwhite_39049cases_98082ctrl.keep_id
#Keep the same variants as above
keep_variants=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_06_30_21_genoarray.filtered.prune.in

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/tf2478/pca/qc1_f2247_f2257_genoarray_2021-08-06.sbatch[0m
INFO: Workflow csg (ID=w33ca191312732348) is executed successfully with 1 completed step.


#### Step 2: Get the PC's

In [None]:
## Yale's cluster
#cwd=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray
#This is the bfile originated after filtering unrelated individuals and pruning
#genoFile=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_06_30_21_genoarray.filtered.prune.bed
## I had to modify the original file to add a super_pop and replace ethnicity for pop
#phenoFile=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno.new

## Columbia's cluster
cwd=$UKBB_PATH/results/080621_pca_genoarray_HI/f2247_f2257
#This is the bfile originated after filtering unrelated individuals and pruning ending in *filtered.prune.bed
genoFile=$UKBB_PATH/results/080621_pca_genoarray_HI/f2247_f2257/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.f2247_f2257.filtered.extracted.bed
# Format FID, IID, pop, superpop
phenoFile=$UKBB_PATH/results/080621_pca_genoarray_HI/080421_UKBB_Combined_f2247_f2257_expandedwhite_39049cases_98082ctrl.phenopca

label_col=pop
pop_col=pop
pops=extended_white
pca_sbatch=$OUT_PATH/flashpca_f2247_f2257_genoarray_$(date +"%Y-%m-%d").sbatch
k=2
maha_k=2
min_axis=0
max_axis=0

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --pops $pops
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

[MetaKernelApp] ERROR | KeyboardInterrupt caught in kernel.


### Combined 50K

#### Step 1

In [12]:
## Columbia's cluster
cwd=$UKBB_PATH/results/080621_pca_genoarray_HI/090721_f2247_f2257_50K
gwas_sbatch=$OUT_PATH/qc1_f2247_f2257_genoarray_50K_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=$UKBB_yale/results/070921_pca_genotype_array/plinkqc_05_28_21/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.bed
keep_samples=~/UKBiobank/phenotype_files/hearing_impairment/080421_UKBB_Combined_f2247_f2257_expandedwhite_33399ind_50Kexomes.keep_id
#Keep the same variants as above
keep_variants=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_06_30_21_genoarray.filtered.prune.in

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/qc1_f2247_f2257_genoarray_50K_2021-09-07.sbatch[0m
INFO: Workflow csg (ID=wb8829af0783f6ffe) is executed successfully with 1 completed step.



#### Step 2

In [5]:
## Columbia's cluster
cwd=$UKBB_PATH/results/080621_pca_genoarray_HI/090721_f2247_f2257_50K
#This is the bfile originated after filtering unrelated individuals and pruning ending in *filtered.prune.bed
genoFile=$UKBB_PATH/results/080621_pca_genoarray_HI/090721_f2247_f2257_50K/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.090721_f2247_f2257_50K.filtered.extracted.bed
# Format FID, IID, pop, superpop
phenoFile=$UKBB_PATH/results/080621_pca_genoarray_HI/080421_UKBB_Combined_f2247_f2257_expandedwhite_33399ind_50K.phenopca

label_col=pop
pop_col=pop
pca_sbatch=$OUT_PATH/flashpca_f2247_f2257_genoarray_50K_$(date +"%Y-%m-%d").sbatch
k=2
maha_k=2
min_axis=0
max_axis=0

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/flashpca_f2247_f2257_genoarray_50K_2021-09-07.sbatch[0m
INFO: Workflow csg (ID=w28b3f40fb289286a) is executed successfully with 1 completed step.



### Combined 150K

#### Step 1

In [7]:
## Columbia's cluster
cwd=$UKBB_PATH/results/080621_pca_genoarray_HI/090721_f2247_f2257_150K
gwas_sbatch=$OUT_PATH/qc1_f2247_f2257_genoarray_150K$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=$UKBB_yale/results/070921_pca_genotype_array/plinkqc_05_28_21/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.bed
keep_samples=~/UKBiobank/phenotype_files/hearing_impairment/080421_UKBB_Combined_f2247_f2257_expandedwhite_103732ind_150Kexomes.keep_id
#Keep the same variants as above
keep_variants=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_06_30_21_genoarray.filtered.prune.in

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/qc1_f2247_f2257_genoarray_150K2021-09-07.sbatch[0m
INFO: Workflow csg (ID=w17799829d36f4355) is ignored with 1 ignored step.



#### Step 2

In [6]:
## Columbia's cluster
cwd=$UKBB_PATH/results/080621_pca_genoarray_HI/090721_f2247_f2257_150K
#This is the bfile originated after filtering unrelated individuals and pruning ending in *filtered.prune.bed
genoFile=$UKBB_PATH/results/080621_pca_genoarray_HI/090721_f2247_f2257_150K/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.090721_f2247_f2257_150K.filtered.extracted.bed
# Format FID, IID, pop, superpop
phenoFile=$UKBB_PATH/results/080621_pca_genoarray_HI/080421_UKBB_Combined_f2247_f2257_expandedwhite_103732ind_150K.phenopca

label_col=pop
pop_col=pop
pca_sbatch=$OUT_PATH/flashpca_f2247_f2257_genoarray_150K_$(date +"%Y-%m-%d").sbatch
k=2
maha_k=2
min_axis=0
max_axis=0

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/flashpca_f2247_f2257_genoarray_150K_2021-09-07.sbatch[0m
INFO: Workflow csg (ID=w319b847dbf4637b7) is executed successfully with 1 completed step.



### Mendelian

#### Step 1. Produce the genotype file keeping specific samples and variants

In [27]:
##Yale's cluster
#cwd=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_related_06_30_21_genoarray
## Use qc'ed genotype array
#genoFile=$UKBB_PATH/results/070921_pca_genotype_array/plinkqc_05_28_21/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.bed
## phenoFile=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno
#keep_samples=$UKBB_PATH/results/070921_pca_genotype_array/king_05_28_21/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.king_05_28_21.related_id
#Keep the same variants as above
#keep_variants=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_06_30_21_genoarray.filtered.prune.in

## Columbia's cluster
cwd=$UKBB_PATH/results/080621_pca_genoarray_HI/mendelian
gwas_sbatch=$OUT_PATH/qc1_mendelian_genoarray_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=$UKBB_yale/results/070921_pca_genotype_array/plinkqc_05_28_21/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.bed
keep_samples=$UKBB_PATH/results/080621_pca_genoarray_HI/080421_UKBB_Mendelian_expandedwhite_1520cases_98082ctrl.keep_id
#Keep the same variants as above
keep_variants=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_06_30_21_genoarray.filtered.prune.in

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/tf2478/pca/qc1_mendelian_genoarray_2021-08-06.sbatch[0m
INFO: Workflow csg (ID=wd82d336f7549f02a) is executed successfully with 1 completed step.


#### Step 2: Get the PC's

In [7]:
## Yale's cluster
#cwd=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray
#This is the bfile originated after filtering unrelated individuals and pruning
#genoFile=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_06_30_21_genoarray.filtered.prune.bed
## I had to modify the original file to add a super_pop and replace ethnicity for pop
#phenoFile=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno.new

## Columbia's cluster
cwd=$UKBB_PATH/results/080621_pca_genoarray_HI/mendelian
#This is the bfile originated after filtering unrelated individuals and pruning ending in *filtered.prune.bed
genoFile=$UKBB_PATH/results/080621_pca_genoarray_HI/mendelian/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.mendelian.filtered.extracted.bed
# Format FID, IID, pop, superpop
phenoFile=$UKBB_PATH/results/080621_pca_genoarray_HI/080421_UKBB_Mendelian_expandedwhite_1520cases_98082ctrl.phenopca

label_col=pop
pop_col=pop
pops=extended_white
pca_sbatch=$OUT_PATH/flashpca_mendelian_genoarray_$(date +"%Y-%m-%d").sbatch
k=2
maha_k=2
min_axis=
max_axis=

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --pops $pops
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/tf2478/pca/flashpca_mendelian_genoarray_2021-08-06.sbatch[0m
INFO: Workflow csg (ID=w9670c3d5696c82f5) is executed successfully with 1 completed step.


## 08-30-21 Run with QC'ed genotype array data

### Merge QC'ed exome bed file to assess sample missingess

In [5]:
## Columbia's cluster
cwd=~/UKBiobank/data/exome_files/project_VCF/072721_run/merged_plink
genoFile=`echo ~/UKBiobank/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`
gwas_sbatch=$OUT_PATH/merged_qc_exomes_$(date +"%Y-%m-%d").sbatch
numThreads=20
mem='60G'
walltime='60h'
merged_prefix='ukb23155_qc_merged'

gwasqc_args="""merge_plink
    --cwd $cwd
    --genoFile $genoFile
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
    --walltime $walltime
    --merged_prefix $merged_prefix
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/merged_qc_exomes_2021-09-02.sbatch[0m
INFO: Workflow csg (ID=we00f5e940d3f54a6) is executed successfully with 1 completed step.



## Step 1. Select "European individuals" from genotype array data

In [9]:
#Columbia's cluster
cwd=$UKBB_PATH/results/083021_PCA_results/europeans
#bfile with sample and variants QC from 083021 containing all of the samples Columbias's cluster
##here I used the bfile in which individuals with call rate >90% were retained
genoFile=~/UKBiobank/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
#To keep the samples of white individuals only
keep_samples=$UKBB_yale/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind
#QC is already done, so no need to filter any more
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
mem='30G'
gwasqc_sos=$USER_PATH/bioworkflows/GWAS/GWAS_QC.ipynb
gwasqc_sbatch=$USER_PATH/UKBB_GWAS_dev/output/select_europeans_qcbed_$(date +"%Y-%m-%d").sbatch

gwasqc1_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwasqc_sbatch \
    --args "$gwasqc1_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/select_europeans_qcbed_2021-09-02.sbatch[0m
INFO: Workflow csg (ID=w2a67ebea54c884ff) is executed successfully with 1 completed step.



## Step 2 . Run KING

In [12]:
##Columbia's variables
cwd=$UKBB_PATH/results/083021_PCA_results/090221_king/
genoFile=$UKBB_PATH/results/083021_PCA_results/europeans/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.europeans.filtered.bed
king_sbatch=$OUT_PATH/flashpca_king_extendedwhite_$(date +"%Y-%m-%d").sbatch
kinship=0.0625
gwasqc_sos=$USER_PATH/bioworkflows/GWAS/GWAS_QC.ipynb
numThreads=20
mem='30G'
walltime='36h'

king_args="""king
    --cwd $cwd
    --genoFile $genoFile
    --kinship $kinship
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
    --walltime $walltime
    --no-maximize-unrelated
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $king_sbatch \
    --args "$king_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/flashpca_king_extendedwhite_2021-09-02.sbatch[0m
INFO: Workflow csg (ID=w8aa48a3899af1935) is executed successfully with 1 completed step.



## Step 3 . Remove related and LD pruning

In [4]:
## Columbia's cluster
cwd=$UKBB_PATH/results/083021_PCA_results/090221_ldprun_unrelated
## Use the qc version of the genotype array with the already filtered 189010 white individuals
genoFile=$UKBB_PATH/results/083021_PCA_results/europeans/cache/*.europeans.filtered.bed
#To keep the samples of white individuals only
remove_samples=$UKBB_PATH/results/083021_PCA_results/090221_king/*.related_id

#GWAS QC variables: leave all the variables in 0 so there's no more filtering in the already filtered data
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
#LD prunning variables
window=50
shift=10
r2=0.1
gwas_sbatch=$OUT_PATH/gwas_unrelated_european_$(date +"%Y-%m-%d").sbatch
numThreads=20
mem='30G'

gwasqc_args="""qc
    --cwd $cwd
    --genoFile $genoFile
    --remove_samples $remove_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --window $window
    --shift $shift
    --r2 $r2
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/gwas_unrelated_european_2021-09-03.sbatch[0m
INFO: Workflow csg (ID=w772a2dadd9f0d3c8) is executed successfully with 1 completed step.



## Step 4. Get bed file for related

In [11]:
##Columbia's cluster
cwd=$UKBB_PATH/results/083021_PCA_results/090221_ldprun_related
genoFile=$UKBB_PATH/results/083021_PCA_results/europeans/cache/*.europeans.filtered.bed
keep_samples=$UKBB_PATH/results/083021_PCA_results/090221_king/*.related_id
keep_variants=$UKBB_PATH/results/083021_PCA_results/090221_ldprun_unrelated/cache/*.090221_ldprun_unrelated.filtered.prune.in

#GWAS QC variables
maf_filter=0.0
geno_filter=0.0
hwe_filter=0.0
mind_filter=0.0
gwas_sbatch=$OUT_PATH/gwas_related_european_$(date +"%Y-%m-%d").sbatch
numThreads=20
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/gwas_related_european_2021-09-03.sbatch[0m
INFO: Workflow csg (ID=w8f33a1fa03c50cb6) is executed successfully with 1 completed step.



## Step 5. Run PCA with unrelated samples

In [10]:
## Columbia's cluster
cwd=$UKBB_PATH/results/083021_PCA_results/090321_PCA_unrelated
#This is the bfile originated after filtering unrelated individuals and pruning
genoFile=$UKBB_PATH/results/083021_PCA_results/090221_ldprun_unrelated/cache/*.090221_ldprun_unrelated.filtered.prune.bed
phenoFile=$UKBB_yale/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=$OUT_PATH/flashpca_european_unrelated_$(date +"%Y-%m-%d").sbatch
k=10
maha_k=5
min_axis=0
max_axis=0

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/flashpca_european_unrelated_2021-09-03.sbatch[0m
INFO: Workflow csg (ID=wf589152d03288f2a) is executed successfully with 1 completed step.



## Step 6. Project related samples back

https://privefl.github.io/blog/detecting-outlier-samples-in-pca/

In [19]:
## Columbia's cluster
cwd=$UKBB_PATH/results/083021_PCA_results/090321_PCA_related_pval0.005
#This is the bfile originated after filtering related individuals
genoFile=$UKBB_PATH/results/083021_PCA_results/090221_ldprun_related/cache/*.filtered.extracted.bed
phenoFile=$UKBB_yale/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno
# Project using the PCA model for the unrelated individuals
pca_model=$UKBB_PATH/results/083021_PCA_results/090321_PCA_unrelated/*.pca.rds

pca_sbatch=$OUT_PATH/flashpca_european_related_projected_pval0.005_$(date +"%Y-%m-%d").sbatch
label_col=ethnicity
pop_col=ethnicity
k=10
maha_k=10
prob=0.997
#after correcting for multiple comparissons 0.05/10PC's
pval=0.005
min_axis=0
max_axis=0
## set the --homogeneous options to consider all the pops like one 
## For the plot you need to use the *.projected.rds and not the *.projected.mahalanobis.rds
#plot_data=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray_projected/030821_ukb42495_exomed_white_189010ind.pheno.white_expanded_06_30_21_genoarray_projected.pca.projected.rds
#outlier_file=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray_projected/030821_ukb42495_exomed_white_189010ind.pheno.white_expanded_06_30_21_genoarray_projected.pca.projected.outliers


pca_args="""project_samples
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --pca_model $pca_model
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --prob $prob
    --pval $pval
    --homogeneous
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/flashpca_european_related_projected_pval0.0052021-09-03.sbatch[0m
INFO: Workflow csg (ID=wab37037fb8d999e2) is executed successfully with 1 completed step.



## Step 7. Plot the projected individuals and highlight outliers

In [21]:
## Columbia's cluster
cwd=$UKBB_PATH/results/083021_PCA_results/090321_PCA_related_pval0.005
#This is the bfile originated after filtering related individuals
genoFile=$UKBB_PATH/results/083021_PCA_results/090221_ldprun_related/cache/*.filtered.extracted.bed
phenoFile=$UKBB_yale/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno
plot_data=$UKBB_PATH/results/083021_PCA_results/090321_PCA_related_pval0.005/*.pca.projected.rds
outlier_file=$UKBB_PATH/results/083021_PCA_results/090321_PCA_related_pval0.005/*.outliers
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=$OUT_PATH/plot_european_projected_$(date +"%Y-%m-%d").sbatch

pca_args="""plot_pca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --label_col $label_col
    --pop_col $pop_col
    --plot_data $plot_data
    --outlier_file $outlier_file
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/plot_european_projected_2021-09-03.sbatch[0m
INFO: Workflow csg (ID=w551e30b44159cdab) is executed successfully with 1 completed step.



## Per Suzanne's request color the individuals that are related in the PC's Pc1:4

In [7]:
## Columbia's cluster
cwd=$UKBB_PATH/results/083021_PCA_results/090321_PCA_related_pval0.005/related_colored
#This is the bfile originated after filtering related individuals
genoFile=$UKBB_PATH/results/083021_PCA_results/090221_ldprun_related/cache/*.filtered.extracted.bed
phenoFile=$UKBB_yale/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno
plot_data=$UKBB_PATH/results/083021_PCA_results/090321_PCA_related_pval0.005/*.pca.projected.rds
# Change here the outlier file for the related individuals
related_ind=$UKBB_PATH/results/083021_PCA_results/090221_king/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.europeans.filtered.090221_king.related_id
##outlier_file=$UKBB_PATH/results/083021_PCA_results/090321_PCA_related_pval0.005/*.outliers
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=$cwd/plot_related_$(date +"%Y-%m-%d").sbatch
k=10

pca_args="""plot_pca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --label_col $label_col
    --pop_col $pop_col
    --plot_data $plot_data
    --outlier_file $related_ind
    --numThreads $numThreads 
    --job_size $job_size
    --k $k
    --container $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/vast/hpc/csg/UKBiobank/results/083021_PCA_results/090321_PCA_related_pval0.005/related_colored/plot_related_2022-11-29.sbatch[0m
INFO: Workflow csg (ID=wa787a1a9e09a3c38) is executed successfully with 1 completed step.



## 09-03-21 Re-run the PCA for every phenotype to obtain PC's for LMM analysis. 

### f.3393

#### Step 1.

In [4]:
## Columbia's cluster
cwd=$UKBB_PATH/results/083021_PCA_results/090821_f3393_pca
gwas_sbatch=$OUT_PATH/qc1_f3393_qcarray_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=~/UKBiobank/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
keep_samples=$UKBB_PATH/results/083021_PCA_results/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.keep_id
#Keep variants after LD pruning
keep_variants=$UKBB_PATH/results/083021_PCA_results/090221_ldprun_unrelated/cache/*.090221_ldprun_unrelated.filtered.prune.in

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/qc1_f3393_qcarray_2021-09-08.sbatch[0m
INFO: Workflow csg (ID=wf5b4c32259fd836b) is executed successfully with 1 completed step.



#### Step 2. 

In [10]:
## Columbia's cluster
cwd=$UKBB_PATH/results/083021_PCA_results/090821_f3393_pca
#This is the bfile obtained in step 1
genoFile=$UKBB_PATH/results/083021_PCA_results/090821_f3393_pca/cache/*.bed
# Format FID, IID, pop
phenoFile=$UKBB_PATH/results/083021_PCA_results/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl.phenopca
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=$OUT_PATH/flashpca_f3393_pc_$(date +"%Y-%m-%d").sbatch
k=2
min_axis=0
max_axis=0

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/flashpca_f3393_pc_2021-09-08.sbatch[0m
INFO: Workflow csg (ID=w2819b574619af6fb) is executed successfully with 1 completed step.



### f.2247

#### Step 1. Produce the genotype file keeping specific samples and variants

In [5]:
## Columbia's cluster
cwd=$UKBB_PATH/results/083021_PCA_results/090821_f2247_pca
gwas_sbatch=$OUT_PATH/qc1_f2247_qcarray_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=~/UKBiobank/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
keep_samples=$UKBB_PATH/results/083021_PCA_results/090321_UKBB_Hearing_difficulty_f2247_expandedwhite_45502cases_96601ctrl.keep_id
keep_variants=$UKBB_PATH/results/083021_PCA_results/090221_ldprun_unrelated/cache/*.090221_ldprun_unrelated.filtered.prune.in
#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/qc1_f2247_qcarray_2021-09-08.sbatch[0m
INFO: Workflow csg (ID=wcfe536eb22d9e7dc) is executed successfully with 1 completed step.



#### Step 2.

In [13]:
## Columbia's cluster
cwd=$UKBB_PATH/results/083021_PCA_results/090821_f2247_pca
#This is the bfile obtained in step 1
genoFile=$UKBB_PATH/results/083021_PCA_results/090821_f2247_pca/cache/*.bed
# Format FID, IID, pop
phenoFile=$UKBB_PATH/results/083021_PCA_results/090321_UKBB_Hearing_difficulty_f2247_expandedwhite_45502cases_96601ctrl.phenopca
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=$OUT_PATH/flashpca_f2247_pc_$(date +"%Y-%m-%d").sbatch
k=2
min_axis=0
max_axis=0

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/flashpca_f2247_pc_2021-09-08.sbatch[0m
INFO: Workflow csg (ID=wafeca9cb7b20c80b) is executed successfully with 1 completed step.



### f.2257

#### Step 1. Produce the genotype file keeping specific samples and variants

In [7]:
## Columbia's cluster
cwd=$UKBB_PATH/results/083021_PCA_results/090821_f2257_pca
gwas_sbatch=$OUT_PATH/qc1_f2257_qcarray_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=~/UKBiobank/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
keep_samples=$UKBB_PATH/results/083021_PCA_results/090321_UKBB_Hearing_noise_f2257_expandedwhite_65660cases_96601ctrl.keep_id
keep_variants=$UKBB_PATH/results/083021_PCA_results/090221_ldprun_unrelated/cache/*.090221_ldprun_unrelated.filtered.prune.in
#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/qc1_f2257_qcarray_2021-09-08.sbatch[0m
INFO: Workflow csg (ID=w7ff2e9c165c744b4) is executed successfully with 1 completed step.



#### Step 2.

In [12]:
## Columbia's cluster
cwd=$UKBB_PATH/results/083021_PCA_results/090821_f2257_pca
#This is the bfile obtained in step 1
genoFile=$UKBB_PATH/results/083021_PCA_results/090821_f2257_pca/cache/*.bed
# Format FID, IID, pop
phenoFile=$UKBB_PATH/results/083021_PCA_results/090321_UKBB_Hearing_noise_f2257_expandedwhite_65660cases_96601ctrl.phenopca
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=$OUT_PATH/flashpca_f2257_pc_$(date +"%Y-%m-%d").sbatch
k=2
min_axis=0
max_axis=0

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/flashpca_f2257_pc_2021-09-08.sbatch[0m
INFO: Workflow csg (ID=w208de65bc74b283f) is executed successfully with 1 completed step.



### Combined trait

#### Step 1. Produce the genotype file keeping specific samples and variants

In [8]:
## Columbia's cluster
cwd=$UKBB_PATH/results/083021_PCA_results/090821_combined_f2247_f2257_pca
gwas_sbatch=$OUT_PATH/qc1_combined_qcarray_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=~/UKBiobank/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
keep_samples=$UKBB_PATH/results/083021_PCA_results/090321_UKBB_Combined_f2247_f2257_expandedwhite_38410cases_96601ctrl.keep_id
keep_variants=$UKBB_PATH/results/083021_PCA_results/090221_ldprun_unrelated/cache/*.090221_ldprun_unrelated.filtered.prune.in
#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/qc1_combined_qcarray_2021-09-08.sbatch[0m
INFO: Workflow csg (ID=w7dfd0657210e6c24) is executed successfully with 1 completed step.



#### Step 2.

In [14]:
## Columbia's cluster
cwd=$UKBB_PATH/results/083021_PCA_results/090821_combined_f2247_f2257_pca
#This is the bfile obtained in step 1
genoFile=$UKBB_PATH/results/083021_PCA_results/090821_combined_f2247_f2257_pca/cache/*.bed
# Format FID, IID, pop
phenoFile=$UKBB_PATH/results/083021_PCA_results/090321_UKBB_Combined_f2247_f2257_expandedwhite_38410cases_96601ctrl.phenopca
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=$OUT_PATH/flashpca_combined_pc_$(date +"%Y-%m-%d").sbatch
k=2
min_axis=0
max_axis=0

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/flashpca_combined_pc_2021-09-08.sbatch[0m
INFO: Workflow csg (ID=w431cc450bb079b20) is executed successfully with 1 completed step.



## 09-14-21 Get PCA for every phenotype for 50K and 150K samples to obtain PC's for LMM analysis. 

### f.3393 50K

#### Step 1.

In [7]:
## Columbia's cluster
cwd=$UKBB_PATH/results/083021_PCA_results/091421_f3393_50Kexomes_pca
gwas_sbatch=$OUT_PATH/qc1_f3393_qcarray_50K_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=~/UKBiobank/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
keep_samples=$UKBB_PATH/results/083021_PCA_results/090321_UKBB_Hearing_aid_f3393_expandedwhite_24189ind_50Kexomes.keep_id
#Keep variants after LD pruning
keep_variants=$UKBB_PATH/results/083021_PCA_results/090221_ldprun_unrelated/cache/*.090221_ldprun_unrelated.filtered.prune.in

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/qc1_f3393_qcarray_2021-09-14.sbatch[0m
INFO: Workflow csg (ID=w08964e7b21063bd2) is executed successfully with 1 completed step.



#### Step 2

In [17]:
## Columbia's cluster
cwd=$UKBB_PATH/results/083021_PCA_results/091421_f3393_50Kexomes_pca
#This is the bfile obtained in step 1
genoFile=$UKBB_PATH/results/083021_PCA_results/091421_f3393_50Kexomes_pca/cache/*.bed
# Format FID, IID, ethnicity
phenoFile=$UKBB_PATH/results/083021_PCA_results/090321_UKBB_Hearing_aid_f3393_expandedwhite_24189_50K.phenopca
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=$OUT_PATH/flashpca_f3393_pc_50K_$(date +"%Y-%m-%d").sbatch
k=2
min_axis=0
max_axis=0

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/flashpca_f3393_pc_50K_2021-09-14.sbatch[0m
INFO: Workflow csg (ID=w11fcf6c45777d4a0) is executed successfully with 1 completed step.



### f.3393 150K

#### Step 1

In [11]:
## Columbia's cluster
cwd=$UKBB_PATH/results/083021_PCA_results/091421_f3393_150Kexomes_pca
gwas_sbatch=$OUT_PATH/qc1_f3393_qcarray_150K_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=~/UKBiobank/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
keep_samples=$UKBB_PATH/results/083021_PCA_results/090321_UKBB_Hearing_aid_f3393_expandedwhite_78848ind_150Kexomes.keep_id
#Keep variants after LD pruning
keep_variants=$UKBB_PATH/results/083021_PCA_results/090221_ldprun_unrelated/cache/*.090221_ldprun_unrelated.filtered.prune.in

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/qc1_f3393_qcarray_150K_2021-09-14.sbatch[0m
INFO: Workflow csg (ID=wbbcc5a29996f0bc1) is executed successfully with 1 completed step.



#### Step 2

In [18]:
## Columbia's cluster
cwd=$UKBB_PATH/results/083021_PCA_results/091421_f3393_150Kexomes_pca
#This is the bfile obtained in step 1
genoFile=$UKBB_PATH/results/083021_PCA_results/091421_f3393_150Kexomes_pca/cache/*.bed
# Format FID, IID, ethnicity
phenoFile=$UKBB_PATH/results/083021_PCA_results/090321_UKBB_Hearing_aid_f3393_expandedwhite_78848_150K.phenopca
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=$OUT_PATH/flashpca_f3393_pc_150K_$(date +"%Y-%m-%d").sbatch
k=2
min_axis=0
max_axis=0

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/flashpca_f3393_pc_150K_2021-09-14.sbatch[0m
INFO: Workflow csg (ID=wc3aa95c22f2a068c) is executed successfully with 1 completed step.



## f.2247 50 K

#### Step 1

In [10]:
## Columbia's cluster
cwd=$UKBB_PATH/results/083021_PCA_results/091421_f2247_50Kexomes_pca
gwas_sbatch=$OUT_PATH/qc1_f2247_qcarray_50K_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=~/UKBiobank/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
keep_samples=$UKBB_PATH/results/083021_PCA_results/090321_UKBB_Hearing_difficulty_f2247_expandedwhite_34596ind_50Kexomes.keep_id
keep_variants=$UKBB_PATH/results/083021_PCA_results/090221_ldprun_unrelated/cache/*.090221_ldprun_unrelated.filtered.prune.in
#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/qc1_f2247_qcarray_50K_2021-09-14.sbatch[0m
INFO: Workflow csg (ID=w49665c69c9f3543b) is executed successfully with 1 completed step.



#### Step 2.

In [19]:
## Columbia's cluster
cwd=$UKBB_PATH/results/083021_PCA_results/091421_f2247_50Kexomes_pca
#This is the bfile obtained in step 1
genoFile=$UKBB_PATH/results/083021_PCA_results/091421_f2247_50Kexomes_pca/cache/*.bed
# Format FID, IID, ethnicity
phenoFile=$UKBB_PATH/results/083021_PCA_results/090321_UKBB_Hearing_difficulty_f2247_expandedwhite_34596ind_50K.phenopca
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=$OUT_PATH/flashpca_f2247_pc_50K_$(date +"%Y-%m-%d").sbatch
k=2
min_axis=0
max_axis=0

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/flashpca_f2247_pc_50K_2021-09-14.sbatch[0m
INFO: Workflow csg (ID=wa725739c54f73ca3) is executed successfully with 1 completed step.



## f.2247 150 K

#### Step 1. 

In [None]:
## Columbia's cluster
cwd=$UKBB_PATH/results/083021_PCA_results/091421_f2247_150Kexomes_pca
gwas_sbatch=$OUT_PATH/qc1_f2247_qcarray_150K_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=~/UKBiobank/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
keep_samples=$UKBB_PATH/results/083021_PCA_results/090321_UKBB_Hearing_difficulty_f2247_expandedwhite_107507ind_150Kexomes.keep_id
keep_variants=$UKBB_PATH/results/083021_PCA_results/090221_ldprun_unrelated/cache/*.090221_ldprun_unrelated.filtered.prune.in
#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

#### Step 2.

In [20]:
## Columbia's cluster
cwd=$UKBB_PATH/results/083021_PCA_results/091421_f2247_150Kexomes_pca
#This is the bfile obtained in step 1
genoFile=$UKBB_PATH/results/083021_PCA_results/091421_f2247_150Kexomes_pca/cache/*.bed
# Format FID, IID, ethnicity
phenoFile=$UKBB_PATH/results/083021_PCA_results/090321_UKBB_Hearing_difficulty_f2247_expandedwhite_107507ind_150K.phenopca
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=$OUT_PATH/flashpca_f2247_pc_150K_$(date +"%Y-%m-%d").sbatch
k=2
min_axis=0
max_axis=0

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/flashpca_f2247_pc_150K_2021-09-14.sbatch[0m
INFO: Workflow csg (ID=w7e44a7ddb4b4f8d7) is executed successfully with 1 completed step.



## f.2257 50K

#### Step 1

In [13]:
## Columbia's cluster
cwd=$UKBB_PATH/results/083021_PCA_results/091421_f2257_50Kexomes_pca
gwas_sbatch=$OUT_PATH/qc1_f2257_qcarray_50K_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=~/UKBiobank/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
keep_samples=$UKBB_PATH/results/083021_PCA_results/090321_UKBB_Hearing_noise_f2257_expandedwhite_38721ind_50Kexomes.keep_id
keep_variants=$UKBB_PATH/results/083021_PCA_results/090221_ldprun_unrelated/cache/*.090221_ldprun_unrelated.filtered.prune.in
#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/qc1_f2257_qcarray_50K_2021-09-14.sbatch[0m
INFO: Workflow csg (ID=w3935a290a25c2248) is executed successfully with 1 completed step.



#### Step 2

In [21]:
## Columbia's cluster
cwd=$UKBB_PATH/results/083021_PCA_results/091421_f2257_50Kexomes_pca
#This is the bfile obtained in step 1
genoFile=$UKBB_PATH/results/083021_PCA_results/091421_f2257_50Kexomes_pca/cache/*.bed
# Format FID, IID, ethnicity
phenoFile=$UKBB_PATH/results/083021_PCA_results/090321_UKBB_Hearing_noise_f2257_expandedwhite_38723ind_50K.phenopca
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=$OUT_PATH/flashpca_f2257_pc_50K_$(date +"%Y-%m-%d").sbatch
k=2
min_axis=0
max_axis=0

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/flashpca_f2257_pc_50K_2021-09-14.sbatch[0m
INFO: Workflow csg (ID=wdb9ab761be06600d) is executed successfully with 1 completed step.



## f.2257 150K

#### Step 1.

In [14]:
## Columbia's cluster
cwd=$UKBB_PATH/results/083021_PCA_results/091421_f2257_150Kexomes_pca
gwas_sbatch=$OUT_PATH/qc1_f2257_qcarray_150K_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=~/UKBiobank/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
keep_samples=$UKBB_PATH/results/083021_PCA_results/090321_UKBB_Hearing_noise_f2257_expandedwhite_123538ind_150Kexomes.keep_id
keep_variants=$UKBB_PATH/results/083021_PCA_results/090221_ldprun_unrelated/cache/*.090221_ldprun_unrelated.filtered.prune.in
#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/qc1_f2257_qcarray_150K_2021-09-14.sbatch[0m
INFO: Workflow csg (ID=w44507a1ae5adbe55) is executed successfully with 1 completed step.



#### Step 2. 

In [22]:
## Columbia's cluster
cwd=$UKBB_PATH/results/083021_PCA_results/091421_f2257_150Kexomes_pca
#This is the bfile obtained in step 1
genoFile=$UKBB_PATH/results/083021_PCA_results/091421_f2257_150Kexomes_pca/cache/*.bed
# Format FID, IID, ethnicity
phenoFile=$UKBB_PATH/results/083021_PCA_results/090321_UKBB_Hearing_noise_f2257_expandedwhite_123538ind_150K.phenopca
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=$OUT_PATH/flashpca_f2257_pc_150K_$(date +"%Y-%m-%d").sbatch
k=2
min_axis=0
max_axis=0

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/flashpca_f2257_pc_150K_2021-09-14.sbatch[0m
INFO: Workflow csg (ID=w0ad961c577b76fb1) is executed successfully with 1 completed step.



## combined 50K

#### Step 1

In [15]:
## Columbia's cluster
cwd=$UKBB_PATH/results/083021_PCA_results/091421_combined_f2247_f2257_50Kexomes_pca
gwas_sbatch=$OUT_PATH/qc1_combined_qcarray_50K_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=~/UKBiobank/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
keep_samples=$UKBB_PATH/results/083021_PCA_results/090321_UKBB_Combined_f2247_f2257_expandedwhite_32878ind_50Kexomes.keep_id
keep_variants=$UKBB_PATH/results/083021_PCA_results/090221_ldprun_unrelated/cache/*.090221_ldprun_unrelated.filtered.prune.in
#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/qc1_combined_qcarray_50K_2021-09-14.sbatch[0m
INFO: Workflow csg (ID=w50102ad74cf5bbdf) is executed successfully with 1 completed step.



## Step 2. 

In [23]:
## Columbia's cluster
cwd=$UKBB_PATH/results/083021_PCA_results/091421_combined_f2247_f2257_50Kexomes_pca
#This is the bfile obtained in step 1
genoFile=$UKBB_PATH/results/083021_PCA_results/091421_combined_f2247_f2257_50Kexomes_pca/cache/*.bed
# Format FID, IID, ethnicity
phenoFile=$UKBB_PATH/results/083021_PCA_results/090321_UKBB_Combined_f2247_f2257_expandedwhite_32878ind_50K.phenopca
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=$OUT_PATH/flashpca_combined_pc_50K_$(date +"%Y-%m-%d").sbatch
k=2
min_axis=0
max_axis=0

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/flashpca_combined_pc_50K_2021-09-14.sbatch[0m
INFO: Workflow csg (ID=w392fb78ab856c93a) is executed successfully with 1 completed step.



## combined 150K

#### Step 1.

In [16]:
## Columbia's cluster
cwd=$UKBB_PATH/results/083021_PCA_results/091421_combined_f2247_f2257_150Kexomes_pca
gwas_sbatch=$OUT_PATH/qc1_combined_qcarray_150K_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=~/UKBiobank/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
keep_samples=$UKBB_PATH/results/083021_PCA_results/090321_UKBB_Combined_f2247_f2257_expandedwhite_102133ind_150Kexomes.keep_id
keep_variants=$UKBB_PATH/results/083021_PCA_results/090221_ldprun_unrelated/cache/*.090221_ldprun_unrelated.filtered.prune.in
#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/qc1_combined_qcarray_150K_2021-09-14.sbatch[0m
INFO: Workflow csg (ID=w97617f3c320f3808) is executed successfully with 1 completed step.



#### Step 2.

In [24]:
## Columbia's cluster
cwd=$UKBB_PATH/results/083021_PCA_results/091421_combined_f2247_f2257_150Kexomes_pca
#This is the bfile obtained in step 1
genoFile=$UKBB_PATH/results/083021_PCA_results/091421_combined_f2247_f2257_150Kexomes_pca/cache/*.bed
# Format FID, IID, ethnicity
phenoFile=$UKBB_PATH/results/083021_PCA_results/090321_UKBB_Combined_f2247_f2257_expandedwhite_102133ind_150K.phenopca
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=$OUT_PATH/flashpca_combined_pc_150K_$(date +"%Y-%m-%d").sbatch
k=2
min_axis=0
max_axis=0

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/flashpca_combined_pc_150K_2021-09-14.sbatch[0m
INFO: Workflow csg (ID=w76325a9da60537fd) is executed successfully with 1 completed step.



# Get the phenoFiles for Asian and Africans

In [14]:
library(dplyr)

In [48]:
asian <-  read.table("/mnt/mfs/statgen/UKBiobank/phenotype_files/asian_IID/010622_ukb47922_asian_10695.iid", sep="\t", header=T)

In [32]:
head(asian)

Unnamed: 0_level_0,FID,IID,ethnicity
Unnamed: 0_level_1,<int>,<int>,<fct>
1,1000906,1000906,3003
2,1001874,1001874,3004
3,1002497,1002497,3001
4,1002712,1002712,3001
5,1003025,1003025,3001
6,1003083,1003083,3001


In [49]:
asian <- asian %>%
    mutate(ethnia=recode(ethnicity, '3001' = "Indian", "3002" = "Pakistani", "3003" = "Bangladeshi", "2003" = 'White_and_Asian', "3004" = "Any_other_asian_background", '3' = "Asian_or_Asian_British")) %>%
    select(FID,IID,ethnia)

In [50]:
head(asian)

Unnamed: 0_level_0,FID,IID,ethnia
Unnamed: 0_level_1,<int>,<int>,<fct>
1,1000906,1000906,Bangladeshi
2,1001874,1001874,Any_other_asian_background
3,1002497,1002497,Indian
4,1002712,1002712,Indian
5,1003025,1003025,Indian
6,1003083,1003083,Indian


In [51]:
write.table(asian, "/mnt/mfs/statgen/UKBiobank/phenotype_files/asian_IID/010622_ukb47922_asian_10695_pop_superpop.iid", sep="\t", row.names = FALSE, col.names =TRUE, quote=FALSE)

In [52]:
african <-  read.table("/mnt/mfs/statgen/UKBiobank/phenotype_files/african_IID/010622_ukb47922_african_9096.iid", sep="\t", header=T)

In [53]:
african <- african %>%
    mutate(ethnia=recode(ethnicity, '4001' = "Caribbean", "4002" = "African", "4003" = "Any_other_Black_background", "2001" = 'White_and_Black_Caribbean', "2002" = "White_and_Black_African", '4' = "Black_or_Black_British")) %>%
    select(FID,IID,ethnia)

In [54]:
write.table(african, "/mnt/mfs/statgen/UKBiobank/phenotype_files/african_IID/010622_ukb47922_african_9096_pop_superpop.iid", sep="\t", row.names = FALSE, col.names =TRUE, quote=FALSE)

# 01-07-22 PCA for Asian samples in the full data N=10,189 of these N=4,591 have exome data

## Step 1. Run King to check sample relatedness N=1,111 samples related

The file with the Asian individuals that have imputed and exome data

`~/UKBiobank/genotype_files_processed/010622_asian_10189ind/010722_asian_qc_10188ind_withexome_or_imputed`

In [4]:
##Columbia's variables
cwd=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/010722_king_asian
genoFile=~/UKBiobank/genotype_files_processed/010622_asian_10189ind/010722_sample_var_final_qc/cache/UKB_genotypedatadownloaded083019.010722_sample_var_final_qc.filtered.extracted.bed
king_sbatch=$OUT_PATH/flashpca_king_asian_$(date +"%Y-%m-%d").sbatch
kinship=0.0625
gwasqc_sos=$USER_PATH/bioworkflows/GWAS/GWAS_QC.ipynb
numThreads=20
mem='30G'
walltime='36h'

king_args="""king
    --cwd $cwd
    --genoFile $genoFile
    --kinship $kinship
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
    --walltime $walltime
    --no-maximize-unrelated
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $king_sbatch \
    --args "$king_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/flashpca_king_asian_2022-01-07.sbatch[0m
INFO: Workflow csg (ID=w561b4276c3de8ab6) is executed successfully with 1 completed step.



## Step 2. Get the unrelated individuals and do LD pruning

In [4]:
## Columbia's cluster
cwd=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/010722_ldprun_unrelated_asian
## Use the qc version of the genotype array with the already filtered asian individuals
genoFile=~/UKBiobank/genotype_files_processed/010622_asian_10189ind/010722_sample_var_final_qc/cache/UKB_genotypedatadownloaded083019.010722_sample_var_final_qc.filtered.extracted.bed
#To keep the samples of asian and unrelated individuals only
remove_samples=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/010722_king_asian/UKB_genotypedatadownloaded083019.010722_sample_var_final_qc.filtered.extracted.010722_king_asian.related_id

#GWAS QC variables: leave all the variables in 0 so there's no more filtering in the already filtered data
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
#LD prunning variables
window=50
shift=10
r2=0.1
gwas_sbatch=$OUT_PATH/ldprun_unrelated_asian_$(date +"%Y-%m-%d").sbatch
numThreads=20
mem='30G'

gwasqc_args="""qc
    --cwd $cwd
    --genoFile $genoFile
    --remove_samples $remove_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --window $window
    --shift $shift
    --r2 $r2
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/ldprun_unrelated_asian_2022-01-10.sbatch[0m
INFO: Workflow csg (ID=w339f1ad05015d800) is executed successfully with 1 completed step.



## Step 3. Run PCA for unrelated Asian

In [20]:
## Columbia's cluster
cwd=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/010722_pca_unrelated_asian
#This is the bfile originated after filtering unrelated individuals and pruning
genoFile=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/010722_ldprun_unrelated_asian/cache/UKB_genotypedatadownloaded083019.010722_sample_var_final_qc.filtered.extracted.010722_ldprun_unrelated_asian.filtered.prune.bed
## I had to modify the original file to add a super_pop and replace ethnicity for pop
phenoFile=/mnt/mfs/statgen/UKBiobank/phenotype_files/asian_IID/010622_ukb47922_asian_10695_pop_superpop.iid
label_col=ethnia
pca_sbatch=$OUT_PATH/flashpca_asian_unrelated_genoarray_$(date +"%Y-%m-%d").sbatch
k=10
maha_k=5
min_axis=""
max_axis=""
homogeneous=TRUE

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --pops $pops
    --min_axis $min_axis
    --max_axis $max_axis
    --homogeneous $homogeneous
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/flashpca_asian_unrelated_genoarray_2022-01-10.sbatch[0m
INFO: Workflow csg (ID=wa8d2fc29796281ae) is executed successfully with 1 completed step.



## Step 4. Get the bed file for related Asian

In [11]:
## Columbia's cluster
cwd=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/010722_related_asian
## Use qc'ed genotype array
genoFile=~/UKBiobank/genotype_files_processed/010622_asian_10189ind/010722_sample_var_final_qc/cache/UKB_genotypedatadownloaded083019.010722_sample_var_final_qc.filtered.extracted.bed
keep_samples=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/010722_king_asian/UKB_genotypedatadownloaded083019.010722_sample_var_final_qc.filtered.extracted.010722_king_asian.related_id
#Keep the same variants as above
keep_variants=~/UKBiobank/genotype_files_processed/010622_asian_10189ind/010722_ldprun_unrelated_asian/cache/UKB_genotypedatadownloaded083019.010722_sample_var_final_qc.filtered.extracted.010722_ldprun_unrelated_asian.filtered.prune.in

#GWAS QC variables
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
gwas_sbatch=$OUT_PATH/flashpca_asian_related_qc_genoarray_$(date +"%Y-%m-%d").sbatch
numThreads=20
mem='30G'
merged_prefix='ukb23155_qc_related_genoarray_asian'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --merged_prefix $merged_prefix
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/flashpca_asian_related_qc_genoarray_2022-01-10.sbatch[0m
INFO: Workflow csg (ID=wc78d7a348c45c450) is executed successfully with 1 completed step.



## Step 5. Project back related asian individuals

In [24]:
## Columbia's cluster
cwd=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/010722_project_related_asian
#This is the bfile originated after filtering related individuals
genoFile=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/010722_related_asian/cache/UKB_genotypedatadownloaded083019.010722_sample_var_final_qc.filtered.extracted.010722_related_asian.filtered.extracted.bed
## I had to modify the original file to add a super_pop and replace ethnicity for pop
phenoFile=/mnt/mfs/statgen/UKBiobank/phenotype_files/asian_IID/010622_ukb47922_asian_10695_pop_superpop.iid
pca_model=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/010722_pca_unrelated_asian/010622_ukb47922_asian_10695_pop_superpop.010722_pca_unrelated_asian.pca.rds
pca_sbatch=$OUT_PATH/flashpca_asian_related_genoarray_projected_$(date +"%Y-%m-%d").sbatch
label_col=ethnia
pop_col=ethnia
k=10
maha_k=5
prob=0.997
pval=0.05
## set the --homogeneous TRUE options to consider all the pops like one 
homogeneous=TRUE
## For the plot you need to use the *.projected.rds and not the *.projected.mahalanobis.rds
#plot_data=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray_projected/030821_ukb42495_exomed_white_189010ind.pheno.white_expanded_06_30_21_genoarray_projected.pca.projected.rds
#outlier_file=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray_projected/030821_ukb42495_exomed_white_189010ind.pheno.white_expanded_06_30_21_genoarray_projected.pca.projected.outliers


pca_args="""project_samples
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --pca_model $pca_model
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --prob $prob
    --pval $pval
    --homogeneous $homogeneous
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/flashpca_asian_related_genoarray_projected_2022-01-10.sbatch[0m
INFO: Workflow csg (ID=w5836c0866cc3f9ed) is executed successfully with 1 completed step.



## Step 6. Plot projected indviduals and look for outliers 

In [28]:
## Columbia's cluster
cwd=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/010722_project_related_asian
#This is the bfile originated after filtering related individuals
genoFile=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/010722_related_asian/cache/UKB_genotypedatadownloaded083019.010722_sample_var_final_qc.filtered.extracted.010722_related_asian.filtered.extracted.bed
## I had to modify the original file to add a super_pop and replace ethnicity for pop
phenoFile=/mnt/mfs/statgen/UKBiobank/phenotype_files/asian_IID/010622_ukb47922_asian_10695_pop_superpop.iid
pca_model=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/010722_pca_unrelated_asian/010622_ukb47922_asian_10695_pop_superpop.010722_pca_unrelated_asian.pca.rds
plot_data=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/010722_project_related_asian/010622_ukb47922_asian_10695_pop_superpop.010722_project_related_asian.pca.projected.rds
outlier_file=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/010722_project_related_asian/010622_ukb47922_asian_10695_pop_superpop.010722_project_related_asian.pca.projected.outliers
label_col=ethnia
pop_col=ethnia
pca_sbatch=$OUT_PATH/flashpca_asian_related_genoarray_plot_$(date +"%Y-%m-%d").sbatch

pca_args="""plot_pca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --label_col $label_col
    --pop_col $pop_col
    --plot_data $plot_data
    --outlier_file $outlier_file
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg  \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/flashpca_asian_related_genoarray_plot_2022-01-10.sbatch[0m
INFO: Workflow csg (ID=w5aed8844c8fe2afd) is executed successfully with 1 completed step.



## Get the sample id and variant list after QC and PCA

In [None]:
# N=10,157 samples and 444,076 variants
plink2 \
    --bfile ~/UKBiobank/genotype_files_processed/010622_asian_10189ind/010722_sample_var_final_qc/cache/UKB_genotypedatadownloaded083019.010722_sample_var_final_qc.filtered.extracted \
    --remove ~/UKBiobank/genotype_files_processed/010622_asian_10189ind/010722_project_related_asian/010622_ukb47922_asian_10695_pop_superpop.010722_project_related_asian.pca.projected.outliers \
    --write-snplist --write-samples --no-id-header \
    --threads 8 \
    --out ~/UKBiobank/genotype_files_processed/010622_asian_10189ind/UKB_genotypedatadownloaded083019.010722_sample_var_final_qc.filtered.extracted_ASIAN

# 01-07-22 PCA for African samples in th full data N=8,617 of these N=3,678 have exome data

## Step 1. Run King to check sample relatedness

The file having the individuals that have exome and imputed data and are African

`~/UKBiobank/genotype_files_processed/010622_african_9096ind/010722_african_qc_8617ind_withexome_or_imputed`

In [5]:
cwd=$UKBB_PATH/genotype_files_processed/010622_african_9096ind/010722_king_african
genoFile=~/UKBiobank/genotype_files_processed/010622_african_9096ind/010722_sample_var_final_qc/cache/UKB_genotypedatadownloaded083019.010722_sample_var_final_qc.filtered.extracted.bed
king_sbatch=$OUT_PATH/flashpca_king_african_$(date +"%Y-%m-%d").sbatch
kinship=0.0625
gwasqc_sos=$USER_PATH/bioworkflows/GWAS/GWAS_QC.ipynb
numThreads=20
mem='30G'
walltime='36h'

king_args="""king
    --cwd $cwd
    --genoFile $genoFile
    --kinship $kinship
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
    --walltime $walltime
    --no-maximize-unrelated
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $king_sbatch \
    --args "$king_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/flashpca_king_african_2022-01-10.sbatch[0m
INFO: Workflow csg (ID=wad64520a422f6c32) is executed successfully with 1 completed step.



## Step 2. Get the unrelated individuals and do LD pruning

In [16]:
## Columbia's cluster
cwd=$UKBB_PATH/genotype_files_processed/010622_african_9096ind/010722_ldprun_unrelated_african
## Use the qc version of the genotype array with the already filtered asian individuals
genoFile=~/UKBiobank/genotype_files_processed/010622_african_9096ind/010722_sample_var_final_qc/cache/UKB_genotypedatadownloaded083019.010722_sample_var_final_qc.filtered.extracted.bed
#To keep the samples of asian and unrelated individuals only
remove_samples=~/UKBiobank/genotype_files_processed/010622_african_9096ind/010722_king_african/UKB_genotypedatadownloaded083019.010722_sample_var_final_qc.filtered.extracted.010722_king_african.related_id

#GWAS QC variables: leave all the variables in 0 so there's no more filtering in the already filtered data
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
#LD prunning variables
window=50
shift=10
r2=0.1
gwas_sbatch=$OUT_PATH/ldprun_unrelated_african_$(date +"%Y-%m-%d").sbatch
numThreads=20
mem='30G'

gwasqc_args="""qc
    --cwd $cwd
    --genoFile $genoFile
    --remove_samples $remove_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --window $window
    --shift $shift
    --r2 $r2
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/ldprun_unrelated_african_2022-01-10.sbatch[0m
INFO: Workflow csg (ID=w54542a1dafbffbf6) is executed successfully with 1 completed step.



## Step 3. Flashpca in the unrelated

In [25]:
## Columbia's cluster
cwd=$UKBB_PATH/genotype_files_processed/010622_african_9096ind/010722_pca_unrelated_african
#This is the bfile originated after filtering unrelated individuals and pruning
genoFile=$UKBB_PATH/genotype_files_processed/010622_african_9096ind/010722_ldprun_unrelated_african/cache/UKB_genotypedatadownloaded083019.010722_sample_var_final_qc.filtered.extracted.010722_ldprun_unrelated_african.filtered.prune.bed
## I had to modify the original file to add a super_pop and replace ethnicity for pop
phenoFile=/mnt/mfs/statgen/UKBiobank/phenotype_files/african_IID/010622_ukb47922_african_9096_pop_superpop.iid
label_col=ethnia
pca_sbatch=$OUT_PATH/flashpca_african_unrelated_genoarray_$(date +"%Y-%m-%d").sbatch
k=10
maha_k=5
min_axis=0
max_axis=0

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --pops $pops
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/flashpca_african_unrelated_genoarray_2022-01-10.sbatch[0m
INFO: Workflow csg (ID=w8a97bc7e44cc076f) is executed successfully with 1 completed step.



## Step 4. Get the bed file for related Africans

In [26]:
## Columbia's cluster
cwd=$UKBB_PATH/genotype_files_processed/010622_african_9096ind/010722_related_african
## Use qc'ed genotype array
genoFile=~/UKBiobank/genotype_files_processed/010622_african_9096ind/010722_sample_var_final_qc/cache/UKB_genotypedatadownloaded083019.010722_sample_var_final_qc.filtered.extracted.bed
keep_samples=~/UKBiobank/genotype_files_processed/010622_african_9096ind/010722_king_african/UKB_genotypedatadownloaded083019.010722_sample_var_final_qc.filtered.extracted.010722_king_african.related_id
#Keep the same variants as above
keep_variants=$UKBB_PATH/genotype_files_processed/010622_african_9096ind/010722_ldprun_unrelated_african/cache/UKB_genotypedatadownloaded083019.010722_sample_var_final_qc.filtered.extracted.010722_ldprun_unrelated_african.filtered.prune.in

#GWAS QC variables
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
gwas_sbatch=$OUT_PATH/flashpca_african_related_qc_genoarray_$(date +"%Y-%m-%d").sbatch
numThreads=20
mem='30G'
merged_prefix='ukb23155_qc_related_genoarray_asian'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --merged_prefix $merged_prefix
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/flashpca_african_related_qc_genoarray_2022-01-10.sbatch[0m
INFO: Workflow csg (ID=w60c8a480c99afab0) is executed successfully with 1 completed step.



## Step 5.Project back related Africans

In [36]:
## Columbia's cluster
cwd=$UKBB_PATH/genotype_files_processed/010622_african_9096ind/010722_project_related_african
#This is the bfile originated after filtering related individuals
genoFile=$UKBB_PATH/genotype_files_processed/010622_african_9096ind/010722_related_african/cache/UKB_genotypedatadownloaded083019.010722_sample_var_final_qc.filtered.extracted.010722_related_african.filtered.extracted.bed
## I had to modify the original file to add a super_pop and replace ethnicity for pop
phenoFile=/mnt/mfs/statgen/UKBiobank/phenotype_files/african_IID/010622_ukb47922_african_9096_pop_superpop.iid
pca_model=~/UKBiobank/genotype_files_processed/010622_african_9096ind/010722_pca_unrelated_african/010622_ukb47922_african_9096_pop_superpop.010722_pca_unrelated_african.pca.rds
pca_sbatch=$OUT_PATH/flashpca_african_related_genoarray_projected_$(date +"%Y-%m-%d").sbatch
label_col=ethnia
pop_col=ethnia
maha_k=5
prob=0.997
pval=0.05
## set the --homogeneous TRUE options to consider all the pops like one 
homogeneous=TRUE
## For the plot you need to use the *.projected.rds and not the *.projected.mahalanobis.rds
#plot_data=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray_projected/030821_ukb42495_exomed_white_189010ind.pheno.white_expanded_06_30_21_genoarray_projected.pca.projected.rds
#outlier_file=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray_projected/030821_ukb42495_exomed_white_189010ind.pheno.white_expanded_06_30_21_genoarray_projected.pca.projected.outliers


pca_args="""project_samples
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --pca_model $pca_model
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $ethnia
    --prob $prob
    --pval $pval
    --homogeneous $homogeneous
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/flashpca_african_related_genoarray_projected_2022-01-11.sbatch[0m
INFO: Workflow csg (ID=w4d96ac2a1a1d0bc4) is executed successfully with 1 completed step.



## Step 6. Plot projected individuals and look for outliers

In [37]:
## Columbia's cluster
cwd=$UKBB_PATH/genotype_files_processed/010622_african_9096ind/010722_project_related_african
#This is the bfile originated after filtering related individuals
genoFile=~/UKBiobank/genotype_files_processed/010622_african_9096ind/010722_sample_var_final_qc/cache/UKB_genotypedatadownloaded083019.010722_sample_var_final_qc.filtered.extracted.bed
## Phenofile with ethnia column
phenoFile=/mnt/mfs/statgen/UKBiobank/phenotype_files/african_IID/010622_ukb47922_african_9096_pop_superpop.iid
pca_model=~/UKBiobank/genotype_files_processed/010622_african_9096ind/010722_pca_unrelated_african/010622_ukb47922_african_9096_pop_superpop.010722_pca_unrelated_african.pca.rds
plot_data=$UKBB_PATH/genotype_files_processed/010622_african_9096ind/010722_project_related_african/010622_ukb47922_african_9096_pop_superpop.010722_project_related_african.pca.projected.rds
#outlier_file=$UKBB_PATH/genotype_files_processed/010622_african_9096ind/010722_related_african/010722_project_related_african/*.pca.projected.outliers
label_col=ethnia
pop_col=ethnia
pca_sbatch=$OUT_PATH/flashpca_african_related_genoarray_plot_$(date +"%Y-%m-%d").sbatch

pca_args="""plot_pca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --label_col $label_col
    --pop_col $pop_col
    --plot_data $plot_data
    --outlier_file $outlier_file
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg  \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/flashpca_african_related_genoarray_plot_2022-01-11.sbatch[0m
INFO: Workflow csg (ID=w27a90c47284e126b) is executed successfully with 1 completed step.



## Get final sample_list and variant_list after QC and PCA

In [None]:
# N=8,591 samples and 351,690 variants
plink2 \
    --bfile ~/UKBiobank/genotype_files_processed/010622_african_9096ind/010722_sample_var_final_qc/cache/UKB_genotypedatadownloaded083019.010722_sample_var_final_qc.filtered.extracted  \
    --remove ~/UKBiobank/genotype_files_processed/010622_african_9096ind/010722_project_related_african/010622_ukb47922_african_9096_pop_superpop.010722_project_related_african.pca.projected.outliers \
    --write-snplist --write-samples --no-id-header \
    --threads 8 \
    --out ~/UKBiobank/genotype_files_processed/010622_african_9096ind/UKB_genotypedatadownloaded083019.010722_sample_var_final_qc.filtered.extracted

# Calculate PC's for Asians on hearing impairment data 

## f.3393 200K

#### Step 1.

In [8]:
## Columbia's cluster
cwd=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/011822_f3393_asian_PC3
gwas_sbatch=$OUT_PATH/f3393_asian_PC3_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/010722_sample_var_final_qc/cache/UKB_genotypedatadownloaded083019.010722_sample_var_final_qc.filtered.extracted.bed
keep_samples=$UKBB_PATH/phenotype_files/hearing_impairment/011422_UKBB_Hearing_aid_f3393_asian_96cases_2395ctrl.keep_id
#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/tf2478/pca_01_18_22/f3393_asian_PC3_2022-01-18.sbatch[0m
INFO: Workflow csg (ID=wcb0da58e7e7ce450) is executed successfully with 1 completed step.


#### Step 2.

In [13]:
## Columbia's cluster
cwd=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/011822_f3393_asian_PC3
#This is the bfile obtained in step 1
genoFile=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/011822_f3393_asian_PC3/cache/*.bed
# Format FID, IID, ethnicity
phenoFile=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/011822_f3393_asian_PC3/cache/*.fam
pca_sbatch=$OUT_PATH/flashpca_f3393_pc3_asian_$(date +"%Y-%m-%d").sbatch
k=3

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/tf2478/pca_01_18_22/flashpca_f3393_pc3_asian_2022-01-18.sbatch[0m
INFO: Workflow csg (ID=wd9ec91a4951bf48d) is executed successfully with 1 completed step.


## f.2247 200K

#### Step 1.

In [9]:
## Columbia's cluster
cwd=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/011822_f2247_asian_PC3
gwas_sbatch=$OUT_PATH/f2247_asian_PC3_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/010722_sample_var_final_qc/cache/UKB_genotypedatadownloaded083019.010722_sample_var_final_qc.filtered.extracted.bed
keep_samples=$UKBB_PATH/phenotype_files/hearing_impairment/011422_UKBB_Hearing_difficulty_f2247_asian_728cases_2395ctrl.keep_id
#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/tf2478/pca_01_18_22/f2247_asian_PC3_2022-01-18.sbatch[0m
INFO: Workflow csg (ID=w586506ca57e6c0ab) is executed successfully with 1 completed step.


#### Step 2.

In [14]:
## Columbia's cluster
cwd=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/011822_f2247_asian_PC3
#This is the bfile obtained in step 1
genoFile=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/011822_f2247_asian_PC3/cache/*.bed
# Format FID, IID, ethnicity
phenoFile=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/011822_f2247_asian_PC3/cache/*.fam
pca_sbatch=$OUT_PATH/flashpca_f2247_pc3_asian_$(date +"%Y-%m-%d").sbatch
k=3

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/tf2478/pca_01_18_22/flashpca_f2247_pc3_asian_2022-01-18.sbatch[0m
INFO: Workflow csg (ID=wb6c3f8fefbec1dc6) is executed successfully with 1 completed step.


## f.2257 200K

#### Step 1.

In [11]:
## Columbia's cluster
cwd=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/011822_f2257_asian_PC3
gwas_sbatch=$OUT_PATH/f2257_asian_PC3_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/010722_sample_var_final_qc/cache/UKB_genotypedatadownloaded083019.010722_sample_var_final_qc.filtered.extracted.bed
keep_samples=$UKBB_PATH/phenotype_files/hearing_impairment/011422_UKBB_Hearing_noise_f2257_asian_1515cases_2395ctrl.keep_id
#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/tf2478/pca_01_18_22/f2257_asian_PC3_2022-01-18.sbatch[0m
INFO: Workflow csg (ID=w2c5ce40875952475) is executed successfully with 1 completed step.


#### Step 2.

In [15]:
## Columbia's cluster
cwd=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/011822_f2257_asian_PC3
#This is the bfile obtained in step 1
genoFile=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/011822_f2257_asian_PC3/cache/*.bed
# Format FID, IID, ethnicity
phenoFile=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/011822_f2257_asian_PC3/cache/*.fam
pca_sbatch=$OUT_PATH/flashpca_f2257_pc3_asian_$(date +"%Y-%m-%d").sbatch
k=3

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/tf2478/pca_01_18_22/flashpca_f2257_pc3_asian_2022-01-18.sbatch[0m
INFO: Workflow csg (ID=w3f5b2e6d6361c89b) is executed successfully with 1 completed step.


## Combined 200K

#### Step 1.

In [12]:
## Columbia's cluster
cwd=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/011822_combined_asian_PC3
gwas_sbatch=$OUT_PATH/combined_asian_PC3_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/010722_sample_var_final_qc/cache/UKB_genotypedatadownloaded083019.010722_sample_var_final_qc.filtered.extracted.bed
keep_samples=$UKBB_PATH/phenotype_files/hearing_impairment/011422_UKBB_Combined_f2247_f2257_asian_598cases_2395ctrl.keep_id
#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/tf2478/pca_01_18_22/combined_asian_PC3_2022-01-18.sbatch[0m
INFO: Workflow csg (ID=w9d5f00ed9ba00513) is executed successfully with 1 completed step.


#### Step 2.

In [16]:
## Columbia's cluster
cwd=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/011822_combined_asian_PC3
#This is the bfile obtained in step 1
genoFile=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/011822_combined_asian_PC3/cache/*.bed
# Format FID, IID, ethnicity
phenoFile=$UKBB_PATH/genotype_files_processed/010622_asian_10189ind/011822_combined_asian_PC3/cache/*.fam
pca_sbatch=$OUT_PATH/flashpca_combined_pc3_asian_$(date +"%Y-%m-%d").sbatch
k=3

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/tf2478/pca_01_18_22/flashpca_combined_pc3_asian_2022-01-18.sbatch[0m
INFO: Workflow csg (ID=w21a0df69a9de56f3) is executed successfully with 1 completed step.
