# Analyze HDL as a binary trait

In [1]:
using Distributed
addprocs(30)
nprocs()

31

In [2]:
using MendelIHT
using SnpArrays
using DataFrames
using Distributions
using BenchmarkTools
using Random
using LinearAlgebra
using GLM
using CSV
using Dates

# Import data

+ ** For description of what each phenotype column means**, see here:
https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/dataset.cgi?study_id=phs000276.v2.p1&phv=129612&phd=&pha=2896&pht=2005&phvf=&phdf=&phaf=&phtf=&dssp=1&consent=&temp=1

+ **To match up phenotype with genotype**, see `/Volumes/ExternalDrive/stampeed_data/original_dbgap/StudyFiles_decrypted/Release_Notes.phs000276.NFBC66.v2.p1.MULTI.pdf` on page 2. Basically, SUBJID in `phenotype_data` below corresponds to the first column of the `.fam` file (same as `kevin_imputed_filtered_HDL.fam`) at  `/Volumes/ExternalDrive/stampeed_data/original_dbgap/genotypes_decrypted/Mat/NFBC_dbGaP_20091127.fam`.

In [3]:
# import full genotype data
kevin_stampeed = SnpArray("../kevin_imputed.bed")

# import full phenotype data
phenotype_data = CSV.read("full_phenotype_sorted", delim=',', header=true)

Unnamed: 0_level_0,dbGaP_Subject_ID,SUBJID,SEX,a10atc,crp3dec,FASTING_STATUS,FB_GLUK,FS_INS,FS_KOL,FS_KOL_H,FS_KOL_L,FS_TRIGL,HOMA_IR,PAILAHDE,Pills31,PITLAHDE,ZP4202U,ZT20,ZT27,ZT28,ZT29,ZT30,CASE,RACE,BMI,Systolic Blood Pressure MEAN,Diastolic Blood Pressure MEAN
Unnamed: 0_level_1,Int64⍰,Int64⍰,Int64⍰,String⍰,String⍰,String⍰,String⍰,String⍰,String⍰,String⍰,String⍰,String⍰,String⍰,String⍰,String⍰,String⍰,String⍰,String⍰,String⍰,String⍰,String⍰,String⍰,Int64⍰,String⍰,String⍰,String⍰,String⍰
1,356529,2864,2,0,0.373,0,4.8,8.8,6.73,2.26,4.2,0.64,1.13,1,1,1,1,2,166.8,55.7,92.4,71.8,2,Caucasian,20,110,79
2,358247,2374,1,0,0.38,0,5.6,7,3.95,1.55,2.1,0.67,0.94,1,X,1,X,X,179.6,67.2,92.5,79,2,Caucasian,20.8,117,72
3,356035,2803,1,0,9.214,0,5.7,12.9,7.44,1.19,5.6,1.52,1.72,1,X,1,X,X,177,101.7,112,117,2,Caucasian,32.5,147,97
4,353849,4709,1,0,0.61,0,4.9,8.7,3.92,1.44,2.3,0.35,1.12,1,X,1,X,X,173.4,89.9,103,102,2,Caucasian,29.9,125,60
5,354502,4350,1,0,0.593,0,4.9,3.6,3.53,1.53,1.8,0.39,0.47,1,X,1,1,2,170.8,66.5,90,75,2,Caucasian,22.8,X,X
6,356015,3480,2,0,1.774,0,4.7,8.7,3.59,1.48,1.8,0.69,1.11,1,0,1,0,2,170.5,71.6,99,78,2,Caucasian,24.6,138,84
7,356268,3729,1,0,2.564,0,4.6,5.7,5.71,2.01,3.4,0.6,0.73,1,X,1,X,X,182,73.5,91,84,2,Caucasian,22.2,132,76
8,358341,4031,1,0,3.998,0,5.9,7.6,4.11,1.31,2.3,1.03,1.03,1,X,1,X,X,182.3,85.4,102,92,2,Caucasian,25.7,132,100
9,357054,1104,1,0,3.238,0,5.4,12.2,5.93,0.92,3.5,3.24,1.6,1,X,1,X,X,178.4,91.1,102,99,2,Caucasian,28.6,140,110
10,354105,395,1,0,0.83,0,4.1,3.4,3.97,1.58,2.1,0.54,0.42,1,X,1,1,X,172.5,68.3,89.5,78.5,2,Caucasian,23,118,87


# Phenotype data was sorted so it matches genotype data

In particular, the following code was executed:

```Julia
genotype_order = CSV.read("../kevin_imputed.fam", delim=' ', header=false)[:, 1]
phenotype_order = phenotype_data[:, 2]

phenotype_data_sorted = similar(phenotype_data, 0) 
for i in 1:size(phenotype_data, 1)
    row_in_phenotype = findall(x -> x == genotype_order[i], phenotype_data[:, 2])
    length(row_in_phenotype) == 1 || error("should have only 1 element")
    push!(phenotype_data_sorted, phenotype_data[row_in_phenotype[1], :])
end
```

# Filtered phenotype and genotype data for HDL phenotype

In [4]:
# exclude samples without HDL measurements
HDL_chol = phenotype_data[:FS_KOL_H]
missing_HDL_data = HDL_chol .== "X"

# exclude people that are fasting
fasting_data = phenotype_data[:FASTING_STATUS]
contains_nonfasting_blood = (fasting_data .== "1") .+ (fasting_data .== "X")

# exlucde people on diabetes medication
diabetes_med = phenotype_data[:a10atc]
contains_diabetes_medication = (diabetes_med .== "1") .+ (diabetes_med .== "X")

# exclude SNPs with maf < 0.01 and SNPs with HWE p-value < 0.00001
rowmask, snps_to_keep = SnpArrays.filter(kevin_stampeed, min_success_rate_per_row=1.0, 
    min_success_rate_per_col=1.0, min_maf=0.01, min_hwe_pval=1e-5)

# combine
samples_to_exclude = missing_HDL_data .+ contains_nonfasting_blood .+ contains_diabetes_medication .+ rowmask
samples_to_keep = samples_to_exclude .== 0

@show count(snps_to_keep)
@show count(samples_to_keep)

count(snps_to_keep) = 324789
count(samples_to_keep) = 4907


4907

## write filtered result to new file

In [5]:
SnpArrays.filter("../kevin_imputed", samples_to_keep, snps_to_keep, des="kevin_imputed_filtered_HDL")

4907×324789 SnpArray:
 0x03  0x03  0x03  0x03  0x03  0x03  …  0x03  0x02  0x03  0x03  0x03  0x03
 0x03  0x02  0x02  0x02  0x02  0x03     0x03  0x03  0x02  0x03  0x03  0x03
 0x02  0x03  0x03  0x03  0x02  0x03     0x02  0x02  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x02  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x02  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x02  0x03  …  0x03  0x02  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x02  0x03     0x02  0x02  0x02  0x02  0x02  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x02  0x02  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x00  0x00  0x00  0x00  0x03
 0x02  0x00  0x00  0x00  0x03  0x03     0x03  0x02  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x02  0x03  …  0x03  0x00  0x03  0x03  0x03  0x02
 0x03  0x03  0x03  0x03  0x02  0x03     0x03  0x00  0x00  0x03  0x03  0x03
 0x03  0x02  0x02  0x02  0x03  0x03     0x03  0x03  0x00  0x02  0x02  0x03
   

## Compute top 2 principal components on resulting file using plink2

The following command was executed:
```
./plink2 --bfile kevin_imputed_filtered_HDL --pca 2
```

In [6]:
# first check genotype and phenotype files actually match 
genotype_order = CSV.read("kevin_imputed_filtered_HDL.fam", delim=' ', header=false)[:, 1]
phenotype_order = phenotype_data[samples_to_keep, 2]
all(phenotype_order .== genotype_order)

true

# Begin analysis

Here we truncate the HDL level at 60ml/DL, which is the [desired level of HDL](https://www.mayoclinic.org/diseases-conditions/high-blood-cholesterol/in-depth/hdl-cholesterol/art-20046388). 

In [10]:
cutoff = 60
molecular_weight = 386.654 #g/mol 

# convert from mmol/L to mg/dL
HDL_converted = parse.(Float64, HDL_chol[samples_to_keep]) .* molecular_weight ./ 10

# truncate 
HDL_truncated = HDL_converted .>= cutoff
y = convert(Vector{Float64}, HDL_truncated)
@show count(!iszero, y)

# check truncation
[y HDL_converted]

count(!iszero, y) = 2265


4907×2 Array{Float64,2}:
 1.0  87.3838
 0.0  59.9314
 0.0  46.0118
 0.0  55.6782
 0.0  59.1581
 0.0  57.2248
 1.0  77.7175
 0.0  50.6517
 0.0  35.5722
 1.0  61.0913
 1.0  67.2778
 0.0  54.1316
 1.0  64.1846
 ⋮           
 1.0  64.5712
 1.0  62.6379
 0.0  36.7321
 1.0  85.0639
 1.0  64.5712
 0.0  40.5987
 1.0  85.0639
 0.0  48.3317
 0.0  49.1051
 1.0  66.8911
 0.0  55.2915
 0.0  57.6114

In [8]:
x = SnpArray("kevin_imputed_filtered_HDL.bed")

4907×324789 SnpArray:
 0x03  0x03  0x03  0x03  0x03  0x03  …  0x03  0x02  0x03  0x03  0x03  0x03
 0x03  0x02  0x02  0x02  0x02  0x03     0x03  0x03  0x02  0x03  0x03  0x03
 0x02  0x03  0x03  0x03  0x02  0x03     0x02  0x02  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x02  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x02  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x02  0x03  …  0x03  0x02  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x02  0x03     0x02  0x02  0x02  0x02  0x02  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x02  0x02  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x00  0x00  0x00  0x00  0x03
 0x02  0x00  0x00  0x00  0x03  0x03     0x03  0x02  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x02  0x03  …  0x03  0x00  0x03  0x03  0x03  0x02
 0x03  0x03  0x03  0x03  0x02  0x03     0x03  0x00  0x00  0x03  0x03  0x03
 0x03  0x02  0x02  0x02  0x03  0x03     0x03  0x03  0x00  0x02  0x02  0x03
   

# Include nongenetic covariates

As per Keys and Sabatti, we include intercept, sexOCPG, and first 2 principal components as nongenetic covariates. Here we standardized all covariates other than the intercept, because we want to ensure equal penalty to all covariates.

In [11]:
n, p = size(x)
z = zeros(n, 6) 

# add intercept
z[:, 1] .= ones(n) 

for i in 1:n
    # add sex: male = 0, female = 1
    z[i, 2] = (phenotype_data[:SEX][i] == 1 ? 0.0 : 1.0)
    
    # add oral contraceptives: 0 = no, 1 = yes, X = unknown
    phenotype_data[:ZP4202U][i] == "0" && (z[i, 3] = 0)
    phenotype_data[:ZP4202U][i] == "1" && (z[i, 3] = 1)
    phenotype_data[:ZP4202U][i] == "X" && (z[i, 3] = 2)
    
    # add pregnancy: 1 = yes, 2 = no, 3 and X = unknown
    phenotype_data[:ZT20][i] == "1" && (z[i, 4] = 0)
    phenotype_data[:ZT20][i] == "2" && (z[i, 4] = 1)
    phenotype_data[:ZT20][i] == "3" && (z[i, 4] = 2)
    phenotype_data[:ZT20][i] == "X" && (z[i, 4] = 2)
end

# add first 2 principal components
pc = CSV.read("kevin_imputed_filtered_HDL.eigenvec", delim="\t", header=true)
z[:, 5] .= pc[:, 3]
z[:, 6] .= pc[:, 4]

# standardize all covariates
for i in 2:size(z, 2)
    col_mean = mean(z[:, i])
    col_std  = std(z[:, i])
    z[:, i] .= (z[:, i] .- col_mean) ./ col_std
end

In [12]:
@show mean(z[:, 1]), var(z[:, 1])
@show mean(z[:, 2]), var(z[:, 2])
@show mean(z[:, 3]), var(z[:, 3])
@show mean(z[:, 4]), var(z[:, 4])
@show mean(z[:, 5]), var(z[:, 5])
@show mean(z[:, 5]), var(z[:, 5])

(mean(z[:, 1]), var(z[:, 1])) = (1.0, 0.0)
(mean(z[:, 2]), var(z[:, 2])) = (7.891701467072643e-17, 0.9999999999999981)
(mean(z[:, 3]), var(z[:, 3])) = (-5.792074471245976e-16, 1.0000000000000013)
(mean(z[:, 4]), var(z[:, 4])) = (7.67449867440092e-17, 1.0000000000000004)
(mean(z[:, 5]), var(z[:, 5])) = (0.0, 0.9999999999999997)
(mean(z[:, 5]), var(z[:, 5])) = (0.0, 0.9999999999999997)


(0.0, 0.9999999999999997)

# Run cross validation

In [13]:
Random.seed!(1111)
d = Bernoulli
l = canonicallink(d())
path = collect(1:20)
num_folds = 5
folds = rand(1:num_folds, size(x, 1))

4907-element Array{Int64,1}:
 5
 4
 3
 2
 1
 3
 5
 4
 1
 3
 1
 1
 1
 ⋮
 1
 1
 3
 5
 4
 5
 3
 2
 2
 3
 1
 5

In [16]:
println("start time = " * string(Dates.format(now(), "HH:MM")))
mses = cv_iht(d(), l, x, z, y, 1, path, num_folds, folds=folds, debias=false, parallel=true)

start time = 09:10


Crossvalidation Results:
	k	MSE
	1	1298.121183549106
	2	1264.368781367858
	3	1247.9347113977437
	4	1247.9379377229552
	5	1246.713595208504
	6	1255.2601234253634
	7	1256.1252644064878
	8	1260.007456179116
	9	1256.1451645249208
	10	1266.90953724116
	11	1266.8787293862163
	12	1269.7143185392188
	13	1275.749343499474
	14	1274.3015860912808
	15	1277.8838795416066
	16	1282.2290708620035
	17	1284.0273381000516
	18	1288.57871049886
	19	1291.5250510586757
	20	1286.1241555104014


20-element Array{Float64,1}:
 1298.121183549106 
 1264.368781367858 
 1247.9347113977437
 1247.9379377229552
 1246.713595208504 
 1255.2601234253634
 1256.1252644064878
 1260.007456179116 
 1256.1451645249208
 1266.90953724116  
 1266.8787293862163
 1269.7143185392188
 1275.749343499474 
 1274.3015860912808
 1277.8838795416066
 1282.2290708620035
 1284.0273381000516
 1288.57871049886  
 1291.5250510586757
 1286.1241555104014

In [17]:
println("end time = " * string(Dates.format(now(), "HH:MM")))

end time = 11:02


# Compute final model

In [19]:
xbm = SnpBitMatrix{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true); 
@show k = argmin(mses)
d = Normal
l = canonicallink(d())
result = L0_reg(x, xbm, z, y, 1, k, d(), l, debias=true)

k = argmin(mses) = 5



IHT estimated 4 nonzero SNP predictors and 1 non-genetic predictors.

Compute time (sec):     59.05154800415039
Final loglikelihood:    -3285.077257999052
Iterations:             4

Selected genetic predictors:
4×2 DataFrame
│ Row │ Position │ Estimated_β │
│     │ [90mInt64[39m    │ [90mFloat64[39m     │
├─────┼──────────┼─────────────┤
│ 1   │ 119288   │ -0.0559564  │
│ 2   │ 119293   │ 0.203675    │
│ 3   │ 119294   │ -0.0637012  │
│ 4   │ 275336   │ -0.0616348  │

Selected nongenetic predictors:
1×2 DataFrame
│ Row │ Position │ Estimated_β │
│     │ [90mInt64[39m    │ [90mFloat64[39m     │
├─────┼──────────┼─────────────┤
│ 1   │ 1        │ 0.461585    │

In [20]:
estimated_b = result.beta
position = findall(!iszero, estimated_b)
found_snps = CSV.read("kevin_imputed_filtered_HDL.bim", delim='\t', header=false)[position, :]

Unnamed: 0_level_0,Column1,Column2,Column3,Column4,Column5,Column6
Unnamed: 0_level_1,Int64⍰,String⍰,Int64⍰,Int64⍰,String⍰,String⍰
1,6,rs9261224,0,30121866,A,G
2,6,rs6917603,0,30125050,G,A
3,6,rs9261256,0,30129920,C,G
4,16,rs3764261,0,55550825,A,C


In [21]:
# try no debias
result = L0_reg(x, xbm, z, y, 1, k, d(), l, debias=false)


IHT estimated 4 nonzero SNP predictors and 1 non-genetic predictors.

Compute time (sec):     218.72346878051758
Final loglikelihood:    -3239.190847652173
Iterations:             18

Selected genetic predictors:
4×2 DataFrame
│ Row │ Position │ Estimated_β │
│     │ [90mInt64[39m    │ [90mFloat64[39m     │
├─────┼──────────┼─────────────┤
│ 1   │ 119288   │ -0.117137   │
│ 2   │ 119293   │ 0.202616    │
│ 3   │ 275336   │ -0.0608985  │
│ 4   │ 284637   │ 0.036083    │

Selected nongenetic predictors:
1×2 DataFrame
│ Row │ Position │ Estimated_β │
│     │ [90mInt64[39m    │ [90mFloat64[39m     │
├─────┼──────────┼─────────────┤
│ 1   │ 1        │ 0.461585    │

In [22]:
estimated_b = result.beta
position = findall(!iszero, estimated_b)
found_snps = CSV.read("kevin_imputed_filtered_HDL.bim", delim='\t', header=false)[position, :]

Unnamed: 0_level_0,Column1,Column2,Column3,Column4,Column5,Column6
Unnamed: 0_level_1,Int64⍰,String⍰,Int64⍰,Int64⍰,String⍰,String⍰
1,6,rs9261224,0,30121866,A,G
2,6,rs6917603,0,30125050,G,A
3,16,rs3764261,0,55550825,A,C
4,17,rs9898058,0,45173820,A,G
