# Sparsity projection via clustering

In IHT, we need to evaluate $P_k(\mathbf{v})$ where $\mathbf{v}$ is a dense vector and $k$ is the number of non-zero entries to keep. Usually $k$ is determined via cross-validation, an expensive procedure. This notebook investigates an attractive alternative: project by clustering.

The idea is to group large values of $\mathbf{v}$ into a cluster, and small values of $\mathbf{v}$ into another. Then we project the smaller cluster to sparsity. We naively try $k$-means with just 2 clusters.

In [1]:
using Distributed
# addprocs(4)

@everywhere begin
    using Revise
    using MendelIHT
    using SnpArrays
    using Random
    using GLM
    using DelimitedFiles
    using Test
    using Distributions
    using LinearAlgebra
    using CSV
    using DataFrames
end

┌ Info: Precompiling MendelIHT [921c7187-1484-5754-b919-5d3ed9ac03c4]
└ @ Base loading.jl:1278


# Simulate data

There are 20 true SNPs. 

In [32]:
n = 1000  # number of samples
p = 10000 # number of SNPs
k = 20    # number of causal SNPs per trait
d = Normal
l = canonicallink(d())

# set random seed for reproducibility
Random.seed!(2021)

# simulate `.bed` file with no missing data
x = simulate_random_snparray(undef, n, p)
xla = SnpLinAlg{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true) 

# intercept is the only nongenetic covariate
z = ones(n)
intercept = 0.0

# simulate response y, true model b, and the correct non-0 positions of b
Y, true_b, correct_position = simulate_random_response(xla, k, d, l, Zu=z*intercept);

# If $k$ unknown, try clustering by absolute value

Here is a simple projection based on based on k-means clustering. 

In [9]:
"""
    project_by_clustering!(x::AbstractVector)

Projects `x` to sparsity by clustering. We run k-means with `group` clusters for
10 iterations, and then project the cluster with the smallest mean.
"""
function project_by_clustering!(x::AbstractVector{T}, groups::Int=3) where T <: Real
    centers = collect(T, 0:1/(groups-1):1) # initialize cluster centers
    members = [Int[] for _ in 1:groups]

    # run 10 iterations of k-means
    for iter in 1:10
        # assign xᵢ to the nearest cluster
        empty!.(members) # refresh cluster members
        for i in eachindex(x)
            # xi, best_dist, best_group = abs(log(abs(x[i]))), typemax(T), 0
            xi, best_dist, best_group = abs(x[i]), typemax(T), 0
            for j in 1:groups
                d = abs2(xi - centers[j])
                if d < best_dist
                    best_dist = d
                    best_group = j
                end
            end
            push!(members[best_group], i)
        end

        # update cluster centers
        centers .= zero(T)
        for j in 1:groups
            for i in members[j]
                centers[j] += abs(x[i])
            end
            length(members[j]) != 0 && (centers[j] /= length(members[j]))
            centers[j] ≥ 0 || error("center $j = $(centers[j]) is negative! Shouldn't happen!")
        end
    end

    # project cluster with smallest mean
    mincenter, mincluster = findmin(centers)
    for i in members[mincluster]
        x[i] = 0
    end

    return centers, members
end

project_by_clustering!

# Run full IHT with clustering

Here $k$ represents the number of clusters

In [33]:
@time result = fit_iht(Y, xla, z, k=2) # k clusters

****                   MendelIHT Version 1.3.3                  ****
****     Benjamin Chu, Kevin Keys, Chris German, Hua Zhou       ****
****   Jin Zhou, Eric Sobel, Janet Sinsheimer, Kenneth Lange    ****
****                                                            ****
****                 Please cite our paper!                     ****
****         https://doi.org/10.1093/gigascience/giaa044        ****

Running sparse linear regression
Link functin = IdentityLink()
Sparsity parameter (k) = 2
Prior weight scaling = off
Doubly sparse projection = off
Debias = off
Max IHT iterations = 100
Converging when tol < 0.0001:

cluster 1 has mean 0.09897039900604343 with 9989 members
cluster 2 has mean 1.0296528284922977 with 12 members
Iteration 1: loglikelihood = -1706.7140423805665, backtracks = 0, tol = 1.3804772210599874
cluster 1 has mean 0.03785743987018223 with 9989 members
cluster 2 has mean 1.0544671540047375 with 12 members
Iteration 2: loglikelihood = -1657.4869549672399, backt


IHT estimated 12 nonzero SNP predictors and 0 non-genetic predictors.

Compute time (sec):     0.04872894287109375
Final loglikelihood:    -1656.666069606865
Iterations:             6

Selected genetic predictors:
[1m12×2 DataFrame[0m
[1m Row [0m│[1m Position [0m[1m Estimated_β [0m
[1m     [0m│[90m Int64    [0m[90m Float64     [0m
─────┼───────────────────────
   1 │      921     0.756641
   2 │     1079    -1.53159
   3 │     1216     0.776927
   4 │     1320    -1.48857
   5 │     1733    -0.952702
   6 │     2767    -0.782935
   7 │     2942     1.24028
   8 │     3174    -0.856962
   9 │     4189     0.677355
  10 │     6889    -1.23779
  11 │     7036    -1.21663
  12 │     7960     1.25045

Selected nongenetic predictors:
[1m0×2 DataFrame[0m

In [23]:
# compare estimated vs true beta values
[result.beta[correct_position] true_b[correct_position]]

20×2 Array{Float64,2}:
  0.0       -0.155227
 -0.435418  -0.420876
  0.0       -0.0897132
  0.0        0.0684544
  0.0        0.139375
  0.0       -0.08059
  0.0       -0.157304
 -0.616885  -0.65508
  0.0       -0.0981301
 -0.263381  -0.240965
 -0.634225  -0.642799
  0.0       -0.175661
  0.0        0.100944
  0.468623   0.465826
  0.0        0.139876
  0.0       -0.0358273
  0.0       -0.048751
  0.0       -0.0218399
 -0.416971  -0.423826
  0.0       -0.0330992

**Conclusion:** Dynamically updating $k$ finds 12/20 predictors. If true beta values are small (e.g. $\beta_j \sim N(0.3, 1)$, then clustering method fails miserably. 

### Try cross validation

In [24]:
mses = cv_iht(Y, x, z, path=1:30)
argmin(mses)



Crossvalidation Results:
	k	MSE
	1	2209.101834495594
	2	1836.5609091116007
	3	1600.7771519145786
	4	1440.902388132112
	5	1240.3639324002856
	6	1171.1213039329634
	7	1139.2040429143092
	8	1120.9205045580977
	9	1101.8738297278624
	10	1079.412309465611
	11	1054.9503961484781
	12	1052.7415377589205
	13	1043.623092354303
	14	1042.7011124003648
	15	1044.757978321074
	16	1045.3574516402612
	17	1041.0838535225398
	18	1039.2727503659023
	19	1039.4087108491858
	20	1042.5824463794356
	21	1047.3632497467302
	22	1052.2645940685206
	23	1054.0345432285792
	24	1053.6535606757136
	25	1054.426078307337
	26	1055.3776890272109
	27	1057.5455664997887
	28	1059.582519470663
	29	1061.5691535732124
	30	1064.2392058490943


18

### Run regular IHT with best cross-validated $k$

In [25]:
@time result = fit_iht(Y, xla, z, k=argmin(mses))
[result.beta[correct_position] true_b[correct_position]]

****                   MendelIHT Version 1.3.3                  ****
****     Benjamin Chu, Kevin Keys, Chris German, Hua Zhou       ****
****   Jin Zhou, Eric Sobel, Janet Sinsheimer, Kenneth Lange    ****
****                                                            ****
****                 Please cite our paper!                     ****
****         https://doi.org/10.1093/gigascience/giaa044        ****

Running sparse linear regression
Link functin = IdentityLink()
Sparsity parameter (k) = 18
Prior weight scaling = off
Doubly sparse projection = off
Debias = off
Max IHT iterations = 100
Converging when tol < 0.0001:

Iteration 1: loglikelihood = -7170.411132934618, backtracks = 0, tol = 0.6071937058669298
Iteration 2: loglikelihood = -7111.69972086593, backtracks = 0, tol = 0.05380327953646194
Iteration 3: loglikelihood = -7110.546921169437, backtracks = 0, tol = 0.03423792810475089
Iteration 4: loglikelihood = -7110.4840308483645, backtracks = 0, tol = 0.0016732061780767714
It

20×2 Array{Float64,2}:
 -0.146535   -0.155227
 -0.436508   -0.420876
 -0.0859204  -0.0897132
  0.0678852   0.0684544
  0.141589    0.139375
 -0.0737369  -0.08059
 -0.143445   -0.157304
 -0.613671   -0.65508
 -0.0703329  -0.0981301
 -0.252921   -0.240965
 -0.640562   -0.642799
 -0.184113   -0.175661
  0.088282    0.100944
  0.475128    0.465826
  0.159098    0.139876
  0.0        -0.0358273
 -0.0656724  -0.048751
  0.0        -0.0218399
 -0.409771   -0.423826
  0.0        -0.0330992

**Conclusion:** Dynamically updating $k$ finds 12/20 predictors, while cross validation finds 17/20.