### Missing Data Experiment

The raw data for this experiemnt can be downloaded from https://ifcs.boku.ac.at/repository/data/tetragonula_bee/index.html, and is supplied in the `DATA/Bees/Tetragonula.csv` , the code load numpy files with the raw data and labels (no pre processing).

Only the vHDPMM and DPMM experiments are included, as GMM was done using sklearn in python.

Note - Here we demonstrate a single iteration, thus not mean or std, this is especialy true for the vHDPMM with missing data, as we took random partitions each time.

In [1]:
using NPZ
using DPMMSubClusters
using LinearAlgebra
using Clustering
using Random

#### Raw Data Loading

In [2]:
data = copy(npzread("DATA/Bees/bees_x.npy")')
labels = npzread("DATA/Bees/bees_y.npy");

In [3]:
size(data)

(13, 236)

In [4]:
DPMMPrior = DPMMSubClusters.niw_hyperparams(1,zeros(13),16,Matrix{Float64}(I, 13, 13)*0.1)
DPMMResults = DPMMSubClusters.fit(data,DPMMPrior,100.0,iters = 100, gt = labels, verbose = false);

│   caller = Distributions.Dirichlet{Float64}(::Array{Float64,1}) at dirichlet.jl:36
└ @ Distributions /home/dinari/.julia/packages/Distributions/0Wogo/src/multivariate/dirichlet.jl:36
│   caller = Distributions.Dirichlet{Float64}(::Array{Float64,1}) at dirichlet.jl:38
└ @ Distributions /home/dinari/.julia/packages/Distributions/0Wogo/src/multivariate/dirichlet.jl:38


#### DPMM NMI:

In [5]:
println(mutualinfo(DPMMResults[1], labels; normed =true))

0.6809218109002302


We will partition the data into 4 groups (at random), and create a version where all features exists, and a version where a random count of features are missing

In [6]:
perm = randperm(size(data,2))
mixed_data = data[:,perm]
mixed_labels = labels[perm]

interval = Int(length(mixed_labels)/4)

group_indices = Int.(collect(1:interval:length(mixed_labels)))
labels_dict = Dict()
full_data_dict = Dict()
missing_data_dict = Dict()
features_count = rand(1:7,4)
base_features = collect(1:6)
for (i,v) in enumerate(group_indices)
    relevant_data = mixed_data[:,v:v+interval-1]
    labels_dict[i] = mixed_labels[v:v+interval-1]
    full_data_dict[i] = relevant_data
    choosen_features = Int.(vcat(base_features,(randperm(7).+6)[1:features_count[i]]))
    missing_data_dict[i] = relevant_data[choosen_features,:]
end   

We will now run our model, initially adding process (as it must have atleast 1 worker process)

In [7]:
using Distributed
addprocs(4)

4-element Array{Int64,1}:
 2
 3
 4
 5

In [8]:
cur_dir = pwd()
@everywhere cd("../")
include("hdp_shared_features.jl")

results_stats (generic function with 1 method)

In [9]:
local_priors = [niw_hyperparams(1.0, zeros(i),i+3,Matrix{Float64}(I, i, i)*0.1) for i in features_count]
global_hyper_params = niw_hyperparams(1.0, zeros(6), 9, Matrix{Float64}(I, 6, 6)*0.1)
constant_local_prior = niw_hyperparams(1.0, zeros(7), 10, Matrix{Float64}(I, 7, 7)*0.1)

niw_hyperparams(1.0, [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 10.0, [0.1 0.0 … 0.0 0.0; 0.0 0.1 … 0.0 0.0; … ; 0.0 0.0 … 0.1 0.0; 0.0 0.0 … 0.0 0.1])

In [10]:
missing_data_results = vhdp_fit(missing_data_dict,6,100.0,100.0,100.0,global_hyper_params,local_priors,100)

│   caller = Dirichlet{Float64}(::Array{Float64,1}) at dirichlet.jl:36
└ @ Distributions ~/.julia/packages/Distributions/0Wogo/src/multivariate/dirichlet.jl:36
│   caller = Dirichlet{Float64}(::Array{Float64,1}) at dirichlet.jl:38
└ @ Distributions ~/.julia/packages/Distributions/0Wogo/src/multivariate/dirichlet.jl:38
│   caller = Dirichlet{Float64}(::Array{Float64,1}) at dirichlet.jl:36
└ @ Distributions ~/.julia/packages/Distributions/0Wogo/src/multivariate/dirichlet.jl:36
│   caller = Dirichlet{Float64}(::Array{Float64,1}) at dirichlet.jl:38
└ @ Distributions ~/.julia/packages/Distributions/0Wogo/src/multivariate/dirichlet.jl:38
│   caller = Dirichlet{Float64}(::Array{Float64,1}) at dirichlet.jl:36
└ @ Distributions ~/.julia/packages/Distributions/0Wogo/src/multivariate/dirichlet.jl:36
│   caller = Dirichlet{Float64}(::Array{Float64,1}) at dirichlet.jl:38
└ @ Distributions ~/.julia/packages/Distributions/0Wogo/src/multivariate/dirichlet.jl:38
│   caller = Dirichlet{Float64}(::Array{

Iteration: 1|| Global Counts: [6]|| iter time: 20.441547870635986
Iteration: 2|| Global Counts: [8]|| iter time: 0.4789559841156006
Iteration: 3|| Global Counts: [8]|| iter time: 0.06530284881591797
Iteration: 4|| Global Counts: [9]|| iter time: 0.02383279800415039
Iteration: 5|| Global Counts: [10, 10]|| iter time: 0.5425400733947754
Iteration: 6|| Global Counts: [10, 4]|| iter time: 0.31418704986572266
Iteration: 7|| Global Counts: [10, 4]|| iter time: 0.07028794288635254
Iteration: 8|| Global Counts: [10, 4]|| iter time: 0.03614401817321777
Iteration: 9|| Global Counts: [10, 4]|| iter time: 0.014374971389770508
Iteration: 10|| Global Counts: [10, 4, 10, 4]|| iter time: 0.0202939510345459
Iteration: 11|| Global Counts: [9, 4, 7, 1]|| iter time: 0.1972489356994629
Iteration: 12|| Global Counts: [9, 4, 7, 1]|| iter time: 0.019371986389160156
Iteration: 13|| Global Counts: [9, 4, 7, 1]|| iter time: 0.020440101623535156
Iteration: 14|| Global Counts: [9, 4, 7, 1]|| iter time: 0.020908117

(hdp_shared_features(model_hyper_params(niw_hyperparams(1.0, [0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 9.0, [0.1 0.0 … 0.0 0.0; 0.0 0.1 … 0.0 0.0; … ; 0.0 0.0 … 0.1 0.0; 0.0 0.0 … 0.0 0.1]), niw_hyperparams[niw_hyperparams(1.0, [0.0], 4.0, [0.1]), niw_hyperparams(1.0, [0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 9.0, [0.1 0.0 … 0.0 0.0; 0.0 0.1 … 0.0 0.0; … ; 0.0 0.0 … 0.1 0.0; 0.0 0.0 … 0.0 0.1]), niw_hyperparams(1.0, [0.0, 0.0, 0.0, 0.0, 0.0], 8.0, [0.1 0.0 … 0.0 0.0; 0.0 0.1 … 0.0 0.0; … ; 0.0 0.0 … 0.1 0.0; 0.0 0.0 … 0.0 0.1]), niw_hyperparams(1.0, [0.0, 0.0, 0.0, 0.0, 0.0], 8.0, [0.1 0.0 … 0.0 0.0; 0.0 0.1 … 0.0 0.0; … ; 0.0 0.0 … 0.1 0.0; 0.0 0.0 … 0.0 0.1])], 100.0, 100.0, 100.0, 1.0, 1.0, 7, 7), Dict{Any,Any}(4 => local_group(model_hyper_params(niw_hyperparams(1.0, [0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 9.0, [0.1 0.0 … 0.0 0.0; 0.0 0.1 … 0.0 0.0; … ; 0.0 0.0 … 0.1 0.0; 0.0 0.0 … 0.0 0.1]), niw_hyperparams(1.0, [0.0, 0.0, 0.0, 0.0, 0.0], 8.0, [0.1 0.0 … 0.0 0.0; 0.0 0.1 … 0.0 0.0; … ; 0.0 0.0 … 0.1 0.0; 0.0 

#### Missing Data NMI:

In [11]:
NMI = 0.0
for i=1:4
    group_labels = create_global_labels(missing_data_results[1].groups_dict[i])
    NMI += mutualinfo(group_labels,labels_dict[i])
end
println(NMI/4)

0.8076389920820859


In [14]:
full_data_results = vhdp_fit(full_data_dict,6,100.0,100.0,100.0,global_hyper_params,constant_local_prior,100)

Iteration: 1|| Global Counts: [6]|| iter time: 0.015303850173950195
Iteration: 2|| Global Counts: [8]|| iter time: 0.019905805587768555
Iteration: 3|| Global Counts: [8]|| iter time: 0.012077093124389648
Iteration: 4|| Global Counts: [10]|| iter time: 0.013199090957641602
Iteration: 5|| Global Counts: [13, 13]|| iter time: 0.016148805618286133
Iteration: 6|| Global Counts: [9, 12]|| iter time: 0.023788928985595703
Iteration: 7|| Global Counts: [9, 11]|| iter time: 0.022953033447265625
Iteration: 8|| Global Counts: [9, 11]|| iter time: 0.0158998966217041
Iteration: 9|| Global Counts: [8, 11]|| iter time: 0.018301010131835938
Iteration: 10|| Global Counts: [8, 10, 8, 10]|| iter time: 0.01865410804748535
Iteration: 11|| Global Counts: [9, 8, 4, 7]|| iter time: 0.024176836013793945
Iteration: 12|| Global Counts: [9, 8, 4, 7]|| iter time: 0.03275609016418457
Iteration: 13|| Global Counts: [9, 9, 4, 7]|| iter time: 0.021872997283935547
Iteration: 14|| Global Counts: [9, 8, 4, 7]|| iter time:

Iteration: 92|| Global Counts: [9, 4, 4, 1, 5, 2, 4, 4, 3, 1]|| iter time: 0.03502392768859863
Iteration: 93|| Global Counts: [9, 4, 4, 1, 5, 2, 4, 4, 3, 1]|| iter time: 0.030124902725219727
Iteration: 94|| Global Counts: [9, 4, 4, 1, 5, 2, 4, 4, 4, 1]|| iter time: 0.029729127883911133
Iteration: 95|| Global Counts: [9, 4, 4, 1, 6, 2, 4, 4, 3, 1]|| iter time: 0.029502153396606445
Iteration: 96|| Global Counts: [9, 4, 4, 1, 6, 2, 4, 4, 3, 1]|| iter time: 0.03689885139465332
Iteration: 97|| Global Counts: [9, 4, 4, 1, 6, 2, 4, 4, 3, 1]|| iter time: 0.028085947036743164
Iteration: 98|| Global Counts: [9, 4, 4, 1, 5, 2, 4, 4, 3, 1]|| iter time: 0.030404090881347656
Iteration: 99|| Global Counts: [9, 4, 4, 1, 5, 2, 4, 4, 3, 1]|| iter time: 0.03512001037597656
Iteration: 100|| Global Counts: [9, 4, 4, 1, 5, 2, 4, 4, 3, 1]|| iter time: 0.02962493896484375


(hdp_shared_features(model_hyper_params(niw_hyperparams(1.0, [0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 9.0, [0.1 0.0 … 0.0 0.0; 0.0 0.1 … 0.0 0.0; … ; 0.0 0.0 … 0.1 0.0; 0.0 0.0 … 0.0 0.1]), niw_hyperparams(1.0, [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 10.0, [0.1 0.0 … 0.0 0.0; 0.0 0.1 … 0.0 0.0; … ; 0.0 0.0 … 0.1 0.0; 0.0 0.0 … 0.0 0.1]), 100.0, 100.0, 100.0, 1.0, 1.0, 13, 7), Dict{Any,Any}(4 => local_group(model_hyper_params(niw_hyperparams(1.0, [0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 9.0, [0.1 0.0 … 0.0 0.0; 0.0 0.1 … 0.0 0.0; … ; 0.0 0.0 … 0.1 0.0; 0.0 0.0 … 0.0 0.1]), niw_hyperparams(1.0, [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 10.0, [0.1 0.0 … 0.0 0.0; 0.0 0.1 … 0.0 0.0; … ; 0.0 0.0 … 0.1 0.0; 0.0 0.0 … 0.0 0.1]), 100.0, 100.0, 100.0, 1.0, 1.0, 13, 7), [20.020599365234375 16.416400909423828 … 19.11910057067871 20.320600509643555; 12.112099647521973 0.0 … 14.714699745178223 12.112099647521973; … ; 12.01259994506836 11.01099967956543 … 10.810999870300293 12.013799667358398; 15.715700149536133 18.4183998107

#### Full Data NMI

In [15]:
NMI = 0.0
for i=1:4
    group_labels = create_global_labels(full_data_results[1].groups_dict[i])
    NMI += mutualinfo(group_labels,labels_dict[i])
end
println(NMI/4)

0.8545211263751176
