### Missing Data Experiment

The raw data for this experiemnt can be downloaded from https://ifcs.boku.ac.at/repository/data/tetragonula_bee/index.html, and is supplied in the `DATA/Bees/Tetragonula.csv` , the code load numpy files with the raw data and labels (no pre processing).

Only the vHDPMM and DPMM experiments are included, as GMM was done using sklearn in python.

Note - Here we demonstrate a single iteration, thus not mean or std, this is especialy true for the vHDPMM with missing data, as we took random partitions each time.

In [1]:
using NPZ
using DPMMSubClusters
using LinearAlgebra
using Clustering
using Random

#### Raw Data Loading

In [3]:
data = copy(npzread("DATA/Bees/bees_x.npy")')
labels = npzread("DATA/Bees/bees_y.npy");

In [4]:
size(data)

(13, 236)

In [5]:
DPMMPrior = DPMMSubClusters.niw_hyperparams(1,zeros(13),16,Matrix{Float64}(I, 13, 13)*0.1)
DPMMResults = DPMMSubClusters.fit(data,DPMMPrior,100.0,iters = 100, gt = labels, verbose = false);

#### DPMM NMI:

In [6]:
println(mutualinfo(DPMMResults[1], labels; normed =true))

0.7126463125369757


We will partition the data into 4 groups (at random), and create a version where all features exists, and a version where a random count of features are missing

In [7]:
perm = randperm(size(data,2))
mixed_data = data[:,perm]
mixed_labels = labels[perm]

interval = Int(length(mixed_labels)/4)

group_indices = Int.(collect(1:interval:length(mixed_labels)))
labels_dict = Dict()
full_data_dict = Dict()
missing_data_dict = Dict()
features_count = rand(1:7,4)
base_features = collect(1:6)
for (i,v) in enumerate(group_indices)
    relevant_data = mixed_data[:,v:v+interval-1]
    labels_dict[i] = mixed_labels[v:v+interval-1]
    full_data_dict[i] = relevant_data
    choosen_features = Int.(vcat(base_features,(randperm(7).+6)[1:features_count[i]]))
    missing_data_dict[i] = relevant_data[choosen_features,:]
end   

We will now run our model, initially adding process (as it must have atleast 1 worker process)

In [8]:
using Distributed
addprocs(4)

4-element Array{Int64,1}:
 2
 3
 4
 5

In [9]:
@everywhere using VersatileHDPMixtureModels

In [10]:
local_priors = [niw_hyperparams(1.0, zeros(i),i+3,Matrix{Float64}(I, i, i)*0.1) for i in features_count]
global_hyper_params = niw_hyperparams(1.0, zeros(6), 9, Matrix{Float64}(I, 6, 6)*0.1)
constant_local_prior = niw_hyperparams(1.0, zeros(7), 10, Matrix{Float64}(I, 7, 7)*0.1)

niw_hyperparams(1.0, [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 10.0, [0.1 0.0 … 0.0 0.0; 0.0 0.1 … 0.0 0.0; … ; 0.0 0.0 … 0.1 0.0; 0.0 0.0 … 0.0 0.1])

In [13]:
missing_data_results = vhdp_fit(missing_data_dict,6,100.0,100.0,100.0,global_hyper_params,local_priors,100);

Iteration: 1|| Global Counts: [6]|| iter time: 0.06101679801940918
Iteration: 2|| Global Counts: [6]|| iter time: 0.008918046951293945
Iteration: 3|| Global Counts: [8]|| iter time: 0.01250600814819336
Iteration: 4|| Global Counts: [9]|| iter time: 0.011228084564208984
Iteration: 5|| Global Counts: [9, 9]|| iter time: 0.010521173477172852
Iteration: 6|| Global Counts: [10, 7]|| iter time: 0.015869855880737305
Iteration: 7|| Global Counts: [10, 8]|| iter time: 0.011442184448242188
Iteration: 8|| Global Counts: [10, 7]|| iter time: 0.012156009674072266
Iteration: 9|| Global Counts: [10, 7]|| iter time: 0.015585899353027344
Iteration: 10|| Global Counts: [10, 7, 7]|| iter time: 0.011366128921508789
Iteration: 11|| Global Counts: [10, 4, 6]|| iter time: 0.013006925582885742
Iteration: 12|| Global Counts: [10, 4, 6, 10]|| iter time: 0.012372016906738281
Iteration: 13|| Global Counts: [6, 4, 8, 7]|| iter time: 0.01441192626953125
Iteration: 14|| Global Counts: [6, 4, 6, 7]|| iter time: 0.013

#### Missing Data NMI:

In [14]:
NMI = 0.0
for i=1:4
    group_labels = create_global_labels(missing_data_results[1].groups_dict[i])
    NMI += mutualinfo(group_labels,labels_dict[i])
end
println(NMI/4)

0.8204475661907087


In [15]:
full_data_results = vhdp_fit(full_data_dict,6,100.0,100.0,100.0,global_hyper_params,constant_local_prior,100);

Iteration: 1|| Global Counts: [8]|| iter time: 0.7217671871185303
Iteration: 2|| Global Counts: [9]|| iter time: 0.06853103637695312
Iteration: 3|| Global Counts: [9]|| iter time: 0.009547948837280273
Iteration: 4|| Global Counts: [11]|| iter time: 0.010154962539672852
Iteration: 5|| Global Counts: [14, 14]|| iter time: 0.010442972183227539
Iteration: 6|| Global Counts: [15, 5]|| iter time: 0.06933116912841797
Iteration: 7|| Global Counts: [16, 4]|| iter time: 0.012984991073608398
Iteration: 8|| Global Counts: [16, 4]|| iter time: 0.013051033020019531
Iteration: 9|| Global Counts: [16, 4]|| iter time: 0.013823986053466797
Iteration: 10|| Global Counts: [16, 4, 16, 4]|| iter time: 0.019757986068725586
Iteration: 11|| Global Counts: [19, 4, 12, 4]|| iter time: 0.02441096305847168
Iteration: 12|| Global Counts: [17, 4, 6, 4]|| iter time: 0.01689004898071289
Iteration: 13|| Global Counts: [14, 4, 7, 4]|| iter time: 0.025344133377075195
Iteration: 14|| Global Counts: [15, 4, 6, 4]|| iter ti

#### Full Data NMI

In [16]:
NMI = 0.0
for i=1:4
    group_labels = create_global_labels(full_data_results[1].groups_dict[i])
    NMI += mutualinfo(group_labels,labels_dict[i])
end
println(NMI/4)

0.8395588562022707
