Parameter learning takes forever #104

kavir1698 · 2020-07-09T18:53:08Z

I am trying to fit the parameters of a not-so-big discrete BN. I have tested it in Netica and the learning happens instantly. But with BayesNets.jl it took 30 hours before running out of memory. Is this a bug or a limitation?

Here is the network structure and 200 samples to replicate the issue.

using BayesNets
using CSV
df = CSV.read("sample.csv");
dbn = fit(DiscreteBayesNet, df,
  (
    :A => :S,
    :B => :S,
    :C => :S,
    :D => :S,
    :E => :S,
    
    :F => :SN,
    :G => :SN,
    :H => :SN,

    :SN => :DCN,
    :SN => :CAN,
    :SN => :VCN,

    :S => :EA,
    :A => :EA,
    :B => :EA,
    :C => :EA,
    :D => :EA,
    :E => :EA,
    
    :EA => :ARS,
    :S => :ARS,
    :SN => :ARS,
    :A => :ARS,
    :B => :ARS,
    :C => :ARS,
    :D => :ARS,
    :E => :ARS,
    :F => :ARS,
    :G => :ARS,
    :H => :ARS,
    
    :EA => :AR,
    :S => :AR,
    :SN => :AR,
    :A => :AR,
    :B => :AR,
    :C => :AR,
    :D => :AR,
    :E => :AR,
    :F => :AR,
    :G => :AR,
    :H => :AR,
    
    :AR => :CA,
    :S => :CA,
    :SN => :CA,
    :A => :CA,
    :B => :CA,
    :C => :CA,
    :D => :CA,
    :E => :CA,
    :F => :CA,
    :G => :CA,
    :H => :CA,
    :CAN => :CA,
    
    :AR => :VC,
    :S => :VC,
    :SN => :VC,
    :A => :VC,
    :B => :VC,
    :C => :VC,
    :D => :VC,
    :E => :VC,
    :F => :VC,
    :G => :VC,
    :H => :VC,
    :VCN => :VC,
    
    :AR => :ER,
    :S => :ER,
    :SN => :ER,
    :A => :ER,
    :B => :ER,
    :C => :ER,
    :D => :ER,
    :E => :ER,
    :F => :ER,
    :G => :ER,
    :H => :ER,
    
    :AR => :IN,
    :S => :IN,
    :SN => :IN,
    :A => :IN,
    :B => :IN,
    :C => :IN,
    :D => :IN,
    :E => :IN,
    :F => :IN,
    :G => :IN,
    :H => :IN,
    
    :CA => :DC,
    :S => :DC,
    :SN => :DC,
    :A => :DC,
    :B => :DC,
    :C => :DC,
    :D => :DC,
    :E => :DC,
    :F => :DC,
    :G => :DC,
    :H => :DC,
    :DCN => :DC,
  )
)

zsunberg · 2020-07-09T19:18:08Z

I'm not intimately familiar with the inner workings of this package, but I am guessing from this that there are type instabilities and dynamic memory allocations slowing it down. The best way to find out would be to use https://github.com/timholy/ProfileView.jl on a small example.

Fortunately, the language has improved vastly since most of the work on this package was done before 2018, and I think it would be possible to realize massive speedups now! Happy to make suggestions if someone wants to work on improving it.

mykelk · 2020-07-09T19:27:27Z

It looks like you have several variables with many parents. I think the way it is set up now is that it loops over all possible parental instantiations, which is going to be enormous in your case (the complexity grows exponentially with the number of parents). The loop of interest is here. It would be much faster to just loop over your data and fill in the non-zero entries in the tables.

kavir1698 · 2020-07-10T11:13:29Z

I am guessing from this that there are type instabilities and dynamic memory allocations slowing it down. The best way to find out would be to use https://github.com/timholy/ProfileView.jl on a small example.

I tried it but even with 20 samples it does not finish.

Happy to make suggestions if someone wants to work on improving it.

I would happily work on it, but I am very knew to these concepts.

I think the way it is set up now is that it loops over all possible parental instantiations, which is going to be enormous in your case

That should be the issue. Creating a table for a single variable with many parents (cpdARS = fit(CategoricalCPD{DiscreteNonParametric}, df, :ARS, [:S, :A, :B, :C, :D, :E, :F, :G, :H, :SN, :S, :EA])) gets stuck too.

It would be much faster to just loop over your data and fill in the non-zero entries in the tables.

I am not sure how to loop over the data and fill in the entries in tables.

mykelk · 2020-07-10T14:54:36Z

I think the issue is with this definition:

struct CategoricalCPD{D} <: CPD{D}
    target::NodeName
    parents::NodeNames
    # list of instantiation counts for each parent, in same order as parents
    parental_ncategories::Vector{Int}
    # a vector of distributions in DMU order
    distributions::Vector{D}
end

The distributions field is a Vector with each element containing a distribution associated with the corresponding parental instantiation. The number of parental instantiations is equal to the product of the number of the possible values all of the parents. In your case, you have a huge number of parents. Generally, you want to avoid having a large number of parents in a Bayesian network. However, if this is really what you want, then an option is to modify the distributions field as follows:

distributions::Dict{Int,D}

Then, we would store distributions for parental instantiations that actually exist in the data. If we try to access a parental instantiation that does not exist, then we can either throw an error or return a uniform distribution. We would also need to update fit so that it iterates over the rows in the DataFrame and incrementally update the distribution.

It may be better to provide this implementation as a new type, such as SparseCategoricalCPD and write another fit function that dispatches on that type.

kavir1698 · 2020-07-10T15:33:29Z

Thanks for the explanation. I'll see if I can implement it.

zsunberg · 2020-07-10T16:40:50Z

Ah, yeah, it was kind of silly of me to think this is a type stability/allocation issue, haha. Such a large slowdown is obviously algorithmic.

If fit is the only algorithm that is affected by this, there may be an alternative to creating a completely new type: at the beginning of fit, check the size of the data and the maximum number of parent instantiations for the node and then loop over the one that is smaller. This might be a little more user-friendly, but parameterizing the type usefully might be tricky.

mykelk · 2020-07-10T17:33:36Z

@zsunberg It might be a memory issue. Unless we switch to a sparse representation, it will allocate memory enough to hold all the parental instantiations:

distributions = Array{Categorical{Float64,Vector{Float64}}}(undef, prod(parental_ncategories))

This might not be good if prod(parental_ncategories) is super big.

zsunberg · 2020-07-10T17:43:34Z

Right - you would have to do something like

struct CategoricalCPD{D,T} <: CPD{D}
    target::NodeName
    parents::NodeNames
    # list of instantiation counts for each parent, in same order as parents
    parental_ncategories::Vector{Int}
    # a vector of distributions in DMU order
    distributions::T
end

and then fit determines whether T should be a dense or sparse structure. Sorry my explanation wasn't too clear. That's why I said it might be tricky to parameterize.

mykelk · 2020-07-10T17:54:39Z

Ah, clever!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parameter learning takes forever #104

Parameter learning takes forever #104

kavir1698 commented Jul 9, 2020

zsunberg commented Jul 9, 2020

mykelk commented Jul 9, 2020

kavir1698 commented Jul 10, 2020

mykelk commented Jul 10, 2020 •

edited

kavir1698 commented Jul 10, 2020

zsunberg commented Jul 10, 2020

mykelk commented Jul 10, 2020

zsunberg commented Jul 10, 2020

mykelk commented Jul 10, 2020

Parameter learning takes forever #104

Parameter learning takes forever #104

Comments

kavir1698 commented Jul 9, 2020

zsunberg commented Jul 9, 2020

mykelk commented Jul 9, 2020

kavir1698 commented Jul 10, 2020

mykelk commented Jul 10, 2020 • edited

kavir1698 commented Jul 10, 2020

zsunberg commented Jul 10, 2020

mykelk commented Jul 10, 2020

zsunberg commented Jul 10, 2020

mykelk commented Jul 10, 2020

mykelk commented Jul 10, 2020 •

edited