Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameter learning takes forever #104

Open
kavir1698 opened this issue Jul 9, 2020 · 9 comments
Open

Parameter learning takes forever #104

kavir1698 opened this issue Jul 9, 2020 · 9 comments

Comments

@kavir1698
Copy link

I am trying to fit the parameters of a not-so-big discrete BN. I have tested it in Netica and the learning happens instantly. But with BayesNets.jl it took 30 hours before running out of memory. Is this a bug or a limitation?

Here is the network structure and 200 samples to replicate the issue.

using BayesNets
using CSV
df = CSV.read("sample.csv");
dbn = fit(DiscreteBayesNet, df,
  (
    :A => :S,
    :B => :S,
    :C => :S,
    :D => :S,
    :E => :S,
    
    :F => :SN,
    :G => :SN,
    :H => :SN,

    :SN => :DCN,
    :SN => :CAN,
    :SN => :VCN,

    :S => :EA,
    :A => :EA,
    :B => :EA,
    :C => :EA,
    :D => :EA,
    :E => :EA,
    
    :EA => :ARS,
    :S => :ARS,
    :SN => :ARS,
    :A => :ARS,
    :B => :ARS,
    :C => :ARS,
    :D => :ARS,
    :E => :ARS,
    :F => :ARS,
    :G => :ARS,
    :H => :ARS,
    
    :EA => :AR,
    :S => :AR,
    :SN => :AR,
    :A => :AR,
    :B => :AR,
    :C => :AR,
    :D => :AR,
    :E => :AR,
    :F => :AR,
    :G => :AR,
    :H => :AR,
    
    :AR => :CA,
    :S => :CA,
    :SN => :CA,
    :A => :CA,
    :B => :CA,
    :C => :CA,
    :D => :CA,
    :E => :CA,
    :F => :CA,
    :G => :CA,
    :H => :CA,
    :CAN => :CA,
    
    :AR => :VC,
    :S => :VC,
    :SN => :VC,
    :A => :VC,
    :B => :VC,
    :C => :VC,
    :D => :VC,
    :E => :VC,
    :F => :VC,
    :G => :VC,
    :H => :VC,
    :VCN => :VC,
    
    :AR => :ER,
    :S => :ER,
    :SN => :ER,
    :A => :ER,
    :B => :ER,
    :C => :ER,
    :D => :ER,
    :E => :ER,
    :F => :ER,
    :G => :ER,
    :H => :ER,
    
    :AR => :IN,
    :S => :IN,
    :SN => :IN,
    :A => :IN,
    :B => :IN,
    :C => :IN,
    :D => :IN,
    :E => :IN,
    :F => :IN,
    :G => :IN,
    :H => :IN,
    
    :CA => :DC,
    :S => :DC,
    :SN => :DC,
    :A => :DC,
    :B => :DC,
    :C => :DC,
    :D => :DC,
    :E => :DC,
    :F => :DC,
    :G => :DC,
    :H => :DC,
    :DCN => :DC,
  )
)
@zsunberg
Copy link
Member

zsunberg commented Jul 9, 2020

I'm not intimately familiar with the inner workings of this package, but I am guessing from this that there are type instabilities and dynamic memory allocations slowing it down. The best way to find out would be to use https://github.com/timholy/ProfileView.jl on a small example.

Fortunately, the language has improved vastly since most of the work on this package was done before 2018, and I think it would be possible to realize massive speedups now! Happy to make suggestions if someone wants to work on improving it.

@mykelk
Copy link
Member

mykelk commented Jul 9, 2020

It looks like you have several variables with many parents. I think the way it is set up now is that it loops over all possible parental instantiations, which is going to be enormous in your case (the complexity grows exponentially with the number of parents). The loop of interest is here. It would be much faster to just loop over your data and fill in the non-zero entries in the tables.

@kavir1698
Copy link
Author

I am guessing from this that there are type instabilities and dynamic memory allocations slowing it down. The best way to find out would be to use https://github.com/timholy/ProfileView.jl on a small example.

I tried it but even with 20 samples it does not finish.

Happy to make suggestions if someone wants to work on improving it.

I would happily work on it, but I am very knew to these concepts.

I think the way it is set up now is that it loops over all possible parental instantiations, which is going to be enormous in your case

That should be the issue. Creating a table for a single variable with many parents (cpdARS = fit(CategoricalCPD{DiscreteNonParametric}, df, :ARS, [:S, :A, :B, :C, :D, :E, :F, :G, :H, :SN, :S, :EA])) gets stuck too.

It would be much faster to just loop over your data and fill in the non-zero entries in the tables.

I am not sure how to loop over the data and fill in the entries in tables.

@mykelk
Copy link
Member

mykelk commented Jul 10, 2020

I think the issue is with this definition:

struct CategoricalCPD{D} <: CPD{D}
    target::NodeName
    parents::NodeNames
    # list of instantiation counts for each parent, in same order as parents
    parental_ncategories::Vector{Int}
    # a vector of distributions in DMU order
    distributions::Vector{D}
end

The distributions field is a Vector with each element containing a distribution associated with the corresponding parental instantiation. The number of parental instantiations is equal to the product of the number of the possible values all of the parents. In your case, you have a huge number of parents. Generally, you want to avoid having a large number of parents in a Bayesian network. However, if this is really what you want, then an option is to modify the distributions field as follows:

distributions::Dict{Int,D}

Then, we would store distributions for parental instantiations that actually exist in the data. If we try to access a parental instantiation that does not exist, then we can either throw an error or return a uniform distribution. We would also need to update fit so that it iterates over the rows in the DataFrame and incrementally update the distribution.

It may be better to provide this implementation as a new type, such as SparseCategoricalCPD and write another fit function that dispatches on that type.

@kavir1698
Copy link
Author

Thanks for the explanation. I'll see if I can implement it.

@zsunberg
Copy link
Member

Ah, yeah, it was kind of silly of me to think this is a type stability/allocation issue, haha. Such a large slowdown is obviously algorithmic.

If fit is the only algorithm that is affected by this, there may be an alternative to creating a completely new type: at the beginning of fit, check the size of the data and the maximum number of parent instantiations for the node and then loop over the one that is smaller. This might be a little more user-friendly, but parameterizing the type usefully might be tricky.

@mykelk
Copy link
Member

mykelk commented Jul 10, 2020

@zsunberg It might be a memory issue. Unless we switch to a sparse representation, it will allocate memory enough to hold all the parental instantiations:

distributions = Array{Categorical{Float64,Vector{Float64}}}(undef, prod(parental_ncategories))

This might not be good if prod(parental_ncategories) is super big.

@zsunberg
Copy link
Member

Right - you would have to do something like

struct CategoricalCPD{D,T} <: CPD{D}
    target::NodeName
    parents::NodeNames
    # list of instantiation counts for each parent, in same order as parents
    parental_ncategories::Vector{Int}
    # a vector of distributions in DMU order
    distributions::T
end

and then fit determines whether T should be a dense or sparse structure. Sorry my explanation wasn't too clear. That's why I said it might be tricky to parameterize.

@mykelk
Copy link
Member

mykelk commented Jul 10, 2020

Ah, clever!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants