In [1]:
using StatsBase
using Combinatorics

In [2]:
# toy data

Z = [1 1 2 2 3 3 4 4 4 5 5] # group partition
D = [3 4 2 5 6 4 3 2 5 2 2] # degree sequence

InterruptException: InterruptException:

First thing we'll do: let's check the formula

$$\sum_{R \in [n]^\ell} \mathbb{I}(\#(\mathbf{z}_R) = p) \sigma(\theta_R) = \sum_{y: \#y = p}\prod_{a = 1}^{\ell}v_{y_a}\;,$$

In this formula, we're allowing R to range over $[n]^\ell$, where $\ell$ is the number of nodes per hyperedge, and $n$ the number of nodes.  $p$ is a specified permutation. This is **different** from the current version of the nodes, and matches the conversation we had over email. 



In [3]:
# Naive evaluation of the sum (LHS)

function evalSum(p, Z, D)
    n = length(Z)
    ℓ = sum(p)

    S = 0

    T = Iterators.product((1:n for i = 1:ℓ)...)

    for R in T
        a = countmap(vec(Z[collect(R)]))
        a = -sort(-collect(values(a)))
        if a == p
            S += prod(D[collect(R)])
        end
    end
    return(S)
end

# Faster evaluation (RHS)

function evalSum2(p, Z, D)
    n = length(Z)
    ℓ = sum(p)
    S = 0

    V = [sum([(Z[i] == s)*D[i] for i = 1:n]) for s = 1:maximum(Z)] # vector of volumes

    P = Iterators.product((1:maximum(Z) for i = 1:ℓ)...)
    for p_ in P
        a = countmap(vec(collect(p_)))
        a = -sort(-collect(values(a)))
        if a == p
          S += prod(V[collect(p_)])
        end
    end

    return(S)
end;

In [5]:
# test: check that these two functions give the same result on all partitions
# this will be very slow for even moderate n

r = 3

for i = 1:r
    for j = 1:i
        for p in partitions(i,j)
            println(evalSum(p, Z, D) == evalSum2(p, Z, D))
        end
    end
end

UndefVarError: UndefVarError: Z not defined

Ok, looks good! `evalSum2` is a lot faster than `evalSum`, although they are both very slow. Presumably this is due in part to kludgy coding on my part, but one would imagine that even much better coding practice could only improve these so much

In [7]:
p = [2, 1]
@time evalSum(p, Z, D)
@time evalSum2(p, Z, D)

  0.125355 seconds (30.42 k allocations: 94.904 MiB, 25.50% gc time)
  0.013573 seconds (2.65 k allocations: 8.912 MiB, 61.19% gc time)


27546

Next we're going to try to use the recursion lemma from the notes in order to evaluate the sums for every possible partition vector p at once. 

In [8]:
function correctOvercounting(M, p)
    """
    Utility function: second term in the recurrence in the notes
    """
        pk = p[end]
        S = 0
        for i = 1:length(p)-1
            p_ = copy(p)[1:(end-1)]
            p_[i] += pk
            S += M[-sort(-p_)]
        end
        return(S)
    end

function evalSums(Z, D, r)
    """
    Z: an Array of integer group labels
    D: an Array of degrees
    ℓ: the largest hyperedge size to compute
    """
    
    V = [sum([D[i]*(Z[i] == j) for i in 1:length(Z)]) for j in 1:maximum(Z)]
    μ = [sum(V.^i) for i = 1:r]

    M = Dict()

    for i = 1:r
        for j = 1:i # number of nonzero entries
            for p in partitions(i, j)
                pk = p[j]
                M[p] = μ[p[end]]*get(M, p[1:(end-1)], 1) - correctOvercounting(M,p)
            end
        end
    end
    N = Dict()
    for p in keys(M)
        orderCorrection = 1
        counts = values(countmap(p))
        for c in counts
            orderCorrection *= factorial(c) 
        end

        N[p] = M[p] * multinomial(p...) ÷ orderCorrection
    end
    return(N)
end

evalSums (generic function with 1 method)

In [9]:
# need to run this block twice in order to avoid timing compile time
r = 5 # size of largest hyperedge
@time M = evalSums(Z, D, r)

BoundsError: BoundsError: attempt to access 3-element Array{Int64,1} at index [4]

For comparison, let's implement `evalSums2` by just running `evalSum2` for each value of `p`: 

In [21]:
# comparison to using evalSum2 from above
function evalSums2(Z, D, ℓ)
    N = Dict()
    for i = 1:ℓ
        for j = 1:i # number of nonzero entries
            for p in partitions(i, j)
                N[p] = evalSum2(p, Z, D)
            end
        end
    end
    return(N)
end

evalSums2 (generic function with 1 method)

In [23]:
# need to run twice to avoid timing compile times
@time N = evalSums2(Z, D, ℓ)

  1.348992 seconds (504.27 k allocations: 1.791 GiB, 22.95% gc time)


Dict{Any,Any} with 18 entries:
  [2, 2, 1]       => 23050800
  [1, 1, 1, 1]    => 346080
  [3, 2]          => 6288620
  [2, 2]          => 220614
  [3]             => 2750
  [1, 1]          => 1130
  [1, 1, 1]       => 24576
  [2]             => 314
  [2, 1, 1]       => 1175616
  [2, 1]          => 27546
  [4, 1]          => 3587830
  [4]             => 25058
  [1]             => 38
  [5]             => 234638
  [2, 1, 1, 1]    => 26997600
  [3, 1, 1]       => 16723680
  [1, 1, 1, 1, 1] => 2352000
  [3, 1]          => 317768

Ok, so the main thing that these timings tell me us that I wrote `evalSum2` really inefficiently, but still seems promising....

# How big can we go?

Even with my highly non-optimized code, we can do medium-sized instances fairly quickly this way. Note, however, that we need BigInts to avoid overflow issues. 

In [28]:
# Performance test: how big can we do this?
n = 20000

Z = rand(1:50, n)
Z = convert(Array{BigInt,1}, Z)
D = rand(2:100, n)
D = convert(Array{BigInt,1}, D)

r = 15 # maximum hyperedge size

@time M = evalSums(Z, D, r)

  0.690162 seconds (4.07 M allocations: 127.180 MiB, 23.96% gc time)


Dict{Any,Any} with 683 entries:
  [7, 4, 2, 1]                => 4876097134623344827104310051502713342199818234…
  [6, 1, 1, 1, 1]             => 7201451189283449227415975087764213309607655220…
  [6, 2, 2, 2, 2]             => 1919215419355724423666487780680990394780648579…
  [3, 3, 3, 3]                => 4777112010615357290763106726350374474518877252…
  [4, 4, 2, 1, 1]             => 1487028166110726729048635662694304361555049208…
  [3, 3, 3, 2, 1, 1, 1]       => 6606175680023126384315251083952314561414384765…
  [4, 4, 2, 1]                => 2631325697865943443880271903083442217705802612…
  [6, 3, 3, 1, 1, 1]          => 2369171418104558447191591745732633753164666408…
  [4, 2, 1]                   => 19010143665488668897598211633038611500
  [7, 1]                      => 639891464698090626130351698705320520352
  [3]                         => 433061729882502
  [5, 2, 2, 1, 1, 1]          => 5340229387325443202481028581515308658211676141…
  [8, 3, 1]                   => 137017832790