# Sampling probability arrays: a  Python class

When working in data science, we often need arrays of weights that sum to one, we call these probability arrays or probability vectors. 
For example, in classification tasks, we want the output to be a array of probabilities representing predicted class distributions. When implement mixture models, probability arrays can represent the weight of each component. In finance, they might define the weights of assets in a portfolio where the total allocation must sum to one.

However, ensuring that these vectors are correctly sampled and always sum to one is not trivial. A simple approach, like sampling random numbers and normalizing them, can lead to uneven or biased distributions. So, how can we guarantee that sampled arrays always meet this constraint?

One way to do so is to imagine placing beads into bins. Picture having a finite number of beads, and distributing them across five bins. Each bin represents one component of the probability array, and the fraction of beads in each bin determines the corresponding probability weight. 


<figure>
  <img src="https://github.com/PessoaP/blog/blob/master/Beads/beads1.png?raw=true" alt="S"/>
</figure>

This analogy is not only intuitive but also robust, as it naturally ensures that the total sum is preserved without needing normalization after. Additionally, this representation provides flexibility: by increasing the number of beads, we can achieve higher precision in our probability arrays. Moreover, this bead-and-bin model offers a precise way to represent probabilities. Since beads are discrete, the total count is inherently stable, avoiding the floating-point errors that can arise when dealing with continuous random variables.

In this blog post, we build a Python class called `prob_array` to model probability arrays using the bead-and-bin analogy. We'll cover three core functions: (i) initializing arrays from raw counts or by another probability vector, (ii) proposing symmetric updates by redistributing beads; and (iii) calculating the log-probability under a multinomial prior, including necessary mathematical corrections. 
By the end, you'll have a clean implementation of `prob_array`, with examples showcasing initialization, symmetric proposals, and probabilistic evaluation. Let’s begin with array initialization.

In [1]:
import numpy as np
normalize = lambda x: x/x.sum()

class prob_array:
    def __init__(self,array=None,components=20,beads=10000):
        if array is None:
            self.counts = np.ones(components,dtype=int)*(beads//components)
            self.counts[:beads%components]+=1
        elif array.dtype == int:
            self.counts = array
        elif np.isclose(1.,np.sum(array)):
            self.counts = np.floor(array*beads).astype(int)
            self.counts[np.argsort(array)[:beads%self.counts.sum()]]+=1

        self.prob = normalize(self.counts)


In [None]:
def proposal(p_array,rate=.01):
    movables = np.random.binomial(p_array.counts,rate)
    new_counts = p_array.counts - movables 

    mvleft = np.random.binomial(movables,.5)
    mvright = movables - mvleft 

    new_counts[:-1] += mvleft[1:]
    new_counts[0] += mvleft[0] #the ones selected to move left from 0 stay in place

    new_counts[1:] += mvright[:-1]
    new_counts[-1] += mvright[-1] #the ones selected to move left from -1 stay in place

    return prob_array(new_counts,new_counts.size,new_counts.sum())

In [None]:
from scipy.special import loggamma
logfactorial = lambda x: loggamma(x+1)    
def multinomial_logprob(p_array,alpha):
    beads = p_array.counts.sum()
    p = p_array.alpha/p_array.alpha.sum()

    log_prefactor = p_array.counts.size*np.log(beads) + logfactorial(beads) - logfactorial(p_array.counts).sum()
    return log_prefactor + (p_array.counts*np.log(p)).sum() 

In [2]:
import numpy as np
from scipy.special import loggamma

# Utility functions
normalize = lambda x: x / x.sum()  # Ensures the input array sums to 1
logfactorial = lambda x: loggamma(x + 1)  # Computes the log-factorial using log-gamma

class prob_array:
    def __init__(self, array=None, components=20, beads=10000):
        """
        Initialize a probability array.
        
        Parameters:
        - array: Optional numpy array, can represent counts or probabilities.
        - components: Number of components in the probability array.
        - beads: Total number of beads (samples) for normalization.
        """
        if array is None:
            # Uniform distribution of beads across components
            self.counts = np.ones(components, dtype=int) * (beads // components)
            self.counts[:beads % components] += 1  # Distribute remaining beads
        elif array.dtype == int:
            # If array represents counts directly
            self.counts = array
        elif np.isclose(1.0, np.sum(array)):
            # If array represents probabilities
            self.counts = np.floor(array * beads).astype(int)
            remainder = beads - self.counts.sum()
            self.counts[np.argsort(array)[-remainder:]] += 1  # Adjust for rounding errors
        else:
            raise ValueError("Input array must be counts (int) or probabilities (sum to 1).")
        
        self.prob = normalize(self.counts)  # Normalize counts to probabilities

    def proposal(self, rate=0.01):
        """
        Generate a proposal for a new probability array.
        
        Parameters:
        - rate: Rate of change for beads redistribution.
        
        Returns:
        - New prob_array instance with adjusted counts.
        """
        movables = np.random.binomial(self.counts, rate)  # Determine movable beads
        new_counts = self.counts - movables

        # Redistribute beads left and right
        mvleft = np.random.binomial(movables, 0.5)
        mvright = movables - mvleft

        new_counts[:-1] += mvleft[1:]
        new_counts[0] += mvleft[0]  # Beads attempting to move left from the first index stay

        new_counts[1:] += mvright[:-1]
        new_counts[-1] += mvright[-1]  # Beads attempting to move right from the last index stay

        # Ensure counts remain valid
        if np.any(new_counts < 0):
            raise ValueError("Invalid move: Negative counts detected.")
        
        return prob_array(new_counts, new_counts.size, new_counts.sum())
    
    def multinomial_logprob(self, alpha):
        """
        Compute the log-probability of the current counts given a multinomial distribution.
        
        Parameters:
        - alpha: Dirichlet prior (numpy array).
        
        Returns:
        - Log-probability value.
        """
        beads = self.counts.sum()
        if len(alpha) != len(self.counts):
            raise ValueError("Alpha vector must have the same length as counts.")
        
        p = alpha / alpha.sum()  # Normalize alpha to get probabilities
        
        log_prefactor = logfactorial(beads) - logfactorial(self.counts).sum() + self.counts.size * np.log(beads)
        return log_prefactor + (self.counts * np.log(p)).sum()

    def __repr__(self):
        return f"prob_array(counts={self.counts}, prob={self.prob})"
