## Sending data to Kafka server

This notebook relies on the Kafka Python client for sending messages to the Kafka cluster.
Messages are data generated from a linear model with $n$ input variables and $m$ output variables:

$$
y_j = B_{\dot j} x + w
$$
with $x \in \mathbb{R}^n$, $B \in \mathbb{n \times m}$ and $y, w \in \mathbb{R}^m$.
$w$ is Gaussian noise.

Messages are sent every $interval$ seconds. They are list of size (n+m+1) where:
* First element is the counter
* Next $m$ elements are $y$ values
* Last $n$ elements are $x$ values

### General import

In [1]:
from kafka import KafkaProducer
import time
import numpy as np
import matplotlib as plt

### Initialization of the Kafka producer

The server is assumed to run locally and listen to port 9092.

In [2]:
producer = KafkaProducer(bootstrap_servers='localhost:9092')

### Sampling the parameters of the multivariate Gaussian

Instead of choosing arbitrary values for the mean vector or selecting
a covariance matrix that may not be symmetric semi-definite,
the normal-Wishart distribution is used to randomly generate
these parameters before running the simulation. Covariance matrix is obtained
by inverting the sampled precision matrix.
The mean vector $\mu$ and precision matrix $\Lambda$ are sampled only once,
and the coefficients of the linear model are sampled only once from the multivariate
distribution $\mathcal{N}(\mu, \lambda^{-1})$. More details about data generation can be found
in the report.

In [3]:
class NormalWishart:
    """Normal-Wishart distribution.
    
    Attributes:
        _mu0 (:obj:`np.ndarray`): location vector.
        _D (int): Number of components.
        _lambda (float): positive-only scalar.
        _W (:obj:`np.ndarray`): positive definite scale matrix.
        _nu (int): number of degrees of freedom.
        _n (int): sample size.
        _0 (:obj:`np.ndarray`): pre-allocated vector of zeros.
    """
    
    def __init__(self, mu0, _lambda, W, nu, n):
        self._mu0 = np.asarray(mu0)
        self._D = self._mu0.shape[0]
        self._lambda = _lambda
        self._W = np.asarray(W)
        self._nu = nu
        self._n = n
        self._0 = np.zeros(self._nu)
        # Make sure that degrees of freedom are sufficient
        # and that the dimensionality if the scale matrix is correct.
        assert((self._nu > self._D - 1) and (len(W) == self._nu))
    
    def sample(self):
        """Randomly generates a sample of `self._n` observations.
        
        returns:
            :obj:`np.ndarray`: a random mean vector.
            :obj:`np.ndarray`: a random precision matrix.
        """
        # Randomly samples the mean vector
        mu = np.random.multivariate_normal(self._mu0, self._W)
        
        # Randomly samples the precision matrix
        G = np.random.multivariate_normal(self._0, self._W, size=self._n)
        S = np.dot(G.T, G)
        return mu, S

Now that the normal-Wishart has been defined, the actual data generation algorithm remains to be implemented.
Parameters $\mu$, $\Lambda$, $B$ are computed in the constructor of the Generator, in order to make sure
that they are sampled only once. The hyper-parameters of the normal-Wishart distribution are the following:
default parameter matrix $W_0$ is the diagonal matrix, prior mean vector $\mu_0$
is the zero vector, chosen scaling parameter is $1$, and $\nu_0$ has been
arbitrarily set to 15.

Each time the generator samples new data, the counter is automatically incremented.
$x$ is sampled using a uniform distribution $\mathcal{U}(0, 1)$, $w$ is sampled using a Gaussian distribution $\mathcal{N}(0, 1)$, and variable $y_j$ is computed as follows:
$$
y_j = B_{\cdot j} x + w
$$
for each $j$. In practice, vector $y$ is computed in a vectorized fashion.

In [4]:
class Generator:
    """Data generator.
    
    Attributes:
        _n_inputs (int): Number of explanatory variables.
        _n_outputs (int): Number of explained variables.
        _latent_dim (int): Latent dimension, represented by the
            number of Gaussian observations to draw for
            estimating the precision matrix.
        _beta (:obj:`np.ndarray`): Variable weights.
    """
    
    def __init__(self, n_inputs, n_outputs, latent_dim=15):
        self._counter = 0
        self._n_inputs = n_inputs
        self._n_outputs = n_outputs
        self._latent_dim = latent_dim
        
        # Initializes NW distribution with the identity
        # as scale matrix and a zero vector as location vector.
        W = np.eye(self._n_inputs)
        mu0 = np.zeros(self._n_inputs)
        nu = self._n_inputs
        nw = NormalWishart(mu0, 1., W, nu, self._latent_dim)
        
        # Lambda is a random precision matrix,
        # and needs to be inverted in order to obtain
        # a covariance matrix.
        mu, Lambda = nw.sample()
        Sigma = np.linalg.inv(Lambda)
        
        # Randomly initializes the variable weights
        self._beta = np.random.multivariate_normal(mu, Sigma, self._n_outputs)
    
    def sample(self, sample_size=1):
        """Draw random samples from a multivariate Gaussian distribution.
        
        Parameters:
            sample_size (int): Number of observations to draw.
        
        Returns:
            :obj:`np.ndarray`: Array of shape (`sample_size`, `self._n_inputs`)
                containing random values for the explanatory variables.
            :obj:`np.ndarray`: Array of shape (`sample_size`, `self._n_outputs`)
                containing Random values for the explained variables.
            :obj:`np.ndarray`: Array of length `sample_size` containing the unique
                identifiers of generated samples.
        """
        # Generates unique identifiers
        counters = np.arange(self._counter, self._counter + sample_size)
        self._counter += sample_size
        
        # Samples explanatory variables
        X = np.random.rand(sample_size, self._n_inputs)
        
        # Samples random noise
        w = np.random.normal(0, 1, size=self._n_outputs) * 0.1
        
        # Computes outputs
        y = np.dot(X, self._beta.T) + w
        return np.squeeze(X), np.squeeze(y), counters
    
    @property
    def beta(self):
        return self._beta
    
    @property
    def counter(self):
        """Returns current value for the sample counter.
        
        Returns:
            int: Identifier of the next sample to be generated.
        """
        return self._counter

Let's define the dimensionality of the data:

In [5]:
n = 10 # Number of inputs
m = 8 # Number of outputs

Let's test the generator with these parameters.

In [13]:
Generator(10, 8).sample()

(array([0.29311455, 0.09541923, 0.25294935, 0.90461192, 0.31660916,
        0.21628202, 0.86068329, 0.85337652, 0.06117153, 0.09591472]),
 array([ 0.34642123, -0.09661949,  0.69663622,  0.76235547, -0.27186512,
         0.48915591,  0.92408925,  1.85420393]),
 array([0]))

In [11]:
# Waiting time before sending the next observation
TIME_INTERVAL = 1

# Create a new generator
gen = Generator(n, m)

# Let's see the coefficients B
print('beta: %s' % str(gen.beta))

# Loop for sending messages to Kafka with the topic dataLinearModel
while True:
    
    # Sample an observation and concatenate the counter
    # with the values of explanatory and explained variables.
    x, y, counter = gen.sample()
    arr = np.concatenate((counter, y, x))
    
    #print(arr)
    
    # Convert the array to text and send it to Kafka
    message = np.array2string(arr, separator=',')
    producer.send('dataLinearModel', message.encode())
    
    # Wait
    time.sleep(TIME_INTERVAL)
    

beta: [[ 0.67096096  1.62272054 -0.04350657  0.76190821  0.27915542 -0.92009199
   0.75251552  0.96051031  0.12702734 -0.22189986]
 [ 0.45158215  1.52465389  0.17136432  1.32806491  0.73294021 -0.79619713
   1.26436157  1.02246963 -0.09546401  0.19348103]
 [ 0.60467334 -0.46399002  0.54392233  1.2777495   1.39623466 -0.71617532
   1.06135513  0.83831317 -0.67125076 -0.70739602]
 [ 0.67183736  0.59650181  0.10901946  1.04807695  0.7043718  -0.9582969
   0.88958936  0.59990975 -0.61420484 -0.42472034]
 [ 0.66463092  1.00656279  0.20028729  1.09136501  0.43948152 -1.19714787
   1.17272188  1.00095635 -0.79871338 -0.36289474]
 [ 0.50669192  0.86953819  0.35188337  0.63813706 -0.27801352 -1.26849023
   1.11517122  1.15698931 -2.05286212 -0.69816222]
 [ 0.32154452 -0.15310418  0.48471017  1.59678547  1.72277444 -0.6362342
   0.7357768   0.2178628   0.23844441 -0.84115277]
 [ 0.69534023  0.92267164 -0.17876002  1.86570903  0.43905266 -1.06167573
   0.80919231  0.79126414 -1.65308907 -0.084428

KeyboardInterrupt: 