## Sending data to Kafka server

This notebook relies on the Kafka Python client for sending messages to the Kafka cluster.
Messages are data generated from a linear model with $n$ input variables and $m$ output variables:

$$
y_j = B_{\dot j} x + w
$$
with $x \in \mathbb{R}^n$, $B \in \mathbb{n \times m}$ and $y, w \in \mathbb{R}^m$.
$w$ is Gaussian noise.

Messages are sent every $interval$ seconds. They are list of size (n+m+1) where:
* First element is the counter
* Next $m$ elements are $y$ values
* Last $n$ elements are $x$ values

### General import

In [14]:
from kafka import KafkaProducer
import time
import numpy as np
import matplotlib as plt

### Initialization of the Kafka producer

The server is assumed to run locally and listen to port 9092.

In [15]:
producer = KafkaProducer(bootstrap_servers='localhost:9092')

### Sampling the parameters of the multivariate Gaussian

Instead of choosing arbitrary values for the mean vector or selecting
a covariance matrix that may not be symmetric semi-definite,
the normal-Wishart distribution is used to randomly generate
these parameters before running the simulation. Covariance matrix is obtained
by inverting the sampled precision matrix.
The mean vector $\mu$ and precision matrix $\Lambda$ are sampled only once,
and the coefficients of the linear model are sampled only once from the multivariate
distribution $\mathcal{N}(\mu, \lambda^{-1})$. More details about data generation can be found
in the report.

In [16]:
class NormalWishart:
    """Normal-Wishart distribution.
    
    Attributes:
        _mu0 (:obj:`np.ndarray`): location vector.
        _D (int): Number of components.
        _lambda (float): positive-only scalar.
        _W (:obj:`np.ndarray`): positive definite scale matrix.
        _nu (int): number of degrees of freedom.
        _n (int): sample size.
        _0 (:obj:`np.ndarray`): pre-allocated vector of zeros.
    """
    
    def __init__(self, mu0, _lambda, W, nu, n):
        self._mu0 = np.asarray(mu0)
        self._D = self._mu0.shape[0]
        self._lambda = _lambda
        self._W = np.asarray(W)
        self._nu = nu
        self._n = n
        self._0 = np.zeros(self._nu)
        # Make sure that degrees of freedom are sufficient
        # and that the dimensionality if the scale matrix is correct.
        assert((self._nu > self._D - 1) and (len(W) == self._nu))
    
    def sample(self):
        """Randomly generates a sample of `self._n` observations.
        
        returns:
            :obj:`np.ndarray`: a random mean vector.
            :obj:`np.ndarray`: a random precision matrix.
        """
        # Randomly samples the mean vector
        mu = np.random.multivariate_normal(self._mu0, self._W)
        
        # Randomly samples the precision matrix
        G = np.random.multivariate_normal(self._0, self._W, size=self._n)
        S = np.dot(G.T, G)
        return mu, S

Now that the normal-Wishart has been defined, the actual data generation algorithm remains to be implemented.
Parameters $\mu$, $\Lambda$, $B$ are computed in the constructor of the Generator, in order to make sure
that they are sampled only once. The hyper-parameters of the normal-Wishart distribution are the following:
default parameter matrix $W_0$ is the diagonal matrix, prior mean vector $\mu_0$
is the zero vector, chosen scaling parameter is $1$, and $\nu_0$ has been
arbitrarily set to 15.

Each time the generator samples new data, the counter is automatically incremented.
$x$ is sampled using a uniform distribution $\mathcal{U}(0, 1)$, $w$ is sampled using a Gaussian distribution $\mathcal{N}(0, 1)$, and variable $y_j$ is computed as follows:
$$
y_j = B_{\cdot j} x + w
$$
for each $j$. In practice, vector $y$ is computed in a vectorized fashion.

In [17]:
class Generator:
    """Data generator.
    
    Attributes:
        _n_inputs (int): Number of explanatory variables.
        _n_outputs (int): Number of explained variables.
        _latent_dim (int): Latent dimension, represented by the
            number of Gaussian observations to draw for
            estimating the precision matrix.
        _beta (:obj:`np.ndarray`): Variable weights.
    """
    
    def __init__(self, n_inputs, n_outputs, latent_dim=15):
        self._counter = 0
        self._n_inputs = n_inputs
        self._n_outputs = n_outputs
        self._latent_dim = latent_dim
        
        # Initializes NW distribution with the identity
        # as scale matrix and a zero vector as location vector.
        W = np.eye(self._n_inputs)
        mu0 = np.zeros(self._n_inputs)
        nu = self._n_inputs
        nw = NormalWishart(mu0, 1., W, nu, self._latent_dim)
        
        # Lambda is a random precision matrix,
        # and needs to be inverted in order to obtain
        # a covariance matrix.
        mu, Lambda = nw.sample()
        Sigma = np.linalg.inv(Lambda)
        
        # Randomly initializes the variable weights
        self._beta = np.random.multivariate_normal(mu, Sigma, self._n_outputs)
    
    def sample(self, sample_size=1):
        """Draw random samples from a multivariate Gaussian distribution.
        
        Parameters:
            sample_size (int): Number of observations to draw.
        
        Returns:
            :obj:`np.ndarray`: Array of shape (`sample_size`, `self._n_inputs`)
                containing random values for the explanatory variables.
            :obj:`np.ndarray`: Array of shape (`sample_size`, `self._n_outputs`)
                containing Random values for the explained variables.
            :obj:`np.ndarray`: Array of length `sample_size` containing the unique
                identifiers of generated samples.
        """
        # Generates unique identifiers
        counters = np.arange(self._counter, self._counter + sample_size)
        self._counter += sample_size
        
        # Samples explanatory variables
        X = np.random.rand(sample_size, self._n_inputs)
        
        # Samples random noise
        w = np.random.normal(0, 1, size=self._n_outputs) * 0.1
        
        # Computes outputs
        y = np.dot(X, self._beta.T) + w
        return np.squeeze(X), np.squeeze(y), counters
    
    @property
    def beta(self):
        return self._beta
    
    @property
    def counter(self):
        """Returns current value for the sample counter.
        
        Returns:
            int: Identifier of the next sample to be generated.
        """
        return self._counter

Let's define the dimensionality of the data:

In [18]:
n = 10 # Number of inputs
m = 8 # Number of outputs

Let's test the generator with these parameters.

In [19]:
Generator(10, 8).sample()

(array([0.05203194, 0.12143488, 0.3856555 , 0.34441807, 0.67153769,
        0.29619884, 0.61440962, 0.88948009, 0.72289516, 0.72536867]),
 array([2.33610669, 2.6005269 , 2.57951711, 3.59446197, 2.62726151,
        2.03530953, 2.94350042, 3.14102941]),
 array([0]))

In [21]:
# Waiting time before sending the next observation
TIME_INTERVAL = 1

# Create a new generator
gen = Generator(n, m)

# Let's see the coefficients B
print('beta: %s' % str(gen.beta))

# Loop for sending messages to Kafka with the topic dataLinearModel
while True:
    
    # Sample an observation and concatenate the counter
    # with the values of explanatory and explained variables.
    x, y, counter = gen.sample()
    arr = np.concatenate((counter, y, x))
    
    #print(arr)
    
    # Convert the array to text and send it to Kafka
    message = np.array2string(arr, separator=',')
    producer.send('dataLinearModel', message.encode())
    
    # Wait
    time.sleep(TIME_INTERVAL)
    

beta: [[-1.52809964 -1.19204649  0.37486388 -1.21758472 -0.3247237  -0.02527817
  -0.22016223  1.34798948  0.2111285   1.50931203]
 [-0.10457232  1.23308178  3.3438633   2.76157794 -0.56714695  2.69501417
   0.24975084 -2.04759892 -2.06898235  1.07548952]
 [-0.18715959 -0.03208075  2.06709328 -0.02620484 -0.46086334  0.9233103
  -0.13724642 -1.87379911 -1.24861593  0.67563011]
 [-0.91901068 -1.61524963  0.77555363 -1.56797555  0.07700233 -0.38671343
   0.49605748  2.04485873  1.37608316  1.0653438 ]
 [-1.26663352 -1.30995445  0.37927989 -1.90849082 -0.32179531 -0.97661407
   0.12366479  1.33673143  1.64246086  1.51802968]
 [-0.71142619 -0.89088105  1.02388561 -0.52274731 -0.23792444  0.79970074
   0.36886396  0.55707933  0.19083248  1.48005592]
 [-0.56082966 -0.20478647  1.66907953 -0.54126108 -0.25420555  0.93459017
  -0.30837298 -0.81894012 -0.73377583  0.76207092]
 [-1.27604373 -0.6243154   1.77747854 -0.67365805 -0.67426312  0.60112505
  -0.15960459  0.32958887  0.22451513  0.60301

KeyboardInterrupt: 