# [INFO-H515 - Big Data Scalable Analytics](https://uv.ulb.ac.be/course/view.php?id=85246?username=guest)

## TP 4 - Streaming forecasting (RLS and ML) with a network socket and Spark Streaming

#### *Gianluca Bontempi, Cédric Simar and Theo Verhelst*

####  26/04/2023

## Sending data to network socket

This notebook uses a network socket to send stremaing data. 

In this example, the messages are data generated from a linear model with $n$ input variables, i.e., 

$$
y =x^T \beta +w
$$
with $x, \beta \in \mathbb{R}^n$, and $y, w \in \mathbb{R}$. $w$ is Gaussian noise.

Messages are sent every `time_delay` seconds. They are a list of size $(n+2)$ where:
* First element is the message index 
* Second element is $y$ 
* Third to last elements are $x$ values (size $n$)


Let's start by importing all the required libraries

In [None]:
import time
import numpy as np

Then, let's create a socket, running on port 9999, in order to be able to send messages.

In [None]:
import socket
  
# take the server name and port name
host = 'localhost'
port = 9999
  
# create a socket at server side
# using TCP / IP protocol
s = socket.socket(socket.AF_INET,
                  socket.SOCK_STREAM)
  
# bind the socket with server
# and port number
s.bind((host, port))
  
# allow maximum 1 connection to
# the socket
s.listen(5)
  
# wait till a client accept
# connection
c, addr = s.accept()
  
# display client address
print("CONNECTION FROM:", str(addr))

## Linear DGP (Data Generating Process)

In this example, the messages are data generated from a linear model with $n$ input variables and random coefficients $\beta$, i.e., 

$$
y =x^T \beta +w
$$
with $x, \beta \in \mathbb{R}^n$, and $y, w \in \mathbb{R}$. $w$ is Gaussian noise.

Please note that the numerical values, here encoded as a numpy array, are sent to the network socket in a serialized (string) format.


In [None]:
np.random.seed(2452020515) # Fix seed to ensure repeatability
i=0 #Initialise counter

n=10   # number of inputs
time_delay = 0.01 # Time delay between the transmission of two consecutive messages

beta=np.zeros(n) 
beta[0]=1   
beta[-1]=1 ## first and last parameters are 1, others are zeros
beta.shape=(n,1)


#Infinite loop for sending messages to Kafka with the topic dataLinearModel
while True:
    # Randomly generate x_i
    x = np.random.randn(1,n)[0]
    
    # Compute y from x_i according to formula
    y = float(x.dot(beta)) + 0.1 * np.random.rand(1)[0] ## y =x^T beta +w

    # Serialize array and print message as a string
    message = np.array2string(np.append([i,y],x),separator=",",max_line_width=1000) +'\n'
    #print(message) # n=10 -> 12 elements in the message: cnt+y+10 xi
    
    # Send message to the client
    try:  
        c.send(message.encode())
    except socket.error:
        # If failed, client is probably disconnected. Wait for another connection
        c.close()
        c, addr = s.accept()
    
    i = i+1
    time.sleep(time_delay)
    

In [None]:
# disconnect the server
c.close()

**N.B** As the cell runs an infinite loop, the producer is never going to stop by itself. 
Don't forget to stop the cell using the dedicated button (■).

## Non-linear DGP (Data Generating Process)

In this example, the messages are data generated from a non-linear model with $n$ input variables, i.e., 

$$
y = \sin(x_0) + |x_1*x_2| + \sum_{i=2}^{10} log(x_i) + w
$$
with $x \in \mathbb{R}^n$, and $y, w \in \mathbb{R}$. $w$ is Gaussian noise.

Please note that the numerical values, here encoded as a numpy array, are sent to the network socket in a serialized (string) format.

In [None]:
np.random.seed(2452020515) # Fix seed to ensure repeatability
i=0 #Initialise counter

n=10   # number of inputs
time_delay = 1 # Time delay between the transmission of two consecutive messages

#Infinite loop for sending messages to Kafka with the topic dataNonLinearModel
while True:
    # Randomly generate x_i
    x=np.random.rand(1,n)[0]
    # Compute y from x_i according to formula
    y=float(np.sin(x[0])+abs(x[1]*x[2])+np.log(abs(x[-1])))+0.25*np.random.rand(1)[0]
    
    # Serialize array and print message as a string
    message=np.array2string(np.append([i,y],x),separator=",") 
    print(message) # n=10 -> 12 elements in the message: cnt+y+10 xi
    
    # Send message to the client
    try:  
        c.send(message.encode())
    except socket.error:
        # If failed, client is probably disconnected. Wait for another connection
        c.close()
        c, addr = s.accept()
    
    
    i=i+1
    time.sleep(time_delay)
    

**N.B** As the cell runs an infinite loop, the producer is never going to stop by itself. 
Don't forget to stop the cell using the dedicated button (■).