# Data streamer

In this notebook we simulate a big data source by reading TSV files and sending them line by line over the network.
The main assumption is that the datasets are structured as if they were the result of an API request.

In [7]:
import socket
import time
import csv
import random

In [8]:
API_dataset_tsv_path = "/home/ubuntu/jupyter/Datasets/goodreads_API_dataset.tsv"

# All the datasets to send
dataset_paths = [API_dataset_tsv_path]

In [9]:
# Constant used for debugging
PRINT_FREQUENCY=20

## Sending function

This is the function responsible for actually sending the information over the network. We had to introduce a small delay in order not to overwhelm the VM.

In [10]:
# this function is responsible to write the dataset on the socket line by line so that spark reads it on the other end.
def send_data(c, dataset):
  print("Start sending data")
  count = 0

  with open(dataset, encoding="utf-8", errors="ignore") as file_obj:
    # Skipping the header
    heading=next(file_obj)

    # Create reader object by passing the file object to reader method
    reader_obj = csv.reader(file_obj, delimiter="\t")

    # Iterate over each row in the csv file using reader object
    for line in reader_obj:
      try:
        # c.sendall accept an utf-8 encoded message
        line_string = f'{str(line)}\n'
        byte_message = line_string.encode(encoding='utf-8', errors='ignore')

        c.sendall(byte_message)
        time.sleep(random.uniform(0, 0.5)) # small delay not to overwhelm the VM

        # Debugging print
        if count % PRINT_FREQUENCY == 0:
          print("Sent:", line_string)
        count += 1
      except Exception as ex:
        print("Error sending data:", ex)
        break

## Connection

In this section we build the network configuration.
Once the connection is established, the data is sent.

In [11]:
# Configuring connection
host = "localhost"
port = 9999
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind((host, port))
s.listen(5)
print("Server is listening on", port)

# Sending data
try:
  connection, addr = s.accept()
  print("Connected by", addr)

  for path in dataset_paths:
    send_data(connection, path)

  connection.close()
except Exception as e:
  print("Connection error:", e)

Server is listening on 9999
Connected by ('127.0.0.1', 52648)
Start sending data
Sent: ['1', 'Harry Potter and the Half-Blood Prince (Harry Potter  #6)', 'J.K. Rowling/Mary GrandPré', '4.57', '0439785960', '9780439785969', 'eng', '652', '2095690', '27591', '9/16/2006', 'Scholastic Inc.']

Sent: ['29', 'The Mother Tongue: English and How It Got That Way', 'Bill Bryson', '3.93', '0380715430', '9780380715435', 'eng', '270', '28489', '2085', '9/28/1991', 'William Morrow Paperbacks']

Sent: ['68', 'The Known World', 'Edward P. Jones/Kevin R. Free', '3.83', '006076273X', '9780060762735', 'en-US', '14', '55', '12', '6/15/2004', 'HarperAudio']

Sent: ['94', 'Getting Results with Curriculum Mapping', 'Heidi Hayes Jacobs', '3.25', '0871209993', '9780871209993', 'eng', '192', '55', '5', '11/15/2004', 'ASCD']

Sent: ['135', 'What to Sell on ebay and Where to Get It: The Definitive Guide to Product Sourcing for eBay and Beyond', 'Chris Malta/Lisa Suttora', '3.62', '0072262788', '9780072262780', 'en